GEN AI Services
GPT in LLM stands for Generative Pre-trained Transformer. Here's a breakdown of what each part means:
Generative: Refers to the model's ability to generate new text based on the input it receives, producing coherent and contextually relevant responses.
Pre-trained: The model is trained on large amounts of text data beforehand, learning patterns, language structures, and context before being fine-tuned or applied to specific tasks.
Transformer: GPT is based on the Transformer architecture, which is a neural network model designed for processing sequences of data, like sentences or paragraphs. Transformers are particularly good at capturing long-range dependencies in text, which makes them effective for language tasks.
So, GPT models are powerful LLMs designed to generate human-like text by leveraging their pre-training on large datasets and the Transformer architecture.
Transformers are a type of neural network architecture that is very efficient at processing these vectors and understanding the relationships between them. Rather than processing strings of text sequentially as earlier neural networks do, transformers can look at different parts of the sequence all at once and determine the most important parts. Transformer architectures have been a big contributor to the rapid growth and availability of large language models.
GPT (Generative Pre-trained Transformer) – generate human text or computer code based on input prompts
BERT (Bidirectional Encoder Representations from Transformers) – similar intent to GPT, but reads the text in two directions
RNN (Recurrent Neural Network) – meant for sequential data such as time-series or text, useful in speech recognition, time-series prediction
ResNet (Residual Network) – Deep Convolutional Neural Network (CNN) used for image recognition tasks, object detection, facial recognition
SVM (Support Vector Machine) – ML algorithm for classification and regression
WaveNet – model to generate raw audio waveform, used in Speech Synthesis
GAN (Generative Adversarial Network) – models used to generate synthetic data such as images, videos, or sounds that resemble the training data. Helpful for data augmentation
XGBoost (Extreme Gradient Boosting) – an implementation of gradient boosting
Large Language Models (LLMs) like GPT have several limitations, despite their impressive capabilities:
Lack of True Understanding: LLMs generate responses based on patterns in data rather than true comprehension, which can lead to plausible-sounding but factually incorrect or nonsensical answers.
Limited Context: LLMs have a limited capacity to remember long conversations or documents, so they may lose track of earlier context, leading to inconsistent or irrelevant responses over longer interactions.
LLMs do not inherently remember context across interactions. Each query or prompt is processed independently, without memory of previous interactions. Therefore, it is the responsibility of applications (like chat interfaces or other systems) to:
Pass context: Continuously send the relevant conversation or data history along with each new prompt.
Enrich prompts: Manage and structure the input to ensure the model maintains continuity and can respond appropriately based on past context.
In short, context management is handled externally by the application, not by the LLM itself.
Bias in Data: Since LLMs are trained on large datasets from the internet, they can inadvertently learn and reproduce biases (e.g., gender, racial, or cultural biases) present in the data.
When variance is high, the model becomes so familiar with the training data that it can make predictions with high accuracy. This is because it is capturing all the features of the data.However, when you introduce new data to the model, the model's accuracy drops. This is because the new data can have different features that the model is not trained on. This introduces the problem of overfitting. Overfitting is when model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
Bias-variance tradeoff is when you optimize your model with the right balance between bias and variance. This means that you need to optimize your model so that it is not underfitted or overfitted. The goal is to achieve a trained model with the lowest bias and lowest variance tradeoff for a given data set. when bias is low and the variance is low. Here the regression is a curve. This is what you want. Its capturing enough features of the data, without capturing noise.
Inability to Handle Real-Time Data: LLMs, especially those trained on static data, cannot access real-time information or events happening after their training period, making them outdated in some contexts.
Overfitting or Memorization: LLMs sometimes memorize specific data points from the training dataset, which can lead to privacy concerns if sensitive information is unintentionally repeated in generated responses.
Computationally Expensive: Training and deploying LLMs requires significant computational resources, which can be costly and energy-intensive.
Difficulty with Reasoning and Math: LLMs struggle with logical reasoning, complex problem-solving, and mathematical calculations, often producing errors in these areas.
These limitations highlight that while LLMs are powerful tools, they still require human oversight and careful application in many scenarios.
Interpretability is a feature of model transparency. Interpretability is the degree to which a human can understand the cause of a decision. This might sound a lot like explainability, but there is a distinction difference.
Interpretability
Interpretability is the access into a system so that a human can interpret the model’s output based on the weights and features. For example, if a business wants high model transparency and wants to understand exactly why and how the model is generating predictions, they need to observe the inner mechanics of the AI/ML method.
Explainability
Explainability is how to take an ML model and explain the behavior in human terms. With complex models (for example, black boxes), you cannot fully understand how and why the inner mechanics impact the prediction. However, through model agnostic methods (for example, partial dependence plots, SHAP dependence plots, or surrogate models) you can discover meaning between input data attributions and model outputs. With that understanding, you can explain the nature and behavior of the AI/ML model.
techniques are strategies used to guide gen AI models. Some common techniques are:
Zero-shot prompting
Few-shot prompting
Chain-of-thought (CoT) prompting
Self-consistency
Tree of thoughts (ToT)
Retrieval Augmented Generation (RAG)
Automatic Reasoning and Tool-use (ART)
ReAct prompting
The maximum length setting determines the maximum number of tokens that the model can generate during the inference process.
Stop sequences are special tokens or sequences of tokens that signal the model to stop generating further output.
Top p
is a setting that controls the diversity of the text by limiting the number of words that the model can choose from based on their probabilities. Top p is also set on a scale from 0 to 1. The following are examples of different top p settings.
Low top p (for example, 0.250)
With a low top p setting, like 0.250, the model will only consider words that make up the top 25 percent of the total probability distribution. This can help the output be more focused and coherent, because the model is limited to choosing from the most probable words given the context.
High top p (for example, 0.990)
With a high top p setting, like 0.990, the model will consider a broad range of possible words for the next word in the sequence, because it will include words that make up the top 99 percent of the total probability distribution. This can lead to more diverse and creative output, because the model has a wider pool of words to choose from.
TOP K
Top k limits the number of words to the top k most probable words, regardless of their percent probabilities. For instance, if top k is set to 50, the model will only consider the 50 most likely words for the next word in the sequence, even if those 50 words only make up a small portion of the total probability distribution.
Low top k (for example, 10)
With a low setting, like 10, the model will only consider the 10 most probable words for the next word in the sequence. This can help the output be more focused and coherent, because the model is limited to choosing from the most probable words given the context.
High top k (for example, 500)
With a low setting, like 10, the model will only consider the 10 most probable words for the next word in the sequence. This can help the output be more focused and coherent, because the model is limited to choosing from the most probable words given the context.
With a high top k setting, like 500, the model will consider the 500 most probable words for the next word in the sequence, regardless of their individual probabilities. This can lead to more diverse and creative output, because the model has a larger pool of potential words to choose from.
If temprature is set to say lowest , and Top P to 25% and Top K 500words so if next possible words in POOL are say 100 and we select 25 as topP is 25% now this Top K will override and choose all then its overriding both temp and top p because our entire pool was 100 of which we had configured to pick 25 . OR it will have effect of all of 25 from top P ?
Temperature:
Think of temperature as creativity or randomness in choosing the next word.
A low temperature makes the AI pick more predictable words (like playing it safe).
A high temperature makes the AI take more risks and choose less common words.
2. Top P (Nucleus Sampling):
Top P looks at probabilities of the next words and only keeps a portion of them.
If Top P is set to 25%, it means: "Only keep the words that make up the top 25% of the total probability."
For example, if there are 100 possible words, and you set Top P to 25%, it might filter down to the 25 words with the highest probability.
3. Top K:
Top K looks at the number of possible words and chooses from the top ones.
If Top K is 500, it means: "Only consider the top 500 most likely words."
In our example with 100 possible words, Top K won’t have much impact because it’s already less than 500.
In the Example:
Step-by-Step Process:
You have 100 potential words.
Top P = 25%: This selects the top 25 words from the pool of 100 because they make up the top 25% of probability.
Top K = 500: Since you only have 100 words in the pool, Top K would just keep all of them (since 500 is much larger than 100).
Impact on Selection:
In this case, Top K doesn’t really override anything because it’s larger than your pool (100). So, the final selection is controlled by Top P, which is 25 words.
Temperature still applies to these 25 words to decide how risky or conservative the final choice will be.
Summary:
In the example, Top P will effectively determine that 25 words (25% of 100) are kept. Since Top K is set to 500, it doesn’t change anything because there aren’t that many words. Temperature will then influence how randomly the words are picked from those 25 options. So, Top K doesn't override Top P; it just isn't limiting in this scenario because your pool is smaller than 500.
Tips for generating good prompts
Be clear and concise
Include context if needed
Describe task, specify rules and directions about output formatting
Consider the output in the prompt
Provide diverse examples including edge cases
Use simple language
Test and experiment
Add examples and rules to the prompt if results are not satisfactory
is another way to improve the performance of a foundation model even further. Fine-tuning refers to the process of taking a pre-trained language model and further training it on a specific task or domain-specific dataset. Fine-tuning allows the model to adapt its knowledge and capabilities to better suit the requirements of the business use case.
Instruction fine-tuning uses examples of how the model should respond to a specific instruction. Prompt tuning is a type of instruction fine-tuning.
Reinforcement learning from human feedback (RLHF) provides human feedback data, resulting in a model that is better aligned with human preferences.
In the fine-tuning process, labeling the data with accurate and relevant labels is crucial for guiding the model's adjustments to specialize in the target domain.
The following list walks through the key steps in fine-tuning data preparation:
Data curation: Although it is a continuation, this involves a more rigorous selection process to ensure every piece of data is highly relevant. This step also ensures the data contributes to the model's learning in the specific context.
Labeling: In fine-tuning, the accuracy and relevance of labels are paramount. They guide the model's adjustments to specialize in the target domain.
Governance and compliance: Considering fine-tuning often uses more specialized data, ensuring data governance and compliance with industry-specific regulations is critical.
Representativeness and bias checking: It is essential to ensure that the fine-tuning dataset does not introduce or perpetuate biases that could skew the model's performance in undesirable ways.
Feedback integration: For methods like RLHF, incorporating user or expert feedback directly into the training process is crucial. This is more nuanced and interactive than the initial training phase.
Automated metrics can be useful for rapid iterations and fine-tuning during model development, but they often fail to capture the nuances and complexities of human language and might not align perfectly with human judgments.
they can provide a quick and scalable way to evaluate foundation model
is a set of metrics used for evaluating automatic summarization and machine translation systems. It measures the quality of a generated summary or translation by comparing it to one or more reference summaries or translations.
is a metric used to evaluate the quality of machine-generated text, particularly in the context of machine translation. It measures the similarity between a generated text and one or more reference translations, considering both precision and brevity.
Score is a metric for assessing the semantic similarity between two sentences. It uses pre-trained Bidirectional Encoder Representations from Transformers (BERT) models to compute contextualized embeddings for the input texts, and then calculates the cosine similarity between them.
it's often recommended to combine them with human evaluation for a more comprehensive assessment.
(for evaluating classification or entity recognition tasks)
a measure of how well the model predicts the next token)
are categories for risks and practices. These categories include
fairness
explainability
privacy and security
veracity and robustness
governance
transparency
safety
controllability.
Transparency helps to understand HOW a model makes decisions.
Explainability helps to understand WHY the model made the decision that it made. It gives insight into the limitations of a model.
a simple scenario involving an AI system that approves or denies loan applications.
Transparency (HOW): "The model looks at 5 specific factors in this order: income, credit score, employment history, current debt, and age of credit history. It weighs income as 35% of the decision, credit score as 30%, employment as 20%, current debt as 10%, and credit history age as 5%. The data passes through 3 layers of processing using a neural network architecture."
Explainability (WHY): "Your loan application was denied primarily because your current debt-to-income ratio of 45% is above our acceptable threshold of 40%. Even though you have an excellent credit score and stable employment, the high existing debt poses too much risk for an additional loan at this time."
The key difference is that transparency shows the mechanics and process (HOW), while explainability provides the specific reasoning and justification for an individual outcome (WHY). Would you like me to provide another example in a different context?
Models that lack transparency and explainability are often referred to as opaque models. These models use complex algorithms and numerous layers of neural networks to make predictions, but they do not provide insight into their internal workings.
Explainability frameworks help summarize and interpret the decisions made by AI systems.
LIME (Local Interpretable Model-agnostic Explanations): Creates simplified local approximations of complex models to explain individual predictions by perturbing input data and observing how predictions change.
SHAP (SHapley Additive exPlanations): Uses game theory concepts to assign importance values to each feature by considering all possible combinations of features and their marginal contributions. SHAP is a specific type of feature attribution method.
Counterfactual Explanations: Shows what changes would be needed in the input data to get a different outcome - like "if your income was $5000 higher, the loan would be approved."
Association Rule Mining: Discovers relationships between variables in large datasets to explain patterns that influence model decisions (like "When feature A is high AND feature B is low, the outcome is usually positive").
Rule Extraction: Converts complex models into simpler, interpretable rule-based systems that approximate the original model's behavior using if-then statements.
A model that provides transparency into a system so a human can explain the model’s output based on the weights and features is an example of interpretability in a model.
when you implement transparency in your AI system, elements of explainability, fairness, and governance will be required.
Explainability empowers users to verify system functionality, check for unwanted biases, increase useful human control, and place appropriate trust in AI systems. This dimension of AI promotes the responsible development and deployment of AI technology for the benefit of society. Without explainability, AI could lose public trust because of inscrutable failures.
AI Service Cards, Amazon provides transparent documentation on Amazon services that help you build your AI services
SageMaker Model Cards, you can catalog and provide documentation on models that you create or develop yourself.
SageMaker Clarify helps identify potential bias in machine learning models and datasets without the need for extensive coding. You specify input features, such as gender or age, and SageMaker Clarify runs an analysis job to detect potential bias in those features. SageMaker Clarify then provides a visual report with a description of the metrics and measurements of potential bias so that you can identify steps to remediate the bias.
SageMaker Canvas gives the ability to use machine learning to generate predictions without needing to write any code.
Amazon SageMaker Data Wrangler to balance your data in cases of any imbalances.
SageMaker Data Wrangler offers 3 balancing operators to rebalance data in your unbalanced datasets.
Random under sampling
Random oversampling
Synthetic Minority Oversampling Technique (SMOTE)
SageMaker Clarify is integrated with Amazon SageMaker Experiments to provide scores detailing which features contributed the most to your model prediction on a particular input for tabular, natural language processing (NLP), and computer vision models.
For tabular datasets, SageMaker Clarify can also output an aggregated feature importance chart that provides insights into the overall prediction process of the model. These details can help determine if a particular model input has more influence than expected on overall model behavior.
SageMaker Experiments is a capability of SageMaker that you can use to create, manage, analyze, and compare your machine learning experiments.
SageMaker Autopilot : Amazon SageMaker Autopilot is a feature set that simplifies and accelerates various stages of the machine learning workflow by automating the process of building and deploying machine learning models (AutoML).
SageMaker Ground Truth offers the most comprehensive set of human-in-the-loop capabilities for incorporating human feedback across the ML lifecycle to improve model accuracy and relevancy. SageMaker Ground Truth includes a data annotator for RLHF capabilities. You can give direct feedback and guidance on output that a model has generated by ranking, classifying, or doing both for its responses for RL outcomes. The data, referred to as comparison and ranking data, is effectively a reward model or reward function that is then used to train the model. You can use comparison and ranking data to customize an existing model for your use case or to fine-tune a model that you build from scratch.
Amazon SageMaker Model Monitor monitors the quality of SageMaker machine learning models in production. You can set up continuous monitoring with a real-time endpoint (or a batch transform job that runs regularly), or on-schedule monitoring for asynchronous batch transform jobs. With SageMaker Model Monitor, you can set alerts that notify you when there are deviations in the model quality. With early and proactive detection of these deviations, you can take corrective actions.
Amazon Augmented AI (Amazon A2I) is a service that helps build the workflows required for human review of ML predictions. Amazon A2I brings human review to all developers and removes the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers.
Amazon SageMaker Role Manager: With SageMaker Role Manager, administrators can define minimum permissions in minutes.
Amazon SageMaker Model Cards: With SageMaker Model Cards, you can capture, retrieve, and share essential model information, such as intended uses, risk ratings, and training details, from conception to deployment.
Model cards are a standardized format for documenting the key details about an ML model, including its intended use, performance characteristics, and potential limitations.
Model cards can be used to
provide source citations
data origin documentation.
understand the provenance (lineage) of the data used to train the model.
Model cards can include details about the
datasets used
Dataset sources
licenses
any known biases
quality issues in the training data.
Amazon SageMaker Model Dashboard: With SageMaker Model Dashboard, you can keep your team informed on model behavior in production, all in one place.
AI Service Cards are a form of documentation on responsible AI. They provide teams with a single place to find information on the intended use cases and limitations, responsible AI design choices, and deployment and performance optimization best practices for AWS AI services.
Model Evaluation on Amazon Bedrock, the you can evaluate, compare, and select the best foundation model for the use case.
With Guardrails for Amazon Bedrock, the you can implement safeguards for generative AI applications. It can help to filter out any harmful content.
Model performance varies across a number of factors, including the following:
Level of customization – The ability to change a model’s output with new data ranging from prompt-based approaches to full model retraining
Model size – The amount of information the model has learned as defined by parameter count
Inference options – From self-managed deployment to API calls
Licensing agreements – Some agreements can restrict or prohibit commercial use
Context windows – The amount of information that can fit in a single prompt
Latency – The amount of time it takes for a model to generate an output
Performance is a function of the model and a test dataset, not just the model. So, when you are assessing a model, you need to determine how well a model performs on a particular dataset.
For example, a model might perform well on test dataset A over a period of time. The model might perform even better on test dataset B. However, the model might progressively get worst on test dataset C.
This means that you need to consider two development trajectories: the development trajectory of the model and the development trajectory of the datasets. Remember the dataset is not necessarily constant. It is often evolving.
The Open Web Application Security Project (OWASP) Top 10 is the industry standard list of the top 10 vulnerabilities that can impact a generative AI LLM system. These vulnerabilities are as follows:
1 Prompt injection: Malicious user inputs that can manipulate the behavior of a language model
2 Insecure output handling: Failure to properly sanitize or validate model outputs, leading to security vulnerabilities
3 Training data poisoning: Introducing malicious data into a model's training set, causing it to learn harmful behaviors
4 Model denial of service: Techniques that exploit vulnerabilities in a model's architecture to disrupt its availability
5 Supply chain vulnerabilities: Weaknesses in the software, hardware, or services used to build or deploy a model
6 Sensitive information disclosure: Leakage of sensitive data through model outputs or other unintended channels
7 Insecure plugin design: Flaws in the design or implementation of optional model components that can be exploited
8 Excessive agency: Granting a model too much autonomy or capability, leading to unintended and potentially harmful actions
9 Overreliance: Over-dependence on a model's capabilities, leading to over-trust and failure to properly audit its outputs
10 Model theft: Unauthorized access or copying of a model's parameters or architecture, allowing for its reuse or misuse
Amazon Q Business application is a generative AI assistant that boosts employee productivity and transforms the way that you get work done. You can quickly find accurate answers from your enterprise content, backed up with citations and references. You can brainstorm new ideas, generate content, or create summaries using Amazon Q application based on your enterprise data. Amazon Q application ensures that users access enterprise content securely according to their permissions. You can quickly deploy Amazon Q application with built-in connectors to popular enterprise repositories.
User makes a request.
Amazon Q application retrieves semantically most relevant information to the user's request from the ingested enterprise content, to which the user has access permissions.
Amazon Q sends the user request along with the retrieved information as context to the underlying large language model (LLM).
LLM returns a succinct response to the user’s request based on the context.
The response from LLM is sent back to the user.
With Amazon Q application, you create an Application, configure a Retriever and then use the Preview Web Experience to interact with your application from the console, or deploy a web experience using SAML 2.0 compliant identity provider of your choice such as Okta, and provide the URL to your users, so that they can authenticate and use the Amazon Q application you just created.
Create an Application: You can use the step by step workflow in the AWS Management Console for Amazon Q application to start creating an application.
Customize Web Experience: The step by step workflow to create an application will also guide you to provide customization for your Web Experience.
Configure Retriever: The next step is to make a choice about whether you are planning to use the native retriever provided by Amazon Q or you plan to use a pre-existing Amazon Kendra index, in the same AWS region as the Amazon Q application, with Kendra retriever.
Configure native retriever: In this step, you can simply upload documents from your local machine to Amazon Q application, configure a data source with sample documents, or use one of the native data source connectors to connect to the data sources where the enterprise documents are located. If you configured a data source connector, initiate data source sync to index the documents from the data source.
Configure Kendra retriever: In this step, you configure the existing Kendra index as a retriever.
Deploy web experience: In this step you will deploy the web experience for your Amazon Q application using SAML 2.0 compliant identity provider (IdP) and provide the URL to your users, so that they can authenticate and use the Amazon Q application you just created.