Large Language Models (LLMs) like GPT have several limitations, despite their impressive capabilities:
Lack of True Understanding: LLMs generate responses based on patterns in data rather than true comprehension, which can lead to plausible-sounding but factually incorrect or nonsensical answers.
Limited Context: LLMs have a limited capacity to remember long conversations or documents, so they may lose track of earlier context, leading to inconsistent or irrelevant responses over longer interactions.
LLMs do not inherently remember context across interactions. Each query or prompt is processed independently, without memory of previous interactions. Therefore, it is the responsibility of applications (like chat interfaces or other systems) to:
Pass context: Continuously send the relevant conversation or data history along with each new prompt.
Enrich prompts: Manage and structure the input to ensure the model maintains continuity and can respond appropriately based on past context.
In short, context management is handled externally by the application, not by the LLM itself.
Bias in Data: Since LLMs are trained on large datasets from the internet, they can inadvertently learn and reproduce biases (e.g., gender, racial, or cultural biases) present in the data.
Inability to Handle Real-Time Data: LLMs, especially those trained on static data, cannot access real-time information or events happening after their training period, making them outdated in some contexts.
Overfitting or Memorization: LLMs sometimes memorize specific data points from the training dataset, which can lead to privacy concerns if sensitive information is unintentionally repeated in generated responses.
Computationally Expensive: Training and deploying LLMs requires significant computational resources, which can be costly and energy-intensive.
Difficulty with Reasoning and Math: LLMs struggle with logical reasoning, complex problem-solving, and mathematical calculations, often producing errors in these areas.
These limitations highlight that while LLMs are powerful tools, they still require human oversight and careful application in many scenarios.
Enterprise AI Paradox:
Despite huge potential, only ~25% of businesses see ROI from AI.
Core issue: AI struggles with context, something humans are naturally good at.
Context is Crucial:
The more differentiated value you want, the deeper your systems need to handle enterprise-specific context.
Contextual understanding is the key to unlocking ROI.
Think Systems, Not Just Models:
LLMs are only ~20% of the system.
A mediocre model + great RAG system > great model + poor RAG system.
Build full-stack AI systems, not just model demos.
Specialization Beats AGI:
Enterprise expertise is the fuel.
AGI is interesting, but real results come from domain-specific, specialized solutions.
Your Data is Your Moat:
Long-term value comes from leveraging messy, real-world enterprise data.
Don’t over-invest in data cleanup—build systems that handle noisy data at scale.
Don’t Just Build Pilots—Design for Production:
Pilots are easy; scaling is hard.
Plan for production from day one (security, scale, performance, use-case breadth).
Speed > Perfection:
Ship early, get feedback, iterate quickly.
Real users, not just test users, should drive development.
Don’t Waste Engineers on Boring Work:
Chunking, prompting, and infrastructure can be abstracted.
Let engineers focus on delivering business value, not technical busywork.
Make AI Easy to Consume:
Enterprise tools must fit into existing workflows.
Adoption hinges on seamless UX and workflow integration.
Design for User “Wow” Moments:
Fastest path to value is giving users "aha" moments.
Example: Finding forgotten documents that solve real problems.
Accuracy Isn’t Enough—Think Inaccuracy:
Observability, traceability, and attribution are crucial.
Especially important in regulated environments.
Be Ambitious:
Low-ROI projects (e.g. answering HR questions) won’t move the needle.
Aim for transformational use cases—this is a once-in-a-generation opportunity.
One of the key issues with LLM's is their ability to hallucinate. Returning answers grounded in facts becomes a top priority for QA, Contextual Chatbot and other use-cases. Question answering (QA) is an important task that involves extracting answers to factual queries posed in natural language. Typically, a QA system processes a query against a knowledge base containing structured or unstructured data and generates a response with accurate information. Ensuring high accuracy is key to developing a useful, reliable and trustworthy question answering system, especially for enterprise use cases.
Generative AI models like Titan, Claude and Jurassic use probability distributions to generate responses to questions. These models are trained on vast amounts of text data, which allows them to predict what comes next in a sequence or what word might follow a particular word.
However, these models are not able to provide accurate or deterministic answers to question that relates to private or enterprise data because the model is not aware of that information.
Enterprises need to query domain specific and proprietary data and use the information to answer questions, including data on which the model has not been trained.
The challenge here is that there is a limitation on the amount of contextual information that can be used in a given scenario based on the limited size of the prompt that most models can handle.
This can be overcome by using the Retrieval Augmented Generation (RAG) technique.
RAG combines using embeddings to index the corpus of documents to build a knowledge base and using a large language model to extract information from a subset of documents in the knowledge base.
As a preparation step for RAG, the documents comprising the knowledge base are split into chunks of a fixed size (matching the maximum input size of the selected embedding model), and are then passed to the model to obtain the embedding vector. The embedding together with the original chunk of the document and additional metadata are stored in a vector database. The vector database is optimized to efficiently perform similarity searches between vectors.
With Knowledge Bases for Amazon Bedrock , you can give FMs and agents contextual information from your company’s private data sources for Retrieval Augmented Generation (RAG) to deliver more relevant, accurate, and customized responses
To equip FMs with up-to-date and proprietary information, organizations use Retrieval Augmented Generation (RAG), a technique that fetches data from company data sources and enriches the prompt to provide more relevant and accurate responses. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows. Session context management is built in, so your app can readily support multi-turn conversations.
With Knowledge Bases, you can now securely ask questions of your data without needing to setup a vector database.
In a RAG (Retrieval-Augmented Generation) architecture, the vector store is the component responsible for performing searches, not the embeddings model itself. However, the embeddings model plays a critical role in enabling effective search within the vector store. Here's how it works and where each component fits in
RAG combines two main components:
Retrieval: A search mechanism that finds relevant information or documents from a large dataset.
Generation: A generative model (like an LLM) that uses the retrieved information to generate a coherent and contextually accurate response.
Embeddings Model
The embeddings model converts text (queries, documents, phrases) into high-dimensional vectors that capture the semantic meaning of the text.
It is used to create vectors for both the user’s query and the documents stored in the database.
The embeddings for the documents (or data) are precomputed and stored in a vector store.
When a user submits a query, the embeddings model generates a query vector based on that input. This query vector is what the search mechanism will use to find relevant information.
Vector Store
Vector databases provide the ability to store and retrieve vectors as high-dimensional points. They add additional capabilities for efficient and fast lookup of nearest-neighbors in the N-dimensional space. They are typically powered by k-nearest neighbor (k-NN) indexes and built with algorithms like the Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF) algorithms. Vector databases provide additional capabilities like data management, fault tolerance, authentication and access control, and a query engine.
The vector store holds the precomputed embeddings (vectors) of documents or chunks of data that are available for retrieval.
When a query is input, the vector store takes the query vector (generated by the embeddings model) and performs a similarity search, finding vectors (documents) that are most similar to the query vector.
The similarity search in the vector store can use methods like cosine similarity, dot product, or Euclidean distance to measure how close the query vector is to each stored vector.
Once the most relevant documents are found, they are retrieved and passed to the generative model for final processing.
Component | Responsibility | Details
Embeddings Model | Converts text to vectors | Encodes both queries and documents, capturing semantic meaning
Vector Store | Performs the search (retrieval) | Uses precomputed vectors for documents and query vectors to find relevant | | matches
Generative Model (LLM) | Generates the final response using retrieved documents | Utilizes context from retrieved documents to generate accurate and context-aware | | answers
Efficiency: The vector store is optimized for similarity search operations. It can handle large-scale databases of vectors and perform fast lookups.
Storage: It’s specifically designed to store and manage vectors efficiently, including handling updates, indexing, and organizing data for rapid retrieval.
Scalability: Vector stores like Pinecone, FAISS, or Milvus are built to handle millions or billions of vectors, making them scalable solutions for retrieval in RAG.
The embeddings model's role is to encode the input data into a format (vector) that the vector store can use for searching.
It doesn't perform the search directly, but the quality of the search depends heavily on how well the embeddings model encodes the semantic information.
Encoding:
Documents are encoded into vectors using the embeddings model.
These vectors are stored in the vector store.
Query Encoding:
A user's query is encoded into a vector using the same embeddings model.
Similarity Search:
The vector store searches for vectors that are most similar to the query vector.
Retrieval:
Relevant vectors (documents) are retrieved and provided as context.
Generation:
The generative model (LLM) uses the retrieved information to produce a response.
The embeddings model in a RAG architecture is crucial for generating accurate vector representations, but it does not perform the search. Instead, the vector store is the search engine that finds relevant data based on the vectors provided by the embeddings model. This division of labor allows for highly efficient and accurate retrieval and generation workflows.
There's a similarity between tokenizers and embedding models in the sense that they both deal with representing language, but they serve distinct purposes in the context of generative AI and information retrieval:
A tokenizer is a tool that converts raw text into tokens, which are the smallest units of text (words, sub-words, characters) that a model can process. It does this by assigning token IDs to each unique token.
These token IDs are used during both training and inference in language models (LLMs). The model relies on these IDs to learn and generate coherent text based on patterns seen in the training data.
Tokens are essentially how the text data is structured and fed into the LLM. The consistency of the tokenizer ensures that the model knows what each input token means and how to predict the next token in a sequence.
Key Purpose of a Tokenizer:
Breaks down and standardizes text into numerical form (tokens).
Maintains a fixed vocabulary and token ID mappings that are consistent throughout the LLM's training and usage.
An embeddings model takes text (or tokens) and converts them into high-dimensional numerical vectors. These vectors capture semantic meaning, allowing words or sentences with similar meanings to have similar vector representations.
Embeddings are used for a range of purposes, including:
Semantic search: Finding documents or information that are contextually similar.
Clustering: Grouping similar data points together.
Recommendation systems: Suggesting content based on semantic similarities.
In a vector store (like Pinecone, FAISS), embeddings are stored as vectors to enable fast and efficient similarity search, allowing you to retrieve data based on meaning rather than exact matches.
Key Purpose of Embeddings:
Transform tokens or phrases into vectors that capture meaning.
Enable comparisons between text data to identify similarities or retrieve relevant information.
Aspect | Tokenizer (LLM) | Embeddings Model (Vector Store)
Function | Converts text into tokens (IDs) | Converts tokens/text into vectors (semantic meaning)
Output. | Numerical IDs tied to specific tokens | High-dimensional vectors representing semantic space
Purpose | Prepares text for training/generation | Facilitates search, similarity comparisons, clustering
Scope | Used internally within LLMs | Used in vector databases for information retrieval
Consistency | Fixed IDs throughout training & usage | Dynamic; vectors can vary depending on context/model
Language Understanding | Token-level processing | Contextual/semantic-level understanding
In an LLM, the tokenizer handles the initial step of converting text to tokens, and the model then processes these tokens to understand and generate language.
The internal representation in LLMs often includes an embeddings layer. This layer maps the tokens (IDs) into a vector space, similar to what embedding models do in vector stores. The LLM’s embeddings layer allows it to capture the semantic meaning and relationships between tokens for language understanding.
When using a vector store, the embeddings model takes the final tokenized input (often in the form of words or phrases) and creates a vector, which is then stored for efficient retrieval.
A tokenizer is crucial for LLMs to break down text into tokens and convert them into numerical IDs that the model can process. It provides a structured input/output format.
An embeddings model captures the deeper semantic meaning of text and converts it into vectors for tasks like similarity search or clustering, often used in vector databases.
Both play a critical role in text understanding but operate at different layers: tokenizers work at the token level, while embeddings work at the meaning level.
If you provide a paragraph to an embeddings model, the output will be a vector representation of the entire paragraph. This vector is a single point in a high-dimensional space that captures the semantic meaning of the entire input paragraph. An embeddings model's output for a paragraph is a vector that captures the paragraph's overall meaning and context. This vector is used for tasks that require understanding semantic relationships, searching for similar content, or clustering information based on meaning.
Input: You input a paragraph (a block of text) to the embeddings model.
Embedding Process: The embeddings model processes the paragraph, considering the context and meaning of the entire text, including the relationships between words, sentences, and phrases.
Output Vector: The result is a single vector—essentially a list of numbers—that represents the paragraph in a high-dimensional semantic space. This vector encodes the paragraph's meaning, capturing nuances and context beyond individual words.
The vector is a dense numerical representation where each dimension captures certain features of the input text, such as sentiment, topics, and contextual relationships.
Texts that have similar meanings or are contextually related will have vectors that are close to each other in this high-dimensional space.
The embedding effectively compresses the information in the paragraph into a format that is both compact and semantically meaningful.
The vector's dimensionality (e.g., 256, 512, 768, etc.) depends on the embeddings model being used. Higher-dimensional embeddings can capture more complex and subtle aspects of the text.
For example, if an embeddings model outputs a 512-dimensional vector, the paragraph is represented by a vector with 512 numbers.
Single Vector for Entire Paragraph:
Some embeddings models generate a single vector for the entire input paragraph. This means the whole paragraph is treated as one unit, and the output is a single vector summarizing the meaning.
Multiple Vectors (Sentence-Level or Word-Level):
Other embeddings models might generate vectors for individual words or sentences within the paragraph. In this case, you could average, pool, or otherwise combine these vectors to get a single vector representing the whole paragraph.
This is useful if you want more granular control or analysis over parts of the paragraph.
The ListKnowledgeBases API in Amazon Bedrock serves several important functions and has a specific scope. Here are the key points about its scope and function:
Listing Knowledge Bases:
It retrieves a list of all knowledge bases associated with an AWS account.
Provides a comprehensive view of the knowledge bases available in Amazon Bedrock.
Information Retrieval:
For each knowledge base, it returns summary information including: • Knowledge base ID • Name • Description • Current status • Last updated time
Pagination Support:
Allows specifying a maximum number of results per request (maxResults).
Provides a nextToken for retrieving subsequent batches of results.
Filtering and Sorting:
While not explicitly mentioned, it may support filtering and sorting options to help manage large numbers of knowledge bases.
Access Control:
Respects IAM permissions, ensuring users can only list knowledge bases they have access to.
Error Handling:
Provides specific error responses for various scenarios like access denied, internal server errors, or validation issues.
Integration with AWS SDKs:
Can be easily integrated with various programming languages through AWS SDKs.
Monitoring and Auditing:
Supports AWS CloudTrail, allowing for auditing and monitoring of API usage.
Resource Management:
Helps in managing and organizing knowledge bases within Amazon Bedrock.
Scalability:
Designed to handle accounts with potentially large numbers of knowledge bases.
When using this API, it's important to implement proper error handling, respect API rate limits, and ensure your IAM policies grant the necessary permissions while adhering to the principle of least privilege. For the most up-to-date and detailed information, always refer to the official AWS documentation.
To sync and list knowledge bases in Amazon Bedrock, you can follow these best practices:
Use the ListKnowledgeBases API:
This API allows you to list all knowledge bases in your account.
You can specify the maximum number of results to return per request.
Use pagination with the
nextToken
parameter to retrieve all results if there are more than the specified maximum.
Implement regular synchronization:
Set up a scheduled task or Lambda function to periodically call the ListKnowledgeBases API.
This ensures your local list of knowledge bases stays up-to-date with what's in Amazon Bedrock.
Store knowledge base information:
Save the retrieved knowledge base summaries in a local database or cache.
Include details like knowledge base ID, name, description, status, and last updated time.
Implement differential updates:
Compare the retrieved list with your local storage.
Update only the changed or new knowledge bases to minimize data transfer and processing.
Use the AWS SDK for easier implementation:
The SDK provides convenient methods for pagination and error handling.
Implement error handling:
Handle potential exceptions like AccessDeniedException, ThrottlingException, or ValidationException.
Implement appropriate retry logic for transient errors.
Consider using the GetKnowledgeBase API:
If you need detailed information about specific knowledge bases, use this API after listing.
It provides more comprehensive details about individual knowledge bases.
Manage permissions carefully:
Ensure your IAM roles or users have the necessary permissions to list and access knowledge bases.
Follow the principle of least privilege when assigning permissions.
Monitor API usage:
Keep track of your API calls to stay within service limits and avoid throttling.
Implement logging:
Log synchronization activities for auditing and troubleshooting purposes.
Remember to consult the official AWS documentation for the most up-to-date information on API usage, best practices, and any service-specific considerations when working with Amazon Bedrock knowledge bases.
Concept | OpenSearch Indexing | BM25 using TF-IDF
What is it? | A search engine platform ( Elasticsearch) that includes indexing and querying mechanisms | A mathematical algorithm used to rank documents by relevance
Role | Infrastructure for storing/searching data | Scoring model used inside search systems (like OpenSearch)
Focus | How data is indexed, stored, and retrieved | How relevance is calculated when you search
Uses TF-IDF? | Not directly; uses BM25, which improves upon TF-IDF | Based on TF-IDF logic but adds tuning parameters for better relevance
What is it?
OpenSearch (like Elasticsearch) is a search engine that:
Stores documents (like articles, logs, etc.)
Builds inverted indexes (mapping words → documents)
Lets you search using query DSL
Uses ranking algorithms like BM25 under the hood
Features:
Full-text search
Filtering and aggregations
Scalable, distributed indexing
Works with BM25 by default to rank results
What is TF-IDF?
TF (Term Frequency): How often a word appears in a document
IDF (Inverse Document Frequency): How rare the word is across documents
TF-IDF score = TF × IDF
Used to rank documents by how well they match a query.
What is BM25?
BM25 is an improved version of TF-IDF used in most modern search engines (including OpenSearch).
It adds:
Document length normalization (short docs don’t get unfairly favored)
Saturation curve for TF (avoids overweighting repeated terms)
BM25 ≈ smarter TF-IDF.
OpenSearch is the engine.
BM25 is the default ranking algorithm OpenSearch uses internally.
BM25 is based on ideas from TF-IDF, but more accurate and flexible.
Imagine a library search system:
OpenSearch is the entire library software — cataloging books, handling user searches.
BM25 is the relevance calculator inside — deciding which books are most relevant when you type a keyword.
TF-IDF is the older method — BM25 is the upgrade.