Description: Simple retrieval-then-generate approach
Flow: Query → Retrieve relevant documents → Generate response using retrieved context
Components: Single vector database, basic embedding model, LLM for generation
Use Case: Simple Q&A applications, proof of concepts
Limitations: No query optimization, basic relevance matching
Description: Enhanced retrieval with pre/post-processing
Features: Query rewriting, document re-ranking, result filtering
Components: Query optimizer, multiple retrieval strategies, re-ranking models
Improvements: Better relevance, reduced hallucination, contextual understanding
Use Case: Production applications requiring higher accuracy
1. Basic/Foundational RAG
Primary Paper:
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). "Retrieval-augmented generation for knowledge-intensive NLP tasks." Advances in Neural Information Processing Systems, 33, 9459-9474.
Venue: NeurIPS 2020
Citations: 3,000+ (highly influential)
Follow-up Work:
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). "Dense passage retrieval for open-domain question answering." arXiv preprint arXiv:2004.04906.
2. Self-RAG
Primary Paper:
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." International Conference on Learning Representations (ICLR), 2024.
Venue: ICLR 2024 (accepted)
arXiv: arXiv:2310.11511
3. Corrective RAG (CRAG)
Primary Paper:
Yan, S., Gu, J. C., Zhu, Y., & Ling, Z. H. (2024). "Corrective Retrieval Augmented Generation." International Conference on Learning Representations (ICLR), 2024.
Venue: ICLR 2024 (accepted)
arXiv: arXiv:2401.15884
4. GraphRAG (Academic Foundations)
Core Concept Papers:
Yasunaga, M., Leskovec, J., & Liang, P. (2021). "QA-GNN: Reasoning with language models and knowledge graphs for question answering." NAACL-HLT, 2021.
Venue: NAACL 2021
Zhang, Y., Li, X., Cui, L., Wu, B., Gu, Y., & Dublish, N. (2023). "Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models." arXiv preprint arXiv:2309.01219.
Microsoft's Implementation (Technical Report):
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., ... & Larson, J. (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv preprint arXiv:2404.16130.
5. Fusion/Hybrid RAG
Rank Fusion Foundations:
Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). "Reciprocal rank fusion outperforms condorcet and individual rank learning methods." Proceedings of the 32nd international ACM SIGIR conference.
Venue: SIGIR 2009
Dense-Sparse Hybrid:
Gao, L., Ma, X., Lin, J., & Callan, J. (2023). "Precise Zero-Shot Dense Retrieval without Relevance Labels." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
Venue: ACL 2023
Lin, S. C., Yang, J. H., & Lin, J. (2021). "Distilling dense representations for ranking using tightly-coupled teachers." arXiv preprint arXiv:2010.11386.
6. Multi-Modal RAG (Emerging Academic Work)
Foundation Papers:
Chen, J., Lin, H., Han, X., & Sun, L. (2023). "M-BEIR: A Multi-domain Benchmark for Multi-modal Information Retrieval." arXiv preprint arXiv:2308.14565.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation." ICML, 2022.
Venue: ICML 2022
7. Conversational RAG
Academic Foundations:
Qu, C., Yang, L., Qiu, M., Croft, W. B., Zhang, Y., & Iyyer, M. (2019). "BERT with history answer embedding for conversational question answering." Proceedings of the 42nd International ACM SIGIR Conference.
Venue: SIGIR 2019
Anantha, R., Vakulenko, S., Tu, Z., Longpre, S., Pulman, S., & Chappidi, S. (2021). "Open-domain question answering goes conversational via question rewriting." NAACL-HLT, 2021.
Venue: NAACL 2021
NLP: ACL, EMNLP, NAACL, EACL
ML: ICLR, NeurIPS, ICML
IR: SIGIR, ECIR, CIKM, WSDM
AI: AAAI, IJCAI
These papers represent the academic foundation. Many "industry patterns" (like Modular RAG, Agentic RAG) are implementation frameworks from companies like:
LangChain/LangSmith documentation
LlamaIndex research papers (often arXiv, not peer-reviewed)
OpenAI, Anthropic technical reports
AWS, Google Cloud, Microsoft technical documentation
Agentic RAG is essentially a combination/application of:
Basic RAG (Lewis et al., 2020) - for retrieval mechanism
ReAct (Yao et al., 2022) - for reasoning and action planning
Tool Learning (various papers) - for dynamic tool selection
Multi-step reasoning - from planning literature
Academic Status:
The components are academically established
The specific combination as "Agentic RAG" is primarily industry terminology
Industry Sources:
LangChain/LangGraph documentation
LlamaIndex agent frameworks
OpenAI Assistants API documentation
Anthropic Claude function calling
AWS Bedrock Agents
Use case : So lets say there are some 10000 legal documents some are case specific some are general policies some are topic specific but public some legal documents about companies internal cases ongoing or judgements are case specific and are sensitive . If we have to index these 10k docs in a vector store we need to do some pre processing before we injest these files in vector stores
i.e if the case file is named case_001.pdf, the metadata file should be created in same location named case_001.pdf.metadata.json with
{
"metadataAttributes": {
"respondent_id": 669,
"case_id": 1
}
}
we may need to classify group of files and have right metadata e.g for general policies
{
"metadataAttributes": {
"document_type": "policy",
"sensitivity": "public",
"access_level": "public",
"topic": "employment_law"
}
}
Confidential files would have
{
"metadataAttributes": {
"case_id": "001",
"document_type": "case_file",
"sensitivity": "confidential",
"access_level": "restricted",
"respondent_id": 669,
"case_status": "ongoing"
}
}
Step one : prepare and create the metadata files
This means we have to run some kind of pipeline to understand and add these files based on our usecase. which could need services such as Entity Extraction (Claude/Textract/Comprehend) .
Step two : Injest to vector store
After we add these files , then we index i.e injest into vector data store ,
Key Point: There's no "update metadata" API for existing indexed documents. You must re-ingest with metadata present in S3 at ingestion time.
Step 3 : At query time we send
# User with case access
filter = {"equals": {"key": "case_id", "value": "001"}}
# Public documents only
filter = {"equals": {"key": "access_level", "value": "public"}}
One point to take home , is the feature is to optimize fetch at scale ,however this could be very easily used or rather misused as access control mechanism , i would like to explain why this may not be a good idea .
Technically this would work , but Consider scenario , we have a huge data set of files in thousands , we ran a pipeline to understand its content create metadata files for access contrtol . our data store layer will not give access ( by filtering out ) docments , if we get an edge case where we need to have Temporary access grants to Access for a specific scenario we wont be able to do this without huge cost of reindexing just for temporary access !
Relying solely on data-layer filters for primary access control is a security anti-pattern and a key principle of poor system design because it creates a single point of failure and violates the principle of layered security
Apart from that Relying solely on data-layer filters for primary access control is a security anti-pattern ,it creates a single point of failure and violates the principle of layered security
Scalability and Complexity Issues: Implementing complex, fine-grained access control entirely within the database or data layer often leads to a maintenance nightmare as the application grows. It is difficult to manage user permissions and roles in a scalable way within the database itself.
Lack of Context: Data-layer filters often lack the rich contextual information (e.g., time of day, user's device posture, session risk, or business process state) available at higher application layers, which is necessary for effective, adaptive access control.
Metadata filters are NOT ideal for primary access control because:
Role changes require re-indexing - Expensive and slow for large datasets
Access policies are dynamic - User permissions change frequently
Metadata is static - Baked into vectors at ingestion time
Core Principle: Vector Stores = Immutable Content Layer
What belongs IN vector store metadata:
Intrinsic document properties (won't change)
document_type: "legal_brief"
jurisdiction: "EU"
topic: "employment_law"
publication_date: "2024-01-15"
case_id: "001" (the case itself doesn't change)
What belongs OUTSIDE vector store:
Transactional/mutable access data
User roles and permissions
Case assignments (who can access case_001)
Recommended Approach
Enforce Access Controls at the Application Level: Implement a dedicated authorization service or layer in your application logic that handles access control checks. This layer should enforce policy based on the user's identity, role, and the requested action.
Principle of Least Privilege: Ensure that users, processes, and systems only have the minimum necessary access to perform their required tasks.
Use metadata for:
Document categorization (public/internal - broad categories)
Topic/domain filtering
Performance optimization (reduce search scope)
Optimal Architecture:
Query Flow:
User Query
↓
[Access Control Layer] ← DynamoDB/RDS (mutable permissions)
↓ (determines allowed case_ids, topics, etc.)
[Vector Store Query] ← Static metadata filters
↓
[Results Post-Filter] ← Additional security check
↓
Return to User
Benefits:
Cost: No re-indexing for permission changes
Performance: Vector store does what it's good at (semantic search)
Flexibility: Update access rules in milliseconds, not hours
Scalability: Independent scaling of access control vs search
Audit: Separate permission tracking from content
Multiple Knowledge Bases: When to Split
Split into separate KBs when:
Different update frequencies
Public policies (update quarterly) ≠ Active case files (daily updates)
Static legal precedents ≠ Ongoing litigation docs
Different access patterns
Public documents (all users) vs Confidential cases (restricted users)
Saves cost - don't search confidential KB for public queries
Different retrieval characteristics
Legal precedents: need high recall, search entire corpus
Case-specific: need precision, smaller focused search
Operational independence
HR legal docs managed by HR team
Corporate litigation managed by legal team
Independent update cycles, ownership, SLAs
Legal Use Case - Recommended Split:
KB 1: Public Legal Resources
- General policies
- Public case law
- Legal guidelines
- Update: Monthly
- Access: All users
KB 2: Internal Policies & Procedures
- Company policies
- Internal guidelines
- Update: Quarterly
- Access: All employees
KB 3: Active Case Files
- Ongoing litigation
- Sensitive case materials
- Update: Daily/Weekly
- Access: Case-specific RBAC
KB 4: Closed Case Archive
- Historical cases
- Settled matters
- Update: Rarely (append-only)
- Access: Legal team + authorized
Benefits of This Approach:
Cost Optimization:
Re-index only KB 3 (active cases) frequently
KB 4 (archive) almost never re-indexed
KB 1, 2 on slower schedules
Performance:
Route queries to appropriate KB based on context
Smaller search space = faster, cheaper queries
Public queries never hit expensive confidential KBs
Maintenance:
Independent teams manage their KBs
Failures isolated to one KB
Easier troubleshooting
Access Control:
Simpler RBAC per KB
IAM policies per knowledge base
Clear security boundaries
Single KB Approach - When It Makes Sense:
All documents update at similar frequency
Similar access patterns across all docs
Small dataset (<10K docs)
Unified search experience required
Metadata filtering is sufficient
Hybrid Pattern (Often Best):
3-4 KBs + Smart Routing:
Most queries hit 1 KB (fast, cheap)
Complex queries can search multiple KBs in parallel
Application layer orchestrates multi-KB search when needed
For our hypothetical usecase of 10,000 legal docs:
recommend 3-4 separate KBs based on:
Update frequency (active vs archived)
Sensitivity (public vs confidential)
Access patterns (broad vs case-specific)
This gives you operational flexibility without over-fragmenting. You can always merge later if needed, but splitting an existing KB is harder.
Bottom line: Multiple KBs aligned with your data lifecycle and access patterns = lower cost, better performance, easier maintenance.
Basic RAG: Bedrock Knowledge Bases + Foundation Models
Agentic RAG: Bedrock Agents + Lambda functions
Conversational RAG: Bedrock + DynamoDB + API Gateway
Hybrid RAG: Kendra + OpenSearch + custom fusion logic
Multi-Modal RAG: Bedrock multimodal models + S3 + Rekognition
Self-RAG: Custom orchestration with quality assessment
1. Basic/Naive RAG
Paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020, NeurIPS)
Status: Foundational paper, highly cited (~3000+ citations)
2. Self-RAG
Paper: "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" (Asai et al., 2023, ICLR 2024)
Status: Peer-reviewed, significant impact
3. Corrective RAG (CRAG)
Paper: "Corrective Retrieval Augmented Generation" (Yan et al., 2024, ICLR 2024)
Status: Recently accepted at top-tier venue
4. GraphRAG
Papers: Multiple works including "Graph-RAG" papers and Microsoft's GraphRAG implementation
Status: Mixed - concept is established, specific implementations vary
5. Fusion/Hybrid RAG
Papers: "Precise Zero-Shot Dense Retrieval without Relevance Labels" (Gao et al., 2023) and related fusion techniques
Status: Rank fusion methods well-established in IR literature
6. Modular RAG - More of an engineering pattern from LangChain, LlamaIndex
7. Agentic RAG - Emerging from agent frameworks, not formalized academically yet
8. Hierarchical RAG - Implementation pattern, limited formal research
9. Adaptive RAG - Engineering optimization, some academic work emerging
10. Multi-Modal RAG - Active research area but fragmented approaches
Others - Mostly industry terminology or implementation patterns
ACL, EMNLP, NAACL - Main NLP venues
ICLR, NeurIPS, ICML - ML conferences with RAG research
SIGIR, ECIR - Information retrieval conferences
arXiv - Preprints (not peer-reviewed but influential)
Know your model's capabilities and limitations
It is important to consider the selected model's known capabilities and limitations while prompting the model. In this lab, we are using the Claude3 models which have the following guidelines for effective prompting:
Claude 3 Models:
Input Format: Images need to be provided in a base64-encoded format.
Image Size: Individual image size cannot exceed 5MB
Multiple Images: Claude 3 models supports prompting with up to 5 images.
Image Format: Supported image formats: 'PNG', 'JPEG', 'WebP' and 'GIF'.
Image Clarity: Clear images which are not too blurry are more effective with Claude 3 models.
Image Placement*: Claude 3 models work better when images come before text while prompting. However, if the use case requires, images can follow text or can be interpolated with text.
Image Resolution: If the image's long edge is more than 1568 pixels, or the image is more than ~1600 tokens, it will first be scaled down, preserving aspect ratio, until it is within size limits. If the input image is too large and needs to be resized, it will increase latency of time-to-first-token, without giving you any additional model performance. Very small images under 200 pixels on any given edge may lead to degraded performance.