Advanced RAG Patterns for Production Systems

Basic RAG — embed, retrieve, generate — gets you 60-70% of the way. These advanced patterns close the remaining gap: hybrid retrieval, multi-stage reranking, query decomposition, corrective RAG, and agentic retrieval.

Advanced RAG Patterns for Production Systems

Key Takeaways

  • Hybrid retrieval (vector + BM25) improves recall by 15-25% over vector-only search
  • Cross-encoder reranking boosts answer accuracy by 10-20% with minimal latency impact
  • Query decomposition handles complex multi-part questions that basic RAG fails on
  • Corrective RAG detects and recovers from retrieval failures automatically
  • Agentic RAG lets an AI agent manage the retrieval process dynamically — the most flexible pattern

Why Basic RAG Isn't Enough

Basic RAG follows a fixed pipeline: embed the query → retrieve top-K documents → stuff them into the prompt → generate. This works for simple factual questions but fails in several predictable ways:

  • Vocabulary mismatch: The user says "revenue" but the document says "gross income" — vector similarity misses the connection
  • Complex questions: "Compare our Q3 and Q4 performance" requires retrieving from two different document sections
  • Noisy retrieval: Top-K returns partially relevant documents that dilute the useful context
  • Missing context: The answer spans multiple documents, but the retriever only gets one piece
  • Wrong context: Semantically similar but actually irrelevant documents are retrieved

Advanced RAG patterns address each of these failure modes. We've deployed all of these patterns in production — most notably in our compliance review system processing 50,000+ regulatory documents.

Hybrid Retrieval

Combine dense vector search (semantic) with sparse keyword search (BM25) for the best of both worlds:

  • Vector search excels at semantic similarity — understanding meaning, paraphrases, and conceptual relationships
  • BM25 excels at exact matching — specific terms, product names, regulatory codes, acronyms

Reciprocal Rank Fusion (RRF) merges the results: each document gets a score based on its rank in each result set, then scores are combined. This is simple, robust, and requires no tuning — it's our default retrieval strategy for production RAG.

Performance impact: 15-25% improvement in recall@10 compared to vector-only retrieval, with near-zero additional latency since BM25 runs in parallel with vector search.

Implementation Notes

  • Most vector databases now support hybrid search natively (Pinecone, Weaviate, Qdrant)
  • Weight ratio: Start with 0.7 vector / 0.3 BM25, tune based on your domain
  • For highly technical domains (legal, medical), increase BM25 weight — exact terminology matters more

Multi-Stage Reranking

The initial retrieval casts a wide net (retrieve 20-50 results). Reranking narrows to the most relevant results (top 3-5) for the LLM context.

Stage 1: Cross-Encoder Reranking

A cross-encoder model (such as Cohere Rerank, BAAI/bge-reranker-v2, or ms-marco-MiniLM) takes the query and each candidate document as a pair and outputs a relevance score. Unlike bi-encoder embeddings (which encode query and document separately), cross-encoders see both together — capturing fine-grained relevance that embeddings miss.

Accuracy improvement: 10-20%. Latency cost: 50-150ms for 20 candidates. Worth it for nearly every production RAG system.

Stage 2: Diversity Filter

After reranking by relevance, apply a diversity filter to avoid redundant context. If three top documents all say the same thing, you're wasting context window space. Use MMR (Maximal Marginal Relevance) to balance relevance and diversity.

Stage 3: Recency Boost

For time-sensitive domains (news, regulations, product docs), apply a recency boost after reranking. More recent documents get a score multiplier. Configurable decay function — linear, exponential, or step-based cutoffs.

Query Decomposition

Complex questions often need multiple retrieval passes. Query decomposition breaks a complex question into simpler sub-questions:

Original: "Compare our healthcare compliance costs in Q3 vs Q4 and identify which policy changes drove the increase."

Decomposed:

  1. "What were the healthcare compliance costs in Q3?"
  2. "What were the healthcare compliance costs in Q4?"
  3. "What policy changes occurred between Q3 and Q4?"
  4. "Which policy changes affect compliance costs?"

Each sub-question gets its own retrieval pass. Results are aggregated and the LLM generates a comprehensive answer using all retrieved context.

When to Use

  • Questions with comparison keywords ("compare," "vs," "difference between")
  • Multi-part questions ("What's X and how does it affect Y?")
  • Time-range questions ("How did X change from 2024 to 2025?")

Contextual Compression

Retrieved documents often contain 80% irrelevant content. A 2,000-token chunk might have only 200 tokens that answer the question. Contextual compression extracts only the relevant portions:

  1. Retrieve full documents/chunks
  2. Use an LLM or specialized model to extract only the sentences relevant to the query
  3. Pass the compressed context to the generation LLM

Benefits: Fits more relevant information in the same context window, reduces noise, and lowers generation cost (fewer input tokens). Trade-off: additional LLM call adds latency and cost. Use a small, fast model (GPT-4o-mini) for compression.

Corrective RAG (CRAG)

Standard RAG blindly trusts retrieval results. CRAG adds a self-correction step:

  1. Retrieve documents as normal
  2. Evaluate retrieval quality — does the retrieved content actually answer the question?
  3. If high confidence: proceed with generation
  4. If low confidence: reformulate the query and try again, or fall back to web search
  5. If no relevant results: acknowledge the knowledge gap instead of hallucinating

The evaluation step uses an LLM or fine-tuned classifier to assess relevance. This prevents the most common RAG failure: generating confident-sounding answers from irrelevant context.

Agentic RAG

The most flexible pattern. Instead of a fixed pipeline, an AI agent manages the entire retrieval process:

  • Agent receives the question
  • Decides whether to search (or if it already knows the answer)
  • Formulates the optimal search query
  • Evaluates retrieved results
  • Decides: answer now, search again with a different query, or ask for clarification
  • Can search multiple knowledge bases, web, or APIs

Agentic RAG handles questions that fixed pipelines can't: "Find all mentions of our liability coverage changes in the past 6 months and summarize the net impact." The agent iteratively searches, collects, and synthesizes across multiple retrieval passes.

Implementation: Use LangGraph for the agentic loop with search tools, evaluation nodes, and conditional routing based on retrieval confidence.

Graph-Enhanced RAG

Knowledge graphs add relational context that vector search misses. When you need to answer questions about relationships between entities — "Who approved this contract and what other deals were they involved in?" — graph RAG shines.

Architecture: Documents are processed to extract entities and relationships, stored in a knowledge graph (Neo4j, Amazon Neptune). At query time, graph queries retrieve structured relationship data alongside vector search results. The combined context gives the LLM both document content and entity relationships.

Best for: Domains with complex entity relationships — legal (contracts, parties, obligations), healthcare (patients, providers, treatments), and financial services (accounts, transactions, regulations).

Evaluation Framework

Every RAG improvement should be measured. Key metrics:

MetricWhat It MeasuresTarget
Retrieval Recall@K% of relevant docs in top K results> 85%
Retrieval Precision@K% of top K results that are relevant> 70%
Answer Accuracy% of answers that are factually correct> 90%
Faithfulness% of answer claims supported by retrieved docs> 95%
Latency (p95)End-to-end response time< 3 seconds
Cost per QueryTotal API + compute cost< $0.05

Use frameworks like RAGAS or custom evaluation suites. Run evaluations on every code change in CI/CD. Track metrics over time to catch regressions early.

Ready to upgrade your RAG pipeline? Our RAG development services implement these patterns in production.

Frequently Asked Questions

What is hybrid retrieval?

Hybrid retrieval combines dense vector search (semantic similarity) with sparse keyword search (BM25) using reciprocal rank fusion. This captures both semantic meaning and exact keyword matches, improving recall by 15-25%.

What is a reranker and why does it matter?

A reranker is a cross-encoder model that re-scores retrieved documents by examining query and document together. Unlike embedding models that encode them separately, cross-encoders capture fine-grained relevance. Add 10-20% accuracy improvement.

What is agentic RAG?

Agentic RAG uses an AI agent to manage retrieval dynamically — deciding when to search, what to search for, whether to refine the query, and when it has enough information. Handles complex multi-step questions that fixed pipelines fail on.

Upgrade Your RAG Pipeline

From basic retrieval to advanced patterns — we build production RAG systems that actually deliver accurate answers.

Start a Project