The first version of any RAG pipeline usually looks the same: embed a query, search a vector store, stuff the results into a prompt, and let the LLM answer. This “naive RAG” gets you from zero to a working prototype fast, but it starts falling apart the moment you deploy it against real users with real questions. Queries are ambiguous, retrieved chunks miss the mark, and the LLM confidently hallucinates when the context isn’t relevant enough.
After building and debugging several RAG systems, I’ve found that four patterns consistently make the difference between a demo that impresses and a system people actually rely on. Here’s what they are and how to implement them.
Why Naive RAG Isn’t Enough
Naive RAG has three fundamental weaknesses. First, it treats every query the same way — a simple factual question like “What is our refund policy?” gets the same retrieval pipeline as a complex research question like “How does our pricing compare to competitors for enterprise contracts over 500 seats?” Second, it trusts retrieval results blindly. If the vector search returns irrelevant chunks, the LLM has no way to reject them. Third, single-query retrieval means you only see one angle of the problem, missing context that a rephrased query might surface.
The four patterns below address each of these failure modes directly.
Pattern 1 — Adaptive RAG with Query Routing
The core insight behind Adaptive RAG is simple: not every question needs retrieval. If someone asks “What’s 2 + 2?” or “Summarize this document I just pasted,” hitting your vector database is wasted compute and latency. The idea is to add a lightweight classifier in front of your pipeline that routes queries to the appropriate strategy.
A practical implementation uses a small, fast model (or even a rule-based classifier) to categorize each incoming query:
from enum import Enum
from pydantic import BaseModel
class QueryType(Enum):
NO_RETRIEVAL = "no_retrieval" # Simple facts, math, greetings
SINGLE_RETRIEVAL = "single" # Standard lookup
MULTI_RETRIEVAL = "multi" # Complex, multi-source needed
EXTERNAL_SEARCH = "external" # Needs web/current data
class QueryRouter(BaseModel):
"""Classify the query type using a small LLM call."""
query: str
has_attached_context: bool = False
def classify(self, llm_client) -> QueryType:
system_prompt = """Classify this user query into exactly one category:
- NO_RETRIEVAL: Simple factual questions, math, greetings, or queries
where the user has already provided all needed context.
- SINGLE_RETRIEVAL: Questions that need a single knowledge base lookup.
- MULTI_RETRIEVAL: Complex questions needing multiple sources or chunks.
- EXTERNAL_SEARCH: Questions requiring current/external information not
likely in the knowledge base.
Return ONLY the category name."""
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": self.query},
],
temperature=0,
)
category = response.choices[0].message.content.strip()
return QueryType(category)
The cost of this extra classification step is negligible — a small model call adds roughly 20-50ms and a fraction of a cent — but the savings compound fast. In production systems I’ve seen, 30-40% of queries can bypass retrieval entirely, cutting average latency by more than half for those requests.
Pattern 2 — Corrective RAG with Retrieval Grading
Corrective RAG (CRAG) adds a quality gate between retrieval and generation. After you fetch documents, a grader evaluates whether they’re actually relevant to the question. If the relevance score is below a threshold, the system takes corrective action — either refining the query and retrying, falling back to a web search, or telling the LLM to answer from its own knowledge.
import json
def grade_document(query: str, document: str, llm_client) -> dict:
"""Score a single document's relevance to the query."""
prompt = f"""On a scale of 1-10, how relevant is this document to the query?
If the document contains information that helps answer the query, score high.
If it's tangentially related but not useful, score low (1-3).
If it's completely unrelated, score 1.
Query: {query}
Document: {document[:2000]}
Respond in JSON: {{"score": <int>, "reason": "<brief explanation>"}}"""
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
def corrective_rag(query, retriever, llm_client, threshold=6):
"""Retrieve documents, grade them, and take corrective action."""
documents = retriever.search(query, top_k=5)
graded = []
for doc in documents:
result = grade_document(query, doc.text, llm_client)
graded.append({"doc": doc, "score": result["score"]})
passed = [g for g in graded if g["score"] >= threshold]
if len(passed) >= 2:
# Enough relevant documents — proceed to generation
return [g["doc"] for g in passed]
elif len(passed) == 1:
# Marginal — try a refined query and merge results
refined_docs = retriever.search(
refine_query(query, llm_client), top_k=3
)
return [passed[0]["doc"]] + refined_docs
else:
# Retrieval failed — fall back to web search or admit ignorance
return web_search_fallback(query)
The key design decision is the threshold. Set it too high and you’ll trigger expensive fallback paths constantly. Set it too low and you’ll pass irrelevant noise to the generator. In practice, a threshold of 5-6 out of 10 works well as a starting point, then tune based on your evaluation dataset.
Pattern 3 — RAG Fusion with Query Expansion
A single query rarely captures everything you need. The user asks “How do I handle authentication in microservices?” but the best documents might be indexed under “JWT tokens,” “service-to-service mTLS,” or “OAuth2 in distributed systems.” RAG Fusion solves this by generating multiple rephrased queries, searching for each one, and merging the results using Reciprocal Rank Fusion (RRF).
def generate_query_variants(query: str, llm_client, n=3) -> list:
"""Generate rephrased versions of the original query."""
prompt = f"""Generate {n} alternative phrasings of this question.
Each should capture the same intent but use different vocabulary
and phrasing. Return a JSON object with a "queries" key containing
an array of the rewritten questions.
Question: {query}"""
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)
return [query] + data["queries"]
def reciprocal_rank_fusion(
result_lists: list, k: int = 60
) -> list:
"""Merge multiple ranked result lists using RRF scoring."""
scores = {} # doc_id -> cumulative RRF score
all_docs = {} # doc_id -> document object
for results in result_lists:
for rank, doc in enumerate(results):
doc_id = doc.id
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
if doc_id not in all_docs:
all_docs[doc_id] = doc
# Sort by RRF score descending
ranked_ids = sorted(scores, key=scores.get, reverse=True)
return [all_docs[doc_id] for doc_id in ranked_ids]
def rag_fusion_search(query, retriever, llm_client):
"""Search with query expansion and RRF merging."""
queries = generate_query_variants(query, llm_client, n=3)
all_results = [retriever.search(q, top_k=5) for q in queries]
return reciprocal_rank_fusion(all_results)
RRF is remarkably effective because it doesn’t need relevance scores — just rankings. This means you can combine results from completely different retrieval backends (vector search, keyword search, SQL queries) without worrying about score calibration. The k parameter controls how much the algorithm favors top-ranked results; the standard value of 60 works well for most cases.
Pattern 4 — Hybrid Search with Semantic Reranking
Vector search is great at capturing semantic meaning but terrible at exact matches. If someone searches for “ERR-1042 error code,” pure vector search might return documents about error handling in general instead of the specific error documentation. Hybrid search runs both a vector (semantic) search and a keyword (BM25) search in parallel, then uses a cross-encoder reranker to sort the combined results by true relevance.
def hybrid_search(query, vector_store, keyword_index, top_k=10):
"""Run both vector and keyword search, merge with RRF."""
# Semantic search (dense embeddings)
semantic_results = vector_store.similarity_search(query, k=top_k)
# Keyword search (BM25 / sparse)
keyword_results = keyword_index.bm25_search(query, k=top_k)
# Merge with RRF
return reciprocal_rank_fusion([semantic_results, keyword_results])
def cross_encoder_rerank(query, documents, model):
"""Rerank documents using a cross-encoder for precision."""
pairs = [(query, doc.text) for doc in documents]
scores = model.predict(pairs)
# Sort by cross-encoder score descending
ranked = sorted(
zip(documents, scores), key=lambda x: x[1], reverse=True
)
return [doc for doc, score in ranked]
The cross-encoder is the secret weapon here. Unlike embedding models (bi-encoders) that encode the query and document independently, a cross-encoder processes them together, allowing it to capture nuanced relevance that bi-encoders miss. The tradeoff is speed — cross-encoders are too slow to use over your entire document corpus, which is why you use bi-encoder retrieval first to narrow to the top 20-50 candidates, then rerank with the cross-encoder.
Putting It All Together
These four patterns aren’t mutually exclusive — they compose. A production RAG pipeline might use query routing (Pattern 1) at the entry point, hybrid search with reranking (Pattern 4) for retrieval, retrieval grading (Pattern 2) for quality control, and query expansion with RRF (Pattern 3) when the initial results are insufficient.
from sentence_transformers import CrossEncoder
def production_rag(query, vector_store, keyword_index, llm_client):
# Step 1: Route the query
router = QueryRouter(query=query)
query_type = router.classify(llm_client)
if query_type == QueryType.NO_RETRIEVAL:
return llm_generate(query, context=[])
# Step 2: Hybrid retrieval
docs = hybrid_search(query, vector_store, keyword_index, top_k=15)
# Step 3: Grade and correct
graded_docs = []
for doc in docs[:10]:
result = grade_document(query, doc.text, llm_client)
if result["score"] >= 5:
graded_docs.append(doc)
# Step 4: If too few passed, expand and retry
if len(graded_docs) < 2:
fused = rag_fusion_search(query, vector_store, llm_client)
graded_docs = fused[:5]
# Step 5: Rerank with cross-encoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
reranked = cross_encoder_rerank(query, graded_docs, reranker)
# Step 6: Generate
return llm_generate(query, context=reranked[:5])
The most important advice: don’t build all of this at once. Start with naive RAG, measure where it fails, and add patterns incrementally. Query routing gives you the biggest latency win for the least effort. Retrieval grading gives you the biggest accuracy improvement. Add fusion and hybrid search when your evaluation shows retrieval coverage is the bottleneck. Every extra component adds latency and cost — include it only when the data justifies it.
Evaluation Matters More Than Architecture
All the patterns in the world won’t help if you can’t measure whether they’re working. Build a small evaluation dataset of 50-100 question-answer pairs from your actual domain, and measure both retrieval quality (recall@k, mean reciprocal rank) and generation quality (faithfulness, relevancy, correctness). Run this eval after every change to your pipeline. Frameworks like RAGAS, TruLens, or Prompt Flow can automate this, but even a manual spreadsheet with scores works for getting started.
The RAG pipeline that works best is the one you’ve measured, iterated on, and tuned for your specific data and users — not the one with the most components.