Module 08 — RAG Phase

Advanced RAG & Multimodal

Basic RAG retrieves chunks and stuffs them into a prompt. Advanced RAG transforms this linear pipeline into a sophisticated system with query rewriting, hierarchical indexing, parent-child chunk relationships, graph-based retrieval, multimodal document processing, and self-correcting loops. This module covers the techniques that separate demo-quality RAG from production-grade systems handling complex, multi-hop questions across heterogeneous document collections.

Query Transformation
Parent-Child Chunks
Graph RAG
Multimodal RAG
Self-Corrective RAG
Production Patterns
Open in Colab Open Notebook in Colab
01

Query Transformation

Plain Language

Users rarely ask questions in the way that is most effective for retrieval. A user might ask "Why is my app slow?" when the actual documents discuss "performance optimization," "latency reduction," and "caching strategies." Query transformation bridges this gap by rewriting, expanding, or decomposing the user's query before it hits the vector store. Think of it as having a research librarian who, when you ask "Why is my app slow?", translates that into three targeted searches: "application performance bottlenecks," "latency optimization techniques," and "caching best practices."

The three main query transformation techniques are query rewriting, multi-query generation, and step-back prompting. Query rewriting uses an LLM to reformulate the question for better retrieval — removing conversational fluff, expanding abbreviations, and adding relevant technical terms. Multi-query generation creates multiple different versions of the same question and runs separate retrievals for each, then merges the results. This is powerful because different phrasings match different documents — "deployment strategies" and "how to deploy to production" and "CI/CD pipeline setup" all retrieve different relevant chunks. Step-back prompting generates a more abstract version of the question that captures the broader concept, which helps when the specific question is too narrow to match existing documents.

Query decomposition handles complex multi-hop questions by breaking them into simpler sub-questions. When a user asks "How does our company's RAG system compare to the industry standard for financial document processing?", a single retrieval will likely fail because no single document chunk addresses this compound question. Decomposition splits it into: (1) "How does our RAG system process financial documents?" and (2) "What is the industry standard for financial document RAG?" Each sub-question retrieves its own context, and the LLM synthesizes the combined evidence into a final comparative answer.

HyDE (Hypothetical Document Embeddings) is an elegant transformation technique where instead of embedding the query directly, you first ask the LLM to generate a hypothetical answer to the question, then embed that hypothetical answer and use it for retrieval. The insight is that a hypothetical answer is semantically closer to the actual documents than the question is — documents contain statements, not questions. This often produces dramatically better retrieval results, especially for technical and scientific content where the language of questions differs significantly from the language of answers.

Deep Dive

from openai import OpenAI
import json

client = OpenAI()

# --- Multi-Query Generation ---
def generate_multi_queries(question: str, n: int = 3) -> list[str]:
    """Generate N alternative phrasings of a question for retrieval."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": f"""Generate {n} alternative versions of the given question for
document retrieval. Each version should use different keywords and phrasing
to maximize the chance of finding relevant documents. Return as JSON array."""
        }, {
            "role": "user",
            "content": question
        }],
        temperature=0.7,
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("queries", [question])

queries = generate_multi_queries("Why is my RAG system returning irrelevant results?")
# ["How to improve RAG retrieval accuracy and relevance",
#  "Common causes of poor vector search results in RAG pipelines",
#  "Debugging retrieval quality issues in retrieval-augmented generation"]

# --- HyDE (Hypothetical Document Embeddings) ---
def generate_hyde(question: str) -> str:
    """Generate a hypothetical document that would answer the question."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Write a short paragraph that would appear in a technical document
answering the given question. Write it as factual content, not as a response
to a question. This will be used for document retrieval."""
        }, {
            "role": "user",
            "content": question
        }],
        temperature=0.5,
        max_tokens=200
    )
    return response.choices[0].message.content

# Instead of embedding the question, embed the hypothetical doc
hyde_doc = generate_hyde("How does PagedAttention work?")
# "PagedAttention is a memory management technique for LLM inference
#  that partitions the key-value cache into fixed-size pages..."
# → This document-like text matches actual docs better than the question

# --- Query Decomposition ---
def decompose_query(question: str) -> list[str]:
    """Break a complex question into simpler sub-questions."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Break the complex question into 2-4 simpler sub-questions
that can be answered independently. Each sub-question should be self-contained.
Return as JSON array under key "sub_questions"."""
        }, {
            "role": "user",
            "content": question
        }],
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result["sub_questions"]
When to Use What

Multi-query: Always a good default — low cost, high impact. HyDE: Best for technical/scientific content where questions differ from document language. Decomposition: For multi-hop questions ("compare X and Y", "how does A affect B through C"). Step-back: When specific queries fail, broaden to the general concept.

02

Parent-Child Chunks

Plain Language

The fundamental tension in chunking is between retrieval precision and context completeness. Small chunks (256 tokens) retrieve with high precision — the retrieved content closely matches the query — but they often lack the surrounding context needed to generate a complete answer. Large chunks (2048 tokens) provide plenty of context but dilute the signal with irrelevant text, reducing retrieval precision. Parent-child chunking resolves this tension by using a two-level hierarchy: small "child" chunks are used for retrieval (high precision), but when a child chunk matches, the system returns its "parent" chunk (rich context).

Imagine you are searching through a textbook. The index at the back lists specific terms and the exact pages where they appear — this is like small chunk retrieval, highly precise. But when you flip to that page, you read the entire section, not just the sentence where the term appears — this is like returning the parent chunk. The parent provides the full context: the introduction to the concept, the explanation, the examples, and the caveats. By searching with child precision and reading with parent completeness, you get the best of both worlds.

Implementation involves creating two sets of chunks from each document. First, you create large parent chunks (e.g., 2000 tokens) that correspond to full sections or major paragraphs. Then you split each parent into multiple smaller child chunks (e.g., 400 tokens). The child chunks are embedded and stored in the vector database for retrieval, but each child carries a reference (an ID) back to its parent. When the retrieval system finds a relevant child chunk, it looks up the parent ID and returns the full parent chunk to the LLM. This way, the embedding and search operate on focused, precise chunks, but the LLM receives broader, more complete context.

A related technique called sentence window retrieval uses even finer granularity. Each individual sentence is embedded for maximum retrieval precision, and when a sentence matches, the system returns a "window" of N sentences before and after the matching sentence. This provides local context without requiring a pre-defined parent-child hierarchy. It works particularly well for documents with dense, information-rich content where every sentence matters — like legal contracts, medical guidelines, or technical specifications.

Deep Dive

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Parent chunks: large, context-rich (full sections)
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)

# Child chunks: small, precise (for retrieval)
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
)

# Storage for parent documents
docstore = InMemoryStore()

# Vector store for child chunk embeddings
vectorstore = Chroma(
    collection_name="parent_child",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_parent_child"
)

# ParentDocumentRetriever handles the two-level logic
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Index documents (automatically creates parent + child chunks)
retriever.add_documents(docs)

# Retrieve: searches child chunks, returns parent chunks
results = retriever.invoke("How does PagedAttention manage GPU memory?")
for doc in results:
    print(f"[{len(doc.page_content)} chars] {doc.page_content[:100]}...")
    # Returns ~2000 char parent chunks even though child chunks were searched

For a manual implementation without LangChain that gives you more control:

import uuid
import chromadb

client = chromadb.PersistentClient(path="./chroma_pc")
child_collection = client.get_or_create_collection("children")
parent_store = {}  # In production, use Redis or a database

def index_with_parent_child(documents: list[str]):
    """Index documents with parent-child chunk hierarchy."""
    for doc in documents:
        # Create parent chunks
        parents = parent_splitter.split_text(doc)
        for parent_text in parents:
            parent_id = str(uuid.uuid4())
            parent_store[parent_id] = parent_text

            # Create child chunks from this parent
            children = child_splitter.split_text(parent_text)
            child_ids = [f"{parent_id}_c{i}" for i in range(len(children))]

            child_collection.add(
                ids=child_ids,
                documents=children,
                metadatas=[{"parent_id": parent_id} for _ in children]
            )

def retrieve_parents(query: str, k: int = 3) -> list[str]:
    """Search child chunks, return parent chunks."""
    results = child_collection.query(query_texts=[query], n_results=k * 2)

    # Deduplicate parent IDs and return parent chunks
    seen_parents = set()
    parents = []
    for meta in results["metadatas"][0]:
        pid = meta["parent_id"]
        if pid not in seen_parents:
            seen_parents.add(pid)
            parents.append(parent_store[pid])
        if len(parents) >= k:
            break
    return parents
Size Guidelines

Parent chunks: 1500–2500 tokens (full sections). Child chunks: 200–500 tokens (focused passages). Ratio: typically 3–6 children per parent. If children are too small, retrieval becomes noisy; if parents are too large, the LLM wastes context window on irrelevant text.

03

Graph RAG

Plain Language

Standard RAG treats each document chunk as an independent island — there is no understanding of how concepts relate to each other across chunks. Graph RAG addresses this by building a knowledge graph alongside the vector index. A knowledge graph is a network of entities (people, concepts, products, processes) connected by relationships ("uses," "depends on," "is part of," "preceded by"). When a user asks a question, the system can traverse these relationships to find information that is connected to the query topic, even if it would not be found through simple vector similarity.

Consider the question: "What are all the dependencies of our payment processing module?" Standard vector search might find chunks that directly mention "payment processing dependencies." But the actual dependencies might be spread across dozens of documents — a configuration file that imports certain libraries, an architecture diagram showing service connections, a requirements document listing external APIs, and a deployment guide mentioning infrastructure prerequisites. A knowledge graph connects all of these: "PaymentModule → depends_on → StripeAPI," "PaymentModule → uses → PostgreSQL," "PaymentModule → requires → AuthService." By traversing the graph from the "PaymentModule" node, the system finds all related information regardless of whether the word "dependency" appears in each document.

Microsoft's GraphRAG research demonstrated that knowledge graphs are particularly effective for global questions — questions about themes, summaries, and high-level concepts that span many documents. Standard RAG excels at local questions where the answer exists in one or two chunks. But questions like "What are the main themes in this document collection?" or "How do these technologies relate to each other?" require aggregating information across the entire corpus, which is exactly what a knowledge graph enables through community detection, entity resolution, and hierarchical summarization.

Deep Dive

Building a knowledge graph for RAG involves three steps: entity extraction (finding entities in text), relationship extraction (finding how entities relate), and graph construction (building the queryable graph structure). LLMs excel at the first two steps because they understand natural language context. Here is an implementation using an LLM for extraction and NetworkX for the graph:

from openai import OpenAI
import networkx as nx
import json

client = OpenAI()
graph = nx.DiGraph()

EXTRACT_PROMPT = """Extract entities and relationships from the following text.
Return JSON with:
- "entities": [{"name": "...", "type": "concept|technology|process|person|org"}]
- "relationships": [{"source": "...", "target": "...", "relation": "..."}]

Text: {text}"""

def extract_graph(chunk: str) -> dict:
    """Extract entities and relationships from a text chunk."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": EXTRACT_PROMPT.format(text=chunk)}],
        response_format={"type": "json_object"},
        temperature=0.0
    )
    return json.loads(response.choices[0].message.content)

def build_graph(chunks: list[str]):
    """Build knowledge graph from document chunks."""
    for chunk in chunks:
        data = extract_graph(chunk)
        for entity in data.get("entities", []):
            graph.add_node(
                entity["name"],
                type=entity.get("type", "concept"),
                chunks=[chunk]
            )
        for rel in data.get("relationships", []):
            graph.add_edge(
                rel["source"], rel["target"],
                relation=rel["relation"]
            )

def graph_retrieve(query: str, hops: int = 2) -> list[str]:
    """Retrieve context by traversing the knowledge graph."""
    # Step 1: Extract entities from query
    query_data = extract_graph(query)
    seed_entities = [e["name"] for e in query_data.get("entities", [])]

    # Step 2: Find matching nodes in graph
    matched = [n for n in graph.nodes if any(
        e.lower() in n.lower() for e in seed_entities
    )]

    # Step 3: Traverse N hops from matched nodes
    related_nodes = set(matched)
    for _ in range(hops):
        neighbors = set()
        for node in related_nodes:
            if node in graph:
                neighbors.update(graph.successors(node))
                neighbors.update(graph.predecessors(node))
        related_nodes.update(neighbors)

    # Step 4: Collect chunks associated with related nodes
    context_chunks = []
    for node in related_nodes:
        if node in graph.nodes:
            context_chunks.extend(graph.nodes[node].get("chunks", []))
    return context_chunks
Graph RAG vs Vector RAG

Use Graph RAG alongside vector RAG, not instead of it. Vector search handles most queries well. Graph traversal adds value for: (1) multi-hop questions, (2) "find all related..." queries, (3) global summarization, and (4) questions about relationships between concepts. The typical pattern is to run both in parallel and merge results.

04

Multimodal RAG

Plain Language

Real-world documents are not just text. They contain tables, charts, diagrams, photographs, equations, and complex layouts that carry critical information. A financial report's value is in its tables and charts as much as its text. An engineering manual's diagrams explain concepts that words alone cannot. Multimodal RAG extends the pipeline to handle these non-text elements — extracting information from images and tables, embedding them alongside text, and enabling the LLM to reason over visual content when generating answers.

There are two main approaches to multimodal RAG. The first is extraction-based: you use vision models (GPT-4V, Claude) or specialized tools (table extraction, OCR) to convert images and tables into text descriptions, which are then embedded and indexed alongside regular text chunks. This approach is simple and works with any text-based retrieval system, but it loses visual nuances that text descriptions cannot capture. The second approach is native multimodal: you store images directly and use multimodal embedding models (CLIP, Cohere multimodal) that can embed both text and images into the same vector space. During retrieval, a text query can match relevant images, and during synthesis, the multimodal LLM can see the actual images alongside text context.

Table extraction deserves special attention because tables are the most common non-text element in enterprise documents, and they contain some of the most important information (pricing, specifications, comparisons, metrics). Simply converting a table to plain text loses its structure — a row that reads "Product A | $99 | Premium" becomes meaningless without column headers. Effective table RAG preserves the tabular structure, either by converting to markdown table format (which LLMs understand well) or by storing the table as a structured object with row/column metadata.

Deep Dive

The extraction-based approach uses a vision model to describe images and tables found in documents. This works well with existing text-only RAG infrastructure:

import base64
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def describe_image(image_path: str) -> str:
    """Use GPT-4V to generate a text description of an image."""
    image_data = Path(image_path).read_bytes()
    b64 = base64.b64encode(image_data).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": """Describe this image in detail for use in a document retrieval system.
Include all text visible in the image, describe any charts/diagrams/tables
with their data, and explain what the image represents."""},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{b64}"
                }}
            ]
        }],
        max_tokens=1000
    )
    return response.choices[0].message.content

# Extract and index image descriptions alongside text chunks
description = describe_image("./docs/architecture_diagram.png")
# → "This architecture diagram shows a three-tier system with a React
#    frontend communicating via REST API to a FastAPI backend, which
#    connects to PostgreSQL with pgvector for vector storage..."

# This description can be embedded and indexed like any text chunk

For table extraction from PDFs, combining pdfplumber or camelot with markdown formatting preserves structure:

import pdfplumber

def extract_tables_as_markdown(pdf_path: str) -> list[str]:
    """Extract tables from PDF and convert to markdown format."""
    tables_md = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            for table in page.extract_tables():
                if not table:
                    continue

                # Convert to markdown table
                headers = table[0]
                md = "| " + " | ".join(str(h) for h in headers) + " |\n"
                md += "| " + " | ".join("---" for _ in headers) + " |\n"
                for row in table[1:]:
                    md += "| " + " | ".join(str(c) for c in row) + " |\n"

                tables_md.append(f"[Table from page {page_num}]\n{md}")

    return tables_md
Multimodal Costs

Vision model API calls for image description are 5-10x more expensive than text calls. For a document with 50 images, budget ~$2-5 for the extraction step. Cache descriptions aggressively — re-describe images only when documents are updated.

05

Self-Corrective RAG

Plain Language

Standard RAG is a one-shot pipeline: retrieve → generate → done. If the retrieval returns irrelevant documents or the generated answer is unfaithful to the context, there is no mechanism to detect or fix the problem. Self-corrective RAG adds feedback loops that check the quality of retrieval and generation, and retry or adjust when quality is insufficient. Think of it as adding a proofreader who reads the answer, checks it against the sources, and sends it back for revision if something is wrong.

The most influential framework for self-corrective RAG is CRAG (Corrective RAG), which adds a retrieval evaluator between the retrieval and generation steps. After retrieving documents, CRAG uses an LLM to judge whether the retrieved documents are relevant to the query. If documents are judged relevant, proceed normally. If ambiguous, supplement with a web search. If irrelevant, discard the retrieved documents entirely and fall back to a web search or a different retrieval strategy. This prevents the common failure mode where the LLM generates an answer based on tangentially related but ultimately unhelpful retrieved content.

Self-RAG goes further by adding reflection tokens to the generation process. During generation, the model periodically evaluates: "Is this claim supported by the retrieved evidence?" If not, it can trigger a new retrieval for the unsupported claim, search for additional evidence, or explicitly flag the claim as uncertain. This creates a generate-evaluate-retrieve loop that continues until the model is confident that every claim in its answer is grounded in evidence.

In practice, self-corrective RAG is implemented using LangGraph or similar workflow frameworks that support conditional branching and loops. The key components are: (1) a relevance grader that evaluates retrieval quality, (2) a hallucination checker that verifies generated claims against context, (3) a query rewriter that reformulates the query when retrieval fails, and (4) a loop controller that limits iterations to prevent infinite cycles. The extra LLM calls add latency and cost, but the reliability improvement is dramatic for production applications where incorrect answers have real consequences.

Deep Dive

Here is a self-corrective RAG implementation using LangGraph that implements the CRAG pattern with retrieval grading, query rewriting, and generation with hallucination checking:

from langgraph.graph import StateGraph, END
from typing import TypedDict
from openai import OpenAI
import json

client = OpenAI()

class RAGState(TypedDict):
    question: str
    documents: list[str]
    generation: str
    retries: int

def retrieve(state: RAGState) -> RAGState:
    """Retrieve documents from vector store."""
    docs = retriever.search(state["question"], k=5)
    return {**state, "documents": [d["text"] for d in docs]}

def grade_documents(state: RAGState) -> RAGState:
    """Grade retrieved documents for relevance."""
    relevant = []
    for doc in state["documents"]:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Is this document relevant to the question?
Question: {state['question']}
Document: {doc[:500]}
Answer with just "yes" or "no"."""
            }],
            temperature=0.0, max_tokens=3
        )
        if "yes" in response.choices[0].message.content.lower():
            relevant.append(doc)
    return {**state, "documents": relevant}

def should_rewrite(state: RAGState) -> str:
    """Decision: rewrite query or generate answer."""
    if len(state["documents"]) < 2 and state["retries"] < 2:
        return "rewrite"
    return "generate"

def rewrite_query(state: RAGState) -> RAGState:
    """Rewrite query for better retrieval."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""The following question did not retrieve good results.
Rewrite it to be more specific and use different keywords.
Original: {state['question']}
Rewritten:"""
        }],
        temperature=0.7
    )
    new_q = response.choices[0].message.content.strip()
    return {**state, "question": new_q, "retries": state["retries"] + 1}

def generate(state: RAGState) -> RAGState:
    """Generate answer from retrieved context."""
    context = "\n\n".join(state["documents"])
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based only on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {state['question']}"}
        ],
        temperature=0.1
    )
    return {**state, "generation": response.choices[0].message.content}

# Build the graph
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", should_rewrite, {
    "rewrite": "rewrite",
    "generate": "generate"
})
workflow.add_edge("rewrite", "retrieve")  # Loop back
workflow.add_edge("generate", END)

app = workflow.compile()

# Run
result = app.invoke({
    "question": "How does PagedAttention improve throughput?",
    "documents": [],
    "generation": "",
    "retries": 0
})
print(result["generation"])
Query Retrieve Grade Docs Relevant? Yes Generate Answer No Rewrite Query
Figure 2 — Self-corrective RAG loop: retrieve → grade → rewrite or generate
06

Production Patterns

Plain Language

Moving RAG from a demo to production requires addressing several challenges that do not surface during prototyping. Document versioning ensures that users always search the latest version of each document and that old versions are properly archived. Incremental indexing avoids re-embedding the entire corpus every time a single document changes. Metadata filtering lets users (or the system) restrict searches to specific document categories, date ranges, or departments. Caching at the embedding and retrieval layers prevents redundant computation for repeated or similar queries. Observability through logging every stage of the pipeline enables debugging when answers are wrong.

Chunking strategy selection in production often involves testing multiple approaches. You might use semantic chunking for long-form documentation, table-aware chunking for financial reports, and fixed-size chunking for chat logs. Different document types benefit from different strategies, and a production system should support this heterogeneity rather than forcing a one-size-fits-all approach. The key is to evaluate each strategy against real user queries using metrics like retrieval precision, recall, and end-to-end answer quality.

The most critical production pattern is a RAG evaluation pipeline. Without systematic evaluation, you are flying blind — you do not know whether your chunking, embedding, retrieval, or synthesis is the bottleneck. A proper evaluation pipeline tests each stage independently: Does the retriever find the right documents? (Retrieval metrics.) Is the generated answer faithful to the context? (Faithfulness metric.) Does the answer actually address the question? (Relevance metric.) Module 10 covers evaluation in depth, but the key principle is: instrument your pipeline from day one, not after problems surface in production.

Deep Dive

A production-ready RAG service wraps all the components into a clean FastAPI application with proper logging, caching, and error handling:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from functools import lru_cache
import hashlib, logging, time

app = FastAPI(title="RAG Service")
logger = logging.getLogger(__name__)

class RAGRequest(BaseModel):
    question: str = Field(..., min_length=3, max_length=2000)
    collection: str = "default"
    top_k: int = Field(default=5, ge=1, le=20)
    filters: dict = Field(default_factory=dict)
    use_rerank: bool = True

class RAGResponse(BaseModel):
    answer: str
    sources: list[dict]
    retrieval_ms: float
    generation_ms: float
    total_ms: float

@app.post("/v1/rag", response_model=RAGResponse)
async def rag_endpoint(req: RAGRequest):
    start = time.perf_counter()

    # Retrieval phase
    t0 = time.perf_counter()
    chunks = retriever.search(
        req.question, k=req.top_k, use_rerank=req.use_rerank
    )
    retrieval_ms = (time.perf_counter() - t0) * 1000

    if not chunks:
        raise HTTPException(
            status_code=404,
            detail="No relevant documents found for this query."
        )

    # Generation phase
    t1 = time.perf_counter()
    answer = rag_query(req.question, chunks)
    generation_ms = (time.perf_counter() - t1) * 1000

    total_ms = (time.perf_counter() - start) * 1000

    # Log for observability
    logger.info(json.dumps({
        "question": req.question,
        "num_retrieved": len(chunks),
        "retrieval_ms": round(retrieval_ms),
        "generation_ms": round(generation_ms),
        "total_ms": round(total_ms),
        "answer_length": len(answer),
    }))

    return RAGResponse(
        answer=answer,
        sources=[{"text": c["text"][:200], "score": c["score"]} for c in chunks],
        retrieval_ms=round(retrieval_ms, 2),
        generation_ms=round(generation_ms, 2),
        total_ms=round(total_ms, 2),
    )
PatternWhen to UseComplexity
Basic RAG (retrieve → generate)Prototyping, simple Q&ALow
+ Hybrid search (vector + BM25)Mixed keyword/semantic queriesLow-Medium
+ RerankingWhen precision mattersMedium
+ Parent-child chunksLong documents needing contextMedium
+ Query transformationDiverse user query stylesMedium
+ Self-corrective loopHigh-stakes answersHigh
+ Graph RAGMulti-hop, relational queriesHigh
+ MultimodalTables, images, diagramsHigh
Start Simple, Add Complexity

Begin with basic RAG + hybrid search + reranking. This covers 80% of use cases. Add parent-child chunks if users complain about incomplete answers. Add query transformation if retrieval fails on paraphrased queries. Add self-corrective loops only for high-stakes domains where answer accuracy is critical.

🎯

Interview Ready

How to Explain This in 2 Minutes

Elevator Pitch

Advanced RAG goes beyond the basic retrieve-and-stuff pattern by adding layers of intelligence at every stage of the pipeline. Before retrieval, query expansion and rewriting techniques generate multiple reformulations of the user's question to cast a wider net. During retrieval, hybrid search combines dense vector similarity with sparse keyword matching (BM25) to catch both semantic and lexical matches. After retrieval, a cross-encoder reranker scores each query-document pair jointly to surface the truly relevant chunks, discarding false positives that fooled the initial similarity search. Self-RAG and CRAG add self-corrective loops — the system evaluates whether retrieved documents actually support an answer and, if not, falls back to web search or re-retrieves with a rewritten query. Evaluation metrics like context relevance, faithfulness, and answer completeness (automated via frameworks like RAGAS) close the feedback loop, letting you measure and continuously improve each stage. These techniques collectively transform demo-quality RAG into production-grade systems that handle ambiguous, multi-hop, and adversarial queries reliably.

Likely Interview Questions

QuestionWhat They're Really Asking
What is hybrid search and why is it better than pure vector search for RAG?Do you understand the complementary strengths of dense embeddings (semantic similarity) and sparse retrieval (exact keyword matching via BM25)?
Explain how reranking works and where it fits in a RAG pipeline.Can you distinguish between bi-encoder retrieval (fast, approximate) and cross-encoder reranking (slow, precise), and articulate why the two-stage approach is necessary?
What query expansion techniques would you use to improve retrieval recall?Do you know about HyDE (Hypothetical Document Embeddings), multi-query generation, and step-back prompting — and when each is appropriate?
How do Self-RAG and CRAG differ from standard RAG?Can you explain self-corrective loops, reflection tokens, retrieval grading, and the concept of the LLM deciding when retrieval is needed vs. when it already knows the answer?
How do you evaluate an advanced RAG system end to end?Do you know the key metrics — context precision, context recall, faithfulness, answer relevancy — and how frameworks like RAGAS automate LLM-as-judge evaluation across retrieval and generation stages?

Model Answers

1. Hybrid search vs. pure vector search — Pure vector search excels at semantic matching — it finds passages that mean the same thing even if they use different words. But it struggles with exact keyword queries like product codes, error messages, or proper nouns where lexical matching is critical. Hybrid search combines dense vector retrieval with sparse BM25 retrieval using Reciprocal Rank Fusion (RRF) to merge the two result sets. RRF assigns each document a score of 1 / (k + rank) from each retrieval method and sums them, naturally boosting documents that appear in both lists. In practice, hybrid search improves recall by 10–20% over either method alone because dense and sparse retrievers have complementary failure modes — vectors miss exact matches, BM25 misses paraphrases. Most production vector databases (Weaviate, Qdrant, Pinecone) support hybrid search natively with a configurable alpha parameter to weight dense vs. sparse contributions.

2. Reranking in the RAG pipeline — The retrieval stage uses a bi-encoder — query and documents are embedded independently, and similarity is computed via dot product or cosine distance. This is fast (sub-millisecond over millions of vectors) but approximate, because the query and document never "see" each other during encoding. A cross-encoder reranker takes each (query, document) pair as a single input and produces a joint relevance score. This is far more accurate because the model attends across both texts simultaneously, catching nuances a bi-encoder misses. The two-stage pattern is: retrieve top-20 to top-50 candidates with the bi-encoder (fast, high recall), then rerank to select the top-3 to top-5 with the cross-encoder (slow, high precision). Models like Cohere Rerank, BGE-Reranker, or cross-encoder/ms-marco-MiniLM-L-6-v2 add 100–300ms of latency but improve answer quality by 15–25%. Always set a relevance threshold — if no reranked chunk scores above the threshold, return "I don't have enough information" rather than hallucinating.

3. Query expansion techniques — Query expansion rewrites the user's original question into multiple forms to improve retrieval coverage. HyDE (Hypothetical Document Embeddings) asks the LLM to generate a hypothetical answer, then embeds that answer instead of the question — this works because a hypothetical answer is closer in embedding space to the actual documents than a short question is. Multi-query generation produces 3–5 diverse reformulations of the question, runs retrieval for each, and merges the results. Step-back prompting transforms a specific question ("What was Apple's Q3 2024 revenue?") into a broader one ("What were Apple's recent financial results?") to retrieve more comprehensive context. Sub-question decomposition breaks multi-hop questions into atomic parts, retrieves for each, and synthesizes. The choice depends on query type: HyDE for factual lookups, multi-query for ambiguous questions, step-back for overly specific queries, and decomposition for complex multi-part questions.

4. Self-RAG and CRAG — Standard RAG always retrieves and always uses retrieved context, even when the LLM already knows the answer or when the retrieved documents are irrelevant. Self-RAG introduces reflection tokens that let the model decide at each step: (1) should I retrieve at all? (2) are the retrieved documents relevant? (3) is my generated response supported by the evidence? (4) is the response useful to the user? The model is trained to emit these reflection tokens, enabling it to skip retrieval for simple factual queries and to self-correct when retrieved context is insufficient. CRAG (Corrective RAG) takes a different approach — it grades retrieved documents using a lightweight evaluator, and if the documents score below a confidence threshold, it triggers a corrective action: re-retrieving with a rewritten query, falling back to web search, or decomposing the question into sub-queries. CRAG is more practical for production because it works with any off-the-shelf LLM without special training, while Self-RAG requires fine-tuning the model to produce reflection tokens.

5. Evaluating advanced RAG systems — Advanced RAG evaluation measures both retrieval and generation quality across four key metrics. Context precision asks: of the chunks we retrieved, what fraction are actually relevant to the question? Context recall asks: of all the relevant information that exists in the corpus, what fraction did we retrieve? Faithfulness measures whether the generated answer is supported by the retrieved context — every claim in the answer should trace back to a specific passage. Answer relevancy checks whether the response actually addresses the question asked. RAGAS automates these using LLM-as-judge: it prompts a strong model (e.g., GPT-4) to decompose the answer into individual claims, verify each claim against the context, and score the overall alignment. Beyond RAGAS, you should build golden evaluation sets — 50–100 question-answer-source triples curated by domain experts — and track metrics over time as you modify the pipeline. The most actionable insight comes from failure analysis: categorize bad answers into retrieval failures (right answer exists but was not retrieved) vs. generation failures (right context was retrieved but the LLM misinterpreted it) and address each with targeted improvements.

System Design Scenario

Design Challenge

You are redesigning a customer support RAG system that currently uses basic vector search. The system handles 10,000 queries per day across 200,000 support articles, but users report that 30% of answers are irrelevant or incomplete. Queries range from simple ("How do I reset my password?") to complex multi-hop questions ("My payment failed after I changed my address and updated my card — what went wrong?"). Design an advanced RAG architecture to reduce the irrelevant answer rate below 5%.

A strong answer should cover:

  • Query classification and routing — classify incoming queries by complexity (simple factual, comparative, multi-hop, troubleshooting) and route each to an appropriate pipeline: simple queries get basic retrieval, multi-hop queries get sub-question decomposition, and troubleshooting queries get step-back prompting followed by sequential reasoning
  • Hybrid search with reranking — replace pure vector search with hybrid retrieval (dense + BM25 via RRF), retrieve top-20 candidates, then rerank with a cross-encoder to select top-5, with a minimum relevance threshold to avoid injecting low-quality context
  • Self-corrective loop (CRAG pattern) — after initial retrieval, grade the relevance of retrieved documents; if confidence is below threshold, rewrite the query using the LLM and re-retrieve; if still below threshold, escalate to a human agent rather than generating an unreliable answer
  • Evaluation and monitoring pipeline — implement RAGAS metrics (faithfulness, context precision, context recall, answer relevancy) on a sampled stream of production queries, build a dashboard tracking these metrics daily, and create an automated alert when any metric drops below its threshold
  • Feedback loop — capture user thumbs-up/thumbs-down signals, map negative feedback to retrieval vs. generation failures, and use the failure analysis to continuously refine chunking strategy, reranker fine-tuning, and prompt templates

Common Mistakes

  • Adding every advanced technique at once instead of iterating — Teams often implement hybrid search, reranking, query expansion, and self-corrective loops simultaneously, making it impossible to measure which technique actually improved results. Start with hybrid search + reranking (the highest-impact changes), measure the improvement, then layer on query expansion and CRAG patterns incrementally. Each addition should be A/B tested against the previous baseline.
  • Using query expansion without deduplication or fusion — Multi-query expansion generates multiple reformulations and retrieves for each, but naively concatenating the results produces duplicate chunks and exceeds the context window. Always apply Reciprocal Rank Fusion to merge results and deduplicate by chunk ID. Without fusion, you waste context window tokens on redundant passages and may push the most relevant unique chunks out of the top-k selection.
  • Treating reranker scores as absolute confidence and skipping the relevance threshold — Cross-encoder rerankers produce relative scores, not calibrated probabilities. A top-ranked chunk with a score of 0.6 might still be irrelevant — it is simply less irrelevant than the alternatives. Always set a minimum relevance threshold (calibrated on your evaluation set) below which the system responds with "I could not find a reliable answer" instead of generating from poor context. This single safeguard prevents the majority of confidently wrong answers that erode user trust.
← Previous
07 · RAG Systems
Next →
09 · Agents