Module 07 — RAG Phase

Retrieval-Augmented Generation

RAG is the most impactful pattern in production GenAI. Instead of relying solely on what the model memorized during training, RAG retrieves relevant documents from your own data and injects them into the prompt — giving the model up-to-date, domain-specific knowledge it never saw during training. This module covers every component of the RAG pipeline: document loading, text chunking, embedding generation, vector storage, retrieval strategies, and the synthesis step where the LLM generates answers grounded in retrieved context.

Document Loading
Chunking Strategies
Embeddings
Vector Databases
Retrieval & Reranking
LLM Synthesis
Open in Colab Open Notebook in Colab
01

What Is RAG

Plain Language

Imagine you are taking an open-book exam. You can look up answers in your textbook, notes, and reference materials before writing your response. Retrieval-Augmented Generation works exactly this way — instead of asking the LLM to answer purely from memory (its training data), we first search through a collection of documents to find the most relevant passages, then hand those passages to the LLM along with the user's question. The LLM reads the retrieved context and writes an answer that is grounded in actual source material rather than potentially hallucinated information.

This pattern solves the three biggest problems with using LLMs for knowledge-intensive tasks. First, knowledge cutoff: a model trained in 2024 knows nothing about events, products, or documentation updated in 2025. RAG lets the model access current information by retrieving from an up-to-date document store. Second, hallucination: when an LLM does not know an answer, it often generates plausible-sounding but completely fabricated information. With RAG, the model has actual source text to reference, dramatically reducing hallucination. Third, domain specificity: a general-purpose LLM has limited knowledge about your company's internal processes, proprietary documentation, or niche domain topics. RAG injects this domain knowledge at query time without requiring any model training.

The RAG pipeline has two main phases. The indexing phase happens offline (before any user queries): you load documents, split them into smaller chunks, convert each chunk into a numerical vector (embedding), and store these vectors in a specialized database. The query phase happens in real time: when a user asks a question, you convert their question into a vector, search the database for the most similar document vectors, retrieve those document chunks, and pass them to the LLM as context for generating the answer. The beauty of this architecture is that the indexing phase only needs to happen once (or whenever documents are updated), while the query phase can serve thousands of users in real time.

RAG has become the default architecture for production GenAI applications because it offers a compelling combination of low cost, high quality, and operational simplicity. Unlike fine-tuning — which requires preparing training datasets, running expensive GPU training jobs, and managing model versions — RAG only requires an embedding model and a vector database. You can update the knowledge base by simply adding or modifying documents, without any model retraining. Most enterprise AI assistants, customer support bots, documentation search tools, and internal knowledge systems use RAG as their foundation.

Deep Dive

The theoretical foundation for RAG was established in the 2020 paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et al. at Facebook AI Research. The key insight was that combining a parametric model (the LLM) with a non-parametric memory (the retrieved documents) produces answers that are both more factual and more controllable than either approach alone. The parametric model contributes language fluency and reasoning ability, while the non-parametric memory provides factual grounding and the ability to cite sources.

In mathematical terms, a standard LLM generates output token by token according to P(y|x) — the probability of the response y given only the input x. RAG modifies this to P(y|x, z) where z represents the retrieved context documents. This seemingly simple modification has profound implications: the model's output is now conditioned on actual evidence, which means it can be verified, attributed to specific sources, and updated by changing the document store rather than retraining the model.

The complete RAG pipeline consists of six stages, each with significant engineering decisions. Stage 1: Document Loading — extracting text from diverse formats (PDF, DOCX, HTML, Markdown, databases) and preserving structural information like headers, tables, and metadata. Stage 2: Chunking — splitting documents into semantically coherent segments that fit within embedding model context windows. Stage 3: Embedding — converting text chunks into high-dimensional vectors using an embedding model. Stage 4: Indexing — storing vectors in a database optimized for similarity search. Stage 5: Retrieval — finding the most relevant chunks for a given query using vector similarity, keyword matching, or hybrid approaches. Stage 6: Generation — synthesizing a response using the LLM with retrieved context injected into the prompt.

RAG Pipeline — Indexing & Query Phases INDEXING PHASE (OFFLINE) Documents PDF, DOCX, HTML Markdown, DB Chunker Split into 512-1024 tokens Embedding Model text → [0.12, -0.34, 0.56, ...] (1536d) Vector Database ChromaDB, Pinecone pgvector, Qdrant QUERY PHASE (REAL-TIME) User Query "How does RAG reduce hallucination?" Embed Query Same model as indexing phase Similarity Search Top-K nearest vectors (cosine) LLM Synthesis Query + Retrieved Context → Answer Answer Grounded in source docs
Figure 1 — Complete RAG pipeline: offline indexing phase and real-time query phase

The quality of a RAG system depends on every stage of this pipeline. Poor chunking produces fragments that lack context; a weak embedding model creates vectors that miss semantic relationships; an inefficient retriever returns irrelevant documents; and even a great LLM will produce poor answers if the retrieved context does not contain the information needed. This is why understanding each stage in depth — which we cover in the following sections — is essential for building production-quality RAG systems.

02

Document Chunking

Plain Language

Chunking is the process of breaking large documents into smaller, manageable pieces that can be individually embedded and retrieved. Think of it like cutting a long newspaper article into individual paragraphs that you can file in separate folders. The challenge is finding the right size and boundaries for these pieces — too small and you lose context (a sentence in isolation may not make sense), too large and you waste precious context window space by retrieving irrelevant text alongside the relevant parts. The goal is to create chunks where each piece is semantically self-contained — it makes sense on its own and contains a complete thought or piece of information.

The simplest approach is fixed-size chunking, where you split text every N characters or tokens regardless of content boundaries. This is fast and predictable, but often cuts sentences in half or separates a heading from its paragraph. A much better approach is recursive character splitting, which tries to split on natural boundaries — first on double newlines (paragraph breaks), then single newlines, then sentences, then words — only breaking at smaller units when a chunk would otherwise exceed the size limit. This preserves the natural structure of the document and keeps related sentences together.

Overlap is a crucial concept in chunking. When you split a document into non-overlapping chunks, information that spans a chunk boundary is lost — if the answer to a question starts in the last sentence of chunk 5 and continues in the first sentence of chunk 6, neither chunk alone contains the complete answer. Adding overlap (typically 10-20% of the chunk size) means that each chunk includes some text from the neighboring chunks, ensuring that boundary-spanning information appears in at least one complete chunk. The tradeoff is that overlap increases storage requirements and can cause duplicate information in retrieval results.

The optimal chunk size depends on your use case. For question-answering over technical documentation, chunks of 512-1024 tokens work well — large enough to contain a complete explanation, small enough to fit multiple chunks in the LLM's context window. For conversational applications where precision matters more than coverage, smaller chunks (256-512 tokens) perform better. For summarization tasks where you need broader context, larger chunks (1024-2048 tokens) are preferable. The embedding model also constrains chunk size — most models have a maximum context length of 512 or 8192 tokens, and performance degrades for inputs near the maximum.

Deep Dive

LangChain provides the most comprehensive set of text splitters in the Python ecosystem. The RecursiveCharacterTextSplitter is the recommended default for most use cases because its hierarchical splitting strategy naturally preserves document structure. Here is a complete document loading and chunking pipeline:

from langchain_community.document_loaders import (
    PyPDFLoader, UnstructuredWordDocumentLoader,
    UnstructuredHTMLLoader, TextLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

# --- Document Loading ---
def load_documents(directory: str) -> list:
    """Load all documents from a directory, auto-detecting format."""
    loaders = {
        ".pdf": PyPDFLoader,
        ".docx": UnstructuredWordDocumentLoader,
        ".html": UnstructuredHTMLLoader,
        ".txt": TextLoader,
        ".md": TextLoader,
    }
    docs = []
    for path in Path(directory).rglob("*"):
        loader_cls = loaders.get(path.suffix.lower())
        if loader_cls:
            loader = loader_cls(str(path))
            docs.extend(loader.load())
    return docs

# --- Chunking ---
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Target ~250 tokens (4 chars/token avg)
    chunk_overlap=200,     # 20% overlap for boundary coverage
    separators=[
        "\n\n",   # 1st: paragraph breaks
        "\n",     # 2nd: line breaks
        ". ",     # 3rd: sentence endings
        " ",      # 4th: word boundaries
        ""        # 5th: character level (last resort)
    ],
    length_function=len,
    is_separator_regex=False,
)

docs = load_documents("./knowledge_base")
chunks = splitter.split_documents(docs)

print(f"Loaded {len(docs)} documents → {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")

Semantic chunking is a more advanced approach that uses embedding similarity to determine chunk boundaries. Instead of splitting at fixed sizes or syntactic markers, you compute embeddings for each sentence and split where consecutive sentences have low similarity — indicating a topic change. This produces chunks that are semantically coherent regardless of formatting:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Semantic chunker splits at topic boundaries
semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",  # Split where similarity drops
    breakpoint_threshold_amount=70,         # Below 70th percentile = new chunk
)

semantic_chunks = semantic_splitter.split_documents(docs)
print(f"Semantic chunking: {len(semantic_chunks)} chunks")

Metadata enrichment is the often-overlooked step that dramatically improves retrieval quality. Each chunk should carry metadata from its source document — the filename, page number, section heading, document date, and any custom tags. This metadata enables filtered retrieval (e.g., "search only in documents from 2024") and provides attribution information in the generated answer:

def enrich_chunks(chunks: list) -> list:
    """Add computed metadata to each chunk."""
    for i, chunk in enumerate(chunks):
        chunk.metadata["chunk_id"] = i
        chunk.metadata["char_count"] = len(chunk.page_content)
        chunk.metadata["word_count"] = len(chunk.page_content.split())

        # Extract section heading if present
        lines = chunk.page_content.split("\n")
        if lines and len(lines[0]) < 100:
            chunk.metadata["heading"] = lines[0].strip()

    return chunks
Chunk Size Rule of Thumb

Start with 512 tokens (≈2000 characters) and 20% overlap. If retrieval quality is poor, experiment with smaller chunks (256 tokens) for higher precision or larger chunks (1024 tokens) for more context. Always evaluate with real queries against your specific document collection.

03

Embeddings

Plain Language

Embeddings are the mathematical heart of RAG. An embedding model takes a piece of text and converts it into a list of numbers — a vector — that captures the meaning of that text. Similar meanings produce similar vectors, and dissimilar meanings produce distant vectors. When you search for "how to deploy a web app," the embedding for this query will be mathematically close to embeddings of document chunks about deployment, Docker, hosting, and DevOps, even if those chunks use completely different words. This is the magic that makes semantic search work — it matches by meaning rather than by keyword.

Think of embeddings as coordinates in a high-dimensional space. Just as a GPS coordinate (latitude, longitude) represents a physical location, an embedding vector (1536 numbers for OpenAI's model) represents a semantic location. Paris and Lyon are geographically close because they are both in France; similarly, "machine learning" and "artificial intelligence" have nearby embeddings because they are semantically related. The distance between two embedding vectors (measured by cosine similarity) tells you how semantically similar the two texts are — a cosine similarity of 0.95 means they are very similar, while 0.3 means they are mostly unrelated.

The choice of embedding model has an enormous impact on RAG quality. A good embedding model understands that "The server is down" and "The production instance is experiencing an outage" mean the same thing, even though they share almost no words. Popular embedding models include OpenAI's text-embedding-3-small (fast, cheap, 1536 dimensions), Cohere's embed-english-v3.0 (excellent for English, supports compression), and open-source models like BAAI/bge-large-en-v1.5 (runs locally, no API costs). The MTEB (Massive Text Embedding Benchmark) leaderboard provides standardized comparisons across dozens of embedding models on retrieval tasks.

One critical rule: you must use the same embedding model for indexing and querying. Each model creates vectors in its own coordinate system — an embedding from OpenAI and an embedding from Cohere are not comparable, even for the same text. If you index your documents with one model and query with another, the similarity scores will be meaningless and retrieval will fail. This means choosing an embedding model is a relatively permanent decision that affects your entire vector database.

Deep Dive

Here is a practical comparison of embedding models for RAG and how to use them with the most common providers:

from openai import OpenAI
import numpy as np

client = OpenAI()

# --- OpenAI Embeddings ---
def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    """Embed a batch of texts using OpenAI's embedding API."""
    response = client.embeddings.create(
        input=texts,
        model=model,
    )
    vectors = [item.embedding for item in response.data]
    return np.array(vectors)

# Generate embeddings
query_vec = embed_openai(["How does RAG reduce hallucination?"])
print(f"Shape: {query_vec.shape}")  # (1, 1536)

# Batch embedding for chunks
chunk_texts = ["RAG retrieves relevant documents...", "Fine-tuning modifies weights..."]
chunk_vecs = embed_openai(chunk_texts)
print(f"Shape: {chunk_vecs.shape}")  # (2, 1536)

# --- Cosine Similarity ---
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_similarity(query_vec[0], chunk_vecs[0])
print(f"Similarity (RAG chunk): {sim:.4f}")   # ~0.85 (high)
sim2 = cosine_similarity(query_vec[0], chunk_vecs[1])
print(f"Similarity (fine-tune chunk): {sim2:.4f}")  # ~0.45 (low)

For local embeddings (no API costs, full data privacy), Sentence Transformers is the standard library. The BAAI/bge family of models consistently ranks near the top of the MTEB leaderboard while being free to run locally:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a local embedding model (downloads once, ~1.3GB)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Embed texts locally — no API calls
texts = [
    "Retrieval augmented generation grounds LLM outputs in source documents.",
    "The weather in Paris is mild in spring.",
    "RAG pipelines reduce hallucination by providing evidence.",
]

vectors = model.encode(texts, normalize_embeddings=True)
print(f"Shape: {vectors.shape}")  # (3, 1024)

# Similarity matrix
sims = np.inner(vectors, vectors)  # Cosine sim (normalized vectors)
print(f"Text 0 vs Text 2: {sims[0][2]:.4f}")  # High (~0.88)
print(f"Text 0 vs Text 1: {sims[0][1]:.4f}")  # Low (~0.25)
Model Dimensions MTEB Score Cost Best For
text-embedding-3-small153662.3$0.02/1M tokensCost-effective production
text-embedding-3-large307264.6$0.13/1M tokensMax quality (OpenAI)
BAAI/bge-large-en-v1.5102464.2Free (local)Privacy-first / on-prem
Cohere embed-v3102464.5$0.10/1M tokensMultilingual + compression
nomic-embed-text-v1.576862.3Free (local)Lightweight local
Batch Processing

Always embed in batches, not one at a time. OpenAI's API accepts up to 2048 texts per call. For local models, SentenceTransformer.encode() automatically batches and uses GPU if available. Embedding 10,000 chunks individually takes 10,000 API calls; batching reduces this to 5 calls.

04

Vector Databases

Plain Language

A vector database is a specialized storage system designed to efficiently store and search through millions of embedding vectors. Traditional databases are optimized for exact matches — find the row where id = 42. Vector databases are optimized for approximate nearest neighbor (ANN) search — find the 10 vectors most similar to this query vector out of 10 million stored vectors. This is a fundamentally different computational problem that requires specialized data structures and algorithms.

Think of a vector database like a well-organized library. Instead of organizing books alphabetically by title, this library organizes them by meaning — books about cooking are physically near each other, books about astronomy are in another cluster, and books about culinary astronomy (if such a thing existed) would be between the two clusters. When you walk in with a question, the librarian does not need to check every single book — they can walk directly to the relevant section and pull the most relevant volumes. This is what ANN algorithms do: they organize vectors into structures that allow fast approximate search without comparing against every stored vector.

The most popular vector databases each target different use cases. ChromaDB is the simplest option — it is an in-process database that stores vectors in memory or on disk, requires no separate server, and can be set up in three lines of Python. It is perfect for prototyping, small datasets (under 1 million vectors), and applications where simplicity matters more than scale. Pinecone is a fully managed cloud service that handles all infrastructure, scaling, and operations — you just send vectors via API. It is ideal for production applications that need zero operational burden. pgvector is a PostgreSQL extension that adds vector search to your existing Postgres database, which is perfect when you want to keep your vectors alongside your relational data without adding another infrastructure component. Qdrant and Weaviate are purpose-built vector databases that offer advanced features like filtering, multi-tenancy, and hybrid search.

Deep Dive

ChromaDB is the fastest way to get started with vector storage. Its API is clean and intuitive, and it handles embedding generation automatically if you provide an embedding function. Here is a complete example that indexes documents and performs similarity search:

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Initialize ChromaDB (persistent storage)
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection with OpenAI embeddings
embed_fn = OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embed_fn,
    metadata={"hnsw:space": "cosine"}  # Distance metric
)

# Add documents (embeddings generated automatically)
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "RAG retrieves relevant documents and injects them into the LLM prompt.",
        "Fine-tuning modifies model weights using domain-specific training data.",
        "Vector databases store embeddings for efficient similarity search.",
    ],
    metadatas=[
        {"source": "rag_guide.pdf", "page": 1},
        {"source": "fine_tuning.pdf", "page": 3},
        {"source": "vector_db.pdf", "page": 1},
    ]
)

# Query — returns most similar documents
results = collection.query(
    query_texts=["How does RAG improve LLM accuracy?"],
    n_results=2,
    include=["documents", "metadatas", "distances"]
)

for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"[{dist:.4f}] {meta['source']}: {doc[:80]}...")

pgvector integrates vector search into PostgreSQL, which is powerful for applications that already use Postgres. You get the full power of SQL — joins, filters, transactions — combined with vector similarity search:

-- PostgreSQL with pgvector extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    source VARCHAR(255),
    page_num INTEGER,
    embedding vector(1536),  -- OpenAI text-embedding-3-small
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Similarity search with metadata filtering
SELECT content, source, page_num,
       1 - (embedding <=> '[0.12, -0.34, ...]'::vector) AS similarity
FROM documents
WHERE source = 'rag_guide.pdf'
ORDER BY embedding <=> '[0.12, -0.34, ...]'::vector
LIMIT 5;

Using pgvector from Python with the psycopg2 or asyncpg libraries is straightforward:

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
register_vector(conn)
cur = conn.cursor()

# Insert a document with embedding
cur.execute(
    "INSERT INTO documents (content, source, embedding) VALUES (%s, %s, %s)",
    ("RAG reduces hallucination...", "guide.pdf", query_vec[0].tolist())
)

# Similarity search
cur.execute(
    """SELECT content, source, 1 - (embedding <=> %s::vector) AS similarity
       FROM documents ORDER BY embedding <=> %s::vector LIMIT 5""",
    (query_vec[0].tolist(), query_vec[0].tolist())
)
results = cur.fetchall()
Choosing a Vector Database

Prototyping: ChromaDB (zero setup, in-process). Already using Postgres: pgvector (no new infrastructure). Managed production: Pinecone (zero ops). Self-hosted production: Qdrant or Weaviate (feature-rich, scalable). Start simple with ChromaDB and migrate when you hit scale limits.

05

Retrieval & Reranking

Plain Language

Retrieval is the step where you search through your indexed document chunks to find the ones most relevant to the user's query. The simplest approach — pure vector similarity search — works surprisingly well in most cases: embed the query, find the K nearest vectors, return the corresponding chunks. But real-world retrieval often needs more sophistication. What if the user searches for an exact product ID that should be matched literally, not semantically? What if the top vector matches are from the same document and you want diversity? What if the vector search returns 20 results but only 3 are truly relevant?

Hybrid search combines vector similarity with traditional keyword matching (BM25) to get the best of both worlds. Semantic search excels at understanding meaning — it knows that "car" and "automobile" are the same concept — but can fail on exact terms like product codes, error messages, or proper nouns. Keyword search excels at exact matching but fails on paraphrased queries. Hybrid search runs both searches in parallel and combines the results using a technique called Reciprocal Rank Fusion (RRF), which gives a document high score if it appears near the top of either search result list.

Reranking is a second-stage process that dramatically improves retrieval precision. The initial retrieval (either vector or hybrid) casts a wide net — returning perhaps 20 candidates that are roughly relevant. A reranker model then examines each candidate in the context of the original query and produces a much more accurate relevance score. Rerankers are typically cross-encoder models that process the query and document together (rather than independently like embedding models), giving them much better understanding of relevance. The cost is that cross-encoders are too slow to run against the entire document collection, which is why they are used as a second stage after initial retrieval narrows the candidates.

The difference between a basic and a well-tuned retrieval pipeline can be dramatic. A naive vector search might produce results where 3 out of 5 retrieved chunks are actually relevant (60% precision). Adding hybrid search might push this to 4 out of 5 (80%). Adding a reranker often achieves 5 out of 5 (100%) for well-indexed domains. Since the LLM's answer quality is directly proportional to the relevance of retrieved context, these improvements in retrieval precision translate directly into better final answers.

Deep Dive

Here is a complete retrieval pipeline implementing vector search, hybrid search with BM25, and cross-encoder reranking:

from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder
import numpy as np

class HybridRetriever:
    """Combines vector search + BM25 keyword search + cross-encoder reranking."""

    def __init__(self, collection, chunks: list[str]):
        self.collection = collection
        self.chunks = chunks

        # BM25 index for keyword search
        tokenized = [doc.lower().split() for doc in chunks]
        self.bm25 = BM25Okapi(tokenized)

        # Cross-encoder reranker
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

    def search(self, query: str, k: int = 5, use_rerank: bool = True) -> list[dict]:
        # Step 1: Vector search (top 20 candidates)
        vector_results = self.collection.query(
            query_texts=[query], n_results=20,
            include=["documents", "metadatas", "distances"]
        )
        vector_docs = vector_results["documents"][0]

        # Step 2: BM25 keyword search (top 20 candidates)
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_top = np.argsort(bm25_scores)[::-1][:20]
        bm25_docs = [self.chunks[i] for i in bm25_top]

        # Step 3: Reciprocal Rank Fusion
        fused = self._rrf_merge(vector_docs, bm25_docs)

        # Step 4: Rerank with cross-encoder
        if use_rerank and fused:
            pairs = [[query, doc] for doc in fused]
            scores = self.reranker.predict(pairs)
            ranked = sorted(
                zip(fused, scores), key=lambda x: x[1], reverse=True
            )
            return [{"text": doc, "score": float(s)} for doc, s in ranked[:k]]

        return [{"text": doc, "score": 0.0} for doc in fused[:k]]

    def _rrf_merge(self, list_a: list, list_b: list, k: int = 60) -> list:
        """Reciprocal Rank Fusion: merge two ranked lists."""
        scores = {}
        for rank, doc in enumerate(list_a):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank + 1)
        for rank, doc in enumerate(list_b):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank + 1)
        return sorted(scores, key=scores.get, reverse=True)
Reranker Impact

In benchmarks, adding a cross-encoder reranker improves retrieval precision (Precision@5) by 15-30% on average. The Cohere Rerank API and open-source cross-encoder/ms-marco-MiniLM-L-12-v2 model are the most popular choices. The latency cost is typically 50-100ms for reranking 20 candidates.

06

LLM Synthesis

Plain Language

Synthesis is the final step where everything comes together: you have the user's question and the retrieved document chunks, and now you need the LLM to generate an answer that is grounded in that context. This is where prompt engineering meets retrieval — the way you structure the prompt determines whether the LLM faithfully uses the provided context, ignores it, or worse, mixes retrieved facts with hallucinated information. The synthesis prompt must clearly instruct the model to base its answer on the provided context and to say "I don't know" when the context does not contain the answer.

The standard synthesis pattern uses a system prompt that sets the behavioral rules and a user message that contains both the retrieved context and the question. The context chunks are typically formatted with clear delimiters — numbered sections, XML-like tags, or markdown headers — so the model can easily distinguish between different source documents. Including the source metadata (filename, page number, section title) in the context enables the model to provide citations in its answer, which is essential for enterprise applications where users need to verify information.

A well-designed synthesis prompt handles several edge cases. What if the retrieved context is contradictory? The prompt should instruct the model to acknowledge the contradiction and explain both perspectives. What if the context is partially relevant? The model should use what it can and explicitly state what information is missing. What if the user's question is completely outside the scope of the knowledge base? The model should clearly say it does not have information to answer rather than falling back on its training data, which might be outdated or incorrect for this specific domain.

Streaming the synthesis response is important for user experience. Since the LLM generates tokens one at a time, you can begin displaying the answer to the user while it is still being generated. This dramatically reduces perceived latency — instead of waiting 5 seconds for the complete response, the user sees the first word appear within 200 milliseconds and watches the answer flow in naturally. Combined with source citations at the end, this creates a polished, trustworthy experience.

Deep Dive

Here is a complete RAG synthesis implementation with structured prompting, source attribution, and streaming:

from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = """You are an expert assistant that answers questions based ONLY on the provided context documents. Follow these rules strictly:

1. Base your answer entirely on the provided context. Do not use information from your training data.
2. If the context does not contain enough information to answer the question, say: "I don't have enough information in the available documents to answer this question."
3. Cite your sources using [Source: filename, page N] format after each claim.
4. If the context contains contradictory information, acknowledge both perspectives.
5. Keep your answer clear, well-structured, and directly responsive to the question.
6. Use the same terminology as the source documents."""

def format_context(retrieved_chunks: list[dict]) -> str:
    """Format retrieved chunks into a structured context block."""
    sections = []
    for i, chunk in enumerate(retrieved_chunks, 1):
        source = chunk.get("source", "unknown")
        page = chunk.get("page", "?")
        text = chunk["text"]
        sections.append(
            f"--- Context Document {i} [Source: {source}, Page {page}] ---\n{text}"
        )
    return "\n\n".join(sections)

def rag_query(question: str, retriever, stream: bool = True):
    """Complete RAG pipeline: retrieve → format → synthesize."""

    # Step 1: Retrieve relevant chunks
    chunks = retriever.search(question, k=5)

    # Step 2: Format context
    context = format_context(chunks)

    # Step 3: Synthesize with LLM
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"""Context Documents:
{context}

Question: {question}

Provide a comprehensive answer based only on the context above."""}
    ]

    if stream:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.1,   # Low temp for factual grounding
            max_tokens=1024,
            stream=True,
        )
        answer = ""
        for chunk in response:
            delta = chunk.choices[0].delta.content or ""
            answer += delta
            print(delta, end="", flush=True)
        print()
        return answer
    else:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.1,
            max_tokens=1024,
        )
        return response.choices[0].message.content

# Usage
answer = rag_query("How does RAG reduce hallucination?", retriever)

For a complete end-to-end RAG application using LangChain, here is how all the pieces fit together:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# 3. Embed and store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

# 4. Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# 5. Build RAG chain
prompt = PromptTemplate(
    template="""Use the following context to answer the question. If you cannot
answer from the context, say "I don't have that information."

Context: {context}

Question: {question}

Answer:""",
    input_variables=["context", "question"]
)

rag_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o", temperature=0.1),
    chain_type="stuff",  # Stuff all retrieved docs into prompt
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

# 6. Query
result = rag_chain.invoke({"query": "What is retrieval-augmented generation?"})
print(result["result"])
print("Sources:", [d.metadata["source"] for d in result["source_documents"]])
Temperature Matters

For RAG synthesis, use temperature 0.0–0.2. Higher temperatures encourage creative generation, which works against the goal of grounding answers in retrieved context. A low temperature makes the model stick closely to the provided evidence, reducing hallucination and improving factual accuracy.

🎯

Interview Ready

How to Explain This in 2 Minutes

Elevator Pitch

RAG — Retrieval-Augmented Generation — is the most widely deployed pattern in production GenAI. Instead of relying on what the LLM memorized during training, we retrieve relevant documents from an external knowledge base and inject them into the prompt as context. The pipeline has two phases: an offline indexing phase where documents are chunked, embedded into vectors, and stored in a vector database, and a real-time query phase where the user's question is embedded, similar chunks are retrieved via similarity search, and the LLM generates an answer grounded in the retrieved evidence. RAG solves three critical problems — knowledge cutoff (the model can access information newer than its training data), hallucination (the model has real source text to reference), and domain specificity (you inject proprietary knowledge without retraining). Compared to fine-tuning, RAG is cheaper, faster to update, and easier to audit because you can trace every answer back to specific source documents.

Likely Interview Questions

QuestionWhat They're Really Asking
Walk me through a RAG pipeline end to end — from document ingestion to answer generation.Do you understand every component (loader, chunker, embedder, vector store, retriever, LLM) and how they connect?
How do you choose a chunking strategy and chunk size for a RAG system?Can you reason about the trade-offs between context completeness and retrieval precision?
What is a vector embedding, and how does similarity search work in a vector database?Do you understand the math behind cosine similarity and approximate nearest neighbor search?
When would you choose RAG over fine-tuning, and when might you combine both?Can you articulate the cost, latency, and accuracy trade-offs for different knowledge injection strategies?
How do you evaluate whether a RAG system is retrieving the right documents and generating accurate answers?Do you know about retrieval metrics (recall@k, MRR) and generation quality metrics (faithfulness, relevance)?

Model Answers

1. End-to-end RAG pipeline — The indexing phase starts with document loading (PDFs, HTML, databases), then text chunking to split long documents into passages of 256–1024 tokens with overlap, then embedding each chunk using a model like text-embedding-3-small to produce dense vectors, and finally storing those vectors with their metadata in a vector database like Pinecone, Weaviate, or ChromaDB. At query time, the user's question is embedded with the same model, a similarity search (cosine or dot product) retrieves the top-k most relevant chunks, and those chunks are formatted into a prompt that instructs the LLM to answer based only on the provided context. The LLM generates a grounded response, ideally with citations pointing back to source chunks.

# Minimal RAG pipeline with LangChain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Index phase
docs = PyPDFLoader("handbook.pdf").load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=64
).split_documents(docs)
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())

# Query phase
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4", temperature=0),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)
answer = qa.run("What is the company's PTO policy?")

2. Choosing a chunking strategy — The right chunking strategy depends on document structure and query patterns. Fixed-size chunking (e.g., 512 tokens with 64-token overlap) is the simplest and works well for uniform text. Recursive character splitting respects natural boundaries — it tries to split on paragraphs first, then sentences, then words. Semantic chunking groups sentences by embedding similarity, keeping topically coherent passages together. For structured documents like legal contracts or technical manuals, document-aware chunking that respects headers and sections preserves context boundaries. Smaller chunks (256 tokens) improve retrieval precision but may lack context; larger chunks (1024 tokens) preserve more context but increase noise. The overlap (typically 10–15% of chunk size) ensures that information at chunk boundaries is not lost.

3. Vector embeddings and similarity search — An embedding model maps text to a dense vector in high-dimensional space (e.g., 1536 dimensions for OpenAI's text-embedding-3-small). Semantically similar texts produce vectors that are close together, measured by cosine similarity (the cosine of the angle between two vectors, ranging from -1 to 1). A vector database indexes these vectors using approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World graphs) or IVF (Inverted File Index) to enable sub-millisecond search over millions of vectors. At query time, the user's question is embedded into the same vector space, and the database returns the k vectors with the highest cosine similarity — these correspond to the most semantically relevant document chunks.

4. RAG vs fine-tuning — RAG is best when you need access to frequently updated information, when you want traceable citations, or when your knowledge base is large and diverse. Fine-tuning is better for teaching the model a specific style, tone, or reasoning pattern that cannot be captured by simply providing context. In practice, you often combine both: fine-tune a model to follow your output format and citation conventions, then use RAG to inject the actual knowledge at query time. RAG is also far cheaper — updating knowledge means re-indexing documents, not retraining a model. Fine-tuning a large model costs hundreds to thousands of dollars per run, while RAG indexing costs pennies per document.

5. Evaluating RAG systems — Evaluation has two dimensions: retrieval quality and generation quality. For retrieval, measure Recall@k (what fraction of relevant documents appear in the top-k results), MRR (Mean Reciprocal Rank — how high the first relevant result ranks), and precision (what fraction of retrieved documents are actually relevant). For generation, measure faithfulness (does the answer contradict the retrieved context?), relevance (does the answer address the question?), and completeness (does the answer cover all relevant information from the context?). Frameworks like RAGAS automate these metrics using LLM-as-judge evaluation. The most common failure mode is retrieving irrelevant chunks that cause the LLM to generate plausible but incorrect answers — this is why retrieval quality is the single most important factor in RAG performance.

System Design Scenario

Design Challenge

You are building a RAG-powered internal knowledge assistant for a company with 50,000 documents across Confluence, Google Drive, and SharePoint. Documents range from 1-page memos to 200-page technical manuals. The system must serve 500 concurrent users with sub-3-second response times and support daily document updates. Design the complete RAG architecture.

A strong answer should cover:

  • Document ingestion pipeline — connectors for each source (Confluence API, Google Drive API, SharePoint Graph API), a unified document format, and incremental sync that only re-indexes changed documents using content hashes or last-modified timestamps
  • Chunking strategy — document-aware chunking that respects headers and sections for long manuals, with metadata preservation (source URL, title, author, date) attached to each chunk for filtering and citation
  • Embedding and vector storage — a managed vector database (Pinecone or Weaviate) with namespace separation per document source, hybrid search combining dense vectors with sparse BM25 for keyword matching, and metadata filtering to scope searches by department or document type
  • Retrieval and reranking — initial retrieval of top-20 candidates via ANN search, followed by a cross-encoder reranker (e.g., Cohere Rerank) to select the top-4 most relevant chunks, plus a relevance threshold to avoid injecting low-quality context
  • Scaling and latency — caching frequent queries and their retrieved contexts, async embedding generation for ingestion, read replicas for the vector database, and a queue-based architecture for document processing to handle update spikes

Common Mistakes

  • Using chunks that are too large or too small — Chunks over 1024 tokens dilute relevance and waste context window space. Chunks under 128 tokens lose context and produce fragmented retrieval. Test multiple sizes with your actual queries and measure retrieval recall to find the sweet spot for your data.
  • Ignoring chunk overlap — Without overlap, critical information that spans a chunk boundary gets split across two chunks, and neither chunk alone contains the complete answer. Always use 10–15% overlap, and for critical applications, consider sentence-level deduplication in the retrieved results.
  • Skipping reranking and using raw vector similarity scores as confidence — ANN similarity scores are not calibrated probabilities. Two chunks with scores 0.82 and 0.79 may differ dramatically in actual relevance. A cross-encoder reranker examines query-chunk pairs jointly and produces much more reliable relevance judgments, often improving answer quality by 15–25% with minimal latency overhead.