Why Evaluation Is Hard for LLMs
In traditional machine learning, evaluation is conceptually simple. You have a test set with known labels. Your model makes predictions. You count how often it is right. Whether you are classifying spam, detecting fraud, or predicting house prices, the evaluation question is always the same: does the output match the ground truth? The answer is a number, the number can be tracked over time, and improvements are unambiguous.
LLMs break every part of this model. When you ask an LLM to "summarize this document" or "explain quantum entanglement to a ten-year-old" or "help me debug this Python function," there is no single correct answer. There are dozens of equally valid answers, varying in length, style, focus, and vocabulary. Two expert humans shown the same response would disagree about whether it is excellent or merely adequate. This is not a solvable problem — it is a fundamental property of open-ended text generation. Any evaluation framework for LLMs has to grapple with this irreducible ambiguity from the start.
The practical consequence is that many teams in their early stages of building LLM products fall back on what practitioners call the "vibe check": a developer tries the system a few times, it seems to work, they ship it. This approach has a seductive simplicity. It requires no infrastructure, no datasets, and no methodology. The problem is that it does not scale — neither to the volume of outputs a production system generates, nor to the diversity of inputs real users bring, nor to the detection of subtle regressions when you change a prompt or swap a model version. The vibe check is how teams end up shipping changes that hurt 20% of their users while improving the 5 examples the developer happened to test.
Even when teams attempt more systematic evaluation, they quickly discover the distribution shift problem. Your handcrafted test cases reflect the queries you thought to write. Real users ask things you never anticipated. They phrase questions in unexpected ways, use domain vocabulary you did not cover, combine tasks in unusual sequences, and push against edge cases that seemed unlikely. A system that scores 95% on your internal test set may perform dramatically worse on actual production traffic — not because the model is bad, but because the eval set was not representative.
Compounding this is prompt sensitivity: LLMs are not deterministic functions with stable, predictable behavior. Adding a single word to a system prompt, changing the temperature by 0.1, or switching from one model version to another can produce noticeably different outputs on the same input. This means every change to your system — and in a production LLM app, changes happen constantly — is potentially a regression you cannot see without a rigorous eval harness. Evaluation is not a one-time activity; it is the continuous measurement infrastructure that makes your entire development process trustworthy.
Ground Truth Challenges and the Multiple-Correct-Answers Problem
For tasks like question answering over a document, you might imagine that ground truth is achievable: there is a specific passage that answers the question, and you can check whether the model found it. But even here, the problem is messier than it looks. A model might answer correctly using different words than the reference answer. It might give a correct answer at a different level of specificity. It might synthesize information from two passages rather than one. Exact-match metrics will penalize all of these correct answers. Token-overlap metrics like BLEU and ROUGE will give partial credit, but the scores are difficult to interpret and correlate poorly with actual quality for long-form responses.
For generative tasks — writing, brainstorming, explanation, code generation — the ground truth problem is even more severe. There is no canonical correct email response, no single correct explanation of a concept, no uniquely correct implementation of a function. Human annotators asked to produce reference answers for these tasks will disagree with each other nearly as much as they disagree with the model. This means that reference-based evaluation — checking the model's output against a gold standard — is only viable for a subset of LLM tasks, and requires careful methodology even in those cases.
The alternative approaches each come with their own trade-offs. Reference-free evaluation judges quality without a ground truth — typically by having a strong model evaluate the output directly. This is flexible and scalable, but it introduces the biases and reliability concerns of the judge model itself. Human evaluation is the gold standard for quality but is slow, expensive, and hard to reproduce. Behavioral testing — checking that outputs satisfy logical constraints, safety rules, or format requirements — is the most reliable automated method, but it only covers the properties you explicitly specify. A complete evaluation strategy combines multiple methods because no single approach is sufficient.
The Evaluation Metrics Taxonomy
It helps to organize the evaluation landscape into a clear taxonomy before diving into specific frameworks. At the highest level, evaluation methods divide into three families based on what they require to operate.
Reference-based metrics compare model output to a human-written reference answer. They include exact match (binary: does the output equal the reference?), token-overlap metrics like BLEU (used for translation) and ROUGE (used for summarization), and embedding-based similarity metrics like BERTScore that compare semantic meaning rather than surface text. Reference-based metrics are cheap to compute but require the expensive work of building a ground-truth dataset, and they systematically penalize valid outputs that differ from the reference.
Reference-free metrics assess quality without requiring a human-written reference. They include perplexity (how surprised is the model by its own output — not a quality metric, but a coherence proxy), factual consistency checkers that use an NLI model to verify claims, readability metrics, and the increasingly dominant LLM-as-judge approach where a powerful model scores the output on a rubric. Reference-free metrics are scalable and flexible, but they require careful calibration against human judgments to be trusted.
Human evaluation is the ultimate arbiter but is expensive and slow. The key design choices in human eval are: the rating scale (Likert scale, pairwise preference, or categorical), the annotation guidelines (how precisely you define what evaluators should measure), the annotator pool (domain experts vs. crowd workers vs. end users), and the inter-annotator agreement methodology (how you verify that annotators are applying criteria consistently). Every automated metric is ultimately valuable only to the extent that it correlates with human judgment — which is why running human eval periodically to calibrate your automated metrics is a fundamental practice, not an optional one.
The cost vs. accuracy trade-off runs through every evaluation decision. Human eval is the most accurate but costs $5–50 per sample. LLM-as-judge costs $0.01–0.10 per sample but has systematic biases. Automated metrics cost fractions of a cent but correlate poorly with quality. Build your eval stack in layers: cheap automated metrics for every run, LLM-as-judge for detailed analysis, human eval for periodic calibration.
LLM-as-Judge
The LLM-as-judge pattern is exactly what it sounds like: you use a powerful language model — typically GPT-4o, Claude Opus, or a specialized judge model — to evaluate the output of another language model. This might sound circular, but it works surprisingly well in practice, and understanding why helps you use it correctly. A strong judge model has internalized an enormous amount of human judgment about writing quality, factual accuracy, helpfulness, and reasoning coherence. When you give it a clear rubric and a structured evaluation task, it can apply that judgment consistently across thousands of examples at a fraction of the cost of human annotators.
The key insight is that evaluating a response is a much easier task than generating it. A human editor can reliably spot errors in an essay they could not have written themselves. A student who could not solve a calculus problem correctly can still verify that a given solution is wrong once they see it. LLMs exhibit the same asymmetry: a model that sometimes makes mistakes when generating answers can nonetheless be quite reliable at identifying problems in a generated answer when given a structured evaluation prompt. This is why GPT-4 as a judge achieves approximately 80% agreement with human raters on many benchmarks — comparable to the level of agreement between two human annotators.
The practical value of LLM-as-judge is that it enables evaluation at production scale. If your application handles 10,000 queries per day, you cannot have a human review each response. But you can run an automated judge on a representative sample of 500 queries and get a reliable signal about overall quality. You can run the judge on every response in your regression test suite every time you change a prompt. You can run it on adversarial examples before deployment. This transforms evaluation from a periodic manual process into a continuous automated signal integrated into your development workflow.
The LLM-as-Judge Pipeline
A well-designed judge pipeline has four components. The system prompt describes the judge's role, the evaluation task, and the scoring rubric with enough precision that the judge applies criteria consistently. The input includes the original question or task, the model's answer, and optionally a reference answer and/or the source context (for RAG systems). The output is a structured JSON object containing the numeric score and, critically, a written reasoning trace explaining why that score was assigned. The reasoning trace is not just for human review — it also forces the judge model to reason carefully before assigning a score, which improves accuracy.
There are two fundamental evaluation modes: pointwise and pairwise. In pointwise evaluation, the judge scores a single response on an absolute scale (typically 1-5 or 1-10) against a rubric. This is more scalable and works well when you have a clear definition of quality. In pairwise evaluation, the judge is shown two responses to the same question and must decide which is better (or whether they are equivalent). Pairwise evaluation is more reliable — humans and models find it easier to compare two options than to assign absolute scores — but it requires N-squared evaluations to rank N candidates and produces relative rather than absolute quality signals. The two approaches are complementary: use pairwise evaluation to calibrate your pointwise rubric, then use pointwise evaluation at scale.
Bias in LLM-as-Judge and Mitigation Strategies
LLM judges have well-documented systematic biases that you must actively mitigate. The three most important are: position bias (when comparing two answers, models tend to prefer the first one they see, regardless of quality — studies show this effect can be as large as 15-25 percentage points), verbosity bias (models tend to rate longer, more detailed answers higher even when the added length does not add value — this can be a 10-20% score inflation for verbose responses), and self-enhancement bias (models tend to rate responses that match their own stylistic tendencies higher, meaning GPT-4 may systematically favor GPT-4 style outputs when judging a competition between GPT-4 and Claude).
The standard mitigation for position bias is position swapping with averaging: run each pairwise evaluation twice, once with answer A first and once with answer B first, then average the results. This roughly doubles your evaluation cost but essentially eliminates position bias from the aggregate signal. For verbosity bias, the most effective approach is an explicit rubric that penalizes unnecessary verbosity: instruct the judge to score concise correct answers as highly as verbose correct answers, and to score padded responses lower than equivalently correct concise responses. For self-enhancement bias, the most reliable solution is to use a different model as your judge than the model you are evaluating.
Beyond the big three biases, LLM judges are also susceptible to sycophancy (preferring responses that agree with claims in the prompt), format bias (preferring responses with lists, headers, or other structured formatting even when plain prose is more appropriate), and anchoring (being influenced by a reference answer shown before the response to evaluate). Mitigation requires careful prompt engineering, rubric design, and — whenever your evaluation budget allows — comparison against human ratings to catch systematic errors before they become embedded in your evaluation infrastructure.
Specialized open-source judge models have emerged to address some of these problems. Prometheus is a Llama-based model trained specifically to follow evaluation rubrics and produce human-aligned scores with explicit reasoning. It is designed to be more transparent and controllable than using a general-purpose proprietary model as a judge. The Llama-3-based judge family (including models like Meta-Llama-3-70B fine-tuned on preference data) offers strong judging capability that can run locally, avoiding the cost and latency of API calls and allowing evaluation of sensitive data without sending it to a third-party service.
LLM-as-Judge: Full Implementation
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal
import json
client = OpenAI()
# ── Structured output schema for the judge ──────────────────────────────────
class JudgeVerdict(BaseModel):
score: int = Field(..., ge=1, le=5,
description="Quality score from 1 (very poor) to 5 (excellent)")
reasoning: str = Field(...,
description="Step-by-step explanation justifying the score")
strengths: list[str] = Field(default_factory=list)
weaknesses: list[str] = Field(default_factory=list)
verdict: Literal["pass", "fail"] = Field(
..., description="Pass if score >= 3, fail otherwise")
# ── Pointwise judge ──────────────────────────────────────────────────────────
JUDGE_SYSTEM_PROMPT = """You are an expert evaluator of AI assistant responses.
Your task is to score the quality of the provided answer on a 5-point scale.
RUBRIC:
5 - Excellent: Fully correct, appropriately detailed, well-organized, no errors
4 - Good: Mostly correct with minor gaps or slightly verbose/terse
3 - Adequate: Correct core answer but missing important nuance or context
2 - Poor: Partially correct but contains significant errors or omissions
1 - Very poor: Incorrect, harmful, or completely fails to address the question
IMPORTANT RULES:
- Score based on accuracy and helpfulness, NOT length
- A short but complete answer should score the same as a long but equally complete answer
- Provide specific evidence from the response to justify your score
- Always write your reasoning BEFORE assigning the score
"""
def judge_pointwise(
question: str,
answer: str,
reference: str | None = None,
context: str | None = None,
judge_model: str = "gpt-4o"
) -> JudgeVerdict:
"""Score a single answer on a 1-5 scale using an LLM judge."""
user_content = f"""QUESTION: {question}
ANSWER TO EVALUATE:
{answer}"""
if reference:
user_content += f"\n\nREFERENCE ANSWER (for guidance, not required match):\n{reference}"
if context:
user_content += f"\n\nSOURCE CONTEXT:\n{context}"
response = client.beta.chat.completions.parse(
model=judge_model,
messages=[
{"role": "system", "content": JUDGE_SYSTEM_PROMPT},
{"role": "user", "content": user_content}
],
response_format=JudgeVerdict,
temperature=0.0 # deterministic scoring
)
return response.choices[0].message.parsed
# ── Pairwise judge with position-swap debiasing ──────────────────────────────
class PairwiseVerdict(BaseModel):
winner: Literal["A", "B", "tie"]
reasoning: str
confidence: Literal["low", "medium", "high"]
PAIRWISE_PROMPT = """Compare two answers to the same question. Decide which is better.
Answer A or Answer B may be better, or they may be equally good (tie).
Judge on: accuracy, completeness, clarity. NOT on length alone."""
def judge_pairwise(
question: str,
answer_a: str,
answer_b: str,
judge_model: str = "gpt-4o"
) -> dict:
"""Pairwise comparison with automatic position-swap debiasing."""
def _compare(first: str, second: str, first_label: str) -> PairwiseVerdict:
content = f"""QUESTION: {question}
ANSWER {first_label}: {first}
ANSWER {"B" if first_label == "A" else "A"}: {second}"""
resp = client.beta.chat.completions.parse(
model=judge_model,
messages=[
{"role": "system", "content": PAIRWISE_PROMPT},
{"role": "user", "content": content}
],
response_format=PairwiseVerdict,
temperature=0.0
)
return resp.choices[0].message.parsed
# Run twice with swapped positions to cancel position bias
result_ab = _compare(answer_a, answer_b, "A") # A first
result_ba = _compare(answer_b, answer_a, "B") # B first (labels swapped)
# Aggregate: both must agree for a confident winner
votes = {"A": 0, "B": 0, "tie": 0}
votes[result_ab.winner] += 1
votes[result_ba.winner] += 1
if votes["A"] == 2: final = "A"
elif votes["B"] == 2: final = "B"
else: final = "tie" # disagreement → tie
return {
"winner": final,
"run_ab": result_ab.model_dump(),
"run_ba": result_ba.model_dump(),
"position_bias_detected": result_ab.winner != result_ba.winner
}
# ── Batch evaluation ─────────────────────────────────────────────────────────
def batch_evaluate(test_cases: list[dict], judge_model: str = "gpt-4o") -> dict:
"""Run pointwise judge on a list of {question, answer, reference?} dicts."""
results = []
for case in test_cases:
verdict = judge_pointwise(
question=case["question"],
answer=case["answer"],
reference=case.get("reference"),
context=case.get("context"),
judge_model=judge_model
)
results.append({**case, "verdict": verdict.model_dump()})
scores = [r["verdict"]["score"] for r in results]
pass_rate = sum(1 for r in results if r["verdict"]["verdict"] == "pass") / len(results)
return {
"results": results,
"summary": {
"mean_score": sum(scores) / len(scores),
"pass_rate": pass_rate,
"score_distribution": {i: scores.count(i) for i in range(1, 6)}
}
}
# Usage:
# verdict = judge_pointwise("What is RAG?", "RAG stands for...")
# print(verdict.score, verdict.reasoning)
Always log the judge's full reasoning trace, not just the score. When evaluating why your system is failing, the reasoning traces are far more informative than the numeric scores. They let you identify systematic failure patterns — "the judge always penalizes responses that don't start with a direct answer" — that you can then address in your system prompt or judge prompt.
RAGAS for RAG Evaluation
Retrieval-Augmented Generation systems have a unique evaluation challenge compared to simple chatbots: there are multiple points of failure that are largely independent. The retrieval step might find the wrong documents. The retrieved documents might be correct but not cover everything needed. The generation step might produce a response that contradicts the retrieved context. Or the response might answer a related-but-different question instead of the one actually asked. A single overall quality score cannot diagnose which of these failures is occurring — you need separate metrics for separate pipeline stages.
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that addresses this by defining four complementary metrics, each targeting a different component of the RAG pipeline. By computing all four metrics together, you get a comprehensive diagnostic picture: is your retriever finding the right documents? Is your generator staying faithful to what it retrieved? Is your answer actually addressing the user's question? Do you have a coverage gap? Teams that start using RAGAS consistently report that it reveals problems they had no idea existed — particularly faithfulness failures, where the LLM confidently generates information that is not present in the retrieved context.
The framework is designed to be largely automated: most metrics use an LLM to perform the fine-grained assessments (claim extraction, question generation, entailment checking) that would otherwise require human annotators. This makes it practical to run as part of a CI/CD pipeline. The one metric that requires human-provided ground truth is Context Recall, which needs a reference answer to assess whether the retrieval system found everything necessary. For the other three metrics, you only need the question, the retrieved context chunks, and the generated answer — all of which are available from your RAG pipeline's runtime logs.
RAGAS decomposes RAG evaluation into four targeted metrics, each diagnosing a different failure mode in the retrieval-generation pipeline.
Deep Dive: Each RAGAS Metric Explained
Faithfulness (0-1) measures whether the generated answer contains only claims that are supported by the retrieved context. It is designed to detect hallucination — the most dangerous failure mode in RAG systems, where the LLM generates confident-sounding information that has no basis in the retrieved documents. The measurement process works by first using an LLM to decompose the generated answer into a list of atomic claims (individual factual statements). Then, for each claim, a second LLM call determines whether that claim is entailed by the retrieved context. The faithfulness score is the fraction of claims that are supported. A score of 1.0 means every claim in the answer can be traced back to the context; a score of 0.6 means 40% of the answer's claims are potentially hallucinated.
Answer Relevance (0-1) measures whether the generated answer actually addresses the question that was asked. A high faithfulness score does not guarantee relevance — a model could faithfully reproduce facts from the retrieved documents while completely ignoring the user's actual question. Answer Relevance is measured using a clever reverse-generation approach: the judge LLM is given the answer and asked to generate several questions for which this answer would be a good response. These generated questions are then compared to the original question using embedding similarity. If the generated questions are semantically close to the original question, the answer was relevant; if they are very different, the answer drifted from the topic.
Context Precision (0-1) measures the signal-to-noise ratio in the retrieved chunks. When your retriever returns ten chunks but only two of them are actually relevant to the question, you have low context precision. Irrelevant chunks in the context window are a problem for two reasons: they push the relevant information further away (LLMs are sensitive to position in the context window), and they provide material for hallucination-by-distraction, where the LLM picks up on tangentially related content and incorporates it inappropriately. Context Precision is measured by asking the judge LLM to classify each retrieved chunk as relevant or not relevant, then computing the precision-at-k metric across the ranked list.
Context Recall (0-1) is the coverage metric: did the retriever find all the information needed to answer the question completely? Unlike the other three metrics, Context Recall requires a ground-truth reference answer. The measurement works by decomposing the reference answer into atomic claims and checking whether each claim can be attributed to the retrieved context. A score of 1.0 means the retrieved context contains everything the reference answer mentions; a score of 0.5 means the context is missing half the information needed for a complete answer. Low context recall points to gaps in your document collection, chunking strategy, or retrieval method.
Full RAGAS Evaluation Pipeline
from ragas import evaluate, EvaluationDataset
from ragas.metrics import (
Faithfulness,
AnswerRelevancy,
ContextPrecision,
ContextRecall,
AnswerCorrectness,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from datasets import Dataset
import pandas as pd
# ── Configure judge LLM and embeddings ──────────────────────────────────────
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))
# ── Build your evaluation dataset ───────────────────────────────────────────
# Each sample needs: question, answer, contexts (list of retrieved chunks)
# Context Recall also needs: ground_truth (reference answer)
eval_samples = [
{
"user_input": "What is the refund policy for digital products?",
"response": "Digital products are non-refundable once downloaded, "
"except in cases of technical failure verified by support.",
"retrieved_contexts": [
"Section 4.2: Digital goods including software, ebooks, and "
"downloadable content are non-refundable after delivery.",
"Section 4.5: If a product fails to function due to a technical "
"error on our side, contact support@example.com for resolution.",
"Section 1.1: Our store sells physical and digital products worldwide.",
],
"reference": "Digital products are non-refundable after download. "
"Technical failures may qualify for refund via support."
},
# ... more samples
]
# ── Run evaluation ───────────────────────────────────────────────────────────
dataset = EvaluationDataset.from_list(eval_samples)
metrics = [
Faithfulness(llm=llm),
AnswerRelevancy(llm=llm, embeddings=embeddings),
ContextPrecision(llm=llm),
ContextRecall(llm=llm),
AnswerCorrectness(llm=llm, embeddings=embeddings),
]
results = evaluate(dataset=dataset, metrics=metrics)
df = results.to_pandas()
# ── Inspect results ──────────────────────────────────────────────────────────
print(df[["user_input", "faithfulness", "answer_relevancy",
"context_precision", "context_recall"]])
print("\nAggregate Scores:")
print(df[["faithfulness", "answer_relevancy",
"context_precision", "context_recall"]].mean())
# ── Find failing examples ────────────────────────────────────────────────────
low_faithfulness = df[df["faithfulness"] < 0.7]
print(f"\n{len(low_faithfulness)} examples with faithfulness < 0.7 (potential hallucination)")
# ── Synthetic test set generation ────────────────────────────────────────────
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import DirectoryLoader
# Load your knowledge base documents
loader = DirectoryLoader("./docs", glob="**/*.pdf")
docs = loader.load()
generator = TestsetGenerator(llm=llm, embedding_model=embeddings)
testset = generator.generate_with_langchain_docs(
docs,
testset_size=50,
# Generates: simple, reasoning, multi-context question types
)
testset_df = testset.to_pandas()
print(testset_df.head())
# Columns: user_input (question), reference (ground truth answer),
# reference_contexts, synthesizer_name (question type)
Use RAGAS's TestsetGenerator to build your initial evaluation set from your actual knowledge base documents. It creates diverse question types — simple factual, multi-hop reasoning, and abstractive — that surface different failure modes. A 50-100 question synthetic test set gets you started immediately, even before you have collected real user queries.
Benchmarks & Task-Specific Metrics
Benchmarks serve a different purpose from application-level evaluation. While RAGAS and LLM-as-judge tell you how well your specific application is performing on your specific task, benchmarks provide a standardized, reproducible measure of a model's general capabilities that lets you compare across model versions, fine-tuning runs, and different models entirely. When you read that a model "scores 85 on MMLU" or "achieves 70% on HumanEval," you are reading the model's performance on a standardized test administered under controlled conditions.
Understanding the major benchmarks matters even if you are not running them yourself, because they directly inform which base model you should choose for a given application. A model with strong MMLU scores tends to be better at knowledge-intensive QA tasks. A model with strong HumanEval scores is better for code generation. A model with strong MT-Bench scores is better for multi-turn conversational interactions. Benchmark scores are not perfect predictors of application performance — the gap between benchmark performance and real-world performance is well documented, and benchmark contamination (where a model's training data includes benchmark questions) is a real problem — but they remain the most available signal for initial model selection.
Beyond the standard benchmarks, every production LLM application eventually needs a custom golden dataset: a curated set of questions with human-verified reference answers that is specific to your domain, your users' vocabulary, and your application's specific requirements. No general benchmark can tell you how well your insurance claims processing assistant handles the actual questions insurance adjusters ask. Building and maintaining a golden dataset is an investment, but it is the evaluation asset with the highest long-term value for an application team.
The Major Standard Benchmarks
MMLU (Massive Multitask Language Understanding) consists of 14,000 multiple-choice questions across 57 subjects ranging from elementary mathematics to professional law, medicine, and philosophy. It tests broad knowledge and reasoning across academic disciplines. A model scoring above 70% on MMLU is considered strong; GPT-4 and Claude Opus score above 85%. MMLU is useful for selecting models for knowledge-intensive applications but is sometimes criticized for being too easy for frontier models and for measuring memorization as much as reasoning.
HumanEval contains 164 Python programming problems where the model must generate a function body given the docstring and signature. Correctness is evaluated by running unit tests, not by textual comparison — the code has to actually work. The primary metric is pass@k: the probability that at least one of k generated solutions passes all tests. Pass@1 is the most commonly reported figure (does the first generation work?). HumanEval is the standard benchmark for code generation capability, though it has been largely solved by frontier models and is being superseded by harder benchmarks like HumanEval+ and SWE-bench.
MT-Bench (Multi-Turn Benchmark) consists of 80 multi-turn conversational questions spanning writing, roleplay, extraction, reasoning, math, coding, and STEM. The unique feature is that it tests how well a model handles follow-up questions — a critical capability for real chatbots that earlier single-turn benchmarks missed entirely. Models are evaluated by GPT-4 as a judge on a 1-10 scale. MT-Bench correlates strongly with user preference in real deployments, which is why it remains heavily cited despite being relatively small.
GPQA (Graduate-Level Google-Proof Q&A) is a challenging benchmark of 448 questions written by domain experts in biology, physics, and chemistry. Questions are designed to be "Google-proof" — you cannot answer them by finding facts online; you need to understand and apply graduate-level scientific reasoning. Even expert humans with PhDs score around 65% on questions outside their sub-specialty. GPQA is useful for evaluating models for high-stakes scientific applications and for measuring true reasoning capability versus knowledge retrieval.
Task-Specific Metrics Reference
BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between the model's output and one or more reference translations. It was designed for machine translation and remains the standard there, but it is widely misused for other tasks where it correlates poorly with quality. BLEU ranges from 0 to 1 (often reported as 0-100). For state-of-the-art neural machine translation, BLEU scores above 30 are considered good. Never use BLEU to evaluate open-ended generation like summarization or conversation — the n-gram overlap assumption is too restrictive for tasks with high legitimate output variability.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics designed for summarization. ROUGE-1 measures unigram overlap, ROUGE-2 measures bigram overlap, and ROUGE-L measures longest common subsequence. Unlike BLEU (which is precision-focused), ROUGE variants also measure recall — whether the summary covers the key information from the source. ROUGE-L above 0.4 is generally considered good for abstractive summarization, but the same caveats about n-gram overlap limiting validity for creative or diverse outputs apply.
BERTScore addresses the n-gram problem by comparing model and reference outputs in embedding space. It computes token-level cosine similarity between BERT embeddings of the model output and reference, then takes the F1 of the precision and recall scores. BERTScore correlates more strongly with human judgments than BLEU or ROUGE for most generative tasks and can handle paraphrases and synonyms that n-gram metrics penalize unfairly. The trade-off is higher computational cost and the need to choose a base BERT model whose embeddings are appropriate for your domain.
pass@k for code generation is computed by sampling k completions per problem and checking whether any of them passes all unit tests. The formula accounts for the combinatorics correctly: pass@k = 1 - C(n-c, k)/C(n, k), where n is the number of samples, c is the number that pass, and k is the number of tries reported. In practice, pass@1 is what matters for production (your user gets one code completion, and it either works or it does not), but pass@10 or pass@100 is useful for measuring the model's ceiling capability — whether a correct solution exists in the distribution even if it is not always first.
Start by collecting 100 real queries from your application's production logs or beta users. Have subject-matter experts write reference answers for each. Label a subset with additional properties (difficulty, topic, required reasoning type). This dataset becomes your primary regression test suite — every model change, prompt change, or retrieval change must be evaluated against it before deployment.
Evaluation-Driven Development
Evaluation-Driven Development (EDD) is the LLM equivalent of Test-Driven Development (TDD) in classical software engineering. The core principle is identical: define your success criteria before you build, then build toward passing those criteria, and protect your progress with automated regression tests that run on every change. If you have spent time in software engineering, you already understand the value of this approach intellectually. In LLM development, it is even more important because the system's behavior is less deterministic and more sensitive to small changes than conventional software.
The reason most LLM application teams do not start with EDD is that it requires upfront investment before you have a working system. It feels counterproductive to write tests when you do not yet know what the system will do. But this upfront investment pays back enormously within the first few iterations. Without an eval harness, every change to your prompt, your model version, your chunking strategy, or your retrieval method requires manual spot-checking. With an eval harness, those changes produce an objective score you can compare to the previous version. You stop relying on intuition and start making data-driven decisions.
The practical starting point for EDD is writing down what "good" looks like before you write any code. For a customer support assistant, good might mean: answers questions correctly according to the knowledge base, never invents policies that do not exist, maintains a professional tone, stays on-topic and redirects off-topic queries, and responds in under 200 words for simple questions. Each of these criteria can be operationalized as an automated check or a judge prompt dimension. You write these checks first, then build the system that passes them.
Once you have a baseline system and an eval suite, the development process becomes a disciplined loop: measure the current baseline, identify the biggest failure category, make a targeted change (rewrite the system prompt, adjust chunk size, add few-shot examples), measure again, compare. This loop separates LLM engineering from LLM guessing. Teams that commit to this process consistently report that it reduces the time to find effective improvements dramatically — not because individual improvements are faster to implement, but because you stop spending time on changes that feel intuitively right but actually hurt performance.
The most underappreciated part of EDD is regression protection. LLM systems fail in non-obvious ways. A prompt change that improves performance on FAQ questions might degrade performance on edge cases you did not test. A new model version might be better overall but worse on your specific domain's terminology. Without regression tests, these regressions go undetected until users complain. With regression tests, they are caught before deployment. The golden dataset you painstakingly built becomes more valuable over time precisely because it accumulates the edge cases and failure modes you have discovered — and ensures you never accidentally reintroduce a bug you have already fixed.
The Evaluation Loop in Detail
The full EDD cycle has six stages that repeat continuously. Stage 1: Define success criteria. For each major capability your system must have, write down a measurable definition of success. Be specific enough that two different people would agree whether a given output passes or fails. "Helpful" is not a criterion; "correctly identifies the policy that applies to the user's situation" is a criterion.
Stage 2: Build the eval dataset. Collect or create questions that test your criteria. Include three categories: golden examples (clear, unambiguous cases that any good system should handle), edge cases (unusual inputs, corner cases, questions at the boundary of your system's intended scope), and adversarial examples (inputs designed to probe failure modes — jailbreaks, ambiguous phrasing, conflicting context). A good initial dataset has 50-200 examples; a mature production dataset has 500-2000.
Stage 3: Implement baseline. Build the simplest reasonable implementation of your system. Do not over-engineer; you need something to measure against, not the final product. Run your eval suite and record the baseline scores. These become the floor that all future iterations must meet or exceed.
Stage 4: Measure. Run the full eval suite. Record per-category scores, not just overall averages — a system that scores 80% overall but 40% on edge cases and 95% on golden examples has a very different failure profile than a system that scores 80% uniformly. Look at the failing examples in detail; manual inspection of failures is the most efficient way to identify improvement opportunities.
Stage 5: Improve. Make one targeted change at a time. Changing multiple things simultaneously makes it impossible to attribute score changes to specific interventions. Common improvements to try in roughly increasing order of effort: system prompt rewrite, few-shot examples added, model version upgrade, chunk size adjustment, retrieval top-k adjustment, hybrid search, re-ranking, query rewriting.
Stage 6: Measure again and commit. Run the full eval suite. If scores improve on the target category without degrading on other categories, commit the change. If the change helps some categories but hurts others, decide explicitly whether the trade-off is acceptable. Never ship a change without this explicit decision — "it seems better" is not an acceptable deployment criterion when you have an eval harness.
The EDD loop runs continuously throughout the lifetime of an LLM application. Stages 4-5-6 repeat for every improvement iteration; stages 1-2-3 are revisited when the application scope changes.
CI/CD Integration and Tooling
Modern LLM development treats eval runs the same way software development treats test suites: they run automatically on every pull request, block merges if regressions are detected, and produce dashboards that track performance trends over time. The infrastructure for this is not trivial to build from scratch, which is why purpose-built platforms have emerged.
LangSmith (by LangChain) is the most tightly integrated platform for teams already using LangChain. It records every run of your LLM chain — inputs, outputs, intermediate steps, latency, token counts — and allows you to run evaluators against those traces. You can compare datasets across experiments, track metric trends across versions, and replay specific failing examples against new prompt versions. The eval runs are stored persistently, so you can compare current performance against runs from three months ago.
Braintrust is a dedicated LLM eval platform that provides a flexible SDK for defining experiments, a UI for analyzing results, and tight git integration for tracking which code version produced which eval scores. It is model-agnostic and works well for teams not using LangChain. Weights & Biases (W&B) — the standard tool for ML experiment tracking — has added LLM-specific evaluation tables and trace logging to its existing experiment tracking infrastructure, making it a good choice for teams that already use W&B for traditional ML workflows.
For A/B testing in production, the pattern is: serve two system variants to random user segments (or split by user ID for consistency), log all interactions, sample a subset for human review or LLM-as-judge scoring, and use statistical testing (typically a two-proportion z-test or Mann-Whitney U test for score distributions) to determine whether the difference in performance is statistically significant before making the winning variant permanent. Always run A/B tests long enough to collect at least several hundred samples per variant in each major query category before drawing conclusions.
Never evaluate only on the examples you used to develop your system. This is as invalid as training on test data in traditional ML. Maintain a strict separation between development examples (which you look at while iterating) and held-out test examples (which you only run when you believe a version is ready for deployment). The held-out examples provide the only unbiased estimate of real-world performance.
Human Evaluation
Every automated metric ultimately derives its authority from correlation with human judgment. BLEU, ROUGE, BERTScore, LLM-as-judge — all of them are proxies for what humans actually think is a good response. This means that however sophisticated your automated eval pipeline becomes, you need periodic human evaluation to calibrate it and to catch systematic failures that your automated metrics miss. Human eval is not a legacy practice that will be automated away; it is the foundation that makes all other evaluation trustworthy.
The practical challenge with human evaluation is cost and speed. Having a domain expert review and annotate 500 responses takes days and costs hundreds to thousands of dollars. This is why human eval is not run continuously — it is run periodically (at major version releases, when automated metrics diverge from user satisfaction signals, or when entering a new product phase) and on carefully chosen samples (stratified to cover different query types and difficulty levels, not random samples that will oversample easy cases).
The design of the human evaluation task matters as much as who does it. Annotators given vague instructions ("rate this response on a scale of 1-5 for quality") will apply wildly different interpretations of quality, producing data that is almost meaningless in aggregate. Annotators given precise rubrics ("1 = the response contains a factual error; 2 = the response is technically accurate but incomplete; 3 = the response is complete but uses overly technical language; 4 = the response is complete and appropriately worded; 5 = the response is complete, appropriately worded, and anticipates a follow-up need") will produce consistent, actionable data. Writing good annotation guidelines is skilled work that takes significant iteration.
Annotation Guidelines and Rubric Design
A well-designed annotation rubric has several properties. First, it is decomposed into independent dimensions: factual accuracy, completeness, tone appropriateness, response length, safety. Annotating each dimension separately produces more actionable data than a single holistic quality rating, because you can identify which dimension is driving low scores. An annotator can be good at judging factual accuracy but poor at judging tone appropriateness — decomposed rubrics let you route different dimensions to annotators with different expertise.
Second, good rubrics include anchor examples for each rating level on each dimension. Abstract descriptions of what constitutes a "3" on a 5-point scale are ambiguous; three concrete examples of "3" responses anchor the rating to a shared interpretation. The process of creating anchor examples is itself valuable: it forces the rubric designer to think carefully about the boundary cases and often reveals ambiguities in the rubric that need to be resolved before annotation begins.
Third, rubrics must be calibrated across annotators before the main annotation run. Have all annotators rate the same 20-30 examples independently, then compare their ratings. High disagreement on specific examples reveals ambiguities in the rubric that need to be resolved with clarification or revised anchor examples. This calibration session also serves as training — walking through disagreements together helps annotators converge on a shared understanding of the task that cannot be fully conveyed in written instructions alone.
Inter-Annotator Agreement
Inter-annotator agreement (IAA) is the statistical measure of how consistently different annotators apply the same rubric to the same examples. Low IAA is a warning sign that your rubric is ambiguous, your annotator pool is too diverse in background, or the task is inherently subjective at the level you are trying to measure. The two most widely used IAA metrics are Cohen's kappa (for two annotators) and Fleiss' kappa (for three or more annotators). Both produce a value from -1 to 1, where 0 represents chance agreement, below 0.2 is considered poor, 0.2-0.4 fair, 0.4-0.6 moderate, 0.6-0.8 substantial, and above 0.8 near-perfect.
For most LLM evaluation tasks, IAA of 0.4-0.6 is achievable with good rubric design. Tasks involving factual accuracy tend toward higher IAA (facts are right or wrong); tasks involving tone, helpfulness, or creativity tend toward lower IAA (these are genuinely more subjective). Reporting IAA alongside your eval results is an important practice: an annotation with low IAA and a 4.2/5 average quality score is a much weaker claim than a 4.2/5 with high IAA.
When IAA is unacceptably low on a particular dimension, you have three options: revise the rubric for that dimension (add more anchor examples, sharpen the definitions), resolve conflicts through adjudication (have a third annotator or a senior annotator resolve disagreements between the first two), or accept that the dimension is too subjective to measure reliably and remove it from the rubric. All three options are valid depending on the importance of the dimension to your application.
Crowdsourcing platforms like Scale AI, Labelbox, and Amazon Mechanical Turk provide access to large annotator pools for high-volume annotation. They are appropriate when: your task does not require specialized domain expertise, your rubric is well-tested and produces consistent results with non-expert annotators, and you need hundreds or thousands of annotations quickly. Expert annotation (using domain specialists — doctors for medical content, lawyers for legal content, software engineers for code quality) is appropriate when the task requires deep domain knowledge that crowd workers reliably lack. The choice between the two is not about cost alone: low-quality annotations produced cheaply at high volume can be worse than no annotation at all.
Red-Teaming and Shadow Mode Evaluation
Red-teaming is structured adversarial testing: a group of people whose explicit job is to find ways to make your LLM system fail, produce harmful outputs, or behave in ways that violate your guidelines. Red-teaming is qualitatively different from standard eval because it is not trying to measure average-case performance — it is trying to find the worst-case inputs your system will encounter. The output of a red-team exercise is a list of specific failure cases and input patterns that your system mishandles, which then get added to your regression test suite so those failure modes are permanently monitored.
Effective red-teaming requires diversity in the team: people with different backgrounds, mental models, and attack strategies will find different failure modes. It requires explicit guidance on what to look for: for a customer support bot, red-teamers should probe for policy hallucination, jailbreaks that get the bot to produce off-brand content, edge cases around ambiguous policy interpretation, and multi-turn manipulation attempts. It also requires a systematic output format so that discovered failures are clearly documented with the exact input sequence and the problematic output, not just a general description of the problem.
Shadow mode evaluation is the production-scale complement to red-teaming. In shadow mode, you log all production traffic and periodically sample a subset for human review. The sampling strategy matters enormously: pure random sampling will heavily oversample common, well-handled queries. A more useful strategy is stratified sampling — oversample by uncertainty proxy (e.g., low-confidence retrieval scores, long responses, or responses with high perplexity), oversample recent queries (to catch distribution shift), and include a random baseline for comparison. The humans reviewing the shadow sample become your earliest warning system for quality problems that your automated metrics have not yet been calibrated to detect.
Finally, the human preference data collected during evaluation does not have to be one-way value: it can be recycled as RLHF-style training signal. Every preference judgment ("response A was better than response B for this query") is potentially a training example for a reward model that can then be used to fine-tune your base model's tendency to produce the preferred style of response. This is how the preference data that OpenAI, Anthropic, and Google collect through their human feedback processes feeds back into model training — and it is available to application teams as well, through techniques like DPO (Direct Preference Optimization) that train directly on preference pairs without requiring a separate reward model.
Do not wait until you have a perfect eval infrastructure to start collecting human judgments. Even a simple spreadsheet where team members tag 10 production responses per week as "good," "acceptable," or "needs improvement" with one-sentence explanations accumulates into a valuable dataset over time. When you are ready to build a proper annotation system, that backlog of judgments provides calibration data and reveals the most important quality dimensions for your specific application.
Interview Ready
LLM evaluation is fundamentally harder than traditional ML evaluation because there is no single correct answer for open-ended generation tasks. A production-grade eval strategy layers three approaches: cheap automated metrics (BLEU, ROUGE, BERTScore) for every CI run, LLM-as-judge for nuanced quality assessment at scale, and periodic human evaluation to calibrate everything else. For RAG systems specifically, RAGAS decomposes evaluation into four independent metrics — faithfulness, answer relevance, context precision, and context recall — so you can pinpoint exactly which pipeline stage is failing. The teams that ship reliable LLM products are the ones that treat evaluation as continuous infrastructure, not a one-time checklist.
Top Interview Questions
| Question | What They're Really Asking |
|---|---|
| How do you evaluate an LLM application in production? | Do you understand the layered eval stack and can you design a practical strategy that balances cost, speed, and accuracy? |
| What is the difference between BLEU, ROUGE, and BERTScore? | Can you explain reference-based metrics, their trade-offs, and when each is appropriate versus misleading? |
| How does LLM-as-judge work and what are its failure modes? | Do you know how to use a strong model to evaluate a weaker one, and can you identify and mitigate position bias, verbosity bias, and self-enhancement bias? |
| How would you detect and measure hallucinations in a RAG system? | Can you connect faithfulness metrics, claim decomposition, and entailment checking into a practical hallucination detection pipeline? |
| When would you use A/B testing versus offline evaluation for an LLM feature? | Do you understand the difference between offline eval (controlled, fast, cheaper) and online A/B testing (real user behavior, statistical significance), and when each is warranted? |
Model Answers
How do you evaluate an LLM application in production?
I build a three-tier eval stack. The first tier runs on every CI push: automated metrics like ROUGE for summarization tasks and exact-match or F1 for extractive QA, plus behavioral assertions that check format, length, and safety constraints. The second tier uses LLM-as-judge with a detailed rubric to score a sampled subset of production responses on dimensions like accuracy, completeness, and tone — this costs roughly $0.01-0.10 per evaluation and catches quality issues that surface-level metrics miss. The third tier is periodic human evaluation on a stratified sample, which calibrates the automated tiers and catches systematic blind spots. I also track user-facing signals like thumbs-up/down rates, session completion, and escalation rates as lagging indicators that validate the eval pipeline itself.
What is the difference between BLEU, ROUGE, and BERTScore?
BLEU measures precision of n-gram overlaps between the generated text and a reference — it asks "what fraction of the generated n-grams appear in the reference?" and was originally designed for machine translation. ROUGE measures recall of n-gram overlaps — it asks "what fraction of the reference n-grams appear in the generated text?" and is the standard metric for summarization. Both are surface-level string metrics that penalize valid paraphrases. BERTScore addresses this by computing cosine similarity between contextual embeddings of tokens in the generated and reference texts, capturing semantic similarity rather than exact word overlap. In practice, BERTScore correlates better with human judgment for open-ended generation, but all three require a reference answer, which limits their applicability to tasks where ground truth can be established.
How does LLM-as-judge work and what are its failure modes?
You prompt a strong model (GPT-4o, Claude) with a detailed rubric and ask it to score another model's output, producing both a numeric rating and a reasoning trace. The main failure modes are position bias (preferring whichever answer appears first in a pairwise comparison, mitigated by position swapping and averaging), verbosity bias (inflating scores for longer responses regardless of quality, mitigated by explicit rubric instructions penalizing padding), and self-enhancement bias (a model rating its own style of output higher, mitigated by using a different model family as judge than the one being evaluated). I always log the full reasoning trace, not just scores, because the reasoning reveals systematic evaluation patterns that need correction.
How would you detect and measure hallucinations in a RAG system?
I use the RAGAS faithfulness metric as the core hallucination detector. It works in two steps: first, an LLM decomposes the generated answer into atomic factual claims; then, for each claim, a second LLM call checks whether that claim is entailed by the retrieved context chunks. The faithfulness score is the fraction of claims that are supported — a score below 1.0 means some claims have no basis in the retrieved documents. I complement this with an NLI-based factual consistency checker for faster, cheaper continuous monitoring, and I flag responses where the retriever returned low-confidence results as high-risk for hallucination. For critical applications, I add a verification step where the system explicitly cites which context chunk supports each claim, making hallucinations visible to end users.
When would you use A/B testing versus offline evaluation for an LLM feature?
Offline evaluation is my default for iterative development — it is fast, deterministic, and cheap. I run my eval suite on every prompt change, model swap, or retrieval adjustment to get immediate signal before anything reaches users. A/B testing is reserved for changes where I need to measure real user behavior that offline metrics cannot capture: engagement patterns, task completion rates, user satisfaction, and downstream business metrics. I require statistical significance (typically several hundred samples per variant per query category) before declaring a winner. The two approaches are complementary: offline eval gates what gets deployed to the A/B test, and A/B test results calibrate which offline metrics actually predict user satisfaction.
You are building a medical question-answering system backed by a curated knowledge base. Design the evaluation pipeline. A strong answer layers RAGAS metrics (faithfulness is critical — hallucinated medical information is dangerous), a specialized LLM-as-judge with a medical accuracy rubric reviewed by clinicians, domain-expert human evaluation on a weekly stratified sample, automated safety checks that flag responses containing dosage information or diagnostic language for mandatory human review, and an A/B testing framework that measures not just answer quality but whether users follow up with their doctor or take action based on the response. The key insight is that evaluation criteria must reflect the domain's risk profile — a "pretty good" eval pipeline is not acceptable when wrong answers can harm patients.
Common Mistakes
- Relying solely on BLEU/ROUGE for open-ended generation. These surface-level n-gram metrics penalize valid paraphrases and correlate poorly with human judgment for creative or explanatory tasks. They are useful for translation and summarization benchmarks but misleading as the sole quality signal for conversational or reasoning-heavy applications.
- Using the same model as both generator and judge. Self-enhancement bias means GPT-4 will systematically rate GPT-4-style outputs higher than equivalently correct outputs in a different style. Always use a different model family for judging, or at minimum, validate your judge's ratings against human annotations to quantify the bias.
- Evaluating only on development examples and skipping held-out test sets. This is the LLM equivalent of training on test data. Your development examples are the queries you optimized for — they will always look good. A held-out set that you never inspect during development provides the only unbiased estimate of real-world performance, and regression tests on this set prevent you from silently breaking edge cases while improving common cases.