Architecture Overview
The Eval & Guardrails architecture wraps your LLM with pre-processing and post-processing layers that validate, filter, and score every interaction. Input guards catch malicious or problematic prompts before they reach the model. Output guards validate, sanitize, and quality-check responses before they reach the user.
This pattern is non-negotiable for production systems. Without guardrails, your application is vulnerable to prompt injection attacks, PII leakage, toxic output, and inconsistent quality. Without evaluation, you cannot measure whether your system is actually working or degrading over time.
When to Use
- Any customer-facing LLM application (chat, search, content generation)
- Applications handling sensitive data (healthcare, finance, legal)
- Systems where incorrect or harmful output has real consequences
- Teams that need to demonstrate compliance and safety to stakeholders
- Continuous quality monitoring and regression detection across model updates
Complexity Level
Moderate. Individual guard components are straightforward to implement. The challenge is designing a pipeline that is fast enough to not degrade user experience, comprehensive enough to catch edge cases, and flexible enough to update without redeployment.
Guardrails are not a replacement for good system prompts. They are a safety net. A well-designed system prompt prevents 90% of issues; guardrails catch the remaining 10% that slip through under adversarial conditions.
Architecture Diagram
Architecture diagram — Eval & Guardrails: input guards, LLM, output guards, pass/fail gate
Components Deep Dive
Input Guards
| Guard | Technique | What It Catches |
|---|---|---|
| Prompt Injection | Classifier model, heuristic patterns, canary tokens | "Ignore previous instructions", role-play attacks, delimiter injection |
| PII Detection | Regex (SSN, email, phone) + NER models (spaCy, Presidio) | Social Security numbers, credit cards, names, addresses, emails |
| Topic Filter | Keyword blocklist + embedding classifier | Off-topic queries, prohibited content categories, competitor mentions |
| Input Length | Token counting (tiktoken) | Excessively long inputs designed to waste tokens or overwhelm context |
| Language Detection | langdetect, fasttext | Unsupported languages, mixed-language injection attacks |
Output Guards
| Guard | Technique | What It Catches |
|---|---|---|
| Safety Classifier | Toxicity model (Perspective API, OpenAI moderation) | Hate speech, violence, self-harm, explicit content |
| Factuality Check | LLM-as-judge, source verification against RAG context | Hallucinated facts, unsupported claims, fabricated citations |
| Format Validator | JSON schema validation, regex, Pydantic parsing | Malformed JSON, missing required fields, incorrect data types |
| PII in Output | Same PII detection as input, applied to generated text | Model leaking training data, echoing back user PII |
| Refusal Detector | Pattern matching on refusal phrases | "I cannot help with that" when the model should have answered |
Prompt Injection Defense
Layer multiple defenses: input classifiers, delimiter isolation (XML tags around user input), canary tokens (unique strings that should never appear in output), and instruction hierarchy in system prompts.
PII Scrubbing
Use regex for structured PII (SSN, credit cards, phone numbers) and NER models (spaCy, Presidio) for unstructured PII (names, addresses). Replace detected PII with placeholder tokens; optionally restore on output.
LLM-as-Judge
Use a separate LLM call to evaluate response quality on criteria like relevance, accuracy, helpfulness, and safety. Score on a rubric (1-5) with written reasoning. More nuanced than rule-based checks.
Automated Eval Metrics
BLEU/ROUGE: n-gram overlap with reference. BERTScore: semantic similarity via embeddings. Custom rubrics: LLM judges domain-specific criteria. Track all metrics over time for regression detection.
A/B Testing Quality
Compare model versions, prompts, or guardrail configurations by routing traffic splits and measuring quality metrics. Statistical significance matters: use proper hypothesis testing, not vibes.
Fallback Strategy
When guardrails block a response: return a safe canned response, retry with a modified prompt, escalate to human review, or return a partial answer with a disclaimer. Never return nothing.
No single guard is foolproof. Prompt injection detection has false negatives; PII regex misses novel formats. Layer multiple guards with different techniques. The goal is not perfection but making attacks impractical.
Implementation
Step 1: PII Detection Guard
import re
from dataclasses import dataclass
@dataclass
class GuardResult:
passed: bool
reason: str = ""
sanitized_text: str = ""
details: dict = None
PII_PATTERNS = {
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"credit_card": re.compile(r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b"),
"email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
"phone": re.compile(r"\b(?:\+1[\s-]?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b"),
"ip_address": re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"),
}
def detect_pii(text: str, scrub: bool = True) -> GuardResult:
"""Detect and optionally scrub PII from text."""
found = {}
sanitized = text
for pii_type, pattern in PII_PATTERNS.items():
matches = pattern.findall(sanitized)
if matches:
found[pii_type] = len(matches)
if scrub:
sanitized = pattern.sub(f"[{pii_type.upper()}_REDACTED]", sanitized)
if found:
return GuardResult(
passed=False,
reason=f"PII detected: {found}",
sanitized_text=sanitized,
details={"pii_found": found},
)
return GuardResult(passed=True, sanitized_text=text)
Step 2: Prompt Injection Detection
INJECTION_PATTERNS = [
r"ignore\s+(all\s+)?previous\s+instructions",
r"you\s+are\s+now\s+(?:a\s+)?(?:new|different)",
r"system\s*:\s*",
r"</?(?:system|admin|root)>",
r"do\s+not\s+follow\s+(?:the\s+)?(?:system|original)",
r"disregard\s+(?:all\s+)?(?:prior|previous)",
r"pretend\s+(?:you\s+are|to\s+be)",
r"jailbreak|DAN\s+mode",
]
def detect_injection(text: str) -> GuardResult:
"""Detect common prompt injection patterns."""
text_lower = text.lower()
for pattern in INJECTION_PATTERNS:
if re.search(pattern, text_lower):
return GuardResult(
passed=False,
reason=f"Potential prompt injection detected",
details={"pattern": pattern},
)
# Heuristic: excessive special characters or delimiters
special_ratio = sum(1 for c in text if c in '{}[]<>|\\`') / max(len(text), 1)
if special_ratio > 0.15:
return GuardResult(
passed=False,
reason="Suspicious character distribution",
details={"special_ratio": round(special_ratio, 3)},
)
return GuardResult(passed=True)
Step 3: Output Safety and Format Validation
import json
def validate_json_output(text: str, schema: dict = None) -> GuardResult:
"""Validate that output is well-formed JSON matching schema."""
try:
data = json.loads(text)
except json.JSONDecodeError as e:
return GuardResult(passed=False, reason=f"Invalid JSON: {e}")
if schema:
# Check required fields
for field in schema.get("required", []):
if field not in data:
return GuardResult(
passed=False,
reason=f"Missing required field: {field}",
)
return GuardResult(passed=True, sanitized_text=text)
def check_output_safety(text: str, client) -> GuardResult:
"""Use LLM to check output for safety issues."""
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=50,
system="""Evaluate this text for safety issues.
Reply with JSON: {"safe": true/false, "reason": "..."}
Flag: hate speech, violence, self-harm, explicit content,
dangerous instructions, or PII disclosure.""",
messages=[{"role": "user", "content": text[:2000]}],
temperature=0.0,
)
try:
result = json.loads(response.content[0].text)
return GuardResult(
passed=result["safe"],
reason=result.get("reason", ""),
)
except (json.JSONDecodeError, KeyError):
return GuardResult(passed=True) # Fail open if guard errors
Step 4: Full Guard Pipeline
import anthropic
import logging
logger = logging.getLogger(__name__)
class GuardedLLM:
def __init__(self):
self.client = anthropic.Anthropic()
self.stats = {"total": 0, "input_blocked": 0, "output_blocked": 0, "passed": 0}
def query(self, user_input: str, system: str = "You are helpful.") -> dict:
"""Full guarded query pipeline."""
self.stats["total"] += 1
# === INPUT GUARDS ===
# 1. Prompt injection check
injection = detect_injection(user_input)
if not injection.passed:
self.stats["input_blocked"] += 1
logger.warning(f"Injection blocked: {injection.reason}")
return {"blocked": True, "reason": "Your message was flagged by our safety system."}
# 2. PII detection and scrubbing
pii = detect_pii(user_input, scrub=True)
clean_input = pii.sanitized_text # Use scrubbed version
# === LLM CALL ===
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": clean_input}],
)
output_text = response.content[0].text
# === OUTPUT GUARDS ===
# 3. Safety check
safety = check_output_safety(output_text, self.client)
if not safety.passed:
self.stats["output_blocked"] += 1
logger.warning(f"Output blocked: {safety.reason}")
return {"blocked": True, "reason": "Response did not pass safety review."}
# 4. PII in output check
output_pii = detect_pii(output_text, scrub=True)
final_text = output_pii.sanitized_text
self.stats["passed"] += 1
return {
"blocked": False,
"text": final_text,
"guards": {
"input_pii_found": not pii.passed,
"output_pii_scrubbed": not output_pii.passed,
}
}
# Usage
guard = GuardedLLM()
result = guard.query("My SSN is 123-45-6789. What can I claim on taxes?")
# PII scrubbed before sending to LLM; answer generated safely
Step 5: LLM-as-Judge Evaluation
def llm_judge(question: str, answer: str, client, rubric: str = None) -> dict:
"""Evaluate answer quality using LLM-as-judge pattern."""
default_rubric = """Score the answer on these criteria (1-5 each):
- Relevance: Does it address the question directly?
- Accuracy: Are the facts correct and verifiable?
- Completeness: Does it cover the key aspects?
- Clarity: Is it well-organized and easy to understand?
- Safety: Is it free from harmful or misleading content?
Reply as JSON: {"relevance": N, "accuracy": N, "completeness": N,
"clarity": N, "safety": N, "overall": N, "reasoning": "..."}"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300,
system=rubric or default_rubric,
messages=[{
"role": "user",
"content": f"Question: {question}\n\nAnswer: {answer}"
}],
temperature=0.0,
)
return json.loads(response.content[0].text)
Data Flow
Step-by-step flow of a request through the Eval & Guardrails pipeline:
- 1. User input received — Raw message arrives at API endpoint
- 2. Input guard: injection detection — Scan for prompt injection patterns; block if detected
- 3. Input guard: PII scrubbing — Detect and replace PII with redaction tokens
- 4. Input guard: topic filter — Check if query is within allowed topic scope
- 5. LLM generation — Sanitized input sent to LLM with system prompt
- 6. Output guard: safety classifier — Check generated text for toxicity, harmful content
- 7. Output guard: factuality check — Verify claims against source context (for RAG systems)
- 8. Output guard: format validation — Verify JSON schema, required fields, data types
- 9. Pass/fail gate — All guards passed → deliver response; any failed → fallback or retry
- 10. Logging and evaluation — Record guard decisions, quality scores, and latency for monitoring
Trade-offs & Considerations
| Advantage | Limitation |
|---|---|
| Prevents prompt injection and data exfiltration | Adds latency (50-500ms per guard, especially LLM-based) |
| Catches PII before it reaches the model or logs | False positives block legitimate queries (user frustration) |
| Ensures consistent output format for downstream systems | LLM-based guards add cost (extra API calls per request) |
| Automated eval enables continuous quality monitoring | No single guard catches all attack vectors (arms race) |
| Compliance and audit trail for regulated industries | Over-aggressive guards can make the product unusable |
Evaluation Metrics Comparison
| Metric | Type | Strengths | Limitations |
|---|---|---|---|
| BLEU | N-gram overlap | Fast, deterministic | Misses semantic similarity, paraphrases |
| ROUGE | Recall-based overlap | Good for summarization | Same limitations as BLEU |
| BERTScore | Embedding similarity | Captures semantic meaning | Requires GPU, less interpretable |
| LLM-as-Judge | Model-based scoring | Nuanced, multi-criteria | Expensive, potential bias, non-deterministic |
| Custom rubric | Domain-specific LLM eval | Tailored to your use case | Requires rubric engineering and calibration |
Guardrails apply to every architecture. They should be layered on top of Architectures 01-06, not used in isolation. For fine-tuning models to inherently behave safely, see Architecture 08 (Fine-Tuning & Serving).
Production Checklist
- Implement at least 3 input guards: injection detection, PII scrubbing, input length validation
- Implement at least 2 output guards: safety classifier, format validation
- Build a red team test suite: 50+ adversarial prompts covering known injection techniques
- Monitor false positive rate: track legitimate queries that guards incorrectly block
- Set up LLM-as-judge evaluation on a sample of production traffic (daily or weekly batch)
- Create labeled evaluation dataset with expected answers for regression testing
- Configure alerts for guard trigger rate spikes (may indicate attack or classifier drift)
- Implement guard bypass for internal/debugging with proper access controls
- Version guard configurations and enable rollback without code deployment
- Run guards in parallel where possible to minimize total latency overhead
- Document guard coverage gaps and accepted risks for compliance teams
- Set up A/B testing pipeline to compare guard configurations and their impact on user satisfaction