Agent Architecture
Plain Language
Imagine you hire a team of research analysts to write a report on a complex topic — say, "the impact of AI on healthcare in 2025." You wouldn't hand this to one person. Instead, you'd have a project manager who breaks down the topic into research questions, researchers who go out and gather information from different sources, an analyst who synthesizes the findings and identifies patterns, a writer who turns the analysis into a polished report, and an editor who reviews the final product for quality. Our autonomous report agent mirrors this exact team structure using multiple AI agents.
Each agent is a specialized LLM instance with a focused system prompt and a specific set of tools. The Planner agent takes the user's topic and breaks it into research sub-tasks. The Researcher agent has access to web search, document retrieval, and data APIs to gather raw information. The Synthesizer agent takes the collected research notes and produces structured analysis. The Writer agent transforms the analysis into a well-formatted report with sections, citations, and visualizations. And the Reviewer agent checks the report for factual consistency, proper citations, and quality.
These agents are orchestrated by a LangGraph state machine that defines the workflow: plan, research, synthesize, write, review. If the reviewer finds issues, the workflow loops back to the appropriate stage for corrections. A human-in-the-loop checkpoint lets the user review and approve the research plan before the system spends time and API credits executing it. The entire pipeline is wrapped with guardrails that prevent the agents from generating harmful content, exceeding budget limits, or going off-topic.
This capstone brings together nearly every concept from the course: LLM APIs for the agent brains, prompt engineering for the system prompts, tool use for external data access, RAG for document-grounded research, LangGraph for workflow orchestration, guardrails for safety, evaluation for quality checking, and deployment for making it all run reliably.
Deep Dive
The architecture uses a supervisor pattern where a central orchestrator (the LangGraph state machine) routes work between specialized agents. Each agent is a node in the graph with its own LLM configuration, system prompt, and tool set. The state flows through the graph as a typed dictionary that accumulates research findings, analysis, and the final report.
The state is the central data structure that flows through every node. It uses a TypedDict to ensure type safety and includes all the accumulated data from each stage of the pipeline:
from typing import TypedDict, Annotated, Sequence
from langgraph.graph import add_messages
class ReportState(TypedDict):
topic: str # User's report topic
plan: dict # Research plan from Planner
plan_approved: bool # Human approval flag
research_notes: list[dict] # Raw findings from Researcher
analysis: str # Synthesized analysis
report: str # Final report (Markdown)
review_feedback: str # Reviewer's feedback
review_passed: bool # Quality gate
revision_count: int # Track revision loops
messages: Annotated[list, add_messages] # Agent messages
budget_used: float # Cost tracking ($)
errors: list[str] # Error log
The state grows as each agent contributes its output. The Planner writes the plan field, the Researcher appends to research_notes, the Synthesizer writes analysis, the Writer produces the report, and the Reviewer sets review_passed with review_feedback. The revision_count field prevents infinite loops — after 3 revisions, the system accepts the report as-is and flags it for human review. The budget_used field tracks accumulated API costs to enforce spending limits.
We use a supervisor pattern rather than a fully autonomous agent swarm because it provides predictability and debuggability. Each agent has a clear role, the workflow is deterministic (given the same state, it always routes to the same next node), and you can inspect the state at any point to understand what happened. This is critical for a capstone project where you need to demonstrate mastery of the concepts.
Research Agent & Tools
Plain Language
The Research agent is the workhorse of our system — it goes out into the world and gathers information. Think of it as a dedicated research assistant who can search the web, query databases, read documents, and call APIs, all to answer the specific research questions outlined in the plan.
What makes this agent powerful is its tool belt. Each tool is a well-defined function that the agent can call: a web search tool that queries Google or Bing and returns summaries of relevant pages, a RAG retrieval tool that searches through previously uploaded documents (reusing the Document Portal from Capstone I), and data API tools that can fetch structured data like stock prices, weather data, or public datasets.
The agent uses the ReAct pattern — it reasons about what information it needs, selects a tool, observes the results, and then decides whether it has enough information or needs to search further. For each research question in the plan, the agent iterates through multiple tool calls until it has gathered sufficient evidence, then records its findings as structured notes with source citations.
Importantly, the Research agent is bounded. It has a maximum number of tool calls per question (to prevent runaway API costs), a timeout per research task (to keep total execution time reasonable), and content filters that flag potentially unreliable sources. These bounds are configurable and enforced by the guardrails system we'll build in Section 5.
Deep Dive
Let's build the tools and the Research agent that uses them:
from langchain_core.tools import tool
from openai import AsyncOpenAI
import httpx, json
@tool
async def web_search(query: str) -> str:
"""Search the web for information on a topic.
Returns summaries of the top 5 results."""
async with httpx.AsyncClient() as client:
resp = await client.get(
"https://api.tavily.com/search",
params={"query": query, "max_results": 5,
"api_key": TAVILY_KEY}
)
results = resp.json()["results"]
return "\n\n".join(
f"[{r['title']}]({r['url']})\n{r['content']}"
for r in results
)
@tool
async def rag_retrieve(query: str, collection: str = "default") -> str:
"""Search uploaded documents using RAG.
Returns relevant passages with citations."""
async with httpx.AsyncClient() as client:
resp = await client.post(
"http://localhost:8000/query",
json={"question": query, "collection": collection}
)
data = resp.json()
sources = "\n".join(
f"- {s['filename']} p.{s['page']}"
for s in data.get("sources", [])
)
return f"{data['answer']}\n\nSources:\n{sources}"
@tool
async def fetch_data(url: str, description: str) -> str:
"""Fetch structured data from a public API endpoint.
Provide the URL and a description of what you expect."""
async with httpx.AsyncClient() as client:
resp = await client.get(url, timeout=15.0)
resp.raise_for_status()
# Truncate large responses
text = resp.text[:5000]
return f"Data from {url}:\n{text}"
@tool
def save_note(title: str, content: str, sources: list[str]) -> str:
"""Save a research finding as a structured note."""
return json.dumps({
"title": title,
"content": content,
"sources": sources
})
RESEARCH_TOOLS = [web_search, rag_retrieve, fetch_data, save_note]
Each tool is a decorated async function with a clear docstring that the LLM reads to understand when and how to use it. The web_search tool uses Tavily (a search API designed for LLM agents). The rag_retrieve tool calls back into the Document Portal's API from Capstone I, demonstrating how the two capstones connect. The fetch_data tool provides controlled access to public APIs with a size limit to prevent context overflow.
The Research agent wraps these tools in a ReAct loop with budget controls:
class ResearchAgent:
def __init__(self, tools, max_steps: int = 10):
self.client = AsyncOpenAI()
self.tools = {t.name: t for t in tools}
self.max_steps = max_steps
self.tool_schemas = [
{
"type": "function",
"function": {
"name": t.name,
"description": t.description,
"parameters": t.args_schema.schema()
}
}
for t in tools
]
async def research_question(
self, question: str, context: str = ""
) -> list[dict]:
messages = [{
"role": "system",
"content": (
"You are a thorough research agent. For the "
"given question, use the available tools to gather "
"comprehensive information. Save each distinct "
"finding as a note with proper source citations. "
"Stop when you have sufficient evidence to answer "
"the question thoroughly."
)
}, {
"role": "user",
"content": f"Context: {context}\n\nQuestion: {question}"
}]
notes = []
for step in range(self.max_steps):
resp = await self.client.chat.completions.create(
model="gpt-4o", messages=messages,
tools=self.tool_schemas, tool_choice="auto"
)
msg = resp.choices[0].message
messages.append(msg)
if not msg.tool_calls:
break # Agent decided it has enough info
for tc in msg.tool_calls:
tool_fn = self.tools[tc.function.name]
args = json.loads(tc.function.arguments)
result = await tool_fn.ainvoke(args)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": str(result)
})
# Capture saved notes
if tc.function.name == "save_note":
notes.append(json.loads(result))
return notes
The research agent loops through a maximum of 10 steps, calling tools as needed. When it calls save_note, the structured finding is captured and returned. The agent autonomously decides which tools to call based on the question — it might search the web first, find a relevant API endpoint mentioned in the results, fetch data from that API, and then save a synthesized note. This is the ReAct pattern in action: Reason (what do I need?), Act (call a tool), Observe (process results), repeat.
LangGraph Orchestration
Plain Language
LangGraph is the conductor of our multi-agent orchestra. Just as a conductor doesn't play any instruments but ensures every musician plays at the right time and in the right order, LangGraph doesn't do any research or writing itself but ensures each agent runs at the right stage with the right inputs.
The orchestration defines a graph where each node is an agent and each edge is a transition. The graph starts at the Planner node, moves to a human approval checkpoint, then proceeds through Research, Synthesis, Writing, and Review. The review node has a conditional edge — if the review passes, the workflow ends; if it fails, it routes back to the Synthesizer for revision. This conditional routing is what makes the system self-correcting.
One of LangGraph's most powerful features is state persistence. The entire state can be saved to a database (or even just a JSON file) at any point, and the workflow can be resumed later. This is essential for the human-in-the-loop checkpoint — when the system pauses for human approval, it serializes its state, and when the human approves, it deserializes and continues exactly where it left off. It also enables debugging: if something goes wrong at the writing stage, you can inspect the state to see exactly what research was gathered and what analysis was produced.
Deep Dive
Here's the complete LangGraph workflow that orchestrates the five agents:
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
# --- Node Functions ---
async def plan_node(state: ReportState) -> dict:
"""Break topic into research questions."""
client = AsyncOpenAI()
resp = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Create a research plan. Output JSON with keys: "
"'title', 'sections' (list of section objects "
"with 'heading' and 'questions' list), "
"'estimated_sources' (int)."
)
}, {
"role": "user",
"content": f"Topic: {state['topic']}"
}],
response_format={"type": "json_object"}
)
plan = json.loads(resp.choices[0].message.content)
return {"plan": plan, "budget_used": state["budget_used"] + 0.03}
async def research_node(state: ReportState) -> dict:
"""Execute research for each planned question."""
agent = ResearchAgent(RESEARCH_TOOLS, max_steps=8)
all_notes = []
for section in state["plan"]["sections"]:
for question in section["questions"]:
notes = await agent.research_question(
question,
context=f"Report: {state['plan']['title']}, "
f"Section: {section['heading']}"
)
all_notes.extend(notes)
return {
"research_notes": all_notes,
"budget_used": state["budget_used"] + len(all_notes) * 0.05
}
async def synthesize_node(state: ReportState) -> dict:
"""Analyze research notes into structured findings."""
client = AsyncOpenAI()
notes_text = "\n\n".join(
f"### {n['title']}\n{n['content']}\nSources: {', '.join(n['sources'])}"
for n in state["research_notes"]
)
feedback = ""
if state.get("review_feedback"):
feedback = f"\n\nPrevious review feedback to address:\n{state['review_feedback']}"
resp = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Synthesize research notes into a structured "
"analysis. Identify key themes, contradictions, "
"and gaps. Organize by the report's planned "
"sections. Include all source citations."
)
}, {
"role": "user",
"content": f"Plan: {json.dumps(state['plan'])}\n\n"
f"Research Notes:\n{notes_text}{feedback}"
}],
max_tokens=4000
)
return {"analysis": resp.choices[0].message.content}
async def write_node(state: ReportState) -> dict:
"""Generate the final Markdown report."""
client = AsyncOpenAI()
resp = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Write a professional report in Markdown. Include: "
"title, executive summary, table of contents, "
"detailed sections with citations, and a conclusion. "
"Use [1], [2], etc. for citations and include a "
"references section at the end."
)
}, {
"role": "user",
"content": f"Report plan:\n{json.dumps(state['plan'])}\n\n"
f"Analysis:\n{state['analysis']}"
}],
max_tokens=8000
)
return {"report": resp.choices[0].message.content}
async def review_node(state: ReportState) -> dict:
"""Review the report for quality."""
client = AsyncOpenAI()
resp = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Review this report for: factual accuracy, "
"proper citations, logical flow, completeness, "
"and professional quality. Output JSON with "
"'passed' (bool), 'score' (1-10), and "
"'feedback' (string with specific issues)."
)
}, {
"role": "user",
"content": state["report"]
}],
response_format={"type": "json_object"}
)
review = json.loads(resp.choices[0].message.content)
return {
"review_passed": review["passed"],
"review_feedback": review["feedback"],
"revision_count": state["revision_count"] + 1
}
# --- Conditional Edge ---
def should_revise(state: ReportState) -> str:
if state["review_passed"]:
return "end"
if state["revision_count"] >= 3:
return "end" # Max revisions reached
return "revise"
# --- Build Graph ---
graph = StateGraph(ReportState)
graph.add_node("planner", plan_node)
graph.add_node("researcher", research_node)
graph.add_node("synthesizer", synthesize_node)
graph.add_node("writer", write_node)
graph.add_node("reviewer", review_node)
graph.set_entry_point("planner")
graph.add_edge("planner", "researcher")
graph.add_edge("researcher", "synthesizer")
graph.add_edge("synthesizer", "writer")
graph.add_edge("writer", "reviewer")
graph.add_conditional_edges("reviewer", should_revise, {
"revise": "synthesizer",
"end": END
})
checkpointer = MemorySaver()
app = graph.compile(
checkpointer=checkpointer,
interrupt_before=["researcher"] # HITL: pause for plan approval
)
The interrupt_before=["researcher"] is the key to human-in-the-loop. When the graph reaches the research node, it pauses execution and saves state. The controlling application can then display the research plan to the user, collect their approval (or edits), update the state, and resume execution. This prevents the system from spending API credits on research before the user has validated the approach.
To execute: result = await app.ainvoke({"topic": "AI in Healthcare 2025", "budget_used": 0, "revision_count": 0, "errors": [], "research_notes": []}, config={"configurable": {"thread_id": "report-1"}}). The thread_id enables state persistence — you can resume the graph later with the same thread_id.
Report Generation
Plain Language
The report generation stage transforms raw analysis into a polished, professional document. Think of the difference between a researcher's messy notes and the final published paper — the Writer agent bridges that gap. It takes the structured analysis from the Synthesizer, follows the outline from the Planner, and produces a Markdown document with proper formatting, citations, and logical flow.
The generated report follows a standard structure: an executive summary that gives busy readers the key findings in two paragraphs, a table of contents for navigation, detailed sections that present the analysis with evidence and citations, and a conclusion that ties everything together with actionable recommendations. Each claim in the report is backed by a numbered citation that traces back to the original source found during research.
Beyond text generation, the system can also produce supplementary outputs. A PDF export converts the Markdown to a formatted PDF using a library like WeasyPrint or markdown-pdf. A key metrics summary extracts quantitative findings into a structured format. And a source bibliography compiles all referenced sources into a properly formatted reference list.
Deep Dive
The Writer agent uses structured output to ensure report consistency, and we add post-processing to generate multiple output formats:
from pydantic import BaseModel, Field
import markdown, weasyprint
from pathlib import Path
class ReportSection(BaseModel):
heading: str
content: str # Markdown content
key_findings: list[str]
class StructuredReport(BaseModel):
title: str
executive_summary: str
sections: list[ReportSection]
conclusion: str
references: list[str]
generated_at: str
class ReportFormatter:
def to_markdown(self, report: StructuredReport) -> str:
parts = [
f"# {report.title}\n",
f"*Generated: {report.generated_at}*\n",
f"## Executive Summary\n\n{report.executive_summary}\n",
"## Table of Contents\n",
]
for i, sec in enumerate(report.sections, 1):
parts.append(f"{i}. [{sec.heading}](#section-{i})")
parts.append("")
for i, sec in enumerate(report.sections, 1):
parts.append(f'## {i}. {sec.heading} {{#section-{i}}}\n')
parts.append(sec.content + "\n")
if sec.key_findings:
parts.append("**Key Findings:**")
for f in sec.key_findings:
parts.append(f"- {f}")
parts.append("")
parts.append(f"## Conclusion\n\n{report.conclusion}\n")
parts.append("## References\n")
for i, ref in enumerate(report.references, 1):
parts.append(f"[{i}] {ref}")
return "\n".join(parts)
def to_html(self, md_content: str) -> str:
html_body = markdown.markdown(md_content, extensions=["tables", "toc"])
return f"""<!DOCTYPE html>
<html><head>
<style>
body {{ font-family: Georgia, serif; max-width: 800px;
margin: 40px auto; padding: 0 20px; line-height: 1.8; }}
h1 {{ color: #1a1a2e; border-bottom: 2px solid #f59e0b; }}
h2 {{ color: #16213e; margin-top: 2em; }}
code {{ background: #f5f5f5; padding: 2px 6px; border-radius: 3px; }}
blockquote {{ border-left: 3px solid #f59e0b; padding-left: 1em;
color: #555; }}
</style>
</head><body>{html_body}</body></html>"""
def to_pdf(self, html_content: str, output_path: str):
weasyprint.HTML(string=html_content).write_pdf(output_path)
# Usage in the pipeline
formatter = ReportFormatter()
md = formatter.to_markdown(structured_report)
html = formatter.to_html(md)
formatter.to_pdf(html, "report.pdf")
Path("report.md").write_text(md)
Path("report.html").write_text(html)
The StructuredReport Pydantic model ensures the LLM output conforms to a strict schema — each section must have a heading, content, and key findings. The ReportFormatter then converts this structured data into multiple output formats: Markdown for version control and editing, HTML for web display, and PDF for sharing. WeasyPrint handles the HTML-to-PDF conversion with CSS styling, producing professional-looking documents without requiring LaTeX or external tools.
For the Writer agent to produce structured output, we use OpenAI's structured output mode:
async def write_structured_report(state: ReportState) -> dict:
client = AsyncOpenAI()
resp = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Write a comprehensive, professional report. "
"Use numbered citations [1], [2] etc. Each "
"section should be 3-5 paragraphs with data "
"and evidence from the analysis."
)
}, {
"role": "user",
"content": f"Plan:\n{json.dumps(state['plan'])}\n\n"
f"Analysis:\n{state['analysis']}"
}],
response_format=StructuredReport
)
report = resp.choices[0].message.parsed
formatter = ReportFormatter()
return {"report": formatter.to_markdown(report)}
Using response_format=StructuredReport guarantees the LLM output matches your Pydantic model exactly. This eliminates the need for JSON parsing error handling and ensures every report has the required sections, key findings, and references.
Guardrails & Human-in-the-Loop
Plain Language
Guardrails are the safety systems that keep our autonomous agent from going off the rails. Without guardrails, an agent asked to "research competitive pricing strategies" might accidentally generate content that's legally problematic, spend hundreds of dollars in API calls, or produce a report that hallucinates data and cites non-existent sources. Guardrails prevent all of these scenarios.
We implement three categories of guardrails. Budget guardrails track and limit API spending — the system has a maximum budget per report (say, $5.00), and if the accumulated cost approaches that limit, the agent skips optional research steps and moves directly to synthesis. Content guardrails scan agent outputs for prohibited content (legal advice, medical recommendations, personally identifiable information) and either redact it or flag it for human review. Quality guardrails check that citations are valid, that the report stays on-topic, and that the agent hasn't hallucinated statistics or quotes.
Human-in-the-loop (HITL) is the ultimate guardrail — putting a human in the decision loop at critical points. In our system, the human reviews the research plan before the agent spends time and money executing it. They can modify the plan, add questions the agent missed, or reject it entirely. This single checkpoint prevents the most expensive category of errors: researching the wrong thing.
Deep Dive
Let's implement the guardrails system with budget tracking, content filtering, and the HITL checkpoint:
from dataclasses import dataclass
import re
@dataclass
class BudgetGuard:
max_budget: float = 5.0 # dollars
warn_threshold: float = 0.8 # warn at 80%
def check(self, state: ReportState) -> dict:
used = state["budget_used"]
remaining = self.max_budget - used
if used >= self.max_budget:
return {"allowed": False, "reason": "Budget exhausted",
"action": "skip_to_synthesis"}
if used >= self.max_budget * self.warn_threshold:
return {"allowed": True, "warning":
f"Budget ${remaining:.2f} remaining",
"action": "reduce_scope"}
return {"allowed": True}
class ContentGuard:
PROHIBITED = [
(r"(?i)this is not (legal|medical) advice", "disclaimer_needed"),
(r"\b\d{3}-\d{2}-\d{4}\b", "ssn_detected"),
(r"(?i)(guaranteed|certainly will|100% effective)", "overconfident_claim"),
]
def scan(self, text: str) -> list[dict]:
issues = []
for pattern, issue_type in self.PROHIBITED:
matches = re.findall(pattern, text)
if matches:
issues.append({
"type": issue_type,
"matches": matches[:3],
"severity": "high" if "ssn" in issue_type else "medium"
})
return issues
class CitationGuard:
def verify(self, report: str, sources: list[dict]) -> dict:
# Find all [N] citations in the report
cited = set(re.findall(r"\[(\d+)\]", report))
available = set(str(i) for i in range(1, len(sources) + 1))
invalid = cited - available
unused = available - cited
return {
"valid": len(invalid) == 0,
"invalid_citations": list(invalid),
"unused_sources": list(unused),
"citation_coverage": len(cited) / max(len(available), 1)
}
# --- Guardrailed node wrapper ---
def with_guardrails(node_fn, budget_guard, content_guard):
async def wrapped(state):
# Pre-check: budget
budget = budget_guard.check(state)
if not budget["allowed"]:
return {"errors": state["errors"] + [budget["reason"]]}
# Execute node
result = await node_fn(state)
# Post-check: content
for key in ["report", "analysis"]:
if key in result:
issues = content_guard.scan(result[key])
if any(i["severity"] == "high" for i in issues):
result["errors"] = state.get("errors", []) + [
f"Content issue: {i['type']}" for i in issues
]
return result
return wrapped
The with_guardrails wrapper can be applied to any node function, adding budget pre-checks and content post-checks transparently. The BudgetGuard has three states: allowed (proceed normally), warning (reduce scope), and blocked (skip to synthesis). The ContentGuard scans for PII patterns, overconfident claims, and other prohibited content. The CitationGuard verifies that every citation in the report references a real source and flags any "phantom citations."
For the human-in-the-loop checkpoint, here's how the controlling application manages the interruption:
async def run_report_agent(topic: str) -> str:
config = {"configurable": {"thread_id": "report-1"}}
initial = {
"topic": topic, "budget_used": 0.0,
"revision_count": 0, "errors": [],
"research_notes": [], "plan_approved": False
}
# Phase 1: Run until HITL interrupt (after planner)
state = await app.ainvoke(initial, config)
# Display plan to user
print("=== Research Plan ===")
print(json.dumps(state["plan"], indent=2))
# Get human approval
approval = input("Approve plan? (yes/no/edit): ")
if approval.lower() == "no":
return "Plan rejected by user."
if approval.lower() == "edit":
edits = input("Enter modifications: ")
# Re-plan with user edits
state["topic"] = state["topic"] + f"\n\nUser modifications: {edits}"
await app.aupdate_state(config, {"plan_approved": True})
# Phase 2: Resume execution (research → synthesis → write → review)
final_state = await app.ainvoke(None, config)
return final_state["report"]
Never deploy an autonomous agent without budget limits and human checkpoints. A research agent with access to paid APIs (web search, LLM calls) can easily spend $50+ on a single report if left unchecked. The budget guard and HITL approval gate are not optional — they're essential production safeguards.
Evaluation & Deployment
Plain Language
Before deploying our report agent, we need to know if it actually produces good reports. Evaluation for an agentic system is more complex than evaluating a single LLM call because we need to assess quality at multiple levels: Did the planner create a good research plan? Did the researcher find relevant sources? Did the synthesizer correctly interpret the findings? Did the writer produce a well-structured report? Did the reviewer catch actual quality issues?
We evaluate using a combination of automated metrics and LLM-as-judge assessments. Automated metrics include citation coverage (what percentage of claims have citations), factual consistency (do the report's claims match the research notes), and structural completeness (does the report have all required sections). LLM-as-judge uses a separate LLM instance to score the report on dimensions like clarity, depth, accuracy, and actionability.
For deployment, we wrap the entire agent pipeline in a FastAPI service with a WebSocket endpoint for real-time progress updates. Users submit a topic, receive live updates as the agent progresses through each stage (planning, researching, synthesizing, writing, reviewing), and get the final report delivered as a downloadable file. The system is containerized with Docker and deployed alongside the Document Portal on ECS, sharing the same vector store for document-grounded research.
Deep Dive
The evaluation framework scores reports across multiple dimensions:
class ReportEvaluator:
def __init__(self):
self.client = AsyncOpenAI()
async def evaluate(self, state: ReportState) -> dict:
scores = {}
# 1. Structural completeness
report = state["report"]
scores["has_executive_summary"] = "executive summary" in report.lower()
scores["has_conclusion"] = "conclusion" in report.lower()
scores["has_references"] = "references" in report.lower()
scores["word_count"] = len(report.split())
# 2. Citation coverage
citations = re.findall(r"\[\d+\]", report)
paragraphs = [p for p in report.split("\n\n") if len(p) > 100]
cited_paras = [p for p in paragraphs if re.search(r"\[\d+\]", p)]
scores["citation_density"] = len(citations) / max(len(paragraphs), 1)
scores["cited_para_ratio"] = len(cited_paras) / max(len(paragraphs), 1)
# 3. LLM-as-judge (multi-dimension scoring)
resp = await self.client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"Score this report 1-10 on each dimension. "
"Output JSON: {clarity, depth, accuracy, "
"actionability, citations, overall, feedback}"
)
}, {
"role": "user",
"content": f"Topic: {state['topic']}\n\n{report}"
}],
response_format={"type": "json_object"}
)
llm_scores = json.loads(resp.choices[0].message.content)
scores["llm_eval"] = llm_scores
# 4. Budget efficiency
scores["cost"] = state["budget_used"]
scores["cost_per_word"] = state["budget_used"] / max(scores["word_count"], 1)
scores["revisions"] = state["revision_count"]
return scores
The evaluator combines structural checks (does the report have required sections?), statistical metrics (citation density, word count), LLM-based quality scoring (clarity, depth, accuracy), and efficiency metrics (cost per word, revision count). This multi-faceted evaluation gives you a comprehensive view of agent performance and helps you identify which stages need improvement.
Finally, the deployment wraps everything in a FastAPI service with real-time progress via WebSockets:
from fastapi import FastAPI, WebSocket
import asyncio, json
report_app = FastAPI(title="Report Agent API")
@report_app.websocket("/ws/generate")
async def generate_report(ws: WebSocket):
await ws.accept()
data = await ws.receive_json()
topic = data["topic"]
config = {"configurable": {"thread_id": data.get("id", "ws-1")}}
async def send_status(stage, detail=""):
await ws.send_json({
"type": "status", "stage": stage,
"detail": detail
})
await send_status("planning", "Creating research plan...")
# Run to HITL interrupt
state = await app.ainvoke(
{"topic": topic, "budget_used": 0,
"revision_count": 0, "errors": [],
"research_notes": []},
config
)
# Send plan for approval
await ws.send_json({
"type": "approval_needed",
"plan": state["plan"]
})
# Wait for approval
approval = await ws.receive_json()
if not approval.get("approved"):
await ws.send_json({"type": "cancelled"})
return
# Resume with progress updates
await send_status("researching", "Gathering information...")
await app.aupdate_state(config, {"plan_approved": True})
final = await app.ainvoke(None, config)
# Evaluate and send results
evaluator = ReportEvaluator()
scores = await evaluator.evaluate(final)
await ws.send_json({
"type": "complete",
"report": final["report"],
"scores": scores,
"cost": final["budget_used"]
})
await ws.close()
| Evaluation Dimension | Target Score | How to Improve |
|---|---|---|
| Clarity | 8+ | Better writer system prompt, add examples |
| Depth | 7+ | More research steps, broader tool set |
| Accuracy | 9+ | Stronger citation guardrails, fact-checking |
| Actionability | 7+ | Add "recommendations" section to prompt |
| Citation Coverage | >80% | Enforce citations in writer prompt |
| Cost Efficiency | <$3/report | Use GPT-4o-mini for planning, cache searches |
A strong capstone submission demonstrates mastery of the full stack: LLM API calls (planning, synthesis, writing, review), tool use (web search, RAG, data APIs), LangGraph orchestration (state machine, conditional edges, interrupts), guardrails (budget, content, citations), evaluation (multi-dimensional scoring), and deployment (FastAPI, Docker, WebSockets). Focus on making each component work reliably rather than adding extra features.
Interview Ready
How to Explain This in 2 Minutes
I built an autonomous multi-agent system that takes a research topic, breaks it into sub-questions, dispatches specialised agents to gather data from web search, APIs, and RAG sources, then synthesises everything into a structured, cited report. The orchestration layer uses LangGraph to model the workflow as a state machine with conditional routing, retry logic, and human-in-the-loop approval gates. A dedicated review agent scores the draft on completeness, accuracy, and coherence before the final report is delivered, and guardrails enforce budget limits, content policies, and citation requirements throughout the pipeline.
Likely Interview Questions
| Question | What They're Really Asking |
|---|---|
| How do you coordinate multiple agents without them duplicating work? | Do you understand shared state, task decomposition, and deduplication in multi-agent systems? |
| What happens when one agent fails or returns low-quality data? | Can you design fault-tolerant orchestration with retries, fallbacks, and quality gates? |
| How did you decide which parts require human approval? | Do you understand the trade-off between full autonomy and human-in-the-loop control for safety and quality? |
| How do you ensure the final report is factually grounded? | Can you implement citation tracking, hallucination checks, and a review agent that validates claims? |
| How would you adapt this system for a different domain, like financial analysis? | Is your architecture modular enough to swap tools, prompts, and evaluation criteria without rewriting the orchestration? |
Model Answers
Multi-Agent Orchestration: The system uses LangGraph to define a directed graph where each node is a specialised agent: a Planner decomposes the topic into sub-queries, a Researcher gathers evidence using web search and RAG tools, a Writer synthesises findings into report sections, and a Reviewer scores the draft. Edges are conditional so the Reviewer can loop the draft back to the Writer if quality thresholds are not met. Shared state carries accumulated evidence and section drafts so agents never duplicate work.
Tool Integration: Each agent has access to a curated tool set. The Researcher calls a web search API, a document retrieval endpoint, and optionally structured data APIs. Tool calls are wrapped in a retry-with-backoff pattern and results are cached so repeated queries for the same sub-topic are instant. A token budget guardrail tracks cumulative LLM spend and halts execution if the pipeline exceeds the configured limit.
Quality and Safety: The Reviewer agent uses a multi-dimensional rubric that scores completeness, factual grounding, coherence, and citation density. If any dimension falls below its threshold, the draft is routed back for revision with specific feedback. Human-in-the-loop interrupts are placed before final publication so a human can approve, edit, or reject the report. Content guardrails filter toxic or off-topic material at every generation step.
System Design Scenario
A consulting firm wants a system that automatically generates 20-page industry analysis reports overnight. The system must pull data from five different APIs, apply company-specific style guidelines, and produce reports in both PDF and slide-deck format. Describe how you would architect the agent graph, manage long-running execution across hours, handle partial failures (e.g., one API is down), enforce style consistency across sections written by different agents, and implement a human review dashboard where analysts can approve, annotate, or request revisions on individual sections before final assembly.
Common Mistakes
- No exit condition on agent loops: Allowing the Reviewer-Writer loop to run indefinitely burns tokens and can cause infinite cycles. Always set a maximum iteration count and a graceful fallback.
- Conflating orchestration with prompting: Trying to manage multi-step workflows purely through prompt chaining instead of an explicit state machine makes the system fragile and nearly impossible to debug or extend.
- Skipping citation validation: Generating a report without verifying that every claim traces back to a retrieved source produces authoritative-looking text that may be entirely hallucinated.