🛠️ Reference Architectures

GenAI Reference
Architectures

10 production-ready architecture patterns for generative AI applications. From simple chat APIs to enterprise multi-agent platforms — each with detailed diagrams, code, and companion notebooks.

Begin with Architecture 01 →
10
Architectures
3
Tiers
10
Notebooks
🌐
Open Source
All Architectures
Each guide covers key concepts with hands-on examples.
Foundation Tier
ARCH 01 · FOUNDATION
Simple Chat API
Single LLM call with system prompt. Stateless request-response pattern — the simplest GenAI architecture.
System PromptTemperatureStreamingError Handling
ARCH 02 · FOUNDATION
Conversational Chatbot
Multi-turn chat with memory and session management. Maintains conversation context across turns.
MemorySessionsWindow BufferSummary Memory
ARCH 03 · FOUNDATION
RAG Pipeline
Retrieval-augmented generation with vector store and embeddings. Ground LLM responses in your own data.
EmbeddingsVector DBChunkingReranking
Intermediate Tier
ARCH 04 · INTERMEDIATE
Document Processing Pipeline
Ingest PDFs and images, extract text, summarize, classify, and store structured output at scale.
PDF ParsingOCRStructured OutputBatch Processing
ARCH 05 · INTERMEDIATE
Multi-Model Router
Route requests to different models based on task complexity and cost. Optimize spend and latency across tiers.
Cost OptimizationLatency TiersFallback ChainsA/B Testing
ARCH 06 · INTERMEDIATE
Agentic Tool Use
LLM agents that can call external tools, APIs, and functions to take actions and retrieve live data.
Tool CallingFunction APIsReAct LoopSandboxing
ARCH 07 · INTERMEDIATE
Evaluation & Guardrails
Systematic evaluation pipelines and runtime guardrails to ensure quality, safety, and compliance.
Eval MetricsInput GuardrailsOutput FiltersRed Teaming
Advanced Tier
ARCH 08 · ADVANCED
Fine-Tuning & Serving
Fine-tune foundation models with LoRA/QLoRA and serve them efficiently with optimized inference pipelines.
LoRAQLoRAvLLMModel Serving
ARCH 09 · ADVANCED
Multi-Agent Orchestration
Coordinate multiple specialized agents that collaborate, delegate, and compose results for complex tasks.
Agent TeamsDelegationOrchestratorMessage Bus
ARCH 10 · ADVANCED
Production GenAI Platform
End-to-end enterprise platform combining all patterns with observability, auth, rate limiting, and CI/CD.
PlatformObservabilityCI/CDEnterprise
Glossary
LLM
Large Language Model. A neural network with billions of parameters trained on massive text corpora. Capable of text generation, reasoning, summarization, and code generation.
Embedding
A dense vector representation of text (or other data) in a continuous space. Semantically similar inputs produce vectors that are close together, enabling similarity search.
Vector Store
A database optimized for storing and querying high-dimensional embedding vectors. Examples include Chroma, Pinecone, Weaviate, Qdrant, and pgvector.
RAG
Retrieval-Augmented Generation. A pattern that retrieves relevant documents from a knowledge base and includes them in the LLM prompt to ground responses in factual data.
Agent
An LLM-powered system that can reason about tasks, make decisions, and take actions by calling external tools and APIs in a loop until a goal is achieved.
Tool Use
The ability of an LLM to invoke external functions (APIs, databases, code interpreters) during generation. The model outputs structured tool calls that are executed by the runtime.
Guardrails
Safety and quality checks applied to LLM inputs and outputs. Include content filters, PII detection, topic restriction, format validation, and factuality checks.
Fine-Tuning
Adapting a pre-trained model to a specific domain or task by continuing training on a curated dataset. Improves performance on targeted use cases while preserving general capabilities.
LoRA
Low-Rank Adaptation. A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights, reducing compute and memory requirements significantly.
Inference
The process of generating outputs from a trained model given new inputs. In GenAI, this means producing text tokens autoregressively from a prompt.
Latency
The time between sending a request and receiving a response. In GenAI, measured as time-to-first-token (TTFT) and total generation time. Critical for user experience.
Throughput
The number of requests or tokens a system can process per unit time. Measured in requests/second or tokens/second. Key metric for production GenAI deployments.
Token
The basic unit of text processed by an LLM. Roughly 0.75 words on average. Models have token limits for both input (context) and output (generation).
Context Window
The maximum number of tokens an LLM can process in a single request (input + output combined). Ranges from 4K to 1M+ tokens depending on the model.
Orchestration
The coordination and sequencing of multiple LLM calls, tool invocations, and data flows to accomplish complex tasks. Central to agent and multi-agent architectures.
System Prompt
Instructions provided to the LLM that define its role, behavior, constraints, and output format. Set once per conversation and persists across all user messages.
Chunking
Splitting documents into smaller pieces for embedding and retrieval. Strategies include fixed-size, sentence-based, semantic, and recursive splitting. Chunk size affects retrieval quality.
Streaming
Delivering LLM output tokens incrementally as they are generated, rather than waiting for the complete response. Reduces perceived latency and improves user experience.