GenAI Reference Architectures

Large Language Model. A neural network with billions of parameters trained on massive text corpora. Capable of text generation, reasoning, summarization, and code generation.

Embedding

A dense vector representation of text (or other data) in a continuous space. Semantically similar inputs produce vectors that are close together, enabling similarity search.

Vector Store

A database optimized for storing and querying high-dimensional embedding vectors. Examples include Chroma, Pinecone, Weaviate, Qdrant, and pgvector.

RAG

Retrieval-Augmented Generation. A pattern that retrieves relevant documents from a knowledge base and includes them in the LLM prompt to ground responses in factual data.

Agent

An LLM-powered system that can reason about tasks, make decisions, and take actions by calling external tools and APIs in a loop until a goal is achieved.

Tool Use

The ability of an LLM to invoke external functions (APIs, databases, code interpreters) during generation. The model outputs structured tool calls that are executed by the runtime.

Guardrails

Safety and quality checks applied to LLM inputs and outputs. Include content filters, PII detection, topic restriction, format validation, and factuality checks.

Fine-Tuning

Adapting a pre-trained model to a specific domain or task by continuing training on a curated dataset. Improves performance on targeted use cases while preserving general capabilities.

LoRA

Low-Rank Adaptation. A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights, reducing compute and memory requirements significantly.

Inference

The process of generating outputs from a trained model given new inputs. In GenAI, this means producing text tokens autoregressively from a prompt.

Latency

The time between sending a request and receiving a response. In GenAI, measured as time-to-first-token (TTFT) and total generation time. Critical for user experience.

Throughput

The number of requests or tokens a system can process per unit time. Measured in requests/second or tokens/second. Key metric for production GenAI deployments.

Token

The basic unit of text processed by an LLM. Roughly 0.75 words on average. Models have token limits for both input (context) and output (generation).

Context Window

The maximum number of tokens an LLM can process in a single request (input + output combined). Ranges from 4K to 1M+ tokens depending on the model.

Orchestration

The coordination and sequencing of multiple LLM calls, tool invocations, and data flows to accomplish complex tasks. Central to agent and multi-agent architectures.

System Prompt

Instructions provided to the LLM that define its role, behavior, constraints, and output format. Set once per conversation and persists across all user messages.

Chunking

Splitting documents into smaller pieces for embedding and retrieval. Strategies include fixed-size, sentence-based, semantic, and recursive splitting. Chunk size affects retrieval quality.

Streaming

Delivering LLM output tokens incrementally as they are generated, rather than waiting for the complete response. Reduces perceived latency and improves user experience.

GenAI ReferenceArchitectures

GenAI Reference
Architectures