LLM
Large Language Model. A neural network with billions of parameters trained on massive text corpora. Capable of text generation, reasoning, summarization, and code generation.
Embedding
A dense vector representation of text (or other data) in a continuous space. Semantically similar inputs produce vectors that are close together, enabling similarity search.
Vector Store
A database optimized for storing and querying high-dimensional embedding vectors. Examples include Chroma, Pinecone, Weaviate, Qdrant, and pgvector.
RAG
Retrieval-Augmented Generation. A pattern that retrieves relevant documents from a knowledge base and includes them in the LLM prompt to ground responses in factual data.
Agent
An LLM-powered system that can reason about tasks, make decisions, and take actions by calling external tools and APIs in a loop until a goal is achieved.
Tool Use
The ability of an LLM to invoke external functions (APIs, databases, code interpreters) during generation. The model outputs structured tool calls that are executed by the runtime.
Guardrails
Safety and quality checks applied to LLM inputs and outputs. Include content filters, PII detection, topic restriction, format validation, and factuality checks.
Fine-Tuning
Adapting a pre-trained model to a specific domain or task by continuing training on a curated dataset. Improves performance on targeted use cases while preserving general capabilities.
LoRA
Low-Rank Adaptation. A parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights, reducing compute and memory requirements significantly.
Inference
The process of generating outputs from a trained model given new inputs. In GenAI, this means producing text tokens autoregressively from a prompt.
Latency
The time between sending a request and receiving a response. In GenAI, measured as time-to-first-token (TTFT) and total generation time. Critical for user experience.
Throughput
The number of requests or tokens a system can process per unit time. Measured in requests/second or tokens/second. Key metric for production GenAI deployments.
Token
The basic unit of text processed by an LLM. Roughly 0.75 words on average. Models have token limits for both input (context) and output (generation).
Context Window
The maximum number of tokens an LLM can process in a single request (input + output combined). Ranges from 4K to 1M+ tokens depending on the model.
Orchestration
The coordination and sequencing of multiple LLM calls, tool invocations, and data flows to accomplish complex tasks. Central to agent and multi-agent architectures.
System Prompt
Instructions provided to the LLM that define its role, behavior, constraints, and output format. Set once per conversation and persists across all user messages.
Chunking
Splitting documents into smaller pieces for embedding and retrieval. Strategies include fixed-size, sentence-based, semantic, and recursive splitting. Chunk size affects retrieval quality.
Streaming
Delivering LLM output tokens incrementally as they are generated, rather than waiting for the complete response. Reduces perceived latency and improves user experience.