Transformer
The neural network architecture behind all modern LLMs. Uses self-attention to process entire sequences in parallel, unlike RNNs which processed tokens one at a time.
Token
The atomic unit LLMs work with — roughly 0.75 words in English. "tokenization" splits text into tokens; models are billed per token and have a maximum context window measured in tokens.
Embedding
A vector (list of numbers) that represents the meaning of text. Semantically similar texts produce similar vectors. The foundation of all similarity search and RAG systems.
Self-Attention
The mechanism that lets each token in a sequence "attend" to every other token, learning relationships regardless of positional distance. The key innovation in transformers.
Context Window
The maximum number of tokens an LLM can see at once (prompt + response combined). GPT-4o: 128k; Claude 3.5: 200k. Larger windows cost more and can reduce focus.
Temperature
Controls randomness of LLM output. 0 = deterministic/factual. 1 = creative/varied. For classification tasks, use 0. For creative writing, use 0.7–1.2. Never set above 2.
RAG
Retrieval-Augmented Generation. Giving an LLM access to relevant documents at query time rather than relying solely on its training data. The primary solution to hallucination in domain-specific applications.
Vector Database
A database optimised for storing and querying embeddings via approximate nearest-neighbour search. Key options: Pinecone (managed), Qdrant (open-source), ChromaDB (local), pgvector (PostgreSQL extension).
Fine-Tuning
Continuing to train a pre-trained model on your own data to specialise its knowledge or style. Much cheaper than training from scratch. Can be full (all parameters) or parameter-efficient (LoRA/QLoRA).
LoRA
Low-Rank Adaptation. A PEFT technique that freezes the original model weights and trains only small low-rank matrices inserted into each layer. Reduces trainable parameters by ~99% while preserving quality.
RLHF
Reinforcement Learning from Human Feedback. The technique used to align LLMs with human preferences. Humans rank outputs, a reward model learns from rankings, and the LLM is optimised against that reward model.
DPO
Direct Preference Optimization. A simpler alternative to RLHF that skips the reward model entirely, directly training the LLM on preference pairs. Widely used for instruction-following alignment.
Hallucination
When an LLM confidently generates factually incorrect information. Arises because LLMs generate statistically plausible text, not truth-checked facts. RAG and grounding are the primary mitigations.
Chain-of-Thought (CoT)
A prompting technique where you ask the model to reason step-by-step before answering. Dramatically improves accuracy on multi-step reasoning tasks. "Let's think step by step" is the simplest CoT trigger.
ReAct
Reason + Act. A prompting pattern for agents: the model alternates between Thought (reasoning), Action (tool call), and Observation (tool result) until it reaches a final answer.
Agent
An LLM that can take actions — call tools, browse the web, write code, read files — in a loop until it completes a task. Differs from a chatbot by having autonomy and access to external capabilities.
Tool Use / Function Calling
A capability where the LLM generates a structured call to a pre-defined function (e.g. search_web, run_python, query_database) rather than prose. The function executes and its result is fed back to the model.
LangGraph
A Python library for building stateful, graph-based agent workflows. Represents agent logic as a directed graph of nodes (tasks) and edges (transitions), with built-in support for loops, branching, and human-in-the-loop interrupts.
RAGAS
RAG Assessment — a framework for evaluating RAG systems. Key metrics: Faithfulness (does the answer stay grounded in context?), Answer Relevancy (does it actually address the question?), Context Precision, Context Recall.
Prompt Injection
An attack where malicious text in user input or retrieved documents overrides the system prompt, causing the LLM to ignore its instructions. A critical security concern for any LLM system processing untrusted text.
MCP
Model Context Protocol. An open standard by Anthropic for connecting LLMs to external tools and data sources. Defines a Host-Client-Server architecture with standardised transport (STDIO, SSE, HTTP) and capability types (Tools, Resources, Prompts).
PEFT
Parameter-Efficient Fine-Tuning. A family of techniques (LoRA, QLoRA, prefix tuning, adapters) that fine-tune only a small fraction of model parameters, making fine-tuning feasible on consumer hardware.
Chunking
The process of splitting documents into smaller pieces before embedding for RAG. Critical decisions: chunk size (too small = no context, too large = noisy retrieval), overlap, and whether to chunk at character, sentence, or semantic boundaries.
LLM-as-Judge
Using a capable LLM (e.g. GPT-4o) to evaluate the output of another LLM or system. Produces scores for dimensions like helpfulness, factuality, and safety. Scalable but biased toward verbose and confident answers.
Amazon Bedrock
AWS's managed service for accessing foundation models from Anthropic, Meta, Mistral, Cohere, and Amazon. Provides enterprise-grade security, VPC integration, and access control without needing to manage infrastructure.
Guardrails
Validation and safety layers that sit before and after an LLM in a production pipeline. Prevent harmful inputs from reaching the model, and catch problematic outputs before they reach the user.
System Prompt
The invisible preamble to every conversation that defines the LLM's role, instructions, constraints, and persona. Set by the developer, not the user. Has the highest level of instruction priority in most models.
Quantization
Reducing the precision of model weights (e.g. from 32-bit float to 4-bit int) to shrink model size and increase inference speed, with a small quality trade-off. QLoRA uses quantization to enable fine-tuning on consumer GPUs.