The Transformer Architecture
Plain Language
Before transformers existed, the dominant approach to processing text was the Recurrent Neural Network (RNN). An RNN reads a sentence the way a person reads a ticker tape: one word at a time, left to right, carrying a "memory" state forward from each step. The problem with this approach is profound and practical. If you want to understand the word "bank" in a long document, you might need information from fifty words earlier — the word "river" or "deposit." An RNN has to keep that information alive through every intermediate step, and its memory degrades. Think of it like a game of telephone: by the time the message reaches the hundredth person, the original nuance is lost. RNNs also could not be parallelized across the sequence — every word had to wait for the previous word to finish, which made training on modern GPUs painfully slow.
In 2017, a team of eight researchers at Google Brain published a paper titled "Attention Is All You Need" that changed everything. The insight was deceptively simple: instead of reading text sequentially, what if every word could look at every other word simultaneously, and decide how much attention to pay to each one? You give up the sequential constraint entirely and gain two enormous advantages: the model can capture long-range dependencies directly (word 1 and word 100 can interact in a single step), and the entire sequence can be processed in parallel on GPU, making training dramatically faster.
The architecture they proposed — the Transformer — has an encoder and a decoder. The encoder reads the full input sequence and builds a rich, contextual representation of it. You can think of the encoder as a very thorough reader who annotates every word with deep notes about how it relates to every other word. The decoder then reads those annotations and generates an output sequence, one token at a time. Early Transformers used both halves: the encoder-decoder design is what powers machine translation models like Google Translate, where you read a full French sentence and produce an English one.
Modern large language models like GPT-4, Claude, and Llama are decoder-only transformers. They removed the encoder entirely and trained the decoder on the task of predicting the next word given all the previous ones. This sounds like a limitation, but it turns out to be extraordinarily powerful. By training on essentially all text on the internet and beyond, predicting the next word forces the model to build an internal representation of grammar, facts, reasoning, style, code, and everything else — because predicting the next word well requires understanding all of it.
The intuition behind the core mechanism — attention — is this: for every word in a sentence, you compute a score representing how relevant every other word is to understanding this particular word. Then you take a weighted combination of all the word representations, weighted by those relevance scores. The word "it" in "The animal didn't cross the street because it was too tired" now has a very high attention score connecting it back to "animal," not "street." The model has learned, from data alone, how to route meaning through a sentence to resolve ambiguity. This is something RNNs consistently failed at across long distances.
The original "Attention Is All You Need" paper used 6 encoder layers and 6 decoder layers (65M parameters). GPT-3 scaled to 96 decoder layers with 175 billion parameters. The architecture is essentially the same — scale is what changed.
Deep Dive
The attention mechanism begins with three learned linear projections applied to each token's embedding vector. For a token embedding x of dimension d_model, we compute three vectors: a Query (Q), a Key (K), and a Value (V). These are computed by multiplying x by three separate learned weight matrices W_Q, W_K, W_V, each of shape (d_model, d_k). The Query asks "what am I looking for?", the Key says "what do I contain?", and the Value is "what information should I pass on?"
The attention score between two tokens is the dot product of one's Query with the other's Key. Dot product is a measure of similarity in vector space: if two vectors point in similar directions (i.e., have similar "meaning" in the learned space), their dot product is large. We then divide by sqrt(d_k) to prevent the dot products from becoming too large in high dimensions (which would cause softmax to produce extremely peaked distributions that have near-zero gradients). The softmax then converts all the raw scores for a given Query into a probability distribution that sums to 1. Finally, we take a weighted sum of all Value vectors, using those probabilities as weights. The entire operation is:
Multi-head attention runs this process in parallel across h separate "heads," each with its own learned W_Q, W_K, W_V projections to a lower-dimensional subspace (typically d_k = d_model / h). The outputs of all heads are concatenated and projected back to d_model with another learned matrix W_O. The motivation is rich: each head can specialize in a different type of relationship. In practice, specific heads have been found to track subject-verb agreement, coreference resolution, positional proximity, and syntactic structure. No one designs these specializations explicitly — they emerge from training.
Since attention has no inherent notion of position (it treats the sequence as a set, not an ordered list), the model needs positional encodings added to the token embeddings. The original paper used sinusoidal functions: for position pos and dimension i, the encoding is sin(pos / 10000^(2i/d_model)) for even dimensions and cos(...) for odd ones. This creates a unique signature for each position that the model can use to distinguish token order. Modern models use learned positional embeddings or rotary positional encodings (RoPE, used in Llama) that generalize better to sequences longer than those seen during training.
Each transformer layer wraps both the attention sublayer and the subsequent feed-forward sublayer in residual connections: the output is LayerNorm(x + sublayer(x)). The residual connection (adding the input directly to the output) is critical for training very deep networks. Without it, gradients flowing backward through 96 layers would vanish to near-zero, and the network could not learn. With it, there is always a direct "highway" for gradients to flow to earlier layers. LayerNorm normalizes the activations to have zero mean and unit variance, which stabilizes training.
The feed-forward sublayer follows the attention sublayer in each block. It is applied independently and identically to each position (hence "position-wise"). It consists of two linear transformations with a non-linear activation in between: FFN(x) = Linear_2(GELU(Linear_1(x))). The first linear layer typically expands the dimension by a factor of 4 (e.g., from d_model=768 to 3072), and the second contracts it back. This expansion-contraction stores a surprising amount of factual knowledge — research has shown that FFN layers act somewhat like key-value memories.
The decoder introduces two additional mechanisms beyond the encoder's structure. First, causal masking: during training, the decoder must predict token at position t using only tokens 1...t-1. This is enforced by masking the attention scores for all future positions to -inf before softmax, so they contribute zero weight. Second, cross-attention: after the causal self-attention layer, the decoder attends to the encoder's output, using the encoder representations as Keys and Values and its own hidden state as Queries. In decoder-only models (GPT, Claude, Llama), there is no encoder and no cross-attention — only causal self-attention.
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (seq_len, d_k)
K: (seq_len, d_k)
V: (seq_len, d_v)
mask: optional (seq_len, seq_len) boolean mask — True = mask out
Returns: (seq_len, d_v) context vectors
"""
d_k = Q.shape[-1]
# Step 1: Dot product of Q with K^T → raw attention scores
scores = Q @ K.T / np.sqrt(d_k) # (seq_len, seq_len)
# Step 2: Apply causal mask (decoder only)
if mask is not None:
scores = np.where(mask, -1e9, scores)
# Step 3: Softmax over last axis → attention weights
exp_scores = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True) # (seq_len, seq_len)
# Step 4: Weighted sum of values
output = weights @ V # (seq_len, d_v)
return output, weights
# --- Example ---
np.random.seed(42)
seq_len, d_model, d_k = 6, 512, 64
# Simulate a single attention head
x = np.random.randn(seq_len, d_model) # token embeddings
Wq = np.random.randn(d_model, d_k) * 0.02
Wk = np.random.randn(d_model, d_k) * 0.02
Wv = np.random.randn(d_model, d_k) * 0.02
Q = x @ Wq # (6, 64)
K = x @ Wk # (6, 64)
V = x @ Wv # (6, 64)
# Causal mask: True where future positions should be hidden
mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
context, attn_weights = scaled_dot_product_attention(Q, K, V, mask)
print(f"Context shape: {context.shape}") # (6, 64)
print(f"Attention weights (row 3): {attn_weights[3].round(3)}")
# Token 3 can only attend to tokens 0,1,2,3 — future is zeroed
# --- Multi-head attention sketch ---
def multi_head_attention(x, W_Q_list, W_K_list, W_V_list, W_O):
"""Run h attention heads in parallel, concat, project."""
heads = []
for Wq, Wk, Wv in zip(W_Q_list, W_K_list, W_V_list):
Q, K, V = x @ Wq, x @ Wk, x @ Wv
head, _ = scaled_dot_product_attention(Q, K, V)
heads.append(head)
concat = np.concatenate(heads, axis=-1) # (seq_len, h*d_k)
return concat @ W_O # (seq_len, d_model)
The transformer's power comes from three things working together: global attention (every token sees every other), residual connections (allows depth without vanishing gradients), and massive parallelism (all positions computed simultaneously on GPU). Remove any one of these and you lose a critical capability.
Tokenization
Plain Language
Neural networks are mathematical functions that operate on numbers. They cannot read the letter "A" or the word "transformer" directly. So the very first step in working with any language model is converting text into a sequence of integers. This conversion process is called tokenization, and understanding it well is critical because it affects cost, latency, model behavior, and fairness in ways that are not obvious at first glance.
The intuitive approach would be to split text into words. But this runs into problems immediately: what do you do with punctuation? With contractions like "don't"? With code like df.groupby("col")? With URLs? With emojis? With misspellings? If your vocabulary is every English word, your vocabulary table has hundreds of thousands of entries and you still cannot handle rare words, proper nouns, or any other language. Modern tokenizers solve this with subword tokenization: break text into pieces that are larger than individual characters but smaller than full words.
The most widely used algorithm is Byte Pair Encoding (BPE), originally designed for data compression. The algorithm starts with a vocabulary of all individual bytes (or characters), then repeatedly finds the most frequent adjacent pair in the training corpus and merges them into a single new token. After enough merges, common words like "the" and "and" become single tokens, while rare words like "photosynthesis" might be split into "photo", "synth", "esis." The number of merges determines the final vocabulary size. GPT-2 uses 50,257 tokens; Llama 3 uses 128,256. A larger vocabulary means fewer tokens per text (more efficient) but a larger embedding table.
Tokens are decidedly not words. This is one of the most important intuitions to internalize. "tokenization" might be tokenized as ["token", "ization"] — two tokens. The word "unfortunately" in GPT-4's tokenizer is a single token, while "a" is also a single token. Emojis often require 2-4 tokens each. Code is particularly expensive: a Python function with many special characters and indentation can use far more tokens than the equivalent English description of what it does. An important practical consequence: you pay OpenAI per token, so counting tokens before sending a prompt is not just academic.
Language fairness is a real issue with tokenization. English is dramatically more token-efficient than most other languages. Mandarin Chinese, Arabic, and low-resource languages often require 2-5x as many tokens to express the same content as English. This means: (1) prompts in those languages cost more, (2) those languages have less room in the context window, and (3) models have seen far less training data in those languages, compounding disadvantages. This is an active area of research — recent models like Llama 3 improved multilingual tokenization significantly compared to earlier generations.
When you call OpenAI's API and hit a context length limit, the error is about tokens, not characters or words. Always count tokens using the appropriate library (tiktoken for OpenAI, transformers tokenizer for open models) before assuming your input will fit. A rough rule of thumb: 1 token ≈ 4 characters in English, or about 0.75 words.
Deep Dive
The BPE algorithm is straightforward to implement and worth understanding in detail. You begin by splitting the training corpus into bytes (or Unicode characters) and counting the frequency of every adjacent pair. You then merge the most frequent pair into a new token, update all occurrences in the corpus, and repeat. Each iteration of this loop adds one token to the vocabulary. After 50,000 such merges (for GPT-2), you have a vocabulary of characters plus learned subword merges that efficiently represent the training corpus. The merge rules are stored as an ordered list — during inference, you apply them greedily in the same order.
WordPiece (used in BERT and its derivatives) is similar to BPE but instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training corpus under a unigram language model. In practice this produces similar results but tends to prefer merges that are individually meaningful linguistic units. WordPiece prefixes subwords with "##" to distinguish word-initial tokens from mid-word continuations: "tokenization" becomes ["token", "##ization"].
SentencePiece (used in Llama, T5, Gemma, and many others) takes a different approach: it treats the input as a raw stream of Unicode characters with no pre-tokenization, making it completely language-agnostic. It can implement either BPE or the Unigram Language Model algorithm. The key difference from tiktoken/BPE is that it is trained end-to-end and can handle arbitrary whitespace including leading spaces as part of tokens, which is why Llama tokens often have a leading space character embedded in them (e.g., "▁Hello" rather than "Hello").
Special tokens are a critical part of any tokenizer and vary by model family. They signal structure to the model that cannot be expressed in plain text. GPT-2 and GPT-3 use <|endoftext|> to signal the boundary between documents in the training corpus. The instruction-tuned versions of GPT models use <|im_start|> and <|im_end|> to delimit system prompts and turns (the "IM" stands for "instant message"). Llama uses <s> and </s> as BOS and EOS tokens. BERT uses [CLS] (start of sequence, used for classification) and [SEP] (separator between sentence pairs). These tokens are never generated from normal text — they are added explicitly by the tokenizer based on the context.
Vocabulary sizes have grown over time as models scaled: GPT-2 has 50,257 tokens, GPT-4 uses the cl100k_base encoding with 100,277 tokens, and Llama 3 extended this to 128,256. A larger vocabulary increases the size of the embedding table (vocab_size × d_model parameters) but reduces the average number of tokens per document, which means more text fits in the same context window and inference costs less per document.
import tiktoken
# Load OpenAI's tokenizer for GPT-4 / GPT-4o
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
# --- Basic tokenization ---
text = "Hello, world! Tokenization is surprisingly nuanced."
tokens = enc.encode(text)
print(f"Token IDs: {tokens}")
# [9906, 11, 1917, 0, 9984, 2065, 374, 33407, 84697, 13]
print(f"Token count: {len(tokens)}") # 10
# --- Decode individual tokens ---
for tid in tokens:
piece = enc.decode([tid])
print(f" {tid:6d} → {repr(piece)}")
# --- Compare token counts across languages ---
samples = {
"English": "The transformer architecture revolutionized natural language processing.",
"Spanish": "La arquitectura del transformador revolucionó el procesamiento del lenguaje.",
"Chinese": "Transformer架构彻底改变了自然语言处理领域。",
"Arabic": "أحدث هيكل المحوّل ثورة في معالجة اللغة الطبيعية.",
"Code": "def scaled_attention(Q, K, V):\n return softmax(Q @ K.T / sqrt(d_k)) @ V",
}
for lang, sample in samples.items():
n = len(enc.encode(sample))
chars = len(sample)
print(f"{lang:10s}: {n:3d} tokens ({chars:3d} chars) ratio={chars/n:.1f} chars/tok")
# --- Special tokens (for chat models) ---
enc_chat = tiktoken.get_encoding("cl100k_base")
# Note: special tokens like <|im_start|> must be explicitly allowed
special_tokens = enc_chat.encode(
"<|im_start|>user\nHello<|im_end|>",
allowed_special="all"
)
print(f"With special tokens: {special_tokens}")
# --- Count tokens before sending to API (save money!) ---
def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
"""Approximate token count for a list of chat messages."""
enc = tiktoken.encoding_for_model(model)
total = 0
for msg in messages:
total += 4 # overhead per message
for value in msg.values():
total += len(enc.encode(str(value)))
total += 2 # reply priming tokens
return total
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in one paragraph."},
]
print(f"Estimated tokens: {count_tokens(messages)}") # ~24
Embeddings
Plain Language
Once text has been tokenized into a sequence of integer IDs, the model needs to convert those integers into something it can compute with. Each integer ID indexes into a large lookup table — the embedding matrix — to retrieve a high-dimensional vector. For GPT-3, these vectors have 12,288 numbers each. For smaller models like GPT-2, it is 768 numbers. These vectors are the model's internal language — every token, every concept, every word gets translated into a point in this high-dimensional space.
The remarkable thing about embedding spaces learned during training is that they develop a geometry of meaning. Words with similar meanings cluster together. "cat" and "dog" are closer to each other than either is to "automobile." Even more striking, directions in the space carry semantic meaning. The classic demonstration: take the vector for "king", subtract the vector for "man", and add the vector for "woman." The resulting point in space is extremely close to the vector for "queen." This happens not because anyone programmed it, but because the model learned to represent the structure of meaning implicitly while training on billions of examples.
This geometric property of embeddings is why Retrieval-Augmented Generation (RAG) works, which you will build extensively later in this course. When you want to find which documents in a database are relevant to a user's query, you convert both the query and each document into embedding vectors, then find the documents whose vectors are closest to the query vector. "Closest" is measured using cosine similarity (the angle between vectors), which captures semantic relatedness regardless of the exact words used. A query about "ML model deployment" will match a document about "serving machine learning systems in production" even if none of those exact words overlap.
It is important to distinguish between two types of embeddings you will encounter. Token embeddings inside the transformer are context-dependent: the word "bank" has a different vector when preceded by "river" versus "investment." This is one of the transformer's key achievements over earlier methods. Sentence or document embeddings, which are what you produce when you call the OpenAI embeddings API, are context-free in a different sense — you put in a whole sentence or paragraph, and get back a single vector that represents the entire text. This is produced by running the text through a model and then pooling the resulting token representations (often by averaging them, or taking the special CLS token).
For practical use in this course, embeddings are primarily a tool for semantic search and similarity. You will embed user queries and document chunks, store those vectors in a vector database (like Pinecone, Weaviate, or pgvector), and retrieve the most relevant chunks at query time. Understanding how embeddings work under the hood helps you make better decisions about which embedding model to choose, how to chunk your documents, and how to interpret similarity scores.
Why 1536 dimensions and not 10? In high-dimensional spaces, you can encode an enormous number of independent "directions" (concepts). With 1536 dimensions, the embedding space can simultaneously encode syntax, semantics, sentiment, topic, language, register, and hundreds of other properties without them interfering with each other. Low-dimensional spaces get "crowded" quickly.
Deep Dive
The embedding table is a matrix of shape (vocab_size, d_model). For GPT-2 with vocab_size=50,257 and d_model=768, this is roughly 38.6 million parameters — a significant fraction of the total. Each row is a learned vector for one token. During the forward pass, embedding lookup is conceptually just E[token_id] — a matrix row selection. In practice this is implemented as a matrix multiply with a one-hot vector for numerical stability and hardware efficiency.
The token embeddings are added to positional embeddings of the same dimension before entering the first transformer layer. The sum token_embed + positional_embed encodes both "what this token is" and "where in the sequence it sits." In weight-tied models (like GPT-2), the embedding table is also used as the output projection: after the final layer, the hidden state is multiplied by E^T to produce a logit for every vocabulary token. This parameter sharing both saves memory and constrains the model to produce outputs that live in the same geometric space as the inputs.
Contextual vs. static embeddings represent a fundamental shift in NLP. Earlier models like Word2Vec (2013) and GloVe (2014) assigned a single fixed vector to each word regardless of context. The word "bank" always had the same representation. Transformer embeddings are different: the representation of each token is determined by running the full attention mechanism, so the same token gets a completely different vector depending on its surrounding context. This is why transformers dramatically outperformed Word2Vec on tasks requiring disambiguation.
For the specific case of sentence embeddings (representing an entire text as one vector), several pooling strategies are used. The simplest is mean pooling: average all token embeddings from the last layer. This is what most modern sentence embedding models do. CLS pooling uses the embedding of a special [CLS] token prepended to every sequence; BERT was trained with a classification objective at this position, so it learns to aggregate sequence-level information. OpenAI's embedding API abstracts away this choice and returns a single embedding per input.
OpenAI currently offers two primary embedding models: text-embedding-3-small (1536 dimensions, very fast and cheap) and text-embedding-3-large (3072 dimensions, higher quality for tasks requiring fine-grained semantic distinctions). A unique feature of these models is Matryoshka Representation Learning (MRL): you can truncate the embedding to a smaller dimension (e.g., 256 or 512) by taking the first N dimensions, and the result is still a useful embedding. This lets you trade off memory and speed against precision. The quality degrades gracefully rather than catastrophically.
Cosine similarity is the standard metric for comparing embedding vectors. It measures the angle between two vectors, ignoring their magnitude: cos(a,b) = (a · b) / (|a| × |b|). It ranges from -1 (opposite directions) through 0 (orthogonal, unrelated) to 1 (identical direction, maximally similar). For semantic search purposes, you normalize all embeddings to unit length first, after which cosine similarity is equivalent to the dot product — which GPU hardware computes extremely efficiently.
import numpy as np
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
def get_embedding(text: str, model: str = "text-embedding-3-small") -> np.ndarray:
"""Call the OpenAI embedding API and return a numpy array."""
text = text.replace("\n", " ") # newlines degrade quality slightly
response = client.embeddings.create(input=[text], model=model)
vec = np.array(response.data[0].embedding)
return vec / np.linalg.norm(vec) # L2-normalize for cosine ≡ dot
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two unit-normalised vectors."""
return float(np.dot(a, b)) # == cos(θ) when both are unit vectors
# --- Compare semantic similarity of sentence pairs ---
pairs = [
("The cat sat on the mat.", "A feline rested on a rug."),
("Machine learning model deployment.", "Serving ML systems in production."),
("How do I make pasta carbonara?", "Explain quantum entanglement."),
("king", "queen"),
("king", "automobile"),
]
print(f"{'Sentence A':40s} {'Sentence B':40s} Similarity")
print("-" * 95)
for a, b in pairs:
emb_a = get_embedding(a)
emb_b = get_embedding(b)
sim = cosine_similarity(emb_a, emb_b)
print(f"{a[:38]:40s} {b[:38]:40s} {sim:.4f}")
# --- Demonstrate vector arithmetic (king - man + woman ≈ queen) ---
words = ["king", "man", "woman", "queen", "prince", "princess"]
embeddings = {w: get_embedding(w) for w in words}
analogy = embeddings["king"] - embeddings["man"] + embeddings["woman"]
# Re-normalise after arithmetic
analogy = analogy / np.linalg.norm(analogy)
print("\nking - man + woman → similarity to:")
for word, vec in embeddings.items():
print(f" {word:10s}: {cosine_similarity(analogy, vec):.4f}")
# "queen" should score highest (excluding king/man/woman themselves)
# --- Batch embeddings (more efficient for large corpora) ---
documents = [
"Transformers use self-attention to model sequences.",
"Python is a dynamically-typed interpreted language.",
"Gradient descent minimises the loss function iteratively.",
"RAG retrieves relevant chunks before generating an answer.",
]
# OpenAI supports batching up to 2048 inputs per request
response = client.embeddings.create(input=documents, model="text-embedding-3-small")
doc_embeddings = np.array([r.embedding for r in response.data])
print(f"\nDocument embedding matrix shape: {doc_embeddings.shape}")
# (4, 1536) — 4 documents × 1536 dimensions
# Pairwise similarity matrix
norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
unit_embs = doc_embeddings / norms
sim_matrix = unit_embs @ unit_embs.T
print("\nPairwise cosine similarity matrix:")
print(np.round(sim_matrix, 3))
Pre-training, Instruction Tuning & RLHF
Plain Language
Training a modern large language model happens in distinct stages, each serving a different purpose. Understanding these stages helps you understand why models behave the way they do — why ChatGPT is helpful and conversational rather than producing raw statistical text, why Claude refuses certain requests, and why fine-tuned models specialize in particular domains. These are not accidents; they are the result of carefully designed training pipelines applied on top of each other.
The first stage, pre-training, is the foundation. A raw transformer model with randomly initialized weights is trained on an enormous corpus of text — web pages, books, scientific papers, code repositories, Wikipedia, and more. The training objective is simple: given the previous tokens in a sequence, predict the next token. This sounds almost trivially simple, but at scale it turns out to be an extraordinarily demanding task. To predict the next word reliably across billions of different sentences, the model must implicitly learn grammar, facts about the world, how arguments are structured, how code behaves, social conventions, scientific relationships, and much more. None of this is explicitly taught — it all emerges from the pressure of predicting the next token on enough diverse text.
Pre-training is outrageously expensive. Training GPT-3 (175B parameters) reportedly cost over $4 million in compute. Training GPT-4 is estimated at $60-100 million. These runs require clusters of thousands of specialized AI accelerator chips running for months. The output of pre-training is a "base model" or "foundation model" — an extraordinarily knowledgeable but somewhat feral entity that will complete whatever you give it (often in unexpected directions) rather than helpfully answer questions.
The second stage is instruction tuning (also called Supervised Fine-Tuning or SFT). Here you take the base model and continue training it on a carefully curated dataset of (instruction, ideal response) pairs. After instruction tuning, the model has learned to behave like an assistant rather than a document completer. It will answer "What is the capital of France?" with "Paris" rather than completing the question as if it were the start of a trivia quiz. The transformation is dramatic and requires far less compute than pre-training — thousands of high-quality examples rather than trillions of tokens.
The third stage is RLHF (Reinforcement Learning from Human Feedback), which is what made ChatGPT feel dramatically more helpful and safe than earlier GPT-3 variants. Human raters compare pairs of model responses and indicate which one is better. These preferences are used to train a separate reward model that can predict human preference scores. Then the main language model is fine-tuned using reinforcement learning (specifically, Proximal Policy Optimization or PPO) to maximize the reward model's score. The result is a model that has been shaped to produce responses humans prefer — more helpful, less likely to produce harmful content, better at following instructions precisely. The combination of all three stages is what gives you Claude, ChatGPT, or Gemini.
Research by DeepMind in 2022 showed that most large models at the time were significantly undertrained. The optimal compute budget splits roughly equally between parameters and training tokens, with the rule of thumb being approximately 20 tokens of training data per model parameter. A model with 8 billion parameters should ideally be trained on around 160 billion tokens for optimal efficiency — but Llama 3 8B was trained on 15 trillion tokens, intentionally over-training to produce a smaller model that performs better at inference time.
Deep Dive
Pre-training uses the causal language modeling (CLM) objective. For a sequence of tokens x_1, x_2, ..., x_T, the model computes the log-probability of the entire sequence as the sum of conditional log-probabilities: log P(x) = ∑_t log P(x_t | x_1,...,x_{t-1}). Each forward pass simultaneously predicts all tokens in the sequence (masked so position t cannot see t+1...T), and the loss is the average cross-entropy across all positions. This makes training extremely efficient — a single forward pass of a sequence of length 2048 generates 2048 training signal examples.
The neural scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022) describe a power-law relationship between model performance, compute budget, model parameters, and training tokens. Crucially, they reveal that performance scales smoothly and predictably: doubling compute yields a predictable improvement in loss. This allows AI labs to reliably plan multi-hundred-million-dollar training runs. The Chinchilla paper (Hoffmann et al.) showed the optimal training-to-parameter ratio is approximately 20 tokens per parameter. Llama 3 8B, trained on 15 trillion tokens, is deliberately "overfit" at this ratio to create a small model that performs well enough to deploy cheaply.
Instruction tuning (SFT) is technically straightforward: it is standard supervised fine-tuning using the same CLM objective, but on a curated dataset of instruction-response pairs formatted in a specific template. For example, Llama 3 uses a template wrapping system, user, and assistant turns in special tokens. The model learns to recognize this template and behave like an assistant within it. The SFT dataset need not be enormous — high-quality matters more than quantity. The seminal "Alpaca" paper showed that even 52,000 instruction-following examples were sufficient to dramatically improve instruction-following over the base model.
The full RLHF pipeline proceeds in three steps after SFT. First, a separate reward model (RM) is trained on human preference data. Human raters are shown pairs of model responses to the same prompt and must select the better one. The RM (typically initialized from the SFT model) is trained to predict these preferences, learning a scalar reward for any (prompt, response) pair. Second, the SFT model is fine-tuned using PPO (Proximal Policy Optimization), a reinforcement learning algorithm: the model generates responses, the reward model scores them, and PPO updates the model weights to increase the probability of high-reward responses. Crucially, a KL divergence penalty is added to prevent the model from drifting too far from the SFT baseline (which would cause reward hacking — generating text that maximizes the reward model's score in ways humans would not actually prefer). Third, the process iterates: new data is collected, the reward model is updated, and the policy is fine-tuned again.
Constitutional AI (CAI), developed by Anthropic and used in training Claude, extends RLHF by replacing human preference labelers with an AI judge that evaluates responses against a set of written principles (the "constitution"). The AI critiques its own outputs, rewrites them to better satisfy the principles, and uses the (original, rewritten) pairs as preference data. This scales much more efficiently than collecting human preferences and allows Anthropic to codify specific values (harmlessness, honesty, helpfulness) in written form and train against them directly.
Direct Preference Optimization (DPO) is a more recent alternative to RLHF that bypasses the separate reward model entirely. DPO directly optimizes the language model on preference pairs using a clever mathematical equivalence: it has been shown that the optimal RLHF policy can be expressed in closed form using the reference model, enabling gradient descent directly on the preference data without needing a separate RL training loop. DPO is simpler to implement, more stable to train, and has shown comparable or better results than PPO-based RLHF on many tasks. Most fine-tuning workflows today use DPO or a variant of it.
Extended thinking (sometimes called chain-of-thought reasoning, reasoning tokens, or "thinking mode" as in Claude 3.7 Sonnet) is a training-time technique where the model is encouraged to produce an internal reasoning trace before producing its final answer. During training, reasoning traces are generated and used as additional supervision signal. At inference time, the model produces (and optionally streams) these reasoning tokens before the final answer. This dramatically improves performance on multi-step reasoning tasks like mathematics, coding, and scientific problems. In Claude's API, you can see the thinking tokens when they are enabled, giving you direct visibility into the model's intermediate reasoning steps.
Inference Mechanics
Plain Language
Inference is the process of using a trained model to generate text — what happens every time you send a message to ChatGPT or call an LLM API. Unlike training, no learning occurs during inference: the weights are frozen, and you are simply running the mathematical computation forward through the network. But inference has its own surprising complexity, and understanding how it works helps you make better decisions about temperature, sampling parameters, context windows, and cost.
Modern LLMs generate text autoregressively: one token at a time, feeding each generated token back as input for the next step. When you send "Tell me a joke," the model does not see the full answer immediately. It generates the first token ("Why"), then feeds "Tell me a joke Why" back in to generate the second token ("did"), then "Tell me a joke Why did" to generate "the", and so on, until it generates a special end-of-sequence token or hits the maximum length limit. Each step is an independent forward pass through all the model's layers. For a 70B parameter model, this is a substantial computation — typically 30-80 tokens per second on an NVIDIA A100 GPU.
Temperature controls how random or deterministic the model's choices are. At the end of each forward pass, the model produces a probability distribution over all ~100,000 vocabulary tokens. With temperature=0 (or "greedy decoding"), you always pick the single most likely next token. This is fully deterministic — the same prompt always produces the same output. As temperature increases, the distribution is spread out (lower-confidence tokens get relatively more probability), producing more varied and surprising outputs. Temperature=1 is the "natural" distribution the model learned. Temperature=2 makes choices very random and usually incoherent. For creative tasks, 0.7-1.0 works well; for factual Q&A or code, 0-0.3 is safer.
The context window is the maximum number of tokens the model can consider at once, including both your input and the generated output. GPT-4 originally had 8,192 tokens; modern models have 128K (Claude 3) or even 1M (Gemini 1.5). This matters enormously in practice. If you want to chat with a model about a 500-page book, you cannot simply paste the whole book in — it would exceed the context window. This is one of the key motivations for RAG: instead of stuffing everything into the context, you retrieve only the relevant chunks.
A common misconception is that longer context windows solve all retrieval problems. Research has consistently shown the "lost in the middle" problem: models perform much better at recalling information from the beginning and end of a long context than from the middle. Even with a 128K context window, information buried in position 64K may be effectively ignored. This is why structured retrieval (RAG) often outperforms naive context stuffing even when the context window is theoretically large enough to hold all the information.
Inference is priced per token (input tokens + output tokens, often at different rates). A 128K input token prompt to GPT-4o at $5/1M input tokens costs $0.64 just for the prompt. Output tokens are typically 3-4x more expensive per token than input tokens. At scale, the choice of temperature (which affects output length indirectly), the context window size, and the model size dominate your costs.
Deep Dive
During the autoregressive forward pass, the model runs the full token sequence through all N transformer layers sequentially, producing a hidden state at each layer for each token position. The final layer's hidden state at the last token position is then projected through the language model head (a linear layer of shape (d_model, vocab_size)) to produce a vector of logits — one unnormalized score per vocabulary token. These logits are transformed into a probability distribution by softmax (or, in practice, by sampling operations applied to the logits directly).
The most important optimization for efficient inference is the KV cache (Key-Value cache). Recall that attention requires computing K and V matrices for every token in the sequence. During autoregressive generation, after step t, the K and V matrices for all tokens 1...t have already been computed. Without caching, every new token would require recomputing K and V for the entire prefix. The KV cache stores these matrices in GPU SRAM, so each new step only needs to compute K and V for the single new token, then concatenate with the cached history. This reduces inference cost from O(T²) back toward O(T) per generated token. For long contexts, the KV cache can be gigabytes of data — a key constraint on how many parallel requests a GPU can serve.
Sampling strategies determine how you convert the logit distribution into a chosen token. Greedy decoding (argmax) always picks the highest-probability token and is fully deterministic. Temperature sampling divides all logits by the temperature T before softmax: p_i = softmax(logits / T)_i. With T < 1, the distribution sharpens (confident); with T > 1, it flattens (random). Top-k sampling zeros out all but the k highest-probability tokens before sampling, preventing very unlikely tokens from ever being chosen. Top-p (nucleus) sampling instead keeps the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9), which adapts dynamically to how spread out the distribution is. Min-p sampling (a newer technique) sets a minimum probability threshold relative to the maximum: a token is eligible only if its probability exceeds min_p * max_prob. In practice, most production deployments use a combination: temperature + top-p.
Beam search maintains B candidate sequences (beams) simultaneously rather than committing to one token at a time. At each step, each beam is extended with every possible next token, the B*vocab_size resulting sequences are scored, and the top B are kept. At the end, the highest-scoring complete sequence is returned. Beam search was the dominant decoding strategy for machine translation (where it dramatically outperformed greedy) but is rarely used for open-ended generation: it produces safe, repetitive text and is B times more expensive than greedy decoding.
Speculative decoding is a recent technique for accelerating inference from large models. A small, fast draft model generates K tokens speculatively (e.g., 4-8 tokens). Then the large model verifies all K tokens in a single parallel forward pass (since the transformer processes all positions simultaneously). If the large model agrees with the draft's token, it accepts it; if it disagrees, it corrects from that point and discards the rest. Because verification is parallel and draft generation is cheap, this can achieve 2-3x speedup on the large model's effective throughput with no change in output quality.
Batching is fundamental to serving efficiency. GPUs are designed for massive parallelism, and running a single inference request uses only a fraction of available compute. By processing multiple requests simultaneously in a batch, you amortize the fixed cost of loading model weights from HBM and dramatically increase throughput. Continuous batching (also called iteration-level scheduling) extends this further: rather than waiting for all requests in a batch to finish before starting new ones, new requests are slotted into the batch as old ones complete. This is how production serving systems like vLLM and TGI (Text Generation Inference) achieve high GPU utilization. A well-tuned A100 80GB serving a 13B model can handle hundreds of concurrent users at reasonable latency.
import numpy as np
def softmax(logits: np.ndarray, temperature: float = 1.0) -> np.ndarray:
"""Apply temperature scaling then softmax."""
scaled = logits / max(temperature, 1e-8)
exp = np.exp(scaled - scaled.max()) # numerical stability
return exp / exp.sum()
def top_k_filter(logits: np.ndarray, k: int) -> np.ndarray:
"""Zero out all logits except the top-k."""
threshold = np.sort(logits)[-k]
filtered = np.where(logits >= threshold, logits, -np.inf)
return filtered
def top_p_filter(logits: np.ndarray, p: float) -> np.ndarray:
"""Keep the smallest set of tokens whose cumulative prob ≥ p."""
probs = softmax(logits)
sorted_idx = np.argsort(probs)[::-1] # descending
cumulative = np.cumsum(probs[sorted_idx])
# Tokens after cumsum exceeds p are masked
cutoff = np.searchsorted(cumulative, p) + 1
keep = sorted_idx[:cutoff]
filtered = np.full_like(logits, -np.inf, dtype=float)
filtered[keep] = logits[keep]
return filtered
def sample_token(logits: np.ndarray, temperature: float = 1.0,
top_k: int = 0, top_p: float = 1.0) -> int:
"""
Full sampling pipeline:
1. Apply top-k filter (if k > 0)
2. Apply top-p (nucleus) filter
3. Apply temperature and softmax
4. Sample from resulting distribution
"""
if top_k > 0:
logits = top_k_filter(logits, top_k)
if top_p < 1.0:
logits = top_p_filter(logits, top_p)
probs = softmax(logits, temperature)
return int(np.random.choice(len(probs), p=probs))
# --- Effect of temperature on distribution ---
np.random.seed(0)
vocab_size = 10
logits = np.random.randn(vocab_size) * 2 # simulate raw logits
print("Effect of temperature on token probability distribution:")
print(f"{'Token':^8}", end="")
for T in [0.2, 0.7, 1.0, 1.5, 2.0]:
print(f" T={T:3}", end="")
print()
for i in range(vocab_size):
print(f" tok_{i:2d}", end="")
for T in [0.2, 0.7, 1.0, 1.5, 2.0]:
p = softmax(logits, T)[i]
print(f" {p:.3f} ", end="")
print()
# --- KV cache size estimation ---
def kv_cache_bytes(n_layers: int, n_heads: int, d_head: int,
seq_len: int, batch_size: int = 1,
dtype_bytes: int = 2) -> int:
"""
KV cache stores K and V for each layer, head, and token.
Shape: 2 (K+V) × layers × heads × seq_len × d_head × batch
"""
return 2 * n_layers * n_heads * d_head * seq_len * batch_size * dtype_bytes
# Llama 3 8B: 32 layers, 8 KV heads (GQA), d_head=128, fp16
size = kv_cache_bytes(n_layers=32, n_heads=8, d_head=128,
seq_len=8192, batch_size=1, dtype_bytes=2)
print(f"\nLlama 3 8B KV cache (8192 ctx, batch=1): {size/1e6:.1f} MB")
# GPT-4 style (estimated): 96 layers, 96 heads, d_head=128, 128K context
size_large = kv_cache_bytes(96, 96, 128, 131072, 1, 2)
print(f"GPT-4-class (128K ctx, batch=1): {size_large/1e9:.2f} GB")
# This is why large contexts strain GPU memory so severely
The KV cache calculation explains why batching large-context requests is GPU-memory-intensive. At 128K tokens and 96 layers, a single request's KV cache can exceed the available SRAM on an A100 (80GB HBM). This is why context length and batch size are in fundamental tension, and why techniques like sliding window attention (Mistral) and grouped-query attention (Llama 3) exist — they reduce KV cache size while preserving most of the quality.
Interview Ready
How to Explain This in 2 Minutes
Modern generative AI is built on the Transformer architecture, introduced in the 2017 "Attention Is All You Need" paper. At its core, a Transformer converts raw text into numerical tokens, maps those tokens to high-dimensional embeddings, then passes them through a stack of layers where self-attention lets every token weigh how relevant every other token is. This is fundamentally different from older recurrent models because attention operates in parallel over the full sequence — making training massively scalable on GPUs. The model is first pre-trained on vast internet text using next-token prediction (a self-supervised objective), then fine-tuned with human feedback (RLHF) to follow instructions and be safe. At inference time, it generates text autoregressively — one token at a time — with parameters like temperature and top-p controlling the creativity-vs-accuracy tradeoff. Understanding this pipeline — tokenization, embeddings, attention, training, and inference — is essential for making informed decisions about prompt design, model selection, cost optimization, and debugging unexpected outputs.
Likely Interview Questions
| Question | What They're Really Asking |
|---|---|
| What is the Transformer architecture and why did it replace RNNs/LSTMs? | Do you understand parallelism, self-attention, and why sequence models evolved? |
| Explain the self-attention mechanism. What are Q, K, and V? | Can you go beyond buzzwords and describe the actual matrix operations? |
| How does tokenization work, and why does it matter for LLM applications? | Do you understand BPE/subword tokenization and its impact on cost, context limits, and multilingual performance? |
| What is the difference between generative (autoregressive) and discriminative models? | Can you articulate why GPT generates text token-by-token vs. BERT predicting masked tokens? |
| Walk me through how an LLM is trained end-to-end: pre-training, SFT, and RLHF. | Do you understand the full training pipeline and why each stage exists? |
Model Answers
RNNs process tokens sequentially — token 500 must wait for tokens 1–499 to finish, making training slow and causing the vanishing gradient problem over long sequences. Transformers replace recurrence with self-attention, which computes relationships between all token pairs in parallel. This means an entire 8,192-token sequence is processed simultaneously on a GPU. The positional encoding component preserves order information without sequential computation. This parallelism is why Transformers can be trained on trillions of tokens in weeks, whereas an equivalent RNN would take months.
Each token's embedding is linearly projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QKᵀ / √dk), producing a weight matrix that tells each token how much to attend to every other token. These weights are multiplied by V to get a context-aware representation. The √dk scaling prevents dot products from growing too large and pushing softmax into saturated regions. Multi-head attention repeats this with different learned projections so the model can capture different relationship types (syntactic, semantic, positional) simultaneously.
LLMs do not see raw characters or whole words. They use subword tokenization (typically BPE or SentencePiece) that splits text into frequent subword units. For example, "unhappiness" might become ["un", "happi", "ness"]. This balances vocabulary size (~32K–128K tokens) against sequence length. Tokenization directly impacts cost (APIs charge per token), context window utilization, and multilingual capability — languages with fewer training tokens get less efficient tokenization, meaning the same sentence uses more tokens in Japanese than in English. Understanding tokenization is critical for prompt engineering and cost estimation.
Generative (autoregressive) models like GPT are trained to predict the next token given all previous tokens — they model P(xt | x1...xt-1) and generate text by sampling from this distribution repeatedly. Discriminative models like BERT are trained with masked language modeling (predicting a hidden token given surrounding context in both directions) and are designed for classification, extraction, and understanding tasks rather than open-ended generation. GPT-style models see only leftward context (causal mask), while BERT sees the full sequence bidirectionally. This is why BERT excels at NER and sentiment analysis while GPT excels at chat and content creation.
The pipeline has three stages. Pre-training: the model learns general language by predicting the next token on trillions of tokens of internet text — this is the expensive stage (millions of GPU-hours). Supervised Fine-Tuning (SFT): the pre-trained model is trained on curated prompt-response pairs to follow instructions. RLHF: a reward model trained on human preference rankings is used to further optimize the model via Proximal Policy Optimization (PPO), aligning outputs with human expectations for helpfulness, harmlessness, and honesty. Each stage is orders of magnitude cheaper than the previous one, which is why fine-tuning is accessible but pre-training is not.
System Design Scenario
Scenario: Your team needs to build a real-time customer support chatbot that handles 500 concurrent users with responses under 2 seconds. The product supports 5 languages. Design the inference infrastructure.
A strong answer covers: (1) Model selection — choosing a model size that fits GPU memory while maintaining quality, likely a 7–13B parameter model with grouped-query attention to reduce KV cache overhead. (2) Tokenization awareness — non-English languages consume more tokens per message, so context window budgets differ by language. (3) KV cache management — calculating memory per concurrent session (e.g., 32 layers × 8 KV heads × 128 d_head × 4096 tokens × 2 bytes ≈ 256 MB per session × 500 users = 128 GB). (4) Batching strategy — continuous batching (vLLM/TGI) to maximize throughput. (5) Temperature settings — low temperature (0.1–0.3) for factual support answers. (6) Scaling — horizontal scaling across multiple GPU nodes with a load balancer routing by language to optimize tokenizer-specific caches.
Common Mistakes
- Confusing parameters with tokens — A 7B-parameter model does not have 7 billion tokens. Parameters are learned weights; tokens are input/output units. Model size (parameters) determines capability and memory footprint; token count determines training data volume and context length.
- Thinking attention is free — Self-attention has O(n²) complexity in sequence length. Doubling context from 4K to 8K tokens quadruples attention computation. This is why long-context models use optimizations like FlashAttention, sliding window attention, or sparse attention patterns.
- Ignoring the tokenizer when estimating costs — A "short" prompt in English might be 50 tokens but 150 tokens in Korean due to tokenizer efficiency differences. Always run text through the actual tokenizer (e.g.,
tiktoken) to get accurate token counts before estimating API costs or context window usage.