Module 02 · LLMs, SLMs & Multimodal Models

01

The LLM Landscape

Plain Language: What Is Going On Out There

When people say a model is "large," they are almost always referring to its parameter count — the number of learned numerical weights that encode the model's knowledge and behavior. A 7-billion-parameter model has 7 billion of these values; a 70-billion-parameter model has ten times as many. But parameter count is really just the headline number. What actually determines a model's capability is the combination of how many parameters it has, how much data it was trained on, and how much compute (measured in GPU-hours or floating-point operations) was used during training. A model with fewer parameters trained on better, more carefully curated data can outperform a larger model trained sloppily. This is why some of the most surprising benchmarks in recent years have been small models punching far above their weight.

The major providers in the current landscape break down roughly into two camps. On the closed-weight side you have OpenAI (GPT-4o, o1, o3), Anthropic (Claude 3 and 3.5 and 3.7), Google DeepMind (Gemini Flash, Pro, Ultra), Cohere (Command R+), and xAI (Grok). These companies train their models, keep the weights proprietary, and give you access through a paid API. You can call their models, but you cannot download them, inspect their internals, or run them on your own hardware. On the open-weight side you have Meta AI (Llama 3.x), Mistral AI (Mistral 7B, Mixtral 8x7B), Alibaba (Qwen 2.5), Microsoft (Phi-3, Phi-4), Google (Gemma 2), and DeepSeek (V3, R1). These labs release their model weights publicly, which means you can download them, run them locally, fine-tune them on your own data, and deploy them on your own infrastructure.

The closed-weight advantage is that frontier models from OpenAI and Anthropic consistently sit at the top of most benchmarks, they are continuously updated without you needing to do anything, they handle the infrastructure, and they provide safety features and content policies. The open-weight advantage is that once you have the weights, your cost is just the hardware you run them on, your data never leaves your premises, you can fine-tune the model for your specific domain, and you are not subject to a third party's API availability or pricing changes. The choice between them is not philosophical — it is a practical engineering decision based on your use case, your budget, and your data sensitivity requirements.

The landscape is unusual in how quickly it changes. In most fields of engineering, the set of available tools is relatively stable over a year or two. In LLMs, major capability improvements arrive every three to six months. Models that were considered state-of-the-art in early 2024 were soundly beaten by mid-2024, which were themselves beaten by end of year. This is not just incremental improvement — it is capability doublings, where models that could not reliably complete a complex multi-step coding task in one generation can do it fluently in the next. For a practitioner this means the specific model names you memorize today will be outdated within months, but the underlying frameworks for evaluating and selecting models will stay relevant.

What is driving this speed is a convergence of factors: a clear scaling recipe that still yields improvements (more data, more compute, more parameters), billions of dollars of investment creating infrastructure for rapid experimentation, a growing pool of researchers who have internalized the transformer architecture, and the network effect of open-weight models allowing the research community to build on each other's work. The word "commoditization" gets used a lot in this space — the idea that models which were premium products in 2023 are effectively free or near-free by 2025. GPT-3.5-level capability, which cost meaningful money to access in 2022, can now be run locally for free on a mid-range laptop. This trajectory is continuing.

Deep Dive: Parameters, Compute, and Context

The parameter count of a model determines several practical properties simultaneously. It sets a floor on the memory required to run the model (roughly 2 bytes per parameter in 16-bit precision, so a 7B model needs about 14GB of VRAM before you account for activations and KV cache). It also largely determines the inference speed, since the forward pass through the network must multiply activations through all those weight matrices. And it has a strong correlation (though not a perfect one) with the model's knowledge and reasoning ceiling. The commonly used size classes in the open-weight world are: 7B (fast, cheap, runs on consumer GPUs, capable enough for many production tasks), 13B (a sweet spot that was popular before larger context windows made 7B models more compelling), 34B (often used for domain-specific fine-tuning where quality matters more than speed), 70B (the current gold standard for open-weight quality — requires A100-class hardware but competitive with frontier models on many tasks), 405B (Llama 3.1 405B, Meta's most capable open release, requires multiple A100s), and beyond that, models like GPT-4 which are speculated to be in the 1T+ parameter range using Mixture-of-Experts architectures.

To put compute costs in context: training a 70B model from scratch on quality data requires roughly one million GPU-hours on A100 80GB GPUs. At cloud spot prices around $1-2 per GPU-hour, that is $1-2M in compute for a single training run, before you account for experimentation, hyperparameter tuning, and the fact that the first training run almost never produces the best model. Frontier labs run dozens of ablations and intermediate checkpoints. The compute budgets for models like GPT-4 and Gemini Ultra are estimated in the $50-100M range. This is why hardware and compute access are genuine moats for frontier model development, and why the open-weight ecosystem — which benefits from Meta spending the training cost and releasing the weights — is so valuable to the broader community.

Context window evolution tells a story of one of the most impactful improvements in LLM usability. GPT-2 (2019) had a context window of 1,024 tokens. GPT-3 (2020) expanded to 2,048. GPT-4 launched with 8,192 tokens, later expanded to 32,768. Claude 2 (2023) was a landmark moment with 100,000 tokens, enabling full books or large codebases in a single context. Gemini 1.5 Pro pushed this to 1,000,000 tokens — one million — which means you can theoretically fit an entire software codebase, an hour of audio, or thousands of documents. Claude 3.7 Sonnet has a 200,000 token context window. The practical implication is that "long context" techniques that required complex retrieval systems in 2023 can sometimes be replaced by simply fitting everything into the prompt in 2025 — though retrieval is still necessary for truly large corpora and for cost-effectiveness.

The MMLU benchmark (Massive Multitask Language Understanding) became the dominant academic benchmark for measuring LLM intelligence from 2021 onward. It tests a model across 57 subjects — from elementary mathematics to professional law and medicine — with multiple-choice questions. Human expert performance is approximately 89%. Representative scores: GPT-4 scored 86.4%, Claude 3 Opus scored 86.8%, Llama 3 70B scored 82.0%, and Phi-3-medium (14B) reached 78.2%. These numbers are useful for rough comparisons but should be interpreted cautiously: MMLU tests factual knowledge retrieval, not the kind of reasoning, planning, and instruction following that matters most in production systems.

Mixture-of-Experts (MoE) is an architectural innovation that decouples a model's total parameter count from its inference compute cost. The classic transformer has a single feed-forward network (FFN) block that every token passes through. An MoE model instead has multiple FFN blocks called "experts" — in Mixtral 8x7B there are 8 such experts — and a routing network that selects a small number of them (typically 2) for each token. The result is that Mixtral 8x7B has 46B total parameters but only activates about 12B during any given inference step, giving it quality closer to a 40B+ dense model but speed closer to a 12B model. GPT-4 is widely speculated (though never confirmed) to be a very large MoE model. The architecture allows labs to train larger effective models without proportionally increasing inference costs, which is why MoE is central to the frontier model scaling strategy.

Benchmark gaming is a genuine problem in the LLM evaluation space. When labs optimize heavily for benchmark performance — sometimes by including benchmark-adjacent data in their training sets — benchmark scores become an increasingly unreliable signal of real-world capability. The community has responded by developing harder, contamination-resistant benchmarks. GPQA (Graduate-level Professional Questions in STEM) contains questions so difficult that PhD-level domain experts only score around 65% on questions outside their specialty. LiveBench generates new questions monthly from current events and mathematical competitions so that contamination is impossible. ARC-AGI tests abstract visual pattern completion in ways that current memorization strategies cannot solve. When evaluating claims about model performance, check which benchmark is being cited and whether that benchmark might be contaminated.

Reading a HuggingFace model card is a practical skill you will use constantly in this course. The model card is the documentation page for any model on HuggingFace Hub. Key sections to read: Model Architecture (what type of transformer, number of layers, attention heads, hidden dimension — tells you about compatibility and memory requirements); Training Data (what datasets, what language distribution, what cutoff date — tells you about knowledge freshness and potential biases); Evaluation Results (benchmark scores on standardized tests — useful for rough comparisons); Intended Uses and Limitations (what the model is designed for and what it explicitly is not tested on); and License (Apache 2.0 means commercial use is permitted, while some models have non-commercial-only licenses that would block production deployment).

The concept of inference throughput matters enormously for production systems. Tokens per second (tok/s) determines the user experience and the cost of serving. A model generating 20 tok/s feels sluggish to a user expecting chat-like responses. A model generating 100 tok/s can stream text faster than most people read. The factors that determine throughput are: model size (smaller is faster), hardware (H100s are roughly 3x faster than A100s), batch size (batching multiple requests together improves GPU utilization), and quantization (reducing weight precision from float32 to int8 or int4 can double speed with modest quality loss).

Model landscape positioned by capability (vertical) versus cost-per-token (horizontal). The dashed curve represents the value frontier — models on it deliver optimal capability for their price point.

02

Key Models — What You Need to Know

Plain Language: The Models You Will Actually Use

GPT-4o is OpenAI's flagship model as of 2025 and the one you will reach for when you need maximum capability from the OpenAI ecosystem. The "o" stands for "omni" — it is natively multimodal, meaning it can accept and produce text, images, and audio within a single model architecture rather than having separate specialized sub-models bolted together. In practice this means you can send GPT-4o a photo of a whiteboard, a voice recording, and a text question in the same API call, and it processes all three modalities coherently. It is also significantly faster than GPT-4 Turbo while achieving comparable or better performance, which matters for user-facing applications where latency is visible.

Claude 3.5 and 3.7 Sonnet from Anthropic represent the best balance of speed, intelligence, and cost in the Claude family. Anthropic's positioning has always emphasized safety and instruction-following — Claude models are particularly strong at following complex, nuanced instructions without going off-script, which makes them popular for production systems where reliability is paramount. Claude 3.7 Sonnet introduced "extended thinking," a hybrid mode where the model can spend additional tokens on an internal chain-of-thought before producing its final answer. This dramatically improves performance on mathematical reasoning and multi-step problems at the cost of higher latency.

Gemini 2.0 Flash and Pro are Google DeepMind's current generation, and they are notable for two reasons beyond raw benchmark performance. First, they were designed as natively multimodal models from the ground up — not text models with vision tacked on — which gives them a structural advantage at tasks requiring deep integration of visual and textual reasoning. Second, Gemini 2.0 adds native tool use and real-time capabilities that make it well suited for agentic applications where a model must use APIs and search within a conversation. The Flash variant is among the most cost-effective capable models available, making it a default choice for high-volume applications.

Llama 3.x from Meta AI is the most important open-weight model family in 2025. The 8B variant can be fine-tuned on consumer hardware with 24GB of VRAM, making it accessible for experimentation. The 70B variant requires A100-class GPUs but achieves quality genuinely competitive with GPT-3.5-Turbo and below-frontier Claude models. The 405B variant is a research-grade model that approaches frontier quality. All variants are released under Apache 2.0, which is a permissive license that allows commercial use, modification, and redistribution without restrictions beyond attribution.

Mistral and Mixtral are the flagship products from Mistral AI, a Paris-based startup founded by former Google DeepMind and Meta AI researchers. They have consistently pushed the envelope on open-weight efficiency — their 7B model outperformed the much larger Llama 2 13B when it was released, establishing that careful architecture choices and better training data can compensate for raw size. The Mixtral 8x7B model demonstrated that Mixture-of-Experts applied to open-weight models can achieve GPT-3.5-level quality. Mistral models are a common choice for self-hosted deployments in European enterprises where data sovereignty requirements make cloud API usage problematic.

Model Family Deep Dives

The OpenAI family tree is instructive for understanding how the field evolved. GPT-3 (2020) was trained purely on next-token prediction with no explicit guidance about what constituted a helpful response. It was powerful but erratic — it would continue a question rather than answer it, produce harmful content without resistance, and frequently hallucinate with high confidence. InstructGPT (2022) applied Reinforcement Learning from Human Feedback (RLHF) to teach the model to follow instructions, producing a qualitatively different user experience. ChatGPT was essentially InstructGPT with a chat interface, and its release in November 2022 triggered the public awareness explosion around LLMs. GPT-4 (2023) was a landmark capability jump, reliably passing bar exams and medical licensing tests. GPT-4o (2024) unified text, image, and audio into a single model. The o1 and o3 series represents a different architecture philosophy that will be covered below.

o1 and o3 are OpenAI's "reasoning models" and they represent a significant departure from standard LLM architecture. Rather than generating a response token by token from a single forward pass, o1-class models use an extended internal chain-of-thought — sometimes thousands of tokens of intermediate "thinking" — before producing their final visible output. This hidden reasoning phase is similar to how a human might work through a problem on scratch paper before writing the final answer. The result is dramatic improvements on tasks that require multi-step planning: mathematical olympiad problems, competitive programming, complex scientific reasoning. The tradeoff is latency and cost — o1 calls are typically 3-10x more expensive than GPT-4o calls and take longer because of the extended reasoning chain. The right time to use o1/o3 is when you have a genuinely hard reasoning task, not for routine text generation.

The Anthropic Claude family has undergone three major generations. Claude 1 established the Constitutional AI approach — training the model to be helpful, harmless, and honest using a set of principles rather than purely from human preference labels. Claude 2 was notable for its 100,000-token context window, which was far ahead of competitors at the time and enabled entirely new applications like summarizing entire books or codebases. Claude 3 in early 2024 introduced the three-tier product structure: Haiku (fast, cheap, for high-volume lightweight tasks), Sonnet (balanced intelligence and speed, the workhorse model), and Opus (most capable, slower and more expensive, for complex tasks). Claude 3.5 Sonnet exceeded Claude 3 Opus on most benchmarks while being faster and cheaper. Claude 3.7 Sonnet added hybrid reasoning mode.

Google's Gemini is organized similarly to Claude: Flash (fast and cheap), Pro (balanced), and Ultra (most capable). Gemini 1.5 Pro's headline feature was the 1M-token context window, which was unprecedented and allowed tasks like analyzing an entire film, a full software codebase, or thousands of documents in a single context. Gemini 2.0 extended this and added native tool use — the ability to call external APIs, run code, and search the web as first-class capabilities rather than as post-hoc function-calling. The Flash variant of Gemini is one of the most cost-effective options on the market ($0.075 per million input tokens as of early 2025) and is often the right choice for high-volume production workloads where the task complexity does not justify a premium model.

Llama 3 represents Meta's third generation of open releases and is significantly more capable than its predecessors. The 3.1 release added a 128,000-token context window across the 8B and 70B variants. The 3.2 release added vision capabilities, enabling image understanding without needing a separate vision model. Fine-tuning the 8B variant on domain-specific data with LoRA (Low-Rank Adaptation) is a common and practical technique — you can adapt the model to specialized vocabulary and reasoning patterns with a relatively modest compute budget and a few thousand training examples.

Mistral 7B demonstrated when it launched in September 2023 that sliding window attention — which limits each token to attending to only a fixed window of previous tokens rather than the full context — could dramatically reduce memory requirements while preserving quality on most tasks. Mixtral 8x7B applied the Mixture-of-Experts architecture to an open-weight model for the first time at significant scale, achieving GPT-3.5-comparable quality. Mistral models use a relatively permissive Apache 2.0 license for their base models, though some of the newer instruction-tuned variants have slightly more restrictive terms.

Qwen 2.5 from Alibaba Cloud is a model family that deserves more attention from Western practitioners than it typically receives. The Qwen series has consistently ranked near the top of open-weight benchmarks, with particular strength in multilingual tasks (Qwen was trained on significantly more Chinese-language data than most Western models) and coding. The 7B through 72B range gives good coverage of deployment scenarios, and the model weights are freely available. For organizations building applications for Asian markets or needing strong multilingual performance, Qwen is a serious consideration.

DeepSeek-V3 and DeepSeek-R1 from the Chinese lab DeepSeek caused significant attention when they were released in early 2025. DeepSeek-V3 is a dense chat model that benchmarks competitively with GPT-4o and Claude 3.5 Sonnet. More notably, DeepSeek-R1 is an open-weight reasoning model — released with full weights downloadable — that achieves performance competitive with OpenAI's o1 on mathematical and coding benchmarks. This was remarkable because it suggested that the reasoning model capability, previously the exclusive domain of closed API providers, could be replicated and released openly. R1 also released distilled variants (smaller models trained to mimic R1's reasoning process) at the 7B and 14B scale.

Practical Guidance

For course projects you will default to Claude Sonnet or GPT-4o for complex tasks, Gemini Flash or GPT-4o-mini for cost-sensitive pipelines, and Llama 3 8B/70B when you want to experiment with self-hosted models. This covers 95% of real-world use cases.

03

Small Language Models (SLMs)

Plain Language: When Smaller Is Better

The excitement around GPT-4-scale models can create a false impression that bigger is always better. In many production scenarios, the opposite is true. A small language model — typically defined as anything under 10 billion parameters — can run on a single consumer laptop, respond in near-real-time, operate entirely offline, and cost essentially nothing per query once you own the hardware. For the right class of tasks, these are not compromises; they are genuine advantages that make previously impossible applications possible.

Microsoft's Phi series is the best demonstration of what carefully chosen training data can accomplish. The Phi-1 model (2023) was a 1.3B parameter model trained on a custom dataset of high-quality code and "textbook-quality" synthetic examples of mathematical reasoning. It outperformed much larger models on coding benchmarks, establishing the "textbooks are all you need" thesis. Phi-3-mini at 3.8 billion parameters fits in 4GB of VRAM (accessible to any modern GPU), has a 128,000-token context window, and matches or beats Llama 3 8B on many reasoning benchmarks. Phi-4 at 14 billion parameters is a stronger reasoning model trained in part by distillation from GPT-4.

Gemma 2 from Google is a family of small open models at 2B, 9B, and 27B parameters, specifically designed and optimized for on-device inference. Google applied knowledge distillation from larger Gemini models during Gemma training and used specific architectural improvements — like alternating attention mechanisms — to improve efficiency. The 2B variant can run on a Raspberry Pi 5 or a modern smartphone. The 9B variant achieves quality competitive with Llama 3 8B while being faster at inference.

The scenarios where you should choose an SLM over a frontier API fall into four clear categories. First, privacy-critical applications: medical records processing, legal document analysis, financial data review — any context where sending data to an external API is either legally prohibited or would require expensive compliance overhead. Second, edge deployment: mobile apps, IoT devices, in-car systems, or any environment where internet connectivity cannot be assumed. Third, cost-sensitive high-volume tasks: if your application performs 10 million simple classifications per day, even a $0.15 per million token price becomes significant; a local SLM brings that to near zero. Fourth, offline applications: tools that need to work without internet access, such as field research tools or secure facility applications.

Phi, Gemma, and SLM Benchmarks in Detail

Phi-3-mini (3.8B) is the flagship demonstration of parameter efficiency. It fits comfortably in 4GB of VRAM in 16-bit precision, making it compatible with consumer GPUs going back to the NVIDIA GTX 1080. In 4-bit quantized form, it requires about 2GB — runnable on an Apple M1 MacBook with 8GB of unified memory. Its 128,000-token context window is remarkable for a model this size; most models at this scale have 8,192 or 32,768 token windows. On MMLU, Phi-3-mini scores approximately 68-70%, which is lower than frontier models but sufficient for many real-world classification, extraction, and summarization tasks.

Phi-4 (14B) represents a step up in reasoning capability while remaining deployable on a single consumer GPU. It was trained using a combination of original data, synthetic datasets generated by GPT-4, and distillation from GPT-4's outputs. Its math performance is notably strong relative to its size — it scores higher than Llama 3 70B on several mathematical benchmarks while being five times smaller. This makes it a compelling choice for technical assistants or coding helpers that need to run in resource-constrained environments.

Quantization is the technique of representing model weights in lower numerical precision than the standard 32-bit or 16-bit floating point. A float32 weight takes 4 bytes; an int8 weight takes 1 byte; an int4 weight takes 0.5 bytes. This means a model quantized from float32 to int4 uses 8x less memory, which allows models that previously required a $10,000 A100 to run on a $500 consumer GPU. The quality penalty from careful quantization is typically 1-3% on standard benchmarks for int8, and 3-7% for int4 — often acceptable for production tasks. The dominant format for quantized local models is GGUF, which is the format used by the llama.cpp inference engine and Ollama.

Ollama is the most practical tool for running local language models. A single command — ollama run llama3.2 — downloads the model, sets it up with the appropriate quantization, and opens an interactive chat session. Ollama also exposes a REST API on localhost:11434 that is intentionally compatible with the OpenAI API format, which means any code you write for the OpenAI SDK can be redirected to a local model by changing the base URL and model name. This is a powerful capability for development and testing: you can build and test locally for free, then switch to a cloud API for production without changing your application logic.

llama.cpp is the inference engine that powers Ollama and many other local model tools. It is written in C++ with no mandatory dependencies (it can run on a plain CPU without any GPU framework installed), supports CUDA and Metal GPU acceleration when available, and implements a wide range of quantization schemes. Its GGUF file format is the standard for distributing quantized model weights. Understanding llama.cpp is not strictly necessary for using Ollama, but it gives you visibility into what is happening under the hood and lets you tune inference parameters like context size, batch size, and thread count.

Knowledge distillation is a training technique where a smaller "student" model is trained to mimic the output distribution of a larger "teacher" model, rather than being trained solely on raw data. Instead of just learning to predict the next correct token, the student learns to match the teacher's probability distribution over all possible next tokens — including the "soft" signal in the near-miss predictions. This transfers not just the answers but the reasoning patterns and uncertainty calibration of the teacher model. The Phi series uses distillation from GPT-4; the DeepSeek-R1 distilled variants use DeepSeek-R1 as teacher. Distillation is why small models trained this way can punch above their weight compared to models of equivalent size trained on raw data.

The business case for SLMs is most compelling when you do the unit economics. A frontier API model at $3 per million input tokens with 1 million daily queries at 500 tokens average costs $1,500 per day — $547,500 per year. A 7B model running on a $3,000 A100 instance at $2.50/hour, serving those same queries with a 50ms latency budget, would cost roughly $60/day in cloud compute — $21,900 per year. That is a 25x cost reduction. The tradeoff is the quality difference on your specific task, the engineering cost of managing the self-hosted infrastructure, and the operational burden of monitoring and updating the model. For high-volume, well-defined tasks where quality is measurable and the task is bounded, the SLM economics are often compelling.

# Ollama local call — REST API compatible with OpenAI SDK
from openai import OpenAI

# Redirect to local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",  # or phi4, mistral, gemma2, etc.
    messages=[
        {"role": "user", "content": "Explain quantization in two sentences."}
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

# The same code works for cloud APIs — just change base_url and api_key:
# base_url="https://api.openai.com/v1" for OpenAI
# This is the key advantage of OpenAI-compatible REST APIs

Local Dev Tip

Use ollama pull llama3.2 and ollama pull phi4 before the self-hosting module. Having models cached locally means you can experiment without any API costs and test your code offline. The 3.2 variant (3B) is the fastest; pull the 8B variant for better quality.

04

Multimodal Models

Plain Language: Beyond Text

A multimodal model is one that can process and reason over multiple types of data — most commonly text and images, but increasingly also audio, video, and structured data like tables and code. For the first four years of the transformer era, large language models were strictly text-in, text-out. Multimodality changed the surface area of what an AI system can do as dramatically as the original LLM capability jump changed what was possible with text.

The most immediately practical multimodal capability for developers is vision: the ability to send an image alongside a text prompt and have the model reason about both together. In practice this means you can paste a screenshot of an error message and ask the model to debug it, send a photo of a whiteboard diagram and ask for an explanation, upload a chart from a PDF and ask for an analysis, or send an image of a handwritten form and ask for the data extracted in JSON. These are not toy capabilities — they represent genuine elimination of entire categories of pre-processing pipelines that developers previously had to build.

The impact on document understanding is particularly significant. Traditional approaches to extracting information from PDFs involved a brittle pipeline: convert PDF to text (lossy, breaks tables and figures), run OCR on scanned pages (error-prone), then process the text with an NLP model. With multimodal models, you can send a page screenshot directly — the model reads the text, understands the layout, interprets the figures, and reasons about everything together. Systems like ColPali take this further, embedding full page screenshots into a vector index, bypassing the entire text extraction step entirely.

Audio adds another dimension. OpenAI's Whisper is a speech recognition model (text-only output) that achieves near-human accuracy on a wide range of accents, languages, and audio quality conditions. GPT-4o goes further with a real-time audio API that can accept audio input and produce audio output natively — enabling voice assistants that are not just speech-to-text followed by an LLM followed by text-to-speech, but a model that processes the acoustic features of speech (tone, speed, emotional content) and can respond with natural prosody. This is a qualitatively different capability from chaining separate models.

Vision Transformers and How Image Inputs Work

The Vision Transformer (ViT) architecture, introduced by Google in 2020, is the foundation for how modern LLMs process images. The key insight is that the transformer architecture — which processes sequences of vectors — can be applied to images by treating image patches as the sequence elements. Concretely: an image is divided into a grid of non-overlapping patches, typically 16x16 pixels each. A 224x224 pixel image yields 196 patches. Each patch is flattened into a vector of pixel values and then projected through a learned linear layer to the transformer's embedding dimension (typically 768 or 1024 dimensions). The resulting sequence of patch embeddings is fed to a standard transformer encoder, which applies self-attention across patches to build a representation of the entire image. The remarkable finding was that transformers, given enough data, learn to attend to semantically meaningful image regions — edges, objects, faces — without any explicit spatial inductive bias.

GPT-4V and GPT-4o use a CLIP-style vision encoder — CLIP (Contrastive Language-Image Pre-training) is a model trained to produce aligned embeddings for images and text, such that the embedding of "a photo of a dog" is close to the embedding of an actual dog photo. The CLIP visual encoder processes the input image and produces a sequence of visual embeddings. A learned projection layer then maps these visual embeddings into the same vector space as the language model's token embeddings. The resulting combined sequence — visual patch embeddings followed by text token embeddings — is fed to the language model decoder as a single unified context. From the decoder's perspective, the image is just a prefix of "visual tokens" before the text begins.

Claude 3's vision capabilities are particularly strong at document understanding tasks that require reading complex layouts. It can reliably extract data from tables that span multiple columns, read mathematical notation, understand hand-drawn diagrams, interpret code screenshots (including running code to verify understanding), and handle low-quality or rotated document scans. Anthropic has published benchmarks showing Claude 3 Opus achieving near-perfect scores on chart question-answering and document extraction tasks that earlier multimodal models struggled with.

Gemini's native multimodality is architecturally different from the "vision encoder + projection layer" approach. Gemini was trained from the start on interleaved sequences of text, images, audio, and video — not text first with vision added later. This means the model has a unified internal representation space where visual, textual, and acoustic information are all encoded in a compatible way from the ground up. The practical benefit is on tasks requiring tight integration of modalities — for example, answering questions about a video where the answer requires matching audio content with visual events at the same timestamp.

Whisper is OpenAI's open-source speech recognition model, released in 2022. It was trained on 680,000 hours of multilingual audio transcribed from the internet, giving it strong performance across dozens of languages and a wide range of recording conditions. For practical purposes, Whisper is the standard choice for speech-to-text preprocessing — it can be run locally (the "medium" model fits in 5GB of VRAM) or called via the OpenAI Transcription API. GPT-4o's real-time audio API enables a different use case: streaming audio in and audio out with low latency, suitable for conversational voice interfaces. It processes audio directly rather than transcribing first, which preserves emotional content and prosodic information that transcription discards.

Sending images via the API requires encoding the image as base64 and including it in the message content as an image_url type. The OpenAI and Anthropic APIs both follow a similar pattern: the message content is not a simple string but a list of content blocks, where each block has a type ("text" or "image_url") and the relevant content. For the OpenAI API, you include a block with "type": "image_url" and the image data as a data URL. For the Anthropic API, you include a block with "type": "image", the media type, and the base64-encoded data. The size limits are generous — GPT-4o accepts images up to 20MB; Claude accepts up to 5MB per image — but the token cost of images is non-trivial: a 512x512 image consumes approximately 170 tokens in the OpenAI pricing model.

ColPali is a recent technique for multimodal document retrieval that bypasses traditional PDF text extraction entirely. Instead of extracting text from a PDF and embedding it, ColPali takes a screenshot of each PDF page and embeds the full-page image using a Vision Language Model. At query time, the query text is encoded and matched against the page image embeddings. The advantage is that it correctly handles pages where text and figures are tightly integrated — like a chart with an inline caption — where text-only extraction would produce a degraded representation. This is relevant for the course's RAG module, where document understanding quality directly impacts answer quality.

Multimodal input processing: images are split into 16x16 patches, encoded by a Vision Transformer, and projected into the LLM's embedding space. Text is tokenized and embedded separately. Both are concatenated into a single unified sequence before entering the language model decoder.

# Sending an image to Claude's API using base64 encoding
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

# Load and encode the image
image_path = Path("chart.png")
image_data = base64.standard_b64encode(image_path.read_bytes()).decode("utf-8")

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Extract all data from this chart as JSON."
                }
            ],
        }
    ],
)

print(message.content[0].text)

05

Model Selection Framework

Plain Language: There Is No Single Best Model

One of the most common mistakes developers new to the LLM ecosystem make is optimizing for the "best" model on general benchmarks. Benchmark performance measures performance on academic test sets — which may have essentially no correlation with performance on your specific task. A model that scores highest on MMLU may be outperformed on your particular use case by a model that scores 5 points lower, simply because the lower-scoring model was fine-tuned on similar data or has better instruction following for the specific prompt patterns you use.

The first set of questions to ask about your task is practical: What is the acceptable latency? What is the token budget per request? Does the task require understanding images, documents, or audio? Does the output need to be structured JSON, or is natural text sufficient? Does the data contain sensitive information that cannot leave your premises? Each of these questions rules out or rules in different model choices before you ever look at a benchmark.

The "good enough" principle is one of the most commercially important ideas in applied AI. A model that achieves 90% of the quality of the best model at 20% of the cost and 300% of the speed is almost always the right production choice, provided you can measure that quality gap. The caveat is the measurement: you need to define what "90% of quality" means for your specific task and build a golden evaluation set to verify it. Without that measurement, you are guessing — and the cost difference between guessing wrong in either direction is substantial.

The practical implication is that model selection is not a one-time decision made at the start of a project. It is an ongoing process: start with the best model to establish a quality ceiling, measure that quality on your task, then systematically test cheaper/faster alternatives to find the knee of the tradeoff curve. This approach — build a golden eval set, run all candidate models, pick the winner — is the professional standard for production LLM systems.

Decision Tree and Use-Case Matching

A practical decision tree for model selection starts with the latency requirement. If you need a response in under 500 milliseconds — for an interactive UI, a real-time assistant, or a high-throughput pipeline — you immediately restrict yourself to models with fast inference: Gemini Flash, GPT-4o-mini, Claude Haiku, or a locally hosted small model. These are not compromised models; they are models specifically designed for throughput-sensitive applications, and they are genuinely capable for the tasks they were designed for.

If latency is less critical and the task is complex reasoning — multi-step mathematical problems, complex code generation, nuanced instruction following, document analysis requiring synthesis across many pages — you move up to Claude Sonnet 4.6, GPT-4o, or Gemini 1.5 Pro. If the task is at the frontier of difficulty — graduate-level mathematics, novel research synthesis, code generation for complex systems — you consider o1/o3 for the extra reasoning budget, accepting the higher latency and cost.

If the task involves vision, your options narrow: GPT-4V/GPT-4o, Claude 3+, Gemini 1.5+, or Llama 3.2 (open-weight vision). For vision tasks requiring high accuracy on document understanding specifically, Claude is often the first choice based on published benchmarks. For tasks requiring integration of visual and audio information (like video analysis), Gemini's native multimodality gives it a structural advantage.

For private data or on-premises requirements, you have no choice but self-hosted open-weight models: Llama 3 (8B, 70B, 405B), Mistral, Mixtral, Qwen, Phi-4, or Gemma 2. The hardware requirements range from a gaming PC (7-8B quantized) to a multi-GPU server (70B at full precision). For teams new to self-hosting, Ollama provides the lowest-friction entry point.

Pricing, Benchmarks, and Testing Methodology

Pricing comparison as of early 2025 (per million tokens, input / output): GPT-4o costs $2.50 / $10.00. Claude Sonnet 4.6 costs $3.00 / $15.00. GPT-4o-mini costs $0.15 / $0.60. Gemini 2.0 Flash costs $0.075 / $0.30. These numbers change frequently — frontier model prices have dropped by roughly 10x every 18 months as scale and competition drive costs down. For context on the magnitude: a typical conversational turn with a 200-token prompt and 300-token response costs $0.0005 at GPT-4o pricing. A million such turns costs $500. At Gemini Flash pricing, it costs $1.125. At that volume, the difference is substantial.

Inference speed (tokens per second for typical use cases): GPT-4o-mini achieves approximately 100 tokens/second at the API level under normal load. Claude Haiku achieves approximately 120 tokens/second. GPT-4o achieves approximately 50-70 tokens/second. Claude Sonnet achieves 50-80 tokens/second. These numbers vary significantly with load and can be improved with streaming (which makes time-to-first-token the relevant metric rather than total latency). For streaming, the first token from a fast model arrives in 200-400ms; for a large reasoning model, it might arrive after 5-10 seconds of internal computation.

The most important public benchmarks for comparing production models are: LMSYS Chatbot Arena (human preference voting by thousands of users on side-by-side comparisons — the gold standard for measuring user-perceived quality because it reflects actual human preferences rather than academic test answers); SEAL Leaderboard (third-party evaluations with contamination controls); and LiveBench (dynamically updated with new questions to prevent contamination). For coding specifically, HumanEval and SWE-bench are widely used. For reasoning, GPQA-Diamond and MATH are contamination-resistant.

The latency vs. cost vs. quality triangle is the canonical tradeoff in model selection: you can optimize for two of the three properties, but not all three simultaneously. A model that is fast and cheap (GPT-4o-mini, Gemini Flash) trades on quality. A model that is fast and high quality (Claude Sonnet, GPT-4o) trades on cost. A model that is high quality and cheap (self-hosted Llama 70B) trades on latency and operational complexity. Understanding where your application sits on this triangle determines your decision before you ever run a single benchmark.

The professional testing methodology for model selection is: (1) Define your task precisely and collect 50-200 representative examples. (2) Label these examples with correct outputs, either human-annotated or from a trusted high-quality model. This is your golden eval set. (3) Run each candidate model on all examples with your production prompt. (4) Score the outputs against the golden labels, using an automated metric where possible (exact match for structured extraction, ROUGE/BERTScore for summaries, LLM-as-judge for open-ended quality). (5) Sort models by quality-per-dollar on your specific task. (6) Pick the model at the knee of the curve — the one after which additional quality gains cost disproportionately more. This methodology, done rigorously, produces decisions that are defensible, reproducible, and often surprising in which model wins.

Code: Token Counting with tiktoken

# Token counting with tiktoken — useful for cost estimation before sending requests
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for OpenAI-family models using tiktoken."""
    # cl100k_base is used by GPT-4o, GPT-4, GPT-3.5-turbo
    # o200k_base is used by GPT-4o and newer models
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Estimate cost before sending a request
def estimate_cost(prompt: str, model: str = "gpt-4o") -> dict:
    # Prices per million tokens (input / output)
    PRICING = {
        "gpt-4o":       (2.50, 10.00),
        "gpt-4o-mini":  (0.15,  0.60),
        "claude-sonnet-4-6": (3.00, 15.00),
        "gemini-2.0-flash": (0.075, 0.30),
    }

    input_tokens = count_tokens(prompt, model)
    price_in, price_out = PRICING.get(model, (3.0, 15.0))

    # Assume typical output is ~300 tokens
    estimated_output = 300
    input_cost  = (input_tokens / 1_000_000) * price_in
    output_cost = (estimated_output / 1_000_000) * price_out

    return {
        "input_tokens":  input_tokens,
        "input_cost_usd": round(input_cost, 6),
        "output_cost_est": round(output_cost, 6),
        "total_est_usd": round(input_cost + output_cost, 6),
    }

# Example usage
long_prompt = "Summarize the following research paper... " + "x" * 5000
result = estimate_cost(long_prompt, "gpt-4o")
print(f"Input tokens: {result['input_tokens']:,}")
print(f"Estimated cost: ${result['total_est_usd']:.4f}")

Important Note on Pricing

API prices change frequently and vary by region, tier, and volume. Always check the current pricing page for each provider before building cost estimates into business cases. The numbers here are illustrative of relative relationships, not guaranteed current prices.

🚀

Speed-First Tasks

Autocomplete, real-time suggestions, high-volume classification, simple Q&A. Use GPT-4o-mini, Gemini Flash, or Claude Haiku. Target <500ms response.

🧠

Reasoning Tasks

Complex code generation, multi-step analysis, research synthesis. Use Claude Sonnet, GPT-4o, or o1 for genuinely hard problems. Accept higher latency.

📷

Vision Tasks

Document extraction, diagram analysis, screenshot debugging, chart reading. Use GPT-4V, Claude 3+, Gemini 1.5+, or Llama 3.2 (open-weight).

🔒

Private Data

Medical, legal, financial data that cannot leave your premises. Self-host Llama 3 70B, Mistral, Qwen, or Phi-4 via Ollama or vLLM on your own infrastructure.

🌏

Multilingual

Non-English tasks, especially Asian languages. Qwen 2.5 (strong on CJK), Claude (strong on European languages), or GPT-4o (broad multilingual coverage).

📈

High Volume / Cost

More than 1M daily queries. Build a golden eval set, verify a cheaper model hits your quality bar, then self-host or use Gemini Flash / GPT-4o-mini at scale.

🎯

Interview Ready

Elevator Pitch — 2-Minute Interview Explanation

"The modern model landscape spans three categories. Large Language Models like GPT-4o and Claude Sonnet have hundreds of billions of parameters, deliver frontier reasoning and multimodal capabilities, and are accessed through paid APIs. Small Language Models like Phi-4 (14B) and Gemma 2 (9B) have fewer than 15 billion parameters, can run on a single consumer GPU or even a laptop, and are the right choice when you need privacy, offline access, or high-volume inference at near-zero marginal cost. Multimodal models extend beyond text to accept images, audio, and video — they use a Vision Transformer to split images into 16x16 patches, encode them, and project the resulting embeddings into the same vector space as text tokens so the decoder processes a unified sequence. The key engineering decision is not picking the 'best' model — it is building an evaluation set for your specific task and finding the model at the knee of the quality-per-dollar curve. A quantized 7B model running locally can replace a $500K/year API bill if the task is well-defined and bounded."

Interview Questions

Question	What They’re Really Asking
What is the difference between an LLM and an SLM, and when would you choose each?	Can you make practical cost-quality tradeoffs, not just chase the biggest model?
How does a multimodal model process an image alongside text?	Do you understand the Vision Transformer + projection architecture, or do you treat it as magic?
What is Mixture-of-Experts and why does it matter?	Can you explain how MoE decouples total parameters from inference cost?
How would you decide between a closed-weight API and a self-hosted open-weight model?	Do you think about data privacy, cost at scale, latency, and operational burden holistically?
What is quantization and what are the tradeoffs?	Can you explain how reducing precision (float32 → int4) shrinks memory 8x with only 3-7% quality loss, and when that tradeoff is worth it?

Model Answers

1. LLM vs. SLM — when to choose each: LLMs (70B+ parameters) excel at complex reasoning, multi-step code generation, and tasks requiring broad world knowledge — they are the right choice when quality is the primary constraint and you can absorb API costs or have GPU infrastructure. SLMs (under 10B parameters) are the right choice when data cannot leave your premises, when you need offline or edge deployment, or when you are running millions of queries per day and the task is well-defined enough that a smaller model meets your quality bar. The decision framework is: establish a quality ceiling with a frontier model, build a golden eval set of 50-200 representative examples, then systematically test smaller and cheaper alternatives to find the point where quality drops below your acceptable threshold.

2. How multimodal models process images: The image is divided into a grid of non-overlapping 16x16 pixel patches — a 224x224 image yields 196 patches. Each patch is flattened into a pixel vector and passed through a Vision Transformer (ViT) encoder, often CLIP-based, which applies self-attention across patches to produce a sequence of visual embeddings. A learned linear projection layer maps these visual embeddings into the same dimensional space as the language model's token embeddings. The visual embeddings are then concatenated with the text token embeddings into a single unified sequence, and the LLM decoder processes this combined sequence with standard self-attention, treating image patches essentially as "visual tokens."

3. Mixture-of-Experts (MoE): In a standard dense transformer, every token passes through the same feed-forward network (FFN). In an MoE architecture, there are multiple FFN blocks called "experts" — for example, Mixtral 8x7B has 8 experts — and a learned routing network selects a small subset (typically 2) for each token. This means Mixtral has 46B total parameters but only activates about 12B per inference step, giving quality comparable to a 40B+ dense model at the speed of a 12B model. MoE is central to frontier scaling strategy because it lets labs train larger effective models without proportionally increasing inference compute costs.

4. Closed-weight API vs. self-hosted open-weight: Closed-weight APIs (OpenAI, Anthropic, Google) offer frontier quality, zero infrastructure management, continuous updates, and built-in safety features — but your data leaves your premises, you are subject to pricing changes, and costs scale linearly with volume. Self-hosted open-weight models (Llama 3, Mistral, Phi-4) keep data on-premises, allow fine-tuning for your domain, and offer near-zero marginal cost at high volume — but require GPU infrastructure, operational monitoring, and manual model updates. The decision depends on data sensitivity (regulated industries often mandate self-hosting), volume (at 10M+ daily queries the cost savings from self-hosting can be 25x), and whether the quality gap on your specific task is acceptable.

5. Quantization and its tradeoffs: Quantization reduces model weight precision from float32 (4 bytes per parameter) to int8 (1 byte) or int4 (0.5 bytes), shrinking memory requirements by 4-8x. A 7B parameter model at float16 needs ~14GB VRAM; at int4 it needs ~3.5GB, making it runnable on consumer hardware. The quality penalty is typically 1-3% on benchmarks for int8 and 3-7% for int4. The dominant format is GGUF, used by llama.cpp and Ollama. Quantization is worth it when deployment constraints (edge devices, cost-sensitive infrastructure) outweigh the modest quality loss, and when the task is well-defined enough that the degradation can be measured against a golden eval set.

System Design Scenario

Challenge: "Design a document processing pipeline for a law firm that receives 5,000 scanned contracts per day. Each contract is 10-30 pages. The system must extract party names, dates, dollar amounts, and clause summaries. The data is highly confidential and cannot leave the firm's private cloud."

What a good answer covers: (1) Self-hosted model is mandatory due to data sensitivity — Llama 3 70B or Phi-4 via vLLM on A100 GPUs. (2) Use a multimodal approach: render each page as an image and send to the vision-capable model rather than relying on brittle OCR + text extraction pipelines. (3) For high throughput at 5,000 docs × 20 pages = 100K pages/day, use a two-stage architecture: a fast SLM (Phi-4 quantized) for initial classification and field extraction, escalating only ambiguous pages to the larger 70B model. (4) Build a golden eval set of 200 manually-verified contracts to measure extraction accuracy. (5) Consider ColPali-style page-image embeddings for a retrieval layer that lets lawyers search across the contract corpus by meaning rather than keyword.

Common Mistakes

Defaulting to the largest model for every task. Using GPT-4o or Claude Opus for simple classification or extraction tasks wastes money and adds latency. Many production tasks are better served by GPT-4o-mini, Gemini Flash, or a fine-tuned 7B model — the key is measuring quality on your specific task rather than assuming bigger is better.
Ignoring parameter count when planning infrastructure. A 70B model at float16 requires ~140GB of VRAM — that is two A100 80GB GPUs minimum. Candidates who propose "just deploy Llama 70B" without accounting for hardware requirements, quantization strategy, or inference throughput targets reveal a gap between theory and production readiness.
Treating multimodal as text extraction plus a language model. The value of native multimodal models is that they process layout, figures, and text together in a unified representation. Candidates who describe a pipeline of "OCR the document, then send the text to an LLM" are missing the architectural shift — direct image input via ViT encoders handles tables, charts, and mixed-format pages far more reliably than text-only extraction.