Fine-Tuning vs Prompting — When to Use Which
Plain Language
The most important thing to understand about fine-tuning is that it is not the first tool you should reach for. A large fraction of developers who decide they need to fine-tune a model have not yet exhausted simpler options. Prompt engineering is free, fast to iterate on, and reversible. Retrieval-Augmented Generation (RAG) can inject facts dynamically at query time without any training cost. Fine-tuning, by contrast, requires compute time to train, engineering time to prepare data and run experiments, and produces a static artifact that cannot be easily updated without re-training. Before committing to fine-tuning, you should be able to articulate clearly why prompting alone is insufficient for your use case.
The most useful diagnostic question is: is this a knowledge problem or a behavior problem? A knowledge problem means the model does not know some facts — the contents of your internal database, your company's specific product catalog, the latest regulatory changes in your industry. Knowledge problems are almost always better solved by RAG, which injects the relevant facts into the context window at the moment of each query. The model already knows how to reason; you are just supplying the data it needs. Adding new facts through fine-tuning is expensive, slow to update, and inferior to RAG for factual recall because the model cannot cite its sources or distinguish training-time knowledge from context-window knowledge.
A behavior problem, on the other hand, means the model already knows the relevant information but does not act the way you want. It uses the wrong tone, outputs in the wrong format, doesn't know your company's specialized terminology, keeps adding unnecessary caveats, or fails to stay in character for your application. These are behavioral traits that are difficult to reliably encode in a prompt alone, especially when you need them to hold across millions of diverse requests. Behavior problems are where fine-tuning genuinely shines. Fine-tuning "bakes in" the desired behavior at the weight level, so it persists regardless of what the user says in the prompt.
There are several specific scenarios where fine-tuning reliably wins over prompting. The first is consistent output format: if your application requires structured JSON with specific field names and nesting, a system prompt can describe the format but a fine-tuned model just produces it, without needing format instructions in every call. The second is specialized vocabulary: medical, legal, and financial domains have terminology so precise that even excellent base models make subtle vocabulary errors. Fine-tuning on domain-appropriate examples fixes this. The third is latency and cost: a fine-tuned 7B parameter model can match the quality of a prompted GPT-4-class model on a narrow task, but is 10-100x cheaper per token and has dramatically lower latency. When you are making millions of calls per day, this arithmetic matters enormously.
Fine-tuning is expensive upfront. You pay compute costs for training, engineering time for data preparation and evaluation, and iteration time as you experiment with hyperparameters and data quality. However, this cost amortizes across inference calls. If your fine-tuned model removes 500 tokens of system prompt instructions from every call, and you make ten million calls per month, you are saving five billion tokens of input per month. The training cost becomes trivial compared to the cumulative inference savings at that scale. Fine-tuning therefore has a break-even point that depends on your call volume — it is economically rational at scale, but potentially wasteful for low-volume applications.
Start with prompt engineering. If you hit a ceiling, try RAG. If you still have a behavior problem that persists across diverse prompts, then consider fine-tuning. The three approaches are not mutually exclusive — production systems often use all three together.
Deep Dive
The format-learning use case is arguably the most underappreciated reason to fine-tune. If your application
needs the model to always output a very specific JSON schema — say, a structured clinical note with fields like
chief_complaint, assessment, plan, and icd10_codes — you can describe this in a
system prompt, but the model will occasionally get it wrong, especially for unusual inputs. You can add examples
to the prompt (few-shot), but that costs tokens. Fine-tuning "burns in" the format at weight level, so it is
nearly always correct without any format instructions at all. This is not just a correctness improvement — it
meaningfully reduces your system prompt length and therefore your per-call cost.
Domain vocabulary goes deeper than just using the right words. In medical fine-tuning, the goal is not just that the model knows the word "myocardial infarction" — a base GPT-4 knows that. The goal is that the model generates text in the register and style of a physician writing notes, uses abbreviations the way physicians do (q.d., b.i.d., PRN), distinguishes between attending note and nursing note conventions, and never uses lay language in a clinical context. This kind of deep domain fluency is genuinely hard to achieve through prompting alone and is exactly what fine-tuning on high-quality domain examples provides.
Understanding dataset size requirements is critical for planning. The good news is you need far fewer examples than people expect. Instruction tuning — teaching the model to follow a new format or respond in a particular style — needs as few as 100 to 1,000 high-quality examples. Significant behavior change, like shifting tone across a wide range of inputs or learning a niche domain, needs 5,000 to 50,000 examples. Pretraining-scale capability injection (adding a genuinely new language or skill that the model lacked) needs millions of examples and is essentially a different category of work. For most production fine-tuning, you are in the hundreds-to-thousands range, which is achievable by curating existing data or generating synthetic examples with a stronger model.
The distinction between full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) is fundamental to understanding the modern landscape. Full fine-tuning updates every weight in the model — all seven billion parameters in a 7B model, for instance. This requires storing the full model in GPU VRAM, storing all gradients (another full model's worth of memory), and storing optimizer states (two more full models' worth for Adam). A 7B model in FP16 requires 14 GB just to store — full fine-tuning requires 60-80 GB of VRAM total, necessitating a cluster of A100 GPUs. PEFT methods like LoRA update only a tiny fraction of the parameters, making fine-tuning feasible on consumer hardware. Full fine-tuning is still done at large organizations with the resources to do it, but for most practitioners PEFT is the practical choice.
Catastrophic forgetting is one of the most important failure modes in fine-tuning that practitioners routinely encounter and are surprised by. When you fine-tune a model intensively on a narrow domain, it can "forget" general capabilities it had before. A model fine-tuned aggressively on medical notes might become worse at general reasoning, coding, or even basic conversational tasks. This happens because gradient updates from your narrow training data push the weights in directions that optimize for your task but happen to disrupt patterns the model had learned for other capabilities. The primary mitigation is data mixing: include a fraction of general-purpose instruction-following examples in your training dataset alongside your domain-specific examples, so the model maintains its breadth while acquiring your specialized behaviors.
Establishing an evaluation baseline before fine-tuning is a discipline that is easy to skip and painful to regret. Before you write a single line of training code, define your evaluation set and run your baseline model on it. This evaluation should consist of real examples drawn from your actual use case, not synthetic test cases. It should measure what actually matters: output format correctness, factual accuracy on your domain, user satisfaction ratings, downstream task performance. Without a baseline, you cannot know if your fine-tuned model is actually better or just different. Many fine-tuning projects have shipped models that scored better on intuitive vibe-checks but were quantitatively worse on the metrics that mattered.
There are cases where you should explicitly not fine-tune, and recognizing them saves significant wasted effort. If prompt engineering achieves 85-90% of your goal, the remaining gap often does not justify the cost and complexity of fine-tuning — the effort is better spent improving your prompt, adding examples, or refining your application logic. Similarly, if your data distribution will change significantly over time, a fine-tuned model becomes a liability: it is "frozen" to its training distribution and will degrade as the world changes around it. RAG systems can be updated by refreshing their knowledge base; fine-tuned models require re-training. For rapidly evolving domains — news, market data, product catalogs — RAG is almost always the better long-term choice.
The question of OpenAI fine-tuning versus self-hosted fine-tuning deserves explicit treatment. OpenAI's hosted fine-tuning is by far the most accessible option: you upload a JSONL file, click a button, and receive a custom model endpoint with zero infrastructure management. The tradeoff is that you have no access to the model weights, cannot use techniques like LoRA or QLoRA, cannot export the model or run it offline, and are permanently dependent on OpenAI's pricing and service. Self-hosted fine-tuning with open-weight models (Llama, Mistral, Qwen) gives you full weight access, portability, and the ability to run on your own infrastructure — at the cost of managing that infrastructure yourself. The right choice depends on your technical resources, data privacy requirements, and long-term strategic considerations.
| Approach | When to Use | Cost Profile | Latency |
|---|---|---|---|
| Prompt Engineering | Default starting point; behavior already possible, just needs direction | Near zero | Higher (long prompts) |
| RAG | Knowledge gap; facts change frequently; need citations | Low | Moderate (retrieval overhead) |
| Fine-Tuning | Persistent behavior change; format; domain vocabulary; cost at scale | High upfront, low per-call | Low (smaller model) |
| All Three | Production systems at scale | Highest upfront | Optimal |
LoRA (Low-Rank Adaptation)
Plain Language
Imagine you have a world-class chess grandmaster — someone who has spent thirty years developing deep intuition about the game. Now imagine you want to teach this person a new opening strategy that they have never studied. You do not need to erase all their existing chess knowledge and re-teach them everything from scratch. You just need to add a thin layer of new knowledge on top of their existing expertise. Their entire foundational understanding stays intact; you are only updating a small, specific part of how they think about certain positions. LoRA (Low-Rank Adaptation) works exactly the same way for neural networks.
Instead of updating all seven billion parameters in a 7B model during fine-tuning, LoRA introduces small additional matrices "alongside" the existing weight matrices and only trains those small additions. The original weights are frozen — they never change. Only the LoRA adapter matrices are updated during training. At the end of training, you have the original base model (unchanged) plus a set of small adapter matrices that encode everything the model learned during fine-tuning. These two components together produce the fine-tuned behavior.
The "low-rank" part of the name refers to a mathematical insight that makes this work so well. When you train a model, the actual changes to the weight matrices have a much simpler underlying structure than the full matrices themselves. It turns out that even though weight matrices might be enormous (a single attention weight matrix might be 4096 by 4096), the meaningful updates during fine-tuning can be approximated by the product of two much smaller matrices. This is the "low-rank decomposition" — the change has low rank (meaning it lives in a much lower-dimensional space than the full matrix), and LoRA exploits this mathematical structure to be extremely parameter-efficient.
The practical benefit is that LoRA adapters are tiny. A full 7B model in 16-bit precision weighs about 14 gigabytes. A LoRA adapter for that model might be 50-200 megabytes — less than 2% of the model size. This has a beautiful consequence: you can maintain one base model and a library of many different LoRA adapters for different tasks, domains, or personas. Need a customer service assistant? Swap in the customer service LoRA. Need a code reviewer? Swap in the code review LoRA. One model, many specialists.
The result of this efficiency is truly democratizing. Before LoRA, fine-tuning a 7B model required a cluster of enterprise GPUs — tens of thousands of dollars in hardware or cloud costs. With LoRA, you can fine-tune a 7B model on a single consumer GPU like an RTX 3090 or RTX 4090 (24 GB VRAM) in a matter of hours. This shifted fine-tuning from being an enterprise-only capability to something any serious developer can do at home or on a single cloud instance. The Alpaca, Vicuna, and countless subsequent open-source fine-tuned models were all made possible by LoRA.
Deep Dive
The mathematical formulation of LoRA is elegant. For any weight matrix W₀ ∈ ℝd×k in the original model, instead of learning the full update ΔW ∈ ℝd×k directly (which would have d×k parameters), LoRA decomposes the update as ΔW = BA, where B ∈ ℝd×r and A ∈ ℝr×k, with r << min(d, k). The modified forward pass becomes: h = W₀x + α/r · BAx, where α is a scaling hyperparameter. B is initialized to all zeros and A to random Gaussian, so the LoRA contribution starts at zero and grows during training without disrupting the pretrained initialization.
The rank r is the most important LoRA hyperparameter. It controls the dimensionality of the low-rank decomposition. Typical values are 4, 8, 16, 32, or 64. A rank of 4 means B has 4 columns and A has 4 rows — very few parameters, but also limited expressive power. A rank of 64 gives much more expressive power but proportionally more parameters. The empirical sweet spot for most tasks is r=8 or r=16. Higher ranks are not always better — if your task has a simple behavioral change, low rank is sufficient and less likely to overfit. If your task requires learning genuinely complex new behaviors, higher rank may be necessary.
The alpha (α) scaling factor controls the magnitude of the LoRA update relative to the pretrained weights. A common convention is to set α = r (giving α/r = 1, so the update is unscaled) or α = 2r (giving α/r = 2, amplifying the LoRA contribution). In practice, α is often held fixed while r is varied, which is equivalent to adjusting the effective learning rate of the LoRA parameters. Some practitioners set α to a fixed value like 16 regardless of r, treating it as a learning rate multiplier. The key insight is that α and the optimizer learning rate interact: a higher α with a lower learning rate can be equivalent to a lower α with a higher learning rate.
Choosing which modules to apply LoRA to significantly affects both model quality and parameter count. The original LoRA paper applied it only to the query and value projection matrices in the self-attention layers (q_proj and v_proj). Subsequent work found that applying LoRA to all linear layers — including the key projection, output projection, and the feed-forward (MLP) layers — consistently improves results. Modern defaults typically target all linear layers. More modules means more parameters and more expressive power, at the cost of slightly larger adapter files and longer training time.
The HuggingFace PEFT library makes LoRA implementation straightforward. You define a LoraConfig
specifying your rank, alpha, target modules, and dropout, then wrap any HuggingFace model with
get_peft_model(). The resulting model behaves identically to the original for inference but
only updates LoRA parameters during training:
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank
lora_alpha=32, # alpha scaling
target_modules=list([ # which layers
"q_proj", "v_proj",
"k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
]),
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 7,284,015,104 || trainable%: 0.5757
The memory arithmetic is what makes LoRA transformative. A 7B model in FP16 requires 14 GB just to store the weights. Full fine-tuning adds gradients (another 14 GB) plus Adam optimizer states (another 28 GB) for a total around 56 GB. With LoRA, the base model is frozen and loaded without requiring gradients (the frozen parameters don't need gradient computation). Only the LoRA adapter parameters need gradients and optimizer states — and those adapters might be only 40 million parameters, requiring less than 300 MB for optimizer states. Total VRAM for LoRA fine-tuning of a 7B model: roughly 16-18 GB, fitting comfortably on a single 24 GB GPU.
At inference time, you have two options for handling the LoRA adapter. The merged approach computes W = W₀ + BA and stores the result as a regular weight matrix. This means zero latency overhead — the fine-tuned model is just a normal model. The separate approach keeps the base model and adapter distinct, adding the BA contribution in the forward pass. This approach allows hot-swapping adapters without reloading the base model, which is powerful when you need multiple specialized models serving different users or tasks simultaneously. Systems have been built that load one large base model into GPU VRAM once and serve hundreds of LoRA-adapted variants simultaneously by swapping adapters between requests.
AdaLoRA (Adaptive LoRA) extends the basic LoRA idea by automatically allocating rank differently across layers. Not all weight matrices are equally important for a given task — some layers encode information that needs significant adaptation while others need very little. AdaLoRA starts with a larger rank budget and iteratively prunes unimportant singular values during training, concentrating parameters where they matter most. This produces better task performance at the same total parameter count as standard LoRA.
DoRA (Weight-Decomposed Low-Rank Adaptation) is a more recent technique that decomposes the pretrained weight matrix into its magnitude and direction components (inspired by how neural networks naturally learn). The magnitude component is fine-tuned directly while the directional component is fine-tuned using LoRA. This decomposition gives the model more expressiveness for the same parameter budget — empirically, DoRA achieves better results than LoRA at identical parameter counts and is particularly effective for multimodal fine-tuning tasks.
LoRA architecture: the frozen pretrained weight W₀ (gray, dashed) passes the input through unchanged, while the trained low-rank matrices A and B (cyan) add a small adaptation. Only A and B are updated during training.
QLoRA
Plain Language
QLoRA stands for Quantized LoRA. If LoRA was the technique that brought fine-tuning to single consumer GPUs for 7B models, QLoRA was the technique that brought fine-tuning of frontier-scale models to teams without supercomputer budgets. The core idea is to combine LoRA with 4-bit quantization: compress the base model weights to use only 4 bits per value (instead of the usual 16), which reduces memory requirements by roughly 4x. Then apply LoRA on top of this highly compressed model.
The obvious question is: if the weights are stored as 4-bit integers, how can you possibly train anything? Gradients cannot flow through integer operations because integers are not differentiable. QLoRA's clever solution is to use 4-bit storage but 16-bit computation. When the forward pass needs to use a weight matrix, it dequantizes that block from 4-bit integers back to 16-bit BrainFloat (BF16) values on-the-fly. The computation happens in full 16-bit precision, gradients flow through normally, and then the weights are re-quantized back to 4-bit for storage. This happens transparently; from the training loop's perspective, everything looks like normal 16-bit training.
The memory numbers are staggering. A 70B parameter model in standard FP16 requires approximately 140 GB of VRAM — that means a minimum of two A100 80GB GPUs (160 GB total), at roughly $30,000 per GPU. With QLoRA, the 70B model in 4-bit takes about 35 GB for the weights, plus LoRA overhead, fitting on a single A100 80GB GPU that costs around $10,000. For a 7B model, QLoRA brings the requirement from 16 GB (already manageable with LoRA) down to about 6 GB, which fits on a mid-range consumer GPU like an RTX 3060.
The impact on the research and startup ecosystem was immediate and profound. Before QLoRA, only well-funded teams at major AI companies or large research universities could experiment with fine-tuning models at the frontier scale. After the QLoRA paper was published in May 2023, researchers at universities with modest GPU budgets, startups with a single cloud GPU, and hobbyists with good consumer hardware could all run meaningful fine-tuning experiments on 13B, 34B, and even 70B models. This democratization of large-scale fine-tuning is one of the most significant shifts in the open-source AI ecosystem.
Deep Dive
The original QLoRA paper introduced three key innovations, each of which independently reduces memory usage. The first is NF4 (4-bit NormalFloat) quantization. Standard INT4 quantization divides the range of values into 16 equally-spaced buckets. But neural network weights are not uniformly distributed — they follow a roughly Gaussian (normal) distribution centered near zero. NF4 instead places quantization levels such that each level covers an equal mass of the normal distribution. This means more precision near zero (where most weights are) and less precision in the tails (where weights are rare). In practice, NF4 loses less information than INT4 for the same number of bits, because it is information-theoretically optimal for normally distributed weights.
The second innovation is double quantization. Quantization requires storing "quantization constants" — the scaling factors that map between 4-bit integers and their float values. These constants themselves require memory. QLoRA quantizes these constants a second time (to 8-bit), reducing the overhead from the quantization metadata by roughly another 0.37 bits per parameter. In a 70B model this saves approximately 3 GB of VRAM — not huge, but meaningful when every gigabyte counts.
The third and perhaps most practically important innovation is the Paged AdamW optimizer. The Adam optimizer stores two momentum tensors for each trainable parameter. Even with LoRA (where trainable parameters are a tiny fraction of total parameters), processing long sequences produces occasional memory spikes that can cause out-of-memory errors. Paged AdamW uses NVIDIA's unified memory system to automatically page optimizer states from GPU VRAM to CPU RAM when the GPU fills up, and pages them back when needed. This acts like virtual memory for the GPU, preventing OOM crashes without requiring manual memory management.
The bitsandbytes library by Tim Dettmers (the lead author of QLoRA) provides the underlying
4-bit quantization implementation. You enable QLoRA by passing a BitsAndBytesConfig to
the model loading function:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# QLoRA config: 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16, # compute in BF16
bnb_4bit_use_double_quant=True, # double quantization
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare for k-bit training (cast LayerNorm etc. to float32)
model = prepare_model_for_kbit_training(model)
# Apply LoRA on top of the 4-bit quantized model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,249,152 || trainable%: 0.2604
In terms of training speed, QLoRA is approximately 20-30% slower than pure LoRA (without 4-bit quantization) because of the overhead from dequantizing weights during each forward pass and re-quantizing after gradient updates. This is a real cost, but the memory savings are so dramatic that for large models the tradeoff is almost always worthwhile. The alternative to accepting 20-30% slower training is not being able to train at all on available hardware. For small models (7B), LoRA without quantization is preferred when VRAM allows because it is faster. For large models (13B+), QLoRA is often the only viable option on single-GPU setups.
An important nuance: QLoRA only quantizes the base model weights. The LoRA adapter parameters — B and A matrices — are always stored and computed in full 16-bit or 32-bit precision. This is correct because these are the parameters being actively trained and their gradients need to be precise. The heavy quantization is only applied to the frozen pretrained parameters, which merely need to provide a stable forward pass approximation. This design is what makes QLoRA work: train high-precision adapters on top of a low-precision frozen base.
RLHF (Reinforcement Learning from Human Feedback)
Plain Language
To understand why RLHF exists, you need to understand what happens during pretraining. When a model is pretrained on trillions of tokens of internet text, it learns to predict likely next tokens. This makes it extraordinarily knowledgeable about the world, but it does not make it "helpful." A pretrained model asked a question might complete the text with a plausible continuation — perhaps another question, perhaps a tangent, perhaps an answer buried in a sea of irrelevant text. It is a text-completion engine, not an assistant.
The first fix for this is Supervised Fine-Tuning (SFT): take the pretrained model and fine-tune it on thousands of examples of (user question, ideal assistant response) pairs, written by skilled human annotators. Now the model knows the format: user says something, model responds helpfully. This is a significant improvement. But SFT has a ceiling: it can only teach behaviors that are directly demonstrated in training data, and human annotators writing ideal responses have their own blind spots and inconsistencies.
RLHF goes further by using human preferences rather than human demonstrations. Instead of asking annotators to write perfect answers (which is hard), you ask them a simpler question: "Here are two answers to this question — which is better?" This is much easier for humans to do reliably. A model trained on these preference signals learns to produce outputs that humans genuinely prefer, capturing subtleties that are hard to specify explicitly — things like appropriate hedging when uncertain, avoiding unnecessary verbosity, being helpful without being sycophantic.
Think of the RLHF process as a teaching relationship. The student (the model) answers questions. A teacher (the reward model, trained on human preferences) grades each answer with a score. The student then practices more, trying to get higher scores from the teacher. Over many rounds, the student learns what kinds of answers the teacher values. The teacher's judgment is a proxy for human preferences — not perfect, but systematically better than no feedback at all.
RLHF is why ChatGPT felt dramatically different from simply querying GPT-3 via API, even though the underlying architecture was similar. The base GPT-3 was knowledgeable but erratic — sometimes helpful, sometimes veering into completions that were technically "likely text" but useless or offensive. ChatGPT's RLHF training shaped those capabilities into something that reliably tried to be helpful, that admitted uncertainty, that declined harmful requests while explaining why, and that engaged conversationally. RLHF is what turns a knowledge base into an assistant.
Deep Dive
Stage 1 — Supervised Fine-Tuning (SFT): The RLHF pipeline begins with a standard instruction fine-tuning step. Human contractors write high-quality (prompt, ideal response) pairs covering a diverse range of tasks — answering questions, writing tasks, coding tasks, summarization, multi-turn dialogue. The pretrained base model is fine-tuned on this data using standard cross-entropy loss. The result is the SFT model — a well-behaved instruction follower. This model becomes the starting point for the rest of the RLHF process and also serves as the "reference model" (πref) in later stages.
Stage 2 — Reward Model Training: This stage creates a model that can score the quality of any (prompt, response) pair. The process starts with data collection: show human raters multiple responses to the same prompt (typically 4-9 responses, generated by the SFT model with varied sampling parameters) and ask them to rank the responses from best to worst. The reward model is initialized from the SFT model and trained to predict these rankings. The RM's final layer is replaced with a linear head that outputs a scalar reward score. Training uses a pairwise ranking loss: for every pair of responses (chosen yw, rejected yl), the loss pushes the chosen response's score higher and the rejected response's score lower.
Stage 3 — PPO (Proximal Policy Optimization): This is the reinforcement learning step.
The SFT model is treated as a "policy" πθ — an agent that takes
a prompt as state and generates a response as action. The reward model evaluates each
response and returns a scalar reward signal. PPO updates the policy's weights to
maximize expected reward. Crucially, a KL penalty is added to the reward:
reward_total = reward_RM - β * KL(πθ || πref).
This penalty penalizes the policy for diverging too far from the SFT model, which
prevents reward hacking — the tendency of RL models to find degenerate solutions
that score high with the RM while producing useless or incoherent text.
Reward hacking is RLHF's central challenge. The reward model is an imperfect proxy for human preferences — it was trained on a finite dataset and has its own biases and failure modes. If you optimize too aggressively against an imperfect reward model, the policy will find behaviors that exploit the RM's weaknesses. Classic reward hacking behaviors include: producing very long verbose answers (because annotators sometimes rated thoroughness positively even when content quality was low), excessive sycophancy (agreeing with the user's stated opinion), or confident-sounding wrong answers (because confident tone was sometimes rewarded). Mitigations include: diverse RM training data, training an ensemble of reward models and using their minimum or average, the KL penalty already mentioned, and continuous red-teaming to find failure modes.
Constitutional AI (Anthropic) is an influential alternative that partially replaces human feedback with AI feedback. Instead of relying on human raters to score responses, you provide the model with a written "constitution" — a list of principles like "be helpful," "be harmless," "be honest," "avoid stereotypes." The model is then used to critique and revise its own responses according to these principles (the Critique and Revision step, or RLCAI — RL from AI feedback). This approach scales more cheaply than pure human feedback and produces models with more consistent value alignment, but depends critically on the quality and comprehensiveness of the constitution.
The compute cost of RLHF is substantially higher than SFT. During PPO training, you must run the policy LLM to generate responses, run the reward model to score those responses, and run the reference SFT model to compute the KL penalty — all during the same training step. This requires holding multiple models in VRAM simultaneously and is typically 3-10x more expensive than SFT in terms of GPU hours. For this reason, RLHF is almost exclusively done by well-resourced organizations, and simpler alternatives like DPO have become popular for resource-constrained settings.
The TRL (Transformer Reinforcement Learning) library from HuggingFace implements the full RLHF
pipeline. The PPOTrainer class handles the RL training loop, including generating
responses, computing rewards with your reward model, computing the KL penalty against a reference
model, and running PPO optimization steps. The library also provides RewardTrainer
for the reward model training stage.
The foundational paper for applying RLHF to LLMs is InstructGPT (Ouyang et al., 2022), "Training language models to follow instructions with human feedback." This paper showed that a 1.3B parameter RLHF model was rated as preferable by human evaluators to a 175B GPT-3 model without RLHF — demonstrating that alignment training can outperform raw scale. The paper's methodology — SFT followed by reward model training followed by PPO — became the template that almost all subsequent alignment work has built on or departed from.
RLHF three-stage pipeline: SFT creates the reference model, the Reward Model learns human preferences from ranked comparisons, and PPO updates the policy to maximize reward while a KL penalty keeps outputs close to the reference model.
DPO (Direct Preference Optimization)
Plain Language
RLHF works, but it is complicated. You need to train three separate models: an SFT model, a reward model, and then run PPO to update the policy using the reward model's feedback. PPO is notoriously finicky — it has many hyperparameters, can be numerically unstable, and requires careful tuning to avoid reward hacking or policy collapse. Each stage introduces its own failure modes and requires its own evaluation. For a large organization with dedicated ML infrastructure teams, this is manageable. For everyone else, it is a significant barrier.
DPO (Direct Preference Optimization, Rafailov et al., 2023) solves the same problem as RLHF in a dramatically simpler way. The key insight is that you can rearrange the RLHF mathematical objective to show that the optimal policy under RLHF can be expressed directly in terms of the preference data — without ever training a separate reward model or running a reinforcement learning loop. DPO is a single supervised learning step that takes (prompt, chosen response, rejected response) triplets and directly trains the model to prefer the chosen response over the rejected one.
The practical impact is substantial. What previously required three training stages, multiple models in VRAM simultaneously, and days of PPO tuning can now be done with a single training run lasting hours. DPO is now the dominant alignment technique for fine-tuning open-weight models. Nearly all of the fine-tuned open-source models released since mid-2023 — from Llama variants to Mistral variants to countless community models — use DPO or one of its successors rather than RLHF.
The main practical limitation of DPO compared to RLHF is that it is an offline algorithm — it trains on a fixed dataset of preference pairs collected before training. RLHF with PPO is online: the policy generates responses, gets scored, and is updated, all in a loop. This means RLHF can potentially keep improving as the model improves, while DPO's quality is bounded by the quality of the pre-collected preference dataset. For most fine-tuning scenarios this is not a significant limitation, but it matters at the scale where OpenAI and Anthropic operate with continuous human feedback loops.
Deep Dive
The mathematical insight behind DPO is that under the standard RLHF objective, the optimal policy has an analytically computable form. Specifically, πr(y|x) ∝ πref(y|x) · exp(r(x,y)/β), where r(x,y) is the reward and β is the KL penalty coefficient. This means the reward function can be expressed in terms of the optimal policy and the reference policy: r(x,y) = β · log(πr(y|x) / πref(y|x)) + β · log Z(x). By substituting this expression into the Bradley-Terry preference model (which defines the probability that yw is preferred over yl), you get a loss function that depends only on the policy and reference model — no reward model needed.
The DPO loss function is:
# DPO loss (conceptual Python form)
# y_w = chosen response, y_l = rejected response
# pi_theta = policy model, pi_ref = reference (SFT) model
# beta = KL coefficient (typically 0.1 - 0.5)
log_ratio_chosen = (
pi_theta.log_prob(y_w | x) - pi_ref.log_prob(y_w | x)
)
log_ratio_rejected = (
pi_theta.log_prob(y_l | x) - pi_ref.log_prob(y_l | x)
)
# DPO objective: maximize the gap between chosen and rejected log-ratios
loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
The β parameter in DPO plays the same role as the KL coefficient in RLHF: it controls how aggressively the model can diverge from the reference SFT model. A lower β (e.g., 0.1) allows more aggressive optimization toward the preference signal, potentially achieving larger behavioral changes but with more risk of degrading general capabilities. A higher β (e.g., 0.5) keeps the model closer to the SFT reference, producing more conservative but safer updates. Common defaults are β = 0.1 for general alignment and β = 0.2-0.5 when you want to preserve the SFT model's capabilities more carefully.
The dataset format for DPO is straightforward JSONL with three fields:
# DPO dataset format (one example per line)
{
"prompt": "Explain why the sky is blue.",
"chosen": "The sky appears blue due to Rayleigh scattering...",
"rejected": "Because molecules scatter short wavelengths, blue."
}
ORPO (Odds Ratio Preference Optimization, Hong et al., 2024) is a further simplification that eliminates the need for a reference model entirely. ORPO combines the SFT loss and the preference loss into a single training objective using odds ratios, which means you can go directly from a base pretrained model to an aligned instruction follower in one training step. This makes ORPO cheaper and simpler than DPO, though it requires your dataset to contain instruction-following examples as well as preference pairs. ORPO has become increasingly popular for fine-tuning smaller models where compute efficiency is paramount.
SimPO (Simple Preference Optimization, Meng et al., 2024) is another DPO variant that removes the reference model by using a length-normalized reward instead of log-probability ratios. SimPO empirically outperforms DPO on several alignment benchmarks and is simpler to implement because you only need a single model in memory rather than both the policy and reference model simultaneously.
The TRL DPOTrainer from HuggingFace handles the complete DPO training loop:
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
dpo_config = DPOConfig(
beta=0.1,
learning_rate=5e-7, # DPO needs very low LR
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
output_dir="./dpo_output",
)
dataset = load_dataset("your_org/preference_dataset")
trainer = DPOTrainer(
model=policy_model, # model to train
ref_model=reference_model, # frozen SFT reference
args=dpo_config,
train_dataset=dataset["train"],
processing_class=tokenizer,
)
trainer.train()
Understanding when to prefer DPO over RLHF comes down to scale and resources. DPO is the right choice for most practitioners: smaller teams, limited GPU budgets, preference pairs available in advance, and a need for training stability. RLHF with PPO has advantages at very large scale where you have a continuous stream of real-time human feedback (online learning), very complex multi-turn preference signals, or the resources to iterate on reward model quality. For the vast majority of open-weight model fine-tuning, DPO or ORPO is the practical choice.
Dataset Preparation
Plain Language
In machine learning, there is a maxim: garbage in, garbage out. Nowhere is this more true than in fine-tuning. You can choose the perfect learning rate, the optimal LoRA rank, the most efficient training infrastructure — and all of that is worthless if your training data is noisy, inconsistent, or misaligned with what you actually want the model to do. The quality of your fine-tuning data is the single most important variable in the entire process, and it is the one most often neglected by practitioners eager to start training.
A counterintuitive but well-validated empirical finding: 100 perfectly curated examples consistently outperforms 10,000 mediocre ones for instruction tuning. The model is already highly capable — it just needs to be steered. A few dozen crystal-clear examples of exactly the behavior you want are far more informative than thousands of examples with noise, inconsistency, or ambiguity. Quality filtering, deduplication, and careful review of every example in your training set are not optional niceties; they are the primary lever of fine-tuning quality.
One of the most effective and increasingly standard approaches to dataset creation is synthetic data generation: using a strong LLM (GPT-4, Claude Opus) to generate training data for a weaker model. This is the approach behind the Phi family of models from Microsoft (Phi-1, Phi-2, Phi-3), which achieved remarkable performance at small model sizes by training on GPT-4-generated "textbook quality" data. It is also the basis of techniques like Self-Instruct and Evol-Instruct. If you have a well-defined task, you can often generate thousands of high-quality examples quickly and cheaply with a strong frontier model, at a fraction of the cost of human annotation.
The distribution of your training data must match the distribution of inputs your model will see at inference time. This sounds obvious but is easy to get wrong. If your production system will see short, terse user queries, training on long elaborate prompts will not help. If your users will ask edge-case questions, you need edge cases in your training data. Most importantly, your evaluation (test) set must be drawn from the real production distribution, not from the same synthetic data generation process as your training set — otherwise you are measuring how well the model learned your synthetic data, not how well it performs on real-world inputs.
Deep Dive
The standard OpenAI instruction tuning format is a JSONL file where each line is a JSON object with a "messages" array following the ChatML conversation format. This format is supported by OpenAI's fine-tuning API and is also widely used with HuggingFace's chat templates:
# Instruction tuning format (one JSON object per line)
{
"messages": [
{
"role": "system",
"content": "You are a medical documentation assistant..."
},
{
"role": "user",
"content": "Summarize the following clinical note..."
},
{
"role": "assistant",
"content": "Chief Complaint: chest pain..."
}
]
}
The legacy completion format (used by older OpenAI fine-tuning and some HuggingFace models) uses a simpler two-field structure. This format is being phased out in favor of the messages format for instruction-following models but is still relevant for sequence-completion tasks and older APIs:
# Legacy completion format
{"prompt": "Translate to French: Hello, how are you?\n\n###\n\n",
"completion": " Bonjour, comment allez-vous?\n"}
Dataset size guidelines are approximate but give useful planning targets. For learning a new output format (e.g., always respond as JSON): 100-500 examples. For changing communication style or tone: 500-2,000 examples. For teaching a new domain's vocabulary and conventions: 2,000-10,000 examples. For learning substantial new knowledge that was not in pretraining: 10,000-50,000 examples. For building significant new capabilities: 50,000 and up. These are not hard requirements — you may need more or fewer depending on the consistency of your data and how different the desired behavior is from the base model's defaults.
Data cleaning encompasses several distinct operations that each meaningfully improve dataset quality. Deduplication removes near-identical examples using techniques like MinHash LSH — duplicates waste training budget and can cause the model to overfit to specific phrasings. Quality filtering removes low-quality examples using automated metrics (perplexity filtering removes text that is incoherent by the model's own assessment) or manual human review. PII (personally identifiable information) removal is essential for compliance — scrub names, addresses, phone numbers, emails, and other identifying information from training data. Format validation ensures every example is syntactically valid and has the expected fields.
Synthetic data generation pipelines follow a consistent pattern: generate a diverse set of seed prompts representing different aspects of your task, use a strong LLM to generate responses for each prompt, filter the generated (prompt, response) pairs by quality using either another LLM as a judge or a rule-based quality check, and then deduplicate:
from openai import OpenAI
import json
client = OpenAI()
def generate_training_example(seed_topic: str) -> dict:
# Use GPT-4o to generate a (prompt, response) pair
meta_prompt = f"""Generate a realistic user question about {seed_topic}
and a high-quality expert answer.
Return JSON: {{"prompt": "...", "response": "..."}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": meta_prompt}],
response_format={"type": "json_object"},
)
data = json.loads(result.choices[0].message.content)
return {
"messages": [
{"role": "user", "content": data["prompt"]},
{"role": "assistant", "content": data["response"]},
]
}
seed_topics = ["drug interactions", "dosage calculations", "clinical terminology"]
examples = [generate_training_example(t) for t in seed_topics * 100]
# Write to JSONL
with open("train.jsonl", "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
The train/validation/test split requires deliberate thought. A common split is 90/5/5 (train/validation/test) for larger datasets, or 80/10/10 for smaller ones. The validation set is used during training to detect overfitting — if training loss keeps falling but validation loss starts rising, you are overfitting and should stop early. The test set is a held-out set evaluated only after training is complete, to get an unbiased estimate of real-world performance. Never use examples from your test set to make training decisions. Ideally, your test set consists of real examples from your actual production environment, not examples from the same synthetic generation process as your training data.
The HuggingFace Datasets library is the standard tool for managing fine-tuning datasets. It handles loading from JSONL, arrow, CSV, and many other formats; streaming large datasets that don't fit in memory; applying transformations and filters; and pushing datasets to the HuggingFace Hub for sharing:
from datasets import Dataset, load_dataset
# Load from local JSONL file
dataset = load_dataset.load_dataset("json", data_files="train.jsonl", split="train")
# Or create from list of dicts
data = [{"messages": [...], "id": i} for i in range(1000)]
dataset = Dataset.from_list(data)
# Split into train and validation
splits = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = splits["train"]
val_ds = splits["test"]
# Push to HuggingFace Hub
dataset.push_to_hub("your_org/medical-instructions-v1", private=True)
If you use a strong LLM to generate both your training data and your evaluation data, you have contaminated your evaluation. The model will perform well on the eval because it learned the eval-generating LLM's patterns — not because it genuinely learned your task. Always hold out real human-curated examples for your test set.
OpenAI Fine-Tuning API
Plain Language
OpenAI provides a hosted fine-tuning service that abstracts away all the infrastructure complexity. You prepare your training data as a JSONL file, upload it to the OpenAI API, specify which base model you want to fine-tune and for how many epochs, and submit a fine-tuning job. OpenAI handles all the compute — the GPU allocation, training, model storage, and serving. When the job finishes (typically 30 minutes to a few hours depending on dataset size), you receive a model ID that you can use exactly like any other OpenAI model in API calls.
The appeal is the complete elimination of infrastructure overhead. There is no GPU to provision, no CUDA environment to configure, no training loop to write or debug. If you can prepare a JSONL file and make an API call, you can fine-tune a model. For teams without ML infrastructure expertise or without the desire to build and maintain GPU training pipelines, this is a compelling option. OpenAI fine-tuning is a fully managed service: you bring the data, they handle everything else.
The currently supported base models include gpt-4o-mini (the best cost-performance option for most use cases) and gpt-3.5-turbo. GPT-4o is available to some partners. The fine-tuned model is hosted on OpenAI's infrastructure and accessed via the standard Chat Completions API — the only difference is the model ID, which includes a "ft:" prefix and your organization's namespace. Fine-tuned models can only be accessed by the organization that created them.
The cost structure has two components. First, you pay a training cost per token in your training dataset — this is a one-time cost paid when the job runs. Second, you pay inference cost per token at call time — and fine-tuned models cost more per token than the corresponding base model. The economic logic is that the training cost is amortized across many inference calls, each of which saves tokens by not needing a long system prompt. For high-volume use cases this amortization makes fine-tuning cost-effective; for low-volume use cases the training cost may not be recovered.
Deep Dive
The file upload step uses the Files API. You upload your JSONL file with the purpose "fine-tune," which tells OpenAI to retain it for use in fine-tuning jobs. The API validates your file format, checks for required fields, and returns a file ID that you will reference in the job creation call:
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from env
# Upload training file
with open("train.jsonl", "rb") as f:
train_file = client.files.create(file=f, purpose="fine-tune")
with open("val.jsonl", "rb") as f:
val_file = client.files.create(file=f, purpose="fine-tune")
print(train_file.id) # file-abc123...
Job creation submits the fine-tuning job to OpenAI's training infrastructure. You specify the base model, the training file, and optionally a validation file and hyperparameters. The job is queued and typically begins training within a few minutes of submission:
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=train_file.id,
validation_file=val_file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": "auto", # let OpenAI choose
"learning_rate_multiplier": "auto",
"batch_size": "auto",
},
suffix="medical-notes-v1", # appears in model ID
)
print(job.id) # ftjob-abc123...
print(job.status) # "validating_files" | "queued" | "running" | "succeeded"
Monitoring training progress is done by polling job events, which include training loss and validation loss at each step. This lets you detect overfitting in real time and cancel the job early if needed:
import time
# Poll job status
while True:
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")
if job.status in ["succeeded", "failed", "cancelled"]:
break
time.sleep(60)
# Stream events to see loss curves
events = client.fine_tuning.jobs.list_events(job.id, limit=50)
for event in events.data:
print(event.message)
# Step 100/240: train loss=0.432, val loss=0.489
# Step 200/240: train loss=0.311, val loss=0.501 <- diverging, consider stopping
print("Fine-tuned model ID:", job.fine_tuned_model)
# ft:gpt-4o-mini-2024-07-18:your-org:medical-notes-v1:AbCdEf12
Using the fine-tuned model is identical to using any other OpenAI model — just substitute the fine-tuned model ID. The API, streaming, function calling, and all other features work exactly the same way. The key difference is that you no longer need to include lengthy formatting or style instructions in your system prompt, because those behaviors are now embedded in the model weights:
# Using the fine-tuned model — no format instructions needed
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:your-org:medical-notes-v1:AbCdEf12",
messages=[
{"role": "user", "content": "Summarize this clinical note: ..."}
],
temperature=0.2,
)
print(response.choices[0].message.content)
Understanding the hyperparameters helps you control training quality. n_epochs
determines how many passes through the training data — OpenAI's "auto" setting typically
chooses 3-4 epochs for datasets under 1,000 examples. learning_rate_multiplier
scales the learning rate relative to OpenAI's base learning rate — values between 0.5 and 2.0
are reasonable for experimentation. batch_size affects gradient estimation quality;
larger batches are more stable but require more memory. In practice, OpenAI's auto-selected
defaults are well-tuned for typical datasets, and manual adjustment is only worthwhile
if you have strong evidence that the defaults are suboptimal.
The validation file is essential for production-quality fine-tuning. Without it, you can only observe training loss (which always goes down as the model memorizes the training data) and have no way to detect overfitting. When you supply a validation file, OpenAI computes validation loss at each checkpoint, giving you a curve you can inspect to decide when training has converged. If validation loss stops decreasing while training loss continues to fall, you are overfitting — the model is memorizing training examples rather than generalizing to new inputs. In this case, you should rerun the job with fewer epochs or reduce the learning rate multiplier.
Always supply a validation file when fine-tuning via OpenAI. Even if your dataset is small, the validation loss curve gives you objective evidence of whether training is helping or overfitting. Start with the "auto" hyperparameters for your first run, inspect the loss curves, and adjust from there on subsequent runs.
Interview Ready
How to Explain This in 2 Minutes
Fine-tuning is how you teach a pre-trained LLM new behaviors — a specific tone, output format, or reasoning pattern — that prompt engineering alone cannot reliably produce. Rather than updating all billions of parameters (full fine-tuning), modern techniques like LoRA inject small trainable matrices into the model's attention layers, cutting GPU memory by 10x or more while achieving comparable quality. QLoRA goes further by quantizing the base model to 4-bit precision, making it possible to fine-tune a 65B-parameter model on a single 48 GB GPU. On the alignment side, RLHF uses a reward model trained on human preference data to steer the model toward helpful, harmless outputs, while DPO simplifies the process by directly optimizing on preference pairs without a separate reward model. The practical workflow is: prepare a high-quality dataset in the right format, decide whether LoRA/QLoRA on your own hardware or a hosted API (like OpenAI's) fits your budget and latency needs, run training, evaluate on held-out data, and iterate. The cardinal rule is to exhaust prompt engineering and RAG first — fine-tuning is for behavior, not knowledge.
Likely Interview Questions
| Question | What They're Really Asking |
|---|---|
| What is LoRA and why is it more practical than full fine-tuning? | Do you understand low-rank decomposition, parameter efficiency, and the memory savings that make LoRA viable on commodity hardware? |
| How does QLoRA differ from LoRA, and when would you choose it? | Can you explain 4-bit NormalFloat quantization, double quantization, and the trade-off between memory reduction and potential quality loss? |
| What is PEFT and how does it relate to LoRA, prefix tuning, and adapters? | Do you know the HuggingFace PEFT ecosystem and can you compare different parameter-efficient strategies beyond just LoRA? |
| Walk me through how you would prepare a fine-tuning dataset for a customer-support chatbot. | Can you handle real-world data engineering — collection, cleaning, formatting (chat-style JSONL), train/val splits, and quality filtering? |
| When should you fine-tune a model versus using RAG, and can you combine both? | Do you understand that fine-tuning changes behavior while RAG injects knowledge, and can you articulate when each approach (or both together) is appropriate? |
Model Answers
1. LoRA and its practicality — Full fine-tuning updates every parameter in the model, which for a 7B-parameter LLM means storing a full copy of gradients and optimizer states — easily 60+ GB of GPU memory. LoRA (Low-Rank Adaptation) freezes the entire pre-trained model and injects pairs of small trainable matrices (A and B) into each attention layer. Because the rank r is typically 8–64, the number of trainable parameters drops to less than 1% of the original. At inference time, the adapter weights can be merged back into the base model with zero additional latency. This makes LoRA practical on a single consumer GPU (24 GB), enables serving multiple task-specific adapters from one base model by hot-swapping adapter weights, and drastically reduces storage since each adapter is only a few hundred megabytes.
2. QLoRA versus LoRA — QLoRA combines LoRA with 4-bit quantization of the base model using a novel NormalFloat4 data type that is information-theoretically optimal for normally distributed weights. It also introduces double quantization — quantizing the quantization constants themselves — saving an additional ~0.4 bits per parameter. Paged optimizers handle memory spikes by offloading optimizer state to CPU RAM when the GPU runs out of memory. The result is that you can fine-tune a 65B model on a single 48 GB GPU (A6000 or A100), which is impossible with standard LoRA. You choose QLoRA when your model is too large to fit in GPU memory even with LoRA alone, or when you want to maximize model size within a fixed hardware budget. The quality trade-off is minimal — the original QLoRA paper showed results matching full 16-bit fine-tuning on benchmarks.
3. Dataset preparation for a customer-support chatbot — Start by mining real support transcripts, filtering for conversations where the customer issue was resolved successfully and the customer rated the interaction positively. Convert each conversation into the chat-completion format: a system message defining the bot persona and policies, followed by alternating user/assistant turns. Clean the data by removing PII, normalizing formatting, and discarding incomplete or off-topic exchanges. Aim for at least 500–1,000 high-quality examples; more data with lower quality is worse than less data with high quality. Split 90/10 into train and validation sets, ensuring no conversation appears in both. Validate the JSONL format programmatically before uploading. After fine-tuning, evaluate on the validation set using both automated metrics (loss, BLEU) and human evaluation of tone, accuracy, and policy compliance.
4. Fine-tuning versus RAG — Fine-tuning teaches the model how to behave — a particular tone, output format, or reasoning chain. RAG teaches the model what to know — injecting up-to-date, domain-specific facts at query time. If your problem is that the model does not know your company's product catalog, use RAG. If the problem is that the model produces verbose, generic responses instead of concise, brand-voice answers, fine-tune. In many production systems you combine both: fine-tune a base model to follow your output conventions and citation style, then use RAG to supply the actual knowledge. This separation keeps the fine-tuned model small and focused while the knowledge layer stays dynamic and auditable.
5. PEFT ecosystem — PEFT (Parameter-Efficient Fine-Tuning) is HuggingFace's library that provides a unified interface for multiple adaptation strategies. LoRA is the most popular, but the library also supports prefix tuning (prepending trainable virtual tokens to every layer's input), prompt tuning (a simpler variant that only adds tokens at the input layer), and adapter layers (small bottleneck MLPs inserted between transformer blocks). LoRA is generally preferred because it adds no inference latency after merging, while prefix tuning and adapters slightly increase the effective sequence length or model depth. The PEFT library integrates with the TRL (Transformer Reinforcement Learning) library for RLHF and DPO workflows, making it possible to do efficient alignment training with minimal code.
System Design Scenario
Your company has a 13B-parameter open-source LLM deployed for internal code review. Developers complain that the model's suggestions are too generic and don't follow the team's coding standards (naming conventions, error-handling patterns, comment style). You have access to 2,000 high-quality code review comments written by senior engineers. Design a fine-tuning pipeline to customize the model.
A strong answer should cover:
- Technique selection — Use QLoRA to fine-tune the 13B model on a single A100 (80 GB) or two A6000 (48 GB) GPUs. Quantize the base model to 4-bit, apply LoRA adapters with rank 16 to the query and value projection matrices, yielding roughly 20M trainable parameters.
- Dataset preparation — Structure each example as a chat turn: system prompt defining the reviewer persona, user message containing the code diff, assistant message containing the senior engineer's review comment. Split 1,800 / 200 for train / validation. Augment with synthetic examples by having the base model generate reviews and having senior engineers rank them (preference data for a potential DPO pass).
- Training and evaluation — Train for 3–5 epochs with the SFTTrainer from TRL, monitoring validation loss for early stopping. Evaluate with a blind test: present 50 code diffs to senior engineers, show the fine-tuned model's review alongside the base model's review (randomized order), and measure win rate. Target at least 70% preference for the fine-tuned model.
- Deployment — Merge the LoRA adapter into the base model for serving (zero latency overhead). Use vLLM or TGI for efficient inference. Keep the adapter weights versioned in a model registry so you can roll back or A/B test different adapter versions without reloading the base model.
- Iteration — Collect ongoing developer feedback on review quality, add highly-rated reviews to the training set, and retrain monthly. Consider a DPO pass using the preference data from engineer rankings to further align the model with team standards.
Common Mistakes
- Fine-tuning for knowledge instead of behavior — Developers often try to "teach" a model new facts through fine-tuning when RAG would be cheaper, more accurate, and easier to update. Fine-tuning bakes information into weights where it cannot be cited, verified, or refreshed without retraining. Use fine-tuning only when you need to change how the model responds, not what it knows.
- Training on too little or low-quality data — Fine-tuning with fewer than 100 examples or with noisy, inconsistent examples leads to overfitting or conflicting learned behaviors. Quality matters far more than quantity: 500 carefully curated examples outperform 5,000 scraped, unchecked ones. Always inspect your data manually before training.
- Ignoring the validation loss curve — Without a validation split, you cannot detect overfitting. Many practitioners train for too many epochs, see training loss decrease, and assume the model is improving — when in reality it is memorizing the training set and degrading on unseen inputs. Always reserve 10–20% of data for validation and stop training when validation loss plateaus or increases.