01
SFT (Supervised Fine-Tuning)
Definition
Continues training on curated (prompt, response) pairs using cross-entropy loss on completion tokens only.
Purpose
The mandatory first step before any preference-based alignment. Teaches the model the instruction format.
02
RLHF
Definition
Trains a reward model on human pairwise preferences, then uses PPO to maximize reward within a KL budget.
Purpose
The recipe behind ChatGPT and Claude's initial alignment. Quality is capped by the reward model ceiling.
03
RLAIF
Definition
Replaces human labelers with a stronger AI judge generating preference labels at scale.
Purpose
Scales feedback data collection 100–1000x vs. human labeling. Constitutional AI uses RLAIF with self-critique.
04
PPO
Definition
Clips the policy gradient ratio within [1-ε, 1+ε] to prevent catastrophically large policy updates.
Purpose
The stability constraint that makes billion-parameter RL-from-feedback tractable; prevents reward hacking.
05
DPO
Definition
Reformulates RLHF: the optimal policy is a closed-form solution and the reward model is implicit.
Purpose
No RL loop, no reward model training. Requires only a reference model and preference pairs; simpler pipeline.
06
ORPO
Definition
Adds an odds-ratio penalty to the SFT cross-entropy loss, combining alignment and fine-tuning in one stage.
Purpose
Eliminates the reference model and halves pipeline complexity; competitive with DPO on alignment benchmarks.
07
KTO
Definition
Derived from Kahneman-Tversky utility theory; treats chosen as gains and rejected as losses asymmetrically.
Purpose
Outperforms DPO when only binary good/bad signals are available, not ranked pairs; cheaper data collection.
08
GRPO
Definition
Samples a group of responses per prompt; uses within-group normalized rewards as the advantage signal.
Purpose
Eliminates the critic model entirely. DeepSeek-R1's alignment recipe; reduces memory and training complexity.
09
SimPO
Definition
Adds a length-normalized margin to the DPO loss and removes the reference model computation entirely.
Purpose
Trains 2x faster than standard DPO with equal or better alignment benchmark scores. Simpler to implement.
10
IPO
Definition
Replaces DPO's unbounded log-ratio objective with an L2-penalized variant.
Purpose
Provably prevents over-optimization where the model exploits the preference signal rather than improving.
11
LoRA
Definition
Freezes the base model and injects trainable low-rank matrices A∈R^{d\times r} and B∈R^{r\times k} alongside weights.
Purpose
Only r×(d+k) parameters are trained instead of d×k. r=16 covers most tasks; the standard PEFT method.
12
QLoRA
Definition
Stores the frozen base in 4-bit NF4 with double quantization while LoRA adapters remain in BF16.
Purpose
Fine-tunes 70B models on a single A100-80GB. The breakthrough that democratized LLM customization.
13
DoRA
Definition
Decomposes each weight W into magnitude and unit direction, applying LoRA updates to the direction only.
Purpose
Closes the accuracy gap between LoRA and full fine-tuning on most tasks; drop-in replacement for LoRA.
14
LoRA+
Definition
Sets the B matrix learning rate 16x higher than A, reflecting that B starts at zero while A carries signal.
Purpose
A one-line change that consistently improves LoRA convergence speed without any quality trade-off.
15
VeRA
Definition
Shares a single frozen random Gaussian matrix pair across all layers; trains only per-layer scaling vectors.
Purpose
Under 1M trainable parameters for a 7B fine-tune. Extreme efficiency with competitive downstream results.
16
IA3
Definition
Multiplies key, value, and FFN intermediate vectors by learned scaling vectors l_k, l_v, l_{ff}.
Purpose
Only 0.01% of parameters trained. Inference overhead is zero after folding scaling vectors into weights.
17
Prefix Tuning
Definition
Prepends P trainable soft vectors to each layer's K and V matrices; the base model stays completely frozen.
Purpose
Adapters are tiny P×d tensors swapped per task. No weight modification; clean compositional fine-tuning.
18
Prompt Tuning
Definition
Learns soft token embeddings prepended to the input; every other parameter is frozen.
Purpose
Nearly as expressive as fine-tuning above 10B parameters. Lightest possible PEFT; zero architecture change.
19
P-Tuning v2
Definition
Applies independent trainable prefix vectors at every transformer layer depth, not just the input.
Purpose
Narrows the gap with full fine-tuning for NLU tasks requiring deep contextualization throughout layers.
20
Adapter Tuning
Definition
Inserts small two-layer bottleneck modules (down-project, nonlinearity, up-project) after attention and FFN.
Purpose
Modular: swap adapters at inference to switch tasks without reloading the base. Original PEFT approach.
21
Full Fine-Tuning
Definition
Updates all parameters using the same optimizer and schedule as pre-training, on task-specific data.
Purpose
Sets the performance ceiling. Justified when dataset exceeds 100K examples and compute allows retraining.
22
Layer Freezing
Definition
Keeps the first k layers frozen and fine-tunes only later layers during the adaptation process.
Purpose
Free accuracy boost: lower layers encode syntax that rarely needs task-specific adjustment. Cuts GPU time.
23
Alignment Tax
Definition
The measurable drop in general benchmark performance that occurs after safety and helpfulness training.
Purpose
Accepted as unavoidable; the goal is to minimize it. Models with low alignment tax are production-ready.
24
Instruction Tuning
Definition
Fine-tunes on diverse NLP tasks reframed as natural language instructions and structured completions.
Purpose
The bridge from raw base model to a general-purpose assistant that follows novel zero-shot instructions.
25
Constitutional AI (CAI)
Definition
Model critiques and revises its own outputs against a written constitution; revised data trains via RLAIF.
Purpose
Reduces dependence on human red-teamers. Anthropic's primary safety training approach for Claude.
26
SPIN (Self-Play Fine-Tuning)
Definition
Current model is optimized to distinguish its outputs from the older checkpoint's on human reference data.
Purpose
Self-improving loop without new human labels. Quality improves each generation of the self-play cycle.
27
Self-Instruct
Definition
Generates instruction-response pairs from a seed set using the model itself, filtering low-quality outputs.
Purpose
Bootstraps instruction datasets at massive scale. Origin of Stanford Alpaca's 52K training examples.
28
Rejection Sampling FT (RFT)
Definition
Samples N responses per prompt, keeps those above a reward threshold, and fine-tunes on accepted outputs.
Purpose
A cheap RL alternative that often matches PPO on reasoning tasks; used in LLaMA-3 instruction tuning.
29
Reward Model Training
Definition
Trains a scalar regression head on pairwise preference data to score responses given a prompt.
Purpose
The quality of the reward model is the primary bottleneck in RLHF. Reward hacking starts here.
30
Safety Training
Definition
Fine-tunes the model to refuse harmful requests, acknowledge uncertainty, and follow behavioral policies.
Purpose
Requires red-teaming datasets and careful evaluation to avoid over-refusal, which cuts helpfulness.
31
Multi-task Fine-Tuning
Definition
Trains on diverse task datasets simultaneously using a mixture of instruction-formatted examples.
Purpose
Improves zero-shot generalization and reduces task-specific over-fitting vs. single-task fine-tuning.
32
Chat Templates
Definition
Structured markup (e.g., [INST]...[/INST] or im-start/im-end) separating system, user, and assistant turns.
Purpose
Inconsistent templates between training and inference severely degrade performance. Always match exactly.
33
DARE
Definition
Randomly zeros a fraction (typically 90%) of delta weights (W_{ft}-W_{base}) before merging models.
Purpose
Dropout on the delta space reduces interference between adapters. Prerequisite for clean model merging.
34
TIES Merging
Definition
Trims low-magnitude deltas, resolves sign conflicts by majority vote, and sums the surviving deltas.
Purpose
Consistently outperforms simple weight averaging when merging 3+ task-specific checkpoints into one.
35
SLERP
Definition
Interpolates two weight vectors along the shortest arc of the high-dimensional unit sphere.
Purpose
Preserves geometric structure that linear interpolation distorts; smooth blending for style mixtures.
36
Model Soups
Definition
Averages the weights of multiple SFT checkpoints trained from the same base with different hyperparameters.
Purpose
Often beats any individual checkpoint at near-zero marginal cost. Free robustness improvement.
37
Task Arithmetic
Definition
Computes task vectors \tau = W_{ft} - W_{base} and adds/subtracts them from the base model.
Purpose
Negating a vector removes a capability; adding two composes skills without any retraining whatsoever.
38
Continual Learning
Definition
Trains on a sequence of tasks while explicitly preserving performance on all previous tasks.
Purpose
Key techniques: elastic weight consolidation (EWC), experience replay, and adapter-based task isolation.
39
Catastrophic Forgetting
Definition
Previously learned capabilities degrade as fine-tuning overwrites shared weights in the base model.
Purpose
The defining challenge of sequential learning. PEFT methods reduce but do not eliminate this effect.
Explore more chapters or test your knowledge with quizzes.