LLM Optimization Dictionary

Fine-Tuning Optimization

Fine-Tuning Optimization terms and explanations from the LLM Optimization Dictionary.

39 terms in this chapter

SFT (Supervised Fine-Tuning)

Definition

Continues training on curated (prompt, response) pairs using cross-entropy loss on completion tokens only.

Purpose

The mandatory first step before any preference-based alignment. Teaches the model the instruction format.

RLHF

Definition

Trains a reward model on human pairwise preferences, then uses PPO to maximize reward within a KL budget.

Purpose

The recipe behind ChatGPT and Claude's initial alignment. Quality is capped by the reward model ceiling.

RLAIF

Definition

Replaces human labelers with a stronger AI judge generating preference labels at scale.

Purpose

Scales feedback data collection 100–1000x vs. human labeling. Constitutional AI uses RLAIF with self-critique.

PPO

Definition

Clips the policy gradient ratio within [1-ε, 1+ε] to prevent catastrophically large policy updates.

Purpose

The stability constraint that makes billion-parameter RL-from-feedback tractable; prevents reward hacking.

DPO

Definition

Reformulates RLHF: the optimal policy is a closed-form solution and the reward model is implicit.

Purpose

No RL loop, no reward model training. Requires only a reference model and preference pairs; simpler pipeline.

ORPO

Definition

Adds an odds-ratio penalty to the SFT cross-entropy loss, combining alignment and fine-tuning in one stage.

Purpose

Eliminates the reference model and halves pipeline complexity; competitive with DPO on alignment benchmarks.

KTO

Definition

Derived from Kahneman-Tversky utility theory; treats chosen as gains and rejected as losses asymmetrically.

Purpose

Outperforms DPO when only binary good/bad signals are available, not ranked pairs; cheaper data collection.

GRPO

Definition

Samples a group of responses per prompt; uses within-group normalized rewards as the advantage signal.

Purpose

Eliminates the critic model entirely. DeepSeek-R1's alignment recipe; reduces memory and training complexity.

SimPO

Definition

Adds a length-normalized margin to the DPO loss and removes the reference model computation entirely.

Purpose

Trains 2x faster than standard DPO with equal or better alignment benchmark scores. Simpler to implement.

IPO

Definition

Replaces DPO's unbounded log-ratio objective with an L2-penalized variant.

Purpose

Provably prevents over-optimization where the model exploits the preference signal rather than improving.

LoRA

Definition

Freezes the base model and injects trainable low-rank matrices A∈R^{d\times r} and B∈R^{r\times k} alongside weights.

Purpose

Only r×(d+k) parameters are trained instead of d×k. r=16 covers most tasks; the standard PEFT method.

QLoRA

Definition

Stores the frozen base in 4-bit NF4 with double quantization while LoRA adapters remain in BF16.

Purpose

Fine-tunes 70B models on a single A100-80GB. The breakthrough that democratized LLM customization.

DoRA

Definition

Decomposes each weight W into magnitude and unit direction, applying LoRA updates to the direction only.

Purpose

Closes the accuracy gap between LoRA and full fine-tuning on most tasks; drop-in replacement for LoRA.

LoRA+

Definition

Sets the B matrix learning rate 16x higher than A, reflecting that B starts at zero while A carries signal.

Purpose

A one-line change that consistently improves LoRA convergence speed without any quality trade-off.

VeRA

Definition

Shares a single frozen random Gaussian matrix pair across all layers; trains only per-layer scaling vectors.

Purpose

Under 1M trainable parameters for a 7B fine-tune. Extreme efficiency with competitive downstream results.

IA3

Definition

Multiplies key, value, and FFN intermediate vectors by learned scaling vectors l_k, l_v, l_{ff}.

Purpose

Only 0.01% of parameters trained. Inference overhead is zero after folding scaling vectors into weights.

Prefix Tuning

Definition

Prepends P trainable soft vectors to each layer's K and V matrices; the base model stays completely frozen.

Purpose

Adapters are tiny P×d tensors swapped per task. No weight modification; clean compositional fine-tuning.

Prompt Tuning

Definition

Learns soft token embeddings prepended to the input; every other parameter is frozen.

Purpose

Nearly as expressive as fine-tuning above 10B parameters. Lightest possible PEFT; zero architecture change.

P-Tuning v2

Definition

Applies independent trainable prefix vectors at every transformer layer depth, not just the input.

Purpose

Narrows the gap with full fine-tuning for NLU tasks requiring deep contextualization throughout layers.

Adapter Tuning

Definition

Inserts small two-layer bottleneck modules (down-project, nonlinearity, up-project) after attention and FFN.

Purpose

Modular: swap adapters at inference to switch tasks without reloading the base. Original PEFT approach.

Full Fine-Tuning

Definition

Updates all parameters using the same optimizer and schedule as pre-training, on task-specific data.

Purpose

Sets the performance ceiling. Justified when dataset exceeds 100K examples and compute allows retraining.

Layer Freezing

Definition

Keeps the first k layers frozen and fine-tunes only later layers during the adaptation process.

Purpose

Free accuracy boost: lower layers encode syntax that rarely needs task-specific adjustment. Cuts GPU time.

Alignment Tax

Definition

The measurable drop in general benchmark performance that occurs after safety and helpfulness training.

Purpose

Accepted as unavoidable; the goal is to minimize it. Models with low alignment tax are production-ready.

Instruction Tuning

Definition

Fine-tunes on diverse NLP tasks reframed as natural language instructions and structured completions.

Purpose

The bridge from raw base model to a general-purpose assistant that follows novel zero-shot instructions.

Constitutional AI (CAI)

Definition

Model critiques and revises its own outputs against a written constitution; revised data trains via RLAIF.

Purpose

Reduces dependence on human red-teamers. Anthropic's primary safety training approach for Claude.

SPIN (Self-Play Fine-Tuning)

Definition

Current model is optimized to distinguish its outputs from the older checkpoint's on human reference data.

Purpose

Self-improving loop without new human labels. Quality improves each generation of the self-play cycle.

Self-Instruct

Definition

Generates instruction-response pairs from a seed set using the model itself, filtering low-quality outputs.

Purpose

Bootstraps instruction datasets at massive scale. Origin of Stanford Alpaca's 52K training examples.

Rejection Sampling FT (RFT)

Definition

Samples N responses per prompt, keeps those above a reward threshold, and fine-tunes on accepted outputs.

Purpose

A cheap RL alternative that often matches PPO on reasoning tasks; used in LLaMA-3 instruction tuning.

Reward Model Training

Definition

Trains a scalar regression head on pairwise preference data to score responses given a prompt.

Purpose

The quality of the reward model is the primary bottleneck in RLHF. Reward hacking starts here.

Safety Training

Definition

Fine-tunes the model to refuse harmful requests, acknowledge uncertainty, and follow behavioral policies.

Purpose

Requires red-teaming datasets and careful evaluation to avoid over-refusal, which cuts helpfulness.

Multi-task Fine-Tuning

Definition

Trains on diverse task datasets simultaneously using a mixture of instruction-formatted examples.

Purpose

Improves zero-shot generalization and reduces task-specific over-fitting vs. single-task fine-tuning.

Chat Templates

Definition

Structured markup (e.g., [INST]...[/INST] or im-start/im-end) separating system, user, and assistant turns.

Purpose

Inconsistent templates between training and inference severely degrade performance. Always match exactly.

DARE

Definition

Randomly zeros a fraction (typically 90%) of delta weights (W_{ft}-W_{base}) before merging models.

Purpose

Dropout on the delta space reduces interference between adapters. Prerequisite for clean model merging.

TIES Merging

Definition

Trims low-magnitude deltas, resolves sign conflicts by majority vote, and sums the surviving deltas.

Purpose

Consistently outperforms simple weight averaging when merging 3+ task-specific checkpoints into one.

SLERP

Definition

Interpolates two weight vectors along the shortest arc of the high-dimensional unit sphere.

Purpose

Preserves geometric structure that linear interpolation distorts; smooth blending for style mixtures.

Model Soups

Definition

Averages the weights of multiple SFT checkpoints trained from the same base with different hyperparameters.

Purpose

Often beats any individual checkpoint at near-zero marginal cost. Free robustness improvement.

Task Arithmetic

Definition

Computes task vectors \tau = W_{ft} - W_{base} and adds/subtracts them from the base model.

Purpose

Negating a vector removes a capability; adding two composes skills without any retraining whatsoever.

Continual Learning

Definition

Trains on a sequence of tasks while explicitly preserving performance on all previous tasks.

Purpose

Key techniques: elastic weight consolidation (EWC), experience replay, and adapter-based task isolation.

Catastrophic Forgetting

Definition

Previously learned capabilities degrade as fine-tuning overwrites shared weights in the base model.

Purpose

The defining challenge of sequential learning. PEFT methods reduce but do not eliminate this effect.

Explore more chapters or test your knowledge with quizzes.

Back to LLM Optimization Dictionary All glossary chapters Practice quizzes