Mixed Precision Training
Definition
Uses BF16/FP16 for the forward pass and FP32 master weights for gradient updates.
Purpose
Cuts GPU memory by 40% and doubles throughput on tensor cores with negligible accuracy cost.
Pre-Training Optimization terms and explanations from the LLM Optimization Dictionary.
Definition
Uses BF16/FP16 for the forward pass and FP32 master weights for gradient updates.
Purpose
Cuts GPU memory by 40% and doubles throughput on tensor cores with negligible accuracy cost.
Definition
Discards intermediate activations during forward pass and recomputes them on demand during backprop.
Purpose
Trades 33% extra compute for a 4–10x reduction in activation memory. Essential for large models.
Definition
Defers the optimizer step across N mini-batches, summing gradients before updating weights.
Purpose
Emulates large batch sizes impossible to fit in VRAM without extra hardware or communication overhead.
Definition
Rescales the gradient vector when its L2 norm exceeds a fixed threshold, typically 1.0.
Purpose
The single most reliable guard against loss spikes and NaN explosions in deep transformer training.
Definition
Partitions optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across GPU ranks.
Purpose
A 7B model needing 112GB in single-GPU training fits on 4xA100s with ZeRO-3. Core of DeepSpeed.
Definition
Splits weight matrices column- and row-wise across devices using Megatron-style sharding.
Purpose
Allows individual layers wider than a single GPU's memory capacity; low communication latency.
Definition
Slices the layer stack into stages on different GPU groups; micro-batches fill the pipeline.
Purpose
Scales model depth across many nodes. Micro-batch interleaving hides inter-stage bubble latency.
Definition
Each GPU holds a full model replica and processes a distinct data shard; gradients are all-reduced.
Purpose
The default distributed training strategy for models that fit on one device. Simplest to implement.
Definition
Gathers weights for the current layer's forward/backward, then immediately shards and discards them.
Purpose
PyTorch-native ZeRO-3 without DeepSpeed dependency. Production standard for large-model training.
Definition
Fuses Q×K softmax and softmax×V into tiled SRAM kernels, never writing the full N×N matrix to HBM.
Purpose
Makes 100K-token training feasible on A100s. The most impactful single kernel in modern LLM training.
Definition
FA2 rewrites inner loops for better warp occupancy; FA3 adds FP8 support and async pipelining for H100s.
Purpose
Pushes training MFU from ~35% to 55%+. The current standard efficient attention implementation.
Definition
Partitions the sequence dimension across GPUs in a ring; each processes local chunks while KV blocks circulate.
Purpose
Enables million-token training contexts by distributing the attention memory across the full cluster.
Definition
Restricts each token's receptive field to structured patterns: local windows, strided columns, or random blocks.
Purpose
Drops attention from O(n^2) to O(n\cdot\!\sqrt{n}) or better; essential for very long sequence pre-training.
Definition
Replaces the softmax kernel with a feature map that factorizes attention into sequential key-value products.
Purpose
True O(n) complexity. Expressiveness trade-off vs. softmax; used in hybrid architectures for long context.
Definition
Multiplies query and key vectors by rotation matrices whose angle is proportional to token position.
Purpose
Relative positional signal emerges from the dot product naturally; the foundation for all context extension.
Definition
Subtracts a fixed linear bias from attention logits based on token distance; no position embedding params.
Purpose
Models trained with ALiBi extrapolate gracefully beyond training context length at zero inference cost.
Definition
Adam optimizer with decoupled L2 weight decay that correctly avoids interacting with adaptive scaling.
Purpose
The universal baseline optimizer for every LLM training run. Default in PyTorch, Transformers, and JAX.
Definition
Applies Nesterov momentum in the orthogonalized gradient space using Newton-Schulz iteration.
Purpose
Achieves faster loss decrease per step than AdamW on language model pre-training; emerging alternative.
Definition
Updates weights using only the sign of the Adam-style momentum term, discarding magnitude information.
Purpose
Requires 3x less optimizer state memory than AdamW while matching convergence on large-scale runs.
Definition
Estimates the diagonal Hessian using Gauss-Newton-Bartlett samples to scale each parameter's update.
Purpose
Claims 2x wall-clock speedup over AdamW at equivalent loss; better curvature-aware scaling.
Definition
Linearly increases the learning rate from near-zero to its target value over the first 1–2% of training steps.
Purpose
Prevents large initial gradient magnitudes from destabilizing embedding and first-layer weights early on.
Definition
Decays the learning rate following a half-cosine curve from peak to a small final value over training.
Purpose
Smooth monotonic decay outperforms step schedules on final perplexity at equivalent compute budget.
Definition
Warms up, holds constant for most of training, then rapidly decays in the final phase.
Purpose
Mid-training checkpoints can be continued cheaply without re-running the decay schedule from scratch.
Definition
Ranks training examples by difficulty and exposes the model to easier data first, harder data later.
Purpose
Improves final performance 5–15% vs. random order on the same compute budget; faster early convergence.
Definition
Concatenates multiple documents end-to-end with separator tokens to fill each context window completely.
Purpose
Eliminates padding waste; raises effective batch token utilization from 60% to 99%. Free throughput gain.
Definition
16-bit float with the same 8-bit exponent as FP32 but only 7 mantissa bits; rarely overflows.
Purpose
The de-facto training dtype on A100/H100; matches FP32 loss curves without the instability of FP16.
Definition
Byte-Pair Encoding: iteratively merges the most frequent adjacent byte pairs to build a vocabulary.
Purpose
Balances sequence length vs. vocabulary coverage. GPT-4 uses 100K BPE tokens; tokenizer choice matters.
Definition
Removes near-duplicate documents using MinHash LSH or n-gram Bloom filters before any training begins.
Purpose
Prevents memorization of repeated web text; dramatically improves downstream generalization quality.
Definition
Hoffman et al. 2022: optimal compute trains a model of N parameters on 20×N tokens minimum.
Purpose
Overturned the GPT-3 era of over-parameterized undertrained models. Now the baseline for run planning.
Definition
Rescales LR, weight init, and attention logits to remain stable as model width grows to any scale.
Purpose
Enables accurate hyperparameter transfer from a tiny proxy model to a 70B+ run; saves millions in tuning.
Definition
Sets the starting values of model parameters before training; He or scaled normal init is standard.
Purpose
Poor initialization causes gradient explosion in early steps; correct init is a prerequisite for convergence.
Definition
Randomly drops entire transformer layers during training with probability increasing with layer depth.
Purpose
Acts as structural regularization; improves convergence and enables faster training for very deep models.
Definition
Trains on short sequences early in training and increases length progressively as training proceeds.
Purpose
Dramatically reduces early compute cost since attention scales quadratically with sequence length.
Definition
Measures the average number of tokens produced per word for a given tokenizer on target-language text.
Purpose
Low fertility means longer sequences and higher inference cost. Critical metric for multilingual models.
Definition
Runs a brief range test exponentially increasing LR while recording loss to find the optimal bracket.
Purpose
Identifies the best learning rate in minutes rather than days of expensive trial-and-error training runs.
Explore more chapters or test your knowledge with quizzes.