Back to LLM Optimization Dictionary
LLM Optimization Dictionary

Pre-Training Optimization

Pre-Training Optimization terms and explanations from the LLM Optimization Dictionary.

35 terms in this chapter
01

Mixed Precision Training

Definition

Uses BF16/FP16 for the forward pass and FP32 master weights for gradient updates.

Purpose

Cuts GPU memory by 40% and doubles throughput on tensor cores with negligible accuracy cost.

02

Gradient Checkpointing

Definition

Discards intermediate activations during forward pass and recomputes them on demand during backprop.

Purpose

Trades 33% extra compute for a 4–10x reduction in activation memory. Essential for large models.

03

Gradient Accumulation

Definition

Defers the optimizer step across N mini-batches, summing gradients before updating weights.

Purpose

Emulates large batch sizes impossible to fit in VRAM without extra hardware or communication overhead.

04

Gradient Clipping

Definition

Rescales the gradient vector when its L2 norm exceeds a fixed threshold, typically 1.0.

Purpose

The single most reliable guard against loss spikes and NaN explosions in deep transformer training.

05

ZeRO Optimizer

Definition

Partitions optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across GPU ranks.

Purpose

A 7B model needing 112GB in single-GPU training fits on 4xA100s with ZeRO-3. Core of DeepSpeed.

06

Tensor Parallelism

Definition

Splits weight matrices column- and row-wise across devices using Megatron-style sharding.

Purpose

Allows individual layers wider than a single GPU's memory capacity; low communication latency.

07

Pipeline Parallelism

Definition

Slices the layer stack into stages on different GPU groups; micro-batches fill the pipeline.

Purpose

Scales model depth across many nodes. Micro-batch interleaving hides inter-stage bubble latency.

08

Data Parallelism

Definition

Each GPU holds a full model replica and processes a distinct data shard; gradients are all-reduced.

Purpose

The default distributed training strategy for models that fit on one device. Simplest to implement.

09

FSDP

Definition

Gathers weights for the current layer's forward/backward, then immediately shards and discards them.

Purpose

PyTorch-native ZeRO-3 without DeepSpeed dependency. Production standard for large-model training.

10

Flash Attention

Definition

Fuses Q×K softmax and softmax×V into tiled SRAM kernels, never writing the full N×N matrix to HBM.

Purpose

Makes 100K-token training feasible on A100s. The most impactful single kernel in modern LLM training.

11

Flash Attention 2/3

Definition

FA2 rewrites inner loops for better warp occupancy; FA3 adds FP8 support and async pipelining for H100s.

Purpose

Pushes training MFU from ~35% to 55%+. The current standard efficient attention implementation.

12

Ring Attention

Definition

Partitions the sequence dimension across GPUs in a ring; each processes local chunks while KV blocks circulate.

Purpose

Enables million-token training contexts by distributing the attention memory across the full cluster.

13

Sparse Attention

Definition

Restricts each token's receptive field to structured patterns: local windows, strided columns, or random blocks.

Purpose

Drops attention from O(n^2) to O(n\cdot\!\sqrt{n}) or better; essential for very long sequence pre-training.

14

Linear Attention

Definition

Replaces the softmax kernel with a feature map that factorizes attention into sequential key-value products.

Purpose

True O(n) complexity. Expressiveness trade-off vs. softmax; used in hybrid architectures for long context.

15

RoPE

Definition

Multiplies query and key vectors by rotation matrices whose angle is proportional to token position.

Purpose

Relative positional signal emerges from the dot product naturally; the foundation for all context extension.

16

ALiBi

Definition

Subtracts a fixed linear bias from attention logits based on token distance; no position embedding params.

Purpose

Models trained with ALiBi extrapolate gracefully beyond training context length at zero inference cost.

17

AdamW

Definition

Adam optimizer with decoupled L2 weight decay that correctly avoids interacting with adaptive scaling.

Purpose

The universal baseline optimizer for every LLM training run. Default in PyTorch, Transformers, and JAX.

18

Muon Optimizer

Definition

Applies Nesterov momentum in the orthogonalized gradient space using Newton-Schulz iteration.

Purpose

Achieves faster loss decrease per step than AdamW on language model pre-training; emerging alternative.

19

Lion Optimizer

Definition

Updates weights using only the sign of the Adam-style momentum term, discarding magnitude information.

Purpose

Requires 3x less optimizer state memory than AdamW while matching convergence on large-scale runs.

20

Sophia Optimizer

Definition

Estimates the diagonal Hessian using Gauss-Newton-Bartlett samples to scale each parameter's update.

Purpose

Claims 2x wall-clock speedup over AdamW at equivalent loss; better curvature-aware scaling.

21

LR Warmup

Definition

Linearly increases the learning rate from near-zero to its target value over the first 1–2% of training steps.

Purpose

Prevents large initial gradient magnitudes from destabilizing embedding and first-layer weights early on.

22

Cosine Annealing

Definition

Decays the learning rate following a half-cosine curve from peak to a small final value over training.

Purpose

Smooth monotonic decay outperforms step schedules on final perplexity at equivalent compute budget.

23

Warmup-Stable-Decay (WSD)

Definition

Warms up, holds constant for most of training, then rapidly decays in the final phase.

Purpose

Mid-training checkpoints can be continued cheaply without re-running the decay schedule from scratch.

24

Curriculum Learning

Definition

Ranks training examples by difficulty and exposes the model to easier data first, harder data later.

Purpose

Improves final performance 5–15% vs. random order on the same compute budget; faster early convergence.

25

Data Packing

Definition

Concatenates multiple documents end-to-end with separator tokens to fill each context window completely.

Purpose

Eliminates padding waste; raises effective batch token utilization from 60% to 99%. Free throughput gain.

26

BFloat16 (BF16)

Definition

16-bit float with the same 8-bit exponent as FP32 but only 7 mantissa bits; rarely overflows.

Purpose

The de-facto training dtype on A100/H100; matches FP32 loss curves without the instability of FP16.

27

BPE Tokenizer

Definition

Byte-Pair Encoding: iteratively merges the most frequent adjacent byte pairs to build a vocabulary.

Purpose

Balances sequence length vs. vocabulary coverage. GPT-4 uses 100K BPE tokens; tokenizer choice matters.

28

Data Deduplication

Definition

Removes near-duplicate documents using MinHash LSH or n-gram Bloom filters before any training begins.

Purpose

Prevents memorization of repeated web text; dramatically improves downstream generalization quality.

29

Chinchilla Scaling Laws

Definition

Hoffman et al. 2022: optimal compute trains a model of N parameters on 20×N tokens minimum.

Purpose

Overturned the GPT-3 era of over-parameterized undertrained models. Now the baseline for run planning.

30

MuP (Maximal Update Parametrization)

Definition

Rescales LR, weight init, and attention logits to remain stable as model width grows to any scale.

Purpose

Enables accurate hyperparameter transfer from a tiny proxy model to a 70B+ run; saves millions in tuning.

31

Weight Initialization

Definition

Sets the starting values of model parameters before training; He or scaled normal init is standard.

Purpose

Poor initialization causes gradient explosion in early steps; correct init is a prerequisite for convergence.

32

Stochastic Depth

Definition

Randomly drops entire transformer layers during training with probability increasing with layer depth.

Purpose

Acts as structural regularization; improves convergence and enables faster training for very deep models.

33

Sequence Length Scheduling

Definition

Trains on short sequences early in training and increases length progressively as training proceeds.

Purpose

Dramatically reduces early compute cost since attention scales quadratically with sequence length.

34

Tokenizer Fertility

Definition

Measures the average number of tokens produced per word for a given tokenizer on target-language text.

Purpose

Low fertility means longer sequences and higher inference cost. Critical metric for multilingual models.

35

Learning Rate Finder

Definition

Runs a brief range test exponentially increasing LR while recording loss to find the optimal bracket.

Purpose

Identifies the best learning rate in minutes rather than days of expensive trial-and-error training runs.

Explore more chapters or test your knowledge with quizzes.