Pruning
Definition
Systematically removes parameters, heads, or layers with low importance, then optionally fine-tunes.
Purpose
Structured pruning delivers real GPU speedups; unstructured needs sparse kernels for practical benefit.
Compression Techniques terms and explanations from the LLM Optimization Dictionary.
Definition
Systematically removes parameters, heads, or layers with low importance, then optionally fine-tunes.
Purpose
Structured pruning delivers real GPU speedups; unstructured needs sparse kernels for practical benefit.
Definition
Removes entire computational units: attention heads, MLP neurons, or full transformer layers.
Purpose
Reduces the dense computational graph. Achieves real latency improvements without sparse hardware support.
Definition
Masks individual weights to zero, creating irregular sparsity in weight matrices.
Purpose
Requires 50–80% sparsity before practical GPU speedup. More effective on CPUs with sparse BLAS support.
Definition
Removes the fraction of weights with the smallest absolute values below a threshold.
Purpose
Competitive at 10–30% sparsity; degrades sharply at high sparsity without subsequent retraining.
Definition
Frames pruning as a sparse regression: finds weights to zero that minimize output change via inverse Hessian.
Purpose
Achieves 50% unstructured sparsity in LLMs with under 2 PPL increase. No retraining required.
Definition
Prunes weights by score w_{ij}×\|x_j\|_2 using input activation norms as importance signals.
Purpose
Rivals SparseGPT quality in a single data pass with no Hessian computation. 10x faster to apply.
Definition
Minimizes KL divergence between student and teacher output distributions using soft probability targets.
Purpose
Soft labels contain dark knowledge about class similarities that hard one-hot labels completely discard.
Definition
Student minimizes cross-entropy against the teacher's full probability distribution, not just the argmax.
Purpose
Temperature T=4–8 softens the teacher distribution and amplifies the dark knowledge training signal.
Definition
Adds L2 or cosine similarity loss between student and teacher hidden states at intermediate layers.
Purpose
Provides richer gradients than output-only distillation. Particularly effective for BERT-scale compression.
Definition
Skips transformer layers during inference using a static schedule from sensitivity analysis.
Purpose
Removes 20–30% of layers with 1–2 PPL increase. Combined with fine-tuning recovers most of the loss.
Definition
Attaches an exit head to each layer; a confidence classifier decides whether to generate from that layer.
Purpose
Reduces average compute per token by 30–60% on easy tokens. Hard tokens still use all layers.
Definition
Removes the attention heads with lowest importance scores (attention entropy or gradient saliency).
Purpose
30–40% of heads in BERT-scale models can be removed with near-zero downstream performance impact.
Definition
Reduces the intermediate dimension d_{ff} of FFN sublayers based on neuron importance scores.
Purpose
FFN contains 66% of transformer parameters. Reducing d_{ff} from 4d to 2d cuts params by 33%.
Definition
Factorizes W \approx U\cdot\Sigma\cdotV^T, keeping only the top-r singular values.
Purpose
Smooth accuracy-compression tradeoff; r tuned layer-by-layer based on singular value drop-off curves.
Definition
Multiple layers or sub-networks share a single weight tensor; parameters are tied rather than independent.
Purpose
ALBERT shares all 12 BERT layers; cuts parameters 12x with surprisingly competitive GLUE benchmark scores.
Definition
Removes entire transformer blocks based on block importance scores such as input-output cosine similarity.
Purpose
ShortGPT shows later layers are often near-identity functions removable without full accuracy recovery.
Definition
A dense network contains a sparse subnetwork (winning ticket) that trains in isolation to match the full network.
Purpose
Theoretical foundation for pruning. Finding winning tickets at LLM scale remains an open research challenge.
Definition
Unifying framework using the layer Hessian to find optimal weight perturbations for compression.
Purpose
GPTQ and SparseGPT are both special cases of OBC. The most principled compression framework available.
Definition
Exploits the observation that ReLU-based models have many zero activations during forward pass.
Purpose
Adding ReLU constraints to SwiGLU models via fine-tuning recovers 50%+ activation sparsity with minimal loss.
Definition
Compact distilled models trained on massive compute budgets to match larger teachers.
Purpose
TinyLLaMA (1.1B, 3T tokens) matches much larger models; shows compute-optimal small models are powerful.
Explore more chapters or test your knowledge with quizzes.