Back to LLM Optimization Dictionary
LLM Optimization Dictionary

Compression Techniques

Compression Techniques terms and explanations from the LLM Optimization Dictionary.

20 terms in this chapter
01

Pruning

Definition

Systematically removes parameters, heads, or layers with low importance, then optionally fine-tunes.

Purpose

Structured pruning delivers real GPU speedups; unstructured needs sparse kernels for practical benefit.

02

Structured Pruning

Definition

Removes entire computational units: attention heads, MLP neurons, or full transformer layers.

Purpose

Reduces the dense computational graph. Achieves real latency improvements without sparse hardware support.

03

Unstructured Pruning

Definition

Masks individual weights to zero, creating irregular sparsity in weight matrices.

Purpose

Requires 50–80% sparsity before practical GPU speedup. More effective on CPUs with sparse BLAS support.

04

Magnitude Pruning

Definition

Removes the fraction of weights with the smallest absolute values below a threshold.

Purpose

Competitive at 10–30% sparsity; degrades sharply at high sparsity without subsequent retraining.

05

SparseGPT

Definition

Frames pruning as a sparse regression: finds weights to zero that minimize output change via inverse Hessian.

Purpose

Achieves 50% unstructured sparsity in LLMs with under 2 PPL increase. No retraining required.

06

Wanda

Definition

Prunes weights by score w_{ij}×\|x_j\|_2 using input activation norms as importance signals.

Purpose

Rivals SparseGPT quality in a single data pass with no Hessian computation. 10x faster to apply.

07

Knowledge Distillation

Definition

Minimizes KL divergence between student and teacher output distributions using soft probability targets.

Purpose

Soft labels contain dark knowledge about class similarities that hard one-hot labels completely discard.

08

Response Distillation

Definition

Student minimizes cross-entropy against the teacher's full probability distribution, not just the argmax.

Purpose

Temperature T=4–8 softens the teacher distribution and amplifies the dark knowledge training signal.

09

Feature Distillation

Definition

Adds L2 or cosine similarity loss between student and teacher hidden states at intermediate layers.

Purpose

Provides richer gradients than output-only distillation. Particularly effective for BERT-scale compression.

10

Layer Dropping

Definition

Skips transformer layers during inference using a static schedule from sensitivity analysis.

Purpose

Removes 20–30% of layers with 1–2 PPL increase. Combined with fine-tuning recovers most of the loss.

11

Early Exit

Definition

Attaches an exit head to each layer; a confidence classifier decides whether to generate from that layer.

Purpose

Reduces average compute per token by 30–60% on easy tokens. Hard tokens still use all layers.

12

Head Pruning

Definition

Removes the attention heads with lowest importance scores (attention entropy or gradient saliency).

Purpose

30–40% of heads in BERT-scale models can be removed with near-zero downstream performance impact.

13

FFN Pruning

Definition

Reduces the intermediate dimension d_{ff} of FFN sublayers based on neuron importance scores.

Purpose

FFN contains 66% of transformer parameters. Reducing d_{ff} from 4d to 2d cuts params by 33%.

14

Matrix Decomposition (SVD)

Definition

Factorizes W \approx U\cdot\Sigma\cdotV^T, keeping only the top-r singular values.

Purpose

Smooth accuracy-compression tradeoff; r tuned layer-by-layer based on singular value drop-off curves.

15

Weight Sharing

Definition

Multiple layers or sub-networks share a single weight tensor; parameters are tied rather than independent.

Purpose

ALBERT shares all 12 BERT layers; cuts parameters 12x with surprisingly competitive GLUE benchmark scores.

16

Depth Pruning

Definition

Removes entire transformer blocks based on block importance scores such as input-output cosine similarity.

Purpose

ShortGPT shows later layers are often near-identity functions removable without full accuracy recovery.

17

Lottery Ticket Hypothesis

Definition

A dense network contains a sparse subnetwork (winning ticket) that trains in isolation to match the full network.

Purpose

Theoretical foundation for pruning. Finding winning tickets at LLM scale remains an open research challenge.

18

OBC (Optimal Brain Compression)

Definition

Unifying framework using the layer Hessian to find optimal weight perturbations for compression.

Purpose

GPTQ and SparseGPT are both special cases of OBC. The most principled compression framework available.

19

Activation Sparsity

Definition

Exploits the observation that ReLU-based models have many zero activations during forward pass.

Purpose

Adding ReLU constraints to SwiGLU models via fine-tuning recovers 50%+ activation sparsity with minimal loss.

20

TinyLLaMA / DistilBERT

Definition

Compact distilled models trained on massive compute budgets to match larger teachers.

Purpose

TinyLLaMA (1.1B, 3T tokens) matches much larger models; shows compute-optimal small models are powerful.

Explore more chapters or test your knowledge with quizzes.