Perplexity (PPL)
Definition
Exponential of the average negative log-likelihood per token over a held-out test corpus.
Purpose
Lower is better. A 1-point PPL reduction at scale represents significant quality worth substantial compute.
Evaluation & Efficiency Metrics terms and explanations from the LLM Optimization Dictionary.
Definition
Exponential of the average negative log-likelihood per token over a held-out test corpus.
Purpose
Lower is better. A 1-point PPL reduction at scale represents significant quality worth substantial compute.
Definition
Average bits per model parameter post-quantization. FP16=16, INT8=8, INT4=4 BPW.
Purpose
The universal compression metric. Lower BPW = more models fit in fixed VRAM; quality tradeoff must be tracked.
Definition
GPU HBM consumed by model weights, activations, KV cache, and optimizer states combined.
Purpose
The hard constraint governing what is deployable. Every optimization technique ultimately targets VRAM.
Definition
Plots FLOP/s vs. arithmetic intensity bounded by compute peak and memory bandwidth on a log-log chart.
Purpose
Instantly reveals whether a workload is memory-bound (decode) or compute-bound (prefill). Essential tool.
Definition
FLOPs performed per byte of memory traffic for a given GPU operation.
Purpose
GEMM has high intensity; elementwise ops have 1 FLOP/byte. Determines position on the roofline chart.
Definition
GPU operations stalling on HBM data, not on compute units (e.g., decode phase).
Purpose
Solutions: quantization (fewer bytes), MQA/GQA (fewer KV bytes), and KV compression techniques.
Definition
Operations where the GPU is fully compute-saturated and memory arrives faster than consumed (e.g., prefill).
Purpose
Solutions: kernel tiling (Flash Attention), kernel fusion, and tensor cores for GEMM acceleration.
Definition
Ratio of measured FLOP/s to theoretical GPU peak FLOP/s during training or inference.
Purpose
A100 training at 50% MFU is good. Below 30% signals a communication or memory bandwidth bottleneck.
Definition
MFU adjusted to account for additional FLOPs from gradient recomputation during checkpointing.
Purpose
The honest efficiency metric when gradient checkpointing is enabled. Always report HFU for training.
Definition
Parameters actually executed per token. Critical for MoE: 141B total but only 39B active per forward pass.
Purpose
Determines true FLOPs and memory bandwidth per token. Total params misleads; active params matter.
Definition
Grows as 2×layers×heads×head\_dim×seq\_len×batch\_size×bytes per element.
Purpose
70B with GQA (8 KV heads, d=128, BF16) at seq\_len=32K uses 13GB per request. Plan capacity accordingly.
Definition
14K multiple-choice questions across 57 academic subjects measuring breadth of world knowledge.
Purpose
GPT-4 scores 86%; human expert baseline 89%. Standard breadth benchmark; does not measure reasoning.
Definition
164 Python function synthesis problems evaluated by executing unit tests against generated code.
Purpose
Pass@k: probability that at least one of k samples passes all tests. GPT-4 reaches Pass@1 ~87%.
Definition
8500 grade-school math word problems requiring multi-step arithmetic reasoning to solve.
Purpose
Chain-of-thought prompting dramatically improves scores. Primary benchmark for measuring emergent reasoning.
Definition
80 multi-turn conversation questions judged by GPT-4 on a 1–10 scale for quality.
Purpose
Correlates well with human preferences. The standard benchmark for evaluating chat assistant quality.
Definition
Tournament-style ranking where models compete pairwise and ratings update based on win/loss outcomes.
Purpose
LMSYS Chatbot Arena uses Elo over millions of human votes. The most reliable quality ranking for chat.
Definition
Fraction of head-to-head comparisons where model A is preferred over model B by human or AI judges.
Purpose
The direct metric for model comparison studies. More actionable than benchmark scores for product decisions.
Definition
ROUGE measures n-gram recall vs. references; BLEU measures n-gram precision for translation quality.
Purpose
Widely used for summarization and MT respectively. Known to correlate poorly with human judgment at high quality.
Explore more chapters or test your knowledge with quizzes.