Back to LLM Optimization Dictionary
LLM Optimization Dictionary

Evaluation & Efficiency Metrics

Evaluation & Efficiency Metrics terms and explanations from the LLM Optimization Dictionary.

18 terms in this chapter
01

Perplexity (PPL)

Definition

Exponential of the average negative log-likelihood per token over a held-out test corpus.

Purpose

Lower is better. A 1-point PPL reduction at scale represents significant quality worth substantial compute.

02

BPW (Bits Per Weight)

Definition

Average bits per model parameter post-quantization. FP16=16, INT8=8, INT4=4 BPW.

Purpose

The universal compression metric. Lower BPW = more models fit in fixed VRAM; quality tradeoff must be tracked.

03

VRAM Usage

Definition

GPU HBM consumed by model weights, activations, KV cache, and optimizer states combined.

Purpose

The hard constraint governing what is deployable. Every optimization technique ultimately targets VRAM.

04

Roofline Model

Definition

Plots FLOP/s vs. arithmetic intensity bounded by compute peak and memory bandwidth on a log-log chart.

Purpose

Instantly reveals whether a workload is memory-bound (decode) or compute-bound (prefill). Essential tool.

05

Arithmetic Intensity

Definition

FLOPs performed per byte of memory traffic for a given GPU operation.

Purpose

GEMM has high intensity; elementwise ops have 1 FLOP/byte. Determines position on the roofline chart.

06

Memory-Bound Operations

Definition

GPU operations stalling on HBM data, not on compute units (e.g., decode phase).

Purpose

Solutions: quantization (fewer bytes), MQA/GQA (fewer KV bytes), and KV compression techniques.

07

Compute-Bound Operations

Definition

Operations where the GPU is fully compute-saturated and memory arrives faster than consumed (e.g., prefill).

Purpose

Solutions: kernel tiling (Flash Attention), kernel fusion, and tensor cores for GEMM acceleration.

08

MFU (Model FLOP Utilization)

Definition

Ratio of measured FLOP/s to theoretical GPU peak FLOP/s during training or inference.

Purpose

A100 training at 50% MFU is good. Below 30% signals a communication or memory bandwidth bottleneck.

09

HFU (Hardware FLOP Utilization)

Definition

MFU adjusted to account for additional FLOPs from gradient recomputation during checkpointing.

Purpose

The honest efficiency metric when gradient checkpointing is enabled. Always report HFU for training.

10

Active Parameters

Definition

Parameters actually executed per token. Critical for MoE: 141B total but only 39B active per forward pass.

Purpose

Determines true FLOPs and memory bandwidth per token. Total params misleads; active params matter.

11

KV Cache Size Formula

Definition

Grows as 2×layers×heads×head\_dim×seq\_len×batch\_size×bytes per element.

Purpose

70B with GQA (8 KV heads, d=128, BF16) at seq\_len=32K uses 13GB per request. Plan capacity accordingly.

12

MMLU

Definition

14K multiple-choice questions across 57 academic subjects measuring breadth of world knowledge.

Purpose

GPT-4 scores 86%; human expert baseline 89%. Standard breadth benchmark; does not measure reasoning.

13

HumanEval

Definition

164 Python function synthesis problems evaluated by executing unit tests against generated code.

Purpose

Pass@k: probability that at least one of k samples passes all tests. GPT-4 reaches Pass@1 ~87%.

14

GSM8K

Definition

8500 grade-school math word problems requiring multi-step arithmetic reasoning to solve.

Purpose

Chain-of-thought prompting dramatically improves scores. Primary benchmark for measuring emergent reasoning.

15

MT-Bench

Definition

80 multi-turn conversation questions judged by GPT-4 on a 1–10 scale for quality.

Purpose

Correlates well with human preferences. The standard benchmark for evaluating chat assistant quality.

16

Elo Rating

Definition

Tournament-style ranking where models compete pairwise and ratings update based on win/loss outcomes.

Purpose

LMSYS Chatbot Arena uses Elo over millions of human votes. The most reliable quality ranking for chat.

17

Win Rate

Definition

Fraction of head-to-head comparisons where model A is preferred over model B by human or AI judges.

Purpose

The direct metric for model comparison studies. More actionable than benchmark scores for product decisions.

18

ROUGE / BLEU

Definition

ROUGE measures n-gram recall vs. references; BLEU measures n-gram precision for translation quality.

Purpose

Widely used for summarization and MT respectively. Known to correlate poorly with human judgment at high quality.

Explore more chapters or test your knowledge with quizzes.