KV Cache
Definition
Caches key and value projections for all past tokens to avoid recomputing them at each new decode step.
Purpose
Without KV cache, decode is O(n^2); with it, each step is O(n). The foundational inference optimization.
Inference Acceleration terms and explanations from the LLM Optimization Dictionary.
Definition
Caches key and value projections for all past tokens to avoid recomputing them at each new decode step.
Purpose
Without KV cache, decode is O(n^2); with it, each step is O(n). The foundational inference optimization.
Definition
Inspired by OS virtual memory: KV cache is divided into fixed-size pages allocated on demand via a block table.
Purpose
Eliminates memory fragmentation; enables copy-on-write KV sharing in beam search. The engine behind vLLM.
Definition
Projects queries to H heads but collapses keys and values to a single shared head across all queries.
Purpose
Reduces KV cache by H×. Dramatically speeds up memory-bound decode at a small accuracy cost.
Definition
Generalizes MQA: G groups of query heads share one K/V head each. LLaMA-3 70B uses G=8.
Purpose
Maintains MHA quality while cutting KV cache 8x vs full MHA. The standard in all modern frontier models.
Definition
A small draft model generates K tokens; the large model verifies all K in one parallel forward pass.
Purpose
2–3x speedup with mathematically zero quality change for any acceptance threshold \tau=1.
Definition
Attaches K additional LM heads to the final layer; each predicts tokens 1 to K+1 steps ahead.
Purpose
Speculative decoding without a separate model. Verified via tree attention over the candidate drafts.
Definition
Draft model predicts the next hidden state (feature vector) rather than the next token directly.
Purpose
Higher acceptance rate than token-level speculation. 3–4x speedup on NVIDIA hardware in practice.
Definition
Runs a 2D Jacobi iteration in parallel: generates draft tokens and n-grams simultaneously, then verifies.
Purpose
No draft model required. Works out-of-the-box on any LLM architecture without any additional training.
Definition
Inserts new requests mid-batch at token boundaries, treating the KV cache pool as a dynamic resource.
Purpose
The foundational innovation that made LLM serving economically viable. Default in vLLM and TRT-LLM.
Definition
Groups concurrent requests with similar lengths to minimize padding waste in GPU kernel calls.
Purpose
Reduces effective wasted compute by 20–40%. Complements continuous batching as a preprocessing step.
Definition
Splits a long prompt prefill into chunks interleaved with decode steps from other requests.
Purpose
Caps memory spikes during long-prompt prefill and reduces TTFT for queued requests simultaneously.
Definition
Detects identical prompt prefixes across requests and reuses their computed KV cache blocks.
Purpose
40–80% latency reduction for system-prompt-heavy deployments. Zero cost on cache hits. Use always.
Definition
Implements prefix caching via a radix tree where each node stores a KV block; partial prefix matches share.
Purpose
Supports thousands of concurrent requests sharing system prompts with fine-grained block-level reuse.
Definition
Captures an entire decode step as a static GPU graph and replays it without per-kernel launch overhead.
Purpose
Reduces per-step latency 10–20% for small batches where kernel launch dominates. Use for real-time chat.
Definition
Combines multiple elementwise operations into a single GPU kernel, avoiding repeated HBM round-trips.
Purpose
Cuts memory traffic from N kernel calls to 1. Essential for memory-bandwidth-limited operations like layernorm.
Definition
Python-embedded DSL that compiles GPU kernels via LLVM with automatic tile-size tuning.
Purpose
Flash Attention's reference implementation is in Triton. Near-CUDA performance with far simpler code.
Definition
Meta's library of memory-efficient attention variants and fused transformer primitives for PyTorch.
Purpose
Drop-in swapout for standard attention. Widely used in Stable Diffusion pipelines and LLaMA stacks.
Definition
NVIDIA's production inference compiler: fuses ops, selects optimal kernel tiles, generates FP8/INT8 engines.
Purpose
Delivers 2–5x throughput vs. naive inference on the same hardware. NVIDIA's official production runtime.
Definition
Open-source inference server combining PagedAttention, continuous batching, and prefix caching.
Purpose
Production standard for open-model serving. Supports 40+ architectures out of the box with one command.
Definition
Structured Generation Language: Python-embedded DSL for branching, multi-call, and constrained generation.
Purpose
RadixAttention makes it fastest for complex multi-turn agentic workloads. The researchers' serving choice.
Definition
Wraps llama.cpp with a Docker-like model registry and REST/gRPC API for local model serving.
Purpose
One command pulls, quantizes, and serves a model locally. The entry point for developer experimentation.
Definition
Hand-tuned CUDA kernels for EXL2-quantized matmuls with fused dequantization on NVIDIA GPUs.
Purpose
Fastest local inference for quantized models on consumer GPUs (RTX 3090/4090). Maximizes VRAM efficiency.
Definition
C/C++ inference engine with AVX2/NEON SIMD kernels, Apple Metal GPU backend, and CUDA partial offload.
Purpose
Runs on everything from Raspberry Pi to data centers. The universal runtime for quantized LLM deployment.
Definition
Wall-clock time from HTTP request arrival to the first streamed token byte in the response.
Purpose
The primary latency metric for interactive chat applications. Dominated by prefill compute length.
Definition
Inverse token rate during the decode phase; determined almost entirely by KV cache memory bandwidth.
Purpose
The metric that determines whether streaming output feels fluid. Improved by MQA, quantization, batching.
Definition
Total tokens generated divided by wall time; the aggregate throughput metric for serving systems.
Purpose
Measured at peak batch size for capacity planning. Per-user TPS matters most for latency SLO compliance.
Definition
Processes the entire prompt in one parallel forward pass with full causal attention masking.
Purpose
Compute-bound; scales with prompt length squared. Benefits most from Flash Attention and tensor parallelism.
Definition
Generates tokens one at a time with KV cache growing at each step; fundamentally sequential.
Purpose
Memory-bandwidth-bound. The target of speculative decoding, MQA/GQA, quantization, and KV compression.
Definition
Actual FLOP/s divided by the GPU's theoretical peak FLOP/s; the definitive hardware efficiency KPI.
Purpose
A100 training at 50% MFU is good. Below 30% signals a communication or memory bandwidth bottleneck.
Definition
Routes prefill to a compute-optimized GPU pool and decode to a bandwidth-optimized pool separately.
Purpose
Prevents bursty prefill from starving decode capacity. Doubles overall cluster throughput in simulation.
Definition
First tokens receive disproportionately high attention weights regardless of content (StreamingLLM finding).
Purpose
Keeping first 4 tokens in KV cache prevents quality collapse in streaming inference. Critical for infinite gen.
Definition
Vendor-level (Anthropic, OpenAI) feature storing computed KV states server-side; charged at reduced rate.
Purpose
Reduces API costs by up to 80% for long system-prompt workloads. Direct ROI on every repeated call.
Definition
Multiple concurrent requests with the same prefix share a single KV cache copy via copy-on-write blocks.
Purpose
PagedAttention enables zero-copy sharing at block level. Essential for multi-tenant system-prompt serving.
Definition
Maximum number of new tokens the model is allowed to generate per call, set by max\_new\_tokens.
Purpose
Controls latency and prevents runaway generation. Too low truncates responses; too high inflates serving cost.
Definition
Shards Q/K/V/O projections and FFN layers across GPUs, all-reducing activations at layer boundaries.
Purpose
Necessary for single-request low-latency serving of models too large for one GPU. Lowest latency at scale.
Definition
Uses H100 FP8 Tensor Cores for matrix multiplications with BF16 residuals and normalization layers.
Purpose
Doubles effective compute throughput vs BF16. TRT-LLM and vLLM both support FP8 serving on H100.
Definition
Extends multi-LoRA with a unified paging system for adapter weights and custom batched LoRA kernels.
Purpose
Demonstrated serving 2000+ adapters concurrently on a single A100. The scalable adapter-serving standard.
Definition
Maintains a pool of LoRA delta weights in GPU memory and applies the correct adapter per request.
Purpose
Serves thousands of customers from one base model without weight switching latency. Key SaaS enabler.
Definition
Adds tensor parallelism, INT8 quantization, and fused attention kernels on top of PyTorch.
Purpose
ZeRO-Inference offloads weights to CPU NVMe for models too large for any multi-GPU VRAM configuration.
Explore more chapters or test your knowledge with quizzes.