LLM Optimization Dictionary

Inference Acceleration

Inference Acceleration terms and explanations from the LLM Optimization Dictionary.

39 terms in this chapter

KV Cache

Definition

Caches key and value projections for all past tokens to avoid recomputing them at each new decode step.

Purpose

Without KV cache, decode is O(n^2); with it, each step is O(n). The foundational inference optimization.

Paged Attention

Definition

Inspired by OS virtual memory: KV cache is divided into fixed-size pages allocated on demand via a block table.

Purpose

Eliminates memory fragmentation; enables copy-on-write KV sharing in beam search. The engine behind vLLM.

MQA (Multi-Query Attention)

Definition

Projects queries to H heads but collapses keys and values to a single shared head across all queries.

Purpose

Reduces KV cache by H×. Dramatically speeds up memory-bound decode at a small accuracy cost.

GQA (Grouped Query Attention)

Definition

Generalizes MQA: G groups of query heads share one K/V head each. LLaMA-3 70B uses G=8.

Purpose

Maintains MHA quality while cutting KV cache 8x vs full MHA. The standard in all modern frontier models.

Speculative Decoding

Definition

A small draft model generates K tokens; the large model verifies all K in one parallel forward pass.

Purpose

2–3x speedup with mathematically zero quality change for any acceptance threshold \tau=1.

Medusa

Definition

Attaches K additional LM heads to the final layer; each predicts tokens 1 to K+1 steps ahead.

Purpose

Speculative decoding without a separate model. Verified via tree attention over the candidate drafts.

EAGLE / EAGLE-2

Definition

Draft model predicts the next hidden state (feature vector) rather than the next token directly.

Purpose

Higher acceptance rate than token-level speculation. 3–4x speedup on NVIDIA hardware in practice.

Lookahead Decoding

Definition

Runs a 2D Jacobi iteration in parallel: generates draft tokens and n-grams simultaneously, then verifies.

Purpose

No draft model required. Works out-of-the-box on any LLM architecture without any additional training.

Continuous Batching

Definition

Inserts new requests mid-batch at token boundaries, treating the KV cache pool as a dynamic resource.

Purpose

The foundational innovation that made LLM serving economically viable. Default in vLLM and TRT-LLM.

Dynamic Batching

Definition

Groups concurrent requests with similar lengths to minimize padding waste in GPU kernel calls.

Purpose

Reduces effective wasted compute by 20–40%. Complements continuous batching as a preprocessing step.

Chunked Prefill

Definition

Splits a long prompt prefill into chunks interleaved with decode steps from other requests.

Purpose

Caps memory spikes during long-prompt prefill and reduces TTFT for queued requests simultaneously.

Prefix Caching

Definition

Detects identical prompt prefixes across requests and reuses their computed KV cache blocks.

Purpose

40–80% latency reduction for system-prompt-heavy deployments. Zero cost on cache hits. Use always.

RadixAttention

Definition

Implements prefix caching via a radix tree where each node stores a KV block; partial prefix matches share.

Purpose

Supports thousands of concurrent requests sharing system prompts with fine-grained block-level reuse.

CUDA Graphs

Definition

Captures an entire decode step as a static GPU graph and replays it without per-kernel launch overhead.

Purpose

Reduces per-step latency 10–20% for small batches where kernel launch dominates. Use for real-time chat.

Kernel Fusion

Definition

Combines multiple elementwise operations into a single GPU kernel, avoiding repeated HBM round-trips.

Purpose

Cuts memory traffic from N kernel calls to 1. Essential for memory-bandwidth-limited operations like layernorm.

Triton Kernels

Definition

Python-embedded DSL that compiles GPU kernels via LLVM with automatic tile-size tuning.

Purpose

Flash Attention's reference implementation is in Triton. Near-CUDA performance with far simpler code.

xFormers

Definition

Meta's library of memory-efficient attention variants and fused transformer primitives for PyTorch.

Purpose

Drop-in swapout for standard attention. Widely used in Stable Diffusion pipelines and LLaMA stacks.

TensorRT-LLM

Definition

NVIDIA's production inference compiler: fuses ops, selects optimal kernel tiles, generates FP8/INT8 engines.

Purpose

Delivers 2–5x throughput vs. naive inference on the same hardware. NVIDIA's official production runtime.

vLLM

Definition

Open-source inference server combining PagedAttention, continuous batching, and prefix caching.

Purpose

Production standard for open-model serving. Supports 40+ architectures out of the box with one command.

SGLang

Definition

Structured Generation Language: Python-embedded DSL for branching, multi-call, and constrained generation.

Purpose

RadixAttention makes it fastest for complex multi-turn agentic workloads. The researchers' serving choice.

Ollama

Definition

Wraps llama.cpp with a Docker-like model registry and REST/gRPC API for local model serving.

Purpose

One command pulls, quantizes, and serves a model locally. The entry point for developer experimentation.

ExLlamaV2

Definition

Hand-tuned CUDA kernels for EXL2-quantized matmuls with fused dequantization on NVIDIA GPUs.

Purpose

Fastest local inference for quantized models on consumer GPUs (RTX 3090/4090). Maximizes VRAM efficiency.

llama.cpp

Definition

C/C++ inference engine with AVX2/NEON SIMD kernels, Apple Metal GPU backend, and CUDA partial offload.

Purpose

Runs on everything from Raspberry Pi to data centers. The universal runtime for quantized LLM deployment.

TTFT (Time To First Token)

Definition

Wall-clock time from HTTP request arrival to the first streamed token byte in the response.

Purpose

The primary latency metric for interactive chat applications. Dominated by prefill compute length.

TBT (Time Between Tokens)

Definition

Inverse token rate during the decode phase; determined almost entirely by KV cache memory bandwidth.

Purpose

The metric that determines whether streaming output feels fluid. Improved by MQA, quantization, batching.

TPS (Tokens Per Second)

Definition

Total tokens generated divided by wall time; the aggregate throughput metric for serving systems.

Purpose

Measured at peak batch size for capacity planning. Per-user TPS matters most for latency SLO compliance.

Prefill Phase

Definition

Processes the entire prompt in one parallel forward pass with full causal attention masking.

Purpose

Compute-bound; scales with prompt length squared. Benefits most from Flash Attention and tensor parallelism.

Decode Phase

Definition

Generates tokens one at a time with KV cache growing at each step; fundamentally sequential.

Purpose

Memory-bandwidth-bound. The target of speculative decoding, MQA/GQA, quantization, and KV compression.

MFU (Model FLOP Utilization)

Definition

Actual FLOP/s divided by the GPU's theoretical peak FLOP/s; the definitive hardware efficiency KPI.

Purpose

A100 training at 50% MFU is good. Below 30% signals a communication or memory bandwidth bottleneck.

Disaggregated Prefill

Definition

Routes prefill to a compute-optimized GPU pool and decode to a bandwidth-optimized pool separately.

Purpose

Prevents bursty prefill from starving decode capacity. Doubles overall cluster throughput in simulation.

Attention Sink

Definition

First tokens receive disproportionately high attention weights regardless of content (StreamingLLM finding).

Purpose

Keeping first 4 tokens in KV cache prevents quality collapse in streaming inference. Critical for infinite gen.

Prompt Caching

Definition

Vendor-level (Anthropic, OpenAI) feature storing computed KV states server-side; charged at reduced rate.

Purpose

Reduces API costs by up to 80% for long system-prompt workloads. Direct ROI on every repeated call.

Token Budget

Definition

Maximum number of new tokens the model is allowed to generate per call, set by max\_new\_tokens.

Purpose

Controls latency and prevents runaway generation. Too low truncates responses; too high inflates serving cost.

Tensor Parallel Inference

Definition

Shards Q/K/V/O projections and FFN layers across GPUs, all-reducing activations at layer boundaries.

Purpose

Necessary for single-request low-latency serving of models too large for one GPU. Lowest latency at scale.

FP8 Inference

Definition

Uses H100 FP8 Tensor Cores for matrix multiplications with BF16 residuals and normalization layers.

Purpose

Doubles effective compute throughput vs BF16. TRT-LLM and vLLM both support FP8 serving on H100.

S-LoRA

Definition

Extends multi-LoRA with a unified paging system for adapter weights and custom batched LoRA kernels.

Purpose

Demonstrated serving 2000+ adapters concurrently on a single A100. The scalable adapter-serving standard.

Multi-LoRA Serving

Definition

Maintains a pool of LoRA delta weights in GPU memory and applies the correct adapter per request.

Purpose

Serves thousands of customers from one base model without weight switching latency. Key SaaS enabler.

DeepSpeed Inference

Definition

Adds tensor parallelism, INT8 quantization, and fused attention kernels on top of PyTorch.

Purpose

ZeRO-Inference offloads weights to CPU NVMe for models too large for any multi-GPU VRAM configuration.

Explore more chapters or test your knowledge with quizzes.

Back to LLM Optimization Dictionary All glossary chapters Practice quizzes

Inference Acceleration

KV Cache

Paged Attention

MQA (Multi-Query Attention)

GQA (Grouped Query Attention)

Speculative Decoding

Medusa

EAGLE / EAGLE-2

Lookahead Decoding

Continuous Batching

Dynamic Batching

Chunked Prefill

Prefix Caching

RadixAttention

CUDA Graphs

Kernel Fusion

Triton Kernels

xFormers

TensorRT-LLM

vLLM

SGLang

Ollama

ExLlamaV2

llama.cpp

TTFT (Time To First Token)

TBT (Time Between Tokens)

TPS (Tokens Per Second)

Prefill Phase

Decode Phase

MFU (Model FLOP Utilization)

Disaggregated Prefill

Attention Sink

Prompt Caching

KV Cache Sharing

Token Budget

Tensor Parallel Inference

FP8 Inference

S-LoRA

Multi-LoRA Serving

DeepSpeed Inference