Throughput
Definition
Total tokens generated per second across all concurrent requests under production load.
Purpose
The primary capacity metric. Maximized by batching, quantization, and caching. Determines cost per token.
Serving & Systems Optimization terms and explanations from the LLM Optimization Dictionary.
Definition
Total tokens generated per second across all concurrent requests under production load.
Purpose
The primary capacity metric. Maximized by batching, quantization, and caching. Determines cost per token.
Definition
Total wall-clock time from request arrival to completion: decomposed as TTFT + (output\_tokens × TBT).
Purpose
The primary quality-of-service metric. Different workloads optimize for TTFT vs. TBT differently.
Definition
Contractual targets for latency, throughput, or availability (e.g., P95 TTFT < 500ms).
Purpose
SLO violation triggers autoscaling, preemption, or degraded mode. Defines the serving system's envelope.
Definition
Injects new requests into the batch at token boundaries rather than waiting for a full batch to complete.
Purpose
The innovation that made LLM serving economically viable. Default in vLLM, TRT-LLM, and SGLang.
Definition
Makes scheduling decisions at every decode step rather than per-request for fine-grained control.
Purpose
Enables preemption, priority, and admission control at microsecond granularity for multi-tenant serving.
Definition
Pauses a running request (swapping its KV cache to CPU) to immediately serve a higher-priority request.
Purpose
Prevents SLO violations for premium users while maximizing GPU utilization across the full request pool.
Definition
Partitions model weights across GPUs using tensor, pipeline, or expert parallelism strategies.
Purpose
Required for models too large for one device. Tensor parallel is lowest latency for small batch sizes.
Definition
Routes prefill to a compute-optimized GPU pool and decode to a bandwidth-optimized pool.
Purpose
Prevents bursty prefill from starving decode capacity. Doubles overall cluster throughput in simulation.
Definition
Keeps model weights permanently resident in GPU HBM across all requests without eviction.
Purpose
Without it, each request reloads weights from CPU/NVMe, adding seconds per call. Always enable in serving.
Definition
Distributes requests across replicas using request-length-aware routing to minimize queue depth variance.
Purpose
Naive round-robin causes 3x latency variance. Length-aware routing cuts it to under 30%. Critical at scale.
Definition
Adds or removes serving replicas based on queue depth, GPU utilization, or predicted traffic load.
Purpose
Horizontal scaling is the escape valve for throughput. Scale-to-zero for idle models eliminates idle GPU cost.
Definition
Routes matmul operations through H100 FP8 Tensor Cores with BF16 accumulation and residuals.
Purpose
Provides 2x throughput over BF16 serving with under 1% benchmark impact. TRT-LLM default for H100.
Definition
Co-located small draft model proposes tokens; large verifier checks all in one step per request.
Purpose
2–3x lower latency for interactive workloads. Transparent to caller: identical output distribution.
Definition
Prioritizes, batches, and schedules incoming requests by deadline, SLO tier, or estimated generation length.
Purpose
Sophisticated queue management prevents head-of-line blocking and cuts P99 latency significantly.
Definition
Maintains model weight snapshots in a registry with metadata: dataset, eval scores, safety checks.
Purpose
Essential for A/B testing, rollback, and compliance auditing. The backbone of responsible production ML.
Definition
Routes 1–5% of traffic to a new model version while the old version serves the remainder.
Purpose
Catches regression in production metrics before full rollout. Standard ML deployment practice everywhere.
Definition
Runs a new model in parallel with the production model, comparing outputs without serving them to users.
Purpose
Zero-risk validation against real production traffic before any live exposure to actual end users.
Definition
Routes traffic splits to different model versions and measures downstream business metrics.
Purpose
The gold standard for production model evaluation. Benchmark scores rarely predict business metric impact.
Definition
Pages LoRA adapter weights into GPU memory on demand with priority-based eviction across requests.
Purpose
2000+ concurrent adapters on 4×A100s with 4ms swap latency. Enables SaaS-scale personalization.
Definition
Implements a segment-level BGMV kernel for heterogeneous LoRA batches in one fused kernel call.
Purpose
Processes requests with different adapters at near-native matmul efficiency. The kernel behind S-LoRA.
Definition
Leading open-source inference servers: vLLM for compatibility breadth, SGLang for structured generation speed.
Purpose
Both are production-grade. Choose based on workload: vLLM for breadth, SGLang for agentic pipelines.
Definition
Serves many distinct LoRA adapter variants from one shared base model using adapter-aware kernels.
Purpose
Per-user model customization at SaaS scale. The enabling technology for personalized AI products.
Explore more chapters or test your knowledge with quizzes.