LLM Optimization Dictionary

Serving & Systems Optimization

Serving & Systems Optimization terms and explanations from the LLM Optimization Dictionary.

22 terms in this chapter

Throughput

Definition

Total tokens generated per second across all concurrent requests under production load.

Purpose

The primary capacity metric. Maximized by batching, quantization, and caching. Determines cost per token.

Latency

Definition

Total wall-clock time from request arrival to completion: decomposed as TTFT + (output\_tokens × TBT).

Purpose

The primary quality-of-service metric. Different workloads optimize for TTFT vs. TBT differently.

SLO (Service Level Objective)

Definition

Contractual targets for latency, throughput, or availability (e.g., P95 TTFT < 500ms).

Purpose

SLO violation triggers autoscaling, preemption, or degraded mode. Defines the serving system's envelope.

Continuous Batching

Definition

Injects new requests into the batch at token boundaries rather than waiting for a full batch to complete.

Purpose

The innovation that made LLM serving economically viable. Default in vLLM, TRT-LLM, and SGLang.

Iteration-Level Scheduling

Definition

Makes scheduling decisions at every decode step rather than per-request for fine-grained control.

Purpose

Enables preemption, priority, and admission control at microsecond granularity for multi-tenant serving.

Preemption

Definition

Pauses a running request (swapping its KV cache to CPU) to immediately serve a higher-priority request.

Purpose

Prevents SLO violations for premium users while maximizing GPU utilization across the full request pool.

Model Sharding

Definition

Partitions model weights across GPUs using tensor, pipeline, or expert parallelism strategies.

Purpose

Required for models too large for one device. Tensor parallel is lowest latency for small batch sizes.

Disaggregated Prefill

Definition

Routes prefill to a compute-optimized GPU pool and decode to a bandwidth-optimized pool.

Purpose

Prevents bursty prefill from starving decode capacity. Doubles overall cluster throughput in simulation.

Model Caching

Definition

Keeps model weights permanently resident in GPU HBM across all requests without eviction.

Purpose

Without it, each request reloads weights from CPU/NVMe, adding seconds per call. Always enable in serving.

Load Balancing

Definition

Distributes requests across replicas using request-length-aware routing to minimize queue depth variance.

Purpose

Naive round-robin causes 3x latency variance. Length-aware routing cuts it to under 30%. Critical at scale.

Autoscaling

Definition

Adds or removes serving replicas based on queue depth, GPU utilization, or predicted traffic load.

Purpose

Horizontal scaling is the escape valve for throughput. Scale-to-zero for idle models eliminates idle GPU cost.

FP8 Serving

Definition

Routes matmul operations through H100 FP8 Tensor Cores with BF16 accumulation and residuals.

Purpose

Provides 2x throughput over BF16 serving with under 1% benchmark impact. TRT-LLM default for H100.

Speculative Decoding (Serving)

Definition

Co-located small draft model proposes tokens; large verifier checks all in one step per request.

Purpose

2–3x lower latency for interactive workloads. Transparent to caller: identical output distribution.

Request Queue Management

Definition

Prioritizes, batches, and schedules incoming requests by deadline, SLO tier, or estimated generation length.

Purpose

Sophisticated queue management prevents head-of-line blocking and cuts P99 latency significantly.

Model Versioning

Definition

Maintains model weight snapshots in a registry with metadata: dataset, eval scores, safety checks.

Purpose

Essential for A/B testing, rollback, and compliance auditing. The backbone of responsible production ML.

Canary Deployment

Definition

Routes 1–5% of traffic to a new model version while the old version serves the remainder.

Purpose

Catches regression in production metrics before full rollout. Standard ML deployment practice everywhere.

Shadow Mode

Definition

Runs a new model in parallel with the production model, comparing outputs without serving them to users.

Purpose

Zero-risk validation against real production traffic before any live exposure to actual end users.

A/B Testing Models

Definition

Routes traffic splits to different model versions and measures downstream business metrics.

Purpose

The gold standard for production model evaluation. Benchmark scores rarely predict business metric impact.

S-LoRA

Definition

Pages LoRA adapter weights into GPU memory on demand with priority-based eviction across requests.

Purpose

2000+ concurrent adapters on 4×A100s with 4ms swap latency. Enables SaaS-scale personalization.

Punica

Definition

Implements a segment-level BGMV kernel for heterogeneous LoRA batches in one fused kernel call.

Purpose

Processes requests with different adapters at near-native matmul efficiency. The kernel behind S-LoRA.

vLLM / SGLang

Definition

Leading open-source inference servers: vLLM for compatibility breadth, SGLang for structured generation speed.

Purpose

Both are production-grade. Choose based on workload: vLLM for breadth, SGLang for agentic pipelines.

Multi-LoRA Serving

Definition

Serves many distinct LoRA adapter variants from one shared base model using adapter-aware kernels.

Purpose

Per-user model customization at SaaS scale. The enabling technology for personalized AI products.

Explore more chapters or test your knowledge with quizzes.

Back to LLM Optimization Dictionary All glossary chapters Practice quizzes