Context Window Extension
Definition
Enabling a model to process sequences longer than its training context using post-hoc techniques.
Purpose
The most requested production capability. Every major technique trades some accuracy or compute for length.
Context & Memory Optimization terms and explanations from the LLM Optimization Dictionary.
Definition
Enabling a model to process sequences longer than its training context using post-hoc techniques.
Purpose
The most requested production capability. Every major technique trades some accuracy or compute for length.
Definition
Applies NTK-aware interpolation to high-frequency RoPE dims with attention temperature correction.
Purpose
Extends Mistral 7B to 128K tokens with only 1B tokens of fine-tuning. The most adopted method.
Definition
Combines shifted sparse attention (S^2-Attn) during fine-tuning with dense attention at inference.
Purpose
Trains 7B models to 100K tokens for the cost of approximately 1000 GPU hours. Highly compute-efficient.
Definition
Retrieves the top-k most relevant documents from an index and prepends them to the context.
Purpose
Shifts factual knowledge from model weights to a searchable, updateable external store. No retraining needed.
Definition
Selectively evicts or quantizes KV cache entries to reduce memory footprint during long generation.
Purpose
Enables sequences 4–8x longer than the GPU's memory would otherwise allow. Critical for long documents.
Definition
Retains KV entries of heavy-hitter tokens (highest cumulative attention scores) and evicts the rest.
Purpose
Preserves 80–90% of generation quality at 50% KV cache reduction. State-of-the-art eviction policy.
Definition
Uses an observation window at prompt start to identify which positions will be attended to most.
Purpose
Consistent quality with H2O but more robust to diverse prompt structures. Evicts before generation starts.
Definition
Retains attention sinks (first 4 tokens) and a rolling window of recent tokens; discards all middle positions.
Purpose
Enables unbounded streaming generation with a fixed memory budget. Perfect for always-on deployments.
Definition
Splits the sequence into local segments and maintains a compressive memory of past segments via linear attention.
Purpose
Global context without global memory. Scales to arbitrarily long documents at constant memory cost.
Definition
Computes attention in SRAM tiles, never writing the full N×N matrix to HBM.
Purpose
The defining technique of Flash Attention and all its variants. The prerequisite for long-context training.
Definition
Pages less-recently-used KV blocks or weight shards to CPU DRAM or NVMe SSDs during inference.
Purpose
Extends effective GPU memory by 8–20x at the cost of PCIe bandwidth latency. Trade speed for capacity.
Definition
Stores computed KV states for shared prompt prefixes and reuses them across all requests.
Purpose
A 1000-token system prompt cached server-side saves 1000 tokens of prefill per request. ROI is immediate.
Definition
Appends trainable memory tokens that persist across segments; model reads and writes them like a soft KV store.
Purpose
Hybrid architecture: transformer within segments, RNN-like memory across segments. Unbounded effective context.
Definition
Equips the model with an external differentiable key-value store or vector database read via attention.
Purpose
Separates knowledge retrieval from reasoning computation. Enables explicit memory update and inspection.
Definition
Multiplies the RoPE base frequency by a scalar >1 to extend context beyond training length.
Purpose
Cheapest context extension: no fine-tuning for modest 2–4x extensions. Quality drops at larger ratios.
Definition
Extends ALiBi's linear bias to lengths beyond training without any modification or fine-tuning needed.
Purpose
Free length generalization: quality degrades gracefully rather than catastrophically past training length.
Explore more chapters or test your knowledge with quizzes.