Back to LLM Optimization Dictionary
LLM Optimization Dictionary

Context & Memory Optimization

Context & Memory Optimization terms and explanations from the LLM Optimization Dictionary.

16 terms in this chapter
01

Context Window Extension

Definition

Enabling a model to process sequences longer than its training context using post-hoc techniques.

Purpose

The most requested production capability. Every major technique trades some accuracy or compute for length.

02

YaRN

Definition

Applies NTK-aware interpolation to high-frequency RoPE dims with attention temperature correction.

Purpose

Extends Mistral 7B to 128K tokens with only 1B tokens of fine-tuning. The most adopted method.

03

LongLoRA

Definition

Combines shifted sparse attention (S^2-Attn) during fine-tuning with dense attention at inference.

Purpose

Trains 7B models to 100K tokens for the cost of approximately 1000 GPU hours. Highly compute-efficient.

04

RAG (Retrieval-Augmented Generation)

Definition

Retrieves the top-k most relevant documents from an index and prepends them to the context.

Purpose

Shifts factual knowledge from model weights to a searchable, updateable external store. No retraining needed.

05

KV Cache Compression

Definition

Selectively evicts or quantizes KV cache entries to reduce memory footprint during long generation.

Purpose

Enables sequences 4–8x longer than the GPU's memory would otherwise allow. Critical for long documents.

06

H2O (Heavy-Hitter Oracle)

Definition

Retains KV entries of heavy-hitter tokens (highest cumulative attention scores) and evicts the rest.

Purpose

Preserves 80–90% of generation quality at 50% KV cache reduction. State-of-the-art eviction policy.

07

SnapKV

Definition

Uses an observation window at prompt start to identify which positions will be attended to most.

Purpose

Consistent quality with H2O but more robust to diverse prompt structures. Evicts before generation starts.

08

StreamingLLM

Definition

Retains attention sinks (first 4 tokens) and a rolling window of recent tokens; discards all middle positions.

Purpose

Enables unbounded streaming generation with a fixed memory budget. Perfect for always-on deployments.

09

Infini-Attention

Definition

Splits the sequence into local segments and maintains a compressive memory of past segments via linear attention.

Purpose

Global context without global memory. Scales to arbitrarily long documents at constant memory cost.

10

Memory-Efficient Attention

Definition

Computes attention in SRAM tiles, never writing the full N×N matrix to HBM.

Purpose

The defining technique of Flash Attention and all its variants. The prerequisite for long-context training.

11

CPU/NVMe Offloading

Definition

Pages less-recently-used KV blocks or weight shards to CPU DRAM or NVMe SSDs during inference.

Purpose

Extends effective GPU memory by 8–20x at the cost of PCIe bandwidth latency. Trade speed for capacity.

12

Prefix Caching

Definition

Stores computed KV states for shared prompt prefixes and reuses them across all requests.

Purpose

A 1000-token system prompt cached server-side saves 1000 tokens of prefill per request. ROI is immediate.

13

Recurrent Memory Transformer

Definition

Appends trainable memory tokens that persist across segments; model reads and writes them like a soft KV store.

Purpose

Hybrid architecture: transformer within segments, RNN-like memory across segments. Unbounded effective context.

14

Memory-Augmented LLMs

Definition

Equips the model with an external differentiable key-value store or vector database read via attention.

Purpose

Separates knowledge retrieval from reasoning computation. Enables explicit memory update and inspection.

15

RoPE Frequency Scaling

Definition

Multiplies the RoPE base frequency by a scalar >1 to extend context beyond training length.

Purpose

Cheapest context extension: no fine-tuning for modest 2–4x extensions. Quality drops at larger ratios.

16

ALiBi for Long Context

Definition

Extends ALiBi's linear bias to lengths beyond training without any modification or fine-tuning needed.

Purpose

Free length generalization: quality degrades gracefully rather than catastrophically past training length.

Explore more chapters or test your knowledge with quizzes.