LLM Optimization Dictionary

Architecture Optimization

Architecture Optimization terms and explanations from the LLM Optimization Dictionary.

34 terms in this chapter

MoE (Mixture of Experts)

Definition

Routes each token to a learned subset of expert FFN layers via a trainable router network.

Purpose

Decouples parameter count from compute. A 141B MoE model uses only 39B active parameters per token.

Sparse MoE

Definition

Activates only top-k (typically k=2) of E expert FFNs per token; the rest are skipped.

Purpose

Mixtral 8x7B matches LLaMA-2 70B quality at 6x less compute per token. The modern efficiency architecture.

Shared Experts

Definition

Keeps always-activated experts alongside the routed sparse experts (DeepSeek-V2/V3 design).

Purpose

Provides stable shared representations that prevent over-specialization in the routed expert layers.

Expert Parallelism

Definition

Places each expert on a different device; tokens are dispatched across the cluster to their assigned expert.

Purpose

The parallelism strategy that makes 1000-expert MoE training feasible at scale.

Token Routing

Definition

A linear projection and softmax followed by top-k selection determines which expert processes each token.

Purpose

Router quality is critical. Collapsed routing (all tokens to one expert) is a training failure mode.

Load Balancing Loss

Definition

Auxiliary loss penalizing variance in expert utilization across the batch during training.

Purpose

Without it, gradient descent collapses routing to 1–2 experts. Coefficient 0.01 requires careful tuning.

Sliding Window Attention

Definition

Each token attends only to a window of W neighbors. Mistral uses W=4096 with attention sinks.

Purpose

Reduces attention from O(n^2) to O(n\cdotW). Local context is usually sufficient for most tokens.

Longformer Attention

Definition

Combines local sliding window with global tokens (CLS, special markers) that attend to all positions.

Purpose

Designed for document-level NLP. O(n) complexity for most tokens with full global coverage for key ones.

SSM (State Space Models)

Definition

Models sequences as linear dynamical systems: h_t = Ah_{t-1} + Bx_t, y_t = Ch_t.

Purpose

Inference is a fixed-size hidden state regardless of sequence length. Training parallelizable via convolutions.

Mamba

Definition

Adds input-dependent (selective) state transitions to SSMs: A, B, C depend on the current input x_t.

Purpose

Breaks the fixed-transition limitation that prevented SSMs from matching transformers on language tasks.

RWKV

Definition

Reformulates attention as a linear RNN: WKV attention computed in parallel in training, recursively in inference.

Purpose

Linear memory at inference; no KV cache growth. Competitive with transformers up to 14B parameters.

Hyena

Definition

Replaces attention with a long implicit convolution defined by a small neural network over positions.

Purpose

Sub-quadratic; reaches GPT-3 quality with 20% fewer FLOPs at sequence length 2048. Emerging alternative.

RetNet

Definition

Implements retention in parallel (training), recurrent (inference), and chunkwise (long-context) modes.

Purpose

Claims to unify the best properties of Transformers, RNNs, and SSMs in one elegant formulation.

MLA (Multi-Latent Attention)

Definition

Compresses KV cache via a low-rank bottleneck projection before caching keys and values.

Purpose

DeepSeek-V2 achieves 93% KV cache reduction with negligible quality loss. The key to its efficiency.

YaRN

Definition

Applies NTK-aware interpolation to high-frequency RoPE dims and linear interpolation to low-frequency ones.

Purpose

Extends a 4K-trained model to 128K tokens with only 1B tokens of continued pre-training. Most used method.

LongRoPE

Definition

Assigns different interpolation factors to RoPE dimensions based on empirical token distance distributions.

Purpose

Outperforms uniform interpolation on tasks requiring very long-range retrieval beyond 128K tokens.

NTK-aware Scaling

Definition

Increases the RoPE base frequency to b×n^2 where n is the extension ratio to preserve high-frequency signal.

Purpose

The theoretical foundation for most RoPE extensions; prevents loss of short-range positional information.

Position Interpolation (PI)

Definition

Maps position indices [0, L_{new}] linearly to [0, L_{train}] for context extension.

Purpose

Simple and effective with brief fine-tuning. The baseline vs. which all RoPE extension methods are measured.

Weight Tying

Definition

Shares the token embedding matrix E∈R^{V \times d} with the output projection matrix.

Purpose

Reduces parameters by V×d (e.g., 400M for 100K vocab, d=4096). Improves embedding consistency.

SwiGLU

Definition

FFN activation: FFN(x) = (xW_1)\odot\sigma_{swish}(xV)\cdotW_2. Input-dependent gating.

Purpose

Consistently outperforms GELU FFNs by 0.5–1 PPL point. Standard in LLaMA, PaLM, and Mistral.

GeGLU

Definition

Variant of the gated linear unit using GELU as the gate activation function in the FFN.

Purpose

Marginal differences from SwiGLU in practice. Both vastly outperform plain GELU in transformer FFNs.

RMSNorm

Definition

Normalizes by RMS(x) = \sqrt{\mathrm{mean}(x^2)} without computing or subtracting the mean.

Purpose

7% faster than full LayerNorm due to the missing mean computation. Universal in modern LLMs.

Pre-LayerNorm

Definition

Applies normalization to the residual stream before feeding into attention and FFN sublayers.

Purpose

More stable training than Post-LN; gradients are better conditioned at initialization. Now universal.

Mixture of Depths (MoD)

Definition

Routes each token to a subset of transformer layers, skipping the rest via a learned token router.

Purpose

Token uses only 50% of layers on average; compute per token is halved with minimal PPL increase.

Cross-Attention

Definition

Attention where queries come from the decoder and keys/values from the encoder output representations.

Purpose

Essential in encoder-decoder architectures like T5 and Whisper for sequence-to-sequence generation tasks.

Encoder-Decoder Architecture

Definition

Separate encoder processes input; separate decoder generates output conditioned on encoder representations.

Purpose

Natural fit for translation, summarization, and ASR. T5, BART, and Whisper all use this design.

Decoder-Only Architecture

Definition

A single autoregressive transformer stack predicts the next token from all previous tokens.

Purpose

GPT, LLaMA, Claude, Gemini: every frontier model is decoder-only. Scales more efficiently than enc-dec.

Embedding Dimension

Definition

The width d of all hidden state vectors throughout the transformer (d_{model}).

Purpose

Scales with model size: 768 for BERT-Base, 8192 for LLaMA-3 70B. Wider = richer representations.

Attention Head Count

Definition

Number of parallel attention sub-spaces H, each with dimension d/H operating independently.

Purpose

More heads capture diverse relation types. H=64 at d=8192 gives 128-dim heads, standard for large models.

FFN Ratio

Definition

Ratio of FFN intermediate dimension to embedding dimension; typically 4× standard, 2.7× for GLU.

Purpose

Controls what fraction of parameters live in FFN vs. attention layers. 66% of params are FFN at 4x ratio.

Context Length

Definition

Maximum number of tokens the model processes in a single forward pass.

Purpose

Every doubling quadruples attention memory and compute at training time. The key architectural constraint.

Depth Upscaling

Definition

Initializes a deeper model by duplicating layers from a trained shallower model checkpoint.

Purpose

Solar 10.7B was built from LLaMA-2 7B via depth-upscaling then fine-tuned. Cuts pre-training cost.

Rotary Embedding Variants

Definition

The family of RoPE-derived position encodings: standard, NTK-aware, YaRN, LongRoPE, LlamaRoPE.

Purpose

Choosing the right variant for context extension is one of the highest-ROI architecture decisions in production.

Deep vs. Wide Trade-off

Definition

At fixed parameter count, deep networks favor compositional reasoning; wide ones favor associative recall.

Purpose

Chinchilla-optimal scaling slightly favors depth. Most frontier models use 80+ layers at production scale.

Explore more chapters or test your knowledge with quizzes.

Back to LLM Optimization Dictionary All glossary chapters Practice quizzes