MoE (Mixture of Experts)
Definition
Routes each token to a learned subset of expert FFN layers via a trainable router network.
Purpose
Decouples parameter count from compute. A 141B MoE model uses only 39B active parameters per token.
Architecture Optimization terms and explanations from the LLM Optimization Dictionary.
Definition
Routes each token to a learned subset of expert FFN layers via a trainable router network.
Purpose
Decouples parameter count from compute. A 141B MoE model uses only 39B active parameters per token.
Definition
Activates only top-k (typically k=2) of E expert FFNs per token; the rest are skipped.
Purpose
Mixtral 8x7B matches LLaMA-2 70B quality at 6x less compute per token. The modern efficiency architecture.
Definition
Keeps always-activated experts alongside the routed sparse experts (DeepSeek-V2/V3 design).
Purpose
Provides stable shared representations that prevent over-specialization in the routed expert layers.
Definition
Places each expert on a different device; tokens are dispatched across the cluster to their assigned expert.
Purpose
The parallelism strategy that makes 1000-expert MoE training feasible at scale.
Definition
A linear projection and softmax followed by top-k selection determines which expert processes each token.
Purpose
Router quality is critical. Collapsed routing (all tokens to one expert) is a training failure mode.
Definition
Auxiliary loss penalizing variance in expert utilization across the batch during training.
Purpose
Without it, gradient descent collapses routing to 1–2 experts. Coefficient 0.01 requires careful tuning.
Definition
Each token attends only to a window of W neighbors. Mistral uses W=4096 with attention sinks.
Purpose
Reduces attention from O(n^2) to O(n\cdotW). Local context is usually sufficient for most tokens.
Definition
Combines local sliding window with global tokens (CLS, special markers) that attend to all positions.
Purpose
Designed for document-level NLP. O(n) complexity for most tokens with full global coverage for key ones.
Definition
Models sequences as linear dynamical systems: h_t = Ah_{t-1} + Bx_t, y_t = Ch_t.
Purpose
Inference is a fixed-size hidden state regardless of sequence length. Training parallelizable via convolutions.
Definition
Adds input-dependent (selective) state transitions to SSMs: A, B, C depend on the current input x_t.
Purpose
Breaks the fixed-transition limitation that prevented SSMs from matching transformers on language tasks.
Definition
Reformulates attention as a linear RNN: WKV attention computed in parallel in training, recursively in inference.
Purpose
Linear memory at inference; no KV cache growth. Competitive with transformers up to 14B parameters.
Definition
Replaces attention with a long implicit convolution defined by a small neural network over positions.
Purpose
Sub-quadratic; reaches GPT-3 quality with 20% fewer FLOPs at sequence length 2048. Emerging alternative.
Definition
Implements retention in parallel (training), recurrent (inference), and chunkwise (long-context) modes.
Purpose
Claims to unify the best properties of Transformers, RNNs, and SSMs in one elegant formulation.
Definition
Compresses KV cache via a low-rank bottleneck projection before caching keys and values.
Purpose
DeepSeek-V2 achieves 93% KV cache reduction with negligible quality loss. The key to its efficiency.
Definition
Applies NTK-aware interpolation to high-frequency RoPE dims and linear interpolation to low-frequency ones.
Purpose
Extends a 4K-trained model to 128K tokens with only 1B tokens of continued pre-training. Most used method.
Definition
Assigns different interpolation factors to RoPE dimensions based on empirical token distance distributions.
Purpose
Outperforms uniform interpolation on tasks requiring very long-range retrieval beyond 128K tokens.
Definition
Increases the RoPE base frequency to b×n^2 where n is the extension ratio to preserve high-frequency signal.
Purpose
The theoretical foundation for most RoPE extensions; prevents loss of short-range positional information.
Definition
Maps position indices [0, L_{new}] linearly to [0, L_{train}] for context extension.
Purpose
Simple and effective with brief fine-tuning. The baseline vs. which all RoPE extension methods are measured.
Definition
Shares the token embedding matrix E∈R^{V \times d} with the output projection matrix.
Purpose
Reduces parameters by V×d (e.g., 400M for 100K vocab, d=4096). Improves embedding consistency.
Definition
FFN activation: FFN(x) = (xW_1)\odot\sigma_{swish}(xV)\cdotW_2. Input-dependent gating.
Purpose
Consistently outperforms GELU FFNs by 0.5–1 PPL point. Standard in LLaMA, PaLM, and Mistral.
Definition
Variant of the gated linear unit using GELU as the gate activation function in the FFN.
Purpose
Marginal differences from SwiGLU in practice. Both vastly outperform plain GELU in transformer FFNs.
Definition
Normalizes by RMS(x) = \sqrt{\mathrm{mean}(x^2)} without computing or subtracting the mean.
Purpose
7% faster than full LayerNorm due to the missing mean computation. Universal in modern LLMs.
Definition
Applies normalization to the residual stream before feeding into attention and FFN sublayers.
Purpose
More stable training than Post-LN; gradients are better conditioned at initialization. Now universal.
Definition
Routes each token to a subset of transformer layers, skipping the rest via a learned token router.
Purpose
Token uses only 50% of layers on average; compute per token is halved with minimal PPL increase.
Definition
Attention where queries come from the decoder and keys/values from the encoder output representations.
Purpose
Essential in encoder-decoder architectures like T5 and Whisper for sequence-to-sequence generation tasks.
Definition
Separate encoder processes input; separate decoder generates output conditioned on encoder representations.
Purpose
Natural fit for translation, summarization, and ASR. T5, BART, and Whisper all use this design.
Definition
A single autoregressive transformer stack predicts the next token from all previous tokens.
Purpose
GPT, LLaMA, Claude, Gemini: every frontier model is decoder-only. Scales more efficiently than enc-dec.
Definition
The width d of all hidden state vectors throughout the transformer (d_{model}).
Purpose
Scales with model size: 768 for BERT-Base, 8192 for LLaMA-3 70B. Wider = richer representations.
Definition
Number of parallel attention sub-spaces H, each with dimension d/H operating independently.
Purpose
More heads capture diverse relation types. H=64 at d=8192 gives 128-dim heads, standard for large models.
Definition
Ratio of FFN intermediate dimension to embedding dimension; typically 4× standard, 2.7× for GLU.
Purpose
Controls what fraction of parameters live in FFN vs. attention layers. 66% of params are FFN at 4x ratio.
Definition
Maximum number of tokens the model processes in a single forward pass.
Purpose
Every doubling quadruples attention memory and compute at training time. The key architectural constraint.
Definition
Initializes a deeper model by duplicating layers from a trained shallower model checkpoint.
Purpose
Solar 10.7B was built from LLaMA-2 7B via depth-upscaling then fine-tuned. Cuts pre-training cost.
Definition
The family of RoPE-derived position encodings: standard, NTK-aware, YaRN, LongRoPE, LlamaRoPE.
Purpose
Choosing the right variant for context extension is one of the highest-ROI architecture decisions in production.
Definition
At fixed parameter count, deep networks favor compositional reasoning; wide ones favor associative recall.
Purpose
Chinchilla-optimal scaling slightly favors depth. Most frontier models use 80+ layers at production scale.
Explore more chapters or test your knowledge with quizzes.