INT8 Quantization
Definition
Maps FP16 weights to 8-bit integers with per-channel or per-tensor scale factors applied.
Purpose
Halves model memory vs FP16 with under 1% accuracy loss when properly calibrated. First step to try.
Quantization terms and explanations from the LLM Optimization Dictionary.
Definition
Maps FP16 weights to 8-bit integers with per-channel or per-tensor scale factors applied.
Purpose
Halves model memory vs FP16 with under 1% accuracy loss when properly calibrated. First step to try.
Definition
Packs two 4-bit values per byte using group quantization, typically with groups of 128 weights.
Purpose
Fits a 70B model in 35GB vs. 140GB at FP16. Accuracy sensitive to group size and calibration quality.
Definition
4-bit data type with bin boundaries spaced to minimize information loss for Gaussian-distributed weights.
Purpose
Achieves strictly better quantization SNR than uniform INT4. The storage format inside QLoRA.
Definition
Hardware-native 8-bit float in E4M3 (inference) or E5M2 (training) formats on H100/H200 GPUs.
Purpose
Doubles effective throughput vs BF16 with nearly identical convergence. The future default training dtype.
Definition
Layer-wise optimal brain compression: updates remaining weights to compensate for each quantized weight.
Purpose
State-of-the-art accuracy at 3–4 bits per weight. Widely supported by vLLM, TRT-LLM, and Ollama.
Definition
Identifies the 1% of salient weights by activation magnitude and protects them with higher precision.
Purpose
Outperforms GPTQ at the same bit budget and quantizes 3x faster. Hardware-efficient; runs on edge GPUs.
Definition
llama.cpp's self-describing binary format encoding quantized weights, tokenizer, and metadata in one file.
Purpose
Supports Q2\_K through Q8\_0 variants. Runs on CPU, Apple Silicon, and hybrid setups universally.
Definition
Python library wrapping CUDA kernels for 8-bit matmul and 4-bit storage with BF16 computation.
Purpose
The de-facto standard for QLoRA via HuggingFace load\_in\_4bit=True. Zero-code quantized fine-tuning.
Definition
Transfers quantization difficulty from outlier-heavy activations to weights by multiplying by scale s.
Purpose
Enables hardware-efficient INT8×INT8 GEMM for both weights and activations simultaneously.
Definition
Incoherence processing via random orthogonal transforms plus vector quantization with Hadamard codebooks.
Purpose
Near-FP16 quality at 2 bits per weight. The frontier of ultra-low-bit quantization accuracy.
Definition
Additive quantization: decomposes each weight vector as a sum of M learned codebook entries.
Purpose
High accuracy at extreme compression rates; best-in-class at 2 bits where scalar quantization fails.
Definition
Per-layer bit allocation: assigns more bits to sensitive layers and fewer to robust ones at target BPW.
Purpose
Maximizes quality at any given memory budget on NVIDIA hardware. The format powering ExLlamaV2.
Definition
Quantizes a trained model in minutes using a small unlabeled calibration set of 128–512 samples.
Purpose
The first thing to try before committing to QAT. Zero training cost; production standard approach.
Definition
Inserts fake quantization operators in the forward pass so gradients account for rounding error.
Purpose
Recovers 0.5–2 PPL points lost by PTQ at the cost of a full fine-tuning run. Worth it below 4 bits.
Definition
Quantizes only the weight matrices to INT4/INT8 while dequantizing to BF16 before each matmul.
Purpose
The most practical approach for memory-bound LLM decode. No activation calibration overhead at all.
Definition
Assigns different bit widths layer-by-layer based on per-layer sensitivity analysis scores.
Purpose
A Pareto improvement over uniform quantization. Protects outlier-heavy layers that uniform quant damages.
Definition
Stores KV cache tensors in INT8 or INT4 instead of BF16 during long-context generation.
Purpose
Doubles or quadruples effective context length for a fixed memory budget. Negligible quality impact at INT8.
Definition
A small (128–1024 sample) representative dataset used to compute activation statistics for PTQ.
Purpose
Quality and domain of calibration data significantly impacts final quantized model accuracy. Choose wisely.
Definition
Divides a weight vector into groups of g (e.g., 64 or 128) and applies independent scale factors per group.
Purpose
Finer granularity than per-tensor quantization. Recovers most of the accuracy lost by coarse global scaling.
Definition
Extends group quantization to 2D blocks of the weight matrix with local statistics per block.
Purpose
Used in NF4/QLoRA. Allows quantization statistics to adapt to local weight structure for better accuracy.
Definition
Computes activation scale factors at runtime from the current batch statistics; no pre-calibration needed.
Purpose
Zero calibration overhead. Best for CPU inference where weights are quantized but compute is flexible.
Definition
Pre-computes activation scale factors offline using a representative calibration dataset.
Purpose
Faster at runtime than dynamic. Standard for GPU inference pipelines with known input distributions.
Explore more chapters or test your knowledge with quizzes.