LLM Optimization Dictionary

Quantization

Quantization terms and explanations from the LLM Optimization Dictionary.

22 terms in this chapter

INT8 Quantization

Definition

Maps FP16 weights to 8-bit integers with per-channel or per-tensor scale factors applied.

Purpose

Halves model memory vs FP16 with under 1% accuracy loss when properly calibrated. First step to try.

INT4 Quantization

Definition

Packs two 4-bit values per byte using group quantization, typically with groups of 128 weights.

Purpose

Fits a 70B model in 35GB vs. 140GB at FP16. Accuracy sensitive to group size and calibration quality.

NF4 (NormalFloat4)

Definition

4-bit data type with bin boundaries spaced to minimize information loss for Gaussian-distributed weights.

Purpose

Achieves strictly better quantization SNR than uniform INT4. The storage format inside QLoRA.

FP8 Quantization

Definition

Hardware-native 8-bit float in E4M3 (inference) or E5M2 (training) formats on H100/H200 GPUs.

Purpose

Doubles effective throughput vs BF16 with nearly identical convergence. The future default training dtype.

GPTQ

Definition

Layer-wise optimal brain compression: updates remaining weights to compensate for each quantized weight.

Purpose

State-of-the-art accuracy at 3–4 bits per weight. Widely supported by vLLM, TRT-LLM, and Ollama.

AWQ

Definition

Identifies the 1% of salient weights by activation magnitude and protects them with higher precision.

Purpose

Outperforms GPTQ at the same bit budget and quantizes 3x faster. Hardware-efficient; runs on edge GPUs.

GGUF / GGML

Definition

llama.cpp's self-describing binary format encoding quantized weights, tokenizer, and metadata in one file.

Purpose

Supports Q2\_K through Q8\_0 variants. Runs on CPU, Apple Silicon, and hybrid setups universally.

bitsandbytes

Definition

Python library wrapping CUDA kernels for 8-bit matmul and 4-bit storage with BF16 computation.

Purpose

The de-facto standard for QLoRA via HuggingFace load\_in\_4bit=True. Zero-code quantized fine-tuning.

SmoothQuant

Definition

Transfers quantization difficulty from outlier-heavy activations to weights by multiplying by scale s.

Purpose

Enables hardware-efficient INT8×INT8 GEMM for both weights and activations simultaneously.

QuIP\#

Definition

Incoherence processing via random orthogonal transforms plus vector quantization with Hadamard codebooks.

Purpose

Near-FP16 quality at 2 bits per weight. The frontier of ultra-low-bit quantization accuracy.

AQLM

Definition

Additive quantization: decomposes each weight vector as a sum of M learned codebook entries.

Purpose

High accuracy at extreme compression rates; best-in-class at 2 bits where scalar quantization fails.

EXL2

Definition

Per-layer bit allocation: assigns more bits to sensitive layers and fewer to robust ones at target BPW.

Purpose

Maximizes quality at any given memory budget on NVIDIA hardware. The format powering ExLlamaV2.

PTQ (Post-Training Quantization)

Definition

Quantizes a trained model in minutes using a small unlabeled calibration set of 128–512 samples.

Purpose

The first thing to try before committing to QAT. Zero training cost; production standard approach.

QAT (Quantization-Aware Training)

Definition

Inserts fake quantization operators in the forward pass so gradients account for rounding error.

Purpose

Recovers 0.5–2 PPL points lost by PTQ at the cost of a full fine-tuning run. Worth it below 4 bits.

Weight-Only Quantization

Definition

Quantizes only the weight matrices to INT4/INT8 while dequantizing to BF16 before each matmul.

Purpose

The most practical approach for memory-bound LLM decode. No activation calibration overhead at all.

Mixed-Precision Quantization

Definition

Assigns different bit widths layer-by-layer based on per-layer sensitivity analysis scores.

Purpose

A Pareto improvement over uniform quantization. Protects outlier-heavy layers that uniform quant damages.

KV Cache Quantization

Definition

Stores KV cache tensors in INT8 or INT4 instead of BF16 during long-context generation.

Purpose

Doubles or quadruples effective context length for a fixed memory budget. Negligible quality impact at INT8.

Calibration Dataset

Definition

A small (128–1024 sample) representative dataset used to compute activation statistics for PTQ.

Purpose

Quality and domain of calibration data significantly impacts final quantized model accuracy. Choose wisely.

Group Quantization

Definition

Divides a weight vector into groups of g (e.g., 64 or 128) and applies independent scale factors per group.

Purpose

Finer granularity than per-tensor quantization. Recovers most of the accuracy lost by coarse global scaling.

Block-wise Quantization

Definition

Extends group quantization to 2D blocks of the weight matrix with local statistics per block.

Purpose

Used in NF4/QLoRA. Allows quantization statistics to adapt to local weight structure for better accuracy.

Dynamic Quantization

Definition

Computes activation scale factors at runtime from the current batch statistics; no pre-calibration needed.

Purpose

Zero calibration overhead. Best for CPU inference where weights are quantized but compute is flexible.

Static Quantization

Definition

Pre-computes activation scale factors offline using a representative calibration dataset.

Purpose

Faster at runtime than dynamic. Standard for GPU inference pipelines with known input distributions.

Explore more chapters or test your knowledge with quizzes.

Back to LLM Optimization Dictionary All glossary chapters Practice quizzes