0
Act 4

Mastery

5 / 10

Quantization

Act 4 Β· ~4 min

Theory

Quantization reduces the numerical precision of model weights to lower memory usage and speed up matrix multiplications.

PrecisionBitsVRAM vs FP16Quality LossCommon Tool
FP3232+100%None (baseline)Training only
FP16 / BF1616BaselineNonetransformers default
INT88βˆ’50%Minimal (under 1%)bitsandbytes
GPTQ4βˆ’75%SmallAutoGPTQ
AWQ4βˆ’75%Very smallAutoAWQ
GGUF Q44βˆ’75%Smallllama.cpp

Two strategies:

  • PTQ β€” quantize an already-trained model with a calibration dataset. Fast, no cluster needed. Standard for GPTQ + AWQ.
  • QAT β€” simulate low precision during training so the model adapts. Higher INT4 quality, costs a full training run.
FP32baseline Β· training
FP16βˆ’50% Β· serving default
INT8βˆ’75% Β· minimal loss
INT4βˆ’87% Β· small drop
Precision β†’ memory trade-off across common formats.

Format by deployment:

  • GGUF β€” CPU/edge (llama.cpp), GPU offload OK
  • GPTQ β€” GPU batch, slightly lower quality than AWQ
  • AWQ β€” GPU production, weight-aware, recommended for vLLM

Frees VRAM β†’ larger KV cache.