0
Act 4

Mastery

4 / 10

Preference Alignment (DPO)

Act 4 · ~5 min

Theory

Preference alignment teaches a model not just to follow instructions, but to do so well — shaped by human judgments of quality, safety, and tone.

DPO vs. RLHF

PropertyRLHFDPO
Reward modelSeparate, trained firstNone — LLM is the implicit reward
AlgorithmPPODirect gradient update
StabilityCan be unstableStable, fewer hyperparameters
Data sensitivityModerateHigh — noisy pairs hurt

Dataset format: each sample is a triplet — prompt, chosen (preferred), rejected (dispreferred). The rejected response carries as much signal as chosen; it defines the behavior to move away from.

beta controls KL regularization: how far the aligned model can deviate from the SFT reference. Higher beta = more conservative alignment.

SFT Modelinstruction-following
Preference Data(prompt, chosen, rejected)
DPO Traininglog-ratio loss
Aligned Modelpreferred outputs
SFT then DPO — the mandatory sequence.

When to use DPO: harmful output reduction, style alignment, tone tuning. Choose RLHF for complex multi-dimensional preference hierarchies.