0
Act 4

Mastery

2 / 10

Supervised Fine-Tuning

Act 4 · ~5 min

Theory

Supervised fine-tuning picks up where pretraining leaves off. A pretrained model predicts the next token over raw text — it has no concept of "task" or "correct answer." SFT introduces that structure by continuing training on labelled (instruction, response) pairs.

StageDataObjective
PretrainingRaw text, billions of tokensNext-token prediction everywhere
SFT(instruction, response) pairsNext-token on response tokens only

Dataset format. Each sample uses a chat template with three roles: system (persona), user (instruction), assistant (target response). Templates differ per model family — always use tokenizer.apply_chat_template rather than hand-rolling separators.

Loss masking. The cross-entropy gradient is computed only on the assistant response tokens. Instruction tokens are masked to zero gradient — the model is not penalised for failing to predict text it received as input.

[system]masked — no gradient
[user instruction]masked — no gradient
[assistant response]loss computed here
SFT loss masking — only the response segment trains the model.

When to SFT: consistent output format, domain vocabulary missing from the base model, style or persona transfer. When not to: injecting facts that update frequently — use RAG for that.

What comes next: LoRA reduces trainable parameters to roughly 1–5% of the model; DPO aligns outputs to human preference without a reward model.