0
Act 4

Mastery

1 / 10

Pre-Training

Act 4 · ~5 min

Theory

Pretraining is the first and most expensive stage in building a foundation model. The model is trained with a simple self-supervised objective: given a sequence of tokens, predict the next one. Run at sufficient scale, this single task teaches language, world knowledge, and reasoning.

Raw text corpus
Pretraining (next-token prediction)
Base model
SFT (instruction pairs)
RLHF / DPO (preferences)
Chat model
Training stages: pretraining builds the base; post-training aligns it for use.

Scaling laws (Chinchilla, 2022): capability scales predictably with three factors — model parameters, training tokens, and compute. Optimal runs scale all three together; undertrained large models waste compute.

Open weights
Llama, Mistral. Self-host. Fine-tune on private data. Full control, full infra cost.
Closed weights
GPT-4, Claude. Managed API. Faster start. No weight access, no below-API customization.