0
Act 1

Foundations

8 / 8

Transformer Architecture

Act 1 · ~5 min

Theory

A transformer is built from a repeating block. Understanding the block explains the whole architecture.

Inputprev block
Multi-Head Attnshare context
Add & Normresidual
Feed-Forwardper-token
Add & Normresidual
Outputricher reps
Inside one transformer block: attention mixes context, FFN refines per token, residuals carry signal forward.
Encoder-decoder (T5, BART)
Two separate stacks: one reads the input, one writes the output.
Decoder-only (GPT, Llama)
One stack does both via masked attention — position i only sees 0 to i−1.
Token IDs
Embed + Pos
Block 1
Block 2
Block N
Linear
Logits
Full stack: identical blocks repeat N times, each layer building richer representations.