Transformer Architecture
Theory
A transformer is built from a repeating block. Understanding the block explains the whole architecture.
Inputprev block
Multi-Head Attnshare context
Add & Normresidual
Feed-Forwardper-token
Add & Normresidual
Outputricher reps
Encoder-decoder (T5, BART)
Two separate stacks: one reads the input, one writes the output.
Decoder-only (GPT, Llama)
One stack does both via masked attention — position i only sees 0 to i−1.
Token IDs
Embed + Pos
Block 1
Block 2
…
Block N
Linear
Logits