Transformer Architecture

Act 1 · ~5 min

Theory

A transformer is built from a repeating block. Understanding the block explains the whole architecture.

Inputprev block

Multi-Head Attnshare context

Add & Normresidual

Feed-Forwardper-token

Add & Normresidual

Outputricher reps

Inside one transformer block: attention mixes context, FFN refines per token, residuals carry signal forward.

Encoder-decoder (T5, BART)

Two separate stacks: one reads the input, one writes the output.

Decoder-only (GPT, Llama)

One stack does both via masked attention — position i only sees 0 to i−1.

Token IDs

Embed + Pos

Block 1

Block 2

…

Block N

Linear

Logits

Full stack: identical blocks repeat N times, each layer building richer representations.