Attention Mechanism
Theory
Self-attention lets every token look at every other token and decide how much to borrow before predicting what comes next.
For token at position i:
Q_i = embedding × W_Q (what am I looking for?)
K_j = embedding × W_K (what does j offer?)
V_j = embedding × W_V (what info does j hold?)
score(i,j) = softmax( Q_i · K_j / √d_k )
output_i = Σ_j score(i,j) × V_j
The √d_k scaling stops dot products from saturating softmax and killing gradients.
"bank" near a river
Attends to river, steep, hikers → geographic meaning.
"bank" near a deposit
Attends to deposit, account, teller → financial meaning.
Same token. Different attention weights. Different output vector — no word-sense dictionary required.
RNN
Hidden state passes left-to-right. Position N waits for N−1. Sequential bottleneck.
Attention
All pairs computed at once. Full GPU parallelism. Longer effective context.
Multi-head runs the whole computation N times in parallel with separate learned projections, then concatenates — different heads specialize on grammar, coreference, or topic.