Attention Mechanism

Act 1 · ~5 min

Theory

Self-attention lets every token look at every other token and decide how much to borrow before predicting what comes next.

For token at position i:
  Q_i = embedding × W_Q   (what am I looking for?)
  K_j = embedding × W_K   (what does j offer?)
  V_j = embedding × W_V   (what info does j hold?)

  score(i,j) = softmax( Q_i · K_j / √d_k )
  output_i   = Σ_j  score(i,j) × V_j

The √d_k scaling stops dot products from saturating softmax and killing gradients.

"bank" near a river

Attends to river, steep, hikers → geographic meaning.

"bank" near a deposit

Attends to deposit, account, teller → financial meaning.

Same token. Different attention weights. Different output vector — no word-sense dictionary required.

RNN

Hidden state passes left-to-right. Position N waits for N−1. Sequential bottleneck.

Attention

All pairs computed at once. Full GPU parallelism. Longer effective context.

Multi-head runs the whole computation N times in parallel with separate learned projections, then concatenates — different heads specialize on grammar, coreference, or topic.

Foundations

Attention Mechanism

Theory