0
Act 1

Foundations

1 / 8

Tokens

Act 1 · ~4 min

Theory

A token is the basic unit of text a language model processes. Tokenizers split your input before the model sees it: one token per common word, more for rare words, punctuation, numbers, or non-Latin script.

Token density by content type

Content typeTokens / wordExample
English prose~1.3"the cat sat" → 3 tokens
Code / JSON~2{"k":1} → 6 tokens
Chinese / Japanese~2.5 / char"你好" → ~3 tokens

Tokens drive two things:

  • Cost — providers bill per million tokens in + out.
  • Capacity — the context window is a fixed token count, shared by prompt, history, and response.
Prompt textTokenizerToken IDs (integers)Language modelNew token IDs