0
Act 1

Foundations

4 / 8

Tokens Across Languages

Act 1 · ~4 min

Theory

Token efficiency varies sharply across languages because BPE vocabularies are built from corpora that are not balanced across languages.

Language× vs English100-word equivalent
English1.0×~133 tokens
Spanish / Portuguese1.1–1.2×~150 tokens
French1.1–1.15×~145 tokens
Arabic1.5–2.0×~220 tokens
Japanese1.8–2.5×~260 tokens
Chinese (Simplified)2.0–2.5×~270 tokens

GPT-family models (tiktoken cl100k) have ~100K token types, heavy on English subwords. Chinese and Japanese characters are individually less frequent in training, so they map to 1.5–2.5 tokens each instead of merging into efficient subwords.