The Math of Tokenization

Why we don't use Whole-Word encoding.

To understand exactly why we don't use Whole-Word tokenization, it helps to look at the sheer math of the English language. If you wanted to build a dictionary of "all possible words" to cover general English plus every major industry, your vocabulary size would easily exceed 2,000,000 distinct words.

The "Standard" Dictionary

The Oxford English Dictionary (OED) contains about 170,000 words in current use.
Webster's Third New International Dictionary contains about 470,000 entries.

Note: Dictionaries are strict; they leave out millions of variations and slang terms.

Word Variations (Multiplies size by ~4x)

Whole-Word tokenizers treat every variation as a completely unique ID.

Root Word

Compute (1 word)

Tokenizer IDs

Computes, Computed, Computing, Precompute, Recompute... (8+ words)

Suddenly, 500k words becomes 2,000,000+ words just to handle basic grammar.

Industry Jargon & Science

Moving into professional fields causes the vocabulary to explode:

Medicine & Biology

350,000+ medical terms (SNOMED-CT), 300,000+ plants, and millions of chemicals.

Technology & IT

API, JSON, Kubernetes, Refactoring, Multithreading, Retweeted, Cryptocurrency.

Proper Nouns & Brands

If "Starbucks" isn't in your whole-word dictionary, the AI cannot read or write the word.

150,000+ distinct English surnames.
Millions of cities, towns, and street names.
Brand names: Google, Pfizer, PlayStation.

Slang & Formatting (Infinite)

On the internet, language is liquid:

yeet rizz hangry Helloooo vlog

The "Lego Brick" Solution (BPE)

A 3,000,000-word vocabulary matrix would be too massive for a GPU to even load. Byte-Pair Encoding (BPE) solves this by finding the common word-chunks.

How an LLM sees "Uncomputed"

comput

By using just ~50,000 tokens, an LLM can construct infinite English words, medical jargon, and typos, while keeping the AI's brain small and lightning fast.