The Math of Tokenization

Why we don't use Whole-Word encoding.

To understand exactly why we don't use Whole-Word tokenization, it helps to look at the sheer math of the English language. If you wanted to build a dictionary of "all possible words" to cover general English plus every major industry, your vocabulary size would easily exceed 2,000,000 distinct words.

1

The "Standard" Dictionary

  • The Oxford English Dictionary (OED) contains about 170,000 words in current use.
  • Webster's Third New International Dictionary contains about 470,000 entries.

Note: Dictionaries are strict; they leave out millions of variations and slang terms.

2

Word Variations (Multiplies size by ~4x)

Whole-Word tokenizers treat every variation as a completely unique ID.

Root Word

Compute (1 word)

Tokenizer IDs

Computes, Computed, Computing, Precompute, Recompute... (8+ words)

Suddenly, 500k words becomes 2,000,000+ words just to handle basic grammar.

3

Industry Jargon & Science

Moving into professional fields causes the vocabulary to explode:

Medicine & Biology

350,000+ medical terms (SNOMED-CT), 300,000+ plants, and millions of chemicals.

Technology & IT

API, JSON, Kubernetes, Refactoring, Multithreading, Retweeted, Cryptocurrency.

4

Proper Nouns & Brands

If "Starbucks" isn't in your whole-word dictionary, the AI cannot read or write the word.

  • 150,000+ distinct English surnames.
  • Millions of cities, towns, and street names.
  • Brand names: Google, Pfizer, PlayStation.
5

Slang & Formatting (Infinite)

On the internet, language is liquid:

yeet rizz hangry Helloooo vlog

The "Lego Brick" Solution (BPE)

A 3,000,000-word vocabulary matrix would be too massive for a GPU to even load. Byte-Pair Encoding (BPE) solves this by finding the common word-chunks.

How an LLM sees "Uncomputed"

un
comput
ed

By using just ~50,000 tokens, an LLM can construct infinite English words, medical jargon, and typos, while keeping the AI's brain small and lightning fast.