Why we don't use Whole-Word encoding.
To understand exactly why we don't use Whole-Word tokenization, it helps to look at the sheer math of the English language. If you wanted to build a dictionary of "all possible words" to cover general English plus every major industry, your vocabulary size would easily exceed 2,000,000 distinct words.
Note: Dictionaries are strict; they leave out millions of variations and slang terms.
Whole-Word tokenizers treat every variation as a completely unique ID.
Root Word
Compute (1 word)
Tokenizer IDs
Computes, Computed, Computing, Precompute, Recompute... (8+ words)Suddenly, 500k words becomes 2,000,000+ words just to handle basic grammar.
Moving into professional fields causes the vocabulary to explode:
Medicine & Biology
350,000+ medical terms (SNOMED-CT), 300,000+ plants, and millions of chemicals.
Technology & IT
API, JSON, Kubernetes, Refactoring, Multithreading, Retweeted, Cryptocurrency.
If "Starbucks" isn't in your whole-word dictionary, the AI cannot read or write the word.
On the internet, language is liquid:
A 3,000,000-word vocabulary matrix would be too massive for a GPU to even load. Byte-Pair Encoding (BPE) solves this by finding the common word-chunks.
How an LLM sees "Uncomputed"
By using just ~50,000 tokens, an LLM can construct infinite English words, medical jargon, and typos, while keeping the AI's brain small and lightning fast.