If you treat every word as a single token (hello = 144), you run into three massive problems:
Imagine you train your dictionary on a million books. You assign an ID to every word.
Then, a user types: "I am coding in C++ and my brain is completely frieddd."
Is "C++" in your dictionary?
Is the typo "frieddd" in your dictionary?
If they aren't, the model crashes or has to replace them with an <UNK> (Unknown) token. The model literally cannot read or write words it hasn't seen before.
In English, "run", "runs", "running", and "ran" are fundamentally the same concept.
In Word-Level tokenization, they are 4 completely separate IDs. The model has to learn what they mean completely independently.
To capture all words, plurals, names, and punctuation, your vocabulary size would need to be 1,000,000 to 5,000,000 tokens.
In a Large Language Model, the final layer (the classifier that guesses the next word) has a size of [Hidden_Dimension × Vocab_Size].
If your Vocab Size is 2,000,000, that matrix becomes astronomically huge.
It will consume all your GPU VRAM just to store the vocabulary, leaving no room for the actual "brain" of the AI.