How LLMs Predict the Next Word

A deep dive into contextual probability and statistical modeling.

The Training Set

Your Provided Stories

Story 1 "Focusing on simplicity helps clear the mind and reduces unnecessary stress. When you use fewer words, your message becomes sharper and easier for others to understand. This approach saves time for everyone involved and ensures your main point isn't lost in a crowd of details."
Story 2 "Consistency is the secret to making this habit stick. Try to review your writing and remove anything that doesn't add direct value to the reader. By keeping things brief and clear, you communicate with more impact and style."
Part 1

Small Scale Prediction

When you ask a model to predict the next word for the prompt "By keeping things brief and" based strictly on the stories above, it follows a deterministic path.

1

Tokenization

The prompt is broken into: [By][keeping][things][brief][and].

2

Exact Match Search

The model scans the dataset (Stories 1 & 2) for this specific token sequence.

3

Calculation

In Story 2, the word "clear" follows this sequence. There are no other instances.

4

The Result

Next Word: "clear"

Probabilistic Representation

$$P(\text{"clear"} | \text{"By keeping things brief and"}) = 1.0$$

Part 2

Scaling to 100GB (The "Real" LLM)

In a dataset of 70 billion words, the phrase "brief and" will appear thousands of times. The model no longer looks for one "right" answer, but a distribution.

The Probability Distribution

Candidate Word Observed Frequency Probability (P)
clear 4,500 times 0.45
concise 3,000 times 0.30
to the point 1,500 times 0.15
simple 1,000 times 0.10

Neural Weights vs. Literal Search

LLMs don't store these stories literally. They store the "meaning" in weights. If you prompt a model with Story 2's context, the Attention Mechanism will boost the weight of "clear" because it recognizes the pattern from the recent conversation history.

Softmax Selection

The final choice is made by converting internal scores (logits) into a probability map where the sum of all possibilities is 1.0.

$$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$