These variables are the fundamental architectural blueprints for building a Transformer-based Large Language Model (LLM). The specific numbers below represent the exact specifications for the original GPT-2 Small model.
What it is: The total number of unique "tokens" (whole words, parts of words, or characters) the model can recognize and generate.
How it works: A tokenizer chops text into tokens. The model has an internal lookup table with exactly 50,257 slots. This specific number is standard for the Byte-Pair Encoding (BPE) tokenizer used by GPT-2.
What it is: Defines how much "meaning" the model can attach to a single token.
How it works: The model translates a word into a list of 768 numbers. Imagine trying to describe a car using 3 features (color, speed, price) — an LLM uses 768 invisible mathematical features to capture the deep, nuanced meaning of every word.
What it is: The number of "attention heads" in the multi-head attention mechanism.
How it works: Instead of reading a sentence once to find relationships between words, it reads it 12 times in parallel. One "head" might focus on pronouns, another on verbs, and another on emotional tone, combining for a richer understanding.
What it is: The size of the hidden layer in the Feed-Forward neural network found inside every transformer block.
How it works: After gathering context, the data is passed through this network to "think." It is usually 4 times the embedding dimension (768 × 4 = 3072). It expands data to find complex patterns, then compresses it back to 768.
What it is: The total number of Transformer blocks stacked on top of each other.
How it works: As data passes through each layer, understanding becomes more abstract. Early layers figure out basic grammar; middle layers handle structure; the 12th layer extracts deep semantic logic. Deeper models are smarter but slower.
What it is: The maximum number of tokens the model can process at one single time (its short-term memory).
How it works: If you give this model a 600-token prompt, it must cut off the first 88 tokens. While 512 was standard for GPT-1/GPT-2, modern models use massive sequence lengths (e.g., 8k, 128k, or even 1M+) to process whole books.
When you hear a model has "5 Billion Parameters," it means there are 5 billion individual mathematical weights (floating-point numbers) the model uses to make predictions. You can count them by looking at the matrices in the network:
vocab_size × embed_dim. For GPT-2 Small, this alone is ~38 million parameters.ff_dim (3072) and back down to the embed_dim (768) inside every layer hold the vast majority of the "knowledge" parameters.To get the total, you simply multiply the dimensions of all these matrices together and sum them up. The specific hyperparameters we used above result in roughly 117 million total parameters.
| Model Name | Release Year | Parameter Count | Scale |
|---|---|---|---|
| GPT-2 (Small) | 2019 | 117 Million | Tiny |
| Llama 3 (8B) | 2024 | 8 Billion | Small / Local |
| Llama 3 (70B) | 2024 | 70 Billion | Medium |
| GPT-3 | 2020 | 175 Billion | Large |
| GPT-4 / Claude 3 Opus | 2023 / 2024 | ~1.5 Trillion+ (Est.) | Massive |
LLM training requires performing trillions of simultaneous matrix multiplications.