Transformer Architecture Blueprints

These variables are the fundamental architectural blueprints for building a Transformer-based Large Language Model (LLM). The specific numbers below represent the exact specifications for the original GPT-2 Small model.

The Dictionary

vocab_size = 50257

What it is: The total number of unique "tokens" (whole words, parts of words, or characters) the model can recognize and generate.

How it works: A tokenizer chops text into tokens. The model has an internal lookup table with exactly 50,257 slots. This specific number is standard for the Byte-Pair Encoding (BPE) tokenizer used by GPT-2.

The Concept Vector

embed_dim = 768

What it is: Defines how much "meaning" the model can attach to a single token.

How it works: The model translates a word into a list of 768 numbers. Imagine trying to describe a car using 3 features (color, speed, price) — an LLM uses 768 invisible mathematical features to capture the deep, nuanced meaning of every word.

The Reading Squad

num_heads = 12

What it is: The number of "attention heads" in the multi-head attention mechanism.

How it works: Instead of reading a sentence once to find relationships between words, it reads it 12 times in parallel. One "head" might focus on pronouns, another on verbs, and another on emotional tone, combining for a richer understanding.

The Processing Power

ff_dim = 3072

What it is: The size of the hidden layer in the Feed-Forward neural network found inside every transformer block.

How it works: After gathering context, the data is passed through this network to "think." It is usually 4 times the embedding dimension (768 × 4 = 3072). It expands data to find complex patterns, then compresses it back to 768.

The Assembly Line

num_layers = 12

What it is: The total number of Transformer blocks stacked on top of each other.

How it works: As data passes through each layer, understanding becomes more abstract. Early layers figure out basic grammar; middle layers handle structure; the 12th layer extracts deep semantic logic. Deeper models are smarter but slower.

The Context Window

seq_len = 512

What it is: The maximum number of tokens the model can process at one single time (its short-term memory).

How it works: If you give this model a 600-token prompt, it must cut off the first 88 tokens. While 512 was standard for GPT-1/GPT-2, modern models use massive sequence lengths (e.g., 8k, 128k, or even 1M+) to process whole books.

How Are "Parameters" Calculated?

When you hear a model has "5 Billion Parameters," it means there are 5 billion individual mathematical weights (floating-point numbers) the model uses to make predictions. You can count them by looking at the matrices in the network:

Embedding Weights: Calculated as vocab_size × embed_dim. For GPT-2 Small, this alone is ~38 million parameters.
Attention Matrices: Inside each of the 12 layers, the model creates matrices for Queries, Keys, and Values to calculate attention.
Feed-Forward Networks: The connections expanding out to the ff_dim (3072) and back down to the embed_dim (768) inside every layer hold the vast majority of the "knowledge" parameters.

To get the total, you simply multiply the dimensions of all these matrices together and sum them up. The specific hyperparameters we used above result in roughly 117 million total parameters.

Popular LLMs & Parameter Counts

Model Name	Release Year	Parameter Count	Scale
GPT-2 (Small)	2019	117 Million	Tiny
Llama 3 (8B)	2024	8 Billion	Small / Local
Llama 3 (70B)	2024	70 Billion	Medium
GPT-3	2020	175 Billion	Large
GPT-4 / Claude 3 Opus	2023 / 2024	~1.5 Trillion+ (Est.)	Massive

Hardware Specs: Why You Can't Train on a CPU

LLM training requires performing trillions of simultaneous matrix multiplications.

Training on a CPU

Architecture: CPUs have a small number of very powerful cores (e.g., 8 to 64 cores).
Design: Built for fast, sequential logic (running your operating system, databases).
The Result: Far too slow. Training a model on a CPU that takes a few days on a GPU could literally take years. It is a massive waste of electricity and time. You will not perform as expected.

Training on a GPU

Architecture: GPUs have thousands of smaller, simpler cores (e.g., an NVIDIA H100 has over 14,000 to 18,000 logical cores).
Design: Built specifically to do millions of math operations in parallel at the exact same time.
VRAM Specs: You need high Video RAM to fit the parameters. A 7B parameter model usually requires at least 24GB of VRAM just to train efficiently. Large models require clusters of 80GB GPUs.