Transformer Architecture Blueprints

These variables are the fundamental architectural blueprints for building a Transformer-based Large Language Model (LLM). The specific numbers below represent the exact specifications for the original GPT-2 Small model.

The Dictionary

vocab_size = 50257

What it is: The total number of unique "tokens" (whole words, parts of words, or characters) the model can recognize and generate.

How it works: A tokenizer chops text into tokens. The model has an internal lookup table with exactly 50,257 slots. This specific number is standard for the Byte-Pair Encoding (BPE) tokenizer used by GPT-2.

The Concept Vector

embed_dim = 768

What it is: Defines how much "meaning" the model can attach to a single token.

How it works: The model translates a word into a list of 768 numbers. Imagine trying to describe a car using 3 features (color, speed, price) — an LLM uses 768 invisible mathematical features to capture the deep, nuanced meaning of every word.

The Reading Squad

num_heads = 12

What it is: The number of "attention heads" in the multi-head attention mechanism.

How it works: Instead of reading a sentence once to find relationships between words, it reads it 12 times in parallel. One "head" might focus on pronouns, another on verbs, and another on emotional tone, combining for a richer understanding.

The Processing Power

ff_dim = 3072

What it is: The size of the hidden layer in the Feed-Forward neural network found inside every transformer block.

How it works: After gathering context, the data is passed through this network to "think." It is usually 4 times the embedding dimension (768 × 4 = 3072). It expands data to find complex patterns, then compresses it back to 768.

The Assembly Line

num_layers = 12

What it is: The total number of Transformer blocks stacked on top of each other.

How it works: As data passes through each layer, understanding becomes more abstract. Early layers figure out basic grammar; middle layers handle structure; the 12th layer extracts deep semantic logic. Deeper models are smarter but slower.

The Context Window

seq_len = 512

What it is: The maximum number of tokens the model can process at one single time (its short-term memory).

How it works: If you give this model a 600-token prompt, it must cut off the first 88 tokens. While 512 was standard for GPT-1/GPT-2, modern models use massive sequence lengths (e.g., 8k, 128k, or even 1M+) to process whole books.

How Are "Parameters" Calculated?

When you hear a model has "5 Billion Parameters," it means there are 5 billion individual mathematical weights (floating-point numbers) the model uses to make predictions. You can count them by looking at the matrices in the network:

To get the total, you simply multiply the dimensions of all these matrices together and sum them up. The specific hyperparameters we used above result in roughly 117 million total parameters.

Popular LLMs & Parameter Counts

Model Name Release Year Parameter Count Scale
GPT-2 (Small) 2019 117 Million Tiny
Llama 3 (8B) 2024 8 Billion Small / Local
Llama 3 (70B) 2024 70 Billion Medium
GPT-3 2020 175 Billion Large
GPT-4 / Claude 3 Opus 2023 / 2024 ~1.5 Trillion+ (Est.) Massive

Hardware Specs: Why You Can't Train on a CPU

LLM training requires performing trillions of simultaneous matrix multiplications.

Training on a CPU

  • Architecture: CPUs have a small number of very powerful cores (e.g., 8 to 64 cores).
  • Design: Built for fast, sequential logic (running your operating system, databases).
  • The Result: Far too slow. Training a model on a CPU that takes a few days on a GPU could literally take years. It is a massive waste of electricity and time. You will not perform as expected.

Training on a GPU

  • Architecture: GPUs have thousands of smaller, simpler cores (e.g., an NVIDIA H100 has over 14,000 to 18,000 logical cores).
  • Design: Built specifically to do millions of math operations in parallel at the exact same time.
  • VRAM Specs: You need high Video RAM to fit the parameters. A 7B parameter model usually requires at least 24GB of VRAM just to train efficiently. Large models require clusters of 80GB GPUs.