GPT-2 BPE Regex Pattern Explanation

The Full Pattern

<\|endoftext\|>|<\|pad\|>|'(?:s|t|re|ve|m|ll|d)| ?\w+| ?\d+| ?[^\s\w]+|\s+(?!\S)|\s+

This is a SINGLE alternation (uses | to try each pattern in order, left to right). The regex engine tries each alternative until one matches, then moves on.


Pattern Breakdown - Left to Right Priority

1. <\|endoftext\|>

Matches: The literal string <|endoftext|>

Input:  "Hello <|endoftext|> World"
Match:  "Hello [<|endoftext|>] World"

Why the escapes?

Priority: HIGHEST (checked first)


2. <\|pad\|>

Matches: The literal string <|pad|>

Input:  "text <|pad|> more"
Match:  "text [<|pad|>] more"

Purpose: Special padding token, treated atomically (never split)


3. '(?:s|t|re|ve|m|ll|d)

Matches: Apostrophe followed by common English contractions

Breaking it down:
'           Literal apostrophe
(?:...)     Non-capturing group (just groups, doesn't save)
s|t|re|...  Alternation of specific suffixes

Examples:

Input:  "don't can't we're I've I'm we'll he'd"
Matches: don['t] can['t] we['re] I['ve] I['m] we['ll] he['d]

Why this matters: Contractions are treated as separate tokens from the base word:

Why these specific contractions?

's  → possessive or "is":     "John's car", "he's"
'tnot:                    "don't", "can't"
're → are:                    "we're", "they're"  
've → have:                   "I've", "we've"
'm  → am:                     "I'm"
'll → will:                   "I'll", "we'll"
'd  → would/had:              "I'd", "he'd"

Most common English contractions. Covers 95%+ of usage.


4. ?\w+

Matches: Optional space followed by one or more word characters

Breaking it down:
 ?          Zero or one space (optional)
\w+         One or more word characters [a-zA-Z0-9_]

Examples:

Input:  "hello world test"
Matches: [hello][ world][ test]
         ^^^^^  ^^^^^^  ^^^^^

Input:  "no_space"  
Matches: [no_space]

Input:  " leading"
Matches: [ leading]
         ^^^^^^^^

Critical insight:

This is position-sensitive tokenization:

Why?

"The cat sat" → ["The", " cat", " sat"]

vs naive split:

"The cat sat" → ["The", "cat", "sat"]  ❌ Loses space info!

5. ?\d+

Matches: Optional space followed by one or more digits

Breaking it down:
 ?          Zero or one space (optional)
\d+         One or more digits [0-9]

Examples:

Input:  "test 123 456"
Matches: test[ 123][ 456]
              ^^^^  ^^^^

Input:  "year2024test"
Matches: year[2024]test
             ^^^^

Why separate from \w+?

Numbers get their own pattern because:

  1. Numbers are often long: 123456789 should be one token candidate
  2. Different distribution: Number patterns differ from word patterns
  3. Efficiency: Large numbers shouldn't be split weirdly

Without this:

"price 999" → might become ["pr", "ice", " ", "9", "9", "9"]  ❌

With this:

"price 999" → ["price", " 999"]  ✓

6. ?[^\s\w]+

Matches: Optional space followed by one or more non-space, non-word characters

Breaking it down:
 ?          Zero or one space (optional)
[^\s\w]+    One or more chars that are NOT space AND NOT word chars

Character class breakdown:

[^...]      Negated character class (anything NOT in the set)
\s          Whitespace characters
\w          Word characters [a-zA-Z0-9_]

So [^\s\w] = NOT (whitespace OR word char)
           = punctuation, symbols, special chars

Examples:

Input:  "hello!!! world???"
Matches: hello[!!!][ world][???]
              ^^^         ^^^

Input:  "test@#$%test"
Matches: test[@#$%]test
             ^^^^^

Input:  "emoji 😀🎉"
Matches: emoji[ 😀🎉]
              ^^^^^^^

What gets matched:

Why group them together?

"!!!" is more useful as ONE token than three separate "!" tokens
"..." is one token representing an ellipsis

7. \s+(?!\S)

Matches: Whitespace that is NOT followed by a non-whitespace character

Breaking it down:
\s+         One or more whitespace characters
(?!\S)      Negative lookahead: NOT followed by non-whitespace

Lookahead explained:

(?!...)     Negative lookahead (zero-width assertion)
\S          Non-whitespace character

So (?!\S) means "not followed by non-whitespace"
          = "followed by whitespace OR end of string"
          = "trailing whitespace"

Examples:

Input:  "hello world    "
                    ^^^^  ← Matches trailing spaces
Matches: hello world[    ]

Input:  "line1\n\n"
             ^^  ← Matches trailing newlines
Matches: line1[\n\n]

Input:  "word   middle   end     "
             ^^^            ^^^^^  ← Only trailing chunks
Matches: word[ middle][ end][     ]

Why this pattern?

This catches trailing whitespace that wasn't captured by ?\w+ or ?\d+:


8. \s+

Matches: Any remaining whitespace (fallback)

Breaking it down:
\s+         One or more whitespace characters

This is the catch-all for weird whitespace:

Input:  "word\t\ttabs\n\nnewlines"
             ^^      ^^
Matches: word[\t\t]tabs[\n\n]newlines

Why needed?

The previous patterns might miss:


How They Work Together - Complete Example

Let's tokenize this string step by step:

Input: "I don't like 42 cats!!! \n"

Regex Processing (left to right, try each alternative):

Position 0:
  Try: <\|endoftext\|>  ❌ Doesn't match "I"
  Try: <\|pad\|>        ❌ Doesn't match "I"
  Try: '(?:s|t|...)     ❌ Doesn't start with '
  Try:  ?\w+            ✓ MATCH "I"

Position 1:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ❌
  Try:  ?\w+            ✓ MATCH " don"

Position 5:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ✓ MATCH "'t"    ← Matched!

Position 7:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ❌
  Try:  ?\w+            ✓ MATCH " like"

Position 12:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ❌
  Try:  ?\w+            ❌ (next char is digit)
  Try:  ?\d+            ✓ MATCH " 42"

Position 15:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ❌
  Try:  ?\w+            ✓ MATCH " cats"

Position 20:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ❌
  Try:  ?\w+            ❌
  Try:  ?\d+            ❌
  Try:  ?[^\s\w]+       ✓ MATCH "!!!"

Position 23:
  Try: <\|endoftext\|>  ❌
  Try: <\|pad\|>        ❌
  Try: '(?:s|t|...)     ❌
  Try:  ?\w+            ❌
  Try:  ?\d+            ❌
  Try:  ?[^\s\w]+       ❌
  Try: \s+(?!\S)        ✓ MATCH " \n"  ← Trailing whitespace

Final Split:

["I", " don", "'t", " like", " 42", " cats", "!!!", " \n"]

Why This Design is Brilliant

1. Position-Aware Tokenization

"cat" vs " cat" are different tokens
Preserves word boundaries without explicit markers

2. Linguistic Intelligence

Contractions kept together: "don't" → ["don", "'t"]
Not random character splits

3. Number Handling

"123456" stays together, not split into digits
Important for numeric data

4. Punctuation Grouping

"!!!" is one token (emphasis)
"..." is one token (ellipsis)
Better semantic preservation

5. Whitespace Preservation

Trailing spaces captured
Newlines preserved
Important for code and formatting

Common Pitfalls & Edge Cases

Pitfall 1: Regex is Greedy

Input: "testing123"

Try  ?\w+:   Matches "testing123"  ✓ (wins)
Try  ?\d+:   Never gets tried! (already matched)

Result: ["testing123"]  not ["testing", "123"]

Because \w+ includes digits! \w = [a-zA-Z0-9_]

Pitfall 2: Order Matters

If we swapped order:

 ?\d+| ?\w+  (digits first)

Input: "test123"
Match:  ?\d+ fails (starts with 't')
Match:  ?\w+ succeeds → "test123"

Same result, but we wasted a regex attempt

Pitfall 3: Special Tokens MUST Be First

If special tokens were last:

Input: "<|endoftext|>"

Try  ?\w+: Would try to match, might partially match
Try special: Would never get full match

Result: Broken special token handling

Performance Characteristics

Regex Engine Behavior

For each character position, tries patterns in order:

Worst case:
- Try pattern 1: ❌ (backtrack)
- Try pattern 2: ❌ (backtrack)
- ...
- Try pattern 8: ✓ (match)

Best case:
- Try pattern 1: ✓ (match immediately)

Optimization: Most common patterns should be FIRST

Complexity: O(n)

Where n = length of input string

Each position is visited once, regex tries alternatives in order.

Not O(n²) because we don't backtrack across the whole string.


Testing Each Pattern

Test 1: Special Tokens

Input:  "start <|endoftext|> end"
Expect: ["start", " <|endoftext|>", " end"]  ❌
Actual: ["start", " ", "<|endoftext|>", " ", "end"]  ← WRONG!

WHY? The  ?\w+ captures " " before special token gets tried!

Fix: Special tokens need to be at START of alternation (already are)

Test 2: Contractions

Input:  "don't can't won't"
Expect: ["don", "'t", " can", "'t", " won", "'t"]
Actual: ["don", "'t", " can", "'t", " won", "'t"]  ✓

Test 3: Numbers

Input:  "test123 456"
Expect: ["test123", " 456"]
Actual: ["test123", " 456"]  ✓

Test 4: Punctuation

Input:  "What!? Amazing..."
Expect: ["What", "!?", " Amazing", "..."]
Actual: ["What", "!?", " Amazing", "..."]  ✓

Test 5: Mixed Whitespace

Input:  "line1\n\n  line2  "
Expect: ["line1", "\n\n", "  ", "line2", "  "]  ← Kinda
Actual: ["line1", "\n\n  ", "line2", "  "]     ← Actual

The \s+ patterns merge adjacent whitespace chunks

What Happens After This Regex Split?

Step 1: Regex produces initial tokens

"I don't like cats" → ["I", " don", "'t", " like", " cats"]

Step 2: Convert to bytes

"I"      → [73]
" don"   → [32, 100, 111, 110]
"'t"     → [39, 116]
" like"  → [32, 108, 105, 107, 101]
" cats"  → [32, 99, 97, 116, 115]

Step 3: BPE learns merges

Most common byte pair: (32, 100) appears in " don", " do", " data"...
Merge: [32, 100] → new_token_256

Next common: (111, 110) in "don", "on", "won"...
Merge: [111, 110] → new_token_257

After 10,000 merges:
" don" might become a SINGLE token
"'t" becomes a single token
Common words become single tokens

Step 4: Final vocabulary

Token 0-255:   Raw bytes
Token 256:     <|endoftext|>
Token 257:     <|pad|>
Token 258-10257: Learned BPE merges
  - Common words: " the", " and", " is"
  - Subwords: "ing", "tion", "er"
  - Rare combos: "xyzzy"

The Genius of This Approach

Without Regex Pre-processing:

"don't" → ['d', 'o', 'n', "'", 't']
BPE needs to learn: (d,o), (do,n), (don,'), (don',t)
Takes many merges to reconstruct linguistic units

With Regex Pre-processing:

"don't" → ["don", "'t"]
BPE starts with linguistic units
Learns semantic merges faster
Better token efficiency

Result:


Summary

The regex does FOUR main things:

  1. Protect special tokens (<|endoftext|>, <|pad|>)
  2. Split contractions intelligently (don'tdon + 't)
  3. Group characters by type (words, numbers, punctuation, whitespace)
  4. Preserve position info (leading spaces stay with words)

It's pre-chunking the text into sensible units BEFORE BPE training.

BPE then learns to merge these chunks into optimal tokens.

This is why GPT models are so effective - the tokenization is linguistically informed, not just dumb byte-pair statistics.