<\|endoftext\|>|<\|pad\|>|'(?:s|t|re|ve|m|ll|d)| ?\w+| ?\d+| ?[^\s\w]+|\s+(?!\S)|\s+
This is a SINGLE alternation (uses | to try each pattern in order, left to right).
The regex engine tries each alternative until one matches, then moves on.
<\|endoftext\|>Matches: The literal string <|endoftext|>
Input: "Hello <|endoftext|> World"
Match: "Hello [<|endoftext|>] World"
Why the escapes?
\| escapes the pipe character (otherwise it's alternation)Priority: HIGHEST (checked first)
<\|pad\|>Matches: The literal string <|pad|>
Input: "text <|pad|> more"
Match: "text [<|pad|>] more"
Purpose: Special padding token, treated atomically (never split)
'(?:s|t|re|ve|m|ll|d)Matches: Apostrophe followed by common English contractions
Breaking it down:
' Literal apostrophe
(?:...) Non-capturing group (just groups, doesn't save)
s|t|re|... Alternation of specific suffixes
Examples:
Input: "don't can't we're I've I'm we'll he'd"
Matches: don['t] can['t] we['re] I['ve] I['m] we['ll] he['d]
Why this matters: Contractions are treated as separate tokens from the base word:
"don't" → ["don", "'t"] not ["d", "o", "n", "'", "t"]Why these specific contractions?
's → possessive or "is": "John's car", "he's"
't → not: "don't", "can't"
're → are: "we're", "they're"
've → have: "I've", "we've"
'm → am: "I'm"
'll → will: "I'll", "we'll"
'd → would/had: "I'd", "he'd"
Most common English contractions. Covers 95%+ of usage.
?\w+Matches: Optional space followed by one or more word characters
Breaking it down:
? Zero or one space (optional)
\w+ One or more word characters [a-zA-Z0-9_]
Examples:
Input: "hello world test"
Matches: [hello][ world][ test]
^^^^^ ^^^^^^ ^^^^^
Input: "no_space"
Matches: [no_space]
Input: " leading"
Matches: [ leading]
^^^^^^^^
Critical insight:
"hello"" world", " test"This is position-sensitive tokenization:
"cat" and " cat" are DIFFERENT tokensWhy?
"The cat sat" → ["The", " cat", " sat"]
vs naive split:
"The cat sat" → ["The", "cat", "sat"] ❌ Loses space info!
?\d+Matches: Optional space followed by one or more digits
Breaking it down:
? Zero or one space (optional)
\d+ One or more digits [0-9]
Examples:
Input: "test 123 456"
Matches: test[ 123][ 456]
^^^^ ^^^^
Input: "year2024test"
Matches: year[2024]test
^^^^
Why separate from \w+?
Numbers get their own pattern because:
123456789 should be one token candidateWithout this:
"price 999" → might become ["pr", "ice", " ", "9", "9", "9"] ❌
With this:
"price 999" → ["price", " 999"] ✓
?[^\s\w]+Matches: Optional space followed by one or more non-space, non-word characters
Breaking it down:
? Zero or one space (optional)
[^\s\w]+ One or more chars that are NOT space AND NOT word chars
Character class breakdown:
[^...] Negated character class (anything NOT in the set)
\s Whitespace characters
\w Word characters [a-zA-Z0-9_]
So [^\s\w] = NOT (whitespace OR word char)
= punctuation, symbols, special chars
Examples:
Input: "hello!!! world???"
Matches: hello[!!!][ world][???]
^^^ ^^^
Input: "test@#$%test"
Matches: test[@#$%]test
^^^^^
Input: "emoji 😀🎉"
Matches: emoji[ 😀🎉]
^^^^^^^
What gets matched:
!, ?, ., ,, ;, :+, -, *, /, =(, ), [, ], {, }@, #, $, %, &Why group them together?
"!!!" is more useful as ONE token than three separate "!" tokens
"..." is one token representing an ellipsis
\s+(?!\S)Matches: Whitespace that is NOT followed by a non-whitespace character
Breaking it down:
\s+ One or more whitespace characters
(?!\S) Negative lookahead: NOT followed by non-whitespace
Lookahead explained:
(?!...) Negative lookahead (zero-width assertion)
\S Non-whitespace character
So (?!\S) means "not followed by non-whitespace"
= "followed by whitespace OR end of string"
= "trailing whitespace"
Examples:
Input: "hello world "
^^^^ ← Matches trailing spaces
Matches: hello world[ ]
Input: "line1\n\n"
^^ ← Matches trailing newlines
Matches: line1[\n\n]
Input: "word middle end "
^^^ ^^^^^ ← Only trailing chunks
Matches: word[ middle][ end][ ]
Why this pattern?
This catches trailing whitespace that wasn't captured by ?\w+ or ?\d+:
\s+Matches: Any remaining whitespace (fallback)
Breaking it down:
\s+ One or more whitespace characters
This is the catch-all for weird whitespace:
Input: "word\t\ttabs\n\nnewlines"
^^ ^^
Matches: word[\t\t]tabs[\n\n]newlines
Why needed?
The previous patterns might miss:
\t\nLet's tokenize this string step by step:
Input: "I don't like 42 cats!!! \n"
Position 0:
Try: <\|endoftext\|> ❌ Doesn't match "I"
Try: <\|pad\|> ❌ Doesn't match "I"
Try: '(?:s|t|...) ❌ Doesn't start with '
Try: ?\w+ ✓ MATCH "I"
Position 1:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ❌
Try: ?\w+ ✓ MATCH " don"
Position 5:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ✓ MATCH "'t" ← Matched!
Position 7:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ❌
Try: ?\w+ ✓ MATCH " like"
Position 12:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ❌
Try: ?\w+ ❌ (next char is digit)
Try: ?\d+ ✓ MATCH " 42"
Position 15:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ❌
Try: ?\w+ ✓ MATCH " cats"
Position 20:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ❌
Try: ?\w+ ❌
Try: ?\d+ ❌
Try: ?[^\s\w]+ ✓ MATCH "!!!"
Position 23:
Try: <\|endoftext\|> ❌
Try: <\|pad\|> ❌
Try: '(?:s|t|...) ❌
Try: ?\w+ ❌
Try: ?\d+ ❌
Try: ?[^\s\w]+ ❌
Try: \s+(?!\S) ✓ MATCH " \n" ← Trailing whitespace
["I", " don", "'t", " like", " 42", " cats", "!!!", " \n"]
"cat" vs " cat" are different tokens
Preserves word boundaries without explicit markers
Contractions kept together: "don't" → ["don", "'t"]
Not random character splits
"123456" stays together, not split into digits
Important for numeric data
"!!!" is one token (emphasis)
"..." is one token (ellipsis)
Better semantic preservation
Trailing spaces captured
Newlines preserved
Important for code and formatting
Input: "testing123"
Try ?\w+: Matches "testing123" ✓ (wins)
Try ?\d+: Never gets tried! (already matched)
Result: ["testing123"] not ["testing", "123"]
Because \w+ includes digits! \w = [a-zA-Z0-9_]
If we swapped order:
?\d+| ?\w+ (digits first)
Input: "test123"
Match: ?\d+ fails (starts with 't')
Match: ?\w+ succeeds → "test123"
Same result, but we wasted a regex attempt
If special tokens were last:
Input: "<|endoftext|>"
Try ?\w+: Would try to match, might partially match
Try special: Would never get full match
Result: Broken special token handling
For each character position, tries patterns in order:
Worst case:
- Try pattern 1: ❌ (backtrack)
- Try pattern 2: ❌ (backtrack)
- ...
- Try pattern 8: ✓ (match)
Best case:
- Try pattern 1: ✓ (match immediately)
Optimization: Most common patterns should be FIRST
Where n = length of input string
Each position is visited once, regex tries alternatives in order.
Not O(n²) because we don't backtrack across the whole string.
Input: "start <|endoftext|> end"
Expect: ["start", " <|endoftext|>", " end"] ❌
Actual: ["start", " ", "<|endoftext|>", " ", "end"] ← WRONG!
WHY? The ?\w+ captures " " before special token gets tried!
Fix: Special tokens need to be at START of alternation (already are)
Input: "don't can't won't"
Expect: ["don", "'t", " can", "'t", " won", "'t"]
Actual: ["don", "'t", " can", "'t", " won", "'t"] ✓
Input: "test123 456"
Expect: ["test123", " 456"]
Actual: ["test123", " 456"] ✓
Input: "What!? Amazing..."
Expect: ["What", "!?", " Amazing", "..."]
Actual: ["What", "!?", " Amazing", "..."] ✓
Input: "line1\n\n line2 "
Expect: ["line1", "\n\n", " ", "line2", " "] ← Kinda
Actual: ["line1", "\n\n ", "line2", " "] ← Actual
The \s+ patterns merge adjacent whitespace chunks
"I don't like cats" → ["I", " don", "'t", " like", " cats"]
"I" → [73]
" don" → [32, 100, 111, 110]
"'t" → [39, 116]
" like" → [32, 108, 105, 107, 101]
" cats" → [32, 99, 97, 116, 115]
Most common byte pair: (32, 100) appears in " don", " do", " data"...
Merge: [32, 100] → new_token_256
Next common: (111, 110) in "don", "on", "won"...
Merge: [111, 110] → new_token_257
After 10,000 merges:
" don" might become a SINGLE token
"'t" becomes a single token
Common words become single tokens
Token 0-255: Raw bytes
Token 256: <|endoftext|>
Token 257: <|pad|>
Token 258-10257: Learned BPE merges
- Common words: " the", " and", " is"
- Subwords: "ing", "tion", "er"
- Rare combos: "xyzzy"
"don't" → ['d', 'o', 'n', "'", 't']
BPE needs to learn: (d,o), (do,n), (don,'), (don',t)
Takes many merges to reconstruct linguistic units
"don't" → ["don", "'t"]
BPE starts with linguistic units
Learns semantic merges faster
Better token efficiency
The regex does FOUR main things:
<|endoftext|>, <|pad|>)don't → don + 't)It's pre-chunking the text into sensible units BEFORE BPE training.
BPE then learns to merge these chunks into optimal tokens.
This is why GPT models are so effective - the tokenization is linguistically informed, not just dumb byte-pair statistics.