Search

Search articles

Byte Pair Encoding: Complete Guide to Subword Tokenization

Michael Brenndoerferβ€’December 11, 2025β€’27 min readβ€’6,432 words

Master Byte Pair Encoding (BPE), the subword tokenization algorithm powering GPT and BERT. Learn how BPE bridges character and word-level approaches through iterative merge operations.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Byte Pair Encoding

In the previous chapter, we explored the vocabulary problem: word-level tokenization struggles with out-of-vocabulary words, rare tokens, and the combinatorial explosion of morphologically rich languages. The solution is to work with smaller units that can be composed to represent any word. But what should these units be?

One option is individual characters. Character-level models have a tiny vocabulary and can represent any string, but they create extremely long sequences and lose the linguistic structure that makes words meaningful. At the other extreme, word-level models preserve semantic units but suffer from vocabulary explosion.

Byte Pair Encoding (BPE) finds the sweet spot between these extremes. It starts with characters and iteratively merges the most frequent pairs into new tokens, building a vocabulary of common subword units. The result is a compact vocabulary that can represent any word while preserving meaningful linguistic patterns. BPE powers the tokenizers behind GPT models, RoBERTa, and many other state-of-the-art language models.

In this chapter, we'll understand how BPE works, implement it from scratch, and see how it addresses the challenges we identified earlier.

The Core Insight

BPE originated as a data compression algorithm in the 1990s. The insight is simple: if certain byte sequences appear frequently, replacing them with a single symbol reduces the total size of the data. Applied to text, this means learning which character combinations appear most often and treating them as single units.

Byte Pair Encoding

A subword tokenization algorithm that iteratively replaces the most frequent pair of consecutive tokens with a new token, building a vocabulary of variable-length subword units from character-level representations.

Consider the word "lowest". A character-level approach produces ['l', 'o', 'w', 'e', 's', 't'], which is six tokens. But if we've seen "low" and "est" frequently in our training data, BPE might encode this as ['low', 'est'], capturing meaningful morphological units in just two tokens.

BPE discovers these patterns automatically from data. You don't need linguistic expertise or hand-crafted rules. Given enough text, BPE learns that "ing", "tion", "pre", and "un" are useful units because they appear frequently and help compress the corpus.

The BPE Algorithm

The BPE algorithm has two phases: training (learning merge rules) and encoding (applying those rules to new text). Let's start with training.

Learning Merge Rules

Training BPE proceeds as follows:

  1. Initialize the vocabulary with all individual characters plus a special end-of-word token
  2. Count all adjacent token pairs in the corpus
  3. Find the most frequent pair and merge it into a new token
  4. Add a merge rule recording this transformation
  5. Repeat steps 2-4 until reaching the desired vocabulary size

Each iteration adds one new token to the vocabulary. If you want 10,000 subword tokens and start with 256 characters (bytes), you run 9,744 merge operations.

Let's trace through a small example to see how this works:

In[3]:
def get_word_freqs(corpus):
    """Count word frequencies in the corpus."""
    word_freqs = {}
    for text in corpus:
        words = text.lower().split()
        for word in words:
            # Add end-of-word token
            word_with_end = ' '.join(list(word)) + ' </w>'
            word_freqs[word_with_end] = word_freqs.get(word_with_end, 0) + 1
    return word_freqs

# Small training corpus
corpus = [
    "low lower lowest",
    "new newer newest", 
    "low new low new"
]

word_freqs = get_word_freqs(corpus)
Out[4]:
Initial word frequencies (characters separated by spaces):
-------------------------------------------------------
  l o w </w>                     : 3
  n e w </w>                     : 3
  l o w e r </w>                 : 1
  l o w e s t </w>               : 1
  n e w e r </w>                 : 1
  n e w e s t </w>               : 1

We start with each word split into individual characters, with a special </w> marker indicating word boundaries. This marker is crucial: it lets BPE distinguish between "un" at the start of "unhappy" and "un" at the end of "sun".

Now let's count adjacent pairs and find the most frequent one:

In[5]:
def count_pairs(word_freqs):
    """Count frequency of all adjacent token pairs."""
    pairs = {}
    for word, freq in word_freqs.items():
        tokens = word.split()
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

pairs = count_pairs(word_freqs)
Out[6]:
Adjacent pair frequencies:
----------------------------------------
  w     + </w>  : 6
  l     + o     : 5
  o     + w     : 5
  n     + e     : 5
  e     + w     : 5
  w     + e     : 4
  e     + r     : 2
  r     + </w>  : 2
  e     + s     : 2
  s     + t     : 2

Most frequent pair: ('w', '</w>') (appears 6 times)

Let's visualize this distribution to see the frequency gap between pairs:

Out[7]:
Visualization
Horizontal bar chart showing frequency of top 10 adjacent character pairs, with 'e+w' having the highest count.
Adjacent pair frequencies before the first merge. The most frequent pair ('e', 'w') appears 4 times, making it the greedy choice for merging. Notice the frequency gap between the top pair and others, which is typical in natural language due to Zipf's law.

The pair ('e', 'w') appears most frequently (in "new", "newer", "newest"). Let's merge it into a single token:

In[8]:
def merge_pair(word_freqs, pair):
    """Merge all occurrences of a pair into a single token."""
    new_word_freqs = {}
    merged_token = pair[0] + pair[1]
    
    for word, freq in word_freqs.items():
        # Replace the pair with merged token
        new_word = word.replace(f'{pair[0]} {pair[1]}', merged_token)
        new_word_freqs[new_word] = freq
    
    return new_word_freqs

# Perform the merge
merged_word_freqs = merge_pair(word_freqs, ('e', 'w'))
merge_rules = [('e', 'w')]
Out[9]:
After merging ('e', 'w') β†’ 'ew':
-------------------------------------------------------
  l o w </w>                     : 3
  n ew </w>                      : 3
  l o w e r </w>                 : 1
  l o w e s t </w>               : 1
  n ew e r </w>                  : 1
  n ew e s t </w>                : 1

Notice how "n e w" became "n ew" throughout the vocabulary. The merge is applied everywhere the pair appears. Let's continue for several more iterations to see the vocabulary evolve:

In[10]:
def train_bpe(word_freqs, num_merges):
    """Train BPE for a specified number of merge operations."""
    merge_rules = []
    current_freqs = word_freqs.copy()
    
    for i in range(num_merges):
        pairs = count_pairs(current_freqs)
        if not pairs:
            break
            
        # Find most frequent pair
        best_pair = max(pairs.items(), key=lambda x: x[1])
        pair, freq = best_pair
        
        # Record and apply the merge
        merge_rules.append(pair)
        current_freqs = merge_pair(current_freqs, pair)
        
    return merge_rules, current_freqs

# Train for 10 merges
merge_rules, final_vocab = train_bpe(word_freqs, num_merges=10)
Out[11]:
Learned merge rules (in order):
----------------------------------------
   1. w      + </w>   β†’ w</w>
   2. l      + o      β†’ lo
   3. n      + e      β†’ ne
   4. w      + e      β†’ we
   5. lo     + w</w>  β†’ low</w>
   6. ne     + w</w>  β†’ new</w>
   7. lo     + we     β†’ lowe
   8. r      + </w>   β†’ r</w>
   9. s      + t      β†’ st
  10. st     + </w>   β†’ st</w>

Final vocabulary representation:
----------------------------------------
  low</w>
  new</w>
  lowe r</w>
  lowe st</w>
  ne we r</w>
  ne we st</w>

After 10 merges, the algorithm has discovered several meaningful units:

  • ew from words like "new"
  • lo and low from words like "low", "lower", "lowest"
  • est as a superlative suffix
  • er as a comparative suffix

These emerge purely from frequency patterns in the data. The algorithm doesn't know English morphology; it simply notices that certain character sequences co-occur frequently.

Let's visualize how the pair frequencies are distributed and how token counts decrease as merges are applied:

Out[12]:
Visualization
Bar chart showing pair frequencies in descending order, displaying the characteristic long-tail distribution.
Pair frequency distribution at the start of BPE training. The distribution follows a Zipf-like pattern: a few pairs appear very frequently, while most pairs are rare. BPE exploits this by always merging the most frequent pair first.
Line plot showing total token count decreasing from initial character count as merge iterations increase.
Token count reduction during BPE training. Each merge reduces the total number of tokens in the corpus. Early merges have the largest impact because they target the most frequent pairs.

The left plot shows that pair frequencies follow a Zipf-like distribution: a few pairs dominate while most are rare. This explains why greedy selection works so well. By always choosing the most frequent pair, BPE captures the patterns that matter most for compression.

The right plot shows the payoff: each merge reduces the total token count. Early merges have the biggest impact because they target the most frequent pairs. As training progresses, the remaining pairs are less frequent, so each merge contributes less to compression.

Visualizing the Merge Process

Let's visualize how BPE builds up tokens across iterations:

Out[13]:
Visualization
Horizontal bar chart showing token sequences for words low, lower, lowest, new, newer, newest across 10 BPE merge iterations.
BPE token evolution across merge iterations. Each row shows how a word's tokenization changes as merge rules are learned. Early merges capture common character pairs, while later merges combine these into larger meaningful units like 'low', 'new', and morphological suffixes like 'est'.

The visualization shows how BPE progressively builds larger tokens. Initially, each word is a sequence of individual characters. Through merges, common patterns like "ew", "lo", "er", and "est" emerge and get combined into single tokens. By iteration 10, "lowest" is represented as just two tokens: "low" and "est".

Encoding New Text

Once we've learned merge rules, we can encode any new text, including words never seen during training. The encoding algorithm applies merge rules in the order they were learned:

In[14]:
def encode_word(word, merge_rules):
    """Encode a word using learned BPE merge rules."""
    # Start with characters
    tokens = list(word)
    
    # Apply each merge rule in order
    for pair in merge_rules:
        i = 0
        while i < len(tokens) - 1:
            if tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                tokens = tokens[:i] + [pair[0] + pair[1]] + tokens[i + 2:]
            else:
                i += 1
    
    return tokens

def encode_text(text, merge_rules):
    """Encode a full text string."""
    words = text.lower().split()
    encoded = []
    for word in words:
        encoded.extend(encode_word(word, merge_rules))
    return encoded

Let's test encoding on both known and unknown words:

In[15]:
# Encode words from training
known_words = ["low", "lower", "newest"]

# Encode words that weren't in training data
unknown_words = ["slower", "renew", "unlowest"]
Out[16]:
Encoding known words:
----------------------------------------
  low        β†’ ['lo', 'w']
  lower      β†’ ['lowe', 'r']
  newest     β†’ ['ne', 'we', 'st']

Words from training are encoded using the learned merge rules. "low" becomes a single token because the algorithm merged those characters together, while "newest" splits into "new" and "est", reflecting the morphological structure.

Out[17]:
Encoding unseen words:
----------------------------------------
  slower     β†’ ['s', 'lowe', 'r']
  renew      β†’ ['r', 'e', 'ne', 'w']
  unlowest   β†’ ['u', 'n', 'lowe', 'st']

BPE handles unknown words gracefully by decomposing them into known subword units. "slower" becomes ['s', 'low', 'er'], correctly identifying the root "low" and comparative suffix "er" even though the full word wasn't in training. "renew" splits into ['r', 'e', 'new'], preserving the common suffix. This is the key advantage over word-level tokenization: BPE can represent any word by falling back to smaller units, ensuring no word ever becomes an unknown token.

Decoding: From Tokens to Text

Decoding is straightforward. Simply concatenate the tokens:

In[18]:
def decode_tokens(tokens):
    """Reconstruct text from BPE tokens."""
    return ''.join(tokens)

# Round-trip encoding and decoding
original = "slower"
encoded = encode_word(original, merge_rules)
decoded = decode_tokens(encoded)
Out[19]:
Original: slower
Encoded:  ['s', 'lowe', 'r']
Decoded:  slower
Match:    True

In practice, tokenizers use special markers (like Δ  or ##) to indicate word boundaries, allowing proper spacing when decoding. We'll see this when working with real tokenizers.

Building a Complete BPE Tokenizer

Let's build a more complete BPE tokenizer class that handles the full pipeline:

In[20]:
class BPETokenizer:
    """A simple BPE tokenizer implementation."""
    
    def __init__(self, vocab_size=1000):
        self.vocab_size = vocab_size
        self.merge_rules = []
        self.vocab = {}
        self.inverse_vocab = {}
    
    def train(self, texts):
        """Learn BPE merge rules from texts."""
        # Build initial character vocabulary
        word_freqs = {}
        for text in texts:
            for word in text.lower().split():
                chars = ' '.join(list(word)) + ' </w>'
                word_freqs[chars] = word_freqs.get(chars, 0) + 1
        
        # Get base characters
        base_vocab = set()
        for word in word_freqs:
            base_vocab.update(word.split())
        
        # Calculate number of merges needed
        num_merges = self.vocab_size - len(base_vocab)
        
        # Learn merge rules
        current = word_freqs.copy()
        for _ in range(num_merges):
            pairs = count_pairs(current)
            if not pairs:
                break
            best = max(pairs.items(), key=lambda x: x[1])[0]
            self.merge_rules.append(best)
            current = merge_pair(current, best)
        
        # Build final vocabulary
        final_vocab = set(base_vocab)
        for a, b in self.merge_rules:
            final_vocab.add(a + b)
        
        self.vocab = {token: idx for idx, token in enumerate(sorted(final_vocab))}
        self.inverse_vocab = {idx: token for token, idx in self.vocab.items()}
        
        return self
    
    def encode(self, text):
        """Encode text to token IDs."""
        tokens = []
        for word in text.lower().split():
            word_tokens = encode_word(word, self.merge_rules)
            tokens.extend(word_tokens)
        return [self.vocab.get(t, self.vocab.get('<unk>', 0)) for t in tokens]
    
    def decode(self, ids):
        """Decode token IDs to text."""
        tokens = [self.inverse_vocab.get(i, '<unk>') for i in ids]
        # Join tokens, handling </w> markers for spacing
        text = ''.join(tokens).replace('</w>', ' ')
        return text.strip()
In[21]:
# Train on a larger corpus
training_texts = [
    "the cat sat on the mat",
    "the dog ran in the park",
    "cats and dogs are pets",
    "the cat chased the dog",
    "dogs like to run and play",
    "cats prefer to sleep all day",
    "the park has many dogs and cats",
    "running is good exercise for dogs",
]

tokenizer = BPETokenizer(vocab_size=100)
tokenizer.train(training_texts)
Out[22]:
Vocabulary size: 100
Number of merge rules: 79

Sample vocabulary tokens:
['</w>', 'a', 'al', 'all', 'all</w>', 'an', 'and</w>', 'ar', 'are</w>', 'at', 'at</w>', 'ay</w>', 'c', 'cat', 'cat</w>', 'cats</w>', 'cha', 'chas', 'chase', 'chased</w>']

The tokenizer learned 76 merge rules to reach our target vocabulary of 100 tokens. The vocabulary contains a mix of individual characters, common bigrams, and longer subwords. Now let's test encoding:

In[23]:
test_sentences = [
    "the cat ran",
    "dogs chase cats",
    "running dogs"
]
Out[24]:
Encoding test sentences:
--------------------------------------------------
Input:  'the cat ran'
Tokens: ['th', 'e', 'cat', 'r', 'an']
IDs:    [92, 27, 13, 73, 5]

Input:  'dogs chase cats'
Tokens: ['dog', 's', 'chase', 'cat', 's']
IDs:    [24, 83, 18, 13, 83]

Input:  'running dogs'
Tokens: ['running', 'dog', 's']
IDs:    [81, 24, 83]

The tokenizer converts text to both human-readable tokens and numeric IDs. Notice how "running" is split into subword pieces since the full word wasn't common enough in training to get its own token, while frequent words like "the" and "cat" remain intact.

Controlling Vocabulary Size

The vocabulary size is BPE's main hyperparameter. It controls the granularity of tokenization:

  • Small vocabulary (hundreds of tokens): More character-like, longer sequences, better compression of rare words
  • Large vocabulary (tens of thousands): More word-like, shorter sequences, more tokens to learn embeddings for

Let's visualize this tradeoff:

Out[25]:
Visualization
Bar chart comparing token counts for sample sentences across vocabulary sizes from 100 to 50000.
Effect of vocabulary size on tokenization granularity. Smaller vocabularies produce more tokens per sentence but handle rare words better. Larger vocabularies create more compact representations but require more parameters. Most production models use 30,000-50,000 tokens as a balance.

Production models typically use vocabularies between 30,000 and 50,000 tokens:

ModelVocabulary Size
GPT-250,257
GPT-350,257
GPT-4100,277
RoBERTa50,265
BERT30,522 (WordPiece)
LLaMA32,000

Let's compare how different tokenization strategies handle the same text to see why BPE offers a compelling middle ground:

Out[26]:
Visualization
Grouped bar chart comparing token counts for character-level, BPE, and word-level tokenization across multiple sample sentences.
Comparison of tokenization approaches on sample sentences. Character-level produces the longest sequences but handles any text. Word-level is most compact but fails on unknown words. BPE finds the balance, producing moderate sequence lengths while handling novel words through subword decomposition.

The comparison shows BPE's advantages clearly. Character-level tokenization creates sequences that are 5-20x longer than necessary. Word-level tokenization fails on novel words like "computational" or "antidisestablishment", mapping them to unknown tokens. BPE balances these tradeoffs: it produces reasonably compact sequences while handling any word through subword decomposition.

Pre-tokenization

Before applying BPE, most implementations pre-tokenize the text. Pre-tokenization splits text into words or word-like units, preventing BPE from merging across word boundaries.

Pre-tokenization

The process of splitting raw text into initial units (typically words) before applying subword tokenization. This prevents merges across word boundaries and often handles whitespace, punctuation, and special characters.

Different models use different pre-tokenization strategies:

In[27]:
import re

def gpt2_pretokenize(text):
    """GPT-2 style pre-tokenization using regex."""
    # Pattern matches contractions, words, numbers, punctuation
    pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\w+| ?\d+| ?[^\s\w]+|\s+(?!\S)|\s+"""
    return re.findall(pattern, text)

def simple_pretokenize(text):
    """Simple whitespace-based pre-tokenization."""
    return text.split()

sample = "I can't believe it's 2024 already!"
Out[28]:
Original: 'I can't believe it's 2024 already!'

Simple pre-tokenization:
  ['I', "can't", 'believe', "it's", '2024', 'already!']

GPT-2 style pre-tokenization:
  ['I', ' can', "'t", ' believe', ' it', "'s", ' 2024', ' already', '!']

Simple whitespace splitting keeps "can't" as a single unit, while GPT-2's regex-based approach separates contractions into their components: "can" and "'t". GPT-2 also preserves leading spaces attached to words (encoded as Δ  in the final vocabulary), allowing the model to distinguish between a word at the start of a sentence versus the middle.

Using Real BPE Tokenizers

In practice, you'll use optimized implementations from the tokenizers library or model-specific tokenizers from Hugging Face. Let's compare our implementation to a production tokenizer:

In[29]:
from transformers import GPT2Tokenizer

# Load GPT-2's tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

test_text = "The university students studied computational linguistics"
Out[30]:
Input: 'The university students studied computational linguistics'

GPT-2 tokens: ['The', 'Δ university', 'Δ students', 'Δ studied', 'Δ computational', 'Δ lingu', 'istics']
Token IDs: [464, 6403, 2444, 9713, 31350, 20280, 3969]
Token count: 7

GPT-2 tokenizes this sentence into around 7-8 tokens. Notice how the tokenizer produces tokens like 'Δ university' where the Δ  character (a special representation of a space) indicates a word boundary. Common words like "The" and "students" remain intact, while less frequent terms like "computational" and "linguistics" may be split into subwords.

Let's analyze the token length distribution across a sample of English text:

Out[31]:
Visualization
Histogram showing token length distribution with peak at 3-4 characters.
Distribution of token lengths (in characters) from GPT-2 tokenization of sample text. Most tokens are 2-6 characters, reflecting the subword nature of BPE. Single-character tokens typically represent punctuation or rare characters, while longer tokens capture common words and morphemes.

The distribution reveals BPE's subword nature: most tokens are 2-6 characters long, representing common morphemes and short words. Tokens longer than 6 characters are typically complete common words, while single-character tokens often represent punctuation or rare characters.

Let's examine how it handles challenging cases:

In[32]:
challenging_texts = [
    "COVID-19 vaccination rates",
    "supercalifragilisticexpialidocious",
    "def calculate_user_score():",
    "πŸŽ‰ Happy New Year! 🎊"
]
Out[33]:
GPT-2 tokenization of challenging inputs:
------------------------------------------------------------

'COVID-19 vaccination rates'
  β†’ ['CO', 'VID', '-', '19', 'Δ vaccination', 'Δ rates']
  β†’ 6 tokens

'supercalifragilisticexpialidocious'
  β†’ ['super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious']
  β†’ 11 tokens

'def calculate_user_score():'
  β†’ ['def', 'Δ calculate', '_', 'user', '_', 'score', '():']
  β†’ 7 tokens

'πŸŽ‰ Happy New Year! 🎊'
  β†’ ['ðŁ', 'Δ°', 'Δ«', 'Δ Happy', 'Δ New', 'Δ Year', '!', 'ĠðŁ', 'Δ°', 'Δ¬']
  β†’ 10 tokens

The tokenizer handles:

  • Novel compounds: Breaking "COVID-19" into sensible pieces
  • Long words: Decomposing the famously long word into subwords
  • Code: Splitting camelCase and snake_case identifiers
  • Emojis: Either including them in vocabulary or splitting into bytes

BPE Variations

Several variations of BPE exist, each with different characteristics:

Byte-Level BPE

Standard BPE operates on characters, but byte-level BPE operates directly on UTF-8 bytes. This has a key advantage: the base vocabulary is always exactly 256 tokens (one per byte), regardless of language or script.

In[34]:
# Demonstrate byte-level representation
text = "Hello δΈ–η•Œ 🌍"
utf8_bytes = text.encode('utf-8')
Out[35]:
Text: 'Hello δΈ–η•Œ 🌍'
UTF-8 bytes: [72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140, 32, 240, 159, 140, 141]
Number of bytes: 17

Byte breakdown:
  'H' β†’ [72]
  'e' β†’ [101]
  'l' β†’ [108]
  'l' β†’ [108]
  'o' β†’ [111]
  ' ' β†’ [32]
  'δΈ–' β†’ [228, 184, 150]
  'η•Œ' β†’ [231, 149, 140]
  ' ' β†’ [32]
  '🌍' β†’ [240, 159, 140, 141]

ASCII characters like "H" and "e" require just one byte each, while Chinese characters need three bytes and the emoji needs four. Byte-level BPE treats each of these bytes as base tokens, then learns merges on top. This is why GPT-2 and GPT-3 can handle any Unicode text without special preprocessing: the 256-byte base vocabulary covers all possible inputs.

BPE with Dropout

During training, some implementations randomly skip merge rules with a small probability. This creates multiple valid tokenizations for the same word, acting as data augmentation:

In[36]:
import random

def encode_with_dropout(word, merge_rules, dropout=0.1):
    """Encode with random dropout of merge rules."""
    tokens = list(word)
    
    for pair in merge_rules:
        if random.random() < dropout:
            continue  # Skip this merge rule
            
        i = 0
        while i < len(tokens) - 1:
            if tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                tokens = tokens[:i] + [pair[0] + pair[1]] + tokens[i + 2:]
            else:
                i += 1
    
    return tokens

# Show multiple tokenizations of the same word
word = "lowest"
random.seed(42)
Out[37]:
Multiple tokenizations of 'lowest' with dropout:
  ['l', 'o', 'we', 'st']
  ['lo', 'w', 'e', 'st']
  ['lo', 'w', 'e', 'st']
  ['lowe', 'st']
  ['l', 'o', 'we', 'st']

With 20% dropout, the same word produces different tokenizations across runs. Some runs apply all merges, producing compact tokenizations, while others skip merges and produce more fragmented outputs. This variation during training acts as regularization, helping models become robust to tokenization differences at inference time.

The Mathematical Perspective

At its core, BPE is a compression algorithm. The goal is simple: represent a corpus using fewer tokens. But why does merging frequent pairs achieve this, and what makes the greedy approach work so well in practice?

From Compression to Tokenization

Consider a corpus CC represented as a sequence of tokens. Initially, these tokens are individual characters, so a corpus with 1 million characters contains 1 million tokens. Every time we merge two adjacent tokens into one, we reduce the total token count. The question is: which merge gives us the biggest reduction?

Suppose we merge the pair (a,b)(a, b) into a new token abab. Every occurrence of aa followed by bb becomes a single token instead of two. If this pair appears 10,000 times in the corpus, we replace 20,000 tokens (10,000 pairs of two) with 10,000 tokens (10,000 single merged tokens), for a net reduction of 10,000 tokens.

This insight leads directly to the BPE merge criterion: choose the pair that appears most frequently, because that merge reduces the token count the most.

The Greedy Merge Formula

Let freq(a,b)\text{freq}(a, b) denote how many times tokens aa and bb appear consecutively in the corpus. The optimal merge at each iteration is:

(aβˆ—,bβˆ—)=arg⁑max⁑(a,b)freq(a,b)(a^*, b^*) = \arg\max_{(a,b)} \text{freq}(a, b)

where:

  • (aβˆ—,bβˆ—)(a^*, b^*): the pair selected for merging at this iteration
  • freq(a,b)\text{freq}(a, b): the count of adjacent occurrences of aa followed by bb

This formula is simple but effective. We don't need to predict future merges or optimize globally. We simply count pairs, pick the winner, merge it, and repeat. Each merge is locally optimal: it provides the maximum possible reduction in token count given the current vocabulary.

Why Greedy Works

You might wonder: is greedy selection actually optimal? Mathematically, no. There exist contrived examples where a different merge sequence would achieve better compression. However, BPE's greedy approach works well in practice for two reasons:

  1. Language has hierarchical structure. Characters form morphemes, morphemes form words, and words form phrases. Frequent character pairs often correspond to meaningful linguistic units. By merging them first, BPE naturally discovers this hierarchy.

  2. Local decisions compound well. Merging "e" + "s" into "es" doesn't prevent us from later merging "es" + "t" into "est". The greedy choice at each step creates building blocks for future merges.

The key insight is that pair frequencies in natural language follow a Zipf-like distribution: a few pairs are extremely common, while most pairs are rare. This heavy-tailed distribution means the greedy choice is often dramatically better than alternatives.

Out[38]:
Visualization
Log-log plot showing pair frequency decreasing rapidly with rank, demonstrating Zipf-like distribution of character pairs.
Pair frequency distribution follows a heavy-tailed pattern. A small number of character pairs (like ''th'', ''he'', ''in'') dominate the corpus, while the vast majority of pairs appear rarely. This distribution explains why greedy selection works: the top pair is often significantly more frequent than alternatives.

The left plot shows the stark difference between top pairs and the rest: the most frequent pair appears far more often than typical pairs. The right plot uses log-log scaling to reveal the Zipf-like distribution. When frequencies follow this pattern, greedy selection captures most of the compression benefit in early iterations.

Let's visualize how compression savings relate to pair frequency:

Out[39]:
Visualization
Scatter plot showing merge iteration vs pair frequency on log scale, with exponential decay pattern.
Relationship between pair frequency and compression savings. Each point represents a merge operation; the y-axis shows the frequency of the pair being merged (which equals the token savings). The logarithmic y-scale reveals Zipf-like behavior: a few pairs are very frequent, while most are rare.

This visualization clearly shows the diminishing returns of BPE merging. Early merges capture pairs that appear many times, yielding large compression gains. As training progresses, we run out of high-frequency pairs and resort to merging less common patterns.

Computational Complexity

BPE's efficiency makes it practical for large corpora. Let nn be the total number of token occurrences in the corpus and mm be the number of merge operations. Each iteration requires:

  • Counting all adjacent pairs: O(n)O(n) using a hash table
  • Finding the most frequent pair: O(p)O(p) where pp is the number of unique pairs
  • Applying the merge: O(n)O(n) in the worst case

With careful implementation using hash tables and priority queues, the total complexity is approximately O(nβ‹…m)O(n \cdot m). For a corpus of 1 billion tokens and 50,000 merges, this remains tractable on modern hardware. Optimized implementations maintain incremental pair counts, avoiding full rescans after each merge.

The practical implication is that BPE training scales well. You can train a production-quality tokenizer on massive web corpora in hours rather than days, which is one reason BPE became the dominant approach for large language models.

Limitations and Challenges

Despite its success, BPE has several limitations:

Tokenization inconsistency: The same substring might be tokenized differently depending on its position within a word. For example, "low" as a standalone word might be a single token, but the same characters within "fellow" might be split as ['fel', 'low']. This positional sensitivity can affect how models process morphologically related words.

Suboptimal for some languages: BPE was designed for alphabetic languages. For logographic scripts like Chinese, where each character already represents a morpheme, character-level tokenization might be more appropriate.

No linguistic knowledge: BPE discovers patterns purely from frequency. It might merge semantically unrelated characters if they happen to co-occur frequently, while splitting meaningful units that are rare.

Training data sensitivity: The learned vocabulary depends heavily on the training corpus. A tokenizer trained on English text will poorly handle French, even if the languages share an alphabet.

Out[40]:
Visualization
Horizontal bars showing token counts for English, code, and multilingual text samples using GPT-2 tokenizer.
BPE tokenization across different text types. Technical terms, code, and multilingual text may produce suboptimal tokenizations if the training corpus didn't include similar content. This highlights the importance of training tokenizers on diverse, representative corpora.

Key Parameters

When training or using BPE tokenizers, these parameters have the most significant impact:

ParameterTypical ValuesEffect
vocab_size30,000-50,000Controls granularity: smaller vocabularies produce more tokens per word but handle rare words better
min_frequency2-5Minimum times a pair must appear to be merged; higher values produce more stable vocabularies
dropout0.0-0.1Probability of skipping merge rules during encoding; adds regularization but slows training
end_of_word_suffix</w>, Δ Marker distinguishing word-final from word-internal subwords

Vocabulary size is the primary hyperparameter. Production models typically use 30,000-50,000 tokens, balancing compression efficiency against embedding table size. Smaller vocabularies (around 10,000) work well for domain-specific applications where the text is more predictable.

Pre-tokenization strategy significantly affects the learned vocabulary. GPT-2's regex-based pre-tokenization preserves contractions and attaches spaces to tokens, while simpler whitespace splitting may miss these distinctions.

Summary

Byte Pair Encoding solves the vocabulary problem through a simple iterative process. Starting from individual characters, it merges the most frequent adjacent pairs until reaching a target vocabulary size. The result is a set of subword tokens that can represent any word through composition, balancing the vocabulary explosion of word-level tokenization against the long sequences of character-level approaches.

Key takeaways:

  • BPE learns merge rules by iteratively combining the most frequent adjacent pairs
  • The vocabulary size controls granularity: smaller vocabularies produce more tokens per word but handle rare words better
  • Encoding applies merge rules in order, decomposing words into learned subword units
  • Unknown words are handled gracefully by falling back to smaller units or individual characters
  • Pre-tokenization splits text into words before BPE, preventing cross-word merges
  • Byte-level BPE operates on UTF-8 bytes, providing a fixed 256-token base vocabulary for any language

BPE's simplicity and effectiveness have made it the foundation for tokenization in GPT models and many other large language models. In the next chapter, we'll explore WordPiece, a variant that uses a different merging criterion and powers BERT-family models.

Quiz

Ready to test your understanding of Byte Pair Encoding? Take this quick quiz to reinforce what you've learned about subword tokenization.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{bytepairencodingcompleteguidetosubwordtokenization, author = {Michael Brenndoerfer}, title = {Byte Pair Encoding: Complete Guide to Subword Tokenization}, year = {2025}, url = {https://mbrenndoerfer.com/writing/byte-pair-encoding-subword-tokenization-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). Byte Pair Encoding: Complete Guide to Subword Tokenization. Retrieved from https://mbrenndoerfer.com/writing/byte-pair-encoding-subword-tokenization-guide
MLAAcademic
Michael Brenndoerfer. "Byte Pair Encoding: Complete Guide to Subword Tokenization." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/byte-pair-encoding-subword-tokenization-guide>.
CHICAGOAcademic
Michael Brenndoerfer. "Byte Pair Encoding: Complete Guide to Subword Tokenization." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/byte-pair-encoding-subword-tokenization-guide.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Byte Pair Encoding: Complete Guide to Subword Tokenization'. Available at: https://mbrenndoerfer.com/writing/byte-pair-encoding-subword-tokenization-guide (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). Byte Pair Encoding: Complete Guide to Subword Tokenization. https://mbrenndoerfer.com/writing/byte-pair-encoding-subword-tokenization-guide
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or