WordPiece Tokenization: BERT's Subword Algorithm Explained

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook nlp

Master WordPiece tokenization, the algorithm behind BERT that balances vocabulary efficiency with morphological awareness. Learn how likelihood-based merging creates smarter subword units than BPE.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

title: "WordPiece Tokenization" format: html: code-fold: false jupyter: python3Link Copied

WordPiece TokenizationLink Copied

IntroductionLink Copied

WordPiece is a subword tokenization algorithm that powers BERT and many other transformer-based models. While it shares the iterative merge approach with Byte Pair Encoding (BPE), WordPiece makes a subtle but important change: instead of merging the most frequent pairs, it merges pairs that maximize the likelihood of the training data.

This difference matters. BPE's frequency-based criterion treats all pairs equally regardless of context. A pair appearing 100 times in rare words contributes the same as one appearing 100 times in common words. WordPiece's likelihood objective weights pairs by how much they improve the overall model, giving more influence to patterns that appear in frequent contexts.

The result is a tokenization scheme that tends to produce slightly different vocabularies than BPE, often capturing more linguistically meaningful units. You'll recognize WordPiece tokens by their distinctive ## prefix, which marks subword units that continue a word rather than starting one.

Technical Deep DiveLink Copied

The Core Difference: Likelihood vs FrequencyLink Copied

BPE selects the most frequent adjacent token pair at each merge step. WordPiece takes a different approach: it selects the pair that maximizes the likelihood of the training corpus under a unigram language model.

Let's formalize this. Given a vocabulary $V$ and a training corpus $C$ tokenized into subword units, the likelihood of the corpus under a unigram model is:

L(C) = \prod_{i=1}^{n} P(t_i)

where $t_i$ is the $i$ -th token in the tokenized corpus and $P(t_i)$ is estimated from token frequencies:

P(t) = \frac{\text{count}(t)}{\sum_{t' \in V} \text{count}(t')}

When we merge a pair $(a, b)$ into a new token $ab$ , the likelihood changes. WordPiece selects the merge that maximizes this likelihood increase.

The Merge ScoreLink Copied

For a candidate pair $(a, b)$ , the WordPiece merge score is:

\text{score}(a, b) = \frac{\text{count}(ab)}{\text{count}(a) \times \text{count}(b)}

This score captures the association strength between tokens $a$ and $b$ . If they appear together more often than you'd expect from their individual frequencies, the score is high.

Pointwise Mutual Information Connection

The WordPiece score is closely related to Pointwise Mutual Information (PMI), a classic measure of association in computational linguistics. PMI measures how much more likely two items appear together compared to if they were independent.

Let's work through why this score maximizes likelihood. When we merge $(a, b)$ into $ab$ :

We remove occurrences of $a$ and $b$ where they appeared as a pair
We add occurrences of the new token $ab$

The likelihood ratio (new/old) for this merge is proportional to:

\frac{P(ab)^{\text{count}(ab)}}{P(a)^{\text{count}(ab)} \times P(b)^{\text{count}(ab)}}

Taking the logarithm and simplifying gives us the score formula above.

The `##` Prefix NotationLink Copied

WordPiece uses a special notation to distinguish word-initial tokens from continuation tokens. Tokens that continue a word (not at the start) are prefixed with ##.

For example, tokenizing "unhappiness":

un ##happi ##ness

This tells us:

un starts the word
##happi continues from the previous token
##ness continues from the previous token

The ## prefix serves two purposes:

Disambiguation: The token ##ing (word continuation) is different from ing (word start). This matters because "ing" at the start of a word (like "ingot") has different distributional properties than "-ing" as a suffix.
Reconstruction: During decoding, we can easily reconstruct the original text by removing ## prefixes and joining tokens.

The Greedy Tokenization AlgorithmLink Copied

Once we have a trained WordPiece vocabulary, encoding new text follows a greedy longest-match algorithm:

For each word in the input: a. Start at the beginning of the word b. Find the longest token in the vocabulary that matches the current position c. Add that token to the output (with ## prefix if not at word start) d. Move past the matched characters e. Repeat until the word is consumed
If at any point no vocabulary token matches (not even single characters), output the special [UNK] token for the entire word

This greedy approach is fast but not guaranteed to find the globally optimal tokenization. However, it works well in practice and is computationally efficient.

Handling Unknown CharactersLink Copied

Unlike BPE, which typically includes all individual characters in its base vocabulary, WordPiece implementations often have a fixed character set. When encountering characters outside this set:

Character-level fallback: Some implementations add unknown characters to the vocabulary during training
UNK replacement: Others map entire words containing unknown characters to [UNK]
Normalization: Pre-processing can normalize or remove problematic characters

BERT's WordPiece vocabulary, for instance, was trained primarily on English and has limited coverage of characters from other scripts.

Worked ExampleLink Copied

Let's trace through WordPiece training on a small corpus to see how the likelihood-based scoring works.

Consider this training corpus with word frequencies:

"low" (5), "lower" (2), "newest" (6), "widest" (3)

Step 1: Initialization

We start with individual characters (plus the end-of-word marker if used):

Vocabulary: ['d', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w']

Initial tokenization:

l o w (5)
l o w e r (2)
n e w e s t (6)
w i d e s t (3)

Step 2: Calculate Merge Scores

For each adjacent pair, we calculate:

\text{score}(a, b) = \frac{\text{count}(ab)}{\text{count}(a) \times \text{count}(b)}

Let's compute some scores:

Pair (e, s): appears in "newest" (6) and "widest" (3) = 9 times
- count(e) = 2 + 6 + 3 = 11 (in "lower", "newest", "widest")
- count(s) = 6 + 3 = 9
- score = 9 / (11 × 9) = 0.091
Pair (s, t): appears in "newest" (6) and "widest" (3) = 9 times
- count(s) = 9
- count(t) = 6 + 3 = 9
- score = 9 / (9 × 9) = 0.111
Pair (l, o): appears in "low" (5) and "lower" (2) = 7 times
- count(l) = 5 + 2 = 7
- count(o) = 5 + 2 = 7
- score = 7 / (7 × 7) = 0.143

The pair (l, o) has the highest score, so we merge it first.

Step 3: After First Merge

Vocabulary: ['d', 'e', 'i', 'l', 'lo', 'n', 'o', 'r', 's', 't', 'w']

Updated tokenization:

lo w (5)
lo w e r (2)
n e w e s t (6)
w i d e s t (3)

We continue this process until reaching our target vocabulary size.

Contrast with BPE

With pure frequency counting (BPE), we might merge (e, s) or (s, t) first since they each appear 9 times, compared to (l, o)'s 7 times. The likelihood-based score adjusts for the base frequencies of each token, preferring pairs where the co-occurrence is more "surprising" given the individual token frequencies.

Code ImplementationLink Copied

Let's implement WordPiece from scratch to understand the algorithm deeply. We'll build a simplified version that captures the core concepts.

First, we set up our basic data structures and the scoring function:

In[3]:

import re
from collections import defaultdict, Counter
from typing import Dict, List, Tuple, Set

class WordPieceTrainer:
    def __init__(self, vocab_size: int = 1000, min_frequency: int = 2):
        self.vocab_size = vocab_size
        self.min_frequency = min_frequency
        self.vocab: Set[str] = set()
        self.word_freqs: Dict[str, int] = {}
    
    def _get_word_counts(self, corpus: List[str]) -> Dict[str, int]:
        """Count word frequencies in corpus."""
        word_counts = Counter()
        for text in corpus:
            words = text.lower().split()
            word_counts.update(words)
        return dict(word_counts)
    
    def _tokenize_word(self, word: str) -> List[str]:
        """Tokenize a word into current vocabulary tokens."""
        tokens = []
        start = 0
        
        while start < len(word):
            end = len(word)
            found = False
            
            while start < end:
                # Check with ## prefix if not at start
                substr = word[start:end]
                if start > 0:
                    substr = "##" + substr
                
                if substr in self.vocab:
                    tokens.append(substr)
                    found = True
                    break
                end -= 1
            
            if not found:
                # Single character fallback
                char = word[start]
                if start > 0:
                    char = "##" + char
                tokens.append(char)
                end = start + 1
            
            start = end
        
        return tokens

import re
from collections import defaultdict, Counter
from typing import Dict, List, Tuple, Set

class WordPieceTrainer:
    def __init__(self, vocab_size: int = 1000, min_frequency: int = 2):
        self.vocab_size = vocab_size
        self.min_frequency = min_frequency
        self.vocab: Set[str] = set()
        self.word_freqs: Dict[str, int] = {}
    
    def _get_word_counts(self, corpus: List[str]) -> Dict[str, int]:
        """Count word frequencies in corpus."""
        word_counts = Counter()
        for text in corpus:
            words = text.lower().split()
            word_counts.update(words)
        return dict(word_counts)
    
    def _tokenize_word(self, word: str) -> List[str]:
        """Tokenize a word into current vocabulary tokens."""
        tokens = []
        start = 0
        
        while start < len(word):
            end = len(word)
            found = False
            
            while start < end:
                # Check with ## prefix if not at start
                substr = word[start:end]
                if start > 0:
                    substr = "##" + substr
                
                if substr in self.vocab:
                    tokens.append(substr)
                    found = True
                    break
                end -= 1
            
            if not found:
                # Single character fallback
                char = word[start]
                if start > 0:
                    char = "##" + char
                tokens.append(char)
                end = start + 1
            
            start = end
        
        return tokens

Now let's add the scoring and merge selection logic:

In[4]:

class WordPieceTrainer(WordPieceTrainer):
    def _compute_pair_scores(self, splits: Dict[str, List[str]]) -> Dict[Tuple[str, str], float]:
        """Compute WordPiece scores for all adjacent pairs."""
        # Count individual tokens and pairs
        token_counts = Counter()
        pair_counts = Counter()
        
        for word, freq in self.word_freqs.items():
            tokens = splits[word]
            for token in tokens:
                token_counts[token] += freq
            for i in range(len(tokens) - 1):
                pair = (tokens[i], tokens[i + 1])
                pair_counts[pair] += freq
        
        # Compute scores
        scores = {}
        for pair, pair_freq in pair_counts.items():
            if pair_freq < self.min_frequency:
                continue
            token_a, token_b = pair
            # WordPiece score: count(ab) / (count(a) * count(b))
            score = pair_freq / (token_counts[token_a] * token_counts[token_b])
            scores[pair] = score
        
        return scores
    
    def _merge_pair(self, pair: Tuple[str, str], splits: Dict[str, List[str]]) -> Dict[str, List[str]]:
        """Merge a pair of tokens throughout the corpus."""
        token_a, token_b = pair
        # Create merged token (remove ## from second if present)
        if token_b.startswith("##"):
            merged = token_a + token_b[2:]
        else:
            merged = token_a + token_b
        
        new_splits = {}
        for word, tokens in splits.items():
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens) - 1 and tokens[i] == token_a and tokens[i + 1] == token_b:
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            new_splits[word] = new_tokens
        
        return new_splits

class WordPieceTrainer(WordPieceTrainer):
    def _compute_pair_scores(self, splits: Dict[str, List[str]]) -> Dict[Tuple[str, str], float]:
        """Compute WordPiece scores for all adjacent pairs."""
        # Count individual tokens and pairs
        token_counts = Counter()
        pair_counts = Counter()
        
        for word, freq in self.word_freqs.items():
            tokens = splits[word]
            for token in tokens:
                token_counts[token] += freq
            for i in range(len(tokens) - 1):
                pair = (tokens[i], tokens[i + 1])
                pair_counts[pair] += freq
        
        # Compute scores
        scores = {}
        for pair, pair_freq in pair_counts.items():
            if pair_freq < self.min_frequency:
                continue
            token_a, token_b = pair
            # WordPiece score: count(ab) / (count(a) * count(b))
            score = pair_freq / (token_counts[token_a] * token_counts[token_b])
            scores[pair] = score
        
        return scores
    
    def _merge_pair(self, pair: Tuple[str, str], splits: Dict[str, List[str]]) -> Dict[str, List[str]]:
        """Merge a pair of tokens throughout the corpus."""
        token_a, token_b = pair
        # Create merged token (remove ## from second if present)
        if token_b.startswith("##"):
            merged = token_a + token_b[2:]
        else:
            merged = token_a + token_b
        
        new_splits = {}
        for word, tokens in splits.items():
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens) - 1 and tokens[i] == token_a and tokens[i + 1] == token_b:
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            new_splits[word] = new_tokens
        
        return new_splits

Finally, let's add the training loop:

In[5]:

class WordPieceTrainer(WordPieceTrainer):
    def train(self, corpus: List[str]) -> Set[str]:
        """Train WordPiece vocabulary on corpus."""
        # Get word frequencies
        self.word_freqs = self._get_word_counts(corpus)
        
        # Initialize vocabulary with characters
        all_chars = set()
        for word in self.word_freqs:
            for i, char in enumerate(word):
                if i == 0:
                    all_chars.add(char)
                else:
                    all_chars.add("##" + char)
        
        self.vocab = all_chars.copy()
        
        # Initialize splits (each word split into characters)
        splits = {}
        for word in self.word_freqs:
            word_splits = [word[0]]
            for char in word[1:]:
                word_splits.append("##" + char)
            splits[word] = word_splits
        
        # Iteratively merge pairs
        while len(self.vocab) < self.vocab_size:
            scores = self._compute_pair_scores(splits)
            
            if not scores:
                break
            
            # Select best pair
            best_pair = max(scores, key=scores.get)
            
            # Create merged token
            token_a, token_b = best_pair
            if token_b.startswith("##"):
                merged = token_a + token_b[2:]
            else:
                merged = token_a + token_b
            
            # Update vocabulary and splits
            self.vocab.add(merged)
            splits = self._merge_pair(best_pair, splits)
        
        return self.vocab
    
    def encode(self, text: str) -> List[str]:
        """Encode text using trained vocabulary."""
        words = text.lower().split()
        all_tokens = []
        
        for word in words:
            tokens = self._tokenize_word(word)
            all_tokens.extend(tokens)
        
        return all_tokens
    
    def decode(self, tokens: List[str]) -> str:
        """Decode tokens back to text."""
        result = []
        current_word = []
        
        for token in tokens:
            if token.startswith("##"):
                current_word.append(token[2:])
            else:
                if current_word:
                    result.append("".join(current_word))
                current_word = [token]
        
        if current_word:
            result.append("".join(current_word))
        
        return " ".join(result)

class WordPieceTrainer(WordPieceTrainer):
    def train(self, corpus: List[str]) -> Set[str]:
        """Train WordPiece vocabulary on corpus."""
        # Get word frequencies
        self.word_freqs = self._get_word_counts(corpus)
        
        # Initialize vocabulary with characters
        all_chars = set()
        for word in self.word_freqs:
            for i, char in enumerate(word):
                if i == 0:
                    all_chars.add(char)
                else:
                    all_chars.add("##" + char)
        
        self.vocab = all_chars.copy()
        
        # Initialize splits (each word split into characters)
        splits = {}
        for word in self.word_freqs:
            word_splits = [word[0]]
            for char in word[1:]:
                word_splits.append("##" + char)
            splits[word] = word_splits
        
        # Iteratively merge pairs
        while len(self.vocab) < self.vocab_size:
            scores = self._compute_pair_scores(splits)
            
            if not scores:
                break
            
            # Select best pair
            best_pair = max(scores, key=scores.get)
            
            # Create merged token
            token_a, token_b = best_pair
            if token_b.startswith("##"):
                merged = token_a + token_b[2:]
            else:
                merged = token_a + token_b
            
            # Update vocabulary and splits
            self.vocab.add(merged)
            splits = self._merge_pair(best_pair, splits)
        
        return self.vocab
    
    def encode(self, text: str) -> List[str]:
        """Encode text using trained vocabulary."""
        words = text.lower().split()
        all_tokens = []
        
        for word in words:
            tokens = self._tokenize_word(word)
            all_tokens.extend(tokens)
        
        return all_tokens
    
    def decode(self, tokens: List[str]) -> str:
        """Decode tokens back to text."""
        result = []
        current_word = []
        
        for token in tokens:
            if token.startswith("##"):
                current_word.append(token[2:])
            else:
                if current_word:
                    result.append("".join(current_word))
                current_word = [token]
        
        if current_word:
            result.append("".join(current_word))
        
        return " ".join(result)

Let's test our implementation:

In[6]:

# Train on a small corpus
corpus = [
    "low low low low low",
    "lower lower",
    "newest newest newest newest newest newest",
    "widest widest widest"
]

trainer = WordPieceTrainer(vocab_size=20, min_frequency=1)
vocab = trainer.train(corpus)

# Train on a small corpus
corpus = [
    "low low low low low",
    "lower lower",
    "newest newest newest newest newest newest",
    "widest widest widest"
]

trainer = WordPieceTrainer(vocab_size=20, min_frequency=1)
vocab = trainer.train(corpus)

Out[7]:

Vocabulary size: 20
Vocabulary: ['##d', '##e', '##er', '##i', '##o', '##r', '##s', '##st', '##t', '##w', 'l', 'lo', 'low', 'lower', 'n', 'ne', 'new', 'w', 'wi', 'wid']

In[8]:

# Test encoding
test_words = ["low", "lower", "lowest", "newest", "widest", "newer"]
for word in test_words:
    tokens = trainer.encode(word)
    decoded = trainer.decode(tokens)
    print(f"{word:10} -> {tokens} -> {decoded}")

# Test encoding
test_words = ["low", "lower", "lowest", "newest", "widest", "newer"]
for word in test_words:
    tokens = trainer.encode(word)
    decoded = trainer.decode(tokens)
    print(f"{word:10} -> {tokens} -> {decoded}")

Out[9]:

low        -> ['low']                                  -> low
lower      -> ['lower']                                -> lower
lowest     -> ['low', '##e', '##st']                   -> lowest
newest     -> ['new', '##e', '##st']                   -> newest
widest     -> ['wid', '##e', '##st']                   -> widest
newer      -> ['new', '##er']                          -> newer

Our implementation learns meaningful subword units. Notice how common patterns like "est" get merged early due to their high co-occurrence scores.

Using Hugging Face TokenizersLink Copied

In practice, you'll use optimized implementations like the Hugging Face tokenizers library. Let's see how to train and use a WordPiece tokenizer:

Out[11]:

Vocabulary size: 100

In[12]:

# Test the tokenizer
test_sentences = [
    "machine learning",
    "deep learning models",
    "transformational change"
]

for sentence in test_sentences:
    output = tokenizer.encode(sentence)
    tokens = output.tokens
    decoded = tokenizer.decode(output.ids)
    print(f"Input: '{sentence}'")
    print(f"Tokens: {tokens}")
    print(f"Decoded: '{decoded}'")
    print()

# Test the tokenizer
test_sentences = [
    "machine learning",
    "deep learning models",
    "transformational change"
]

for sentence in test_sentences:
    output = tokenizer.encode(sentence)
    tokens = output.tokens
    decoded = tokenizer.decode(output.ids)
    print(f"Input: '{sentence}'")
    print(f"Tokens: {tokens}")
    print(f"Decoded: '{decoded}'")
    print()

Out[13]:

Input: 'machine learning'
Tokens: ['m', '##a', '##c', '##h', '##i', '##n', '##e', 'lear', '##ning']
Decoded: 'm ##a ##c ##h ##i ##n ##e lear ##ning'

Input: 'deep learning models'
Tokens: ['d', '##e', '##e', '##p', 'lear', '##ning', 'm', '##o', '##d', '##e', '##l', '##s']
Decoded: 'd ##e ##e ##p lear ##ning m ##o ##d ##e ##l ##s'

Input: 'transformational change'
Tokens: ['tra', '##ns', '##f', '##or', '##m', '##atio', '##n', '##a', '##l', 'c', '##h', '##a', '##ng', '##e']
Decoded: 'tra ##ns ##f ##or ##m ##atio ##n ##a ##l c ##h ##a ##ng ##e'

The Hugging Face tokenizer provides a production-ready implementation with special token handling, efficient encoding, and seamless integration with transformer models.

Visualizing WordPiece vs BPELink Copied

Let's compare how WordPiece and BPE tokenize the same text to see their differences in practice:

Out[14]:

Visualization

Side-by-side comparison charts showing WordPiece and BPE producing different tokenizations for identical input words. — Comparison of vocabulary overlap and tokenization differences between WordPiece and BPE. The left panel shows how much the learned vocabularies differ despite using similar corpora. The right panel demonstrates how the same words get tokenized differently by each algorithm.

The comparison reveals how the likelihood-based scoring in WordPiece produces different token boundaries than BPE's frequency-based approach. WordPiece's ## prefix makes continuation tokens explicit, while BPE treats all tokens uniformly.

WordPiece in BERTLink Copied

WordPiece became famous as the tokenization algorithm behind BERT. Understanding how BERT uses WordPiece helps explain its practical impact.

BERT's VocabularyLink Copied

BERT's original English model uses a 30,522 token vocabulary trained on Wikipedia and BookCorpus. The vocabulary includes:

Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
Word-initial tokens: Complete words and word prefixes without ##
Continuation tokens: Subword units prefixed with ##

In[15]:

from transformers import BertTokenizer

# Load BERT's tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
vocab_size = bert_tokenizer.vocab_size

from transformers import BertTokenizer

# Load BERT's tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
vocab_size = bert_tokenizer.vocab_size

Out[16]:

BERT vocabulary size: 30522

In[17]:

# Examine vocabulary structure
vocab = bert_tokenizer.get_vocab()
vocab_items = list(vocab.items())

# Count token types
continuation_tokens = [t for t in vocab if t.startswith("##")]
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

# Examine vocabulary structure
vocab = bert_tokenizer.get_vocab()
vocab_items = list(vocab.items())

# Count token types
continuation_tokens = [t for t in vocab if t.startswith("##")]
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

Out[18]:

Continuation tokens (##): 5828
Word-initial tokens: 24689
Special tokens: 5

Tokenization in ActionLink Copied

Let's see how BERT tokenizes various text examples:

In[19]:

test_texts = [
    "Hello world",
    "I love machine learning",
    "Transformers revolutionized NLP",
    "COVID-19 pandemic",
    "antidisestablishmentarianism"
]

for text in test_texts:
    tokens = bert_tokenizer.tokenize(text)
    ids = bert_tokenizer.encode(text, add_special_tokens=False)
    print(f"Text: {text}")
    print(f"Tokens: {tokens}")
    print(f"IDs: {ids}")
    print()

test_texts = [
    "Hello world",
    "I love machine learning",
    "Transformers revolutionized NLP",
    "COVID-19 pandemic",
    "antidisestablishmentarianism"
]

for text in test_texts:
    tokens = bert_tokenizer.tokenize(text)
    ids = bert_tokenizer.encode(text, add_special_tokens=False)
    print(f"Text: {text}")
    print(f"Tokens: {tokens}")
    print(f"IDs: {ids}")
    print()

Out[20]:

Text: Hello world
Tokens: ['hello', 'world']
IDs: [7592, 2088]

Text: I love machine learning
Tokens: ['i', 'love', 'machine', 'learning']
IDs: [1045, 2293, 3698, 4083]

Text: Transformers revolutionized NLP
Tokens: ['transformers', 'revolution', '##ized', 'nl', '##p']
IDs: [19081, 4329, 3550, 17953, 2361]

Text: COVID-19 pandemic
Tokens: ['co', '##vid', '-', '19', 'pan', '##de', '##mic']
IDs: [2522, 17258, 1011, 2539, 6090, 3207, 7712]

Text: antidisestablishmentarianism
Tokens: ['anti', '##dis', '##est', '##ab', '##lish', '##ment', '##arian', '##ism']
IDs: [3424, 10521, 4355, 7875, 13602, 3672, 12199, 2964]

Notice how BERT handles unknown or rare words by breaking them into subword pieces. The word "antidisestablishmentarianism" gets tokenized into meaningful morphological units, each of which BERT has seen during pre-training.

Handling Special CasesLink Copied

WordPiece in BERT includes several practical handling mechanisms:

In[21]:

# Special cases
special_cases = [
    "café",           # Accented characters
    "naïve",          # Diacritics
    "北京",            # Chinese characters
    "123.456",        # Numbers
    "user@email.com"  # Email format
]

for text in special_cases:
    tokens = bert_tokenizer.tokenize(text)
    # Check for [UNK] tokens
    has_unk = "[UNK]" in tokens or any(t == bert_tokenizer.unk_token for t in tokens)
    print(f"'{text}' -> {tokens}")
    if has_unk:
        print("  ⚠️ Contains unknown token")

# Special cases
special_cases = [
    "café",           # Accented characters
    "naïve",          # Diacritics
    "北京",            # Chinese characters
    "123.456",        # Numbers
    "user@email.com"  # Email format
]

for text in special_cases:
    tokens = bert_tokenizer.tokenize(text)
    # Check for [UNK] tokens
    has_unk = "[UNK]" in tokens or any(t == bert_tokenizer.unk_token for t in tokens)
    print(f"'{text}' -> {tokens}")
    if has_unk:
        print("  ⚠️ Contains unknown token")

Out[22]:

'café' -> ['cafe']
'naïve' -> ['naive']
'北京' -> ['北', '京']
'123.456' -> ['123', '.', '45', '##6']
'user@email.com' -> ['user', '@', 'email', '.', 'com']

BERT's vocabulary was primarily trained on English text, so it has limited coverage of non-Latin scripts. Multilingual BERT (mBERT) addresses this with a larger vocabulary trained on 104 languages.

Visualization: Token Length DistributionLink Copied

Out[23]:

Visualization

Histogram showing BERT vocabulary token length distribution with separate bars for regular and continuation tokens. — Distribution of token lengths in BERT''s WordPiece vocabulary. Most tokens are short (1-4 characters), with a long tail of longer tokens. The ## continuation tokens tend to be shorter than word-initial tokens since they capture common suffixes and morphemes.

The token length distribution reveals interesting patterns. Continuation tokens (##) tend to be shorter on average because they capture common suffixes like "##ing", "##ed", "##ly". Word-initial tokens include both short function words and longer content words.

Key ParametersLink Copied

vocab_size: The target vocabulary size after training. BERT uses 30,522 tokens. Larger vocabularies reduce sequence lengths but increase model parameters. For English-only models, 30,000-50,000 works well. Multilingual models often use 100,000+ tokens.

min_frequency: Minimum occurrence count for a token to be considered for merging. Higher values (5-10) create cleaner vocabularies but may miss useful rare patterns. Lower values (1-2) capture more patterns but may include noise.

special_tokens: Reserved tokens like [CLS], [SEP], [PAD], [MASK], [UNK]. These are added before training and excluded from the merge process. Design these based on your model's needs.

Limitations and ImpactLink Copied

LimitationsLink Copied

WordPiece shares some limitations with BPE:

Context-independent tokenization: The same word always tokenizes the same way regardless of context. "Lead" (the metal) and "lead" (to guide) produce identical tokens, losing the semantic distinction.

Suboptimal for non-Latin scripts: Vocabularies trained primarily on English have poor coverage of other writing systems. Characters from unsupported scripts often become [UNK] tokens.

Greedy encoding: The longest-match encoding algorithm doesn't guarantee globally optimal tokenization. Alternative segmentations might better capture the intended meaning.

Fixed vocabulary: Once trained, the vocabulary can't adapt to new domains without retraining. A model trained on news may struggle with medical text that uses unfamiliar terminology.

ImpactLink Copied

Despite these limitations, WordPiece has been enormously influential:

Enabled BERT's success: The combination of WordPiece tokenization with masked language modeling created one of the most impactful NLP models ever. BERT's tokenization approach became a template for subsequent models.

Balanced coverage and efficiency: WordPiece's likelihood-based merging creates vocabularies that handle both common and rare words effectively, finding a practical balance between vocabulary size and coverage.

Established subword conventions: The ## prefix notation became widely adopted, providing a clear standard for distinguishing word-initial from continuation tokens.

Inspired improvements: Later algorithms like SentencePiece and Unigram Language Model tokenization built on WordPiece's foundations while addressing some limitations.

SummaryLink Copied

WordPiece tokenization differs from BPE in one crucial way: it selects merges based on likelihood improvement rather than raw frequency. This means pairs that occur more often than expected given their individual frequencies get priority, leading to vocabularies that better capture meaningful subword patterns.

You now understand the key elements of WordPiece:

The likelihood-based scoring formula: $\text{score}(a, b) = \frac{\text{count}(ab)}{\text{count}(a) \times \text{count}(b)}$
The ## prefix notation that distinguishes word-initial from continuation tokens
The greedy longest-match encoding algorithm
How BERT applies WordPiece in practice

WordPiece remains a foundational technique in modern NLP, powering BERT and its many variants. While newer tokenization methods have emerged, understanding WordPiece provides essential context for working with transformer models and designing tokenization pipelines.

QuizLink Copied

Ready to test your understanding of WordPiece tokenization? Take this quiz to reinforce the key concepts.

Loading component...

QuizLink Copied

Ready to test your understanding of WordPiece tokenization? Take this quiz to reinforce the key concepts behind BERT's subword algorithm.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Byte Pair Encoding

Next Chapter

Unigram Language Model Tokenization

Reference

BIBTEXAcademic

@misc{wordpiecetokenizationbertssubwordalgorithmexplained, author = {Michael Brenndoerfer}, title = {WordPiece Tokenization: BERT's Subword Algorithm Explained}, year = {2025}, url = {https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }

APAAcademic

Michael Brenndoerfer (2025). WordPiece Tokenization: BERT's Subword Algorithm Explained. Retrieved from https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm

MLAAcademic

Michael Brenndoerfer. "WordPiece Tokenization: BERT's Subword Algorithm Explained." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm>.

CHICAGOAcademic

Michael Brenndoerfer. "WordPiece Tokenization: BERT's Subword Algorithm Explained." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm.

HARVARDAcademic

Michael Brenndoerfer (2025) 'WordPiece Tokenization: BERT's Subword Algorithm Explained'. Available at: https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm (Accessed: 12/13/2025).

SimpleBasic

Michael Brenndoerfer (2025). WordPiece Tokenization: BERT's Subword Algorithm Explained. https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm

Direct link:

https://mbrenndoerfer.com/writing/wordpiece-tokenization-bert-subword-algorithm

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

WordPiece Tokenization: BERT's Subword Algorithm Explained

title: "WordPiece Tokenization" format: html: code-fold: false jupyter: python3Link Copied

WordPiece TokenizationLink Copied

IntroductionLink Copied

Technical Deep DiveLink Copied

The Core Difference: Likelihood vs FrequencyLink Copied

The Merge ScoreLink Copied

The `##` Prefix NotationLink Copied

The Greedy Tokenization AlgorithmLink Copied

Handling Unknown CharactersLink Copied

Worked ExampleLink Copied

Code ImplementationLink Copied

Using Hugging Face TokenizersLink Copied

Visualizing WordPiece vs BPELink Copied

WordPiece in BERTLink Copied

BERT's VocabularyLink Copied

Tokenization in ActionLink Copied

Handling Special CasesLink Copied

Visualization: Token Length DistributionLink Copied

Key ParametersLink Copied

Limitations and ImpactLink Copied

LimitationsLink Copied

ImpactLink Copied

SummaryLink Copied

QuizLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Unigram Language Model Tokenization: Probabilistic Subword Segmentation

Byte Pair Encoding: Complete Guide to Subword Tokenization

The Vocabulary Problem: Why Word-Level Tokenization Breaks Down

Stay updated

WordPiece Tokenization: BERT's Subword Algorithm Explained

title: "WordPiece Tokenization" format: html: code-fold: false jupyter: python3Link Copied

WordPiece TokenizationLink Copied

IntroductionLink Copied

Technical Deep DiveLink Copied

The Core Difference: Likelihood vs FrequencyLink Copied

The Merge ScoreLink Copied

The ## Prefix NotationLink Copied

The Greedy Tokenization AlgorithmLink Copied

Handling Unknown CharactersLink Copied

Worked ExampleLink Copied

Code ImplementationLink Copied

Using Hugging Face TokenizersLink Copied

Visualizing WordPiece vs BPELink Copied

WordPiece in BERTLink Copied

BERT's VocabularyLink Copied

Tokenization in ActionLink Copied

Handling Special CasesLink Copied

Visualization: Token Length DistributionLink Copied

Key ParametersLink Copied

Limitations and ImpactLink Copied

LimitationsLink Copied

ImpactLink Copied

SummaryLink Copied

QuizLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Unigram Language Model Tokenization: Probabilistic Subword Segmentation

Byte Pair Encoding: Complete Guide to Subword Tokenization

The Vocabulary Problem: Why Word-Level Tokenization Breaks Down

Stay updated

The `##` Prefix NotationLink Copied