Master WordPiece tokenization, the algorithm behind BERT that balances vocabulary efficiency with morphological awareness. Learn how likelihood-based merging creates smarter subword units than BPE.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
title: "WordPiece Tokenization" format: html: code-fold: false jupyter: python3
WordPiece Tokenization
Introduction
WordPiece is a subword tokenization algorithm that powers BERT and many other transformer-based models. While it shares the iterative merge approach with Byte Pair Encoding (BPE), WordPiece makes a subtle but important change: instead of merging the most frequent pairs, it merges pairs that maximize the likelihood of the training data.
This difference matters. BPE's frequency-based criterion treats all pairs equally regardless of context. A pair appearing 100 times in rare words contributes the same as one appearing 100 times in common words. WordPiece's likelihood objective weights pairs by how much they improve the overall model, giving more influence to patterns that appear in frequent contexts.
The result is a tokenization scheme that tends to produce slightly different vocabularies than BPE, often capturing more linguistically meaningful units. You'll recognize WordPiece tokens by their distinctive ## prefix, which marks subword units that continue a word rather than starting one.
Technical Deep Dive
The Core Difference: Likelihood vs Frequency
BPE selects the most frequent adjacent token pair at each merge step. WordPiece takes a different approach: it selects the pair that maximizes the likelihood of the training corpus under a unigram language model.
Let's formalize this. Given a vocabulary and a training corpus tokenized into subword units, the likelihood of the corpus under a unigram model is:
where is the -th token in the tokenized corpus and is estimated from token frequencies:
When we merge a pair into a new token , the likelihood changes. WordPiece selects the merge that maximizes this likelihood increase.
The Merge Score
For a candidate pair , the WordPiece merge score is:
This score captures the association strength between tokens and . If they appear together more often than you'd expect from their individual frequencies, the score is high.
The WordPiece score is closely related to Pointwise Mutual Information (PMI), a classic measure of association in computational linguistics. PMI measures how much more likely two items appear together compared to if they were independent.
Let's work through why this score maximizes likelihood. When we merge into :
- We remove occurrences of and where they appeared as a pair
- We add occurrences of the new token
The likelihood ratio (new/old) for this merge is proportional to:
Taking the logarithm and simplifying gives us the score formula above.
The ## Prefix Notation
WordPiece uses a special notation to distinguish word-initial tokens from continuation tokens. Tokens that continue a word (not at the start) are prefixed with ##.
For example, tokenizing "unhappiness":
un ##happi ##ness
This tells us:
unstarts the word##happicontinues from the previous token##nesscontinues from the previous token
The ## prefix serves two purposes:
-
Disambiguation: The token
##ing(word continuation) is different froming(word start). This matters because "ing" at the start of a word (like "ingot") has different distributional properties than "-ing" as a suffix. -
Reconstruction: During decoding, we can easily reconstruct the original text by removing
##prefixes and joining tokens.
The Greedy Tokenization Algorithm
Once we have a trained WordPiece vocabulary, encoding new text follows a greedy longest-match algorithm:
-
For each word in the input: a. Start at the beginning of the word b. Find the longest token in the vocabulary that matches the current position c. Add that token to the output (with
##prefix if not at word start) d. Move past the matched characters e. Repeat until the word is consumed -
If at any point no vocabulary token matches (not even single characters), output the special
[UNK]token for the entire word
This greedy approach is fast but not guaranteed to find the globally optimal tokenization. However, it works well in practice and is computationally efficient.
Handling Unknown Characters
Unlike BPE, which typically includes all individual characters in its base vocabulary, WordPiece implementations often have a fixed character set. When encountering characters outside this set:
- Character-level fallback: Some implementations add unknown characters to the vocabulary during training
- UNK replacement: Others map entire words containing unknown characters to
[UNK] - Normalization: Pre-processing can normalize or remove problematic characters
BERT's WordPiece vocabulary, for instance, was trained primarily on English and has limited coverage of characters from other scripts.
Worked Example
Let's trace through WordPiece training on a small corpus to see how the likelihood-based scoring works.
Consider this training corpus with word frequencies:
"low" (5), "lower" (2), "newest" (6), "widest" (3)
Step 1: Initialization
We start with individual characters (plus the end-of-word marker if used):
Vocabulary: ['d', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w']
Initial tokenization:
l o w (5)
l o w e r (2)
n e w e s t (6)
w i d e s t (3)
Step 2: Calculate Merge Scores
For each adjacent pair, we calculate:
Let's compute some scores:
-
Pair
(e, s): appears in "newest" (6) and "widest" (3) = 9 times- count(e) = 2 + 6 + 3 = 11 (in "lower", "newest", "widest")
- count(s) = 6 + 3 = 9
- score = 9 / (11 × 9) = 0.091
-
Pair
(s, t): appears in "newest" (6) and "widest" (3) = 9 times- count(s) = 9
- count(t) = 6 + 3 = 9
- score = 9 / (9 × 9) = 0.111
-
Pair
(l, o): appears in "low" (5) and "lower" (2) = 7 times- count(l) = 5 + 2 = 7
- count(o) = 5 + 2 = 7
- score = 7 / (7 × 7) = 0.143
The pair (l, o) has the highest score, so we merge it first.
Step 3: After First Merge
Vocabulary: ['d', 'e', 'i', 'l', 'lo', 'n', 'o', 'r', 's', 't', 'w']
Updated tokenization:
lo w (5)
lo w e r (2)
n e w e s t (6)
w i d e s t (3)
We continue this process until reaching our target vocabulary size.
Contrast with BPE
With pure frequency counting (BPE), we might merge (e, s) or (s, t) first since they each appear 9 times, compared to (l, o)'s 7 times. The likelihood-based score adjusts for the base frequencies of each token, preferring pairs where the co-occurrence is more "surprising" given the individual token frequencies.
Code Implementation
Let's implement WordPiece from scratch to understand the algorithm deeply. We'll build a simplified version that captures the core concepts.
First, we set up our basic data structures and the scoring function:
Now let's add the scoring and merge selection logic:
Finally, let's add the training loop:
Let's test our implementation:
Vocabulary size: 20 Vocabulary: ['##d', '##e', '##er', '##i', '##o', '##r', '##s', '##st', '##t', '##w', 'l', 'lo', 'low', 'lower', 'n', 'ne', 'new', 'w', 'wi', 'wid']
low -> ['low'] -> low lower -> ['lower'] -> lower lowest -> ['low', '##e', '##st'] -> lowest newest -> ['new', '##e', '##st'] -> newest widest -> ['wid', '##e', '##st'] -> widest newer -> ['new', '##er'] -> newer
Our implementation learns meaningful subword units. Notice how common patterns like "est" get merged early due to their high co-occurrence scores.
Using Hugging Face Tokenizers
In practice, you'll use optimized implementations like the Hugging Face tokenizers library. Let's see how to train and use a WordPiece tokenizer:
Vocabulary size: 100
Input: 'machine learning' Tokens: ['m', '##a', '##c', '##h', '##i', '##n', '##e', 'lear', '##ning'] Decoded: 'm ##a ##c ##h ##i ##n ##e lear ##ning' Input: 'deep learning models' Tokens: ['d', '##e', '##e', '##p', 'lear', '##ning', 'm', '##o', '##d', '##e', '##l', '##s'] Decoded: 'd ##e ##e ##p lear ##ning m ##o ##d ##e ##l ##s' Input: 'transformational change' Tokens: ['tra', '##ns', '##f', '##or', '##m', '##atio', '##n', '##a', '##l', 'c', '##h', '##a', '##ng', '##e'] Decoded: 'tra ##ns ##f ##or ##m ##atio ##n ##a ##l c ##h ##a ##ng ##e'
The Hugging Face tokenizer provides a production-ready implementation with special token handling, efficient encoding, and seamless integration with transformer models.
Visualizing WordPiece vs BPE
Let's compare how WordPiece and BPE tokenize the same text to see their differences in practice:

The comparison reveals how the likelihood-based scoring in WordPiece produces different token boundaries than BPE's frequency-based approach. WordPiece's ## prefix makes continuation tokens explicit, while BPE treats all tokens uniformly.
WordPiece in BERT
WordPiece became famous as the tokenization algorithm behind BERT. Understanding how BERT uses WordPiece helps explain its practical impact.
BERT's Vocabulary
BERT's original English model uses a 30,522 token vocabulary trained on Wikipedia and BookCorpus. The vocabulary includes:
- Special tokens:
[PAD],[UNK],[CLS],[SEP],[MASK] - Word-initial tokens: Complete words and word prefixes without
## - Continuation tokens: Subword units prefixed with
##
BERT vocabulary size: 30522
Continuation tokens (##): 5828 Word-initial tokens: 24689 Special tokens: 5
Tokenization in Action
Let's see how BERT tokenizes various text examples:
Text: Hello world Tokens: ['hello', 'world'] IDs: [7592, 2088] Text: I love machine learning Tokens: ['i', 'love', 'machine', 'learning'] IDs: [1045, 2293, 3698, 4083] Text: Transformers revolutionized NLP Tokens: ['transformers', 'revolution', '##ized', 'nl', '##p'] IDs: [19081, 4329, 3550, 17953, 2361] Text: COVID-19 pandemic Tokens: ['co', '##vid', '-', '19', 'pan', '##de', '##mic'] IDs: [2522, 17258, 1011, 2539, 6090, 3207, 7712] Text: antidisestablishmentarianism Tokens: ['anti', '##dis', '##est', '##ab', '##lish', '##ment', '##arian', '##ism'] IDs: [3424, 10521, 4355, 7875, 13602, 3672, 12199, 2964]
Notice how BERT handles unknown or rare words by breaking them into subword pieces. The word "antidisestablishmentarianism" gets tokenized into meaningful morphological units, each of which BERT has seen during pre-training.
Handling Special Cases
WordPiece in BERT includes several practical handling mechanisms:
'café' -> ['cafe'] 'naïve' -> ['naive'] '北京' -> ['北', '京'] '123.456' -> ['123', '.', '45', '##6'] 'user@email.com' -> ['user', '@', 'email', '.', 'com']
BERT's vocabulary was primarily trained on English text, so it has limited coverage of non-Latin scripts. Multilingual BERT (mBERT) addresses this with a larger vocabulary trained on 104 languages.
Visualization: Token Length Distribution

The token length distribution reveals interesting patterns. Continuation tokens (##) tend to be shorter on average because they capture common suffixes like "##ing", "##ed", "##ly". Word-initial tokens include both short function words and longer content words.
Key Parameters
vocab_size: The target vocabulary size after training. BERT uses 30,522 tokens. Larger vocabularies reduce sequence lengths but increase model parameters. For English-only models, 30,000-50,000 works well. Multilingual models often use 100,000+ tokens.
min_frequency: Minimum occurrence count for a token to be considered for merging. Higher values (5-10) create cleaner vocabularies but may miss useful rare patterns. Lower values (1-2) capture more patterns but may include noise.
special_tokens: Reserved tokens like [CLS], [SEP], [PAD], [MASK], [UNK]. These are added before training and excluded from the merge process. Design these based on your model's needs.
Limitations and Impact
Limitations
WordPiece shares some limitations with BPE:
Context-independent tokenization: The same word always tokenizes the same way regardless of context. "Lead" (the metal) and "lead" (to guide) produce identical tokens, losing the semantic distinction.
Suboptimal for non-Latin scripts: Vocabularies trained primarily on English have poor coverage of other writing systems. Characters from unsupported scripts often become [UNK] tokens.
Greedy encoding: The longest-match encoding algorithm doesn't guarantee globally optimal tokenization. Alternative segmentations might better capture the intended meaning.
Fixed vocabulary: Once trained, the vocabulary can't adapt to new domains without retraining. A model trained on news may struggle with medical text that uses unfamiliar terminology.
Impact
Despite these limitations, WordPiece has been enormously influential:
Enabled BERT's success: The combination of WordPiece tokenization with masked language modeling created one of the most impactful NLP models ever. BERT's tokenization approach became a template for subsequent models.
Balanced coverage and efficiency: WordPiece's likelihood-based merging creates vocabularies that handle both common and rare words effectively, finding a practical balance between vocabulary size and coverage.
Established subword conventions: The ## prefix notation became widely adopted, providing a clear standard for distinguishing word-initial from continuation tokens.
Inspired improvements: Later algorithms like SentencePiece and Unigram Language Model tokenization built on WordPiece's foundations while addressing some limitations.
Summary
WordPiece tokenization differs from BPE in one crucial way: it selects merges based on likelihood improvement rather than raw frequency. This means pairs that occur more often than expected given their individual frequencies get priority, leading to vocabularies that better capture meaningful subword patterns.
You now understand the key elements of WordPiece:
- The likelihood-based scoring formula:
- The
##prefix notation that distinguishes word-initial from continuation tokens - The greedy longest-match encoding algorithm
- How BERT applies WordPiece in practice
WordPiece remains a foundational technique in modern NLP, powering BERT and its many variants. While newer tokenization methods have emerged, understanding WordPiece provides essential context for working with transformer models and designing tokenization pipelines.
Quiz
Ready to test your understanding of WordPiece tokenization? Take this quiz to reinforce the key concepts.
Quiz
Ready to test your understanding of WordPiece tokenization? Take this quiz to reinforce the key concepts behind BERT's subword algorithm.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Unigram Language Model Tokenization: Probabilistic Subword Segmentation
Master probabilistic tokenization with unigram language models. Learn how SentencePiece uses EM algorithms and Viterbi decoding to create linguistically meaningful subword units, outperforming deterministic methods like BPE.

Byte Pair Encoding: Complete Guide to Subword Tokenization
Master Byte Pair Encoding (BPE), the subword tokenization algorithm powering GPT and BERT. Learn how BPE bridges character and word-level approaches through iterative merge operations.

The Vocabulary Problem: Why Word-Level Tokenization Breaks Down
Discover why traditional word-level approaches fail with diverse text, from OOV words to morphological complexity. Learn the fundamental challenges that make subword tokenization essential for modern NLP.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


Comments