SentencePiece: Subword Tokenization for Multilingual NLP

Michael Brenndoerfer

Learn how SentencePiece tokenizes text using BPE and Unigram algorithms. Covers byte-level processing, vocabulary construction, and practical implementation for modern language models.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

SentencePieceLink Copied

IntroductionLink Copied

SentencePiece represents a fundamental shift in how we approach text tokenization. Instead of starting with preprocessed text and applying rules-based tokenization, SentencePiece treats raw text as a sequence of bytes and learns optimal subword units directly from the data. This approach eliminates the need for language-specific preprocessing and enables truly multilingual tokenization.

The key insight behind SentencePiece is that we can learn meaningful subword units by treating text as raw bytes and using data-driven algorithms to discover optimal token boundaries. This approach has become the foundation for tokenization in many modern language models, including BERT, ALBERT, and T5.

Technical Deep DiveLink Copied

SentencePiece operates on the principle that text can be treated as a sequence of Unicode code points, which are then converted to bytes. This byte-level representation allows the algorithm to work across languages without requiring language-specific knowledge.

Byte-Level ProcessingLink Copied

At its core, SentencePiece processes text through these steps:

Unicode normalization: Text is normalized to a consistent Unicode form (typically NFKC)
Byte encoding: Each Unicode code point is encoded as UTF-8 bytes
Subword learning: Statistical algorithms learn optimal subword units
Token mapping: Learned subwords are mapped to token IDs

The key insight is treating whitespace as just another byte pattern. Instead of explicit whitespace tokenization, SentencePiece uses a special prefix character (▁, U+2581) to mark word boundaries during training.

Training AlgorithmsLink Copied

How should we decide which subword units to include in our vocabulary? This is the central question that SentencePiece's training algorithms address. We need a principled way to discover token boundaries that capture meaningful patterns in language, patterns that will generalize to new text we haven't seen before.

SentencePiece offers two philosophically different approaches to this problem:

Bottom-up construction (BPE): Start with the smallest possible units and build up by merging frequent patterns
Top-down pruning (Unigram): Start with many possible tokens and remove the least useful ones

Both approaches aim to find a vocabulary that efficiently represents the training corpus while generalizing well to new text. Let's explore each in detail.

Byte Pair Encoding (BPE)Link Copied

BPE takes an intuitive, greedy approach to vocabulary construction. The core insight is simple: if two symbols frequently appear next to each other, they probably represent a meaningful unit that should be treated as a single token.

The intuition: Imagine you're reading a corpus and noticing patterns. You see "ing" appearing constantly at the end of words. You see "th" starting many common words. These frequent co-occurrences suggest natural boundaries where characters "belong together." BPE formalizes this intuition into an algorithm.

Building the vocabulary step by step:

The algorithm begins with the smallest possible vocabulary, individual bytes (or characters). From this minimal starting point, it iteratively discovers larger units:

Initialize: Start with all unique bytes/characters as the vocabulary
Count: Scan the corpus and count how often each pair of adjacent tokens appears
Merge: Take the most frequent pair and combine them into a new token
Repeat: Continue until the vocabulary reaches the target size

At each iteration, BPE identifies the pair of adjacent tokens $(a, b)$ with the highest frequency across the corpus:

(a^*, b^*) = \arg\max_{(a,b)} \text{count}(a, b)

where:

$(a, b)$ : a pair of adjacent tokens in the current vocabulary
$\text{count}(a, b)$ : the number of times tokens $a$ and $b$ appear consecutively in the corpus
$(a^*, b^*)$ : the most frequent pair, which will be merged into a new token

The algorithm then creates a new token $ab$ by concatenating $a^*$ and $b^*$ , adds it to the vocabulary, and replaces all occurrences of the pair in the corpus. This process repeats until the vocabulary reaches the target size.

Why does this work? The greedy nature of BPE means it captures the most common patterns first. Early merges tend to be language-universal patterns (common letter combinations), while later merges capture domain-specific vocabulary. This creates a natural hierarchy: high-frequency subwords get their own tokens, while rare words are decomposed into smaller pieces.

In[3]:

Code

# Initialize vocabulary with individual bytes
vocab = set()
for char in text:
    for byte in char.encode("utf-8"):
        vocab.add(bytes([byte]))

# Iteratively merge most frequent pairs
while len(vocab) < target_vocab_size:
    # Find most frequent adjacent pair
    best_pair = find_most_frequent_pair(text, vocab)
    # Merge the pair in vocabulary
    vocab.add(best_pair)
    # Update text representation
    text = merge_pair_in_text(text, best_pair)

# Initialize vocabulary with individual bytes
vocab = set()
for char in text:
    for byte in char.encode("utf-8"):
        vocab.add(bytes([byte]))

# Iteratively merge most frequent pairs
while len(vocab) < target_vocab_size:
    # Find most frequent adjacent pair
    best_pair = find_most_frequent_pair(text, vocab)
    # Merge the pair in vocabulary
    vocab.add(best_pair)
    # Update text representation
    text = merge_pair_in_text(text, best_pair)

Unigram Language ModelLink Copied

While BPE builds vocabulary through local decisions (always merge the most frequent pair), the unigram approach takes a global perspective. It asks: what vocabulary would make the entire corpus most probable under a simple language model?

The intuition: Think of tokenization as a compression problem. Given limited vocabulary slots, which tokens should we include to most efficiently represent our corpus? Tokens that appear frequently and can't be easily composed from other tokens are the most valuable. The unigram model formalizes this by treating tokenization as probabilistic inference.

Framing tokenization as optimization:

Given a text string $X$ , there are many possible ways to segment it into tokens. For example, "learning" could be segmented as:

["learning"] (one token)
["learn", "ing"] (two tokens)
["l", "e", "a", "r", "n", "i", "n", "g"] (eight characters)

Which segmentation is "best"? The unigram model answers this by assigning probabilities to each possible segmentation and choosing the most probable one.

The core optimization problem is finding the best segmentation $\mathbf{x}^* = (x_1, x_2, \ldots, x_n)$ for input text $X$ :

\mathbf{x}^* = \arg\max_{\mathbf{x} \in S(X)} P(\mathbf{x})

where:

$\mathbf{x}$ : a candidate segmentation consisting of tokens $(x_1, x_2, \ldots, x_n)$
$S(X)$ : the set of all possible segmentations of text $X$
$P(\mathbf{x})$ : the probability of segmentation $\mathbf{x}$ under the unigram model

The independence assumption:

Under the unigram assumption, each token is generated independently. The probability of seeing token $x_i$ doesn't depend on what tokens came before or after it. This simplifying assumption transforms the probability of a segmentation into a product:

P(\mathbf{x}) = \prod_{i=1}^{n} P(x_i)

where:

$P(x_i)$ : the probability of token $x_i$ in the learned vocabulary
$n$ : the number of tokens in segmentation $\mathbf{x}$

This multiplicative form is what makes the unigram model computationally tractable. Despite the exponentially many possible segmentations, we can use dynamic programming (specifically, the Viterbi algorithm) to find the optimal segmentation in time linear in the text length.

Working with Log-Probabilities

In practice, multiplying many small probabilities leads to numerical underflow. We instead work with log-probabilities, converting the product into a sum: $\log P(\mathbf{x}) = \sum_{i=1}^{n} \log P(x_i)$ . This is mathematically equivalent but numerically stable.

Training the unigram model:

Unlike BPE which builds up from nothing, the unigram model starts with a large initial vocabulary (all substrings up to some maximum length) and prunes it down using Expectation-Maximization (EM):

Initialize: Create a vocabulary containing all possible subwords up to a maximum length
E-step: For each possible segmentation of the corpus, compute how likely it is under current token probabilities. This gives us "expected counts" for each token.
M-step: Re-estimate token probabilities based on these expected counts:

P(x) = \frac{\text{expected\_count}(x)}{\sum_{x' \in V} \text{expected\_count}(x')}

Prune: Remove tokens whose deletion would least harm the corpus likelihood
Repeat steps 2-4 until the vocabulary reaches the desired size

The key insight is that step 4 removes tokens that are "redundant," tokens that can be reconstructed from other tokens without much loss in probability. This naturally preserves the most useful subword units.

Whitespace HandlingLink Copied

SentencePiece's approach to whitespace is straightforward. During training, all whitespace characters are replaced with the special character ▁. This allows the algorithm to learn where word boundaries naturally occur based on the training data.

For example, the sentence "Hello world" becomes "▁Hello▁world" during training. The ▁ prefix indicates word boundaries, and the algorithm can learn that spaces between words are meaningful segmentation points.

A Worked ExampleLink Copied

To solidify our understanding, let's trace through exactly how SentencePiece processes a simple sentence: "natural language processing". We'll follow the BPE algorithm step by step, watching how the vocabulary evolves and how tokenization decisions emerge from the learned merge rules.

Step 1: PreprocessingLink Copied

Before any learning happens, SentencePiece transforms the raw text. Spaces are replaced with the special boundary marker ▁, and a boundary is added at the start to indicate the beginning of a word:

Input:  "natural language processing"
Output: "▁natural▁language▁processing"

This preprocessing matters because it allows the algorithm to learn that certain patterns (like "▁the") typically start words, while others (like "ing") typically end them. The ▁ character becomes just another symbol that can participate in merges.

Step 2: Initialize with CharactersLink Copied

The BPE algorithm starts with the finest possible granularity: individual characters. Our initial vocabulary contains every unique character that appears in the preprocessed corpus:

Initial vocabulary: {▁, a, c, d, e, g, i, l, m, n, o, p, r, s, t, u}

At this point, our text is represented as a sequence of 29 individual tokens:

[▁, n, a, t, u, r, a, l, ▁, l, a, n, g, u, a, g, e, ▁, p, r, o, c, e, s, s, i, n, g]

Step 3: Iterative MergingLink Copied

Now the algorithm scans the corpus, counting how often each pair of adjacent tokens appears. Suppose the counts look like this (simplified for illustration):

Pair frequency counts during BPE training. The algorithm selects the most frequent pair for merging.

Pair	Count
(a, l)	3
(a, n)	2
(n, g)	2
(i, n)	1
...	...

The pair (a, l) appears most frequently (in "natural" and "language"), so we merge it:

Merge 1: (a, l) → al

The vocabulary now includes "al" as a single token. Importantly, only adjacent a + l pairs are merged; other instances of "a" and "l" remain separate:

[▁, n, a, t, u, r, al, ▁, l, a, n, g, u, a, g, e, ▁, p, r, o, c, e, s, s, i, n, g]

The algorithm continues, perhaps next merging (a, n) → an, then (n, g) → ng, and so on. After several iterations, we might have:

After 5 merges:

Vocabulary: {▁, a, c, d, e, g, i, l, m, n, o, p, r, s, t, u, al, an, ng, in, ing}
Merge sequence: [(a,l)→al, (a,n)→an, (n,g)→ng, (i,n)→in, (in,g)→ing]

Notice how "ing" emerges naturally: first "in" is merged, then "in" + "g" becomes "ing". The algorithm discovers morphological patterns without being told about English grammar.

Step 4: Tokenization at Inference TimeLink Copied

Once training is complete, tokenizing new text is straightforward. Given the input "natural language":

Preprocess: "natural language" → "▁natural▁language"
Start with characters: [▁, n, a, t, u, r, a, l, ▁, l, a, n, g, u, a, g, e]
Apply merges in order:
- After (a,l)→al: [▁, n, a, t, u, r, al, ▁, l, a, n, g, u, a, g, e]
- After (a,n)→an: [▁, n, a, t, u, r, al, ▁, l, an, g, u, a, g, e]
- ...and so on until no more merges apply
Final tokens: Depending on vocabulary size and training corpus, we might get:
- ["▁natural", "▁language"] if these full words are in the vocabulary
- ["▁natur", "al", "▁langu", "age"] if the vocabulary is smaller
- ["▁", "n", "a", "t", "u", "r", "a", "l", ▁", "l", ...] if vocabulary is very small

The key insight: the merge order matters. BPE applies merges in exactly the order they were learned during training. This deterministic process ensures that the same text always produces the same tokenization.

Comparing BPE and Unigram TokenizationLink Copied

For the unigram model, the tokenization process is different. Instead of applying merge rules, we solve an optimization problem. Given "▁natural▁language", we consider all possible segmentations and choose the one with highest probability:

\mathbf{x}^* = \arg\max_{\mathbf{x}} \prod_i P(x_i)

If our vocabulary assigns:

$P(\text{▁natural}) = 0.001$
$P(\text{▁language}) = 0.002$
$P(\text{▁nat}) = 0.0005$ , $P(\text{ural}) = 0.0003$
...

The algorithm might find that ["▁natural", "▁language"] gives the highest probability product, making it the optimal segmentation. The Viterbi algorithm efficiently searches through all possibilities without explicitly enumerating them.

Code ImplementationLink Copied

Having walked through the algorithm by hand, let's now implement it in code. Building SentencePiece from scratch will solidify our understanding of each component: the preprocessing that adds word boundaries, the iterative counting and merging that builds the vocabulary, and the deterministic application of merge rules during tokenization.

Defining the SentencePiece ClassLink Copied

Our SimpleSentencePiece class encapsulates the entire BPE pipeline. It maintains two key pieces of state:

vocab: A dictionary mapping each token (string) to its unique ID (integer)
merges: An ordered list of merge operations, each recording which pair was merged and what token it produced

The ordering of merges is critical. During tokenization, we must apply them in exactly the same order they were learned:

In[2]:

Code

import collections
from typing import List


class SimpleSentencePiece:
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = []

    def train(self, corpus: List[str]) -> None:
        """Train SentencePiece model on corpus"""
        # Preprocess: add word boundaries
        processed_corpus = ["▁" + text.replace(" ", "▁") for text in corpus]

        # Start with character vocabulary
        text = " ".join(processed_corpus)
        chars = set(text)
        self.vocab = {char: i for i, char in enumerate(sorted(chars))}

        # Convert to list for easier manipulation
        text_list = list(text)

        # BPE training
        token_id = len(self.vocab)

        while len(self.vocab) < self.vocab_size:
            # Count pair frequencies
            stats = collections.defaultdict(int)
            for i in range(len(text_list) - 1):
                pair = (text_list[i], text_list[i + 1])
                stats[pair] += 1

            if not stats:
                break

            # Find most frequent pair
            best_pair = max(stats, key=stats.get)

            # Create new token
            new_token = "".join(best_pair)
            self.vocab[new_token] = token_id
            self.merges.append((best_pair, new_token))
            token_id += 1

            # Merge pair in text
            i = 0
            while i < len(text_list) - 1:
                if (text_list[i], text_list[i + 1]) == best_pair:
                    text_list[i : i + 2] = [new_token]
                else:
                    i += 1

    def tokenize(self, text: str) -> List[str]:
        """Tokenize text using learned vocabulary"""
        # Preprocess
        text = "▁" + text.replace(" ", "▁")

        # Start with characters
        tokens = list(text)

        # Apply merges in order
        for pair, new_token in self.merges:
            i = 0
            while i < len(tokens) - 1:
                if tokens[i : i + 2] == list(pair):
                    tokens[i : i + 2] = [new_token]
                i += 1

        return tokens

    def encode(self, text: str) -> List[int]:
        """Convert text to token IDs"""
        tokens = self.tokenize(text)
        return [
            self.vocab.get(token, self.vocab.get("<unk>", 0))
            for token in tokens
        ]

import collections
from typing import List


class SimpleSentencePiece:
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.vocab = {}
        self.merges = []

    def train(self, corpus: List[str]) -> None:
        """Train SentencePiece model on corpus"""
        # Preprocess: add word boundaries
        processed_corpus = ["▁" + text.replace(" ", "▁") for text in corpus]

        # Start with character vocabulary
        text = " ".join(processed_corpus)
        chars = set(text)
        self.vocab = {char: i for i, char in enumerate(sorted(chars))}

        # Convert to list for easier manipulation
        text_list = list(text)

        # BPE training
        token_id = len(self.vocab)

        while len(self.vocab) < self.vocab_size:
            # Count pair frequencies
            stats = collections.defaultdict(int)
            for i in range(len(text_list) - 1):
                pair = (text_list[i], text_list[i + 1])
                stats[pair] += 1

            if not stats:
                break

            # Find most frequent pair
            best_pair = max(stats, key=stats.get)

            # Create new token
            new_token = "".join(best_pair)
            self.vocab[new_token] = token_id
            self.merges.append((best_pair, new_token))
            token_id += 1

            # Merge pair in text
            i = 0
            while i < len(text_list) - 1:
                if (text_list[i], text_list[i + 1]) == best_pair:
                    text_list[i : i + 2] = [new_token]
                else:
                    i += 1

    def tokenize(self, text: str) -> List[str]:
        """Tokenize text using learned vocabulary"""
        # Preprocess
        text = "▁" + text.replace(" ", "▁")

        # Start with characters
        tokens = list(text)

        # Apply merges in order
        for pair, new_token in self.merges:
            i = 0
            while i < len(tokens) - 1:
                if tokens[i : i + 2] == list(pair):
                    tokens[i : i + 2] = [new_token]
                i += 1

        return tokens

    def encode(self, text: str) -> List[int]:
        """Convert text to token IDs"""
        tokens = self.tokenize(text)
        return [
            self.vocab.get(token, self.vocab.get("<unk>", 0))
            for token in tokens
        ]

Let's trace through the key methods:

train(): Implements the BPE loop we described earlier. It preprocesses the corpus (adding ▁ boundaries), initializes with character vocabulary, then iteratively finds and merges the most frequent pair until reaching the target vocabulary size.
tokenize(): Given new text, applies the learned merges in order. This is the deterministic process that ensures consistent tokenization.
encode(): Converts tokens to their integer IDs, ready for input to a neural network.

Training on a Sample CorpusLink Copied

Now let's see the algorithm in action. We'll train on a small NLP-themed corpus and observe what patterns it discovers:

In[3]:

Code

# Training corpus
corpus = [
    "natural language processing",
    "machine learning models",
    "natural language understanding",
    "deep learning algorithms",
]

# Train the model
sp = SimpleSentencePiece(vocab_size=50)
sp.train(corpus)

# Tokenize new text
test_text = "natural language"
tokens = sp.tokenize(test_text)
token_ids = sp.encode(test_text)

# Count initial vs final vocabulary
initial_chars = len(set("".join(["▁" + t.replace(" ", "▁") for t in corpus])))
num_merges = len(sp.merges)

# Training corpus
corpus = [
    "natural language processing",
    "machine learning models",
    "natural language understanding",
    "deep learning algorithms",
]

# Train the model
sp = SimpleSentencePiece(vocab_size=50)
sp.train(corpus)

# Tokenize new text
test_text = "natural language"
tokens = sp.tokenize(test_text)
token_ids = sp.encode(test_text)

# Count initial vs final vocabulary
initial_chars = len(set("".join(["▁" + t.replace(" ", "▁") for t in corpus])))
num_merges = len(sp.merges)

Out[4]:

Console

Original text: natural language
Tokens: ['▁', 'natural▁language']
Token IDs: [17, 35]
Vocabulary size: 50
Initial characters: 17
Merge operations learned: 32

The output reveals the compression at work. We started with individual characters and, through iterative merging, built up a vocabulary of subword units. The number of tokens needed to represent "natural language" is now much smaller than the character count. This compression is what makes subword tokenization efficient.

Examining Learned MergesLink Copied

The merge sequence tells us exactly what patterns the algorithm discovered in the training corpus. Early merges capture the most frequent character pairs, the "building blocks" that appear across many words:

Out[5]:

Console

First 5 learned merges (most frequent pairs):
  1. ('n', 'g') → 'ng'
  2. ('▁', 'l') → '▁l'
  3. ('i', 'ng') → 'ing'
  4. ('a', 'l') → 'al'
  5. (' ', '▁') → ' ▁'

Last 5 learned merges:
  28. ('▁natural▁language▁', 'p') → '▁natural▁language▁p'
  29. ('▁natural▁language▁p', 'r') → '▁natural▁language▁pr'
  30. ('▁natural▁language▁pr', 'o') → '▁natural▁language▁pro'
  31. ('▁natural▁language▁pro', 'c') → '▁natural▁language▁proc'
  32. ('▁natural▁language▁proc', 'e') → '▁natural▁language▁proce'

Notice the progression: early merges are simple character pairs, while later merges combine previously-merged tokens into increasingly complex units. This hierarchical structure mirrors the morphological patterns in language. Common suffixes like "ing" or "tion" emerge naturally from the statistics, even though the algorithm knows nothing about linguistics.

Visualizing the BPE Training ProcessLink Copied

To better understand how BPE builds its vocabulary, let's visualize the training dynamics. We'll track how the vocabulary grows and how the corpus representation becomes more compressed with each merge operation:

Out[7]:

Visualization

BPE training dynamics showing vocabulary growth and corpus compression. As merges accumulate (x-axis), vocabulary size increases (blue) while the total token count decreases (orange). The diminishing returns in compression reflect that early merges capture high-frequency patterns.

The visualization reveals the fundamental trade-off in BPE training: each merge adds one token to the vocabulary while reducing the number of tokens needed to represent the corpus. Early merges achieve the greatest compression gains because they target the most frequent pairs.

Out[8]:

Visualization

Frequency of merged pairs in order of BPE training. The steep decline illustrates the greedy nature of BPE: it always selects the most frequent remaining pair, leading to diminishing frequencies as training progresses.

The frequency distribution confirms the greedy selection pattern: the first merge captures a pair appearing many times, while later merges address progressively rarer combinations. This explains why BPE tends to create short, frequent tokens early and longer, domain-specific tokens later.

Tokenizing Unseen TextLink Copied

The true test of any tokenization algorithm is how it handles text it hasn't seen before. Unlike word-level tokenizers that map unknown words to a single <UNK> token, subword tokenizers gracefully decompose novel words into known pieces:

In[9]:

Code

# Test on unseen text
unseen_text = "deep processing"
unseen_tokens = sp.tokenize(unseen_text)
unseen_ids = sp.encode(unseen_text)

# Compare token counts
original_chars = len("▁" + unseen_text.replace(" ", "▁"))

# Test on unseen text
unseen_text = "deep processing"
unseen_tokens = sp.tokenize(unseen_text)
unseen_ids = sp.encode(unseen_text)

# Compare token counts
original_chars = len("▁" + unseen_text.replace(" ", "▁"))

Out[10]:

Console

Unseen text: 'deep processing'
Tokens: ['▁', 'de', 'e', 'p', '▁', 'p', 'r', 'o', 'c', 'e', 's', 's', 'ing']
Token IDs: [17, 23, 4, 12, 17, 12, 13, 11, 2, 4, 14, 14, 20]
Original characters: 16
Tokens after BPE: 13
Compression ratio: 1.23x

Even though "deep processing" as a complete phrase never appeared in training, the model handles it gracefully. It recognizes subword patterns it learned from related words, perhaps "process" from "processing" and "deep" from other contexts. The compression ratio quantifies this efficiency: a ratio of 2x means we need half as many tokens as characters, 3x means one-third, and so on.

This robustness to unseen text is why subword tokenization has become the standard in modern NLP. Rare words, technical terms, and even typos can be decomposed into known subwords, allowing models to make reasonable predictions about their meaning.

The Effect of Vocabulary SizeLink Copied

How does vocabulary size affect tokenization? Let's train multiple models with different vocabulary sizes and compare how they tokenize the same text:

Out[12]:

Visualization

Impact of vocabulary size on tokenization efficiency. Larger vocabularies achieve better compression (fewer tokens per sentence) but with diminishing returns. The curve flattens as vocabulary size increases, suggesting an optimal trade-off between vocabulary complexity and compression gains.

The plot reveals the diminishing returns of larger vocabularies. Early vocabulary additions (the first 20-30 tokens beyond characters) provide substantial compression improvements. However, as vocabulary size grows further, each additional token contributes less to overall efficiency. This trade-off motivates the choice of vocabulary sizes in practice. Models like BERT use ~30,000 tokens, balancing compression against embedding table size.

Production UsageLink Copied

SentencePiece is widely used in production systems. Let's see how to use the official implementation:

In[23]:

Code

import sentencepiece as spm

# Train a model
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="mymodel",
    vocab_size=8000,
    model_type="bpe",  # or 'unigram'
    character_coverage=1.0,  # for multilingual
    byte_fallback=True,  # handle rare characters
)

# Load and use model
sp = spm.SentencePieceProcessor()
sp.load("mymodel.model")

# Tokenize text
text = "Hello world! 你好世界"
tokens = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

import sentencepiece as spm

# Train a model
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="mymodel",
    vocab_size=8000,
    model_type="bpe",  # or 'unigram'
    character_coverage=1.0,  # for multilingual
    byte_fallback=True,  # handle rare characters
)

# Load and use model
sp = spm.SentencePieceProcessor()
sp.load("mymodel.model")

# Tokenize text
text = "Hello world! 你好世界"
tokens = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

The official implementation handles edge cases like rare Unicode characters, provides both BPE and unigram modes, and supports multilingual text out of the box.

Key ParametersLink Copied

When working with SentencePiece, several parameters critically affect tokenization quality and model performance:

vocab_size ( $|V|$ ): Controls the final vocabulary size. Larger vocabularies (8,000-32,000) capture more subword patterns but increase computational cost. Smaller vocabularies (1,000-4,000) are more efficient but may lose rare patterns. The vocabulary size directly affects the average number of tokens per word; larger vocabularies tend to produce fewer, longer tokens.
model_type: Choose between 'bpe' for compression-focused tokenization and 'unigram' for language modeling approaches. BPE greedily merges the most frequent pairs at each step, while unigram optimizes the global probability $P(\mathbf{x}) = \prod_i P(x_i)$ across all possible segmentations. BPE is generally faster to train while unigram often produces better token boundaries for downstream tasks.
character_coverage ( $c$ ): Set to $c = 1.0$ for multilingual support to ensure all Unicode characters are covered. Lower values (e.g., $c = 0.995$ ) can be used for language-specific models, allowing the algorithm to treat the rarest 0.5% of characters as unknown tokens to reduce vocabulary size.
byte_fallback: Enables handling of rare or unseen characters by falling back to byte-level representation. When a character isn't in the vocabulary, it's encoded as its raw UTF-8 bytes (each prefixed with a special marker). Essential for robust multilingual tokenization.

Limitations & ImpactLink Copied

SentencePiece's byte-level approach changed how tokenization works, but it comes with trade-offs. The lack of explicit whitespace handling can make debugging difficult, and the learned token boundaries may not always align with linguistic intuition. However, these limitations are outweighed by the algorithm's ability to handle any language without preprocessing.

SentencePiece enabled the development of truly multilingual models like mBERT and XLM-R, which can process hundreds of languages with a single vocabulary. This made language models more accessible across different linguistic communities.

SummaryLink Copied

SentencePiece introduced a data-driven approach to tokenization that treats text as raw bytes and learns optimal subword units directly from corpora. By eliminating language-specific preprocessing and handling whitespace through special prefixes, it enables truly multilingual tokenization.

The algorithm's two main variants offer different trade-offs:

BPE greedily merges the most frequent adjacent pair at each step: $(a^*, b^*) = \arg\max_{(a,b)} \text{count}(a, b)$
Unigram finds the segmentation that maximizes total probability: $\mathbf{x}^* = \arg\max_{\mathbf{x}} \prod_i P(x_i)$

While the approach can produce counterintuitive tokenizations, it enables truly multilingual NLP. Many of today's most capable language models use SentencePiece, including T5, ALBERT, and XLM-RoBERTa.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about SentencePiece tokenization.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Unigram Language Model Tokenization

Next Chapter

Tokenizer Training

Reference

BIBTEXAcademic

@misc{sentencepiecesubwordtokenizationformultilingualnlp, author = {Michael Brenndoerfer}, title = {SentencePiece: Subword Tokenization for Multilingual NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). SentencePiece: Subword Tokenization for Multilingual NLP. Retrieved from https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram

MLAAcademic

Michael Brenndoerfer. "SentencePiece: Subword Tokenization for Multilingual NLP." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram>.

CHICAGOAcademic

Michael Brenndoerfer. "SentencePiece: Subword Tokenization for Multilingual NLP." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram.

HARVARDAcademic

Michael Brenndoerfer (2025) 'SentencePiece: Subword Tokenization for Multilingual NLP'. Available at: https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). SentencePiece: Subword Tokenization for Multilingual NLP. https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram

Direct link:

https://mbrenndoerfer.com/writing/sentencepiece-subword-tokenization-bpe-unigram

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

SentencePiece: Subword Tokenization for Multilingual NLP

SentencePieceLink Copied

IntroductionLink Copied

Technical Deep DiveLink Copied

Byte-Level ProcessingLink Copied

Training AlgorithmsLink Copied

Byte Pair Encoding (BPE)Link Copied

Unigram Language ModelLink Copied

Whitespace HandlingLink Copied

A Worked ExampleLink Copied

Step 1: PreprocessingLink Copied

Step 2: Initialize with CharactersLink Copied

Step 3: Iterative MergingLink Copied

Step 4: Tokenization at Inference TimeLink Copied

Comparing BPE and Unigram TokenizationLink Copied

Code ImplementationLink Copied

Defining the SentencePiece ClassLink Copied

Training on a Sample CorpusLink Copied

Examining Learned MergesLink Copied

Visualizing the BPE Training ProcessLink Copied

Tokenizing Unseen TextLink Copied

The Effect of Vocabulary SizeLink Copied

Production UsageLink Copied

Key ParametersLink Copied

Limitations & ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Special Tokens in Transformers: CLS, SEP, PAD, MASK & More

Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases

Tokenizer Training: Complete Guide to Custom Tokenizer Development

Stay updated