Unigram Language Model Tokenization: Probabilistic Subword Segmentation

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Master probabilistic tokenization with unigram language models. Learn how SentencePiece uses EM algorithms and Viterbi decoding to create linguistically meaningful subword units, outperforming deterministic methods like BPE.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Unigram Language Model TokenizationLink Copied

IntroductionLink Copied

Subword tokenization methods like Byte Pair Encoding (BPE) work by greedily merging the most frequent character pairs. But what if we could make smarter decisions about which merges to perform? The Unigram Language Model approach treats tokenization as a probabilistic optimization problem, asking: "What's the most likely way to segment this text into tokens?"

At its core, unigram LM tokenization trains a language model on subword units and uses it to find the tokenization that maximizes the probability of the text. This approach, popularized by the SentencePiece library, offers several advantages over deterministic methods like BPE.

Rather than applying fixed merge rules, unigram tokenization learns a vocabulary of subword units where each unit has an associated probability. During tokenization, the algorithm searches for the segmentation that gives the highest total probability under this model.

Technical Deep DiveLink Copied

Unigram Language Model FormulationLink Copied

A unigram language model assigns probabilities to individual tokens independently. For tokenization, we want to segment text into a sequence of subword tokens that maximizes the joint probability:

P(\text{text}) = \prod_{i=1}^{n} P(\text{token}_i)

where:

$n$ : number of tokens in the segmentation
$\text{token}_i$ : the $i$ -th subword token in the sequence
$V$ : the learned vocabulary of subword units

We can represent any word as a sequence of subword tokens from $V$ . For example, the word "tokenization" might be segmented as ["token", "ization"] or ["tok", "en", "ization"], and we want to choose the segmentation with the highest probability.

To illustrate this concept, let's visualize different possible segmentations of the word "tokenization" with their associated probabilities. The unigram model learns that certain subword units like "token" and "ization" are more probable than arbitrary splits.

Out[2]:

Visualization

Different possible segmentations of 'tokenization' with their log probabilities. The unigram model learns that meaningful subword units have higher probabilities, leading to more natural token boundaries.

This visualization shows how the unigram model evaluates different ways to split a word. The most probable segmentation combines meaningful subword units ("token" + "ization") that frequently appear in the training corpus. Less natural splits receive lower probabilities, guiding the model toward linguistically sensible token boundaries.

The EM Algorithm for TrainingLink Copied

Training a unigram language model for tokenization uses an Expectation-Maximization (EM) algorithm, adapted specifically for the tokenization task. Unlike standard language model training, the EM algorithm here must consider all possible ways to segment each word in the corpus. We start with a large initial vocabulary (typically all possible substrings up to a maximum length) and iteratively refine both the vocabulary and the token probabilities.

E-Step: Compute Tokenization Probabilities

For each word in our training corpus, we compute the probability of all possible ways to segment it using the current vocabulary. This gives us the expected count of each subword token across all possible segmentations.

M-Step: Update Probabilities

We update the probability of each token proportional to its expected count:

P(\text{token}) = \frac{\text{expected\_count}(\text{token})}{\sum_{\text{token}' \in V} \text{expected\_count}(\text{token}')}

where:

$\text{expected\_count}(\text{token})$ : expected frequency of the token across all possible segmentations of the training corpus
$V$ : current vocabulary of subword units

The algorithm also includes a vocabulary pruning step that removes low-probability tokens to keep the vocabulary size manageable.

To illustrate the EM algorithm's learning process, let's visualize how token probabilities evolve over iterations. The algorithm starts with uniform probabilities and gradually learns which subword units are more frequent in the corpus.

Out[3]:

Visualization

EM algorithm convergence showing how token probabilities evolve over iterations. Meaningful subword units like 'low', 'est', and 'new' gain probability while less useful tokens are pruned away.

This visualization demonstrates the EM algorithm's learning dynamics. Initially, all tokens have equal probability. Over iterations, meaningful subword units that frequently appear in the corpus ("low", "est", "new") gain probability, while less useful or nonexistent tokens are gradually pruned away. The algorithm converges when the token probabilities stabilize, resulting in a vocabulary optimized for the corpus's subword patterns.

Unigram LM vs N-gram Models

Unigram models treat tokens as independent, while n-gram models consider dependencies between consecutive tokens. For tokenization, the unigram assumption works well because we're primarily concerned with which subword units are meaningful, not their sequential dependencies within a word.

Viterbi Decoding for TokenizationLink Copied

Once we have our trained unigram model, tokenization becomes a search problem: find the segmentation of the input text that gives the highest probability. This is solved using the Viterbi algorithm, a dynamic programming approach for finding the most likely sequence.

For a word $w$ of length $L$ , we want to find positions $0 = p_0 < p_1 < \dots < p_k = L$ that define the token boundaries, maximizing:

\prod_{i=1}^{k} P(w[p_{i-1}:p_i])

where:

$w$ : input word to tokenize
$L$ : length of the word in characters
$p_0, p_1, \dots, p_k$ : token boundary positions (with $p_0 = 0$ and $p_k = L$ )
$w[p_{i-1}:p_i]$ : substring of $w$ from position $p_{i-1}$ to $p_i$
$k$ : number of tokens in the segmentation

The Viterbi algorithm efficiently finds the most probable segmentation by using dynamic programming to track the best probability path to each position in the word, avoiding the need to enumerate all possible segmentations.

Sampling Multiple SegmentationsLink Copied

While Viterbi gives us the single best segmentation, we might want to sample from the distribution of possible segmentations. This is useful for data augmentation and creating more robust models.

The sampling algorithm works by:

Computing all possible token boundaries at each position
Sampling tokens according to their probabilities
Building a complete segmentation by sampling sequentially

This approach can generate multiple valid tokenizations for the same text, which helps models learn more robust representations.

Subword RegularizationLink Copied

Subword regularization is a technique that randomly samples different tokenizations during training. Instead of always using the same segmentation, we sometimes use alternative segmentations that have lower but still reasonable probabilities.

This regularization helps models become more robust to different tokenization choices and improves generalization, especially for morphologically rich languages.

A Worked ExampleLink Copied

Let's work through a detailed example to understand how unigram tokenization processes text. This will show how the algorithm learns from data and makes tokenization decisions.

Suppose we have a small training corpus:

low lower lowest new newest newest

This corpus contains morphological patterns we want the model to learn: "low" appears in "low", "lower", and "lowest"; "est" appears in "lowest" and "newest"; "new" appears in "new" and "newest".

Step 1: Initial Vocabulary ConstructionLink Copied

We start with all possible substrings up to a maximum length (typically 4-6 characters for efficiency). For our corpus, this includes:

Character-level tokens: "l", "o", "w", "e", "s", "t", "n", "r" Bigram tokens: "lo", "ow", "we", "es", "st", "ne", "ew", "er" Trigram tokens: "low", "owe", "wer", "est", "new", "wes" Longer tokens: "lower", "lowest", "newest"

Step 2: EM Training ProcessLink Copied

The EM algorithm iteratively refines token probabilities:

E-Step: For each word, compute the probability of all possible segmentations using current token probabilities. For "lowest":

["low", "est"]: P(low) × P(est) = high probability
["lowe", "st"]: P(lowe) × P(st) = medium probability
["l", "o", "w", "e", "s", "t"]: product of 6 character probabilities = low probability

M-Step: Update token probabilities based on expected frequencies across all segmentations in the corpus.

After several iterations, tokens that appear frequently get higher probabilities:

"low": appears in "low", "lower", "lowest" → high probability
"est": appears in "lowest", "newest" → high probability
"new": appears in "new", "newest" → high probability
Rare tokens like "lowest" (appears only once) → lower probability

Step 3: Tokenization of New TextLink Copied

When tokenizing "lowest" using the trained model, the Viterbi algorithm evaluates possible segmentations:

Segmentation	Token Sequence	Probability	Calculation
Best	["low", "est"]	High	P(low) × P(est)
Alternative	["lowe", "st"]	Medium	P(lowe) × P(st)
Worst	["l", "o", "w", "e", "s", "t"]	Low	P(l) × P(o) × P(w) × P(e) × P(s) × P(t)

The algorithm chooses ["low", "est"] because it maximizes the joint probability while using the fewest, most meaningful tokens.

This example demonstrates the core principles: unigram tokenization learns which subword units are most probable from the training data, then uses dynamic programming to find optimal segmentations. The probabilistic approach allows for more linguistically sensible token boundaries compared to rule-based methods.

Code ImplementationLink Copied

Now let's see these concepts in action using SentencePiece, a practical implementation of unigram tokenization. The following code demonstrates the complete pipeline from corpus preparation to tokenization.

Let's implement unigram tokenization using SentencePiece, which provides an efficient implementation of the unigram model.

In[4]:

## Install SentencePiece if not already installed
## pip install sentencepiece

## Install SentencePiece if not already installed
## pip install sentencepiece

In[5]:

import sentencepiece as spm
import tempfile
import os

## Create a small training corpus
corpus = """low lower lowest new newest newest
low lower lowest new newest newest
low lower lowest new newest newest"""

## Write corpus to temporary file
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
    f.write(corpus)
    corpus_file = f.name

## Train unigram model
model_prefix = 'unigram_example'
spm.SentencePieceTrainer.train(
    input=corpus_file,
    model_prefix=model_prefix,
    vocab_size=18,  # Match the maximum possible for this small corpus
    model_type='unigram',
    character_coverage=1.0,
    unk_id=0, unk_piece='<unk>', eos_id=1, eos_piece='</s>',
    bos_id=2, bos_piece='<s>', pad_id=-1
)

## Load the trained model
sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

import sentencepiece as spm
import tempfile
import os

## Create a small training corpus
corpus = """low lower lowest new newest newest
low lower lowest new newest newest
low lower lowest new newest newest"""

## Write corpus to temporary file
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
    f.write(corpus)
    corpus_file = f.name

## Train unigram model
model_prefix = 'unigram_example'
spm.SentencePieceTrainer.train(
    input=corpus_file,
    model_prefix=model_prefix,
    vocab_size=18,  # Match the maximum possible for this small corpus
    model_type='unigram',
    character_coverage=1.0,
    unk_id=0, unk_piece='<unk>', eos_id=1, eos_piece='</s>',
    bos_id=2, bos_piece='<s>', pad_id=-1
)

## Load the trained model
sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

Out[6]:

Trained vocabulary (first 20 tokens):
<unk>: 0.0000
</s>: 0.0000
<s>: 0.0000
▁newest: -1.5001
▁low: -1.5939
▁lowe: -2.0020
▁new: -2.1615
r: -2.2434
est: -2.4506
t: -4.4123
s: -4.4124
o: -4.4125
n: -4.4126
l: -4.4127
▁: -4.4128
w: -4.4129
e: -4.4130
st: -4.4130

The vocabulary shows our learned subword units, with more frequent patterns like "low" and "est" having higher probabilities (less negative log probabilities).

In[7]:

## Tokenize some text
test_words = ["lowest", "newest", "unknownword"]

## Store results for each word
tokenization_results = []
for word in test_words:
    # Encode as pieces (subwords)
    pieces = sp.encode_as_pieces(word)
    # Encode as IDs
    ids = sp.encode_as_ids(word)
    # Get probabilities
    piece_probs = [sp.get_score(sp.piece_to_id(piece)) for piece in pieces]

    tokenization_results.append({
        'word': word,
        'pieces': pieces,
        'ids': ids,
        'probabilities': piece_probs
    })

## Tokenize some text
test_words = ["lowest", "newest", "unknownword"]

## Store results for each word
tokenization_results = []
for word in test_words:
    # Encode as pieces (subwords)
    pieces = sp.encode_as_pieces(word)
    # Encode as IDs
    ids = sp.encode_as_ids(word)
    # Get probabilities
    piece_probs = [sp.get_score(sp.piece_to_id(piece)) for piece in pieces]

    tokenization_results.append({
        'word': word,
        'pieces': pieces,
        'ids': ids,
        'probabilities': piece_probs
    })

Out[8]:

Tokenization Results:
==================================================

Word: lowest
Pieces: ['▁low', 'est']
IDs: [4, 8]
Piece probabilities: ['-1.594', '-2.451']

Word: newest
Pieces: ['▁newest']
IDs: [3]
Piece probabilities: ['-1.500']

Word: unknownword
Pieces: ['▁', 'u', 'n', 'k', 'n', 'o', 'w', 'n', 'w', 'o', 'r', 'd']
IDs: [14, 0, 12, 0, 12, 11, 15, 12, 15, 11, 7, 0]
Piece probabilities: ['-4.413', '0.000', '-4.413', '0.000', '-4.413', '-4.413', '-4.413', '-4.413', '-4.413', '-4.413', '-2.243', '0.000']

The tokenization results show how our trained unigram model segments different words. For "lowest" and "newest", the model correctly identifies "low"/"new" and "est" as meaningful subword units that appear frequently in the training corpus. The "unknownword" gets segmented into individual characters because it contains no familiar patterns, demonstrating how the model falls back to character-level tokenization for out-of-vocabulary inputs.

Out[9]:


Sampling multiple tokenizations for 'lowest':
---------------------------------------------
Sample 1: ['▁low', 'est'] (log prob: -4.044)
Sample 2: ['▁lowe', 'st'] (log prob: -6.415)
Sample 3: ['▁', 'l', 'o', 'w', 'e', 's', 't'] (log prob: -30.889)

The sampling demonstration shows different possible ways to segment "lowest". The first sample represents the most probable segmentation found by Viterbi decoding, while the others show alternative segmentations that could be sampled during training for regularization. Note that SentencePiece's built-in sampling support is limited, so custom sampling implementations are often used for advanced subword regularization techniques.

In[10]:

## Clean up temporary files
os.unlink(corpus_file)
os.unlink(f'{model_prefix}.model')
os.unlink(f'{model_prefix}.vocab')

## Clean up temporary files
os.unlink(corpus_file)
os.unlink(f'{model_prefix}.model')
os.unlink(f'{model_prefix}.vocab')

Key ParametersLink Copied

vocab_size: Controls the final vocabulary size after training. Larger vocabularies can capture more subword patterns but increase memory usage. Typical values range from 8K-32K for modern models.

model_type='unigram': Specifies the unigram language model approach. SentencePiece also supports BPE mode for comparison.

character_coverage: Determines what fraction of characters to include (1.0 = all characters). Important for multilingual text to ensure all scripts are covered.

unk_piece: The token used for unknown characters/subwords. Essential for handling out-of-vocabulary text gracefully.

To illustrate the difference between Unigram LM and BPE tokenization, let's compare how they segment the same text. Unigram LM makes probabilistic decisions based on learned token frequencies, while BPE follows deterministic merge rules.

Out[11]:

Visualization

Comparison of Unigram LM vs BPE tokenization on the same text. Unigram LM chooses segmentations based on learned probabilities, while BPE follows deterministic merge rules.

This comparison highlights key differences between the approaches. Unigram LM assigns probabilities to each token based on corpus statistics, enabling more nuanced decisions. BPE follows deterministic rules but may create less linguistically meaningful segments. The probabilistic nature of Unigram LM allows for better handling of morphological variations and out-of-vocabulary words.

Limitations & ImpactLink Copied

Unigram tokenization offers probabilistic segmentation but comes with trade-offs. The EM training can be computationally expensive for large vocabularies, though SentencePiece optimizes this well. The independence assumption ignores token co-occurrence patterns that could inform better segmentation.

SentencePiece's unigram mode powers tokenization in models like T5, mBART, and many multilingual systems. By learning from actual text rather than applying heuristic rules, unigram tokenization handles diverse languages and domains more robustly than BPE.

The probabilistic approach also enables advanced features like subword regularization, which has improved the robustness of large language models. While BPE remains popular for its simplicity, unigram tokenization offers a more principled approach to the tokenization problem.

SummaryLink Copied

Unigram language model tokenization represents a significant advancement in subword tokenization by treating segmentation as a probabilistic optimization problem. Instead of applying fixed merge rules like BPE, this approach learns a probability distribution over subword units and uses dynamic programming to find the most likely segmentation.

The method combines several sophisticated techniques: an EM algorithm adapted for tokenization that considers all possible word segmentations, Viterbi decoding for efficient segmentation finding, and optional subword regularization for improved robustness. While more computationally intensive than deterministic approaches, unigram tokenization provides superior handling of morphological variation, out-of-vocabulary words, and multilingual text.

This probabilistic framework underlies many modern tokenization systems and demonstrates how statistical learning can produce more linguistically informed text segmentation than rule-based methods.

QuizLink Copied

Ready to test your understanding of unigram language model tokenization? This quiz covers the probabilistic approach to subword segmentation, EM training, and practical implementation.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{unigramlanguagemodeltokenizationprobabilisticsubwordsegmentation, author = {Michael Brenndoerfer}, title = {Unigram Language Model Tokenization: Probabilistic Subword Segmentation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/unigram-language-model-tokenization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }

APAAcademic

Michael Brenndoerfer (2025). Unigram Language Model Tokenization: Probabilistic Subword Segmentation. Retrieved from https://mbrenndoerfer.com/writing/unigram-language-model-tokenization

MLAAcademic

Michael Brenndoerfer. "Unigram Language Model Tokenization: Probabilistic Subword Segmentation." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/unigram-language-model-tokenization>.

CHICAGOAcademic

Michael Brenndoerfer. "Unigram Language Model Tokenization: Probabilistic Subword Segmentation." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/unigram-language-model-tokenization.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Unigram Language Model Tokenization: Probabilistic Subword Segmentation'. Available at: https://mbrenndoerfer.com/writing/unigram-language-model-tokenization (Accessed: 12/13/2025).

SimpleBasic

Michael Brenndoerfer (2025). Unigram Language Model Tokenization: Probabilistic Subword Segmentation. https://mbrenndoerfer.com/writing/unigram-language-model-tokenization

Direct link:

https://mbrenndoerfer.com/writing/unigram-language-model-tokenization

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Unigram Language Model Tokenization: Probabilistic Subword Segmentation

Unigram Language Model TokenizationLink Copied

IntroductionLink Copied

Technical Deep DiveLink Copied

Unigram Language Model FormulationLink Copied

The EM Algorithm for TrainingLink Copied

Viterbi Decoding for TokenizationLink Copied

Sampling Multiple SegmentationsLink Copied

Subword RegularizationLink Copied

A Worked ExampleLink Copied

Step 1: Initial Vocabulary ConstructionLink Copied

Step 2: EM Training ProcessLink Copied

Step 3: Tokenization of New TextLink Copied

Code ImplementationLink Copied

Key ParametersLink Copied

Limitations & ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Byte Pair Encoding: Complete Guide to Subword Tokenization

The Vocabulary Problem: Why Word-Level Tokenization Breaks Down

WordPiece Tokenization: BERT's Subword Algorithm Explained

Stay updated