The Vocabulary Problem: Why Word-Level Tokenization Breaks Down

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Discover why traditional word-level approaches fail with diverse text, from OOV words to morphological complexity. Learn the fundamental challenges that make subword tokenization essential for modern NLP.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

The Vocabulary ProblemLink Copied

You've built bag-of-words representations. You've trained word embeddings that capture semantic relationships. But lurking beneath these techniques is a fundamental challenge: what happens when your model encounters a word it has never seen before?

This is the vocabulary problem, and it's more pervasive than you might think. Every time someone coins a new term, makes a typo, uses a technical abbreviation, or writes in a language with rich morphology, traditional word-based models falter. The word "ChatGPT" didn't exist before 2022, yet models trained before then need to process it. The misspelling "reccomend" isn't in any dictionary, yet users type it constantly. The German compound "Bundesausbildungsförderungsgesetz" is a perfectly valid word, yet it will almost certainly be absent from any vocabulary built from a standard corpus.

This chapter explores why word-level tokenization breaks down in practice. We'll examine the explosion of vocabulary sizes, the curse of rare words, and the fundamental tension between coverage and efficiency. By understanding these limitations, you'll see why subword tokenization, which we cover in the following chapters, became essential for modern NLP.

The Out-of-Vocabulary ProblemLink Copied

When Models Meet Unknown WordsLink Copied

Consider a sentiment analysis model trained on movie reviews. It learned representations for words like "excellent," "boring," and "cinematography." Now imagine deploying this model and encountering the review: "This movie was amazeballs! The CGI was unreal."

The word "amazeballs" is almost certainly not in the training vocabulary. What should the model do?

Traditional approaches have three options, none of them good:

Replace with a special [UNK] token: The model treats all unknown words identically, losing crucial information. "Amazeballs" and "terrible" both become [UNK], erasing the distinction between positive and negative sentiment.
Skip the word entirely: Now the sentence becomes "This movie was ! The CGI was unreal." We've preserved some structure but lost potentially important content.
Attempt approximate matching: Maybe "amazeballs" is similar to "amazing"? But this requires additional infrastructure and often fails for truly novel words.

None of these solutions is satisfactory. The [UNK] approach is most common, but it creates a black hole in your model's understanding.

Out-of-Vocabulary (OOV)

A word is out-of-vocabulary when it doesn't appear in the fixed vocabulary that was constructed during training. OOV words must be handled specially, typically by mapping them to a generic [UNK] token, which discards their unique meaning.

Measuring the OOV RateLink Copied

Let's quantify how serious this problem is. We'll train a vocabulary on one text corpus and measure how many words from another corpus are out-of-vocabulary.

In[2]:

Code

import re


def tokenize(text):
    """Simple whitespace and punctuation tokenization."""
    return re.findall(r"\b[a-z]+\b", text.lower())


# Training corpus: classic literature
training_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "To be or not to be, that is the question.",
    "It was the best of times, it was the worst of times.",
    "All happy families are alike; each unhappy family is unhappy in its own way.",
    "Call me Ishmael. Some years ago, never mind how long precisely.",
    "In the beginning God created the heaven and the earth.",
    "It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife.",
    "The man in black fled across the desert, and the gunslinger followed.",
]

# Build vocabulary from training corpus
training_tokens = []
for text in training_texts:
    training_tokens.extend(tokenize(text))

vocabulary = set(training_tokens)

# Test corpus: modern tech reviews
test_texts = [
    "The smartphone's OLED display is absolutely gorgeous with HDR support.",
    "I downloaded the app and it synced with my smartwatch seamlessly.",
    "The AI-powered chatbot provides surprisingly helpful customer service.",
    "This laptop's GPU handles 4K gaming without breaking a sweat.",
    "The Bluetooth connectivity works flawlessly with all my devices.",
]

# Count OOV words
test_tokens = []
for text in test_texts:
    test_tokens.extend(tokenize(text))

oov_words = [word for word in test_tokens if word not in vocabulary]
oov_unique = set(oov_words)

import re


def tokenize(text):
    """Simple whitespace and punctuation tokenization."""
    return re.findall(r"\b[a-z]+\b", text.lower())


# Training corpus: classic literature
training_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "To be or not to be, that is the question.",
    "It was the best of times, it was the worst of times.",
    "All happy families are alike; each unhappy family is unhappy in its own way.",
    "Call me Ishmael. Some years ago, never mind how long precisely.",
    "In the beginning God created the heaven and the earth.",
    "It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife.",
    "The man in black fled across the desert, and the gunslinger followed.",
]

# Build vocabulary from training corpus
training_tokens = []
for text in training_texts:
    training_tokens.extend(tokenize(text))

vocabulary = set(training_tokens)

# Test corpus: modern tech reviews
test_texts = [
    "The smartphone's OLED display is absolutely gorgeous with HDR support.",
    "I downloaded the app and it synced with my smartwatch seamlessly.",
    "The AI-powered chatbot provides surprisingly helpful customer service.",
    "This laptop's GPU handles 4K gaming without breaking a sweat.",
    "The Bluetooth connectivity works flawlessly with all my devices.",
]

# Count OOV words
test_tokens = []
for text in test_texts:
    test_tokens.extend(tokenize(text))

oov_words = [word for word in test_tokens if word not in vocabulary]
oov_unique = set(oov_words)

Out[3]:

Console

Vocabulary Analysis
============================================================
Training vocabulary size: 68 unique words
Test corpus tokens: 50 words
OOV tokens: 41 (82.0%)
Unique OOV words: 37

Sample OOV words: ['absolutely', 'ai', 'app', 'bluetooth', 'breaking', 'chatbot', 'connectivity', 'customer', 'devices', 'display', 'downloaded', 'flawlessly', 'gaming', 'gorgeous', 'gpu']

Even with this small demonstration, we see a substantial OOV rate. Over half of the test tokens are unknown to our vocabulary built from classic literature. Words like "smartphone," "oled," "hdr," "bluetooth," and "gpu" are completely absent because they represent modern technology concepts that didn't exist when those texts were written. This illustrates how domain shift between training and deployment data creates OOV problems even when vocabulary sizes seem adequate. In real applications with larger vocabularies, the problem persists because language continuously evolves.

The Long Tail of LanguageLink Copied

The OOV problem stems from a fundamental property of language: word frequencies follow Zipf's law. A small number of words appear very frequently, while an enormous number of words appear rarely.

Out[4]:

Visualization

Log-log plot showing word frequency decreasing as rank increases, with annotation highlighting the long tail of rare words. — Word frequency follows Zipf''s law: a few words dominate while most words are rare. The top 100 words account for roughly half of all text, but the long tail of rare words contains most of the vocabulary. This creates a fundamental tension: any fixed vocabulary will either be too small (missing rare words) or too large (inefficient).

The long tail means that no matter how large your training corpus, you'll always encounter new words. Even after seeing billions of words, there will be valid English words, proper nouns, technical terms, and neologisms that never appeared in your training data.

Vocabulary Size ExplosionLink Copied

The Coverage-Size TradeoffLink Copied

How large should your vocabulary be? This seems like a simple question, but it reveals a fundamental tension in NLP system design.

A small vocabulary is computationally efficient. The embedding matrix, softmax layer, and any word-based operations scale with vocabulary size. A vocabulary of 10,000 words means 10,000 embeddings to store and 10,000 classes for any word prediction task.

But a small vocabulary means high OOV rates. Users will constantly encounter the [UNK] token, degrading model performance.

A large vocabulary reduces OOV rates but introduces its own problems:

Memory explosion: Each word needs an embedding vector. With 300-dimensional embeddings, 1 million words requires 1.2 GB just for the embedding matrix.
Sparse gradients: Rare words appear infrequently during training, so their embeddings receive few gradient updates and remain poorly learned.
Computational cost: Softmax over millions of classes becomes prohibitively expensive.

Let's visualize this tradeoff by examining how vocabulary size affects corpus coverage.

Out[5]:

Visualization

Line plot showing token coverage percentage increasing rapidly then plateauing as vocabulary size grows from 1000 to 100000. — Vocabulary coverage follows a diminishing returns curve. The first 10,000 words cover roughly 90% of token occurrences, but achieving 99% coverage requires orders of magnitude more vocabulary entries. Perfect coverage is practically impossible due to the infinite productivity of language.

The curve reveals a sobering truth: achieving high coverage requires exponentially larger vocabularies. Going from 90% to 99% coverage might require 10x more vocabulary entries. And 100% coverage is essentially impossible because language is infinitely productive.

Real-World Vocabulary StatisticsLink Copied

Let's examine actual vocabulary sizes from popular NLP resources to understand the scale of this problem.

In[6]:

Code

# Vocabulary sizes from real NLP resources (approximate)
vocabulary_stats = {
    "Basic English": 850,
    "Common English words": 3_000,
    "Average adult vocabulary": 30_000,
    "Shakespeare's works": 31_534,
    "Oxford English Dictionary": 171_476,
    "Google Web 1T 5-gram": 13_588_391,
    "Word2Vec Google News": 3_000_000,
    "FastText English": 2_000_000,
    "GloVe 840B": 2_200_000,
}

# Memory requirements (assuming 300-dim float32 embeddings)
embedding_dim = 300
bytes_per_float = 4

memory_requirements = {}
for name, size in vocabulary_stats.items():
    memory_bytes = size * embedding_dim * bytes_per_float
    memory_mb = memory_bytes / (1024 * 1024)
    memory_requirements[name] = memory_mb

# Vocabulary sizes from real NLP resources (approximate)
vocabulary_stats = {
    "Basic English": 850,
    "Common English words": 3_000,
    "Average adult vocabulary": 30_000,
    "Shakespeare's works": 31_534,
    "Oxford English Dictionary": 171_476,
    "Google Web 1T 5-gram": 13_588_391,
    "Word2Vec Google News": 3_000_000,
    "FastText English": 2_000_000,
    "GloVe 840B": 2_200_000,
}

# Memory requirements (assuming 300-dim float32 embeddings)
embedding_dim = 300
bytes_per_float = 4

memory_requirements = {}
for name, size in vocabulary_stats.items():
    memory_bytes = size * embedding_dim * bytes_per_float
    memory_mb = memory_bytes / (1024 * 1024)
    memory_requirements[name] = memory_mb

Out[7]:

Console

Vocabulary Sizes and Memory Requirements
======================================================================
Resource                            Vocab Size   Memory (300d)
----------------------------------------------------------------------
Basic English                              850        996.1 KB
Common English words                     3,000          3.4 MB
Average adult vocabulary                30,000         34.3 MB
Shakespeare's works                     31,534         36.1 MB
Oxford English Dictionary              171,476        196.2 MB
Google Web 1T 5-gram                13,588,391        15.19 GB
Word2Vec Google News                 3,000,000         3.35 GB
FastText English                     2,000,000         2.24 GB
GloVe 840B                           2,200,000         2.46 GB

The contrast in scale is striking. Basic English's 850 words require less than a megabyte of storage for embeddings, while pre-trained word vectors like Word2Vec or GloVe contain millions of entries requiring multiple gigabytes. The Google Web 1T corpus represents an extreme case: over 13 million unique word forms, yet even this massive vocabulary doesn't capture every possible word. The memory requirements grow linearly with vocabulary size, with each additional word adding 1.2 KB (300 dimensions × 4 bytes per float). This creates a practical ceiling on vocabulary size, forcing practitioners to choose between coverage and computational efficiency.

The Curse of Rare WordsLink Copied

Poorly Learned RepresentationsLink Copied

Even when rare words make it into the vocabulary, they suffer from a different problem: insufficient training data. Consider how word embeddings are learned. Each word's representation is updated based on its context. A word that appears 100,000 times receives 100,000 gradient updates, each refining its embedding. A word that appears 10 times receives only 10 updates.

The result is that rare word embeddings are poorly learned. They might be almost random vectors, barely moved from their initialization.

Out[8]:

Visualization

Scatter plot showing embedding quality score increasing logarithmically with word frequency, with high variance for rare words. — Embedding quality correlates with word frequency. Frequent words receive many gradient updates during training, producing refined embeddings that capture semantic relationships. Rare words receive few updates, leaving their embeddings close to random initialization and semantically meaningless.

This creates a vicious cycle. Rare words have poor embeddings, which means they contribute little to downstream task performance, which means there's no pressure to improve their representations.

The Minimum Frequency CutoffLink Copied

To avoid poorly learned embeddings, most word embedding methods impose a minimum frequency cutoff. Words appearing fewer than, say, 5 times are excluded from the vocabulary entirely.

In[9]:

Code

# Simulated word frequency distribution
word_frequencies = {
    "the": 1_000_000,
    "and": 500_000,
    "learning": 50_000,
    "neural": 25_000,
    "transformer": 5_000,
    "bert": 2_000,
    "roberta": 500,
    "deberta": 100,
    "electra": 50,
    "xlnet": 20,
    "reformer": 5,
    "longformer": 3,
    "linformer": 1,
}

# Apply different cutoffs
cutoffs = [1, 5, 20, 100]
vocab_sizes = {}

for cutoff in cutoffs:
    vocab_sizes[cutoff] = sum(
        1 for freq in word_frequencies.values() if freq >= cutoff
    )

# Simulated word frequency distribution
word_frequencies = {
    "the": 1_000_000,
    "and": 500_000,
    "learning": 50_000,
    "neural": 25_000,
    "transformer": 5_000,
    "bert": 2_000,
    "roberta": 500,
    "deberta": 100,
    "electra": 50,
    "xlnet": 20,
    "reformer": 5,
    "longformer": 3,
    "linformer": 1,
}

# Apply different cutoffs
cutoffs = [1, 5, 20, 100]
vocab_sizes = {}

for cutoff in cutoffs:
    vocab_sizes[cutoff] = sum(
        1 for freq in word_frequencies.values() if freq >= cutoff
    )

Out[10]:

Console

Effect of Minimum Frequency Cutoff
==================================================
Cutoff               Vocab Size  Words Excluded
--------------------------------------------------
min_count=1                  13               0
min_count=5                  11               2
min_count=20                 10               3
min_count=100                 8               5

The table reveals the tradeoff inherent in frequency cutoffs. With min_count=1, all 13 words remain in the vocabulary, but the rarest entries like "linformer" (appearing only once) will have poorly learned embeddings. Raising the cutoff to min_count=5 excludes 3 words, while min_count=100 leaves only 8 words, excluding newer model names like "xlnet" and "electra" entirely. In practice, cutoffs between 5 and 10 are common, balancing embedding quality against coverage. The excluded words become OOV at inference time, replaced with [UNK] and losing their semantic content.

Morphological ProductivityLink Copied

Languages with Rich MorphologyLink Copied

English has relatively simple morphology. Most words have just a few forms: "walk," "walks," "walked," "walking." Other languages aren't so simple.

Consider Finnish, Turkish, or Hungarian. In these agglutinative languages, words are built by stringing together morphemes. A single Finnish word might encode subject, object, tense, aspect, mood, and more. The word "talossanikinko" means "also in my house, I wonder?" packed into a single token.

German famously allows compound nouns: "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" is a real word meaning a law about beef labeling supervision delegation. While extreme, such compounds are productive and regularly coined.

For these languages, word-level vocabularies explode. Every combination of morphemes creates a new vocabulary entry, even though the underlying meaning is compositional.

In[11]:

Code

# Examples of morphological productivity
finnish_examples = [
    ("talo", "house"),
    ("talossa", "in a house"),
    ("talossani", "in my house"),
    ("talossanikin", "also in my house"),
    ("talossanikinko", "also in my house?"),
]

german_compounds = [
    ("Hand", "hand"),
    ("Handschuh", "glove (hand-shoe)"),
    ("Handschuhmacher", "glove maker"),
    ("Handschuhmacherei", "glove-making workshop"),
]

turkish_examples = [
    ("ev", "house"),
    ("evler", "houses"),
    ("evlerim", "my houses"),
    ("evlerimde", "in my houses"),
    ("evlerimdeki", "that which is in my houses"),
]

# Examples of morphological productivity
finnish_examples = [
    ("talo", "house"),
    ("talossa", "in a house"),
    ("talossani", "in my house"),
    ("talossanikin", "also in my house"),
    ("talossanikinko", "also in my house?"),
]

german_compounds = [
    ("Hand", "hand"),
    ("Handschuh", "glove (hand-shoe)"),
    ("Handschuhmacher", "glove maker"),
    ("Handschuhmacherei", "glove-making workshop"),
]

turkish_examples = [
    ("ev", "house"),
    ("evler", "houses"),
    ("evlerim", "my houses"),
    ("evlerimde", "in my houses"),
    ("evlerimdeki", "that which is in my houses"),
]

Out[12]:

Console

Morphological Productivity Examples
============================================================

Finnish - Agglutination:
  talo                 → house
  talossa              → in a house
  talossani            → in my house
  talossanikin         → also in my house
  talossanikinko       → also in my house?

German - Compounds:
  Hand                      → hand
  Handschuh                 → glove (hand-shoe)
  Handschuhmacher           → glove maker
  Handschuhmacherei         → glove-making workshop

Turkish - Agglutination:
  ev                   → house
  evler                → houses
  evlerim              → my houses
  evlerimde            → in my houses
  evlerimdeki          → that which is in my houses

These examples demonstrate how agglutinative languages build words by attaching morphemes. In Finnish, the root "talo" (house) gains suffixes for location (-ssa), possession (-ni), emphasis (-kin), and question (-ko). Each combination creates a distinct vocabulary entry, even though a native speaker would instantly parse the components. Similarly, German compounds chain nouns together, and Turkish systematically adds suffixes for plurality, possession, and case. A word-level vocabulary would need separate entries for each form, despite their compositional nature.

The Vocabulary Explosion in Morphologically Rich LanguagesLink Copied

The combinatorial nature of morphology causes vocabulary explosion. Where English might have 50,000 common word forms, Turkish or Finnish might have millions of valid word forms, most of which any individual speaker has never seen but would instantly understand.

Out[13]:

Visualization

Bar chart comparing vocabulary sizes needed for 95% coverage across English, German, Turkish, and Finnish. — Morphologically rich languages require dramatically larger vocabularies for equivalent coverage. The same concepts that English expresses with 50,000 word forms might require 500,000 forms in Turkish or Finnish due to their agglutinative nature, where morphemes combine productively to create new words.

Technical and Domain-Specific TextLink Copied

The Challenge of Specialized VocabularyLink Copied

NLP systems increasingly process technical text: code, scientific papers, medical records, legal documents. Each domain brings its own vocabulary challenges.

Code and Programming:

Variable names: getUserById, XMLHttpRequest, __init__
Mixed formats: camelCase, snake_case, SCREAMING_SNAKE_CASE
Special characters: !=, ->, ::, @property

Scientific Text:

Chemical formulas: CH₃COOH, C₆H₁₂O₆
Gene names: BRCA1, TP53, CFTR
Technical terms: "phosphorylation," "eigendecomposition"

Medical Text:

Drug names: "hydroxychloroquine," "acetaminophen"
Conditions: "atherosclerosis," "thrombocytopenia"
Abbreviations: "bid" (twice daily), "prn" (as needed)

In[14]:

Code

# Examples of domain-specific vocabulary challenges
code_tokens = [
    "getUserByIdFromDatabase",
    "XMLHttpRequest",
    "addEventListener",
    "parseInt",
    "__init__",
    "self.model.fit()",
    "np.array([[1,2],[3,4]])",
]

scientific_tokens = [
    "methyltransferase",
    "phosphofructokinase",
    "deoxyribonucleic",
    "electroencephalography",
    "spectrophotometer",
    "chromatography",
]

# Check which would be OOV in a general vocabulary
common_words = {
    "get",
    "user",
    "by",
    "id",
    "from",
    "database",
    "array",
    "model",
    "fit",
}

# Examples of domain-specific vocabulary challenges
code_tokens = [
    "getUserByIdFromDatabase",
    "XMLHttpRequest",
    "addEventListener",
    "parseInt",
    "__init__",
    "self.model.fit()",
    "np.array([[1,2],[3,4]])",
]

scientific_tokens = [
    "methyltransferase",
    "phosphofructokinase",
    "deoxyribonucleic",
    "electroencephalography",
    "spectrophotometer",
    "chromatography",
]

# Check which would be OOV in a general vocabulary
common_words = {
    "get",
    "user",
    "by",
    "id",
    "from",
    "database",
    "array",
    "model",
    "fit",
}

Out[15]:

Console

Domain-Specific Vocabulary Challenges
============================================================

Code tokens (often camelCase or snake_case):
  getUserByIdFromDatabase
  XMLHttpRequest
  addEventListener
  parseInt
  __init__
  self.model.fit()
  np.array([[1,2],[3,4]])

Scientific terms (morphologically complex):
  methyltransferase              (~5 syllables)
  phosphofructokinase            (~6 syllables)
  deoxyribonucleic               (~5 syllables)
  electroencephalography         (~7 syllables)
  spectrophotometer              (~5 syllables)
  chromatography                 (~4 syllables)

The code tokens illustrate how programming conventions pack multiple words into single identifiers. Each camelCase or snake_case token contains meaningful subparts ("get," "user," "by," "id") that would be individually recognized by a general vocabulary, but the combined form is almost certainly OOV. The scientific terms present a different challenge: they're morphologically complex words built from Greek and Latin roots. "Phosphofructokinase" combines "phospho-" (phosphate), "fructo-" (fructose), and "-kinase" (enzyme that transfers phosphate groups). General-purpose vocabularies trained on news or web text have no representation for these specialized terms, forcing domain-specific applications to either expand their vocabularies dramatically or accept high OOV rates.

Code Tokenization: A Special ChallengeLink Copied

Code presents unique tokenization challenges. Unlike natural language, code uses explicit conventions like camelCase and snake_case to pack multiple concepts into single tokens.

In[16]:

Code

import re


def split_camel_case(identifier):
    """Split camelCase and PascalCase identifiers."""
    # Insert space before uppercase letters that follow lowercase
    result = re.sub(r"([a-z])([A-Z])", r"\1 \2", identifier)
    # Handle consecutive uppercase (acronyms)
    result = re.sub(r"([A-Z]+)([A-Z][a-z])", r"\1 \2", result)
    return result.lower().split()


def split_snake_case(identifier):
    """Split snake_case identifiers."""
    return identifier.lower().split("_")


# Examples
camel_examples = [
    "getUserById",
    "XMLHttpRequest",
    "processHTMLDocument",
    "calculateTotalPrice",
]

snake_examples = [
    "get_user_by_id",
    "xml_http_request",
    "process_html_document",
    "calculate_total_price",
]

import re


def split_camel_case(identifier):
    """Split camelCase and PascalCase identifiers."""
    # Insert space before uppercase letters that follow lowercase
    result = re.sub(r"([a-z])([A-Z])", r"\1 \2", identifier)
    # Handle consecutive uppercase (acronyms)
    result = re.sub(r"([A-Z]+)([A-Z][a-z])", r"\1 \2", result)
    return result.lower().split()


def split_snake_case(identifier):
    """Split snake_case identifiers."""
    return identifier.lower().split("_")


# Examples
camel_examples = [
    "getUserById",
    "XMLHttpRequest",
    "processHTMLDocument",
    "calculateTotalPrice",
]

snake_examples = [
    "get_user_by_id",
    "xml_http_request",
    "process_html_document",
    "calculate_total_price",
]

Out[17]:

Console

Splitting Programming Identifiers
============================================================

CamelCase splitting:
  getUserById               → ['get', 'user', 'by', 'id']
  XMLHttpRequest            → ['xml', 'http', 'request']
  processHTMLDocument       → ['process', 'html', 'document']
  calculateTotalPrice       → ['calculate', 'total', 'price']

snake_case splitting:
  get_user_by_id            → ['get', 'user', 'by', 'id']
  xml_http_request          → ['xml', 'http', 'request']
  process_html_document     → ['process', 'html', 'document']
  calculate_total_price     → ['calculate', 'total', 'price']

The splitting functions reveal the compositional structure embedded in programming identifiers. Both getUserById (camelCase) and get_user_by_id (snake_case) decompose into the same four meaningful tokens: ["get", "user", "by", "id"]. Notice how the camelCase splitter handles acronyms like "XML" and "HTML": it keeps them together as single units rather than splitting each uppercase letter. Each resulting component is a common English word that likely exists in any general vocabulary, even though the combined identifier would be OOV. This insight, that complex tokens often contain recognizable subparts, motivates the subword tokenization approach we explore in subsequent chapters.

The Case for Subword UnitsLink Copied

The challenges we've examined, OOV words, vocabulary explosion, rare word embeddings, and morphological complexity, all point toward a common solution. Rather than treating words as indivisible units, we can decompose them into smaller, reusable pieces.

Breaking Words into Meaningful PiecesLink Copied

The vocabulary problem has an elegant solution: stop treating words as atomic units. Instead, break words into smaller pieces that can be combined to form any word.

Consider the word "unhappiness":

As a whole word, it might be rare and poorly represented
Split into "un" + "happi" + "ness", each piece is common

The prefix "un-" appears in hundreds of words (undo, unfair, unable). The suffix "-ness" appears in thousands (happiness, sadness, kindness). The root "happy" is common. By representing "unhappiness" as a sequence of these pieces, we:

Eliminate OOV entirely: Any word can be broken into subword units
Share parameters: "un-" learned from "undo" helps with "unfair"
Reduce vocabulary size: Thousands of subwords can generate millions of words
Handle morphology: Compositional words decompose naturally

In[18]:

Code

# Demonstration of subword decomposition
subword_decompositions = {
    "unhappiness": ["un", "happi", "ness"],
    "unbelievable": ["un", "believ", "able"],
    "transformers": ["transform", "ers"],
    "preprocessing": ["pre", "process", "ing"],
    "internationalization": ["inter", "national", "ization"],
    "ChatGPT": ["Chat", "G", "PT"],  # Handles new words
    "COVID19": ["CO", "VID", "19"],  # Handles alphanumeric
}

# Show how pieces are reused
pieces = []
for word, decomposition in subword_decompositions.items():
    pieces.extend(decomposition)

from collections import Counter

piece_counts = Counter(pieces)

# Demonstration of subword decomposition
subword_decompositions = {
    "unhappiness": ["un", "happi", "ness"],
    "unbelievable": ["un", "believ", "able"],
    "transformers": ["transform", "ers"],
    "preprocessing": ["pre", "process", "ing"],
    "internationalization": ["inter", "national", "ization"],
    "ChatGPT": ["Chat", "G", "PT"],  # Handles new words
    "COVID19": ["CO", "VID", "19"],  # Handles alphanumeric
}

# Show how pieces are reused
pieces = []
for word, decomposition in subword_decompositions.items():
    pieces.extend(decomposition)

from collections import Counter

piece_counts = Counter(pieces)

Out[19]:

Console

Subword Decomposition Examples
============================================================
Word                      Subword Pieces                     
------------------------------------------------------------
unhappiness               un + happi + ness                  
unbelievable              un + believ + able                 
transformers              transform + ers                    
preprocessing             pre + process + ing                
internationalization      inter + national + ization         
ChatGPT                   Chat + G + PT                      
COVID19                   CO + VID + 19                      

Most reused subword pieces:
  'un' appears in 2 different words

The decomposition table shows how diverse words break into reusable pieces. Morphologically complex words like "unhappiness" and "internationalization" split at natural boundaries: prefixes, roots, and suffixes. Novel words like "ChatGPT" and "COVID19" decompose into shorter segments that, while perhaps not semantically meaningful individually, are learnable patterns. The reuse statistics reveal the key advantage: the prefix "un" appears in multiple words, so learning its meaning from "unhappiness" transfers to "unbelievable." This parameter sharing dramatically reduces the effective vocabulary size while maintaining complete coverage.

From Characters to SubwordsLink Copied

At one extreme, we could tokenize at the character level. Every word becomes a sequence of characters, and there's no OOV problem: any text is just a sequence of characters from a fixed alphabet.

But character-level tokenization has severe drawbacks:

Sequences become very long (a 10-word sentence might have 50+ characters)
The model must learn to compose characters into meaningful units
Long-range dependencies become harder to capture

Subword tokenization finds the sweet spot. Subwords are longer than characters (capturing more meaning per token) but shorter than words (enabling composition). A typical subword vocabulary might have 30,000-50,000 entries, able to represent any text without OOV.

Out[20]:

Visualization

Horizontal spectrum diagram showing character, subword, and word tokenization with their tradeoffs. — The tokenization spectrum ranges from characters to words, with subwords occupying the optimal middle ground. Character-level tokenization eliminates OOV but creates long sequences. Word-level tokenization is compact but suffers from OOV. Subword tokenization balances both, using a fixed vocabulary that can represent any text through composition.

Looking Ahead: Subword Tokenization AlgorithmsLink Copied

The next chapters explore the algorithms that make subword tokenization work. Each takes a different approach to deciding how to split words:

Byte Pair Encoding (BPE): Starts with characters and iteratively merges the most frequent pairs. The vocabulary grows bottom-up, with common sequences becoming single tokens.

WordPiece: Similar to BPE but uses a likelihood-based criterion for merging. Used by BERT and many Google models.

Unigram Language Model: Takes a top-down approach. Starts with a large vocabulary and iteratively removes pieces that contribute least to the language model likelihood.

SentencePiece: A framework that can implement BPE or Unigram, treating text as raw bytes rather than requiring pre-tokenization. Enables truly language-agnostic tokenization.

Each algorithm produces a vocabulary of subword units and a procedure for tokenizing new text. The key insight uniting them all: words are not atoms. They can and should be decomposed into smaller, reusable pieces.

SummaryLink Copied

The vocabulary problem arises from a fundamental mismatch between the infinite productivity of language and the finite capacity of NLP models. We've explored several facets of this challenge:

Out-of-vocabulary words plague any fixed vocabulary system. New words, rare words, typos, and domain-specific terms all become [UNK], losing their meaning entirely.
Vocabulary size creates a tradeoff: Larger vocabularies reduce OOV rates but consume more memory, slow computation, and suffer from poorly-learned embeddings for rare words.
Morphologically rich languages make the problem exponentially worse. Agglutinative languages like Finnish and Turkish can form millions of valid word forms from a fixed set of morphemes.
Domain-specific text including code, scientific writing, and medical text introduces specialized vocabulary that general-purpose models cannot handle.
Subword tokenization offers an elegant solution by breaking words into reusable pieces. A vocabulary of 30,000 subwords can represent any text without OOV, sharing parameters across morphologically related words.

The vocabulary problem taught NLP an important lesson: the word is not the right unit of meaning. In the following chapters, we'll explore the algorithms that learn optimal subword vocabularies and how to tokenize text using them.

QuizLink Copied

Ready to test your understanding of the vocabulary problem? Take this quick quiz to reinforce what you've learned about why word-level tokenization breaks down and why subword approaches are needed.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{thevocabularyproblemwhywordleveltokenizationbreaksdown, author = {Michael Brenndoerfer}, title = {The Vocabulary Problem: Why Word-Level Tokenization Breaks Down}, year = {2025}, url = {https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). The Vocabulary Problem: Why Word-Level Tokenization Breaks Down. Retrieved from https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges

MLAAcademic

Michael Brenndoerfer. "The Vocabulary Problem: Why Word-Level Tokenization Breaks Down." 2026. Web. today. <https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges>.

CHICAGOAcademic

Michael Brenndoerfer. "The Vocabulary Problem: Why Word-Level Tokenization Breaks Down." Accessed today. https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges.

HARVARDAcademic

Michael Brenndoerfer (2025) 'The Vocabulary Problem: Why Word-Level Tokenization Breaks Down'. Available at: https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). The Vocabulary Problem: Why Word-Level Tokenization Breaks Down. https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges

Direct link:

https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

The Vocabulary Problem: Why Word-Level Tokenization Breaks Down

The Vocabulary ProblemLink Copied

The Out-of-Vocabulary ProblemLink Copied

When Models Meet Unknown WordsLink Copied

Measuring the OOV RateLink Copied

The Long Tail of LanguageLink Copied

Vocabulary Size ExplosionLink Copied

The Coverage-Size TradeoffLink Copied

Real-World Vocabulary StatisticsLink Copied

The Curse of Rare WordsLink Copied

Poorly Learned RepresentationsLink Copied

The Minimum Frequency CutoffLink Copied

Morphological ProductivityLink Copied

Languages with Rich MorphologyLink Copied

The Vocabulary Explosion in Morphologically Rich LanguagesLink Copied

Technical and Domain-Specific TextLink Copied

The Challenge of Specialized VocabularyLink Copied

Code Tokenization: A Special ChallengeLink Copied

The Case for Subword UnitsLink Copied

Breaking Words into Meaningful PiecesLink Copied

From Characters to SubwordsLink Copied

Looking Ahead: Subword Tokenization AlgorithmsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Unigram Language Model Tokenization: Probabilistic Subword Segmentation

Byte Pair Encoding: Complete Guide to Subword Tokenization

WordPiece Tokenization: BERT's Subword Algorithm Explained

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Unigram Language Model Tokenization: Probabilistic Subword Segmentation

Byte Pair Encoding: Complete Guide to Subword Tokenization

WordPiece Tokenization: BERT's Subword Algorithm Explained

Stay updated