Search

Search articles

The Vocabulary Problem: Why Word-Level Tokenization Breaks Down

Michael BrenndoerferDecember 11, 202517 min read3,986 words

Discover why traditional word-level approaches fail with diverse text, from OOV words to morphological complexity. Learn the fundamental challenges that make subword tokenization essential for modern NLP.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

The Vocabulary Problem

You've built bag-of-words representations. You've trained word embeddings that capture semantic relationships. But lurking beneath these techniques is a fundamental challenge: what happens when your model encounters a word it has never seen before?

This is the vocabulary problem, and it's more pervasive than you might think. Every time someone coins a new term, makes a typo, uses a technical abbreviation, or writes in a language with rich morphology, traditional word-based models falter. The word "ChatGPT" didn't exist before 2022, yet models trained before then need to process it. The misspelling "reccomend" isn't in any dictionary, yet users type it constantly. The German compound "Bundesausbildungsförderungsgesetz" is a perfectly valid word, yet it will almost certainly be absent from any vocabulary built from a standard corpus.

This chapter explores why word-level tokenization breaks down in practice. We'll examine the explosion of vocabulary sizes, the curse of rare words, and the fundamental tension between coverage and efficiency. By understanding these limitations, you'll see why subword tokenization, which we cover in the following chapters, became essential for modern NLP.

The Out-of-Vocabulary Problem

When Models Meet Unknown Words

Consider a sentiment analysis model trained on movie reviews. It learned representations for words like "excellent," "boring," and "cinematography." Now imagine deploying this model and encountering the review: "This movie was amazeballs! The CGI was unreal."

The word "amazeballs" is almost certainly not in the training vocabulary. What should the model do?

Traditional approaches have three options, none of them good:

  1. Replace with a special [UNK] token: The model treats all unknown words identically, losing crucial information. "Amazeballs" and "terrible" both become [UNK], erasing the distinction between positive and negative sentiment.

  2. Skip the word entirely: Now the sentence becomes "This movie was ! The CGI was unreal." We've preserved some structure but lost potentially important content.

  3. Attempt approximate matching: Maybe "amazeballs" is similar to "amazing"? But this requires additional infrastructure and often fails for truly novel words.

None of these solutions is satisfactory. The [UNK] approach is most common, but it creates a black hole in your model's understanding.

Out-of-Vocabulary (OOV)

A word is out-of-vocabulary when it doesn't appear in the fixed vocabulary that was constructed during training. OOV words must be handled specially, typically by mapping them to a generic [UNK] token, which discards their unique meaning.

Measuring the OOV Rate

Let's quantify how serious this problem is. We'll train a vocabulary on one text corpus and measure how many words from another corpus are out-of-vocabulary.

In[2]:
import re
from collections import Counter

def tokenize(text):
    """Simple whitespace and punctuation tokenization."""
    return re.findall(r'\b[a-z]+\b', text.lower())

# Training corpus: classic literature
training_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "To be or not to be, that is the question.",
    "It was the best of times, it was the worst of times.",
    "All happy families are alike; each unhappy family is unhappy in its own way.",
    "Call me Ishmael. Some years ago, never mind how long precisely.",
    "In the beginning God created the heaven and the earth.",
    "It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife.",
    "The man in black fled across the desert, and the gunslinger followed.",
]

# Build vocabulary from training corpus
training_tokens = []
for text in training_texts:
    training_tokens.extend(tokenize(text))

vocabulary = set(training_tokens)

# Test corpus: modern tech reviews
test_texts = [
    "The smartphone's OLED display is absolutely gorgeous with HDR support.",
    "I downloaded the app and it synced with my smartwatch seamlessly.",
    "The AI-powered chatbot provides surprisingly helpful customer service.",
    "This laptop's GPU handles 4K gaming without breaking a sweat.",
    "The Bluetooth connectivity works flawlessly with all my devices.",
]

# Count OOV words
test_tokens = []
for text in test_texts:
    test_tokens.extend(tokenize(text))

oov_words = [word for word in test_tokens if word not in vocabulary]
oov_unique = set(oov_words)
Out[3]:
Vocabulary Analysis
============================================================
Training vocabulary size: 68 unique words
Test corpus tokens: 50 words
OOV tokens: 41 (82.0%)
Unique OOV words: 37

Sample OOV words: ['absolutely', 'ai', 'app', 'bluetooth', 'breaking', 'chatbot', 'connectivity', 'customer', 'devices', 'display', 'downloaded', 'flawlessly', 'gaming', 'gorgeous', 'gpu']

Even with this toy example, we see a substantial OOV rate. Words like "smartphone," "oled," "hdr," "bluetooth," and "gpu" are completely absent from our classic literature vocabulary. In real applications with larger vocabularies, the problem persists because language continuously evolves.

The Long Tail of Language

The OOV problem stems from a fundamental property of language: word frequencies follow Zipf's law. A small number of words appear very frequently, while an enormous number of words appear rarely.

Out[4]:
Visualization
Log-log plot showing word frequency decreasing as rank increases, with annotation highlighting the long tail of rare words.
Word frequency follows Zipf''s law: a few words dominate while most words are rare. The top 100 words account for roughly half of all text, but the long tail of rare words contains most of the vocabulary. This creates a fundamental tension: any fixed vocabulary will either be too small (missing rare words) or too large (inefficient).

The long tail means that no matter how large your training corpus, you'll always encounter new words. Even after seeing billions of words, there will be valid English words, proper nouns, technical terms, and neologisms that never appeared in your training data.

Vocabulary Size Explosion

The Coverage-Size Tradeoff

How large should your vocabulary be? This seems like a simple question, but it reveals a fundamental tension in NLP system design.

A small vocabulary is computationally efficient. The embedding matrix, softmax layer, and any word-based operations scale with vocabulary size. A vocabulary of 10,000 words means 10,000 embeddings to store and 10,000 classes for any word prediction task.

But a small vocabulary means high OOV rates. Users will constantly encounter the [UNK] token, degrading model performance.

A large vocabulary reduces OOV rates but introduces its own problems:

  • Memory explosion: Each word needs an embedding vector. With 300-dimensional embeddings, 1 million words requires 1.2 GB just for the embedding matrix.
  • Sparse gradients: Rare words appear infrequently during training, so their embeddings receive few gradient updates and remain poorly learned.
  • Computational cost: Softmax over millions of classes becomes prohibitively expensive.

Let's visualize this tradeoff by examining how vocabulary size affects corpus coverage.

Out[5]:
Visualization
Line plot showing token coverage percentage increasing rapidly then plateauing as vocabulary size grows from 1000 to 100000.
Vocabulary coverage follows a diminishing returns curve. The first 10,000 words cover roughly 90% of token occurrences, but achieving 99% coverage requires orders of magnitude more vocabulary entries. Perfect coverage is practically impossible due to the infinite productivity of language.

The curve reveals a sobering truth: achieving high coverage requires exponentially larger vocabularies. Going from 90% to 99% coverage might require 10x more vocabulary entries. And 100% coverage is essentially impossible because language is infinitely productive.

Real-World Vocabulary Statistics

Let's examine actual vocabulary sizes from popular NLP resources to understand the scale of this problem.

In[6]:
# Vocabulary sizes from real NLP resources (approximate)
vocabulary_stats = {
    "Basic English": 850,
    "Common English words": 3_000,
    "Average adult vocabulary": 30_000,
    "Shakespeare's works": 31_534,
    "Oxford English Dictionary": 171_476,
    "Google Web 1T 5-gram": 13_588_391,
    "Word2Vec Google News": 3_000_000,
    "FastText English": 2_000_000,
    "GloVe 840B": 2_200_000,
}

# Memory requirements (assuming 300-dim float32 embeddings)
embedding_dim = 300
bytes_per_float = 4

memory_requirements = {}
for name, size in vocabulary_stats.items():
    memory_bytes = size * embedding_dim * bytes_per_float
    memory_mb = memory_bytes / (1024 * 1024)
    memory_requirements[name] = memory_mb
Out[7]:
Vocabulary Sizes and Memory Requirements
======================================================================
Resource                            Vocab Size   Memory (300d)
----------------------------------------------------------------------
Basic English                              850        996.1 KB
Common English words                     3,000          3.4 MB
Average adult vocabulary                30,000         34.3 MB
Shakespeare's works                     31,534         36.1 MB
Oxford English Dictionary              171,476        196.2 MB
Google Web 1T 5-gram                13,588,391        15.19 GB
Word2Vec Google News                 3,000,000         3.35 GB
FastText English                     2,000,000         2.24 GB
GloVe 840B                           2,200,000         2.46 GB

The numbers are striking. While Basic English might only need a few megabytes, real-world corpora yield vocabularies in the millions, requiring gigabytes of memory just for embeddings. And these embeddings still don't cover all possible words.

The Curse of Rare Words

Poorly Learned Representations

Even when rare words make it into the vocabulary, they suffer from a different problem: insufficient training data. Consider how word embeddings are learned. Each word's representation is updated based on its context. A word that appears 100,000 times receives 100,000 gradient updates, each refining its embedding. A word that appears 10 times receives only 10 updates.

The result is that rare word embeddings are poorly learned. They might be almost random vectors, barely moved from their initialization.

Out[8]:
Visualization
Scatter plot showing embedding quality score increasing logarithmically with word frequency, with high variance for rare words.
Embedding quality correlates with word frequency. Frequent words receive many gradient updates during training, producing refined embeddings that capture semantic relationships. Rare words receive few updates, leaving their embeddings close to random initialization and semantically meaningless.

This creates a vicious cycle. Rare words have poor embeddings, which means they contribute little to downstream task performance, which means there's no pressure to improve their representations.

The Minimum Frequency Cutoff

To avoid poorly learned embeddings, most word embedding methods impose a minimum frequency cutoff. Words appearing fewer than, say, 5 times are excluded from the vocabulary entirely.

In[9]:
# Simulated word frequency distribution
word_frequencies = {
    "the": 1_000_000,
    "and": 500_000,
    "learning": 50_000,
    "neural": 25_000,
    "transformer": 5_000,
    "bert": 2_000,
    "roberta": 500,
    "deberta": 100,
    "electra": 50,
    "xlnet": 20,
    "reformer": 5,
    "longformer": 3,
    "linformer": 1,
}

# Apply different cutoffs
cutoffs = [1, 5, 20, 100]
vocab_sizes = {}

for cutoff in cutoffs:
    vocab_sizes[cutoff] = sum(1 for freq in word_frequencies.values() if freq >= cutoff)
Out[10]:
Effect of Minimum Frequency Cutoff
==================================================
Cutoff               Vocab Size  Words Excluded
--------------------------------------------------
min_count=1                  13               0
min_count=5                  11               2
min_count=20                 10               3
min_count=100                 8               5

Higher cutoffs reduce vocabulary size but guarantee that every word has enough training examples. The tradeoff is that more words become OOV.

Morphological Productivity

Languages with Rich Morphology

English has relatively simple morphology. Most words have just a few forms: "walk," "walks," "walked," "walking." Other languages aren't so simple.

Consider Finnish, Turkish, or Hungarian. In these agglutinative languages, words are built by stringing together morphemes. A single Finnish word might encode subject, object, tense, aspect, mood, and more. The word "talossanikinko" means "also in my house, I wonder?" packed into a single token.

German famously allows compound nouns: "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" is a real word meaning a law about beef labeling supervision delegation. While extreme, such compounds are productive and regularly coined.

For these languages, word-level vocabularies explode. Every combination of morphemes creates a new vocabulary entry, even though the underlying meaning is compositional.

In[11]:
# Examples of morphological productivity
finnish_examples = [
    ("talo", "house"),
    ("talossa", "in a house"),
    ("talossani", "in my house"),
    ("talossanikin", "also in my house"),
    ("talossanikinko", "also in my house?"),
]

german_compounds = [
    ("Hand", "hand"),
    ("Handschuh", "glove (hand-shoe)"),
    ("Handschuhmacher", "glove maker"),
    ("Handschuhmacherei", "glove-making workshop"),
]

turkish_examples = [
    ("ev", "house"),
    ("evler", "houses"),
    ("evlerim", "my houses"),
    ("evlerimde", "in my houses"),
    ("evlerimdeki", "that which is in my houses"),
]
Out[12]:
Morphological Productivity Examples
============================================================

Finnish - Agglutination:
  talo                 → house
  talossa              → in a house
  talossani            → in my house
  talossanikin         → also in my house
  talossanikinko       → also in my house?

German - Compounds:
  Hand                      → hand
  Handschuh                 → glove (hand-shoe)
  Handschuhmacher           → glove maker
  Handschuhmacherei         → glove-making workshop

Turkish - Agglutination:
  ev                   → house
  evler                → houses
  evlerim              → my houses
  evlerimde            → in my houses
  evlerimdeki          → that which is in my houses

The Vocabulary Explosion in Morphologically Rich Languages

The combinatorial nature of morphology causes vocabulary explosion. Where English might have 50,000 common word forms, Turkish or Finnish might have millions of valid word forms, most of which any individual speaker has never seen but would instantly understand.

Out[13]:
Visualization
Bar chart comparing vocabulary sizes needed for 95% coverage across English, German, Turkish, and Finnish.
Morphologically rich languages require dramatically larger vocabularies for equivalent coverage. The same concepts that English expresses with 50,000 word forms might require 500,000 forms in Turkish or Finnish due to their agglutinative nature, where morphemes combine productively to create new words.

Technical and Domain-Specific Text

The Challenge of Specialized Vocabulary

NLP systems increasingly process technical text: code, scientific papers, medical records, legal documents. Each domain brings its own vocabulary challenges.

Code and Programming:

  • Variable names: getUserById, XMLHttpRequest, __init__
  • Mixed formats: camelCase, snake_case, SCREAMING_SNAKE_CASE
  • Special characters: !=, ->, ::, @property

Scientific Text:

  • Chemical formulas: CH₃COOH, C₆H₁₂O₆
  • Gene names: BRCA1, TP53, CFTR
  • Technical terms: "phosphorylation," "eigendecomposition"

Medical Text:

  • Drug names: "hydroxychloroquine," "acetaminophen"
  • Conditions: "atherosclerosis," "thrombocytopenia"
  • Abbreviations: "bid" (twice daily), "prn" (as needed)
In[14]:
# Examples of domain-specific vocabulary challenges
code_tokens = [
    "getUserByIdFromDatabase",
    "XMLHttpRequest", 
    "addEventListener",
    "parseInt",
    "__init__",
    "self.model.fit()",
    "np.array([[1,2],[3,4]])",
]

scientific_tokens = [
    "methyltransferase",
    "phosphofructokinase",
    "deoxyribonucleic",
    "electroencephalography",
    "spectrophotometer",
    "chromatography",
]

# Check which would be OOV in a general vocabulary
common_words = {"get", "user", "by", "id", "from", "database", "array", "model", "fit"}
Out[15]:
Domain-Specific Vocabulary Challenges
============================================================

Code tokens (often camelCase or snake_case):
  getUserByIdFromDatabase
  XMLHttpRequest
  addEventListener
  parseInt
  __init__
  self.model.fit()
  np.array([[1,2],[3,4]])

Scientific terms (morphologically complex):
  methyltransferase              (~5 syllables)
  phosphofructokinase            (~6 syllables)
  deoxyribonucleic               (~5 syllables)
  electroencephalography         (~7 syllables)
  spectrophotometer              (~5 syllables)
  chromatography                 (~4 syllables)

General-purpose vocabularies fail catastrophically on domain-specific text. A model trained on news articles has no representation for "phosphofructokinase" or "addEventListener."

Code Tokenization: A Special Challenge

Code presents unique tokenization challenges. Unlike natural language, code uses explicit conventions like camelCase and snake_case to pack multiple concepts into single tokens.

In[16]:
import re

def split_camel_case(identifier):
    """Split camelCase and PascalCase identifiers."""
    # Insert space before uppercase letters that follow lowercase
    result = re.sub(r'([a-z])([A-Z])', r'\1 \2', identifier)
    # Handle consecutive uppercase (acronyms)
    result = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1 \2', result)
    return result.lower().split()

def split_snake_case(identifier):
    """Split snake_case identifiers."""
    return identifier.lower().split('_')

# Examples
camel_examples = [
    "getUserById",
    "XMLHttpRequest",
    "processHTMLDocument",
    "calculateTotalPrice",
]

snake_examples = [
    "get_user_by_id",
    "xml_http_request",
    "process_html_document",
    "calculate_total_price",
]
Out[17]:
Splitting Programming Identifiers
============================================================

CamelCase splitting:
  getUserById               → ['get', 'user', 'by', 'id']
  XMLHttpRequest            → ['xml', 'http', 'request']
  processHTMLDocument       → ['process', 'html', 'document']
  calculateTotalPrice       → ['calculate', 'total', 'price']

snake_case splitting:
  get_user_by_id            → ['get', 'user', 'by', 'id']
  xml_http_request          → ['xml', 'http', 'request']
  process_html_document     → ['process', 'html', 'document']
  calculate_total_price     → ['calculate', 'total', 'price']

This splitting reveals the compositional structure hidden in code identifiers. Each part is a meaningful word that likely exists in a general vocabulary, even if the combined identifier doesn't.

The Case for Subword Units

Breaking Words into Meaningful Pieces

The vocabulary problem has a elegant solution: stop treating words as atomic units. Instead, break words into smaller pieces that can be combined to form any word.

Consider the word "unhappiness":

  • As a whole word, it might be rare and poorly represented
  • Split into "un" + "happi" + "ness", each piece is common

The prefix "un-" appears in hundreds of words (undo, unfair, unable). The suffix "-ness" appears in thousands (happiness, sadness, kindness). The root "happy" is common. By representing "unhappiness" as a sequence of these pieces, we:

  1. Eliminate OOV entirely: Any word can be broken into subword units
  2. Share parameters: "un-" learned from "undo" helps with "unfair"
  3. Reduce vocabulary size: Thousands of subwords can generate millions of words
  4. Handle morphology: Compositional words decompose naturally
In[18]:
# Demonstration of subword decomposition
subword_decompositions = {
    "unhappiness": ["un", "happi", "ness"],
    "unbelievable": ["un", "believ", "able"],
    "transformers": ["transform", "ers"],
    "preprocessing": ["pre", "process", "ing"],
    "internationalization": ["inter", "national", "ization"],
    "ChatGPT": ["Chat", "G", "PT"],  # Handles new words
    "COVID19": ["CO", "VID", "19"],   # Handles alphanumeric
}

# Show how pieces are reused
pieces = []
for word, decomposition in subword_decompositions.items():
    pieces.extend(decomposition)

from collections import Counter
piece_counts = Counter(pieces)
Out[19]:
Subword Decomposition Examples
============================================================
Word                      Subword Pieces                     
------------------------------------------------------------
unhappiness               un + happi + ness                  
unbelievable              un + believ + able                 
transformers              transform + ers                    
preprocessing             pre + process + ing                
internationalization      inter + national + ization         
ChatGPT                   Chat + G + PT                      
COVID19                   CO + VID + 19                      

Most reused subword pieces:
  'un' appears in 2 different words

From Characters to Subwords

At one extreme, we could tokenize at the character level. Every word becomes a sequence of characters, and there's no OOV problem: any text is just a sequence of characters from a fixed alphabet.

But character-level tokenization has severe drawbacks:

  • Sequences become very long (a 10-word sentence might have 50+ characters)
  • The model must learn to compose characters into meaningful units
  • Long-range dependencies become harder to capture

Subword tokenization finds the sweet spot. Subwords are longer than characters (capturing more meaning per token) but shorter than words (enabling composition). A typical subword vocabulary might have 30,000-50,000 entries, able to represent any text without OOV.

Out[20]:
Visualization
Horizontal spectrum diagram showing character, subword, and word tokenization with their tradeoffs.
The tokenization spectrum ranges from characters to words, with subwords occupying the optimal middle ground. Character-level tokenization eliminates OOV but creates long sequences. Word-level tokenization is compact but suffers from OOV. Subword tokenization balances both, using a fixed vocabulary that can represent any text through composition.

Looking Ahead: Subword Tokenization Algorithms

The next chapters explore the algorithms that make subword tokenization work. Each takes a different approach to deciding how to split words:

Byte Pair Encoding (BPE): Starts with characters and iteratively merges the most frequent pairs. The vocabulary grows bottom-up, with common sequences becoming single tokens.

WordPiece: Similar to BPE but uses a likelihood-based criterion for merging. Used by BERT and many Google models.

Unigram Language Model: Takes a top-down approach. Starts with a large vocabulary and iteratively removes pieces that contribute least to the language model likelihood.

SentencePiece: A framework that can implement BPE or Unigram, treating text as raw bytes rather than requiring pre-tokenization. Enables truly language-agnostic tokenization.

Each algorithm produces a vocabulary of subword units and a procedure for tokenizing new text. The key insight uniting them all: words are not atoms. They can and should be decomposed into smaller, reusable pieces.

Summary

The vocabulary problem arises from a fundamental mismatch between the infinite productivity of language and the finite capacity of NLP models. We've explored several facets of this challenge:

  • Out-of-vocabulary words plague any fixed vocabulary system. New words, rare words, typos, and domain-specific terms all become [UNK], losing their meaning entirely.

  • Vocabulary size creates a tradeoff: Larger vocabularies reduce OOV rates but consume more memory, slow computation, and suffer from poorly-learned embeddings for rare words.

  • Morphologically rich languages make the problem exponentially worse. Agglutinative languages like Finnish and Turkish can form millions of valid word forms from a fixed set of morphemes.

  • Domain-specific text including code, scientific writing, and medical text introduces specialized vocabulary that general-purpose models cannot handle.

  • Subword tokenization offers an elegant solution by breaking words into reusable pieces. A vocabulary of 30,000 subwords can represent any text without OOV, sharing parameters across morphologically related words.

The vocabulary problem taught NLP an important lesson: the word is not the right unit of meaning. In the following chapters, we'll explore the algorithms that learn optimal subword vocabularies and how to tokenize text using them.

Quiz

Ready to test your understanding of the vocabulary problem? Take this quick quiz to reinforce what you've learned about why word-level tokenization breaks down and why subword approaches are needed.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{thevocabularyproblemwhywordleveltokenizationbreaksdown, author = {Michael Brenndoerfer}, title = {The Vocabulary Problem: Why Word-Level Tokenization Breaks Down}, year = {2025}, url = {https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). The Vocabulary Problem: Why Word-Level Tokenization Breaks Down. Retrieved from https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges
MLAAcademic
Michael Brenndoerfer. "The Vocabulary Problem: Why Word-Level Tokenization Breaks Down." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges>.
CHICAGOAcademic
Michael Brenndoerfer. "The Vocabulary Problem: Why Word-Level Tokenization Breaks Down." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges.
HARVARDAcademic
Michael Brenndoerfer (2025) 'The Vocabulary Problem: Why Word-Level Tokenization Breaks Down'. Available at: https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). The Vocabulary Problem: Why Word-Level Tokenization Breaks Down. https://mbrenndoerfer.com/writing/vocabulary-problem-subword-tokenization-challenges
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or