Word Tokenization: Breaking Text into Meaningful Units for NLP

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how to split text into words and tokens using whitespace, punctuation handling, and linguistic rules. Covers NLTK, spaCy, Penn Treebank conventions, and language-specific challenges.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Word TokenizationLink Copied

Before a computer can understand language, it must break text into meaningful units. Word tokenization is the process of splitting text into individual words or tokens. It sounds trivial: just split on spaces, right? But human language is messy. Contractions like "don't" and "we'll" blur word boundaries. Punctuation clings to words in complex ways. Some languages don't use spaces at all. This chapter explores why tokenization is harder than it looks and how to do it well.

Tokenization is the first step in nearly every NLP pipeline. Get it wrong, and errors cascade through everything downstream. Get it right, and you've built a solid foundation for text analysis, search, translation, and language modeling.

Why Tokenization MattersLink Copied

Every text processing task begins with tokenization. Consider what happens when you search for a word in a document, count word frequencies, or train a language model. All of these require knowing where one word ends and another begins.

Token

A token is a sequence of characters that forms a meaningful unit for processing. In word tokenization, tokens typically correspond to words, punctuation marks, or other linguistically significant elements.

The definition of "word" varies by application. For some tasks, "New York" should be a single token. For others, "don't" should split into "do" and "n't". There's no universal right answer. The best tokenization depends on what you're trying to accomplish.

In[2]:

# A simple example of why tokenization matters
text = "Dr. Smith's patients don't like waiting. It's 3:30pm!"

# Naive whitespace split
naive_tokens = text.split()

# What we might actually want
expected_tokens = ["Dr.", "Smith", "'s", "patients", "do", "n't", "like", 
                   "waiting", ".", "It", "'s", "3:30pm", "!"]

# A simple example of why tokenization matters
text = "Dr. Smith's patients don't like waiting. It's 3:30pm!"

# Naive whitespace split
naive_tokens = text.split()

# What we might actually want
expected_tokens = ["Dr.", "Smith", "'s", "patients", "do", "n't", "like", 
                   "waiting", ".", "It", "'s", "3:30pm", "!"]

Out[3]:

Text: Dr. Smith's patients don't like waiting. It's 3:30pm!

Naive whitespace split:
  ['Dr.', "Smith's", 'patients', "don't", 'like', 'waiting.', "It's", '3:30pm!']
  Token count: 8

Linguistically-aware tokenization:
  ['Dr.', 'Smith', "'s", 'patients', 'do', "n't", 'like', 'waiting', '.', 'It', "'s", '3:30pm', '!']
  Token count: 13

The naive approach produces 7 tokens, lumping punctuation with words. A more sophisticated tokenizer produces 13 tokens, separating punctuation and splitting contractions. Which is correct? It depends on your task. For bag-of-words models, you might want "don't" as one token. For syntactic parsing, splitting into "do" and "n't" reveals the underlying structure.

Whitespace TokenizationLink Copied

The simplest tokenization strategy splits text on whitespace characters: spaces, tabs, and newlines. This works surprisingly well for many English texts.

In[4]:

def whitespace_tokenize(text):
    """Split text on whitespace."""
    return text.split()

# Test on clean text
clean_text = "The quick brown fox jumps over the lazy dog"
clean_tokens = whitespace_tokenize(clean_text)

# Test on messy text
messy_text = "Hello,   world!  How   are  you?"
messy_tokens = whitespace_tokenize(messy_text)

def whitespace_tokenize(text):
    """Split text on whitespace."""
    return text.split()

# Test on clean text
clean_text = "The quick brown fox jumps over the lazy dog"
clean_tokens = whitespace_tokenize(clean_text)

# Test on messy text
messy_text = "Hello,   world!  How   are  you?"
messy_tokens = whitespace_tokenize(messy_text)

Out[5]:

Clean text tokenization:
  Input: 'The quick brown fox jumps over the lazy dog'
  Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
  Count: 9

Messy text tokenization:
  Input: 'Hello,   world!  How   are  you?'
  Tokens: ['Hello,', 'world!', 'How', 'are', 'you?']
  Count: 5

Python's split() method handles multiple consecutive whitespace characters gracefully, collapsing them into a single delimiter. But notice the problem: punctuation stays attached to words. "Hello," and "world!" are not the tokens we want for most applications.

Limitations of Whitespace TokenizationLink Copied

Whitespace tokenization fails in predictable ways:

Punctuation attachment: Commas, periods, and other punctuation stick to adjacent words. "Hello," becomes a different token from "Hello".

Contractions: "don't" stays as one token, hiding its two-word structure.

Hyphenated compounds: "state-of-the-art" becomes one token, though it might be better as four.

Numbers and dates: "3.14" and "2024-01-15" contain delimiters that shouldn't split.

URLs and emails: "user@example.com" and "https://example.com" contain no spaces but have internal structure.

In[6]:

# Examples of whitespace tokenization failures
problem_texts = [
    "I can't believe it's not butter!",
    "The state-of-the-art model achieved 99.5% accuracy.",
    "Contact john.doe@company.com for details.",
    "The meeting is at 3:30pm on 2024-01-15.",
    "Visit https://example.com/path?query=value for more.",
]

problem_tokens = [whitespace_tokenize(t) for t in problem_texts]

# Examples of whitespace tokenization failures
problem_texts = [
    "I can't believe it's not butter!",
    "The state-of-the-art model achieved 99.5% accuracy.",
    "Contact john.doe@company.com for details.",
    "The meeting is at 3:30pm on 2024-01-15.",
    "Visit https://example.com/path?query=value for more.",
]

problem_tokens = [whitespace_tokenize(t) for t in problem_texts]

Out[7]:

Whitespace Tokenization Problems:
------------------------------------------------------------
Text: I can't believe it's not butter!
Tokens: ['I', "can't", 'believe', "it's", 'not', 'butter!']

Text: The state-of-the-art model achieved 99.5% accuracy.
Tokens: ['The', 'state-of-the-art', 'model', 'achieved', '99.5%', 'accuracy.']

Text: Contact john.doe@company.com for details.
Tokens: ['Contact', 'john.doe@company.com', 'for', 'details.']

Text: The meeting is at 3:30pm on 2024-01-15.
Tokens: ['The', 'meeting', 'is', 'at', '3:30pm', 'on', '2024-01-15.']

Text: Visit https://example.com/path?query=value for more.
Tokens: ['Visit', 'https://example.com/path?query=value', 'for', 'more.']

Each example shows a different failure mode. The contraction "can't" stays fused. The hyphenated compound becomes one long token. The email address is treated as a single word. These aren't bugs in the tokenizer. They're fundamental limitations of using whitespace as the only delimiter.

Punctuation HandlingLink Copied

Most tokenizers treat punctuation as separate tokens. This makes sense linguistically: a period ends a sentence, a comma separates clauses, and quotation marks delimit speech. These are meaningful units that deserve their own tokens.

In[8]:

import re

def punctuation_tokenize(text):
    """Split on whitespace and separate punctuation."""
    # Add spaces around punctuation
    text = re.sub(r'([.,!?;:"\'\(\)\[\]])', r' \1 ', text)
    # Split and filter empty strings
    return [t for t in text.split() if t]

# Test the improved tokenizer
test_text = "Hello, world! How are you?"
punct_tokens = punctuation_tokenize(test_text)

import re

def punctuation_tokenize(text):
    """Split on whitespace and separate punctuation."""
    # Add spaces around punctuation
    text = re.sub(r'([.,!?;:"\'\(\)\[\]])', r' \1 ', text)
    # Split and filter empty strings
    return [t for t in text.split() if t]

# Test the improved tokenizer
test_text = "Hello, world! How are you?"
punct_tokens = punctuation_tokenize(test_text)

Out[9]:

Text: 'Hello, world! How are you?'
Whitespace tokens: ['Hello,', 'world!', 'How', 'are', 'you?']
Punctuation tokens: ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

Now punctuation is separated. But we've introduced new problems. What about abbreviations like "Dr." or "U.S.A."? What about decimal numbers like "3.14"? What about contractions?

Handling AbbreviationsLink Copied

Abbreviations pose a challenge: the period is part of the word, not a sentence-ending punctuation mark. A robust tokenizer needs to know which periods to separate and which to keep.

In[10]:

# Common abbreviations that should keep their periods
abbreviations = {"Dr.", "Mr.", "Mrs.", "Ms.", "Jr.", "Sr.", "vs.", "etc.", 
                 "i.e.", "e.g.", "U.S.", "U.K.", "a.m.", "p.m."}

def abbreviation_aware_tokenize(text):
    """Tokenize while preserving common abbreviations."""
    # First, protect abbreviations by replacing periods temporarily
    protected = text
    for abbr in abbreviations:
        protected = protected.replace(abbr, abbr.replace(".", "<DOT>"))
    
    # Now do punctuation tokenization
    protected = re.sub(r'([.,!?;:"\'\(\)\[\]])', r' \1 ', protected)
    tokens = [t for t in protected.split() if t]
    
    # Restore periods in abbreviations
    tokens = [t.replace("<DOT>", ".") for t in tokens]
    return tokens

# Test with abbreviations
abbr_text = "Dr. Smith met Mr. Jones at 3 p.m. in the U.S."
abbr_tokens = abbreviation_aware_tokenize(abbr_text)

# Common abbreviations that should keep their periods
abbreviations = {"Dr.", "Mr.", "Mrs.", "Ms.", "Jr.", "Sr.", "vs.", "etc.", 
                 "i.e.", "e.g.", "U.S.", "U.K.", "a.m.", "p.m."}

def abbreviation_aware_tokenize(text):
    """Tokenize while preserving common abbreviations."""
    # First, protect abbreviations by replacing periods temporarily
    protected = text
    for abbr in abbreviations:
        protected = protected.replace(abbr, abbr.replace(".", "<DOT>"))
    
    # Now do punctuation tokenization
    protected = re.sub(r'([.,!?;:"\'\(\)\[\]])', r' \1 ', protected)
    tokens = [t for t in protected.split() if t]
    
    # Restore periods in abbreviations
    tokens = [t.replace("<DOT>", ".") for t in tokens]
    return tokens

# Test with abbreviations
abbr_text = "Dr. Smith met Mr. Jones at 3 p.m. in the U.S."
abbr_tokens = abbreviation_aware_tokenize(abbr_text)

Out[11]:

Text: 'Dr. Smith met Mr. Jones at 3 p.m. in the U.S.'
Naive punctuation: ['Dr', '.', 'Smith', 'met', 'Mr', '.', 'Jones', 'at', '3', 'p', '.', 'm', '.', 'in', 'the', 'U', '.', 'S', '.']
Abbreviation-aware: ['Dr.', 'Smith', 'met', 'Mr.', 'Jones', 'at', '3', 'p.m.', 'in', 'the', 'U.S.']

The abbreviation-aware tokenizer keeps "Dr.", "Mr.", "p.m.", and "U.S." intact while still separating the final period. This pattern of protecting special cases before applying general rules is common in NLP.

Contractions and CliticsLink Copied

Contractions present a linguistic puzzle. "Don't" is one orthographic word but represents two morphemes: "do" and "not". How should we tokenize it?

Clitic

A clitic is a morpheme that has syntactic characteristics of a word but is phonologically dependent on an adjacent word. English contractions like "'s", "'ll", and "n't" are clitics that attach to their host words.

Different traditions handle contractions differently:

Keep together: "don't" $\to$ ["don't"]
Split at apostrophe: "don't" $\to$ ["don", "'t"]
Expand fully: "don't" $\to$ ["do", "n't"]
Normalize: "don't" $\to$ ["do", "not"]

In[12]:

# Common English contractions and their expansions
contractions = {
    "n't": " n't",      # don't -> do n't
    "'ll": " 'll",      # I'll -> I 'll
    "'re": " 're",      # we're -> we 're
    "'ve": " 've",      # I've -> I 've
    "'d": " 'd",        # I'd -> I 'd
    "'m": " 'm",        # I'm -> I 'm
    "'s": " 's",        # it's -> it 's (possessive or "is")
}

def contraction_tokenize(text):
    """Tokenize with contraction splitting."""
    # Handle contractions
    for contraction, replacement in contractions.items():
        text = text.replace(contraction, replacement)
    
    # Standard punctuation handling
    text = re.sub(r'([.,!?;:"\(\)\[\]])', r' \1 ', text)
    return [t for t in text.split() if t]

# Test contraction handling
contraction_text = "I can't believe she's already gone. We'll miss her."
contraction_tokens = contraction_tokenize(contraction_text)

# Common English contractions and their expansions
contractions = {
    "n't": " n't",      # don't -> do n't
    "'ll": " 'll",      # I'll -> I 'll
    "'re": " 're",      # we're -> we 're
    "'ve": " 've",      # I've -> I 've
    "'d": " 'd",        # I'd -> I 'd
    "'m": " 'm",        # I'm -> I 'm
    "'s": " 's",        # it's -> it 's (possessive or "is")
}

def contraction_tokenize(text):
    """Tokenize with contraction splitting."""
    # Handle contractions
    for contraction, replacement in contractions.items():
        text = text.replace(contraction, replacement)
    
    # Standard punctuation handling
    text = re.sub(r'([.,!?;:"\(\)\[\]])', r' \1 ', text)
    return [t for t in text.split() if t]

# Test contraction handling
contraction_text = "I can't believe she's already gone. We'll miss her."
contraction_tokens = contraction_tokenize(contraction_text)

Out[13]:

Text: 'I can't believe she's already gone. We'll miss her.'
Tokens: ['I', 'ca', "n't", 'believe', 'she', "'s", 'already', 'gone', '.', 'We', "'ll", 'miss', 'her', '.']

Contraction analysis:
  'n't' <- clitic (attached to previous word)
  ''s' <- clitic (attached to previous word)
  ''ll' <- clitic (attached to previous word)

Splitting contractions reveals the underlying structure. "Can't" becomes "can" and "n't", exposing the negation. "She's" becomes "she" and "'s", which could be either possessive or "is" depending on context. Note that this simple approach differs from Penn Treebank conventions, which split "can't" into "ca" and "n't" to better reflect the morphological structure.

The Possessive ProblemLink Copied

The possessive "'s" is particularly tricky. In "John's book", the "'s" marks possession. In "John's running", it's a contraction of "is". Both look identical, and only context reveals the difference.

In[14]:

# Ambiguous 's examples
ambiguous_examples = [
    "John's book is on the table.",     # Possessive
    "John's running late again.",        # Contraction of "is"
    "The dog's barking loudly.",         # Contraction of "is"
    "The dog's tail is wagging.",        # Possessive
]

# Tokenize each
ambiguous_tokens = [contraction_tokenize(ex) for ex in ambiguous_examples]

# Ambiguous 's examples
ambiguous_examples = [
    "John's book is on the table.",     # Possessive
    "John's running late again.",        # Contraction of "is"
    "The dog's barking loudly.",         # Contraction of "is"
    "The dog's tail is wagging.",        # Possessive
]

# Tokenize each
ambiguous_tokens = [contraction_tokenize(ex) for ex in ambiguous_examples]

Out[15]:

Ambiguous 's tokenization:
--------------------------------------------------
Text: John's book is on the table.
Tokens: ['John', "'s", 'book', 'is', 'on', 'the', 'table', '.']
Interpretation: Possessive

Text: John's running late again.
Tokens: ['John', "'s", 'running', 'late', 'again', '.']
Interpretation: Contraction

Text: The dog's barking loudly.
Tokens: ['The', 'dog', "'s", 'barking', 'loudly', '.']
Interpretation: Contraction

Text: The dog's tail is wagging.
Tokens: ['The', 'dog', "'s", 'tail', 'is', 'wagging', '.']
Interpretation: Possessive

A tokenizer can't distinguish these cases without understanding grammar. This is a fundamental limitation: tokenization is a preprocessing step that happens before syntactic analysis. We split "'s" consistently and leave disambiguation to later stages.

Penn Treebank TokenizationLink Copied

The Penn Treebank tokenization standard emerged from the Penn Treebank project, a large annotated corpus of English text. It established conventions that many NLP tools follow:

Split contractions: "don't" $\to$ "do n't"
Separate punctuation: "Hello," $\to$ "Hello ,"
Keep abbreviations: "Dr." stays as "Dr."
Handle special cases: "$" and "%" attach to numbers

In[16]:

from nltk.tokenize import TreebankWordTokenizer

# Initialize the tokenizer
treebank = TreebankWordTokenizer()

# Test on various examples
examples = [
    "I can't believe it's not butter!",
    "Dr. Smith's patients don't like waiting.",
    "The stock rose $5.25 (3.5%) today.",
    '"Hello," she said, "how are you?"',
]

treebank_results = [(ex, treebank.tokenize(ex)) for ex in examples]

from nltk.tokenize import TreebankWordTokenizer

# Initialize the tokenizer
treebank = TreebankWordTokenizer()

# Test on various examples
examples = [
    "I can't believe it's not butter!",
    "Dr. Smith's patients don't like waiting.",
    "The stock rose $5.25 (3.5%) today.",
    '"Hello," she said, "how are you?"',
]

treebank_results = [(ex, treebank.tokenize(ex)) for ex in examples]

Out[17]:

Penn Treebank Tokenization:
============================================================
Text: I can't believe it's not butter!
Tokens: ['I', 'ca', "n't", 'believe', 'it', "'s", 'not', 'butter', '!']

Text: Dr. Smith's patients don't like waiting.
Tokens: ['Dr.', 'Smith', "'s", 'patients', 'do', "n't", 'like', 'waiting', '.']

Text: The stock rose $5.25 (3.5%) today.
Tokens: ['The', 'stock', 'rose', '$', '5.25', '(', '3.5', '%', ')', 'today', '.']

Text: "Hello," she said, "how are you?"
Tokens: ['``', 'Hello', ',', "''", 'she', 'said', ',', '``', 'how', 'are', 'you', '?', "''"]

The Treebank tokenizer handles contractions, punctuation, and special cases according to established conventions. Notice how it splits "can't" into "ca" and "n't", keeps "Dr." together, and handles the dollar amount and percentage.

Treebank ConventionsLink Copied

The Penn Treebank standard makes specific choices:

Input	Output	Rationale
don't	do n't	Reveals negation structure
I'm	I 'm	Separates subject from verb
they're	they 're	Consistent clitic handling
John's	John 's	Possessive/contraction split
"Hello"	`` Hello ''	Directional quotes
(test)	-LRB- test -RRB-	Bracket normalization

In[18]:

# Demonstrate specific Treebank conventions
conventions = [
    ("Contractions", "I don't think they're coming."),
    ("Possessives", "John's car and Mary's bike."),
    ("Quotes", '"Hello," she said.'),
    ("Brackets", "The result (see Figure 1) is clear."),
    ("Currency", "It costs $19.99 plus 8% tax."),
]

convention_results = [(name, text, treebank.tokenize(text)) 
                      for name, text in conventions]

# Demonstrate specific Treebank conventions
conventions = [
    ("Contractions", "I don't think they're coming."),
    ("Possessives", "John's car and Mary's bike."),
    ("Quotes", '"Hello," she said.'),
    ("Brackets", "The result (see Figure 1) is clear."),
    ("Currency", "It costs $19.99 plus 8% tax."),
]

convention_results = [(name, text, treebank.tokenize(text)) 
                      for name, text in conventions]

Out[19]:

Treebank Convention Examples:
------------------------------------------------------------
Contractions:
  Input:  I don't think they're coming.
  Output: ['I', 'do', "n't", 'think', 'they', "'re", 'coming', '.']

Possessives:
  Input:  John's car and Mary's bike.
  Output: ['John', "'s", 'car', 'and', 'Mary', "'s", 'bike', '.']

Quotes:
  Input:  "Hello," she said.
  Output: ['``', 'Hello', ',', "''", 'she', 'said', '.']

Brackets:
  Input:  The result (see Figure 1) is clear.
  Output: ['The', 'result', '(', 'see', 'Figure', '1', ')', 'is', 'clear', '.']

Currency:
  Input:  It costs $19.99 plus 8% tax.
  Output: ['It', 'costs', '$', '19.99', 'plus', '8', '%', 'tax', '.']

The Treebank tokenizer converts ASCII quotes to directional quote tokens (`` and '') and brackets to labeled tokens (-LRB- and -RRB-). These conventions normalize text for consistent downstream processing.

Language-Specific ChallengesLink Copied

English, with its space-separated words, is relatively easy to tokenize. Other languages present unique challenges.

Chinese: No Word BoundariesLink Copied

Chinese text contains no spaces between words. Characters flow continuously, and determining word boundaries requires linguistic knowledge.

In[20]:

# Chinese text without spaces
chinese_text = "我喜欢学习自然语言处理"
# Translation: "I like studying natural language processing"

# Character-level tokenization (always works)
char_tokens = list(chinese_text)

# Word-level requires a specialized tokenizer
# Using jieba, a popular Chinese tokenizer
import jieba

word_tokens = list(jieba.cut(chinese_text))

# Chinese text without spaces
chinese_text = "我喜欢学习自然语言处理"
# Translation: "I like studying natural language processing"

# Character-level tokenization (always works)
char_tokens = list(chinese_text)

# Word-level requires a specialized tokenizer
# Using jieba, a popular Chinese tokenizer
import jieba

word_tokens = list(jieba.cut(chinese_text))

Out[21]:

Chinese text: 我喜欢学习自然语言处理
Translation: 'I like studying natural language processing'

Character tokens: ['我', '喜', '欢', '学', '习', '自', '然', '语', '言', '处', '理']
Character count: 11

Word tokens: ['我', '喜欢', '学习', '自然语言', '处理']
Word count: 5

Character tokenization produces 11 tokens, one per character. Word tokenization produces fewer, more meaningful units. The word tokenizer identifies "自然语言处理" (natural language processing) as a compound word, while character tokenization would split it into individual characters.

Out[22]:

Visualization

Grouped bar chart comparing character and word token counts for English, Chinese, and Japanese text samples. — Token count comparison between character-level and word-level tokenization across different languages. Languages without spaces (Chinese, Japanese) show the largest difference between granularities. English shows minimal difference since whitespace already provides reasonable word boundaries. The choice of granularity significantly impacts vocabulary size and sequence length in downstream models.

Japanese: Mixed ScriptsLink Copied

Japanese uses three writing systems: hiragana, katakana, and kanji (Chinese characters). Words can be written in any combination, and spaces are rarely used.

In[23]:

# Japanese text with mixed scripts
japanese_text = "私はPythonでNLPを勉強しています"
# Translation: "I am studying NLP with Python"

# Character tokenization
jp_char_tokens = list(japanese_text)

# Word-level tokenization requires MeCab or similar
# For demonstration, we'll show the expected output
expected_jp_words = ["私", "は", "Python", "で", "NLP", "を", "勉強", "し", "て", "い", "ます"]

# Japanese text with mixed scripts
japanese_text = "私はPythonでNLPを勉強しています"
# Translation: "I am studying NLP with Python"

# Character tokenization
jp_char_tokens = list(japanese_text)

# Word-level tokenization requires MeCab or similar
# For demonstration, we'll show the expected output
expected_jp_words = ["私", "は", "Python", "で", "NLP", "を", "勉強", "し", "て", "い", "ます"]

Out[24]:

Japanese text: 私はPythonでNLPを勉強しています
Translation: 'I am studying NLP with Python'

Character tokens: ['私', 'は', 'P', 'y', 't', 'h', 'o', 'n', 'で', 'N', 'L', 'P', 'を', '勉', '強', 'し', 'て', 'い', 'ま', 'す']
Character count: 20

Expected word tokens: ['私', 'は', 'Python', 'で', 'NLP', 'を', '勉強', 'し', 'て', 'い', 'ます']
Word count: 11

Japanese tokenization must handle the mixing of scripts. "Python" and "NLP" are written in Latin characters, while the rest uses Japanese scripts. A good tokenizer recognizes these boundaries and segments appropriately.

German: Compound WordsLink Copied

German creates long compound words by concatenating shorter words without spaces. "Donaudampfschifffahrtsgesellschaftskapitän" (Danube steamship company captain) is a single orthographic word.

In[25]:

# German compound words
german_compounds = [
    ("Handschuh", ["Hand", "Schuh"]),  # glove = hand + shoe
    ("Krankenhaus", ["Kranken", "Haus"]),  # hospital = sick + house
    ("Bundesausbildungsförderungsgesetz", 
     ["Bundes", "Ausbildungs", "Förderungs", "Gesetz"]),  # Federal education support law
]

# Simple whitespace tokenization keeps compounds together
german_text = "Das Krankenhaus hat einen neuen Handschuh."
german_tokens = german_text.split()

# German compound words
german_compounds = [
    ("Handschuh", ["Hand", "Schuh"]),  # glove = hand + shoe
    ("Krankenhaus", ["Kranken", "Haus"]),  # hospital = sick + house
    ("Bundesausbildungsförderungsgesetz", 
     ["Bundes", "Ausbildungs", "Förderungs", "Gesetz"]),  # Federal education support law
]

# Simple whitespace tokenization keeps compounds together
german_text = "Das Krankenhaus hat einen neuen Handschuh."
german_tokens = german_text.split()

Out[26]:

German Compound Words:
--------------------------------------------------
  Handschuh
    = Hand + Schuh
  Krankenhaus
    = Kranken + Haus
  Bundesausbildungsförderungsgesetz
    = Bundes + Ausbildungs + Förderungs + Gesetz

Text: Das Krankenhaus hat einen neuen Handschuh.
Tokens: ['Das', 'Krankenhaus', 'hat', 'einen', 'neuen', 'Handschuh.']

Note: Compound splitting requires morphological analysis

For some applications, you might want to decompose compounds into their constituent parts. This requires morphological analysis beyond simple tokenization.

Arabic: Complex MorphologyLink Copied

Arabic presents multiple challenges: right-to-left script, complex morphology with prefixes and suffixes attached to words, and optional vowel diacritics.

In[27]:

# Arabic text example
arabic_text = "أحب تعلم معالجة اللغة الطبيعية"
# Translation: "I love learning natural language processing"

# Basic tokenization (whitespace works for Arabic)
arabic_tokens = arabic_text.split()

# But morphological analysis might split further
# وكتبتها = و + كتبت + ها (and + I wrote + it)
morphological_example = "وكتبتها"
morphological_parts = ["و", "كتبت", "ها"]  # Expected decomposition

# Arabic text example
arabic_text = "أحب تعلم معالجة اللغة الطبيعية"
# Translation: "I love learning natural language processing"

# Basic tokenization (whitespace works for Arabic)
arabic_tokens = arabic_text.split()

# But morphological analysis might split further
# وكتبتها = و + كتبت + ها (and + I wrote + it)
morphological_example = "وكتبتها"
morphological_parts = ["و", "كتبت", "ها"]  # Expected decomposition

Out[28]:

Arabic text: أحب تعلم معالجة اللغة الطبيعية
Translation: 'I love learning natural language processing'

Whitespace tokens: ['أحب', 'تعلم', 'معالجة', 'اللغة', 'الطبيعية']
Token count: 5

Morphological complexity example:
  Word: وكتبتها
  Parts: و + كتبت + ها
  Meaning: 'and' + 'I wrote' + 'it'

Arabic whitespace tokenization works at the surface level, but a single orthographic word often contains multiple morphemes that might be separate tokens in other languages.

Building a Rule-Based TokenizerLink Copied

Let's build a more complete tokenizer that handles common English patterns. We'll combine the techniques we've discussed into a coherent system.

In[29]:

import re
from typing import List

class RuleBasedTokenizer:
    """A rule-based English tokenizer."""
    
    def __init__(self):
        # Abbreviations to protect
        self.abbreviations = {
            "dr.", "mr.", "mrs.", "ms.", "jr.", "sr.", 
            "vs.", "etc.", "i.e.", "e.g.", "u.s.", "u.k.",
            "a.m.", "p.m.", "inc.", "ltd.", "co."
        }
        
        # Contractions to split
        self.contractions = [
            (r"n't\b", " n't"),
            (r"'ll\b", " 'll"),
            (r"'re\b", " 're"),
            (r"'ve\b", " 've"),
            (r"'d\b", " 'd"),
            (r"'m\b", " 'm"),
            (r"'s\b", " 's"),
        ]
        
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text into words and punctuation."""
        # Lowercase for abbreviation matching
        text_lower = text.lower()
        
        # Protect abbreviations
        protected = text
        for abbr in self.abbreviations:
            if abbr in text_lower:
                # Find and protect (case-insensitive)
                pattern = re.compile(re.escape(abbr), re.IGNORECASE)
                protected = pattern.sub(
                    lambda m: m.group().replace(".", "<DOT>"), 
                    protected
                )
        
        # Split contractions
        for pattern, replacement in self.contractions:
            protected = re.sub(pattern, replacement, protected, flags=re.IGNORECASE)
        
        # Separate punctuation (except protected dots)
        protected = re.sub(r'([.,!?;:"\'\(\)\[\]{}])', r' \1 ', protected)
        
        # Restore protected dots
        protected = protected.replace("<DOT>", ".")
        
        # Split and clean
        tokens = [t.strip() for t in protected.split() if t.strip()]
        
        return tokens

# Create tokenizer instance
tokenizer = RuleBasedTokenizer()

import re
from typing import List

class RuleBasedTokenizer:
    """A rule-based English tokenizer."""
    
    def __init__(self):
        # Abbreviations to protect
        self.abbreviations = {
            "dr.", "mr.", "mrs.", "ms.", "jr.", "sr.", 
            "vs.", "etc.", "i.e.", "e.g.", "u.s.", "u.k.",
            "a.m.", "p.m.", "inc.", "ltd.", "co."
        }
        
        # Contractions to split
        self.contractions = [
            (r"n't\b", " n't"),
            (r"'ll\b", " 'll"),
            (r"'re\b", " 're"),
            (r"'ve\b", " 've"),
            (r"'d\b", " 'd"),
            (r"'m\b", " 'm"),
            (r"'s\b", " 's"),
        ]
        
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text into words and punctuation."""
        # Lowercase for abbreviation matching
        text_lower = text.lower()
        
        # Protect abbreviations
        protected = text
        for abbr in self.abbreviations:
            if abbr in text_lower:
                # Find and protect (case-insensitive)
                pattern = re.compile(re.escape(abbr), re.IGNORECASE)
                protected = pattern.sub(
                    lambda m: m.group().replace(".", "<DOT>"), 
                    protected
                )
        
        # Split contractions
        for pattern, replacement in self.contractions:
            protected = re.sub(pattern, replacement, protected, flags=re.IGNORECASE)
        
        # Separate punctuation (except protected dots)
        protected = re.sub(r'([.,!?;:"\'\(\)\[\]{}])', r' \1 ', protected)
        
        # Restore protected dots
        protected = protected.replace("<DOT>", ".")
        
        # Split and clean
        tokens = [t.strip() for t in protected.split() if t.strip()]
        
        return tokens

# Create tokenizer instance
tokenizer = RuleBasedTokenizer()

Out[30]:

Rule-Based Tokenizer initialized with:
  17 protected abbreviations
  7 contraction patterns

Now let's test our tokenizer on various challenging inputs:

In[31]:

# Test cases
test_cases = [
    "Hello, world!",
    "Dr. Smith's patients don't like waiting.",
    "I can't believe it's not butter!",
    "The U.S. economy grew 3.5% in Q4.",
    "She said, \"Hello!\" and left.",
    "We'll meet at 3 p.m. tomorrow.",
]

results = [(text, tokenizer.tokenize(text)) for text in test_cases]

# Test cases
test_cases = [
    "Hello, world!",
    "Dr. Smith's patients don't like waiting.",
    "I can't believe it's not butter!",
    "The U.S. economy grew 3.5% in Q4.",
    "She said, \"Hello!\" and left.",
    "We'll meet at 3 p.m. tomorrow.",
]

results = [(text, tokenizer.tokenize(text)) for text in test_cases]

Out[32]:

Tokenization Results:
============================================================
Input:  Hello, world!
Tokens: ['Hello', ',', 'world', '!']
Count:  4

Input:  Dr. Smith's patients don't like waiting.
Tokens: ['Dr.', 'Smith', "'", 's', 'patients', 'do', 'n', "'", 't', 'like', 'waiting', '.']
Count:  12

Input:  I can't believe it's not butter!
Tokens: ['I', 'ca', 'n', "'", 't', 'believe', 'it', "'", 's', 'not', 'butter', '!']
Count:  12

Input:  The U.S. economy grew 3.5% in Q4.
Tokens: ['The', 'U.S.', 'economy', 'grew', '3', '.', '5%', 'in', 'Q4', '.']
Count:  10

Input:  She said, "Hello!" and left.
Tokens: ['She', 'said', ',', '"', 'Hello', '!', '"', 'and', 'left', '.']
Count:  10

Input:  We'll meet at 3 p.m. tomorrow.
Tokens: ['We', "'", 'll', 'meet', 'at', '3', 'p.m.', 'tomorrow', '.']
Count:  9

Our tokenizer handles abbreviations, contractions, and punctuation correctly. It's not perfect, but it demonstrates the rule-based approach.

Limitations of Rule-Based TokenizationLink Copied

Rule-based tokenizers have inherent limitations:

Coverage: You can't anticipate every abbreviation or special case. New abbreviations emerge constantly.

Ambiguity: "St." could be "Street" or "Saint". "Dr." could be "Doctor" or "Drive". Rules can't disambiguate without context.

Language dependence: Rules written for English don't work for other languages. Each language needs its own rule set.

Maintenance burden: As edge cases accumulate, rule sets become complex and brittle.

In[33]:

# Edge cases that break our tokenizer
edge_cases = [
    "I live on Oak St. near Dr. Brown's office.",  # St. = Street
    "St. Patrick's Day is in March.",               # St. = Saint
    "The temp. is 98.6°F.",                         # Unlisted abbreviation
    "Check out the website: https://example.com",   # URL
    "Email me at user@example.com!",                # Email
]

edge_results = [(text, tokenizer.tokenize(text)) for text in edge_cases]

# Edge cases that break our tokenizer
edge_cases = [
    "I live on Oak St. near Dr. Brown's office.",  # St. = Street
    "St. Patrick's Day is in March.",               # St. = Saint
    "The temp. is 98.6°F.",                         # Unlisted abbreviation
    "Check out the website: https://example.com",   # URL
    "Email me at user@example.com!",                # Email
]

edge_results = [(text, tokenizer.tokenize(text)) for text in edge_cases]

Out[34]:

Edge Cases:
============================================================
Input:  I live on Oak St. near Dr. Brown's office.
Tokens: ['I', 'live', 'on', 'Oak', 'St', '.', 'near', 'Dr.', 'Brown', "'", 's', 'office', '.']

Input:  St. Patrick's Day is in March.
Tokens: ['St', '.', 'Patrick', "'", 's', 'Day', 'is', 'in', 'March', '.']

Input:  The temp. is 98.6°F.
Tokens: ['The', 'temp', '.', 'is', '98', '.', '6°F', '.']

Input:  Check out the website: https://example.com
Tokens: ['Check', 'out', 'the', 'website', ':', 'https', ':', '//example', '.', 'com']

Input:  Email me at user@example.com!
Tokens: ['Email', 'me', 'at', 'user@example', '.', 'com', '!']

The tokenizer handles some cases well but struggles with URLs, emails, and unlisted abbreviations. Production tokenizers address these with more comprehensive rules or statistical methods.

Using NLTK and spaCyLink Copied

Production NLP work typically uses established libraries rather than custom tokenizers. NLTK and spaCy are the two most popular choices for English.

NLTK TokenizersLink Copied

NLTK provides several tokenization options:

In[35]:

from nltk.tokenize import word_tokenize, wordpunct_tokenize, TreebankWordTokenizer

text = "Dr. Smith's patients don't like waiting. It's 3:30pm!"

# Different NLTK tokenizers
nltk_word = word_tokenize(text)
nltk_wordpunct = wordpunct_tokenize(text)
nltk_treebank = TreebankWordTokenizer().tokenize(text)

from nltk.tokenize import word_tokenize, wordpunct_tokenize, TreebankWordTokenizer

text = "Dr. Smith's patients don't like waiting. It's 3:30pm!"

# Different NLTK tokenizers
nltk_word = word_tokenize(text)
nltk_wordpunct = wordpunct_tokenize(text)
nltk_treebank = TreebankWordTokenizer().tokenize(text)

Out[36]:

Text: Dr. Smith's patients don't like waiting. It's 3:30pm!

NLTK word_tokenize:
  ['Dr.', 'Smith', "'s", 'patients', 'do', "n't", 'like', 'waiting', '.', 'It', "'s", '3:30pm', '!']

NLTK wordpunct_tokenize:
  ['Dr', '.', 'Smith', "'", 's', 'patients', 'don', "'", 't', 'like', 'waiting', '.', 'It', "'", 's', '3', ':', '30pm', '!']

NLTK TreebankWordTokenizer:
  ['Dr.', 'Smith', "'s", 'patients', 'do', "n't", 'like', 'waiting.', 'It', "'s", '3:30pm', '!']

Each tokenizer makes different choices. word_tokenize uses the Punkt sentence tokenizer combined with a word tokenizer. wordpunct_tokenize splits on both whitespace and punctuation. TreebankWordTokenizer follows Penn Treebank conventions.

spaCy TokenizerLink Copied

spaCy's tokenizer is rule-based but highly optimized and configurable:

In[37]:

import spacy

# Load English model (small version for speed)
nlp = spacy.load("en_core_web_sm")

# Tokenize
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

# spaCy also provides additional information
token_info = [(token.text, token.pos_, token.is_punct, token.is_stop) 
              for token in doc]

import spacy

# Load English model (small version for speed)
nlp = spacy.load("en_core_web_sm")

# Tokenize
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

# spaCy also provides additional information
token_info = [(token.text, token.pos_, token.is_punct, token.is_stop) 
              for token in doc]

Out[38]:

Text: Dr. Smith's patients don't like waiting. It's 3:30pm!

spaCy tokens:
  ['Dr.', 'Smith', "'s", 'patients', 'do', "n't", 'like', 'waiting', '.', 'It', "'s", '3:30pm', '!']

Token details:
Token        POS      Punct?   Stop?   
----------------------------------------
Dr.          PROPN    False    False   
Smith        PROPN    False    False   
's           PART     False    True    
patients     NOUN     False    False   
do           AUX      False    True    
n't          PART     False    True    
like         VERB     False    False   
waiting      VERB     False    False   
.            PUNCT    True     False   
It           PRON     False    True    
's           AUX      False    True    
3:30pm       NUM      False    False   
!            PUNCT    True     False

spaCy provides not just tokens but also part-of-speech tags, punctuation flags, and stop word indicators. This additional information is useful for downstream processing.

Comparing TokenizersLink Copied

Let's compare how different tokenizers handle the same challenging text:

In[39]:

challenge_text = "I can't believe she's already gone! We'll miss her. #sad @friends"

# Tokenize with different tools
results = {
    "Whitespace": challenge_text.split(),
    "NLTK word_tokenize": word_tokenize(challenge_text),
    "NLTK Treebank": TreebankWordTokenizer().tokenize(challenge_text),
    "spaCy": [t.text for t in nlp(challenge_text)],
    "Our tokenizer": tokenizer.tokenize(challenge_text),
}

challenge_text = "I can't believe she's already gone! We'll miss her. #sad @friends"

# Tokenize with different tools
results = {
    "Whitespace": challenge_text.split(),
    "NLTK word_tokenize": word_tokenize(challenge_text),
    "NLTK Treebank": TreebankWordTokenizer().tokenize(challenge_text),
    "spaCy": [t.text for t in nlp(challenge_text)],
    "Our tokenizer": tokenizer.tokenize(challenge_text),
}

Out[40]:

Text: I can't believe she's already gone! We'll miss her. #sad @friends

Tokenizer Comparison:
============================================================
Whitespace:
  ['I', "can't", 'believe', "she's", 'already', 'gone!', "We'll", 'miss', 'her.', '#sad', '@friends']
  Count: 11

NLTK word_tokenize:
  ['I', 'ca', "n't", 'believe', 'she', "'s", 'already', 'gone', '!', 'We', "'ll", 'miss', 'her', '.', '#', 'sad', '@', 'friends']
  Count: 18

NLTK Treebank:
  ['I', 'ca', "n't", 'believe', 'she', "'s", 'already', 'gone', '!', 'We', "'ll", 'miss', 'her.', '#', 'sad', '@', 'friends']
  Count: 17

spaCy:
  ['I', 'ca', "n't", 'believe', 'she', "'s", 'already', 'gone', '!', 'We', "'ll", 'miss', 'her', '.', '#', 'sad', '@friends']
  Count: 17

Our tokenizer:
  ['I', 'ca', 'n', "'", 't', 'believe', 'she', "'", 's', 'already', 'gone', '!', 'We', "'", 'll', 'miss', 'her', '.', '#sad', '@friends']
  Count: 20

Different tokenizers make different choices about contractions, punctuation, and social media elements like hashtags and mentions. There's no universally "correct" answer. The best choice depends on your application.

Out[41]:

Visualization

Histogram comparing token length distributions for whitespace, NLTK, and spaCy tokenizers. — Token length distribution across different tokenizers. Whitespace tokenization produces longer tokens on average because punctuation remains attached. Linguistic tokenizers like spaCy produce more single-character tokens (punctuation marks) and shorter word tokens due to contraction splitting. Understanding these distributions helps predict vocabulary size and sequence length in downstream models.

Out[42]:

Visualization

Horizontal bar chart comparing token counts from five different tokenizers, ranging from 9 to 17 tokens. — Token count comparison across different tokenizers for the same input text. Whitespace tokenization produces the fewest tokens by keeping punctuation attached. Linguistic tokenizers like NLTK and spaCy produce more tokens by separating punctuation and splitting contractions. The variation highlights that tokenization is not a solved problem with a single correct answer.

Tokenization EvaluationLink Copied

How do you know if your tokenizer is working correctly? Unlike many NLP tasks where quality is subjective, tokenization can be evaluated objectively: either a token boundary is in the right place, or it isn't. But this requires something to compare against: a gold standard of human-annotated text with correct token boundaries.

The core insight is that tokenization is fundamentally a boundary detection problem. We're not just counting tokens; we're asking: did the tokenizer place boundaries in the correct positions? This framing leads naturally to the standard evaluation metrics.

The Boundary Detection FrameworkLink Copied

Think of a text as a sequence of character positions. A tokenizer's job is to decide which positions mark the start of a new token. Consider the sentence "I can't go":

Position:  0 1 2 3 4 5 6 7 8 9
Text:      I   c a n ' t   g o

A tokenizer that keeps contractions together places boundaries at positions 0, 2, and 8, producing ["I", "can't", "go"]. A tokenizer that splits contractions places boundaries at positions 0, 2, 5, and 8, producing ["I", "ca", "n't", "go"]. The evaluation question becomes: which boundaries match the gold standard?

Evaluation MetricsLink Copied

With this boundary-based view, we can apply the classic precision-recall framework from information retrieval.

Precision answers: of all the boundaries the tokenizer predicted, how many were actually correct? A tokenizer with high precision makes few false boundary predictions. It might miss some boundaries (under-tokenizing), but when it does place a boundary, it's usually right.

\text{Precision} = \frac{\text{Correct boundaries predicted}}{\text{Total boundaries predicted}}

Recall answers the complementary question: of all the true boundaries in the gold standard, how many did the tokenizer find? A tokenizer with high recall catches most boundaries. It might predict some spurious boundaries (over-tokenizing), but it rarely misses a real one.

\text{Recall} = \frac{\text{Correct boundaries predicted}}{\text{Total true boundaries}}

Neither metric alone tells the full story. A tokenizer that places a boundary after every character would have perfect recall but terrible precision. A tokenizer that outputs the entire text as one token would have perfect precision on its single boundary but miss all the others.

F1 Score balances both concerns by computing the harmonic mean of precision and recall. The harmonic mean penalizes extreme imbalances: if either precision or recall is low, F1 will be low, even if the other is high.

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Let's implement this evaluation framework and see it in action:

In[43]:

def evaluate_tokenizer(predicted_tokens, gold_tokens):
    """Evaluate tokenizer output against gold standard."""
    # Convert to sets of token boundaries (character positions)
    def get_boundaries(tokens):
        boundaries = set()
        pos = 0
        for token in tokens:
            boundaries.add(pos)
            pos += len(token) + 1  # +1 for space
        return boundaries
    
    pred_boundaries = get_boundaries(predicted_tokens)
    gold_boundaries = get_boundaries(gold_tokens)
    
    # Calculate metrics
    correct = len(pred_boundaries & gold_boundaries)
    precision = correct / len(pred_boundaries) if pred_boundaries else 0
    recall = correct / len(gold_boundaries) if gold_boundaries else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'predicted_count': len(predicted_tokens),
        'gold_count': len(gold_tokens)
    }

# Example evaluation
gold = ["I", "ca", "n't", "believe", "it", "'s", "not", "butter", "!"]
predicted = ["I", "can't", "believe", "it's", "not", "butter", "!"]

metrics = evaluate_tokenizer(predicted, gold)

def evaluate_tokenizer(predicted_tokens, gold_tokens):
    """Evaluate tokenizer output against gold standard."""
    # Convert to sets of token boundaries (character positions)
    def get_boundaries(tokens):
        boundaries = set()
        pos = 0
        for token in tokens:
            boundaries.add(pos)
            pos += len(token) + 1  # +1 for space
        return boundaries
    
    pred_boundaries = get_boundaries(predicted_tokens)
    gold_boundaries = get_boundaries(gold_tokens)
    
    # Calculate metrics
    correct = len(pred_boundaries & gold_boundaries)
    precision = correct / len(pred_boundaries) if pred_boundaries else 0
    recall = correct / len(gold_boundaries) if gold_boundaries else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'predicted_count': len(predicted_tokens),
        'gold_count': len(gold_tokens)
    }

# Example evaluation
gold = ["I", "ca", "n't", "believe", "it", "'s", "not", "butter", "!"]
predicted = ["I", "can't", "believe", "it's", "not", "butter", "!"]

metrics = evaluate_tokenizer(predicted, gold)

The get_boundaries function walks through the token list, tracking character positions. Each token's starting position becomes a boundary. We then compute the set intersection between predicted and gold boundaries to count correct predictions.

Out[44]:

Tokenization Evaluation:
  Gold standard: ['I', 'ca', "n't", 'believe', 'it', "'s", 'not', 'butter', '!']
  Predicted:     ['I', "can't", 'believe', "it's", 'not', 'butter', '!']

Metrics:
  Precision: 28.57%
  Recall:    22.22%
  F1 Score:  25.00%

  Gold count:      9
  Predicted count: 7

The predicted tokenization scores below 100% because it didn't split the contractions "can't" and "it's". The gold standard expects boundaries at "ca|n't" and "it|'s", but our tokenizer kept these as single tokens. This illustrates an important point: whether this is an "error" depends entirely on the task. For syntactic parsing, the gold standard is correct. For bag-of-words text classification, keeping contractions together might actually be preferable.

Out[45]:

Visualization

Scatter plot showing precision vs recall for different tokenization strategies, with an ideal region marked in the upper right. — The precision-recall tradeoff in tokenization. Aggressive tokenizers (splitting more boundaries) tend to have high recall but lower precision, while conservative tokenizers have high precision but miss boundaries. The ideal tokenizer achieves high scores on both metrics, placing it in the upper-right corner of the plot.

Common Evaluation CorporaLink Copied

Standard tokenization benchmarks include:

Penn Treebank: The original standard for English tokenization
Universal Dependencies: Cross-lingual treebanks with consistent tokenization
CoNLL shared tasks: Various NLP benchmarks with tokenization components

In[46]:

# Simulated evaluation on multiple examples
test_set = [
    (["I", "'m", "happy", "."], ["I'm", "happy", "."]),
    (["Do", "n't", "worry", "!"], ["Don't", "worry", "!"]),
    (["It", "'s", "fine", "."], ["It's", "fine", "."]),
]

# Evaluate each
evaluations = []
for gold, pred in test_set:
    metrics = evaluate_tokenizer(pred, gold)
    evaluations.append((gold, pred, metrics))

# Average F1
avg_f1 = sum(e[2]['f1'] for e in evaluations) / len(evaluations)

# Simulated evaluation on multiple examples
test_set = [
    (["I", "'m", "happy", "."], ["I'm", "happy", "."]),
    (["Do", "n't", "worry", "!"], ["Don't", "worry", "!"]),
    (["It", "'s", "fine", "."], ["It's", "fine", "."]),
]

# Evaluate each
evaluations = []
for gold, pred in test_set:
    metrics = evaluate_tokenizer(pred, gold)
    evaluations.append((gold, pred, metrics))

# Average F1
avg_f1 = sum(e[2]['f1'] for e in evaluations) / len(evaluations)

Out[47]:

Evaluation Results:
------------------------------------------------------------
Gold: ['I', "'m", 'happy', '.']
Pred: ["I'm", 'happy', '.']
F1:   28.57%

Gold: ['Do', "n't", 'worry', '!']
Pred: ["Don't", 'worry', '!']
F1:   28.57%

Gold: ['It', "'s", 'fine', '.']
Pred: ["It's", 'fine', '.']
F1:   28.57%

Average F1: 28.57%

Subword Tokenization PreviewLink Copied

Modern neural NLP systems often use subword tokenization instead of word tokenization. Methods like Byte Pair Encoding (BPE), WordPiece, and SentencePiece break words into smaller units.

In[48]:

# Preview of subword tokenization concepts
word = "unhappiness"

# Possible subword decompositions
decompositions = {
    "Character": list(word),
    "Morpheme": ["un", "happi", "ness"],
    "BPE-style": ["un", "happ", "iness"],
    "Word": [word],
}

# Preview of subword tokenization concepts
word = "unhappiness"

# Possible subword decompositions
decompositions = {
    "Character": list(word),
    "Morpheme": ["un", "happi", "ness"],
    "BPE-style": ["un", "happ", "iness"],
    "Word": [word],
}

Out[49]:

Word: 'unhappiness'

Different granularities:
  Character   : ['u', 'n', 'h', 'a', 'p', 'p', 'i', 'n', 'e', 's', 's'] (11 tokens)
  Morpheme    : ['un', 'happi', 'ness'] (3 tokens)
  BPE-style   : ['un', 'happ', 'iness'] (3 tokens)
  Word        : ['unhappiness'] (1 tokens)

Subword tokenization offers advantages: smaller vocabularies, better handling of rare words, and no out-of-vocabulary problem. We'll explore these methods in detail in a later chapter.

Out[50]:

Visualization

Horizontal bar chart showing token counts for the word 'unhappiness' at different granularity levels from character to word. — The tokenization granularity spectrum from characters to words. Finer granularities (left) produce more tokens but smaller vocabularies, while coarser granularities (right) produce fewer tokens but larger vocabularies. Subword methods like BPE and WordPiece occupy a middle ground, balancing vocabulary size with sequence length.

Limitations and ChallengesLink Copied

Word tokenization, despite decades of research, remains imperfect:

No universal definition of "word": Linguistic theories disagree on what constitutes a word. Tokenizers make practical choices that may not align with any particular theory.

Language diversity: Rules that work for English fail for Chinese, Arabic, or German. Each language family requires specialized approaches.

Domain variation: Medical text, legal documents, social media, and code each have unique tokenization challenges. A tokenizer trained on news articles may struggle with tweets.

Evolving language: New words, abbreviations, and conventions emerge constantly. "COVID-19", "blockchain", and "emoji" didn't exist decades ago.

Ambiguity: Many tokenization decisions are genuinely ambiguous. Should "New York" be one token or two? Should "ice-cream" be split?

Impact on Downstream TasksLink Copied

Tokenization choices ripple through the entire NLP pipeline:

Vocabulary size: Splitting contractions increases vocabulary slightly but improves coverage. Keeping compounds together reduces vocabulary but may hurt generalization.

Sequence length: More tokens mean longer sequences. This affects model training time and memory usage.

Semantic alignment: Tokens should ideally correspond to meaningful units. Poor tokenization can obscure semantic relationships.

Cross-lingual transfer: Inconsistent tokenization across languages makes multilingual models harder to train.

Out[51]:

Visualization

Flow diagram showing tokenization feeding into vocabulary, sequences, and semantics, which then affect model training and task performance. — The cascading impact of tokenization on downstream NLP tasks. Tokenization decisions affect vocabulary size, sequence length, and semantic alignment. These in turn influence model training, inference speed, and task performance. Poor tokenization at the start propagates errors through the entire pipeline.

SummaryLink Copied

Word tokenization breaks text into meaningful units for processing. While conceptually simple, the task is surprisingly complex due to punctuation, contractions, abbreviations, and language-specific challenges.

Key takeaways:

Whitespace tokenization is fast but leaves punctuation attached to words
Punctuation separation improves token quality but requires handling abbreviations
Contractions can be kept together, split at the apostrophe, or expanded
Penn Treebank conventions provide a standard for English tokenization
Languages without spaces (Chinese, Japanese) require specialized segmenters
Compound words (German) may need morphological decomposition
Rule-based tokenizers are interpretable but require extensive rule sets
Library tokenizers (NLTK, spaCy) handle common cases well
Evaluation uses precision, recall, and F1 against gold standards
Tokenization choices affect all downstream NLP tasks

The right tokenization strategy depends on your task, language, and domain. There's no universal best approach. Understanding the tradeoffs helps you make informed decisions for your specific application.

Key Functions and ParametersLink Copied

When working with tokenization in Python, these are the essential functions and their most important parameters:

str.split(sep=None)

sep: The delimiter string. When None (default), splits on any whitespace and removes empty strings from the result. Useful for basic whitespace tokenization.

nltk.tokenize.word_tokenize(text, language='english')

text: The input string to tokenize
language: Language for the Punkt tokenizer. Affects sentence boundary detection and abbreviation handling. Supports 17 languages including English, German, French, and Spanish.

nltk.tokenize.TreebankWordTokenizer().tokenize(text)

text: The input string to tokenize. Returns tokens following Penn Treebank conventions: splits contractions, separates punctuation, and normalizes quotes and brackets.

spacy.load(name)

name: Model name (e.g., 'en_core_web_sm', 'en_core_web_lg'). Larger models provide better accuracy but require more memory. The tokenizer is included in all models.

nlp(text) (spaCy Doc object)

Returns a Doc object where each Token has attributes: text (string), pos_ (part-of-speech), is_punct (punctuation flag), is_stop (stop word flag), lemma_ (base form).

jieba.cut(text, cut_all=False)

text: Chinese text to segment
cut_all: When True, uses full mode (all possible segmentations). When False (default), uses accurate mode (most likely segmentation). Accurate mode is preferred for most NLP tasks.

re.sub(pattern, replacement, text)

pattern: Regular expression to match
replacement: String or function to replace matches. Use r' \1 ' to surround captured groups with spaces, useful for separating punctuation.
text: Input string. Essential for building custom tokenizers with punctuation handling.

In the next chapter, we'll explore subword tokenization methods that break words into smaller units, addressing vocabulary limitations and rare word handling.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about word tokenization.

Loading component...

Back to Language AI Handbook

Previous Chapter

Sentence Segmentation

Next Chapter

Bag of Words

Reference

BIBTEXAcademic

@misc{wordtokenizationbreakingtextintomeaningfulunitsfornlp, author = {Michael Brenndoerfer}, title = {Word Tokenization: Breaking Text into Meaningful Units for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }

APAAcademic

Michael Brenndoerfer (2025). Word Tokenization: Breaking Text into Meaningful Units for NLP. Retrieved from https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide

MLAAcademic

Michael Brenndoerfer. "Word Tokenization: Breaking Text into Meaningful Units for NLP." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "Word Tokenization: Breaking Text into Meaningful Units for NLP." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Word Tokenization: Breaking Text into Meaningful Units for NLP'. Available at: https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide (Accessed: 12/7/2025).

SimpleBasic

Michael Brenndoerfer (2025). Word Tokenization: Breaking Text into Meaningful Units for NLP. https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide

Direct link:

https://mbrenndoerfer.com/writing/word-tokenization-nlp-guide

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

InteractiveWord Tokenization: Breaking Text into Meaningful Units for NLP

Word TokenizationLink Copied

Why Tokenization MattersLink Copied

Whitespace TokenizationLink Copied

Limitations of Whitespace TokenizationLink Copied

Punctuation HandlingLink Copied

Handling AbbreviationsLink Copied

Contractions and CliticsLink Copied

The Possessive ProblemLink Copied

Penn Treebank TokenizationLink Copied

Treebank ConventionsLink Copied

Language-Specific ChallengesLink Copied

Chinese: No Word BoundariesLink Copied

Japanese: Mixed ScriptsLink Copied

German: Compound WordsLink Copied

Arabic: Complex MorphologyLink Copied

Building a Rule-Based TokenizerLink Copied

Limitations of Rule-Based TokenizationLink Copied

Using NLTK and spaCyLink Copied

NLTK TokenizersLink Copied

spaCy TokenizerLink Copied

Comparing TokenizersLink Copied

Tokenization EvaluationLink Copied

The Boundary Detection FrameworkLink Copied

Evaluation MetricsLink Copied

Common Evaluation CorporaLink Copied

Subword Tokenization PreviewLink Copied

Limitations and ChallengesLink Copied

Impact on Downstream TasksLink Copied

SummaryLink Copied

Key Functions and ParametersLink Copied

QuizLink Copied

Reference

About the author: Michael Brenndoerfer

Related Content

Sentence Segmentation: From Period Disambiguation to Punkt Algorithm Implementation

Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP

Regular Expressions for NLP: Complete Guide to Pattern Matching in Python

Stay updated