Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Explore tokenization challenges in NLP including number fragmentation, code tokenization, multilingual bias, emoji complexity, and adversarial attacks. Learn quality metrics.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Tokenization ChallengesLink Copied

You've now learned the major subword tokenization algorithms: BPE, WordPiece, Unigram, and SentencePiece. These techniques have transformed NLP by elegantly solving the vocabulary problem. But tokenization isn't a solved problem. Edge cases lurk everywhere, from numbers that fragment unpredictably to emoji sequences that explode into dozens of tokens.

This chapter examines the practical challenges that arise when tokenizers meet real-world text. We'll explore why "1000" and "1,000" produce different token sequences, how code tokenization creates surprising failure modes, and why multilingual models struggle with fair representation across languages. You'll learn to recognize tokenization artifacts, understand adversarial attacks that exploit tokenizer weaknesses, and measure tokenization quality systematically.

These aren't academic curiosities. When your model fails to count to ten correctly, produces garbled output for certain inputs, or shows unexpected biases across languages, tokenization is often the culprit. Understanding these challenges helps you debug mysterious model behaviors and choose appropriate tokenizers for your applications.

Number TokenizationLink Copied

Numbers present one of the most frustrating challenges for subword tokenizers. Unlike words, which have stable morphological structure, numbers can appear in countless formats: "42", "42.0", "42,000", "4.2e4", "0x2A". Each format fragments differently during tokenization, creating inconsistent representations that downstream models struggle to interpret.

The Fragmentation ProblemLink Copied

Most tokenizers learn their vocabularies from text corpora where numbers appear less frequently than words. As a result, numbers often get split into seemingly arbitrary chunks. The number "1234567" might become ['12', '34', '567'] or ['1', '234', '567'] depending on what patterns the tokenizer happened to see during training.

In[3]:

Code

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Test various number formats
numbers = [
    "42",
    "100",
    "1000",
    "10000",
    "1234567",
    "3.14159",
    "1,000,000",
    "1e6",
    "-273.15",
    "0.0001",
]

results = []
for num in numbers:
    tokens = tokenizer.tokenize(num)
    results.append((num, tokens, len(tokens)))

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Test various number formats
numbers = [
    "42",
    "100",
    "1000",
    "10000",
    "1234567",
    "3.14159",
    "1,000,000",
    "1e6",
    "-273.15",
    "0.0001",
]

results = []
for num in numbers:
    tokens = tokenizer.tokenize(num)
    results.append((num, tokens, len(tokens)))

Out[4]:

Console

Number Tokenization with GPT-2
==================================================
Number          Tokens                    Count
--------------------------------------------------
42              ['42']                    1    
100             ['100']                   1    
1000            ['1000']                  1    
10000           ['10000']                 1    
1234567         ['123', '45', '67']       3    
3.14159         ['3', '.', '14', '159']   4    
1,000,000       ['1', ',', '000', ',', '  5    
1e6             ['1', 'e', '6']           3    
-273.15         ['-', '273', '.', '15']   4    
0.0001          ['0', '.', '0001']        3

The results reveal several concerning patterns. Small numbers like "42" and "100" tokenize efficiently as single tokens because they appear frequently in training data. But larger numbers fragment unpredictably: "1000" might be two tokens while "10000" becomes three. Decimal numbers split at the decimal point, and scientific notation produces even more fragments.

Why Number Fragmentation MattersLink Copied

This fragmentation creates real problems for language models. Consider arithmetic: to compute "1234 + 5678", the model must somehow understand that ['12', '34'] represents twelve hundred thirty-four, not the numbers twelve and thirty-four. The model has no explicit representation of place value, so it must learn these relationships from context.

Out[5]:

Visualization

Line plot showing token count versus number magnitude on log scale, with token count increasing in steps from 1 to 4 tokens. — Token count increases with number magnitude. Small numbers that appear frequently in training data (prices, ages, years) tokenize efficiently, while larger numbers fragment into multiple tokens. This creates a systematic bias: models have better representations for common number ranges.

The relationship between number magnitude and token count is roughly logarithmic, but with steps rather than a smooth curve. This reflects the tokenizer's vocabulary: it learned tokens for common number patterns (like "000" or "100") but must decompose less common combinations character by character.

Format SensitivityLink Copied

The same numeric value can produce wildly different tokenizations depending on its format. This creates unexpected inconsistencies in how models process equivalent quantities.

In[6]:

Code

# Different representations of the same value
equivalent_values = [
    ("1000000", "plain"),
    ("1,000,000", "comma-separated"),
    ("1000000.0", "with decimal"),
    ("1e6", "scientific notation"),
    ("1E6", "scientific (uppercase)"),
    ("1.0e6", "scientific with decimal"),
    ("10^6", "caret notation"),
    ("one million", "text"),
]

format_results = []
for value, desc in equivalent_values:
    tokens = tokenizer.tokenize(value)
    format_results.append((value, desc, tokens, len(tokens)))

# Different representations of the same value
equivalent_values = [
    ("1000000", "plain"),
    ("1,000,000", "comma-separated"),
    ("1000000.0", "with decimal"),
    ("1e6", "scientific notation"),
    ("1E6", "scientific (uppercase)"),
    ("1.0e6", "scientific with decimal"),
    ("10^6", "caret notation"),
    ("one million", "text"),
]

format_results = []
for value, desc in equivalent_values:
    tokens = tokenizer.tokenize(value)
    format_results.append((value, desc, tokens, len(tokens)))

Out[7]:

Console

Same Value, Different Tokenizations
=================================================================
Value                Format               Tokens               Count
-----------------------------------------------------------------
1000000              plain                ['1', '000000']      2
1,000,000            comma-separated      ['1', ',', '000', '  5
1000000.0            with decimal         ['1', '000000', '.'  4
1e6                  scientific notation  ['1', 'e', '6']      3
1E6                  scientific (uppercase) ['1', 'E', '6']      3
1.0e6                scientific with decimal ['1', '.', '0', 'e'  5
10^6                 caret notation       ['10', '^', '6']     3
one million          text                 ['one', 'Ġmillion']  2

The table reveals dramatic variation. "1000000" as a plain number might tokenize into three or four pieces, while "1,000,000" with commas adds tokens for each separator. Scientific notation "1e6" is often more compact but less common in training data. The text representation "one million" uses two intuitive tokens but looks nothing like the numeric forms.

Out[8]:

Visualization

Horizontal bar chart showing token counts for different representations of one million, sorted by efficiency. — Token count varies dramatically across different representations of the same numeric value (one million). Scientific notation is surprisingly efficient, while comma-separated and decimal formats add overhead. This format sensitivity means models must learn multiple representations of identical quantities.

This format sensitivity has practical implications. A model trained primarily on comma-separated numbers might struggle with scientific notation, and vice versa. Financial applications often encounter mixed formats, which models process inconsistently.

Arithmetic ChallengesLink Copied

Number tokenization directly impacts arithmetic performance. When numbers fragment into tokens that don't align with place value, models must learn implicit arithmetic patterns that humans take for granted.

In[9]:

Code

# Arithmetic examples
arithmetic_examples = [
    "5 + 3 = 8",
    "12 + 7 = 19",
    "123 + 456 = 579",
    "1234 + 5678 = 6912",
    "12345 + 67890 = 80235",
]

arithmetic_results = []
for example in arithmetic_examples:
    tokens = tokenizer.tokenize(example)
    arithmetic_results.append((example, tokens, len(tokens)))

# Arithmetic examples
arithmetic_examples = [
    "5 + 3 = 8",
    "12 + 7 = 19",
    "123 + 456 = 579",
    "1234 + 5678 = 6912",
    "12345 + 67890 = 80235",
]

arithmetic_results = []
for example in arithmetic_examples:
    tokens = tokenizer.tokenize(example)
    arithmetic_results.append((example, tokens, len(tokens)))

Out[10]:

Console

Arithmetic Expression Tokenization
======================================================================

'5 + 3 = 8'
  Tokens: ['5', 'Ġ+', 'Ġ3', 'Ġ=', 'Ġ8']
  Count: 5

'12 + 7 = 19'
  Tokens: ['12', 'Ġ+', 'Ġ7', 'Ġ=', 'Ġ19']
  Count: 5

'123 + 456 = 579'
  Tokens: ['123', 'Ġ+', 'Ġ4', '56', 'Ġ=', 'Ġ5', '79']
  Count: 7

'1234 + 5678 = 6912'
  Tokens: ['12', '34', 'Ġ+', 'Ġ5', '678', 'Ġ=', 'Ġ69', '12']
  Count: 8

'12345 + 67890 = 80235'
  Tokens: ['123', '45', 'Ġ+', 'Ġ6', '78', '90', 'Ġ=', 'Ġ80', '235']
  Count: 9

Simple expressions like "5 + 3 = 8" tokenize cleanly, with each number as a single token. But multi-digit arithmetic creates alignment problems: "1234" becomes multiple tokens with different boundaries than "5678", yet the model must learn that carrying happens across these arbitrary splits.

Out[11]:

Visualization

Diagram showing tokenization of arithmetic expression with colored boxes indicating different tokens and their boundaries. — Token boundaries in arithmetic expressions rarely align with digit positions. This visualization shows how '1234 + 5678' tokenizes, with colors indicating token boundaries. The misalignment forces models to learn implicit carrying rules across token boundaries.

Code TokenizationLink Copied

Programming languages present unique tokenization challenges. Code mixes natural language elements (variable names, comments) with syntactic structures (operators, brackets, indentation) in ways that confuse tokenizers trained primarily on prose.

Identifier FragmentationLink Copied

Variable and function names in code often use conventions like camelCase or snake_case that pack multiple words into single identifiers. Tokenizers must decide whether to split these at boundaries or treat them as atomic units.

In[12]:

Code

# Common code identifiers
code_identifiers = [
    "getUserById",
    "get_user_by_id",
    "XMLHttpRequest",
    "parseHTMLDocument",
    "calculate_total_price",
    "__init__",
    "self.model.fit()",
    "np.array([[1,2],[3,4]])",
    "addEventListener",
    "backgroundColor",
]

code_results = []
for identifier in code_identifiers:
    tokens = tokenizer.tokenize(identifier)
    code_results.append((identifier, tokens, len(tokens)))

# Common code identifiers
code_identifiers = [
    "getUserById",
    "get_user_by_id",
    "XMLHttpRequest",
    "parseHTMLDocument",
    "calculate_total_price",
    "__init__",
    "self.model.fit()",
    "np.array([[1,2],[3,4]])",
    "addEventListener",
    "backgroundColor",
]

code_results = []
for identifier in code_identifiers:
    tokens = tokenizer.tokenize(identifier)
    code_results.append((identifier, tokens, len(tokens)))

Out[13]:

Console

Code Identifier Tokenization
======================================================================
Identifier                     Tokens                         Count
----------------------------------------------------------------------
getUserById                    ['get', 'User', 'ById']        3
get_user_by_id                 ['get', '_', 'user', '_', 'by  7
XMLHttpRequest                 ['X', 'ML', 'Http', 'Request'  4
parseHTMLDocument              ['parse', 'HTML', 'Document']  3
calculate_total_price          ['cal', 'cul', 'ate', '_', 't  7
__init__                       ['__', 'init', '__']           3
self.model.fit()               ['self', '.', 'model', '.', '  6
np.array([[1,2],[3,4]])        ['np', '.', 'array', '([', '[  14
addEventListener               ['add', 'Event', 'Listener']   3
backgroundColor                ['background', 'Color']        2

The tokenization reveals inconsistent handling of coding conventions. CamelCase names like "getUserById" might split at capital letters, but not consistently. For example, "XMLHttpRequest" handles the acronym differently. Snake_case names split at underscores since they're treated as separate tokens. Special Python identifiers like __init__ produce surprising fragments.

Out[14]:

Visualization

Horizontal bar chart showing token counts for different code identifier styles, with identifiers grouped by convention type. — Token count varies significantly across coding conventions and identifier styles. Underscores in snake_case add tokens directly, while camelCase splits depend on whether the tokenizer learned the specific pattern. Complex identifiers like method chains fragment heavily.

Operator and Syntax TokenizationLink Copied

Programming operators and syntactic elements often tokenize inefficiently because they appear infrequently in natural language training data.

In[15]:

Code

# Programming operators and syntax
operators = [
    "==",
    "!=",
    "<=",
    ">=",
    "&&",
    "||",
    "++",
    "->",
    "=>",
    "::",
    "**",
    "//",
    "<<<",
    "@property",
    "async def",
    "lambda x: x + 1",
]

operator_results = []
for op in operators:
    tokens = tokenizer.tokenize(op)
    operator_results.append((op, tokens, len(tokens)))

# Programming operators and syntax
operators = [
    "==",
    "!=",
    "<=",
    ">=",
    "&&",
    "||",
    "++",
    "->",
    "=>",
    "::",
    "**",
    "//",
    "<<<",
    "@property",
    "async def",
    "lambda x: x + 1",
]

operator_results = []
for op in operators:
    tokens = tokenizer.tokenize(op)
    operator_results.append((op, tokens, len(tokens)))

Out[16]:

Console

Operator and Syntax Tokenization
=======================================================
'==' → ['=='] (1 tokens)
'!=' → ['!', '='] (2 tokens)
'<=' → ['<', '='] (2 tokens)
'>=' → ['>', '='] (2 tokens)
'&&' → ['&&'] (1 tokens)
'||' → ['||'] (1 tokens)
'++' → ['++'] (1 tokens)
'->' → ['->'] (1 tokens)
'=>' → ['=>'] (1 tokens)
'::' → ['::'] (1 tokens)
'**' → ['**'] (1 tokens)
'//' → ['//'] (1 tokens)
'<<<' → ['<<', '<'] (2 tokens)
'@property' → ['@', 'property'] (2 tokens)
'async def' → ['as', 'ync', 'Ġdef'] (3 tokens)
'lambda x: x + 1' → ['lambda', 'Ġx', ':', 'Ġx', 'Ġ+', 'Ġ1'] (6 tokens)

Multi-character operators like "==" or "!=" sometimes tokenize as single units if they appeared frequently in the training data, but more exotic operators like "<<<" (shell redirect) or "::" (C++ scope resolution) fragment into their component characters. Python decorators and keywords may or may not be recognized depending on the tokenizer's training corpus.

Whitespace and IndentationLink Copied

Python and other whitespace-sensitive languages pose a particular challenge. Indentation carries semantic meaning, but tokenizers often collapse or normalize whitespace.

In[17]:

Code

# Python code with significant whitespace
python_code = """def greet(name):
    if name:
        return f"Hello, {name}!"
    return "Hello, world!"
"""

# Tokenize and analyze whitespace handling
code_tokens = tokenizer.tokenize(python_code)

# Python code with significant whitespace
python_code = """def greet(name):
    if name:
        return f"Hello, {name}!"
    return "Hello, world!"
"""

# Tokenize and analyze whitespace handling
code_tokens = tokenizer.tokenize(python_code)

Out[18]:

Console

Python Code Tokenization
============================================================
Original code:
def greet(name):
    if name:
        return f"Hello, {name}!"
    return "Hello, world!"


Tokenized:
['def', 'Ġgreet', '(', 'name', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġif', 'Ġname', ':', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġf', '"', 'Hello', ',', 'Ġ{', 'name', '}', '!"', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġ"', 'Hello', ',', 'Ġworld', '!"', 'Ċ']

Total tokens: 40
Whitespace-related tokens: 22

The tokenization preserves newlines and some indentation, but the representation is verbose. Each line of code produces many tokens, and the model must learn that four-space indentation has different meaning than two-space. This creates a semantic gap between how programmers think about code structure and how models process it.

The same simple add function produces different token counts depending on the language syntax:

Tokenization efficiency for equivalentadd functions across programming languages. Python's terseness and minimal syntax yield the highest efficiency, while Rust's type annotations and C's verbosity produce more tokens.

Language	Tokens	Characters	Chars/Token
Python	11	30	2.7
JavaScript	18	42	2.3
C	21	44	2.1
Rust	22	40	1.8

Python's minimal syntax produces the fewest tokens, achieving 2.7 characters per token. Rust's explicit type annotations (i32) and return type syntax add overhead, dropping efficiency to 1.8 characters per token. This efficiency gap matters when processing large codebases: a Python project may fit 50% more code into the same context window compared to Rust.

Multilingual ChallengesLink Copied

Tokenizers trained primarily on English text struggle with other languages, creating systematic biases that affect model performance and fairness across linguistic communities.

Script and Language CoverageLink Copied

Different writing systems require dramatically different tokenization strategies. Alphabetic languages like English segment into words at whitespace, but Chinese, Japanese, and Thai have no word boundaries. Arabic and Hebrew write right-to-left. Devanagari and other Brahmic scripts combine consonants and vowels into complex glyphs.

In[20]:

Code

# Text samples across writing systems
multilingual_samples = [
    ("English", "Hello, how are you today?"),
    ("French", "Bonjour, comment allez-vous aujourd'hui?"),
    ("German", "Guten Tag, wie geht es Ihnen heute?"),
    ("Spanish", "Hola, ¿cómo estás hoy?"),
    ("Chinese", "你好，今天你好吗？"),
    ("Japanese", "こんにちは、今日の調子はどうですか？"),
    ("Korean", "안녕하세요, 오늘 기분이 어떠세요?"),
    ("Arabic", "مرحبا كيف حالك اليوم؟"),
    ("Russian", "Привет, как ты сегодня?"),
    ("Hindi", "नमस्ते, आज आप कैसे हैं?"),
]

multilingual_results = []
for lang, text in multilingual_samples:
    tokens = tokenizer.tokenize(text)
    char_count = len(text)
    multilingual_results.append((lang, text, tokens, len(tokens), char_count))

# Text samples across writing systems
multilingual_samples = [
    ("English", "Hello, how are you today?"),
    ("French", "Bonjour, comment allez-vous aujourd'hui?"),
    ("German", "Guten Tag, wie geht es Ihnen heute?"),
    ("Spanish", "Hola, ¿cómo estás hoy?"),
    ("Chinese", "你好，今天你好吗？"),
    ("Japanese", "こんにちは、今日の調子はどうですか？"),
    ("Korean", "안녕하세요, 오늘 기분이 어떠세요?"),
    ("Arabic", "مرحبا كيف حالك اليوم؟"),
    ("Russian", "Привет, как ты сегодня?"),
    ("Hindi", "नमस्ते, आज आप कैसे हैं?"),
]

multilingual_results = []
for lang, text in multilingual_samples:
    tokens = tokenizer.tokenize(text)
    char_count = len(text)
    multilingual_results.append((lang, text, tokens, len(tokens), char_count))

Out[21]:

Console

Multilingual Tokenization Comparison
===========================================================================
Language     Tokens   Chars    Chars/Token  Sample                        
---------------------------------------------------------------------------
English      7        25       3.57         Hello, how are you today?     
French       17       40       2.35         Bonjour, comment allez-vous a...
German       15       35       2.33         Guten Tag, wie geht es Ihnen ...
Spanish      13       22       1.69         Hola, ¿cómo estás hoy?        
Chinese      19       9        0.47         你好，今天你好吗？                     
Japanese     25       18       0.72         こんにちは、今日の調子はどうですか？            
Korean       43       19       0.44         안녕하세요, 오늘 기분이 어떠세요?           
Arabic       22       21       0.95         مرحبا كيف حالك اليوم؟         
Russian      25       23       0.92         Привет, как ты сегодня?       
Hindi        36       23       0.64         नमस्ते, आज आप कैसे हैं?

The table reveals striking disparities. English achieves high efficiency with around 4-5 characters per token, while Chinese and Japanese fragment into many more tokens for equivalent semantic content. This isn't just an efficiency problem: models have a fixed context window measured in tokens, so Chinese text effectively gets less context than English text of similar length.

Out[22]:

Visualization

Horizontal bar chart showing characters per token for 10 languages, with English highest and Chinese lowest. — Tokenization efficiency varies dramatically across languages and scripts. Languages with writing systems similar to the tokenizer''s training data (English, French, German) achieve high characters-per-token ratios, while logographic and syllabic scripts (Chinese, Japanese) fragment heavily. This creates unfair representation: the same context window holds more English content than Chinese.

The Cost of Multilingual TextLink Copied

The efficiency disparity has concrete costs. When you're paying per token for API access or working with limited context windows, users of lower-efficiency languages effectively pay more or get less context.

In[23]:

Code

# Calculate equivalent content lengths
def equivalent_content_analysis(samples, base_tokens=1000):
    """Calculate how much content fits in a fixed token budget."""
    results = []
    for lang, text, tokens, token_count, char_count in samples:
        chars_per_token = char_count / token_count
        equivalent_chars = base_tokens * chars_per_token
        # Rough word estimate (varies by language)
        if lang in ["Chinese", "Japanese"]:
            word_estimate = equivalent_chars / 2  # Characters ≈ words
        elif lang in ["Korean"]:
            word_estimate = equivalent_chars / 3
        else:
            word_estimate = equivalent_chars / 5  # Approx 5 chars per word

        results.append(
            {
                "language": lang,
                "chars_per_token": chars_per_token,
                "chars_in_1k_tokens": equivalent_chars,
                "word_estimate": word_estimate,
            }
        )
    return results


content_analysis = equivalent_content_analysis(multilingual_results)

# Calculate equivalent content lengths
def equivalent_content_analysis(samples, base_tokens=1000):
    """Calculate how much content fits in a fixed token budget."""
    results = []
    for lang, text, tokens, token_count, char_count in samples:
        chars_per_token = char_count / token_count
        equivalent_chars = base_tokens * chars_per_token
        # Rough word estimate (varies by language)
        if lang in ["Chinese", "Japanese"]:
            word_estimate = equivalent_chars / 2  # Characters ≈ words
        elif lang in ["Korean"]:
            word_estimate = equivalent_chars / 3
        else:
            word_estimate = equivalent_chars / 5  # Approx 5 chars per word

        results.append(
            {
                "language": lang,
                "chars_per_token": chars_per_token,
                "chars_in_1k_tokens": equivalent_chars,
                "word_estimate": word_estimate,
            }
        )
    return results


content_analysis = equivalent_content_analysis(multilingual_results)

Out[24]:

Console

Content Capacity in 1000 Tokens
============================================================
Language     Chars/Token     ~Characters     ~Words
------------------------------------------------------------
English      3.57            3571            714
French       2.35            2353            471
German       2.33            2333            467
Spanish      1.69            1692            338
Arabic       0.95            955             191
Russian      0.92            920             184
Japanese     0.72            720             360
Hindi        0.64            639             128
Chinese      0.47            474             237
Korean       0.44            442             147

Out[25]:

Visualization

Stacked bar chart showing equivalent word capacity per 4096-token context window across languages. — The 'cost' of expressing equivalent content varies dramatically across languages. For a fixed context window of 4096 tokens, English can express roughly 4,000 words while Chinese may be limited to under 1,000 words of equivalent semantic content. This disparity creates unequal model capabilities across languages.

Code-Switching and Mixed LanguageLink Copied

Real-world text often mixes languages, creating additional challenges for tokenizers trained on monolingual corpora.

In[26]:

Code

# Code-switching examples
mixed_language = [
    "I need to go to the bibliothèque to study",  # English + French
    "She said 'お疲れ様です' after the meeting",  # English + Japanese
    "The Küche needs cleaning before guests arrive",  # English + German
    "Let me check the расписание for today",  # English + Russian
    "I'll have the 红烧肉 with rice please",  # English + Chinese
]

mixed_results = []
for text in mixed_language:
    tokens = tokenizer.tokenize(text)
    mixed_results.append((text, tokens, len(tokens)))

# Code-switching examples
mixed_language = [
    "I need to go to the bibliothèque to study",  # English + French
    "She said 'お疲れ様です' after the meeting",  # English + Japanese
    "The Küche needs cleaning before guests arrive",  # English + German
    "Let me check the расписание for today",  # English + Russian
    "I'll have the 红烧肉 with rice please",  # English + Chinese
]

mixed_results = []
for text in mixed_language:
    tokens = tokenizer.tokenize(text)
    mixed_results.append((text, tokens, len(tokens)))

Out[27]:

Console

Code-Switching Tokenization
======================================================================

'I need to go to the bibliothèque to study'
  → ['I', 'Ġneed', 'Ġto', 'Ġgo', 'Ġto', 'Ġthe', 'Ġb', 'ibli', 'oth', 'Ã¨', 'que', 'Ġto', 'Ġstudy']
  → 13 tokens

'She said 'お疲れ様です' after the meeting'
  → ['She', 'Ġsaid', "Ġ'", 'ãģ', 'Ĭ', 'ç', 'ĸ', '²', 'ãĤĮ', 'æ', '§', 'ĺ', 'ãģ§', 'ãģĻ', "'", 'Ġafter', 'Ġthe', 'Ġmeeting']
  → 18 tokens

'The Küche needs cleaning before guests arrive'
  → ['The', 'ĠK', 'Ã¼', 'che', 'Ġneeds', 'Ġcleaning', 'Ġbefore', 'Ġguests', 'Ġarrive']
  → 9 tokens

'Let me check the расписание for today'
  → ['Let', 'Ġme', 'Ġcheck', 'Ġthe', 'Ġ', 'ÑĢ', 'Ð°', 'Ñģ', 'Ð', '¿', 'Ð¸', 'Ñģ', 'Ð°', 'Ð½', 'Ð¸', 'Ðµ', 'Ġfor', 'Ġtoday']
  → 18 tokens

'I'll have the 红烧肉 with rice please'
  → ['I', "'ll", 'Ġhave', 'Ġthe', 'Ġç', 'º', '¢', 'ç', 'ĥ', '§', 'è', 'Ĥ', 'ī', 'Ġwith', 'Ġrice', 'Ġplease']
  → 16 tokens

When languages mix within a sentence, the tokenizer must handle sudden script changes. The non-English portions typically fragment more heavily than if they appeared in monolingual text, because the surrounding English context doesn't provide helpful merge patterns.

Emoji and Unicode Edge CasesLink Copied

Emoji and special Unicode characters reveal the limits of byte-based tokenization. What appears as a single character on screen might be multiple Unicode code points, each encoded as multiple bytes.

Emoji TokenizationLink Copied

Modern emoji can be surprisingly complex. A simple smiley face is one code point, but emoji with skin tone modifiers, gender variations, or family compositions are sequences of multiple code points joined by Zero Width Joiners (ZWJ).

In[28]:

Code

# Emoji complexity examples
emoji_examples = [
    ("😀", "simple smiley"),
    ("👍", "thumbs up"),
    ("👍🏽", "thumbs up (skin tone)"),
    ("👨‍👩‍👧‍👦", "family (ZWJ sequence)"),
    ("🏳️‍🌈", "rainbow flag"),
    ("👩‍💻", "woman technologist"),
    ("🇺🇸", "US flag"),
    ("🇯🇵", "Japan flag"),
    ("💁‍♀️", "woman tipping hand"),
    ("🧑‍🤝‍🧑", "people holding hands"),
]

emoji_results = []
for emoji, desc in emoji_examples:
    tokens = tokenizer.tokenize(emoji)
    # Count actual Unicode code points
    code_points = len(emoji.encode("utf-32-le")) // 4
    # Count UTF-8 bytes
    utf8_bytes = len(emoji.encode("utf-8"))
    emoji_results.append(
        (emoji, desc, tokens, len(tokens), code_points, utf8_bytes)
    )

# Emoji complexity examples
emoji_examples = [
    ("😀", "simple smiley"),
    ("👍", "thumbs up"),
    ("👍🏽", "thumbs up (skin tone)"),
    ("👨‍👩‍👧‍👦", "family (ZWJ sequence)"),
    ("🏳️‍🌈", "rainbow flag"),
    ("👩‍💻", "woman technologist"),
    ("🇺🇸", "US flag"),
    ("🇯🇵", "Japan flag"),
    ("💁‍♀️", "woman tipping hand"),
    ("🧑‍🤝‍🧑", "people holding hands"),
]

emoji_results = []
for emoji, desc in emoji_examples:
    tokens = tokenizer.tokenize(emoji)
    # Count actual Unicode code points
    code_points = len(emoji.encode("utf-32-le")) // 4
    # Count UTF-8 bytes
    utf8_bytes = len(emoji.encode("utf-8"))
    emoji_results.append(
        (emoji, desc, tokens, len(tokens), code_points, utf8_bytes)
    )

Out[29]:

Console

Emoji Tokenization Analysis
================================================================================
Emoji  Description               Tokens   CodePts    Bytes   
--------------------------------------------------------------------------------
😀      simple smiley             2        1          4       
👍      thumbs up                 2        1          4       
👍🏽     thumbs up (skin tone)     5        2          8       
👨‍👩‍👧‍👦 family (ZWJ sequence)     14       7          25      
🏳️‍🌈   rainbow flag              9        4          14      
👩‍💻    woman technologist        7        3          11      
🇺🇸     US flag                   6        2          8       
🇯🇵     Japan flag                6        2          8       
💁‍♀️   woman tipping hand        8        4          13      
🧑‍🤝‍🧑  people holding hands      13       5          18

A simple smiley might be 1-2 tokens, but a family emoji with multiple people can explode into a dozen or more tokens. Each skin tone modifier, gender indicator, and ZWJ character adds to the byte count and thus the token count.

Out[30]:

Visualization

Scatter plot showing correlation between number of Unicode code points and token count for various emoji. — Emoji token count increases with emoji complexity. Simple emoji require 1-2 tokens, but compound emoji with modifiers or ZWJ sequences fragment into many more. The family emoji (👨‍👩‍👧‍👦) contains 7 Unicode code points and may tokenize into 10+ tokens.

Unicode Normalization IssuesLink Copied

Unicode allows multiple ways to represent the same visual character. The letter "é" can be a single code point (U+00E9, "Latin Small Letter E with Acute") or two code points (U+0065 "Latin Small Letter E" + U+0301 "Combining Acute Accent"). These normalize to the same visual appearance but may tokenize differently.

In[31]:

Code

import unicodedata

# Normalization examples
normalization_examples = [
    ("é", "precomposed"),
    ("é", "decomposed (e + combining acute)"),
    ("ñ", "precomposed"),
    ("ñ", "decomposed"),
    ("ü", "precomposed"),
    ("ü", "decomposed"),
]

# Create actual test strings
test_strings = [
    ("café", "precomposed"),
    ("café", "decomposed"),  # This won't render differently, create it properly
    (unicodedata.normalize("NFC", "café"), "NFC normalized"),
    (unicodedata.normalize("NFD", "café"), "NFD normalized"),
]

norm_results = []
for text, desc in test_strings:
    tokens = tokenizer.tokenize(text)
    code_points = len(text.encode("utf-32-le")) // 4
    norm_results.append((text, desc, tokens, len(tokens), code_points))

import unicodedata

# Normalization examples
normalization_examples = [
    ("é", "precomposed"),
    ("é", "decomposed (e + combining acute)"),
    ("ñ", "precomposed"),
    ("ñ", "decomposed"),
    ("ü", "precomposed"),
    ("ü", "decomposed"),
]

# Create actual test strings
test_strings = [
    ("café", "precomposed"),
    ("café", "decomposed"),  # This won't render differently, create it properly
    (unicodedata.normalize("NFC", "café"), "NFC normalized"),
    (unicodedata.normalize("NFD", "café"), "NFD normalized"),
]

norm_results = []
for text, desc in test_strings:
    tokens = tokenizer.tokenize(text)
    code_points = len(text.encode("utf-32-le")) // 4
    norm_results.append((text, desc, tokens, len(tokens), code_points))

Out[32]:

Console

Unicode Normalization Effects on Tokenization
======================================================================
Text            Form                 Tokens               CodePts   
----------------------------------------------------------------------
café            precomposed          ['c', 'af', 'Ã©']    4         
café            decomposed           ['c', 'af', 'Ã©']    4         
café            NFC normalized       ['c', 'af', 'Ã©']    4         
café           NFD normalized       ['c', 'afe', 'Ì', '  5

Special Characters and SymbolsLink Copied

Mathematical symbols, currency signs, and other special characters may or may not be in the tokenizer's vocabulary.

In[33]:

Code

# Special characters
special_chars = [
    ("π ≈ 3.14159", "math symbols"),
    ("∑(x²) = n", "summation"),
    ("€100 vs $120", "currency"),
    ("→ ← ↑ ↓", "arrows"),
    ("© 2024 ® ™", "legal symbols"),
    ("°C vs °F", "temperature"),
    ("½ ¼ ¾", "fractions"),
    ("α β γ δ", "Greek letters"),
]

special_results = []
for text, desc in special_chars:
    tokens = tokenizer.tokenize(text)
    special_results.append((text, desc, tokens, len(tokens)))

# Special characters
special_chars = [
    ("π ≈ 3.14159", "math symbols"),
    ("∑(x²) = n", "summation"),
    ("€100 vs $120", "currency"),
    ("→ ← ↑ ↓", "arrows"),
    ("© 2024 ® ™", "legal symbols"),
    ("°C vs °F", "temperature"),
    ("½ ¼ ¾", "fractions"),
    ("α β γ δ", "Greek letters"),
]

special_results = []
for text, desc in special_chars:
    tokens = tokenizer.tokenize(text)
    special_results.append((text, desc, tokens, len(tokens)))

Out[34]:

Console

Special Character Tokenization
======================================================================

math symbols: 'π ≈ 3.14159'
  → ['ÏĢ', 'Ġâī', 'Ī', 'Ġ3', '.', '14', '159']
  → 7 tokens

summation: '∑(x²) = n'
  → ['âĪ', 'ĳ', '(', 'x', 'Â²', ')', 'Ġ=', 'Ġn']
  → 8 tokens

currency: '€100 vs $120'
  → ['âĤ¬', '100', 'Ġvs', 'Ġ$', '120']
  → 5 tokens

arrows: '→ ← ↑ ↓'
  → ['âĨĴ', 'ĠâĨ', 'Ĳ', 'ĠâĨĳ', 'ĠâĨ', 'ĵ']
  → 6 tokens

legal symbols: '© 2024 ® ™'
  → ['Â©', 'Ġ2024', 'ĠÂ®', 'Ġâ', 'Ħ¢']
  → 5 tokens

temperature: '°C vs °F'
  → ['Â°', 'C', 'Ġvs', 'ĠÂ°', 'F']
  → 5 tokens

fractions: '½ ¼ ¾'
  → ['Â½', 'ĠÂ', '¼', 'ĠÂ', '¾']
  → 5 tokens

Greek letters: 'α β γ δ'
  → ['Î±', 'ĠÎ²', 'ĠÎ', '³', 'ĠÎ', '´']
  → 6 tokens

Mathematical symbols like π and ∑ may fragment into byte sequences that models must learn to interpret. Greek letters used in scientific text face similar challenges. This creates a disparity between prose and technical writing that affects how well models handle STEM content.

Tokenization ArtifactsLink Copied

Tokenization creates artifacts: patterns in the token sequence that don't reflect linguistic structure but arise from the tokenizer's learned merge rules. These artifacts can cause unexpected model behaviors.

Position-Dependent TokenizationLink Copied

The same substring may tokenize differently depending on its position in a word. This happens because BPE learns different merge rules for word-initial, word-internal, and word-final positions.

In[35]:

Code

# Position-dependent tokenization examples
position_examples = [
    ("low", "as standalone word"),
    ("lower", "with 'er' suffix"),
    ("lowest", "with 'est' suffix"),
    ("fellow", "with 'low' inside"),
    ("allow", "with 'low' at end"),
    ("lowly", "with 'ly' suffix"),
]

position_results = []
for word, desc in position_examples:
    tokens = tokenizer.tokenize(word)
    tokens_with_space = tokenizer.tokenize(" " + word)
    position_results.append((word, desc, tokens, tokens_with_space))

# Position-dependent tokenization examples
position_examples = [
    ("low", "as standalone word"),
    ("lower", "with 'er' suffix"),
    ("lowest", "with 'est' suffix"),
    ("fellow", "with 'low' inside"),
    ("allow", "with 'low' at end"),
    ("lowly", "with 'ly' suffix"),
]

position_results = []
for word, desc in position_examples:
    tokens = tokenizer.tokenize(word)
    tokens_with_space = tokenizer.tokenize(" " + word)
    position_results.append((word, desc, tokens, tokens_with_space))

Out[36]:

Console

Position-Dependent Tokenization
======================================================================
Word       Context                   No space        With space
----------------------------------------------------------------------
low        as standalone word        ['low']         ['Ġlow']
lower      with 'er' suffix          ['lower']       ['Ġlower']
lowest     with 'est' suffix         ['low', 'est']  ['Ġlowest']
fellow     with 'low' inside         ['f', 'ellow']  ['Ġfellow']
allow      with 'low' at end         ['allow']       ['Ġallow']
lowly      with 'ly' suffix          ['low', 'ly']   ['Ġlowly']

Notice how "low" tokenizes differently as a standalone word versus when embedded in "fellow" or "allow". The leading space marker (Ġ in GPT-2) affects which merge rules apply. This position sensitivity means morphologically related words may have different token representations.

Repeated Character AnomaliesLink Copied

Long sequences of repeated characters create unusual tokenization patterns. The tokenizer may have learned specific tokens for common repeats (like "ee" in "feet") but must fall back to character-by-character tokenization for longer sequences.

In[37]:

Code

# Repeated character examples
repeat_examples = [
    "a",
    "aa",
    "aaa",
    "aaaa",
    "aaaaa",
    "aaaaaaaaa",
    "helllllo",
    "noooooo",
    "yessssss",
    "hahahaha",
]

repeat_results = []
for text in repeat_examples:
    tokens = tokenizer.tokenize(text)
    repeat_results.append((text, tokens, len(tokens)))

# Repeated character examples
repeat_examples = [
    "a",
    "aa",
    "aaa",
    "aaaa",
    "aaaaa",
    "aaaaaaaaa",
    "helllllo",
    "noooooo",
    "yessssss",
    "hahahaha",
]

repeat_results = []
for text in repeat_examples:
    tokens = tokenizer.tokenize(text)
    repeat_results.append((text, tokens, len(tokens)))

Out[38]:

Console

Repeated Character Tokenization
============================================================
Text                 Tokens                         Count
------------------------------------------------------------
a                    ['a']                          1
aa                   ['aa']                         1
aaa                  ['aaa']                        1
aaaa                 ['aaaa']                       1
aaaaa                ['aaaa', 'a']                  2
aaaaaaaaa            ['aaaa', 'aaaa', 'a']          3
helllllo             ['hell', 'll', 'lo']           3
noooooo              ['n', 'oooo', 'oo']            3
yessssss             ['y', 'ess', 'ss', 'ss']       4
hahahaha             ['h', 'ahah', 'aha']           3

Out[39]:

Visualization

Line and scatter plot showing token count versus repeat length for the letter 'a', with annotations showing example tokenizations. — Repeated characters create non-linear tokenization patterns. Single characters tokenize trivially, but sequences of 2-3 repeats may merge into learned tokens (like 'aa' or 'ee' from common words). Longer sequences fragment unpredictably based on what patterns appeared in training data.

Tokenization Boundary EffectsLink Copied

Slight changes in input text can cause cascading changes in tokenization. Adding or removing a single character might shift token boundaries throughout the sequence.

In[40]:

Code

# Boundary effect examples
base_text = "The quick brown fox jumps"
variations = [
    base_text,
    base_text + ".",
    base_text + "!",
    base_text + " over",
    "A " + base_text[4:],  # Change first word
    base_text.replace("quick", "slow"),
]

boundary_results = []
for text in variations:
    tokens = tokenizer.tokenize(text)
    boundary_results.append((text, tokens, len(tokens)))

# Boundary effect examples
base_text = "The quick brown fox jumps"
variations = [
    base_text,
    base_text + ".",
    base_text + "!",
    base_text + " over",
    "A " + base_text[4:],  # Change first word
    base_text.replace("quick", "slow"),
]

boundary_results = []
for text in variations:
    tokens = tokenizer.tokenize(text)
    boundary_results.append((text, tokens, len(tokens)))

Out[41]:

Console

Tokenization Boundary Effects
======================================================================

'The quick brown fox jumps'
  → 5 tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps']

'The quick brown fox jumps.'
  → 6 tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', '.']

'The quick brown fox jumps!'
  → 6 tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', '!']

'The quick brown fox jumps over'
  → 6 tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover']

'A quick brown fox jumps'
  → 5 tokens: ['A', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps']

'The slow brown fox jumps'
  → 5 tokens: ['The', 'Ġslow', 'Ġbrown', 'Ġfox', 'Ġjumps']

Notice how adding a single punctuation mark can change the token count. More significantly, changing a word can affect the tokenization of adjacent words if it shifts the byte alignment of subsequent text.

Out[42]:

Visualization

Adversarial TokenizationLink Copied

Malicious actors can exploit tokenization quirks to confuse language models or bypass content filters. Understanding these attacks helps build more robust systems.

Token Boundary ManipulationLink Copied

By carefully crafting input text, attackers can create token boundaries that obscure the true meaning of the input. This technique can evade content moderation systems that operate on token-level patterns.

In[43]:

Code

# Adversarial tokenization examples
adversarial_examples = [
    ("normal", "Normal text"),
    ("n or mal", "Space insertion"),
    ("normal", "Zero-width spaces"),  # Contains ZWS
    ("ɴᴏʀᴍᴀʟ", "Small caps Unicode"),
    ("𝓃𝑜𝓇𝓂𝒶𝓁", "Mathematical script"),
    ("n0rmal", "Leetspeak substitution"),
]

# Clean version for comparison
clean_word = "normal"
clean_tokens = tokenizer.tokenize(clean_word)

adversarial_results = []
for text, desc in adversarial_examples:
    tokens = tokenizer.tokenize(text)
    adversarial_results.append((text, desc, tokens, len(tokens)))

# Adversarial tokenization examples
adversarial_examples = [
    ("normal", "Normal text"),
    ("n or mal", "Space insertion"),
    ("normal", "Zero-width spaces"),  # Contains ZWS
    ("ɴᴏʀᴍᴀʟ", "Small caps Unicode"),
    ("𝓃𝑜𝓇𝓂𝒶𝓁", "Mathematical script"),
    ("n0rmal", "Leetspeak substitution"),
]

# Clean version for comparison
clean_word = "normal"
clean_tokens = tokenizer.tokenize(clean_word)

adversarial_results = []
for text, desc in adversarial_examples:
    tokens = tokenizer.tokenize(text)
    adversarial_results.append((text, desc, tokens, len(tokens)))

Out[44]:

Console

Adversarial Tokenization Manipulation
======================================================================
Baseline: 'normal' → ['normal']

Variation            Description               Tokens               Count
----------------------------------------------------------------------
normal               Normal text               ['normal']           1
n or mal             Space insertion           ['n', 'Ġor', 'Ġmal'  3
normal             Zero-width spaces         ['n', 'âĢĭ', 'or',   5
ɴᴏʀᴍᴀʟ               Small caps Unicode        ['É', '´', 'á', '´'  15
𝓃𝑜𝓇𝓂𝒶𝓁               Mathematical script       ['ðĿ', 'ĵ', 'ĥ', 'ð  18
n0rmal               Leetspeak substitution    ['n', '0', 'r', 'ma  4

Prompt Injection via TokenizationLink Copied

Attackers can insert invisible characters or use Unicode lookalikes to inject instructions that appear innocuous to human reviewers but get processed differently by models.

Prompt Injection

An attack where malicious instructions are embedded in user input, designed to override the system's intended behavior. Tokenization artifacts can make these injections harder to detect because the malicious content may not match expected token patterns.

In[45]:

Code

# Characters that look similar but tokenize differently
lookalike_examples = [
    ("A", "Latin A", ord("A")),
    ("А", "Cyrillic A", ord("А")),
    ("Α", "Greek Alpha", ord("Α")),
    ("o", "Latin o", ord("o")),
    ("о", "Cyrillic o", ord("о")),
    ("ο", "Greek omicron", ord("ο")),
]

lookalike_results = []
for char, desc, code in lookalike_examples:
    token = tokenizer.tokenize(char)
    lookalike_results.append((char, desc, code, token))

# Characters that look similar but tokenize differently
lookalike_examples = [
    ("A", "Latin A", ord("A")),
    ("А", "Cyrillic A", ord("А")),
    ("Α", "Greek Alpha", ord("Α")),
    ("o", "Latin o", ord("o")),
    ("о", "Cyrillic o", ord("о")),
    ("ο", "Greek omicron", ord("ο")),
]

lookalike_results = []
for char, desc, code in lookalike_examples:
    token = tokenizer.tokenize(char)
    lookalike_results.append((char, desc, code, token))

Out[46]:

Console

Homoglyph Characters and Tokenization
============================================================
Char   Description          Code Point      Token
------------------------------------------------------------
A      Latin A              U+0041       ['A']
А      Cyrillic A           U+0410       ['Ð', 'Ĳ']
Α      Greek Alpha          U+0391       ['Î', 'ĳ']
o      Latin o              U+006F       ['o']
о      Cyrillic o           U+043E       ['Ð¾']
ο      Greek omicron        U+03BF       ['Î¿']

These homoglyphs, characters that look identical but have different Unicode code points, tokenize into completely different tokens. A filter looking for "password" won't catch "pаssword" (with a Cyrillic 'a'), even though they appear identical to human readers.

Homoglyph attack demonstration. Visually identical words produce drastically different tokenizations when containing characters from different Unicode scripts. The Latin "admin" tokenizes as a single unit, while versions with Cyrillic or Greek characters fragment into 5 tokens.

Word	Script	Tokens	Token Sequence
admin	Latin script	1	`['admin']`
аdmin	Cyrillic 'а'	5	`['Ð', '°', 'dm', 'in']`
αdmin	Greek 'α'	5	`['α', 'dm', 'in']`

This attack vector is particularly dangerous for content moderation. A filter searching for the token ['admin'] will miss both homoglyph variants entirely, since they produce completely different token sequences. Robust detection requires Unicode normalization and script analysis before tokenization.

Measuring Tokenization QualityLink Copied

How do we evaluate whether one tokenizer is better than another? Several metrics help quantify tokenization quality across different dimensions.

Compression MetricsLink Copied

The primary goal of subword tokenization is compression: representing text with fewer tokens than characters. Compression ratio and fertility measure this directly.

Fertility

The average number of tokens produced per word. Lower fertility indicates more efficient tokenization, meaning common words are represented by single tokens rather than being split into subwords.

In[48]:

Code

def calculate_tokenization_metrics(tokenizer, texts):
    """Calculate comprehensive tokenization metrics."""
    total_chars = 0
    total_tokens = 0
    total_words = 0

    for text in texts:
        tokens = tokenizer.tokenize(text)
        words = text.split()

        total_chars += len(text)
        total_tokens += len(tokens)
        total_words += len(words)

    return {
        "compression_ratio": total_chars / total_tokens,
        "fertility": total_tokens / total_words,
        "avg_token_length": total_chars / total_tokens,
        "total_tokens": total_tokens,
        "total_words": total_words,
    }


# Test corpus
test_corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Natural language processing has revolutionized AI.",
    "Transformers use self-attention mechanisms for sequence modeling.",
    "Large language models learn from billions of text tokens.",
    "Tokenization converts text into numerical representations.",
]

metrics = calculate_tokenization_metrics(tokenizer, test_corpus)

def calculate_tokenization_metrics(tokenizer, texts):
    """Calculate comprehensive tokenization metrics."""
    total_chars = 0
    total_tokens = 0
    total_words = 0

    for text in texts:
        tokens = tokenizer.tokenize(text)
        words = text.split()

        total_chars += len(text)
        total_tokens += len(tokens)
        total_words += len(words)

    return {
        "compression_ratio": total_chars / total_tokens,
        "fertility": total_tokens / total_words,
        "avg_token_length": total_chars / total_tokens,
        "total_tokens": total_tokens,
        "total_words": total_words,
    }


# Test corpus
test_corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Natural language processing has revolutionized AI.",
    "Transformers use self-attention mechanisms for sequence modeling.",
    "Large language models learn from billions of text tokens.",
    "Tokenization converts text into numerical representations.",
]

metrics = calculate_tokenization_metrics(tokenizer, test_corpus)

Out[49]:

Console

Tokenization Quality Metrics
==================================================
Compression ratio: 5.71 chars/token
Fertility: 1.30 tokens/word
Total words: 37
Total tokens: 48

A compression ratio around 4-5 characters per token is typical for English text with a well-trained tokenizer. Fertility around 1.2-1.5 tokens per word indicates that most words are either single tokens or split into just two pieces.

Out[50]:

Visualization

Scatter plot showing compression ratio versus fertility for different text categories, with optimal zone highlighted. — Relationship between compression ratio and fertility across different text types. Technical text (code, math) typically shows higher fertility and lower compression than natural prose. The shaded region shows the typical range for well-trained tokenizers on English text.

Cross-Language FairnessLink Copied

For multilingual applications, tokenization efficiency should be roughly equal across languages. Large disparities indicate unfair representation.

In[51]:

Code

def calculate_fairness_metrics(tokenizer, language_samples):
    """Calculate cross-language tokenization fairness."""
    results = []

    for lang, text in language_samples:
        tokens = tokenizer.tokenize(text)
        chars = len(text)
        efficiency = chars / len(tokens) if tokens else 0
        results.append(
            {
                "language": lang,
                "tokens": len(tokens),
                "chars": chars,
                "efficiency": efficiency,
            }
        )

    efficiencies = [r["efficiency"] for r in results]
    max_eff = max(efficiencies)
    min_eff = min(efficiencies)

    return {
        "results": results,
        "disparity_ratio": max_eff / min_eff if min_eff > 0 else float("inf"),
        "std_dev": np.std(efficiencies),
        "mean_efficiency": np.mean(efficiencies),
    }


import numpy as np

fairness = calculate_fairness_metrics(tokenizer, multilingual_samples)

def calculate_fairness_metrics(tokenizer, language_samples):
    """Calculate cross-language tokenization fairness."""
    results = []

    for lang, text in language_samples:
        tokens = tokenizer.tokenize(text)
        chars = len(text)
        efficiency = chars / len(tokens) if tokens else 0
        results.append(
            {
                "language": lang,
                "tokens": len(tokens),
                "chars": chars,
                "efficiency": efficiency,
            }
        )

    efficiencies = [r["efficiency"] for r in results]
    max_eff = max(efficiencies)
    min_eff = min(efficiencies)

    return {
        "results": results,
        "disparity_ratio": max_eff / min_eff if min_eff > 0 else float("inf"),
        "std_dev": np.std(efficiencies),
        "mean_efficiency": np.mean(efficiencies),
    }


import numpy as np

fairness = calculate_fairness_metrics(tokenizer, multilingual_samples)

Out[52]:

Console

Cross-Language Tokenization Fairness
==================================================
Disparity ratio: 8.08x
Standard deviation: 0.99
Mean efficiency: 1.41 chars/token

Per-language breakdown:
--------------------------------------------------
  English     : 3.57 chars/token (7 tokens)
  French      : 2.35 chars/token (17 tokens)
  German      : 2.33 chars/token (15 tokens)
  Spanish     : 1.69 chars/token (13 tokens)
  Arabic      : 0.95 chars/token (22 tokens)
  Russian     : 0.92 chars/token (25 tokens)
  Japanese    : 0.72 chars/token (25 tokens)
  Hindi       : 0.64 chars/token (36 tokens)
  Chinese     : 0.47 chars/token (19 tokens)
  Korean      : 0.44 chars/token (43 tokens)

A disparity ratio close to 1.0 indicates fair representation across languages. Ratios above 2.0 suggest significant bias toward some languages. The standard deviation captures how much variation exists in tokenization efficiency.

Out[53]:

Visualization

Radar chart showing normalized scores for compression, fertility, coverage, and fairness metrics. — Radar chart of tokenization quality metrics. An ideal tokenizer would score high on compression (fewer tokens), low on fertility (tokens per word close to 1), high on vocabulary coverage, and low on disparity (equal treatment across languages). Most tokenizers trade off between these dimensions.

Downstream Task CorrelationLink Copied

Ultimately, tokenization quality matters because it affects downstream task performance. Better tokenization generally correlates with better model performance, but the relationship isn't always straightforward.

In[54]:

Code

# Simulated correlation between tokenization and performance
# (In practice, this would come from actual experiments)
tokenization_experiments = [
    {
        "vocab_size": 1000,
        "fertility": 2.5,
        "perplexity": 45.2,
        "accuracy": 0.72,
    },
    {
        "vocab_size": 5000,
        "fertility": 1.8,
        "perplexity": 32.1,
        "accuracy": 0.81,
    },
    {
        "vocab_size": 10000,
        "fertility": 1.5,
        "perplexity": 28.4,
        "accuracy": 0.85,
    },
    {
        "vocab_size": 30000,
        "fertility": 1.3,
        "perplexity": 25.7,
        "accuracy": 0.88,
    },
    {
        "vocab_size": 50000,
        "fertility": 1.2,
        "perplexity": 24.9,
        "accuracy": 0.89,
    },
    {
        "vocab_size": 100000,
        "fertility": 1.15,
        "perplexity": 25.2,
        "accuracy": 0.88,
    },
]

# Simulated correlation between tokenization and performance
# (In practice, this would come from actual experiments)
tokenization_experiments = [
    {
        "vocab_size": 1000,
        "fertility": 2.5,
        "perplexity": 45.2,
        "accuracy": 0.72,
    },
    {
        "vocab_size": 5000,
        "fertility": 1.8,
        "perplexity": 32.1,
        "accuracy": 0.81,
    },
    {
        "vocab_size": 10000,
        "fertility": 1.5,
        "perplexity": 28.4,
        "accuracy": 0.85,
    },
    {
        "vocab_size": 30000,
        "fertility": 1.3,
        "perplexity": 25.7,
        "accuracy": 0.88,
    },
    {
        "vocab_size": 50000,
        "fertility": 1.2,
        "perplexity": 24.9,
        "accuracy": 0.89,
    },
    {
        "vocab_size": 100000,
        "fertility": 1.15,
        "perplexity": 25.2,
        "accuracy": 0.88,
    },
]

Out[55]:

Visualization

Dual-axis line plot showing perplexity and accuracy versus vocabulary size, with optimal zone highlighted. — Relationship between vocabulary size and model performance. Larger vocabularies reduce fertility and improve performance up to a point, after which diminishing returns set in. Very large vocabularies may even hurt performance due to increased sparsity in embedding tables.

Best Practices and RecommendationsLink Copied

Based on these challenges, several best practices emerge for working with tokenizers effectively.

Choosing the Right TokenizerLink Copied

The optimal tokenizer depends on your application:

English-only applications: GPT-2/GPT-4 tokenizers work well, offering good compression and vocabulary coverage
Multilingual applications: Consider tokenizers trained on balanced multilingual corpora, like those from mT5 or XLM-RoBERTa
Code-heavy applications: Look for tokenizers explicitly trained on code, like CodeBERT or StarCoder tokenizers
Domain-specific: Consider training a custom tokenizer on domain text if standard tokenizers perform poorly

Preprocessing for Better TokenizationLink Copied

Some preprocessing steps can improve tokenization consistency:

In[56]:

Code

import unicodedata
import re


def preprocess_for_tokenization(text):
    """Preprocess text for more consistent tokenization."""
    # Normalize Unicode to composed form
    text = unicodedata.normalize("NFC", text)

    # Standardize whitespace
    text = re.sub(r"\s+", " ", text)

    # Normalize quotes and dashes
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(""", "'").replace(""", "'")
    text = text.replace("—", "-").replace("–", "-")

    # Normalize number formats (optional, depending on use case)
    # text = re.sub(r'(\d),(\d)', r'\1\2', text)  # Remove thousands separators

    return text.strip()


# Example - note the curly quotes and special spacing characters
raw_text = "Here\u2019s   a \u201ctest\u201d   with\u3000weird\u3000spacing"
processed = preprocess_for_tokenization(raw_text)

import unicodedata
import re


def preprocess_for_tokenization(text):
    """Preprocess text for more consistent tokenization."""
    # Normalize Unicode to composed form
    text = unicodedata.normalize("NFC", text)

    # Standardize whitespace
    text = re.sub(r"\s+", " ", text)

    # Normalize quotes and dashes
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(""", "'").replace(""", "'")
    text = text.replace("—", "-").replace("–", "-")

    # Normalize number formats (optional, depending on use case)
    # text = re.sub(r'(\d),(\d)', r'\1\2', text)  # Remove thousands separators

    return text.strip()


# Example - note the curly quotes and special spacing characters
raw_text = "Here\u2019s   a \u201ctest\u201d   with\u3000weird\u3000spacing"
processed = preprocess_for_tokenization(raw_text)

Out[57]:

Console

Raw: 'Here’s   a “test”   with　weird　spacing'
Processed: 'Here’s a “test” with weird spacing'

Raw tokens: ['Here', 'âĢ', 'Ļ', 's', 'Ġ', 'Ġ', 'Ġa', 'ĠâĢ', 'ľ', 'test', 'âĢ', 'Ŀ', 'Ġ', 'Ġ', 'Ġwith', 'ãĢ', 'Ģ', 'we', 'ird', 'ãĢ', 'Ģ', 'sp', 'acing'] (23)
Processed tokens: ['Here', 'âĢ', 'Ļ', 's', 'Ġa', 'ĠâĢ', 'ľ', 'test', 'âĢ', 'Ŀ', 'Ġwith', 'Ġweird', 'Ġspacing'] (13)

Monitoring Tokenization in ProductionLink Copied

Track tokenization metrics to catch issues early:

Token count per request: Sudden increases may indicate adversarial input or unusual content
OOV rates: If you're using an older tokenizer, track how often byte-fallback occurs
Language distribution: Monitor whether tokenization efficiency varies across user populations
Cost per character: Track effective costs across different content types

SummaryLink Copied

Tokenization challenges arise at the intersection of linguistic diversity and computational constraints. We've explored several key problem areas:

Number tokenization fragments inconsistently based on magnitude and format, creating challenges for arithmetic and numerical reasoning. The same value tokenizes differently depending on whether it's written as "1000", "1,000", or "1e3".

Code tokenization struggles with programming conventions like camelCase and snake_case, creating verbose representations of identifiers and operators. Whitespace-sensitive languages pose particular challenges.

Multilingual text reveals systematic biases in tokenizers trained primarily on English. Scripts like Chinese and Japanese fragment into many more tokens for equivalent semantic content, creating unfair representation in fixed-context models.

Emoji and Unicode expose the complexity beneath seemingly simple characters. Compound emoji can explode into dozens of tokens, and Unicode normalization affects tokenization consistency.

Tokenization artifacts cause unexpected model behaviors when slight input changes cascade into different token boundaries. Position-dependent tokenization means the same substring represents differently in different contexts.

Adversarial attacks exploit these quirks to bypass content filters and inject malicious prompts. Homoglyphs and invisible characters can obscure true content while appearing innocuous to human reviewers.

Quality metrics help evaluate and compare tokenizers: compression ratio, fertility, cross-language fairness, and correlation with downstream performance all provide useful signals.

Understanding these challenges helps you debug mysterious model behaviors, choose appropriate tokenizers for your applications, and build more robust NLP systems. Tokenization may seem like a solved problem, but as we've seen, edge cases lurk everywhere, and careful attention to tokenization quality can significantly impact model performance.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about tokenization challenges in NLP.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Special Tokens

Next Chapter

Part-of-Speech Tagging

Reference

BIBTEXAcademic

@misc{tokenizationchallengesnumberscodemultilingualunicodeedgecases, author = {Michael Brenndoerfer}, title = {Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases}, year = {2025}, url = {https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases. Retrieved from https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode

MLAAcademic

Michael Brenndoerfer. "Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode>.

CHICAGOAcademic

Michael Brenndoerfer. "Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases'. Available at: https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases. https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode

Direct link:

https://mbrenndoerfer.com/writing/tokenization-challenges-numbers-code-multilingual-unicode

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Tokenization Challenges: Numbers, Code, Multilingual & Unicode Edge Cases

Tokenization ChallengesLink Copied

Number TokenizationLink Copied

The Fragmentation ProblemLink Copied

Why Number Fragmentation MattersLink Copied

Format SensitivityLink Copied

Arithmetic ChallengesLink Copied

Code TokenizationLink Copied

Identifier FragmentationLink Copied

Operator and Syntax TokenizationLink Copied

Whitespace and IndentationLink Copied

Multilingual ChallengesLink Copied

Script and Language CoverageLink Copied

The Cost of Multilingual TextLink Copied

Code-Switching and Mixed LanguageLink Copied

Emoji and Unicode Edge CasesLink Copied

Emoji TokenizationLink Copied

Unicode Normalization IssuesLink Copied

Special Characters and SymbolsLink Copied

Tokenization ArtifactsLink Copied

Position-Dependent TokenizationLink Copied

Repeated Character AnomaliesLink Copied

Tokenization Boundary EffectsLink Copied

Adversarial TokenizationLink Copied

Token Boundary ManipulationLink Copied

Prompt Injection via TokenizationLink Copied

Measuring Tokenization QualityLink Copied

Compression MetricsLink Copied

Cross-Language FairnessLink Copied

Downstream Task CorrelationLink Copied

Best Practices and RecommendationsLink Copied

Choosing the Right TokenizerLink Copied

Preprocessing for Better TokenizationLink Copied

Monitoring Tokenization in ProductionLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Special Tokens in Transformers: CLS, SEP, PAD, MASK & More

SentencePiece: Subword Tokenization for Multilingual NLP

Tokenizer Training: Complete Guide to Custom Tokenizer Development

Stay updated