Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP
Back to Writing

Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP

Michael BrenndoerferDecember 7, 202521 min read4,975 wordsInteractive

Master text normalization techniques including Unicode NFC/NFD/NFKC/NFKD forms, case folding vs lowercasing, diacritic removal, and whitespace handling. Learn to build robust normalization pipelines for search and deduplication.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Text NormalizationLink Copied

In the previous chapter, we saw how a single character like "é" can be represented in multiple ways: as a single precomposed code point (U+00E9) or as a base letter plus a combining accent (U+0065 + U+0301). Both look identical on screen, but Python considers them different strings. This seemingly minor issue can break string matching, corrupt search results, and introduce subtle bugs into your NLP pipelines.

Text normalization is the process of transforming text into a consistent, canonical form. It goes beyond encoding to address the fundamental question: when should two different byte sequences be considered the "same" text? This chapter covers Unicode normalization forms, case handling, whitespace cleanup, and building robust normalization pipelines.

Why Normalization MattersLink Copied

Consider a simple task: searching for the word "café" in a document. Without normalization, your search might miss matches because the document uses a different Unicode representation.

In[2]:
# Two visually identical strings
cafe1 = "café"  # Precomposed: U+00E9
cafe2 = "cafe\u0301"  # Decomposed: e + combining acute

# Visual comparison
looks_same = cafe1 == cafe2
Out[3]:
Two ways to write 'café':
  cafe1 = 'café' (precomposed)
  cafe2 = 'café' (decomposed)

Look identical? Yes, they both display as: café
Are equal in Python? False

Code point breakdown:
  cafe1: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  cafe2: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']

The strings are visually identical but computationally different. This creates problems across NLP:

  • Search: Users searching for "café" won't find documents containing the decomposed form
  • Deduplication: Duplicate detection fails when the same text uses different representations
  • Tokenization: Tokenizers may split decomposed characters incorrectly
  • Embeddings: Identical words may receive different vector representations

Unicode Normalization FormsLink Copied

The Unicode standard defines four normalization forms to address representation ambiguity. Each form serves different purposes.

Unicode Normalization

Unicode normalization transforms text into a canonical form where equivalent strings have identical code point sequences. The four forms (NFC, NFD, NFKC, NFKD) differ in whether they compose or decompose characters and whether they apply compatibility mappings.

NFC: Canonical CompositionLink Copied

NFC (Normalization Form Canonical Composition) converts text to its shortest representation by combining base characters with their accents into single precomposed characters where possible.

In[4]:
import unicodedata

# Start with decomposed form
decomposed = "cafe\u0301"  # e + combining acute

# Normalize to NFC (composed)
nfc = unicodedata.normalize('NFC', decomposed)

# Compare with precomposed original
precomposed = "café"
Out[5]:
NFC Normalization (Composition):
  Original (decomposed): 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']
  Length: 5

  After NFC: 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  Length: 4

  Matches precomposed 'café'? True

NFC is the most commonly used normalization form. It produces the most compact representation and matches what most users expect when they type accented characters.

NFD: Canonical DecompositionLink Copied

NFD (Normalization Form Canonical Decomposition) does the opposite: it breaks precomposed characters into their base character plus combining marks.

In[6]:
# Start with precomposed form
composed = "café"

# Normalize to NFD (decomposed)
nfd = unicodedata.normalize('NFD', composed)
Out[7]:
NFD Normalization (Decomposition):
  Original (composed): 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  Length: 4

  After NFD: 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']
  Length: 5

NFD is useful when you need to manipulate accents separately from base characters, such as removing diacritics or analyzing character components.

NFKC and NFKD: Compatibility NormalizationLink Copied

The "K" forms apply compatibility decomposition in addition to canonical normalization. This maps characters that are semantically equivalent but visually distinct.

Compatibility Equivalence

Compatibility equivalence groups characters that represent the same abstract character but differ in appearance or formatting. Examples include full-width vs. half-width characters, ligatures vs. separate letters, and superscripts vs. regular digits.

In[8]:
# Characters with compatibility equivalents
test_cases = [
    ("fi", "fi ligature"),
    ("①", "circled digit one"),
    ("Ⅳ", "roman numeral four"),
    ("hello", "full-width hello"),
    ("²", "superscript two"),
    ("㎞", "km symbol"),
]

# Apply NFKC normalization
nfkc_results = [(char, desc, unicodedata.normalize('NFKC', char)) for char, desc in test_cases]
Out[9]:
NFKC Compatibility Normalization:
-------------------------------------------------------
Original     Description               NFKC Result 
-------------------------------------------------------
'fi'        fi ligature               'fi'
'①'        circled digit one         '1'
'Ⅳ'        roman numeral four        'IV'
'hello'        full-width hello          'hello'
'²'        superscript two           '2'
'㎞'        km symbol                 'km'

NFKC is aggressive. It converts the "fi" ligature to separate "f" and "i" characters, expands the circled digit to just "1", and converts full-width characters to their ASCII equivalents. This is useful for search and comparison but destroys formatting information.

Out[10]:
Visualization
Diagram showing how the four Unicode normalization forms transform example characters.
Comparison of Unicode normalization forms. NFC and NFD preserve character identity while changing representation. NFKC and NFKD additionally apply compatibility mappings, converting visually distinct but semantically equivalent characters to a common form.

Choosing a Normalization FormLink Copied

The right form depends on your use case:

Use CaseRecommended FormReason
General text storageNFCCompact, preserves visual appearance
Accent-insensitive searchNFD then strip marksEasy to remove combining characters
Full-text searchNFKCMatches variant representations
Security (username comparison)NFKCPrevents homograph attacks
Preserving formattingNFCKeeps ligatures and special forms
In[11]:
def compare_normalization_forms(text):
    """Compare all four normalization forms for a given text."""
    forms = ['NFC', 'NFD', 'NFKC', 'NFKD']
    results = {}
    for form in forms:
        normalized = unicodedata.normalize(form, text)
        results[form] = {
            'text': normalized,
            'length': len(normalized),
            'codepoints': [f'U+{ord(c):04X}' for c in normalized]
        }
    return results

# Test with a complex example
test_text = "financial résumé ①"
comparison = compare_normalization_forms(test_text)
Out[12]:
Normalizing: 'financial résumé ①'
======================================================================

NFC:
  Result: 'financial résumé ①'
  Length: 17

NFD:
  Result: 'financial résumé ①'
  Length: 19

NFKC:
  Result: 'financial résumé 1'
  Length: 18

NFKD:
  Result: 'financial résumé 1'
  Length: 20

The length differences reveal how each form handles the input. NFD produces the longest output because it decomposes characters into base letters plus combining marks. NFC and NFKC produce shorter outputs by composing characters, with NFKC additionally expanding the ligature "fi" into two separate characters.

Out[13]:
Visualization
Grouped bar chart comparing string lengths across NFC, NFD, NFKC, and NFKD normalization forms for various text samples.
String length comparison across Unicode normalization forms for different text samples. NFD consistently produces longer strings by decomposing characters, while NFC produces the most compact representation. NFKC and NFKD may increase or decrease length depending on whether compatibility mappings expand or simplify characters.

The chart reveals important patterns. Text with combining diacritics (café, naïve, Ångström) shows significant length increase under NFD decomposition. Full-width characters (Hello) and circled digits (①②③) shrink dramatically under NFKC/NFKD as they're mapped to their ASCII equivalents. The ligature "fi" in "finance" expands from one character to two under compatibility normalization.

Case Folding vs. LowercasingLink Copied

Case-insensitive comparison seems simple: just convert both strings to lowercase. But Unicode makes this surprisingly complex.

The Problem with Simple LowercasingLink Copied

In[14]:
# German sharp s (ß) uppercases to SS in standard Python
german_word = "straße"  # street
lowered = german_word.lower()
uppered = german_word.upper()
round_trip = uppered.lower()
Out[15]:
Case conversion with German ß:
  Original:    'straße' (length 6)
  .lower():    'straße' (length 6)
  .upper():    'STRASSE' (length 7)
  Round-trip:  'strasse' (length 7)

  Original == round-trip? False

The German "ß" uppercases to "SS" (two characters), and lowercasing "SS" gives "ss", not "ß". Round-tripping through case conversion changes the string. This is not a bug; it reflects the actual orthographic rules of German.

Case FoldingLink Copied

Case Folding

Case folding is a Unicode operation designed for case-insensitive comparison. Unlike simple lowercasing, case folding handles language-specific mappings and ensures that equivalent strings compare equal regardless of their original case.

Python's str.casefold() method implements Unicode case folding:

In[16]:
# Compare lower() vs casefold()
words = ["Straße", "STRASSE", "straße", "strasse"]

lower_results = [w.lower() for w in words]
casefold_results = [w.casefold() for w in words]
Out[17]:
Comparing lower() vs casefold():
--------------------------------------------------
Word         lower()      casefold()  
--------------------------------------------------
Straße       straße       strasse     
STRASSE      strasse      strasse     
straße       straße       strasse     
strasse      strasse      strasse     

Case-insensitive matches:
  Using lower():    2 distinct values
  Using casefold(): 1 distinct values

With casefold(), all four variations of "street" in German normalize to the same string, enabling correct case-insensitive comparison.

Language-Specific Case RulesLink Copied

Some case conversions depend on language context:

In[18]:
# Turkish dotted and dotless i
turkish_examples = [
    ("I", "English uppercase I"),
    ("i", "English lowercase i"),
    ("İ", "Turkish uppercase dotted I (U+0130)"),
    ("ı", "Turkish lowercase dotless i (U+0131)"),
]

# Standard Python case conversion (not locale-aware)
case_results = [(char, desc, char.lower(), char.upper(), char.casefold()) 
                for char, desc in turkish_examples]
Out[19]:
Turkish I variants and case conversion:
---------------------------------------------------------------------------
Char   Description                         lower    upper    casefold
---------------------------------------------------------------------------
'I'    English uppercase I                 'i'      'I'      'i'
'i'    English lowercase i                 'i'      'I'      'i'
'İ'    Turkish uppercase dotted I (U+0130) 'i̇'      'İ'      'i̇'
'ı'    Turkish lowercase dotless i (U+0131) 'ı'      'I'      'ı'

In Turkish, "I" lowercases to "ı" (dotless) and "i" uppercases to "İ" (dotted). Python's default case operations follow English rules, which can cause problems with Turkish text. For locale-aware case conversion, you need specialized libraries.

Accent and Diacritic HandlingLink Copied

Many NLP applications benefit from accent-insensitive matching. A user searching for "resume" should probably find "résumé".

Removing DiacriticsLink Copied

The standard approach uses NFD normalization followed by filtering:

In[20]:
import unicodedata

def remove_diacritics(text):
    """Remove diacritical marks from text."""
    # Decompose into base characters and combining marks
    decomposed = unicodedata.normalize('NFD', text)
    
    # Filter out combining marks (category 'Mn' = Mark, Nonspacing)
    filtered = ''.join(c for c in decomposed 
                       if unicodedata.category(c) != 'Mn')
    
    # Recompose any remaining sequences
    return unicodedata.normalize('NFC', filtered)

# Test with various accented text
test_texts = [
    "résumé",
    "naïve",
    "Ñoño",
    "Zürich",
    "Ångström",
]

stripped = [(text, remove_diacritics(text)) for text in test_texts]
Out[21]:
Diacritic Removal:
-----------------------------------
Original        Stripped       
-----------------------------------
résumé          resume         
naïve           naive          
Ñoño            Nono           
Zürich          Zurich         
Ångström        Angstrom       

This technique decomposes accented characters into base letters plus combining marks, removes the marks, and recomposes. The result is plain ASCII-compatible text.

Preserving Semantic DistinctionsLink Copied

Be careful: removing diacritics can change meaning in some languages.

In[22]:
# Diacritics that change meaning
semantic_examples = [
    ("père", "father (French)"),
    ("pêre", "would be meaningless"),
    ("año", "year (Spanish)"),
    ("ano", "anus (Spanish)"),
    ("für", "for (German)"),
    ("fur", "different word"),
]
Out[23]:
When diacritics matter:
--------------------------------------------------
  père     → pere      (father (French))
  pêre     → pere      (would be meaningless)
  año      → ano       (year (Spanish))
  ano      → ano       (anus (Spanish))
  für      → fur       (for (German))
  fur      → fur       (different word)

For search applications, you might want to match both forms. For translation or language understanding, preserving diacritics is essential.

Whitespace NormalizationLink Copied

Whitespace seems simple, but Unicode defines many whitespace characters beyond the familiar space and tab.

In[24]:
# Unicode whitespace characters
whitespace_chars = [
    ('\u0020', 'Space'),
    ('\u00A0', 'No-Break Space'),
    ('\u2002', 'En Space'),
    ('\u2003', 'Em Space'),
    ('\u2009', 'Thin Space'),
    ('\u200B', 'Zero Width Space'),
    ('\u3000', 'Ideographic Space'),
    ('\t', 'Tab'),
    ('\n', 'Newline'),
    ('\r', 'Carriage Return'),
]

# Check which are detected by str.isspace()
space_check = [(char, name, char.isspace()) for char, name in whitespace_chars]
Out[25]:
Unicode Whitespace Characters:
-------------------------------------------------------
Char     Code       Name                      isspace() 
-------------------------------------------------------
         U+0020     Space                     True
\xa0     U+00A0     No-Break Space            True
\u2002   U+2002     En Space                  True
\u2003   U+2003     Em Space                  True
\u2009   U+2009     Thin Space                True
\u200b   U+200B     Zero Width Space          False
\u3000   U+3000     Ideographic Space         True
\t       U+0009     Tab                       True
\n       U+000A     Newline                   True
\r       U+000D     Carriage Return           True

Notice that the zero-width space (U+200B) is not considered whitespace by Python's isspace(). These invisible characters can cause subtle bugs.

Out[26]:
Visualization
Horizontal bar chart showing UTF-8 byte sizes for various Unicode whitespace characters, ranging from 1 to 3 bytes.
UTF-8 byte sizes of Unicode whitespace characters. Standard ASCII space uses just 1 byte, while specialized Unicode spaces require 2-3 bytes. Zero-width characters, despite being invisible, still consume bytes in the encoded text. This variation affects file sizes and can cause issues when processing text from different sources.

The byte size variation has practical implications. A document using ideographic spaces (common in CJK text) will be larger than one using standard ASCII spaces. Zero-width characters, despite being invisible, still consume 3 bytes each in UTF-8, and they can accumulate when copying text from web pages or PDFs.

Normalizing WhitespaceLink Copied

A robust whitespace normalizer should:

  1. Convert all whitespace variants to standard spaces
  2. Collapse multiple spaces into one
  3. Strip leading and trailing whitespace
  4. Optionally handle zero-width characters
In[27]:
import re

def normalize_whitespace(text, collapse=True, strip=True):
    """Normalize various whitespace characters to standard spaces."""
    # Unicode whitespace pattern (broader than \s)
    # Includes all Unicode Zs (space separator) category
    whitespace_pattern = r'[\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]'
    
    # Replace all whitespace variants with standard space
    text = re.sub(whitespace_pattern, ' ', text)
    
    # Handle zero-width characters
    text = re.sub(r'[\u200B\u200C\u200D\uFEFF]', '', text)
    
    # Normalize line endings
    text = re.sub(r'\r\n|\r', '\n', text)
    
    if collapse:
        # Collapse multiple spaces to single space
        text = re.sub(r' +', ' ', text)
        # Collapse multiple newlines to double newline (paragraph break)
        text = re.sub(r'\n{3,}', '\n\n', text)
    
    if strip:
        text = text.strip()
    
    return text

# Test with messy whitespace
messy_text = "Hello\u00A0\u00A0World\u200B!\u3000\u3000Test"
cleaned = normalize_whitespace(messy_text)
Out[28]:
Whitespace Normalization:
  Original: 'Hello\xa0\xa0World\u200b!\u3000\u3000Test'
  Length: 20

  Cleaned: 'Hello World! Test'
  Length: 17

Ligature ExpansionLink Copied

Ligatures are single characters that represent multiple letters joined together. They're common in typeset text and can cause matching problems.

In[29]:
# Common ligatures
ligatures = [
    ('fi', 'fi', 'Latin small ligature fi'),
    ('fl', 'fl', 'Latin small ligature fl'),
    ('ff', 'ff', 'Latin small ligature ff'),
    ('ffi', 'ffi', 'Latin small ligature ffi'),
    ('ffl', 'ffl', 'Latin small ligature ffl'),
    ('Ꜳ', 'AA', 'Latin capital letter AA'),
    ('œ', 'oe', 'Latin small letter oe'),
    ('æ', 'ae', 'Latin small letter ae'),
]

# NFKC expands most ligatures
expanded = [(lig, exp, unicodedata.normalize('NFKC', lig)) for lig, exp, _ in ligatures]
Out[30]:
Ligature Expansion with NFKC:
---------------------------------------------
Ligature     Expected     NFKC Result 
---------------------------------------------
'fi'          'fi'         'fi' ✓
'fl'          'fl'         'fl' ✓
'ff'          'ff'         'ff' ✓
'ffi'          'ffi'         'ffi' ✓
'ffl'          'ffl'         'ffl' ✓
'Ꜳ'          'AA'         'Ꜳ' ≠
'œ'          'oe'         'œ' ≠
'æ'          'ae'         'æ' ≠

NFKC handles most Latin ligatures correctly. However, some characters like "æ" and "œ" are considered distinct letters in some languages (Danish, French) rather than ligatures, so NFKC preserves them.

Full-Width to Half-Width ConversionLink Copied

East Asian text often uses full-width versions of ASCII characters. These take up the same width as CJK characters, creating visual alignment in mixed text.

In[31]:
# Full-width ASCII characters
fullwidth_examples = [
    ('A', 'A', 'Full-width A'),
    ('a', 'a', 'Full-width a'),
    ('0', '0', 'Full-width 0'),
    ('!', '!', 'Full-width exclamation'),
    (' ', ' ', 'Ideographic space'),
]

# Full-width to half-width conversion
def fullwidth_to_halfwidth(text):
    """Convert full-width ASCII to half-width."""
    result = []
    for char in text:
        code = ord(char)
        # Full-width ASCII range: U+FF01 to U+FF5E maps to U+0021 to U+007E
        if 0xFF01 <= code <= 0xFF5E:
            result.append(chr(code - 0xFF01 + 0x21))
        # Ideographic space to regular space
        elif code == 0x3000:
            result.append(' ')
        else:
            result.append(char)
    return ''.join(result)

# Test conversion
fullwidth_text = "Hello World! 123"
halfwidth_text = fullwidth_to_halfwidth(fullwidth_text)
Out[32]:
Full-Width to Half-Width Conversion:
  Full-width: 'Hello World! 123'
  Half-width: 'Hello World! 123'

Character-by-character:
  'H' (U+FF28) → 'H' (U+0048)
  'e' (U+FF45) → 'e' (U+0065)
  'l' (U+FF4C) → 'l' (U+006C)
  'l' (U+FF4C) → 'l' (U+006C)
  'o' (U+FF4F) → 'o' (U+006F)

NFKC normalization also handles full-width to half-width conversion:

In[33]:
# NFKC also converts full-width characters
nfkc_result = unicodedata.normalize('NFKC', fullwidth_text)
manual_match = nfkc_result == halfwidth_text
Out[34]:
NFKC vs manual conversion:
  NFKC result:   'Hello World! 123'
  Manual result: 'Hello World! 123'
  Match: True

Building a Normalization PipelineLink Copied

Real-world text normalization combines multiple techniques. The order of operations matters.

In[35]:
import unicodedata
import re

class TextNormalizer:
    """A configurable text normalization pipeline."""
    
    def __init__(self, 
                 unicode_form='NFC',
                 lowercase=False,
                 casefold=False,
                 strip_accents=False,
                 normalize_whitespace=True,
                 strip_control_chars=True):
        """
        Initialize the normalizer with configuration options.
        
        Parameters:
        - unicode_form: 'NFC', 'NFD', 'NFKC', 'NFKD', or None
        - lowercase: Apply str.lower()
        - casefold: Apply str.casefold() (overrides lowercase)
        - strip_accents: Remove diacritical marks
        - normalize_whitespace: Collapse and standardize whitespace
        - strip_control_chars: Remove control characters
        """
        self.unicode_form = unicode_form
        self.lowercase = lowercase
        self.casefold = casefold
        self.strip_accents = strip_accents
        self.normalize_whitespace = normalize_whitespace
        self.strip_control_chars = strip_control_chars
    
    def __call__(self, text):
        """Apply the normalization pipeline to text."""
        # Step 1: Unicode normalization (first pass)
        if self.unicode_form:
            text = unicodedata.normalize(self.unicode_form, text)
        
        # Step 2: Strip accents (requires NFD decomposition)
        if self.strip_accents:
            text = unicodedata.normalize('NFD', text)
            text = ''.join(c for c in text 
                          if unicodedata.category(c) != 'Mn')
            # Recompose after stripping
            if self.unicode_form in ('NFC', 'NFKC'):
                text = unicodedata.normalize('NFC', text)
        
        # Step 3: Case normalization
        if self.casefold:
            text = text.casefold()
        elif self.lowercase:
            text = text.lower()
        
        # Step 4: Control character removal
        if self.strip_control_chars:
            # Remove C0, C1 controls except whitespace
            text = ''.join(c for c in text 
                          if unicodedata.category(c) != 'Cc' 
                          or c in '\t\n\r')
        
        # Step 5: Whitespace normalization
        if self.normalize_whitespace:
            # Convert various spaces to regular space
            text = re.sub(r'[\u00A0\u2000-\u200A\u202F\u205F\u3000]', ' ', text)
            # Remove zero-width characters
            text = re.sub(r'[\u200B-\u200D\uFEFF]', '', text)
            # Collapse multiple spaces
            text = re.sub(r' +', ' ', text)
            # Normalize line endings and collapse multiple newlines
            text = re.sub(r'\r\n|\r', '\n', text)
            text = re.sub(r'\n{3,}', '\n\n', text)
            text = text.strip()
        
        return text

# Create different normalizers for different use cases
search_normalizer = TextNormalizer(
    unicode_form='NFKC',
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True
)

storage_normalizer = TextNormalizer(
    unicode_form='NFC',
    normalize_whitespace=True
)
Out[36]:
Normalization Pipeline Comparison:
============================================================
Original: '  Héllo\xa0\xa0Wörld!  finance  '

Search normalizer (aggressive):
  Result: 'hello world! finance'

Storage normalizer (conservative):
  Result: 'Héllo Wörld! finance'

Pipeline Order MattersLink Copied

The order of normalization steps can affect results:

In[37]:
# Demonstrate order dependency
text = "CAFÉ"

# Order 1: Lowercase then strip accents
order1 = text.lower()
order1 = ''.join(c for c in unicodedata.normalize('NFD', order1)
                 if unicodedata.category(c) != 'Mn')

# Order 2: Strip accents then lowercase  
order2 = ''.join(c for c in unicodedata.normalize('NFD', text)
                 if unicodedata.category(c) != 'Mn')
order2 = order2.lower()
Out[38]:
Order of Operations:
  Original: 'CAFÉ'

  Lowercase → Strip accents: 'cafe'
  Strip accents → Lowercase: 'cafe'

  Same result? True

In this case, the order doesn't matter. But with more complex transformations involving case-sensitive patterns or locale-specific rules, order can be significant. Always test your pipeline with representative data.

Practical Example: DeduplicationLink Copied

Let's apply normalization to a real task: finding duplicate entries in a dataset.

In[39]:
# Simulated dataset with near-duplicate entries
company_names = [
    "Société Générale",
    "SOCIÉTÉ GÉNÉRALE",
    "Societe Generale",
    "Société  Générale",  # Extra space
    "Societe Generale",  # Full-width
    "Apple Inc.",
    "Apple Inc",
    "APPLE INC.",
    "apple inc",
    "Müller GmbH",
    "Mueller GmbH",
    "MÜLLER GMBH",
    "Muller GmbH",
]

# Create a normalizer for deduplication
dedup_normalizer = TextNormalizer(
    unicode_form='NFKC',
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True
)

# Group by normalized form
from collections import defaultdict
groups = defaultdict(list)
for name in company_names:
    normalized = dedup_normalizer(name)
    groups[normalized].append(name)
Out[40]:
Duplicate Detection Results:
============================================================

Normalized form: 'societe generale'
  Matches (5):
    - Société Générale
    - SOCIÉTÉ GÉNÉRALE
    - Societe Generale
    - Société  Générale
    - Societe Generale

Normalized form: 'apple inc.'
  Matches (2):
    - Apple Inc.
    - APPLE INC.

Normalized form: 'apple inc'
  Matches (2):
    - Apple Inc
    - apple inc

Normalized form: 'muller gmbh'
  Matches (3):
    - Müller GmbH
    - MÜLLER GMBH
    - Muller GmbH

The normalizer correctly groups variations of "Société Générale" and "Apple Inc." together. It also groups "Müller" with "Mueller" since stripping accents converts "ü" to "u".

Out[41]:
Visualization
Bar chart showing decreasing number of unique strings as more normalization steps are applied, from 13 raw to 3 fully normalized.
Deduplication effectiveness with progressive normalization. Each bar shows how many unique strings remain after applying cumulative normalization steps. Starting with 13 raw entries, aggressive normalization reduces this to just 3 unique canonical forms, successfully identifying all near-duplicates.

The visualization shows how each normalization step progressively reduces the number of unique strings. Raw text shows 13 distinct entries, but after full normalization, only 3 unique entities remain: "societe generale", "apple inc", and "muller gmbh". Each step contributes to duplicate detection: NFKC handles full-width characters, whitespace normalization catches extra spaces, case folding unifies capitalization variants, and accent stripping merges "Müller" with "Mueller".

Out[42]:
Visualization
Flowchart showing text normalization stages from raw input to canonical output.
Text normalization pipeline showing the transformation of raw input through multiple stages. Each stage addresses a specific type of variation, progressively reducing the text to a canonical form suitable for comparison and search.

Limitations and ChallengesLink Copied

Text normalization is powerful but not perfect:

Information loss: Aggressive normalization destroys information. Stripping accents loses the distinction between "resume" (to continue) and "résumé" (CV). Case folding loses the distinction between proper nouns and common words.

Language specificity: No single normalization strategy works for all languages. Turkish case rules differ from English. Chinese has no case. Some scripts have no concept of accents.

Context dependence: The right normalization depends on your task. Search benefits from aggressive normalization. Machine translation needs to preserve source text exactly.

Irreversibility: Most normalization operations cannot be undone. Once you've stripped accents or folded case, the original information is gone.

Edge cases: Unicode is vast and complex. New characters are added regularly. Your normalization code may not handle every possible input correctly.

Key Functions and ParametersLink Copied

When working with text normalization in Python, these are the essential functions and their most important parameters:

unicodedata.normalize(form, text)

  • form: The normalization form to apply. Options are:
    • 'NFC': Canonical composition (default for storage)
    • 'NFD': Canonical decomposition (useful for accent stripping)
    • 'NFKC': Compatibility composition (aggressive, for search)
    • 'NFKD': Compatibility decomposition
  • text: The Unicode string to normalize

unicodedata.category(char)

  • Returns a two-letter category code for a Unicode character
  • 'Mn': Mark, Nonspacing (combining diacritics)
  • 'Cc': Control characters
  • 'Zs': Space separator
  • Useful for filtering specific character types during normalization

str.casefold()

  • Returns a casefolded copy of the string for case-insensitive comparison
  • More aggressive than lower(), handles special cases like German "ß" → "ss"
  • Preferred over lower() for Unicode-aware case-insensitive matching

str.lower() vs str.casefold()

  • lower(): Standard Unicode lowercasing, but does not handle special cases like German "ß"
  • casefold(): Full Unicode case folding, handles language-specific mappings for comparison
  • Use casefold() for comparison, lower() for display

re.sub(pattern, replacement, text)

  • Essential for whitespace normalization patterns
  • Common patterns:
    • r'[\u00A0\u2000-\u200A\u202F\u205F\u3000]': Various Unicode spaces
    • r'[\u200B-\u200D\uFEFF]': Zero-width characters
    • r' +': Multiple consecutive spaces

SummaryLink Copied

Text normalization transforms text into consistent, comparable forms. We covered:

  • Unicode normalization forms: NFC composes, NFD decomposes, NFKC and NFKD add compatibility mappings
  • Case folding: Use casefold() for case-insensitive comparison, not lower()
  • Diacritic handling: NFD decomposition plus filtering removes accents
  • Whitespace normalization: Unicode has many whitespace characters beyond space and tab
  • Ligature expansion: NFKC expands most typographic ligatures
  • Full-width conversion: NFKC converts full-width ASCII to standard ASCII

Key takeaways:

  • NFC is the default choice for general text storage
  • NFKC with casefold is best for search and comparison
  • Always normalize before comparing strings for equality
  • Normalization order matters: plan your pipeline carefully
  • Test with representative data: edge cases will surprise you
  • Preserve originals: keep unnormalized text when possible

In the next chapter, we'll explore tokenization, the process of breaking text into meaningful units for further processing.

Reference

BIBTEXAcademic
@misc{textnormalizationunicodeformscasefoldingwhitespacehandlingfornlp, author = {Michael Brenndoerfer}, title = {Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }
APAAcademic
Michael Brenndoerfer (2025). Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP. Retrieved from https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp
MLAAcademic
Michael Brenndoerfer. "Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP'. Available at: https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp (Accessed: 12/7/2025).
SimpleBasic
Michael Brenndoerfer (2025). Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP. https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.