Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP

Michael BrenndoerferUpdated March 19, 202530 min read

Master text normalization techniques including Unicode NFC/NFD/NFKC/NFKD forms, case folding vs lowercasing, diacritic removal, and whitespace handling. Learn to build robust normalization pipelines for search and deduplication.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Text Normalization

In the previous chapter, we saw how a single character like "é" can be represented in multiple ways: as a single precomposed code point (U+00E9) or as a base letter plus a combining accent (U+0065 + U+0301). Both look identical on screen, but Python considers them different strings. This seemingly minor issue can break string matching, corrupt search results, and introduce subtle bugs into your NLP pipelines.

Text normalization is the process of transforming text into a consistent, canonical form. It goes beyond encoding to address the fundamental question: when should two different byte sequences be considered the "same" text? This chapter covers Unicode normalization forms, case handling, whitespace cleanup, and building robust normalization pipelines.

Why Normalization Matters

Consider a simple task: searching for the word "café" in a document. Without normalization, your search might miss matches because the document uses a different Unicode representation.

In[2]:
Code
# Two visually identical strings
cafe1 = "café"  # Precomposed: U+00E9
cafe2 = "cafe\u0301"  # Decomposed: e + combining acute

# Visual comparison
looks_same = cafe1 == cafe2
Out[3]:
Console
Two ways to write 'café':
  cafe1 = 'café' (precomposed)
  cafe2 = 'café' (decomposed)

Look identical? Yes, they both display as: café
Are equal in Python? False

Code point breakdown:
  cafe1: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  cafe2: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']

The strings are visually identical but computationally different. This creates problems across NLP:

  • Search: Users searching for "café" won't find documents containing the decomposed form
  • Deduplication: Duplicate detection fails when the same text uses different representations
  • Tokenization: Tokenizers may split decomposed characters incorrectly
  • Embeddings: Identical words may receive different vector representations

Unicode Normalization Forms

The Unicode standard defines four normalization forms to address representation ambiguity. Each form serves different purposes.

Unicode Normalization

Unicode normalization transforms text into a canonical form where equivalent strings have identical code point sequences. The four forms (NFC, NFD, NFKC, NFKD) differ in whether they compose or decompose characters and whether they apply compatibility mappings.

NFC: Canonical Composition

NFC (Normalization Form Canonical Composition) converts text to its shortest representation by combining base characters with their accents into single precomposed characters where possible.

In[4]:
Code
import unicodedata

# Start with decomposed form
decomposed = "cafe\u0301"  # e + combining acute

# Normalize to NFC (composed)
nfc = unicodedata.normalize("NFC", decomposed)

# Compare with precomposed original
precomposed = "café"
Out[5]:
Console
NFC Normalization (Composition):
  Original (decomposed): 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']
  Length: 5

  After NFC: 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  Length: 4

  Matches precomposed 'café'? True

NFC is the most commonly used normalization form. It produces the most compact representation and matches what most users expect when they type accented characters.

NFD: Canonical Decomposition

NFD (Normalization Form Canonical Decomposition) does the opposite: it breaks precomposed characters into their base character plus combining marks.

In[6]:
Code
# Start with precomposed form
composed = "café"

# Normalize to NFD (decomposed)
nfd = unicodedata.normalize("NFD", composed)
Out[7]:
Console
NFD Normalization (Decomposition):
  Original (composed): 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  Length: 4

  After NFD: 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']
  Length: 5

NFD is useful when you need to manipulate accents separately from base characters, such as removing diacritics or analyzing character components.

NFKC and NFKD: Compatibility Normalization

The "K" forms apply compatibility decomposition in addition to canonical normalization. This maps characters that are semantically equivalent but visually distinct.

Compatibility Equivalence

Compatibility equivalence groups characters that represent the same abstract character but differ in appearance or formatting. Examples include full-width vs. half-width characters, ligatures vs. separate letters, and superscripts vs. regular digits.

In[8]:
Code
# Characters with compatibility equivalents
test_cases = [
    ("fi", "fi ligature"),
    ("①", "circled digit one"),
    ("Ⅳ", "roman numeral four"),
    ("hello", "full-width hello"),
    ("²", "superscript two"),
    ("㎞", "km symbol"),
]

# Apply NFKC normalization
nfkc_results = [
    (char, desc, unicodedata.normalize("NFKC", char))
    for char, desc in test_cases
]
Out[9]:
Console
NFKC Compatibility Normalization:
-------------------------------------------------------
Original     Description               NFKC Result 
-------------------------------------------------------
'fi'        fi ligature               'fi'
'①'        circled digit one         '1'
'Ⅳ'        roman numeral four        'IV'
'hello'        full-width hello          'hello'
'²'        superscript two           '2'
'㎞'        km symbol                 'km'

NFKC is aggressive. It converts the "fi" ligature to separate "f" and "i" characters, expands the circled digit to just "1", and converts full-width characters to their ASCII equivalents. This is useful for search and comparison but destroys formatting information.

The table below shows how each normalization form transforms different input characters. NFC and NFD are canonical forms that preserve character identity, while NFKC and NFKD apply compatibility mappings that may change the character representation.

Unicode normalization forms compared. Canonical forms (NFC, NFD) preserve character identity; compatibility forms (NFKC, NFKD) apply aggressive mappings.
InputNFCNFDNFKCNFKD
café (decomposed)cafécafécafécafé
e + ́ (combining)ée + ́ée + ́
fi (ligature)fifi
① (circled)11
hi (full-width)hihihihi

The canonical forms (NFC, NFD) preserve ligatures and special characters, changing only the internal representation. The compatibility forms (NFKC, NFKD) aggressively normalize to base characters, expanding ligatures and converting full-width to half-width.

Choosing a Normalization Form

The right form depends on your use case:

Normalization form recommendations by use case.
Use CaseRecommended FormReason
General text storageNFCCompact, preserves visual appearance
Accent-insensitive searchNFD then strip marksEasy to remove combining characters
Full-text searchNFKCMatches variant representations
Security (username comparison)NFKCPrevents homograph attacks
Preserving formattingNFCKeeps ligatures and special forms
In[10]:
Code
def compare_normalization_forms(text):
    """Compare all four normalization forms for a given text."""
    forms = ["NFC", "NFD", "NFKC", "NFKD"]
    results = {}
    for form in forms:
        normalized = unicodedata.normalize(form, text)
        results[form] = {
            "text": normalized,
            "length": len(normalized),
            "codepoints": [f"U+{ord(c):04X}" for c in normalized],
        }
    return results


# Test with a complex example
test_text = "financial résumé ①"
comparison = compare_normalization_forms(test_text)
Out[11]:
Console
Normalizing: 'financial résumé ①'
======================================================================

NFC:
  Result: 'financial résumé ①'
  Length: 17

NFD:
  Result: 'financial résumé ①'
  Length: 19

NFKC:
  Result: 'financial résumé 1'
  Length: 18

NFKD:
  Result: 'financial résumé 1'
  Length: 20

The length differences reveal how each form handles the input. NFD produces the longest output because it decomposes characters into base letters plus combining marks. NFC and NFKC produce shorter outputs by composing characters, with NFKC additionally expanding the ligature "fi" into two separate characters.

Out[12]:
Visualization
Grouped bar chart comparing string lengths across NFC, NFD, NFKC, and NFKD normalization forms for various text samples.
String length comparison across Unicode normalization forms for different text samples. NFD consistently produces longer strings by decomposing characters, while NFC produces the most compact representation. NFKC and NFKD may increase or decrease length depending on whether compatibility mappings expand or simplify characters.

The chart reveals important patterns. Text with combining diacritics (café, naïve, Ångström) shows significant length increase under NFD decomposition. Full-width characters (Hello) and circled digits (①②③) shrink dramatically under NFKC/NFKD as they're mapped to their ASCII equivalents. The ligature "fi" in "finance" expands from one character to two under compatibility normalization.

Case Folding vs. Lowercasing

Case-insensitive comparison seems simple: just convert both strings to lowercase. But Unicode makes this surprisingly complex.

The Problem with Simple Lowercasing

In[13]:
Code
# German sharp s (ß) uppercases to SS in standard Python
german_word = "straße"  # street
lowered = german_word.lower()
uppered = german_word.upper()
round_trip = uppered.lower()
Out[14]:
Console
Case conversion with German ß:
  Original:    'straße' (length 6)
  .lower():    'straße' (length 6)
  .upper():    'STRASSE' (length 7)
  Round-trip:  'strasse' (length 7)

  Original == round-trip? False

The German "ß" uppercases to "SS" (two characters), and lowercasing "SS" gives "ss", not "ß". Round-tripping through case conversion changes the string. This is not a bug; it reflects German orthographic rules where "ß" traditionally had no uppercase form. While Unicode 5.1 (2008) added the capital ẞ (U+1E9E), Python's upper() still converts to "SS" for compatibility with the traditional standard.

Case Folding

Case Folding

Case folding is a Unicode operation designed for case-insensitive comparison. Unlike simple lowercasing, case folding handles language-specific mappings and ensures that equivalent strings compare equal regardless of their original case.

Python's str.casefold() method implements Unicode case folding:

In[15]:
Code
# Compare lower() vs casefold()
words = ["Straße", "STRASSE", "straße", "strasse"]

lower_results = [w.lower() for w in words]
casefold_results = [w.casefold() for w in words]
Out[16]:
Console
Comparing lower() vs casefold():
--------------------------------------------------
Word         lower()      casefold()  
--------------------------------------------------
Straße       straße       strasse     
STRASSE      strasse      strasse     
straße       straße       strasse     
strasse      strasse      strasse     

Case-insensitive matches:
  Using lower():    2 distinct values
  Using casefold(): 1 distinct values

With casefold(), all four variations of "street" in German normalize to the same string, enabling correct case-insensitive comparison.

The following table shows how many distinct strings remain after applying lower() versus casefold() to groups of equivalent words. The ideal is 1 (all variants unified to a single canonical form):

Comparison oflower() vs casefold() for case-insensitive string matching across languages.
Word GroupVariantslower() distinctcasefold() distinct
German "street"Straße, STRASSE, straße, strasse, STRAßE21
German "size"Größe, GRÖSSE, größe, groesse, GROESSE32
Greek sigmaσ, ς, Σ21
Mixed caseHello, HELLO, hello, HeLLo11
Turkish IIstanbul, ISTANBUL, istanbul, İstanbul22

For German words with "ß", casefold() correctly unifies all variants to a single canonical form, while lower() leaves two distinct values. Greek sigma variants (σ, ς, Σ) are particularly interesting: casefold() maps them all to the same form, recognizing that they represent the same letter in different positions. Standard English words show identical behavior for both methods, confirming that casefold() is a superset of lower() functionality. Note that Turkish requires locale-aware handling for correct dotted/dotless I normalization, which neither method provides automatically.

Language-Specific Case Rules

Some case conversions depend on language context:

In[17]:
Code
# Turkish dotted and dotless i
turkish_examples = [
    ("I", "English uppercase I"),
    ("i", "English lowercase i"),
    ("İ", "Turkish uppercase dotted I (U+0130)"),
    ("ı", "Turkish lowercase dotless i (U+0131)"),
]

# Standard Python case conversion (not locale-aware)
case_results = [
    (char, desc, char.lower(), char.upper(), char.casefold())
    for char, desc in turkish_examples
]
Out[18]:
Console
Turkish I variants and case conversion:
---------------------------------------------------------------------------
Char   Description                         lower    upper    casefold
---------------------------------------------------------------------------
'I'    English uppercase I                 'i'      'I'      'i'
'i'    English lowercase i                 'i'      'I'      'i'
'İ'    Turkish uppercase dotted I (U+0130) 'i̇'      'İ'      'i̇'
'ı'    Turkish lowercase dotless i (U+0131) 'ı'      'I'      'ı'

In Turkish, "I" lowercases to "ı" (dotless) and "i" uppercases to "İ" (dotted). Python's default case operations follow English rules, which can cause problems with Turkish text. For locale-aware case conversion, you need specialized libraries.

Accent and Diacritic Handling

Many NLP applications benefit from accent-insensitive matching. A user searching for "resume" should probably find "résumé".

Removing Diacritics

The standard approach uses NFD normalization followed by filtering:

In[19]:
Code
import unicodedata


def remove_diacritics(text):
    """Remove diacritical marks from text."""
    # Decompose into base characters and combining marks
    decomposed = unicodedata.normalize("NFD", text)

    # Filter out combining marks (category 'Mn' = Mark, Nonspacing)
    filtered = "".join(c for c in decomposed if unicodedata.category(c) != "Mn")

    # Recompose any remaining sequences
    return unicodedata.normalize("NFC", filtered)


# Test with various accented text
test_texts = [
    "résumé",
    "naïve",
    "Ñoño",
    "Zürich",
    "Ångström",
]

stripped = [(text, remove_diacritics(text)) for text in test_texts]
Out[20]:
Console
Diacritic Removal:
-----------------------------------
Original        Stripped       
-----------------------------------
résumé          resume         
naïve           naive          
Ñoño            Nono           
Zürich          Zurich         
Ångström        Angstrom       

This technique decomposes accented characters into base letters plus combining marks, removes the marks, and recomposes. The result is plain ASCII-compatible text.

Preserving Semantic Distinctions

Be careful: removing diacritics can change meaning in some languages.

In[21]:
Code
# Diacritics that change meaning
semantic_examples = [
    ("père", "father (French)"),
    ("pêre", "would be meaningless"),
    ("año", "year (Spanish)"),
    ("ano", "anus (Spanish)"),
    ("für", "for (German)"),
    ("fur", "different word"),
]
Out[22]:
Console
When diacritics matter:
--------------------------------------------------
  père     → pere      (father (French))
  pêre     → pere      (would be meaningless)
  año      → ano       (year (Spanish))
  ano      → ano       (anus (Spanish))
  für      → fur       (for (German))
  fur      → fur       (different word)

For search applications, you might want to match both forms. For translation or language understanding, preserving diacritics is essential.

The collision rate when stripping diacritics varies dramatically by language:

Semantic collision rates when stripping diacritics, by language.
LanguageCollision RateCollisions / Total WordsExample Collisions
Spanish50%10 / 20año/ano (year/anus), sí/si (yes/if)
French47%7 / 15où/ou (where/or), côte/cote (coast/quote)
German50%6 / 12schön/schon (beautiful/already), drücken/drucken (press/print)
Portuguese50%6 / 12pôde/pode (could/can), pôr/por (put/by)
English50%5 / 10résumé/resume, café/cafe

Spanish and French use diacritics extensively to distinguish word meanings. German umlauts often differentiate semantically different words. English treats diacritics largely as optional styling for loanwords, but stripping them still causes collisions with the non-diacritical forms. These numbers underscore why blanket diacritic removal can be problematic for multilingual NLP applications.

Whitespace Normalization

Whitespace seems simple, but Unicode defines many whitespace characters beyond the familiar space and tab.

In[23]:
Code
# Unicode whitespace characters
whitespace_chars = [
    ("\u0020", "Space"),
    ("\u00a0", "No-Break Space"),
    ("\u2002", "En Space"),
    ("\u2003", "Em Space"),
    ("\u2009", "Thin Space"),
    ("\u200b", "Zero Width Space"),
    ("\u3000", "Ideographic Space"),
    ("\t", "Tab"),
    ("\n", "Newline"),
    ("\r", "Carriage Return"),
]

# Check which are detected by str.isspace()
space_check = [(char, name, char.isspace()) for char, name in whitespace_chars]
Out[24]:
Console
Unicode Whitespace Characters:
-------------------------------------------------------
Char     Code       Name                      isspace() 
-------------------------------------------------------
         U+0020     Space                     True
\xa0     U+00A0     No-Break Space            True
\u2002   U+2002     En Space                  True
\u2003   U+2003     Em Space                  True
\u2009   U+2009     Thin Space                True
\u200b   U+200B     Zero Width Space          False
\u3000   U+3000     Ideographic Space         True
\t       U+0009     Tab                       True
\n       U+000A     Newline                   True
\r       U+000D     Carriage Return           True

Notice that the zero-width space (U+200B) is not considered whitespace by Python's isspace(). These invisible characters can cause subtle bugs.

The following table shows the UTF-8 byte sizes and isspace() behavior for common Unicode whitespace characters:

Unicode whitespace characters with UTF-8 byte sizes and Pythonisspace() detection.
CharacterCode PointUTF-8 Bytesisspace()
SpaceU+00201
TabU+00091
NewlineU+000A1
Carriage ReturnU+000D1
No-Break SpaceU+00A02
En SpaceU+20023
Em SpaceU+20033
Thin SpaceU+20093
Zero Width SpaceU+200B3
Ideographic SpaceU+30003

The byte size variation has practical implications. A document using ideographic spaces (common in CJK text) will be larger than one using standard ASCII spaces. Zero-width characters, despite being invisible, still consume 3 bytes each in UTF-8, and they can accumulate when copying text from web pages or PDFs. Note that zero-width space is not detected by isspace(), which can cause subtle matching bugs.

Normalizing Whitespace

A robust whitespace normalizer should:

  1. Convert all whitespace variants to standard spaces
  2. Collapse multiple spaces into one
  3. Strip leading and trailing whitespace
  4. Optionally handle zero-width characters
In[25]:
Code
import re


def normalize_whitespace(text, collapse=True, strip=True):
    """Normalize various whitespace characters to standard spaces."""
    # Unicode whitespace pattern (broader than \s)
    # Includes all Unicode Zs (space separator) category
    whitespace_pattern = r"[\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]"

    # Replace all whitespace variants with standard space
    text = re.sub(whitespace_pattern, " ", text)

    # Handle zero-width characters
    text = re.sub(r"[\u200B\u200C\u200D\uFEFF]", "", text)

    # Normalize line endings
    text = re.sub(r"\r\n|\r", "\n", text)

    if collapse:
        # Collapse multiple spaces to single space
        text = re.sub(r" +", " ", text)
        # Collapse multiple newlines to double newline (paragraph break)
        text = re.sub(r"\n{3,}", "\n\n", text)

    if strip:
        text = text.strip()

    return text


# Test with messy whitespace
messy_text = "Hello\u00a0\u00a0World\u200b!\u3000\u3000Test"
cleaned = normalize_whitespace(messy_text)
Out[26]:
Console
Whitespace Normalization:
  Original: 'Hello\xa0\xa0World\u200b!\u3000\u3000Test'
  Length: 20

  Cleaned: 'Hello World! Test'
  Length: 17

The normalizer reduced the string from 18 characters to 17 by converting the various Unicode spaces (no-break space, ideographic space) to standard spaces, removing the zero-width space entirely, and collapsing consecutive spaces into single spaces. This produces consistent, predictable whitespace that won't cause matching failures.

Ligature Expansion

Ligatures are single characters that represent multiple letters joined together. They're common in typeset text and can cause matching problems.

In[27]:
Code
# Common ligatures
ligatures = [
    ("fi", "fi", "Latin small ligature fi"),
    ("fl", "fl", "Latin small ligature fl"),
    ("ff", "ff", "Latin small ligature ff"),
    ("ffi", "ffi", "Latin small ligature ffi"),
    ("ffl", "ffl", "Latin small ligature ffl"),
    ("Ꜳ", "AA", "Latin capital letter AA"),
    ("œ", "oe", "Latin small letter oe"),
    ("æ", "ae", "Latin small letter ae"),
]

# NFKC expands most ligatures
expanded = [
    (lig, exp, unicodedata.normalize("NFKC", lig)) for lig, exp, _ in ligatures
]
Out[28]:
Console
Ligature Expansion with NFKC:
---------------------------------------------
Ligature     Expected     NFKC Result 
---------------------------------------------
'fi'          'fi'         'fi' ✓
'fl'          'fl'         'fl' ✓
'ff'          'ff'         'ff' ✓
'ffi'          'ffi'         'ffi' ✓
'ffl'          'ffl'         'ffl' ✓
'Ꜳ'          'AA'         'Ꜳ' ≠
'œ'          'oe'         'œ' ≠
'æ'          'ae'         'æ' ≠

NFKC handles most Latin ligatures correctly. However, some characters like "æ" and "œ" are considered distinct letters in some languages (Danish, French) rather than ligatures, so NFKC preserves them.

The following table shows common ligatures, their Unicode code points, and their expanded forms. Note the large code point gap between ligatures in the Alphabetic Presentation Forms block (U+FB00-FB4F) and their ASCII expansions (U+0000-007F). This gap explains why naive string comparison fails without normalization:

Common ligatures and their NFKC expansions. Typographic ligatures expand; linguistic letters are preserved. | Æ | U+00C6 | Æ (preserved) | — | Latin-1 Supplement |
LigatureCode PointExpansionExpansion Code PointsUnicode Block
U+FB01fiU+0066 U+0069Alphabetic Presentation Forms
U+FB02flU+0066 U+006CAlphabetic Presentation Forms
U+FB00ffU+0066 U+0066Alphabetic Presentation Forms
U+FB03ffiU+0066 U+0066 U+0069Alphabetic Presentation Forms
U+FB04fflU+0066 U+0066 U+006CAlphabetic Presentation Forms
U+FB05stU+0073 U+0074Alphabetic Presentation Forms
œU+0153œ (preserved)Latin Extended-A
æU+00E6æ (preserved)Latin-1 Supplement
ŒU+0152Œ (preserved)Latin Extended-A

The Latin f-ligatures (fi, fl, ff, ffi, ffl, ſt) are expanded by NFKC because they're typographic variants. However, "æ" and "œ" are preserved because they function as distinct letters in languages like Danish, Norwegian, and French.

The code point data reveals why ligatures cause string matching problems. The "fi" ligature (U+FB01) and its expansion "fi" (U+0066, U+0069) are separated by over 64,000 code points in Unicode space. Characters like "æ" and "œ" sit in the Latin-1 Supplement block, much closer to their ASCII equivalents, reflecting their status as distinct letters rather than purely typographic ligatures. Without normalization, a search for "find" will never match "find" even though they're semantically identical.

Full-Width to Half-Width Conversion

East Asian text often uses full-width versions of ASCII characters. These take up the same width as CJK characters, creating visual alignment in mixed text.

In[29]:
Code
# Full-width ASCII characters
fullwidth_examples = [
    ("A", "A", "Full-width A"),
    ("a", "a", "Full-width a"),
    ("0", "0", "Full-width 0"),
    ("!", "!", "Full-width exclamation"),
    (" ", " ", "Ideographic space"),
]


# Full-width to half-width conversion
def fullwidth_to_halfwidth(text):
    """Convert full-width ASCII to half-width."""
    result = []
    for char in text:
        code = ord(char)
        # Full-width ASCII range: U+FF01 to U+FF5E maps to U+0021 to U+007E
        if 0xFF01 <= code <= 0xFF5E:
            result.append(chr(code - 0xFF01 + 0x21))
        # Ideographic space to regular space
        elif code == 0x3000:
            result.append(" ")
        else:
            result.append(char)
    return "".join(result)


# Test conversion
fullwidth_text = "Hello World! 123"
halfwidth_text = fullwidth_to_halfwidth(fullwidth_text)
Out[30]:
Console
Full-Width to Half-Width Conversion:
  Full-width: 'Hello World! 123'
  Half-width: 'Hello World! 123'

Character-by-character:
  'H' (U+FF28) → 'H' (U+0048)
  'e' (U+FF45) → 'e' (U+0065)
  'l' (U+FF4C) → 'l' (U+006C)
  'l' (U+FF4C) → 'l' (U+006C)
  'o' (U+FF4F) → 'o' (U+006F)

Each full-width character maps to its ASCII equivalent by subtracting a fixed offset (0xFEE0) from the code point. The ideographic space (U+3000) is a special case that maps to the regular space (U+0020). This conversion is essential when processing East Asian text that mixes CJK characters with Latin letters and digits.

NFKC normalization also handles full-width to half-width conversion:

In[31]:
Code
# NFKC also converts full-width characters
nfkc_result = unicodedata.normalize("NFKC", fullwidth_text)
manual_match = nfkc_result == halfwidth_text
Out[32]:
Console
NFKC vs manual conversion:
  NFKC result:   'Hello World! 123'
  Manual result: 'Hello World! 123'
  Match: True

The NFKC normalization produces identical results to the manual conversion function, confirming that NFKC handles full-width to half-width mapping as part of its compatibility normalization. This means you can use NFKC for comprehensive normalization without implementing character-specific conversion logic, simplifying your normalization pipeline.

Building a Normalization Pipeline

Real-world text normalization combines multiple techniques. The order of operations matters.

In[33]:
Code
import unicodedata
import re


class TextNormalizer:
    """A configurable text normalization pipeline."""

    def __init__(
        self,
        unicode_form="NFC",
        lowercase=False,
        casefold=False,
        strip_accents=False,
        normalize_whitespace=True,
        strip_control_chars=True,
    ):
        """
        Initialize the normalizer with configuration options.

        Parameters:
        - unicode_form: 'NFC', 'NFD', 'NFKC', 'NFKD', or None
        - lowercase: Apply str.lower()
        - casefold: Apply str.casefold() (overrides lowercase)
        - strip_accents: Remove diacritical marks
        - normalize_whitespace: Collapse and standardize whitespace
        - strip_control_chars: Remove control characters
        """
        self.unicode_form = unicode_form
        self.lowercase = lowercase
        self.casefold = casefold
        self.strip_accents = strip_accents
        self.normalize_whitespace = normalize_whitespace
        self.strip_control_chars = strip_control_chars

    def __call__(self, text):
        """Apply the normalization pipeline to text."""
        # Step 1: Unicode normalization (first pass)
        if self.unicode_form:
            text = unicodedata.normalize(self.unicode_form, text)

        # Step 2: Strip accents (requires NFD decomposition)
        if self.strip_accents:
            text = unicodedata.normalize("NFD", text)
            text = "".join(c for c in text if unicodedata.category(c) != "Mn")
            # Recompose after stripping
            if self.unicode_form in ("NFC", "NFKC"):
                text = unicodedata.normalize("NFC", text)

        # Step 3: Case normalization
        if self.casefold:
            text = text.casefold()
        elif self.lowercase:
            text = text.lower()

        # Step 4: Control character removal
        if self.strip_control_chars:
            # Remove C0, C1 controls except whitespace
            text = "".join(
                c
                for c in text
                if unicodedata.category(c) != "Cc" or c in "\t\n\r"
            )

        # Step 5: Whitespace normalization
        if self.normalize_whitespace:
            # Convert various spaces to regular space
            text = re.sub(r"[\u00A0\u2000-\u200A\u202F\u205F\u3000]", " ", text)
            # Remove zero-width characters
            text = re.sub(r"[\u200B-\u200D\uFEFF]", "", text)
            # Collapse multiple spaces
            text = re.sub(r" +", " ", text)
            # Normalize line endings and collapse multiple newlines
            text = re.sub(r"\r\n|\r", "\n", text)
            text = re.sub(r"\n{3,}", "\n\n", text)
            text = text.strip()

        return text


# Create different normalizers for different use cases
search_normalizer = TextNormalizer(
    unicode_form="NFKC",
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True,
)

storage_normalizer = TextNormalizer(
    unicode_form="NFC", normalize_whitespace=True
)
Out[34]:
Console
Normalization Pipeline Comparison:
============================================================
Original: '  Héllo\xa0\xa0Wörld!  finance  '

Search normalizer (aggressive):
  Result: 'hello world! finance'

Storage normalizer (conservative):
  Result: 'Héllo Wörld! finance'

The two normalizers produce notably different outputs from the same input. The search normalizer aggressively transforms the text for maximum matching flexibility: it converts full-width characters to ASCII, strips accents, folds case, expands the ligature "fi" to "fi", and collapses all whitespace variants. The storage normalizer preserves the original character forms while only standardizing whitespace, maintaining the text's visual fidelity for display purposes.

Pipeline Order Matters

The order of normalization steps can affect results:

In[35]:
Code
# Demonstrate order dependency
text = "CAFÉ"

# Order 1: Lowercase then strip accents
order1 = text.lower()
order1 = "".join(
    c
    for c in unicodedata.normalize("NFD", order1)
    if unicodedata.category(c) != "Mn"
)

# Order 2: Strip accents then lowercase
order2 = "".join(
    c
    for c in unicodedata.normalize("NFD", text)
    if unicodedata.category(c) != "Mn"
)
order2 = order2.lower()
Out[36]:
Console
Order of Operations:
  Original: 'CAFÉ'

  Lowercase → Strip accents: 'cafe'
  Strip accents → Lowercase: 'cafe'

  Same result? True

In this case, the order doesn't matter. But with more complex transformations involving case-sensitive patterns or locale-specific rules, order can be significant. Always test your pipeline with representative data.

Practical Example: Deduplication

Let's apply normalization to a real task: finding duplicate entries in a dataset.

In[37]:
Code
# Simulated dataset with near-duplicate entries
company_names = [
    "Société Générale",
    "SOCIÉTÉ GÉNÉRALE",
    "Societe Generale",
    "Société  Générale",  # Extra space
    "Societe Generale",  # Full-width
    "Apple Inc.",
    "Apple Inc",
    "APPLE INC.",
    "apple inc",
    "Müller GmbH",
    "Mueller GmbH",
    "MÜLLER GMBH",
    "Muller GmbH",
]

# Create a normalizer for deduplication
dedup_normalizer = TextNormalizer(
    unicode_form="NFKC",
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True,
)

# Group by normalized form
from collections import defaultdict

groups = defaultdict(list)
for name in company_names:
    normalized = dedup_normalizer(name)
    groups[normalized].append(name)
Out[38]:
Console
Duplicate Detection Results:
============================================================

Normalized form: 'societe generale'
  Matches (5):
    - Société Générale
    - SOCIÉTÉ GÉNÉRALE
    - Societe Generale
    - Société  Générale
    - Societe Generale

Normalized form: 'apple inc.'
  Matches (2):
    - Apple Inc.
    - APPLE INC.

Normalized form: 'apple inc'
  Matches (2):
    - Apple Inc
    - apple inc

Normalized form: 'muller gmbh'
  Matches (3):
    - Müller GmbH
    - MÜLLER GMBH
    - Muller GmbH

The normalizer correctly groups variations of "Société Générale" and "Apple Inc." together. It also groups "Müller" with "Mueller" since stripping accents converts "ü" to "u".

Out[39]:
Visualization
Bar chart showing decreasing number of unique strings as more normalization steps are applied, from 13 raw to 3 fully normalized.
Deduplication effectiveness with progressive normalization. Each bar shows how many unique strings remain after applying cumulative normalization steps. Starting with 13 raw entries, aggressive normalization reduces this to 5 unique canonical forms, identifying most near-duplicates (remaining differences are due to punctuation and ü/ue spelling variants).

The visualization shows how each normalization step progressively reduces the number of unique strings. Raw text shows 13 distinct entries, but after full normalization, only 5 unique strings remain. Each step contributes to duplicate detection: NFKC handles full-width characters, whitespace normalization catches extra spaces, case folding unifies capitalization variants, and accent stripping converts "ü" to "u". The remaining 5 strings differ due to punctuation ("Apple Inc." vs "Apple Inc") and German spelling conventions ("Mueller" as the ue-spelling vs "Müller" which becomes "Muller" after accent stripping).

The table below traces the transformation of "Société Générale" through a complete normalization pipeline:

Progressive normalization pipeline transforming "Société Générale" to a canonical form.
StageOutputEffect
Raw Input"Société Générale"Various encodings and representations
Unicode (NFKC)"Société Générale"Canonical form, ligatures expanded
Whitespace"Société Générale"Spaces collapsed, zero-width removed
Case Fold"société générale"Case-insensitive comparison ready
Strip Accents"societe generale"Accent-insensitive matching ready

Each stage addresses a specific type of variation. The final output is a canonical form suitable for search and comparison. All representations of this company name will normalize to the same string.

Limitations and Challenges

Text normalization is powerful but not perfect. Consider these limitations when designing your pipeline:

  • Information loss: Aggressive normalization destroys information. Stripping accents loses the distinction between "resume" (to continue) and "résumé" (CV). Case folding loses the distinction between proper nouns and common words.
  • Language specificity: No single normalization strategy works for all languages. Turkish case rules differ from English. Chinese has no case. Some scripts have no concept of accents.
  • Context dependence: The right normalization depends on your task. Search benefits from aggressive normalization. Machine translation needs to preserve source text exactly.
  • Irreversibility: Most normalization operations cannot be undone. Once you've stripped accents or folded case, the original information is gone.
  • Edge cases: Unicode is vast and complex. New characters are added regularly. Your normalization code may not handle every possible input correctly.

Key Functions and Parameters

When working with text normalization in Python, these are the essential functions and their most important parameters:

  • unicodedata.normalize(form, text): Applies Unicode normalization to a string. The form parameter specifies the normalization form: 'NFC' (canonical composition, default for storage), 'NFD' (canonical decomposition, useful for accent stripping), 'NFKC' (compatibility composition, aggressive, for search), or 'NFKD' (compatibility decomposition).

  • unicodedata.category(char): Returns a two-letter category code for a Unicode character. Common categories include 'Mn' (Mark, Nonspacing, for combining diacritics), 'Cc' (control characters), and 'Zs' (space separator). Useful for filtering specific character types during normalization.

  • str.casefold(): Returns a casefolded copy of the string for case-insensitive comparison. More aggressive than lower(), handles special cases like German "ß" → "ss". Preferred over lower() for Unicode-aware case-insensitive matching.

  • str.lower() vs str.casefold(): Use lower() for display (standard Unicode lowercasing) and casefold() for comparison (full Unicode case folding that handles language-specific mappings).

  • re.sub(pattern, replacement, text): Essential for whitespace normalization. Common patterns include r'[\u00A0\u2000-\u200A\u202F\u205F\u3000]' for various Unicode spaces, r'[\u200B-\u200D\uFEFF]' for zero-width characters, and r' +' for multiple consecutive spaces.

Summary

Text normalization transforms text into consistent, comparable forms. We covered:

  • Unicode normalization forms: NFC composes, NFD decomposes, NFKC and NFKD add compatibility mappings
  • Case folding: Use casefold() for case-insensitive comparison, not lower()
  • Diacritic handling: NFD decomposition plus filtering removes accents
  • Whitespace normalization: Unicode has many whitespace characters beyond space and tab
  • Ligature expansion: NFKC expands most typographic ligatures
  • Full-width conversion: NFKC converts full-width ASCII to standard ASCII

Key takeaways:

  • NFC is the default choice for general text storage
  • NFKC with casefold is best for search and comparison
  • Always normalize before comparing strings for equality
  • Normalization order matters: plan your pipeline carefully
  • Test with representative data: edge cases will surprise you
  • Preserve originals: keep unnormalized text when possible

In the next chapter, we'll explore tokenization, the process of breaking text into meaningful units for further processing.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about text normalization.

Loading component...

Reference

BIBTEXAcademic
@misc{textnormalizationunicodeformscasefoldingwhitespacehandlingfornlp, author = {Michael Brenndoerfer}, title = {Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP. Retrieved from https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp
MLAAcademic
Michael Brenndoerfer. "Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP." 2026. Web. today. <https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP." Accessed today. https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP'. Available at: https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP. https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp