Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Master text normalization techniques including Unicode NFC/NFD/NFKC/NFKD forms, case folding vs lowercasing, diacritic removal, and whitespace handling. Learn to build robust normalization pipelines for search and deduplication.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Text NormalizationLink Copied

In the previous chapter, we saw how a single character like "é" can be represented in multiple ways: as a single precomposed code point (U+00E9) or as a base letter plus a combining accent (U+0065 + U+0301). Both look identical on screen, but Python considers them different strings. This seemingly minor issue can break string matching, corrupt search results, and introduce subtle bugs into your NLP pipelines.

Text normalization is the process of transforming text into a consistent, canonical form. It goes beyond encoding to address the fundamental question: when should two different byte sequences be considered the "same" text? This chapter covers Unicode normalization forms, case handling, whitespace cleanup, and building robust normalization pipelines.

Why Normalization MattersLink Copied

Consider a simple task: searching for the word "café" in a document. Without normalization, your search might miss matches because the document uses a different Unicode representation.

In[2]:

Code

# Two visually identical strings
cafe1 = "café"  # Precomposed: U+00E9
cafe2 = "cafe\u0301"  # Decomposed: e + combining acute

# Visual comparison
looks_same = cafe1 == cafe2

# Two visually identical strings
cafe1 = "café"  # Precomposed: U+00E9
cafe2 = "cafe\u0301"  # Decomposed: e + combining acute

# Visual comparison
looks_same = cafe1 == cafe2

Out[3]:

Console

Two ways to write 'café':
  cafe1 = 'café' (precomposed)
  cafe2 = 'café' (decomposed)

Look identical? Yes, they both display as: café
Are equal in Python? False

Code point breakdown:
  cafe1: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  cafe2: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']

The strings are visually identical but computationally different. This creates problems across NLP:

Search: Users searching for "café" won't find documents containing the decomposed form
Deduplication: Duplicate detection fails when the same text uses different representations
Tokenization: Tokenizers may split decomposed characters incorrectly
Embeddings: Identical words may receive different vector representations

Unicode Normalization FormsLink Copied

The Unicode standard defines four normalization forms to address representation ambiguity. Each form serves different purposes.

Unicode Normalization

Unicode normalization transforms text into a canonical form where equivalent strings have identical code point sequences. The four forms (NFC, NFD, NFKC, NFKD) differ in whether they compose or decompose characters and whether they apply compatibility mappings.

NFC: Canonical CompositionLink Copied

NFC (Normalization Form Canonical Composition) converts text to its shortest representation by combining base characters with their accents into single precomposed characters where possible.

In[4]:

Code

import unicodedata

# Start with decomposed form
decomposed = "cafe\u0301"  # e + combining acute

# Normalize to NFC (composed)
nfc = unicodedata.normalize("NFC", decomposed)

# Compare with precomposed original
precomposed = "café"

import unicodedata

# Start with decomposed form
decomposed = "cafe\u0301"  # e + combining acute

# Normalize to NFC (composed)
nfc = unicodedata.normalize("NFC", decomposed)

# Compare with precomposed original
precomposed = "café"

Out[5]:

Console

NFC Normalization (Composition):
  Original (decomposed): 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']
  Length: 5

  After NFC: 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  Length: 4

  Matches precomposed 'café'? True

NFC is the most commonly used normalization form. It produces the most compact representation and matches what most users expect when they type accented characters.

NFD: Canonical DecompositionLink Copied

NFD (Normalization Form Canonical Decomposition) does the opposite: it breaks precomposed characters into their base character plus combining marks.

In[6]:

Code

# Start with precomposed form
composed = "café"

# Normalize to NFD (decomposed)
nfd = unicodedata.normalize("NFD", composed)

# Start with precomposed form
composed = "café"

# Normalize to NFD (decomposed)
nfd = unicodedata.normalize("NFD", composed)

Out[7]:

Console

NFD Normalization (Decomposition):
  Original (composed): 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+00E9']
  Length: 4

  After NFD: 'café'
  Code points: ['U+0063', 'U+0061', 'U+0066', 'U+0065', 'U+0301']
  Length: 5

NFD is useful when you need to manipulate accents separately from base characters, such as removing diacritics or analyzing character components.

NFKC and NFKD: Compatibility NormalizationLink Copied

The "K" forms apply compatibility decomposition in addition to canonical normalization. This maps characters that are semantically equivalent but visually distinct.

Compatibility Equivalence

Compatibility equivalence groups characters that represent the same abstract character but differ in appearance or formatting. Examples include full-width vs. half-width characters, ligatures vs. separate letters, and superscripts vs. regular digits.

In[8]:

Code

# Characters with compatibility equivalents
test_cases = [
    ("ﬁ", "fi ligature"),
    ("①", "circled digit one"),
    ("Ⅳ", "roman numeral four"),
    ("ｈｅｌｌｏ", "full-width hello"),
    ("²", "superscript two"),
    ("㎞", "km symbol"),
]

# Apply NFKC normalization
nfkc_results = [
    (char, desc, unicodedata.normalize("NFKC", char))
    for char, desc in test_cases
]

# Characters with compatibility equivalents
test_cases = [
    ("ﬁ", "fi ligature"),
    ("①", "circled digit one"),
    ("Ⅳ", "roman numeral four"),
    ("ｈｅｌｌｏ", "full-width hello"),
    ("²", "superscript two"),
    ("㎞", "km symbol"),
]

# Apply NFKC normalization
nfkc_results = [
    (char, desc, unicodedata.normalize("NFKC", char))
    for char, desc in test_cases
]

Out[9]:

Console

NFKC Compatibility Normalization:
-------------------------------------------------------
Original     Description               NFKC Result 
-------------------------------------------------------
'ﬁ'        fi ligature               'fi'
'①'        circled digit one         '1'
'Ⅳ'        roman numeral four        'IV'
'ｈｅｌｌｏ'        full-width hello          'hello'
'²'        superscript two           '2'
'㎞'        km symbol                 'km'

NFKC is aggressive. It converts the "ﬁ" ligature to separate "f" and "i" characters, expands the circled digit to just "1", and converts full-width characters to their ASCII equivalents. This is useful for search and comparison but destroys formatting information.

The table below shows how each normalization form transforms different input characters. NFC and NFD are canonical forms that preserve character identity, while NFKC and NFKD apply compatibility mappings that may change the character representation.

Unicode normalization forms compared. Canonical forms (NFC, NFD) preserve character identity; compatibility forms (NFKC, NFKD) apply aggressive mappings.

Input	NFC	NFD	NFKC	NFKD
café (decomposed)	café	café	café	café
e + ́ (combining)	é	e + ́	é	e + ́
ﬁ (ligature)	ﬁ	ﬁ	fi	fi
① (circled)	①	①	1	1
ｈｉ (full-width)	ｈｉ	ｈｉ	hi	hi

The canonical forms (NFC, NFD) preserve ligatures and special characters, changing only the internal representation. The compatibility forms (NFKC, NFKD) aggressively normalize to base characters, expanding ligatures and converting full-width to half-width.

Choosing a Normalization FormLink Copied

The right form depends on your use case:

Normalization form recommendations by use case.

Use Case	Recommended Form	Reason
General text storage	NFC	Compact, preserves visual appearance
Accent-insensitive search	NFD then strip marks	Easy to remove combining characters
Full-text search	NFKC	Matches variant representations
Security (username comparison)	NFKC	Prevents homograph attacks
Preserving formatting	NFC	Keeps ligatures and special forms

In[10]:

Code

def compare_normalization_forms(text):
    """Compare all four normalization forms for a given text."""
    forms = ["NFC", "NFD", "NFKC", "NFKD"]
    results = {}
    for form in forms:
        normalized = unicodedata.normalize(form, text)
        results[form] = {
            "text": normalized,
            "length": len(normalized),
            "codepoints": [f"U+{ord(c):04X}" for c in normalized],
        }
    return results


# Test with a complex example
test_text = "ﬁnancial résumé ①"
comparison = compare_normalization_forms(test_text)

def compare_normalization_forms(text):
    """Compare all four normalization forms for a given text."""
    forms = ["NFC", "NFD", "NFKC", "NFKD"]
    results = {}
    for form in forms:
        normalized = unicodedata.normalize(form, text)
        results[form] = {
            "text": normalized,
            "length": len(normalized),
            "codepoints": [f"U+{ord(c):04X}" for c in normalized],
        }
    return results


# Test with a complex example
test_text = "ﬁnancial résumé ①"
comparison = compare_normalization_forms(test_text)

Out[11]:

Console

Normalizing: 'ﬁnancial résumé ①'
======================================================================

NFC:
  Result: 'ﬁnancial résumé ①'
  Length: 17

NFD:
  Result: 'ﬁnancial résumé ①'
  Length: 19

NFKC:
  Result: 'financial résumé 1'
  Length: 18

NFKD:
  Result: 'financial résumé 1'
  Length: 20

The length differences reveal how each form handles the input. NFD produces the longest output because it decomposes characters into base letters plus combining marks. NFC and NFKC produce shorter outputs by composing characters, with NFKC additionally expanding the ligature "ﬁ" into two separate characters.

Out[12]:

Visualization

Grouped bar chart comparing string lengths across NFC, NFD, NFKC, and NFKD normalization forms for various text samples. — String length comparison across Unicode normalization forms for different text samples. NFD consistently produces longer strings by decomposing characters, while NFC produces the most compact representation. NFKC and NFKD may increase or decrease length depending on whether compatibility mappings expand or simplify characters.

The chart reveals important patterns. Text with combining diacritics (café, naïve, Ångström) shows significant length increase under NFD decomposition. Full-width characters (Ｈｅｌｌｏ) and circled digits (①②③) shrink dramatically under NFKC/NFKD as they're mapped to their ASCII equivalents. The ligature "ﬁ" in "ﬁnance" expands from one character to two under compatibility normalization.

Case Folding vs. LowercasingLink Copied

Case-insensitive comparison seems simple: just convert both strings to lowercase. But Unicode makes this surprisingly complex.

The Problem with Simple LowercasingLink Copied

In[13]:

Code

# German sharp s (ß) uppercases to SS in standard Python
german_word = "straße"  # street
lowered = german_word.lower()
uppered = german_word.upper()
round_trip = uppered.lower()

# German sharp s (ß) uppercases to SS in standard Python
german_word = "straße"  # street
lowered = german_word.lower()
uppered = german_word.upper()
round_trip = uppered.lower()

Out[14]:

Console

Case conversion with German ß:
  Original:    'straße' (length 6)
  .lower():    'straße' (length 6)
  .upper():    'STRASSE' (length 7)
  Round-trip:  'strasse' (length 7)

  Original == round-trip? False

The German "ß" uppercases to "SS" (two characters), and lowercasing "SS" gives "ss", not "ß". Round-tripping through case conversion changes the string. This is not a bug; it reflects German orthographic rules where "ß" traditionally had no uppercase form. While Unicode 5.1 (2008) added the capital ẞ (U+1E9E), Python's upper() still converts to "SS" for compatibility with the traditional standard.

Case FoldingLink Copied

Case Folding

Case folding is a Unicode operation designed for case-insensitive comparison. Unlike simple lowercasing, case folding handles language-specific mappings and ensures that equivalent strings compare equal regardless of their original case.

Python's str.casefold() method implements Unicode case folding:

In[15]:

Code

# Compare lower() vs casefold()
words = ["Straße", "STRASSE", "straße", "strasse"]

lower_results = [w.lower() for w in words]
casefold_results = [w.casefold() for w in words]

# Compare lower() vs casefold()
words = ["Straße", "STRASSE", "straße", "strasse"]

lower_results = [w.lower() for w in words]
casefold_results = [w.casefold() for w in words]

Out[16]:

Console

Comparing lower() vs casefold():
--------------------------------------------------
Word         lower()      casefold()  
--------------------------------------------------
Straße       straße       strasse     
STRASSE      strasse      strasse     
straße       straße       strasse     
strasse      strasse      strasse     

Case-insensitive matches:
  Using lower():    2 distinct values
  Using casefold(): 1 distinct values

With casefold(), all four variations of "street" in German normalize to the same string, enabling correct case-insensitive comparison.

The following table shows how many distinct strings remain after applying lower() versus casefold() to groups of equivalent words. The ideal is 1 (all variants unified to a single canonical form):

Comparison oflower() vs casefold() for case-insensitive string matching across languages.

Word Group	Variants	`lower()` distinct	`casefold()` distinct
German "street"	Straße, STRASSE, straße, strasse, STRAßE	2	1
German "size"	Größe, GRÖSSE, größe, groesse, GROESSE	3	2
Greek sigma	σ, ς, Σ	2	1
Mixed case	Hello, HELLO, hello, HeLLo	1	1
Turkish I	Istanbul, ISTANBUL, istanbul, İstanbul	2	2

For German words with "ß", casefold() correctly unifies all variants to a single canonical form, while lower() leaves two distinct values. Greek sigma variants (σ, ς, Σ) are particularly interesting: casefold() maps them all to the same form, recognizing that they represent the same letter in different positions. Standard English words show identical behavior for both methods, confirming that casefold() is a superset of lower() functionality. Note that Turkish requires locale-aware handling for correct dotted/dotless I normalization, which neither method provides automatically.

Language-Specific Case RulesLink Copied

Some case conversions depend on language context:

In[17]:

Code

# Turkish dotted and dotless i
turkish_examples = [
    ("I", "English uppercase I"),
    ("i", "English lowercase i"),
    ("İ", "Turkish uppercase dotted I (U+0130)"),
    ("ı", "Turkish lowercase dotless i (U+0131)"),
]

# Standard Python case conversion (not locale-aware)
case_results = [
    (char, desc, char.lower(), char.upper(), char.casefold())
    for char, desc in turkish_examples
]

# Turkish dotted and dotless i
turkish_examples = [
    ("I", "English uppercase I"),
    ("i", "English lowercase i"),
    ("İ", "Turkish uppercase dotted I (U+0130)"),
    ("ı", "Turkish lowercase dotless i (U+0131)"),
]

# Standard Python case conversion (not locale-aware)
case_results = [
    (char, desc, char.lower(), char.upper(), char.casefold())
    for char, desc in turkish_examples
]

Out[18]:

Console

Turkish I variants and case conversion:
---------------------------------------------------------------------------
Char   Description                         lower    upper    casefold
---------------------------------------------------------------------------
'I'    English uppercase I                 'i'      'I'      'i'
'i'    English lowercase i                 'i'      'I'      'i'
'İ'    Turkish uppercase dotted I (U+0130) 'i̇'      'İ'      'i̇'
'ı'    Turkish lowercase dotless i (U+0131) 'ı'      'I'      'ı'

In Turkish, "I" lowercases to "ı" (dotless) and "i" uppercases to "İ" (dotted). Python's default case operations follow English rules, which can cause problems with Turkish text. For locale-aware case conversion, you need specialized libraries.

Accent and Diacritic HandlingLink Copied

Many NLP applications benefit from accent-insensitive matching. A user searching for "resume" should probably find "résumé".

Removing DiacriticsLink Copied

The standard approach uses NFD normalization followed by filtering:

In[19]:

Code

import unicodedata


def remove_diacritics(text):
    """Remove diacritical marks from text."""
    # Decompose into base characters and combining marks
    decomposed = unicodedata.normalize("NFD", text)

    # Filter out combining marks (category 'Mn' = Mark, Nonspacing)
    filtered = "".join(c for c in decomposed if unicodedata.category(c) != "Mn")

    # Recompose any remaining sequences
    return unicodedata.normalize("NFC", filtered)


# Test with various accented text
test_texts = [
    "résumé",
    "naïve",
    "Ñoño",
    "Zürich",
    "Ångström",
]

stripped = [(text, remove_diacritics(text)) for text in test_texts]

import unicodedata


def remove_diacritics(text):
    """Remove diacritical marks from text."""
    # Decompose into base characters and combining marks
    decomposed = unicodedata.normalize("NFD", text)

    # Filter out combining marks (category 'Mn' = Mark, Nonspacing)
    filtered = "".join(c for c in decomposed if unicodedata.category(c) != "Mn")

    # Recompose any remaining sequences
    return unicodedata.normalize("NFC", filtered)


# Test with various accented text
test_texts = [
    "résumé",
    "naïve",
    "Ñoño",
    "Zürich",
    "Ångström",
]

stripped = [(text, remove_diacritics(text)) for text in test_texts]

Out[20]:

Console

Diacritic Removal:
-----------------------------------
Original        Stripped       
-----------------------------------
résumé          resume         
naïve           naive          
Ñoño            Nono           
Zürich          Zurich         
Ångström        Angstrom

This technique decomposes accented characters into base letters plus combining marks, removes the marks, and recomposes. The result is plain ASCII-compatible text.

Preserving Semantic DistinctionsLink Copied

Be careful: removing diacritics can change meaning in some languages.

In[21]:

Code

# Diacritics that change meaning
semantic_examples = [
    ("père", "father (French)"),
    ("pêre", "would be meaningless"),
    ("año", "year (Spanish)"),
    ("ano", "anus (Spanish)"),
    ("für", "for (German)"),
    ("fur", "different word"),
]

# Diacritics that change meaning
semantic_examples = [
    ("père", "father (French)"),
    ("pêre", "would be meaningless"),
    ("año", "year (Spanish)"),
    ("ano", "anus (Spanish)"),
    ("für", "for (German)"),
    ("fur", "different word"),
]

Out[22]:

Console

When diacritics matter:
--------------------------------------------------
  père     → pere      (father (French))
  pêre     → pere      (would be meaningless)
  año      → ano       (year (Spanish))
  ano      → ano       (anus (Spanish))
  für      → fur       (for (German))
  fur      → fur       (different word)

For search applications, you might want to match both forms. For translation or language understanding, preserving diacritics is essential.

The collision rate when stripping diacritics varies dramatically by language:

Semantic collision rates when stripping diacritics, by language.

Language	Collision Rate	Collisions / Total Words	Example Collisions
Spanish	50%	10 / 20	año/ano (year/anus), sí/si (yes/if)
French	47%	7 / 15	où/ou (where/or), côte/cote (coast/quote)
German	50%	6 / 12	schön/schon (beautiful/already), drücken/drucken (press/print)
Portuguese	50%	6 / 12	pôde/pode (could/can), pôr/por (put/by)
English	50%	5 / 10	résumé/resume, café/cafe

Spanish and French use diacritics extensively to distinguish word meanings. German umlauts often differentiate semantically different words. English treats diacritics largely as optional styling for loanwords, but stripping them still causes collisions with the non-diacritical forms. These numbers underscore why blanket diacritic removal can be problematic for multilingual NLP applications.

Whitespace NormalizationLink Copied

Whitespace seems simple, but Unicode defines many whitespace characters beyond the familiar space and tab.

In[23]:

Code

# Unicode whitespace characters
whitespace_chars = [
    ("\u0020", "Space"),
    ("\u00a0", "No-Break Space"),
    ("\u2002", "En Space"),
    ("\u2003", "Em Space"),
    ("\u2009", "Thin Space"),
    ("\u200b", "Zero Width Space"),
    ("\u3000", "Ideographic Space"),
    ("\t", "Tab"),
    ("\n", "Newline"),
    ("\r", "Carriage Return"),
]

# Check which are detected by str.isspace()
space_check = [(char, name, char.isspace()) for char, name in whitespace_chars]

# Unicode whitespace characters
whitespace_chars = [
    ("\u0020", "Space"),
    ("\u00a0", "No-Break Space"),
    ("\u2002", "En Space"),
    ("\u2003", "Em Space"),
    ("\u2009", "Thin Space"),
    ("\u200b", "Zero Width Space"),
    ("\u3000", "Ideographic Space"),
    ("\t", "Tab"),
    ("\n", "Newline"),
    ("\r", "Carriage Return"),
]

# Check which are detected by str.isspace()
space_check = [(char, name, char.isspace()) for char, name in whitespace_chars]

Out[24]:

Console

Unicode Whitespace Characters:
-------------------------------------------------------
Char     Code       Name                      isspace() 
-------------------------------------------------------
         U+0020     Space                     True
\xa0     U+00A0     No-Break Space            True
\u2002   U+2002     En Space                  True
\u2003   U+2003     Em Space                  True
\u2009   U+2009     Thin Space                True
\u200b   U+200B     Zero Width Space          False
\u3000   U+3000     Ideographic Space         True
\t       U+0009     Tab                       True
\n       U+000A     Newline                   True
\r       U+000D     Carriage Return           True

Notice that the zero-width space (U+200B) is not considered whitespace by Python's isspace(). These invisible characters can cause subtle bugs.

The following table shows the UTF-8 byte sizes and isspace() behavior for common Unicode whitespace characters:

Unicode whitespace characters with UTF-8 byte sizes and Pythonisspace() detection.

Character	Code Point	UTF-8 Bytes	`isspace()`
Space	U+0020	1	✓
Tab	U+0009	1	✓
Newline	U+000A	1	✓
Carriage Return	U+000D	1	✓
No-Break Space	U+00A0	2	✓
En Space	U+2002	3	✓
Em Space	U+2003	3	✓
Thin Space	U+2009	3	✓
Zero Width Space	U+200B	3	✗
Ideographic Space	U+3000	3	✓

The byte size variation has practical implications. A document using ideographic spaces (common in CJK text) will be larger than one using standard ASCII spaces. Zero-width characters, despite being invisible, still consume 3 bytes each in UTF-8, and they can accumulate when copying text from web pages or PDFs. Note that zero-width space is not detected by isspace(), which can cause subtle matching bugs.

Normalizing WhitespaceLink Copied

A robust whitespace normalizer should:

Convert all whitespace variants to standard spaces
Collapse multiple spaces into one
Strip leading and trailing whitespace
Optionally handle zero-width characters

In[25]:

Code

import re


def normalize_whitespace(text, collapse=True, strip=True):
    """Normalize various whitespace characters to standard spaces."""
    # Unicode whitespace pattern (broader than \s)
    # Includes all Unicode Zs (space separator) category
    whitespace_pattern = r"[\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]"

    # Replace all whitespace variants with standard space
    text = re.sub(whitespace_pattern, " ", text)

    # Handle zero-width characters
    text = re.sub(r"[\u200B\u200C\u200D\uFEFF]", "", text)

    # Normalize line endings
    text = re.sub(r"\r\n|\r", "\n", text)

    if collapse:
        # Collapse multiple spaces to single space
        text = re.sub(r" +", " ", text)
        # Collapse multiple newlines to double newline (paragraph break)
        text = re.sub(r"\n{3,}", "\n\n", text)

    if strip:
        text = text.strip()

    return text


# Test with messy whitespace
messy_text = "Hello\u00a0\u00a0World\u200b!\u3000\u3000Test"
cleaned = normalize_whitespace(messy_text)

import re


def normalize_whitespace(text, collapse=True, strip=True):
    """Normalize various whitespace characters to standard spaces."""
    # Unicode whitespace pattern (broader than \s)
    # Includes all Unicode Zs (space separator) category
    whitespace_pattern = r"[\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]"

    # Replace all whitespace variants with standard space
    text = re.sub(whitespace_pattern, " ", text)

    # Handle zero-width characters
    text = re.sub(r"[\u200B\u200C\u200D\uFEFF]", "", text)

    # Normalize line endings
    text = re.sub(r"\r\n|\r", "\n", text)

    if collapse:
        # Collapse multiple spaces to single space
        text = re.sub(r" +", " ", text)
        # Collapse multiple newlines to double newline (paragraph break)
        text = re.sub(r"\n{3,}", "\n\n", text)

    if strip:
        text = text.strip()

    return text


# Test with messy whitespace
messy_text = "Hello\u00a0\u00a0World\u200b!\u3000\u3000Test"
cleaned = normalize_whitespace(messy_text)

Out[26]:

Console

Whitespace Normalization:
  Original: 'Hello\xa0\xa0World\u200b!\u3000\u3000Test'
  Length: 20

  Cleaned: 'Hello World! Test'
  Length: 17

The normalizer reduced the string from 18 characters to 17 by converting the various Unicode spaces (no-break space, ideographic space) to standard spaces, removing the zero-width space entirely, and collapsing consecutive spaces into single spaces. This produces consistent, predictable whitespace that won't cause matching failures.

Ligature ExpansionLink Copied

Ligatures are single characters that represent multiple letters joined together. They're common in typeset text and can cause matching problems.

In[27]:

Code

# Common ligatures
ligatures = [
    ("ﬁ", "fi", "Latin small ligature fi"),
    ("ﬂ", "fl", "Latin small ligature fl"),
    ("ﬀ", "ff", "Latin small ligature ff"),
    ("ﬃ", "ffi", "Latin small ligature ffi"),
    ("ﬄ", "ffl", "Latin small ligature ffl"),
    ("Ꜳ", "AA", "Latin capital letter AA"),
    ("œ", "oe", "Latin small letter oe"),
    ("æ", "ae", "Latin small letter ae"),
]

# NFKC expands most ligatures
expanded = [
    (lig, exp, unicodedata.normalize("NFKC", lig)) for lig, exp, _ in ligatures
]

# Common ligatures
ligatures = [
    ("ﬁ", "fi", "Latin small ligature fi"),
    ("ﬂ", "fl", "Latin small ligature fl"),
    ("ﬀ", "ff", "Latin small ligature ff"),
    ("ﬃ", "ffi", "Latin small ligature ffi"),
    ("ﬄ", "ffl", "Latin small ligature ffl"),
    ("Ꜳ", "AA", "Latin capital letter AA"),
    ("œ", "oe", "Latin small letter oe"),
    ("æ", "ae", "Latin small letter ae"),
]

# NFKC expands most ligatures
expanded = [
    (lig, exp, unicodedata.normalize("NFKC", lig)) for lig, exp, _ in ligatures
]

Out[28]:

Console

Ligature Expansion with NFKC:
---------------------------------------------
Ligature     Expected     NFKC Result 
---------------------------------------------
'ﬁ'          'fi'         'fi' ✓
'ﬂ'          'fl'         'fl' ✓
'ﬀ'          'ff'         'ff' ✓
'ﬃ'          'ffi'         'ffi' ✓
'ﬄ'          'ffl'         'ffl' ✓
'Ꜳ'          'AA'         'Ꜳ' ≠
'œ'          'oe'         'œ' ≠
'æ'          'ae'         'æ' ≠

NFKC handles most Latin ligatures correctly. However, some characters like "æ" and "œ" are considered distinct letters in some languages (Danish, French) rather than ligatures, so NFKC preserves them.

The following table shows common ligatures, their Unicode code points, and their expanded forms. Note the large code point gap between ligatures in the Alphabetic Presentation Forms block (U+FB00-FB4F) and their ASCII expansions (U+0000-007F). This gap explains why naive string comparison fails without normalization:

Common ligatures and their NFKC expansions. Typographic ligatures expand; linguistic letters are preserved. | Æ | U+00C6 | Æ (preserved) | — | Latin-1 Supplement |

Ligature	Code Point	Expansion	Expansion Code Points	Unicode Block
ﬁ	U+FB01	fi	U+0066 U+0069	Alphabetic Presentation Forms
ﬂ	U+FB02	fl	U+0066 U+006C	Alphabetic Presentation Forms
ﬀ	U+FB00	ff	U+0066 U+0066	Alphabetic Presentation Forms
ﬃ	U+FB03	ffi	U+0066 U+0066 U+0069	Alphabetic Presentation Forms
ﬄ	U+FB04	ffl	U+0066 U+0066 U+006C	Alphabetic Presentation Forms
ﬅ	U+FB05	st	U+0073 U+0074	Alphabetic Presentation Forms
œ	U+0153	œ (preserved)	—	Latin Extended-A
æ	U+00E6	æ (preserved)	—	Latin-1 Supplement
Œ	U+0152	Œ (preserved)	—	Latin Extended-A

The Latin f-ligatures (ﬁ, ﬂ, ﬀ, ﬃ, ﬄ, ﬅ) are expanded by NFKC because they're typographic variants. However, "æ" and "œ" are preserved because they function as distinct letters in languages like Danish, Norwegian, and French.

The code point data reveals why ligatures cause string matching problems. The "ﬁ" ligature (U+FB01) and its expansion "fi" (U+0066, U+0069) are separated by over 64,000 code points in Unicode space. Characters like "æ" and "œ" sit in the Latin-1 Supplement block, much closer to their ASCII equivalents, reflecting their status as distinct letters rather than purely typographic ligatures. Without normalization, a search for "find" will never match "ﬁnd" even though they're semantically identical.

Full-Width to Half-Width ConversionLink Copied

East Asian text often uses full-width versions of ASCII characters. These take up the same width as CJK characters, creating visual alignment in mixed text.

In[29]:

Code

# Full-width ASCII characters
fullwidth_examples = [
    ("Ａ", "A", "Full-width A"),
    ("ａ", "a", "Full-width a"),
    ("０", "0", "Full-width 0"),
    ("！", "!", "Full-width exclamation"),
    ("　", " ", "Ideographic space"),
]


# Full-width to half-width conversion
def fullwidth_to_halfwidth(text):
    """Convert full-width ASCII to half-width."""
    result = []
    for char in text:
        code = ord(char)
        # Full-width ASCII range: U+FF01 to U+FF5E maps to U+0021 to U+007E
        if 0xFF01 <= code <= 0xFF5E:
            result.append(chr(code - 0xFF01 + 0x21))
        # Ideographic space to regular space
        elif code == 0x3000:
            result.append(" ")
        else:
            result.append(char)
    return "".join(result)


# Test conversion
fullwidth_text = "Ｈｅｌｌｏ　Ｗｏｒｌｄ！　１２３"
halfwidth_text = fullwidth_to_halfwidth(fullwidth_text)

# Full-width ASCII characters
fullwidth_examples = [
    ("Ａ", "A", "Full-width A"),
    ("ａ", "a", "Full-width a"),
    ("０", "0", "Full-width 0"),
    ("！", "!", "Full-width exclamation"),
    ("　", " ", "Ideographic space"),
]


# Full-width to half-width conversion
def fullwidth_to_halfwidth(text):
    """Convert full-width ASCII to half-width."""
    result = []
    for char in text:
        code = ord(char)
        # Full-width ASCII range: U+FF01 to U+FF5E maps to U+0021 to U+007E
        if 0xFF01 <= code <= 0xFF5E:
            result.append(chr(code - 0xFF01 + 0x21))
        # Ideographic space to regular space
        elif code == 0x3000:
            result.append(" ")
        else:
            result.append(char)
    return "".join(result)


# Test conversion
fullwidth_text = "Ｈｅｌｌｏ　Ｗｏｒｌｄ！　１２３"
halfwidth_text = fullwidth_to_halfwidth(fullwidth_text)

Out[30]:

Console

Full-Width to Half-Width Conversion:
  Full-width: 'Ｈｅｌｌｏ　Ｗｏｒｌｄ！　１２３'
  Half-width: 'Hello World! 123'

Character-by-character:
  'Ｈ' (U+FF28) → 'H' (U+0048)
  'ｅ' (U+FF45) → 'e' (U+0065)
  'ｌ' (U+FF4C) → 'l' (U+006C)
  'ｌ' (U+FF4C) → 'l' (U+006C)
  'ｏ' (U+FF4F) → 'o' (U+006F)

Each full-width character maps to its ASCII equivalent by subtracting a fixed offset (0xFEE0) from the code point. The ideographic space (U+3000) is a special case that maps to the regular space (U+0020). This conversion is essential when processing East Asian text that mixes CJK characters with Latin letters and digits.

NFKC normalization also handles full-width to half-width conversion:

In[31]:

Code

# NFKC also converts full-width characters
nfkc_result = unicodedata.normalize("NFKC", fullwidth_text)
manual_match = nfkc_result == halfwidth_text

# NFKC also converts full-width characters
nfkc_result = unicodedata.normalize("NFKC", fullwidth_text)
manual_match = nfkc_result == halfwidth_text

Out[32]:

Console

NFKC vs manual conversion:
  NFKC result:   'Hello World! 123'
  Manual result: 'Hello World! 123'
  Match: True

The NFKC normalization produces identical results to the manual conversion function, confirming that NFKC handles full-width to half-width mapping as part of its compatibility normalization. This means you can use NFKC for comprehensive normalization without implementing character-specific conversion logic, simplifying your normalization pipeline.

Building a Normalization PipelineLink Copied

Real-world text normalization combines multiple techniques. The order of operations matters.

In[33]:

Code

import unicodedata
import re


class TextNormalizer:
    """A configurable text normalization pipeline."""

    def __init__(
        self,
        unicode_form="NFC",
        lowercase=False,
        casefold=False,
        strip_accents=False,
        normalize_whitespace=True,
        strip_control_chars=True,
    ):
        """
        Initialize the normalizer with configuration options.

        Parameters:
        - unicode_form: 'NFC', 'NFD', 'NFKC', 'NFKD', or None
        - lowercase: Apply str.lower()
        - casefold: Apply str.casefold() (overrides lowercase)
        - strip_accents: Remove diacritical marks
        - normalize_whitespace: Collapse and standardize whitespace
        - strip_control_chars: Remove control characters
        """
        self.unicode_form = unicode_form
        self.lowercase = lowercase
        self.casefold = casefold
        self.strip_accents = strip_accents
        self.normalize_whitespace = normalize_whitespace
        self.strip_control_chars = strip_control_chars

    def __call__(self, text):
        """Apply the normalization pipeline to text."""
        # Step 1: Unicode normalization (first pass)
        if self.unicode_form:
            text = unicodedata.normalize(self.unicode_form, text)

        # Step 2: Strip accents (requires NFD decomposition)
        if self.strip_accents:
            text = unicodedata.normalize("NFD", text)
            text = "".join(c for c in text if unicodedata.category(c) != "Mn")
            # Recompose after stripping
            if self.unicode_form in ("NFC", "NFKC"):
                text = unicodedata.normalize("NFC", text)

        # Step 3: Case normalization
        if self.casefold:
            text = text.casefold()
        elif self.lowercase:
            text = text.lower()

        # Step 4: Control character removal
        if self.strip_control_chars:
            # Remove C0, C1 controls except whitespace
            text = "".join(
                c
                for c in text
                if unicodedata.category(c) != "Cc" or c in "\t\n\r"
            )

        # Step 5: Whitespace normalization
        if self.normalize_whitespace:
            # Convert various spaces to regular space
            text = re.sub(r"[\u00A0\u2000-\u200A\u202F\u205F\u3000]", " ", text)
            # Remove zero-width characters
            text = re.sub(r"[\u200B-\u200D\uFEFF]", "", text)
            # Collapse multiple spaces
            text = re.sub(r" +", " ", text)
            # Normalize line endings and collapse multiple newlines
            text = re.sub(r"\r\n|\r", "\n", text)
            text = re.sub(r"\n{3,}", "\n\n", text)
            text = text.strip()

        return text


# Create different normalizers for different use cases
search_normalizer = TextNormalizer(
    unicode_form="NFKC",
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True,
)

storage_normalizer = TextNormalizer(
    unicode_form="NFC", normalize_whitespace=True
)

import unicodedata
import re


class TextNormalizer:
    """A configurable text normalization pipeline."""

    def __init__(
        self,
        unicode_form="NFC",
        lowercase=False,
        casefold=False,
        strip_accents=False,
        normalize_whitespace=True,
        strip_control_chars=True,
    ):
        """
        Initialize the normalizer with configuration options.

        Parameters:
        - unicode_form: 'NFC', 'NFD', 'NFKC', 'NFKD', or None
        - lowercase: Apply str.lower()
        - casefold: Apply str.casefold() (overrides lowercase)
        - strip_accents: Remove diacritical marks
        - normalize_whitespace: Collapse and standardize whitespace
        - strip_control_chars: Remove control characters
        """
        self.unicode_form = unicode_form
        self.lowercase = lowercase
        self.casefold = casefold
        self.strip_accents = strip_accents
        self.normalize_whitespace = normalize_whitespace
        self.strip_control_chars = strip_control_chars

    def __call__(self, text):
        """Apply the normalization pipeline to text."""
        # Step 1: Unicode normalization (first pass)
        if self.unicode_form:
            text = unicodedata.normalize(self.unicode_form, text)

        # Step 2: Strip accents (requires NFD decomposition)
        if self.strip_accents:
            text = unicodedata.normalize("NFD", text)
            text = "".join(c for c in text if unicodedata.category(c) != "Mn")
            # Recompose after stripping
            if self.unicode_form in ("NFC", "NFKC"):
                text = unicodedata.normalize("NFC", text)

        # Step 3: Case normalization
        if self.casefold:
            text = text.casefold()
        elif self.lowercase:
            text = text.lower()

        # Step 4: Control character removal
        if self.strip_control_chars:
            # Remove C0, C1 controls except whitespace
            text = "".join(
                c
                for c in text
                if unicodedata.category(c) != "Cc" or c in "\t\n\r"
            )

        # Step 5: Whitespace normalization
        if self.normalize_whitespace:
            # Convert various spaces to regular space
            text = re.sub(r"[\u00A0\u2000-\u200A\u202F\u205F\u3000]", " ", text)
            # Remove zero-width characters
            text = re.sub(r"[\u200B-\u200D\uFEFF]", "", text)
            # Collapse multiple spaces
            text = re.sub(r" +", " ", text)
            # Normalize line endings and collapse multiple newlines
            text = re.sub(r"\r\n|\r", "\n", text)
            text = re.sub(r"\n{3,}", "\n\n", text)
            text = text.strip()

        return text


# Create different normalizers for different use cases
search_normalizer = TextNormalizer(
    unicode_form="NFKC",
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True,
)

storage_normalizer = TextNormalizer(
    unicode_form="NFC", normalize_whitespace=True
)

Out[34]:

Console

Normalization Pipeline Comparison:
============================================================
Original: '  Ｈéllo\xa0\xa0Wörld！  ﬁnance  '

Search normalizer (aggressive):
  Result: 'hello world! finance'

Storage normalizer (conservative):
  Result: 'Ｈéllo Wörld！ ﬁnance'

The two normalizers produce notably different outputs from the same input. The search normalizer aggressively transforms the text for maximum matching flexibility: it converts full-width characters to ASCII, strips accents, folds case, expands the ligature "ﬁ" to "fi", and collapses all whitespace variants. The storage normalizer preserves the original character forms while only standardizing whitespace, maintaining the text's visual fidelity for display purposes.

Pipeline Order MattersLink Copied

The order of normalization steps can affect results:

In[35]:

Code

# Demonstrate order dependency
text = "CAFÉ"

# Order 1: Lowercase then strip accents
order1 = text.lower()
order1 = "".join(
    c
    for c in unicodedata.normalize("NFD", order1)
    if unicodedata.category(c) != "Mn"
)

# Order 2: Strip accents then lowercase
order2 = "".join(
    c
    for c in unicodedata.normalize("NFD", text)
    if unicodedata.category(c) != "Mn"
)
order2 = order2.lower()

# Demonstrate order dependency
text = "CAFÉ"

# Order 1: Lowercase then strip accents
order1 = text.lower()
order1 = "".join(
    c
    for c in unicodedata.normalize("NFD", order1)
    if unicodedata.category(c) != "Mn"
)

# Order 2: Strip accents then lowercase
order2 = "".join(
    c
    for c in unicodedata.normalize("NFD", text)
    if unicodedata.category(c) != "Mn"
)
order2 = order2.lower()

Out[36]:

Console

Order of Operations:
  Original: 'CAFÉ'

  Lowercase → Strip accents: 'cafe'
  Strip accents → Lowercase: 'cafe'

  Same result? True

In this case, the order doesn't matter. But with more complex transformations involving case-sensitive patterns or locale-specific rules, order can be significant. Always test your pipeline with representative data.

Practical Example: DeduplicationLink Copied

Let's apply normalization to a real task: finding duplicate entries in a dataset.

In[37]:

Code

# Simulated dataset with near-duplicate entries
company_names = [
    "Société Générale",
    "SOCIÉTÉ GÉNÉRALE",
    "Societe Generale",
    "Société  Générale",  # Extra space
    "Ｓｏｃｉｅｔｅ　Ｇｅｎｅｒａｌｅ",  # Full-width
    "Apple Inc.",
    "Apple Inc",
    "APPLE INC.",
    "apple inc",
    "Müller GmbH",
    "Mueller GmbH",
    "MÜLLER GMBH",
    "Muller GmbH",
]

# Create a normalizer for deduplication
dedup_normalizer = TextNormalizer(
    unicode_form="NFKC",
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True,
)

# Group by normalized form
from collections import defaultdict

groups = defaultdict(list)
for name in company_names:
    normalized = dedup_normalizer(name)
    groups[normalized].append(name)

# Simulated dataset with near-duplicate entries
company_names = [
    "Société Générale",
    "SOCIÉTÉ GÉNÉRALE",
    "Societe Generale",
    "Société  Générale",  # Extra space
    "Ｓｏｃｉｅｔｅ　Ｇｅｎｅｒａｌｅ",  # Full-width
    "Apple Inc.",
    "Apple Inc",
    "APPLE INC.",
    "apple inc",
    "Müller GmbH",
    "Mueller GmbH",
    "MÜLLER GMBH",
    "Muller GmbH",
]

# Create a normalizer for deduplication
dedup_normalizer = TextNormalizer(
    unicode_form="NFKC",
    casefold=True,
    strip_accents=True,
    normalize_whitespace=True,
)

# Group by normalized form
from collections import defaultdict

groups = defaultdict(list)
for name in company_names:
    normalized = dedup_normalizer(name)
    groups[normalized].append(name)

Out[38]:

Console

Duplicate Detection Results:
============================================================

Normalized form: 'societe generale'
  Matches (5):
    - Société Générale
    - SOCIÉTÉ GÉNÉRALE
    - Societe Generale
    - Société  Générale
    - Ｓｏｃｉｅｔｅ　Ｇｅｎｅｒａｌｅ

Normalized form: 'apple inc.'
  Matches (2):
    - Apple Inc.
    - APPLE INC.

Normalized form: 'apple inc'
  Matches (2):
    - Apple Inc
    - apple inc

Normalized form: 'muller gmbh'
  Matches (3):
    - Müller GmbH
    - MÜLLER GMBH
    - Muller GmbH

The normalizer correctly groups variations of "Société Générale" and "Apple Inc." together. It also groups "Müller" with "Mueller" since stripping accents converts "ü" to "u".

Out[39]:

Visualization

Bar chart showing decreasing number of unique strings as more normalization steps are applied, from 13 raw to 3 fully normalized. — Deduplication effectiveness with progressive normalization. Each bar shows how many unique strings remain after applying cumulative normalization steps. Starting with 13 raw entries, aggressive normalization reduces this to 5 unique canonical forms, identifying most near-duplicates (remaining differences are due to punctuation and ü/ue spelling variants).

The visualization shows how each normalization step progressively reduces the number of unique strings. Raw text shows 13 distinct entries, but after full normalization, only 5 unique strings remain. Each step contributes to duplicate detection: NFKC handles full-width characters, whitespace normalization catches extra spaces, case folding unifies capitalization variants, and accent stripping converts "ü" to "u". The remaining 5 strings differ due to punctuation ("Apple Inc." vs "Apple Inc") and German spelling conventions ("Mueller" as the ue-spelling vs "Müller" which becomes "Muller" after accent stripping).

The table below traces the transformation of "Société Générale" through a complete normalization pipeline:

Progressive normalization pipeline transforming "Société Générale" to a canonical form.

Stage	Output	Effect
Raw Input	"Société Générale"	Various encodings and representations
Unicode (NFKC)	"Société Générale"	Canonical form, ligatures expanded
Whitespace	"Société Générale"	Spaces collapsed, zero-width removed
Case Fold	"société générale"	Case-insensitive comparison ready
Strip Accents	"societe generale"	Accent-insensitive matching ready

Each stage addresses a specific type of variation. The final output is a canonical form suitable for search and comparison. All representations of this company name will normalize to the same string.

Limitations and ChallengesLink Copied

Text normalization is powerful but not perfect. Consider these limitations when designing your pipeline:

Information loss: Aggressive normalization destroys information. Stripping accents loses the distinction between "resume" (to continue) and "résumé" (CV). Case folding loses the distinction between proper nouns and common words.
Language specificity: No single normalization strategy works for all languages. Turkish case rules differ from English. Chinese has no case. Some scripts have no concept of accents.
Context dependence: The right normalization depends on your task. Search benefits from aggressive normalization. Machine translation needs to preserve source text exactly.
Irreversibility: Most normalization operations cannot be undone. Once you've stripped accents or folded case, the original information is gone.
Edge cases: Unicode is vast and complex. New characters are added regularly. Your normalization code may not handle every possible input correctly.

Key Functions and ParametersLink Copied

When working with text normalization in Python, these are the essential functions and their most important parameters:

unicodedata.normalize(form, text): Applies Unicode normalization to a string. The form parameter specifies the normalization form: 'NFC' (canonical composition, default for storage), 'NFD' (canonical decomposition, useful for accent stripping), 'NFKC' (compatibility composition, aggressive, for search), or 'NFKD' (compatibility decomposition).
unicodedata.category(char): Returns a two-letter category code for a Unicode character. Common categories include 'Mn' (Mark, Nonspacing, for combining diacritics), 'Cc' (control characters), and 'Zs' (space separator). Useful for filtering specific character types during normalization.
str.casefold(): Returns a casefolded copy of the string for case-insensitive comparison. More aggressive than lower(), handles special cases like German "ß" → "ss". Preferred over lower() for Unicode-aware case-insensitive matching.
str.lower() vs str.casefold(): Use lower() for display (standard Unicode lowercasing) and casefold() for comparison (full Unicode case folding that handles language-specific mappings).
re.sub(pattern, replacement, text): Essential for whitespace normalization. Common patterns include r'[\u00A0\u2000-\u200A\u202F\u205F\u3000]' for various Unicode spaces, r'[\u200B-\u200D\uFEFF]' for zero-width characters, and r' +' for multiple consecutive spaces.

SummaryLink Copied

Text normalization transforms text into consistent, comparable forms. We covered:

Unicode normalization forms: NFC composes, NFD decomposes, NFKC and NFKD add compatibility mappings
Case folding: Use casefold() for case-insensitive comparison, not lower()
Diacritic handling: NFD decomposition plus filtering removes accents
Whitespace normalization: Unicode has many whitespace characters beyond space and tab
Ligature expansion: NFKC expands most typographic ligatures
Full-width conversion: NFKC converts full-width ASCII to standard ASCII

Key takeaways:

NFC is the default choice for general text storage
NFKC with casefold is best for search and comparison
Always normalize before comparing strings for equality
Normalization order matters: plan your pipeline carefully
Test with representative data: edge cases will surprise you
Preserve originals: keep unnormalized text when possible

In the next chapter, we'll explore tokenization, the process of breaking text into meaningful units for further processing.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about text normalization.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{textnormalizationunicodeformscasefoldingwhitespacehandlingfornlp, author = {Michael Brenndoerfer}, title = {Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP. Retrieved from https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp

MLAAcademic

Michael Brenndoerfer. "Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP." 2026. Web. today. <https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp>.

CHICAGOAcademic

Michael Brenndoerfer. "Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP." Accessed today. https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP'. Available at: https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP. https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp

Direct link:

https://mbrenndoerfer.com/writing/text-normalization-unicode-nlp

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Text Normalization: Unicode Forms, Case Folding & Whitespace Handling for NLP

Text NormalizationLink Copied

Why Normalization MattersLink Copied

Unicode Normalization FormsLink Copied

NFC: Canonical CompositionLink Copied

NFD: Canonical DecompositionLink Copied

NFKC and NFKD: Compatibility NormalizationLink Copied

Choosing a Normalization FormLink Copied

Case Folding vs. LowercasingLink Copied

The Problem with Simple LowercasingLink Copied

Case FoldingLink Copied

Language-Specific Case RulesLink Copied

Accent and Diacritic HandlingLink Copied

Removing DiacriticsLink Copied

Preserving Semantic DistinctionsLink Copied

Whitespace NormalizationLink Copied

Normalizing WhitespaceLink Copied

Ligature ExpansionLink Copied

Full-Width to Half-Width ConversionLink Copied

Building a Normalization PipelineLink Copied

Pipeline Order MattersLink Copied

Practical Example: DeduplicationLink Copied

Limitations and ChallengesLink Copied

Key Functions and ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Sentence Segmentation: From Period Disambiguation to Punkt Algorithm Implementation

Word Tokenization: Breaking Text into Meaningful Units for NLP

Regular Expressions for NLP: Complete Guide to Pattern Matching in Python

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Sentence Segmentation: From Period Disambiguation to Punkt Algorithm Implementation

Word Tokenization: Breaking Text into Meaningful Units for NLP

Regular Expressions for NLP: Complete Guide to Pattern Matching in Python

Stay updated