Character Encoding: From ASCII to UTF-8 for NLP Practitioners
Back to Writing

Character Encoding: From ASCII to UTF-8 for NLP Practitioners

Michael BrenndoerferDecember 7, 202525 min read5,899 wordsInteractive

Master character encoding fundamentals including ASCII, Unicode, and UTF-8. Learn to detect, fix, and prevent encoding errors like mojibake in your NLP pipelines.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Character EncodingLink Copied

Before computers can process language, they must solve a fundamental problem: how do you represent human writing as numbers? Every letter, symbol, and emoji you see on screen is stored as a sequence of bytes. Character encoding is the system that maps between human-readable text and these numerical representations. Understanding encoding is essential for NLP practitioners because encoding errors corrupt your data silently, turning meaningful text into unintelligible garbage.

This chapter traces the evolution from ASCII's humble 7-bit origins through Unicode's ambitious goal of representing every writing system, and finally to UTF-8, the encoding that now dominates the web. You'll learn why encoding matters, how to detect and fix encoding problems, and how to handle text correctly in Python.

The Birth of ASCIILink Copied

In the early days of computing, there was no universal standard for representing text. Different manufacturers used different codes, making data exchange between systems nearly impossible. In 1963, the American Standard Code for Information Interchange (ASCII) emerged as a solution to this chaos.

ASCII

ASCII (American Standard Code for Information Interchange) is a character encoding standard that uses 7 bits to represent 128 characters, including uppercase and lowercase English letters, digits, punctuation marks, and control characters.

ASCII uses 7 bits per character, allowing for 128 possible values (0 through 127). The designers made clever choices about how to organize these 128 slots:

  • Control characters (0-31, 127): Non-printable characters for device control, like newline, tab, and carriage return
  • Printable characters (32-126): Space, digits, punctuation, uppercase letters, and lowercase letters

Let's explore the ASCII table in Python:

In[2]:
# Examine the structure of ASCII
control_chars = list(range(0, 32)) + [127]
digits = list(range(48, 58))  # '0' to '9'
uppercase = list(range(65, 91))  # 'A' to 'Z'
lowercase = list(range(97, 123))  # 'a' to 'z'

# Show some examples
examples = [
    (65, 'A'),
    (97, 'a'),
    (48, '0'),
    (32, 'space'),
    (10, 'newline')
]
Out[3]:
ASCII Character Ranges:
  Control characters: 33 (codes 0-31 and 127)
  Digits: 10 (codes 48-57)
  Uppercase letters: 26 (codes 65-90)
  Lowercase letters: 26 (codes 97-122)

Key ASCII values:
   65 = A          → 'A'
   97 = a          → 'a'
   48 = 0          → '0'
   32 = space      → ' '
   10 = newline    → '\x0a'

The 33 control characters handle non-printable operations like line breaks and tabs. The remaining 95 printable characters cover everything needed for basic English text.

Out[4]:
Visualization
Heatmap showing the 128 ASCII characters arranged in an 8x16 grid with control characters highlighted in red.
The ASCII character table visualized as a heatmap. Each cell represents one of the 128 ASCII values (0-127), organized in an 8x16 grid. Control characters (codes 0-31 and 127) appear in red, while printable characters are shown in their respective cells. The structured layout reveals ASCII''s elegant design: uppercase letters (A-Z) occupy codes 65-90, lowercase letters (a-z) occupy codes 97-122, and digits (0-9) occupy codes 48-57.

Notice something elegant: uppercase 'A' is 65, and lowercase 'a' is 97, exactly 32 positions apart. This wasn't accidental. The designers ensured that converting between cases requires only flipping a single bit (bit 5). This made case conversion trivially efficient on early hardware.

In[5]:
# The elegant relationship between upper and lower case
upper_A = ord('A')  # 65 in decimal
lower_a = ord('a')  # 97 in decimal
difference = lower_a - upper_A
Out[6]:
'A' = 65 = 0b1000001
'a' = 97 = 0b1100001
Difference: 32 (exactly 2^5 = 32)

Binary comparison:
  A: 01000001
  a: 01100001
      ↑
  Only bit 5 differs!

This bit-flipping trick extends to all 26 letters. To convert any uppercase letter to lowercase, you simply set bit 5 to 1 (add 32). To convert lowercase to uppercase, clear bit 5 (subtract 32).

The 7-Bit LimitationLink Copied

ASCII's 7-bit design was both its strength and its fatal flaw. Using only 7 bits meant that ASCII fit comfortably within an 8-bit byte, leaving one bit free for error checking (parity) during transmission over unreliable communication lines. This was critical in an era of noisy telephone connections.

But 128 characters could only represent English. What about French accents? German umlauts? Greek letters? Russian Cyrillic? Chinese characters? ASCII had no answer.

The remaining 128 values in an 8-bit byte (128-255) became a battleground. Different regions created their own "extended ASCII" standards:

  • ISO-8859-1 (Latin-1): Western European languages
  • ISO-8859-5: Cyrillic alphabets
  • Windows-1252: Microsoft's variant of Latin-1
  • Shift JIS: Japanese
  • GB2312: Simplified Chinese

This fragmentation created a nightmare. A document written on a French computer might display as garbage on a Greek computer. The same byte sequence meant different things depending on which encoding you assumed.

In[7]:
# The same byte interpreted differently in different encodings
byte_value = 0xe9  # 233 in decimal

# In Latin-1, this is 'é'
latin1_char = bytes([byte_value]).decode('latin-1')

# In Windows-1252, also 'é' (compatible for this byte)
cp1252_char = bytes([byte_value]).decode('cp1252')

# But in ISO-8859-5 (Cyrillic), it's something else entirely
cyrillic_char = bytes([byte_value]).decode('iso-8859-5')
Out[8]:
Byte 0xe9 (233) decoded as:
  Latin-1 (Western European): 'é'
  Windows-1252:               'é'
  ISO-8859-5 (Cyrillic):      'щ'

Same bytes, completely different meanings!

This is why encoding matters for NLP. If you don't know the encoding of your text data, you might be training your model on corrupted garbage.

Unicode: One Code to Rule Them AllLink Copied

By the late 1980s, the encoding chaos had become untenable. Software companies were spending enormous effort handling multiple encodings, and data exchange remained problematic. The Unicode Consortium was incorporated in 1991, building on work that began in the late 1980s, with an ambitious goal: create a single character set that could represent every writing system ever used by humanity.

Unicode

Unicode is a universal character encoding standard that assigns a unique number (called a code point) to every character across all writing systems, symbols, and emoji. It currently defines over 149,000 characters covering 161 scripts.

Unicode assigns each character a unique code point, written as U+ followed by a hexadecimal number. For example:

  • U+0041 is 'A'
  • U+03B1 is 'α' (Greek alpha)
  • U+4E2D is '中' (Chinese character for "middle")
  • U+1F600 is '😀' (grinning face emoji)
In[9]:
# Exploring Unicode code points
characters = [
    ('A', 'Latin capital A'),
    ('é', 'Latin small e with acute'),
    ('α', 'Greek small alpha'),
    ('中', 'CJK character for middle'),
    ('😀', 'Grinning face emoji'),
    ('𝕳', 'Mathematical double-struck H'),
]

# Get code point information
code_points = [(char, name, ord(char)) for char, name in characters]
Out[10]:
Unicode Code Points:
------------------------------------------------------------
  A     U+0041  (     65)  Latin capital A
  é     U+00E9  (    233)  Latin small e with acute
  α     U+03B1  (    945)  Greek small alpha
  中     U+4E2D  ( 20,013)  CJK character for middle
  😀     U+1F600  (128,512)  Grinning face emoji
  𝕳     U+1D573  (120,179)  Mathematical double-struck H

The code points span a vast range. Basic Latin characters like 'A' occupy low values (under 128), while the emoji sits at over 128,000, far beyond what a single byte could represent.

Unicode PlanesLink Copied

Unicode organizes its vast character space into 17 planes, each containing 65,536 code points (2162^{16}). The first plane is by far the most important:

  • Plane 0 (Basic Multilingual Plane, BMP): U+0000 to U+FFFF. Contains characters for almost all modern languages, common symbols, and punctuation.
  • Plane 1 (Supplementary Multilingual Plane): U+10000 to U+1FFFF. Historic scripts, musical notation, mathematical symbols, and emoji.
  • Plane 2 (Supplementary Ideographic Plane): U+20000 to U+2FFFF. Rare CJK characters.
  • Planes 3-13: Reserved for future use.
  • Planes 14-16: Special purpose and private use.
In[11]:
# Determine which plane a character belongs to
def get_plane(char):
    cp = ord(char)
    plane = cp >> 16  # Divide by 65536
    return plane

test_chars = ['A', 'é', '中', '😀', '𝕳', '🎵']
planes = [(char, get_plane(char), ord(char)) for char in test_chars]
Out[12]:
Character Planes:
  'A' (U+00041) → Plane 0: Basic Multilingual Plane (BMP)
  'é' (U+000E9) → Plane 0: Basic Multilingual Plane (BMP)
  '中' (U+04E2D) → Plane 0: Basic Multilingual Plane (BMP)
  '😀' (U+1F600) → Plane 1: Supplementary Multilingual Plane (SMP)
  '𝕳' (U+1D573) → Plane 1: Supplementary Multilingual Plane (SMP)
  '🎵' (U+1F3B5) → Plane 1: Supplementary Multilingual Plane (SMP)

For most NLP work, you'll primarily encounter characters in the BMP. However, emoji (increasingly common in social media text) and certain mathematical symbols live in Plane 1, so your code must handle characters beyond the BMP correctly.

Out[13]:
Visualization
Horizontal bar chart showing character counts per Unicode plane, with BMP containing the most characters.
Distribution of Unicode characters across the 17 planes. The Basic Multilingual Plane (BMP, Plane 0) contains the vast majority of commonly used characters, including all modern languages. Plane 1 houses emoji and historic scripts, while Plane 2 contains rare CJK ideographs. Planes 3-13 remain largely empty, reserved for future expansion.

Code Points vs. CharactersLink Copied

A crucial distinction: Unicode code points don't always correspond one-to-one with what humans perceive as "characters." Some visual characters can be represented multiple ways:

In[14]:
# The letter 'é' can be represented two ways
# Method 1: Single precomposed character
e_acute_composed = '\u00e9'  # U+00E9: Latin small letter e with acute

# Method 2: Base character + combining mark
e_acute_decomposed = 'e\u0301'  # U+0065 (e) + U+0301 (combining acute accent)

# They look identical but are different byte sequences
are_equal = e_acute_composed == e_acute_decomposed
Out[15]:
Two ways to write 'é':
  Composed:   'é' = U+00E9 (1 code point)
  Decomposed: 'é' = U+0065 + U+0301 (2 code points)

Look identical? Yes
Are equal in Python? False

Length comparison:
  len(composed) = 1
  len(decomposed) = 2

Despite looking identical to human eyes, Python considers these two strings different. The composed form has length 1, while the decomposed form has length 2. This has serious implications for text processing. String comparison, length calculation, and search operations can give unexpected results. We'll address this with Unicode normalization in the next chapter.

UTF-8: The Encoding That WonLink Copied

Unicode defines what code points exist, but it doesn't specify how to store them as bytes. That's the job of Unicode Transformation Formats (UTFs). Several exist:

  • UTF-32: Uses exactly 4 bytes per character. Simple but wasteful.
  • UTF-16: Uses 2 or 4 bytes per character. Common in Windows and Java.
  • UTF-8: Uses 1 to 4 bytes per character. Dominant on the web.

UTF-8, invented by Ken Thompson and Rob Pike in 1992, has become the de facto standard for text on the internet. As of 2024, over 98% of websites use UTF-8.

Out[16]:
Visualization
Line chart showing UTF-8 web adoption percentage rising from about 50% in 2010 to over 98% in 2024.
The rise of UTF-8 encoding on the web from 2010 to 2024. UTF-8 adoption grew from approximately 50% in 2010 to over 98% by 2024, effectively becoming the universal standard for web content. This dramatic shift reflects UTF-8''s advantages: ASCII compatibility, space efficiency, and no byte-order issues.
UTF-8

UTF-8 (Unicode Transformation Format, 8-bit) is a variable-width encoding that represents Unicode code points using one to four bytes. It is backward-compatible with ASCII, meaning any valid ASCII text is also valid UTF-8.

How UTF-8 WorksLink Copied

UTF-8's genius lies in its variable-width design. Common characters (ASCII) use just 1 byte, while rarer characters use more:

Code Point RangeBytesBit Pattern
U+0000 to U+007F10xxxxxxx
U+0080 to U+07FF2110xxxxx 10xxxxxx
U+0800 to U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The leading bits tell you how many bytes the character uses:

  • If the first bit is 0, it's a 1-byte character (ASCII)
  • If the first bits are 110, it's a 2-byte character
  • If the first bits are 1110, it's a 3-byte character
  • If the first bits are 11110, it's a 4-byte character
  • Continuation bytes always start with 10

Let's see this encoding in action:

In[17]:
def show_utf8_encoding(char):
    """Display how a character is encoded in UTF-8."""
    code_point = ord(char)
    utf8_bytes = char.encode('utf-8')
    
    # Format bytes as binary
    binary = ' '.join(f'{b:08b}' for b in utf8_bytes)
    hex_repr = ' '.join(f'{b:02x}' for b in utf8_bytes)
    
    return {
        'char': char,
        'code_point': code_point,
        'num_bytes': len(utf8_bytes),
        'binary': binary,
        'hex': hex_repr
    }

# Test with characters from different ranges
test_chars = ['A', 'é', '中', '😀']
encodings = [show_utf8_encoding(c) for c in test_chars]
Out[18]:
UTF-8 Encoding Examples:
======================================================================

Character: 'A'
  Code point: U+0041 (65)
  UTF-8 bytes: 1
  Binary: 01000001
  Hex: 41

Character: 'é'
  Code point: U+00E9 (233)
  UTF-8 bytes: 2
  Binary: 11000011 10101001
  Hex: c3 a9

Character: '中'
  Code point: U+4E2D (20013)
  UTF-8 bytes: 3
  Binary: 11100100 10111000 10101101
  Hex: e4 b8 ad

Character: '😀'
  Code point: U+1F600 (128512)
  UTF-8 bytes: 4
  Binary: 11110000 10011111 10011000 10000000
  Hex: f0 9f 98 80

Look at the binary patterns. 'A' (code point 65) fits in 7 bits and uses a single byte starting with 0. The French 'é' needs 2 bytes, starting with 110. The Chinese character '中' needs 3 bytes, starting with 1110. And the emoji needs all 4 bytes, starting with 11110.

Out[19]:
Visualization
Visual diagram showing UTF-8 byte patterns for 1-byte through 4-byte encodings with bit positions highlighted.
UTF-8 encoding patterns visualized for characters requiring 1, 2, 3, and 4 bytes. Each row shows a character with its code point range and the corresponding byte structure. The leading bits (shown in darker shades) indicate the byte count, while 'x' positions hold the actual code point bits. Continuation bytes always begin with '10', enabling self-synchronization.

Why UTF-8 WonLink Copied

UTF-8's dominance isn't accidental. It has several compelling advantages:

ASCII compatibility: Any ASCII text is valid UTF-8 without modification. This made adoption painless for the English-speaking computing world that had decades of ASCII data.

Self-synchronizing: You can jump into the middle of a UTF-8 stream and find character boundaries. Continuation bytes (starting with 10) are distinct from start bytes, so you can always resynchronize.

No byte-order issues: Unlike UTF-16 and UTF-32, UTF-8 has no endianness problems. The same bytes mean the same thing on any system.

Efficiency for ASCII-heavy text: English text, code, markup, and many data formats are predominantly ASCII. UTF-8 represents these with no overhead.

In[20]:
# Compare encoding sizes for different types of text
texts = {
    'English': "Hello, World!",
    'French': "Héllo, Wörld!",
    'Chinese': "你好,世界!",
    'Mixed': "Hello 世界! 😀",
    'Code': "def hello(): return 42"
}

sizes = {}
for name, text in texts.items():
    utf8_size = len(text.encode('utf-8'))
    utf16_size = len(text.encode('utf-16-le'))  # Without BOM
    utf32_size = len(text.encode('utf-32-le'))  # Without BOM
    sizes[name] = (len(text), utf8_size, utf16_size, utf32_size)
Out[21]:
Encoding Size Comparison (bytes):
-----------------------------------------------------------------
Text Type    Chars    UTF-8      UTF-16     UTF-32    
-----------------------------------------------------------------
English      13       13         26         52        
French       13       15         26         52        
Chinese      6        18         12         24        
Mixed        11       18         24         44        
Code         22       22         44         88        

For English text and code, UTF-8 matches the character count exactly since ASCII characters use just 1 byte each. Chinese text requires 3 bytes per character in UTF-8, making it slightly larger than UTF-16, which uses 2 bytes for BMP characters including most CJK ideographs. UTF-32 consistently uses 4 bytes per character regardless of content, resulting in significant overhead for ASCII-heavy text.

Out[22]:
Visualization
Grouped bar chart comparing encoding sizes in bytes for English, French, Chinese, Mixed, and Code text across UTF-8, UTF-16, and UTF-32.
Byte size comparison across UTF-8, UTF-16, and UTF-32 encodings for different text types. UTF-8 excels with ASCII-heavy content (English, Code), matching the character count exactly. For CJK text, UTF-16 is slightly more compact. UTF-32's fixed 4-byte width creates consistent but significant overhead across all text types.

Even for Chinese text, UTF-8 is competitive with UTF-16. Only UTF-32 maintains constant character width, at the cost of 4x overhead for ASCII.

Byte Order Marks and EndiannessLink Copied

When using multi-byte encodings like UTF-16 or UTF-32, a question arises: which byte comes first? Consider the code point U+FEFF. In UTF-16, this could be stored as either:

  • FE FF (big-endian, most significant byte first)
  • FF FE (little-endian, least significant byte first)
Byte Order Mark (BOM)

A Byte Order Mark is a special Unicode character (U+FEFF) placed at the beginning of a text file to indicate the byte order (endianness) of the encoding. In UTF-8, it serves only as an encoding signature since UTF-8 has no endianness issues.

The BOM character (U+FEFF, "Zero Width No-Break Space") was repurposed to solve this ambiguity. By placing it at the start of a file, readers can determine the byte order:

In[23]:
# Different encodings and their BOMs
text = "Hello"

utf8_bom = text.encode('utf-8-sig')
utf16_le = text.encode('utf-16-le')
utf16_be = text.encode('utf-16-be')
utf16_with_bom = text.encode('utf-16')  # Includes BOM
Out[24]:
Encoding 'Hello' with different byte orders:

UTF-8 with BOM: ef bb bf 48 65 6c 6c 6f
  BOM bytes: ef bb bf (UTF-8 signature)

UTF-16-LE (no BOM): 48 00 65 00 6c 00 6c 00 6f 00
UTF-16-BE (no BOM): 00 48 00 65 00 6c 00 6c 00 6f
UTF-16 (with BOM):  ff fe 48 00 65 00 6c 00 6c 00 6f 00
  BOM bytes: ff fe (little-endian marker)

UTF-8 technically doesn't need a BOM since it has no byte-order ambiguity. However, Microsoft tools often add a UTF-8 BOM (EF BB BF) to indicate the file is UTF-8 rather than some other encoding. This can cause problems with Unix tools that don't expect it.

In[25]:
# The UTF-8 BOM can cause subtle bugs
utf8_with_bom = b'\xef\xbb\xbfHello'
utf8_no_bom = b'Hello'

# Decoding both
decoded_with_bom = utf8_with_bom.decode('utf-8')
decoded_no_bom = utf8_no_bom.decode('utf-8')

# Using utf-8-sig to handle BOM automatically
decoded_sig = utf8_with_bom.decode('utf-8-sig')
Out[26]:
BOM handling in Python:
  With BOM (utf-8):     'Hello' (length 6)
  Without BOM:          'Hello' (length 5)
  With BOM (utf-8-sig): 'Hello' (length 5)

The BOM appears as an invisible character at the start!
First char with BOM: U+FEFF (Zero Width No-Break Space)

When reading files of unknown origin, using utf-8-sig instead of utf-8 handles the BOM gracefully.

Encoding DetectionLink Copied

In an ideal world, all text would be clearly labeled with its encoding. In reality, you'll often encounter files with no encoding metadata. How do you figure out what encoding to use?

Heuristic DetectionLink Copied

Encoding detection relies on statistical patterns. Different encodings have characteristic byte sequences:

  • UTF-8: Has a specific pattern of continuation bytes
  • UTF-16: Often has many null bytes (00) for ASCII text
  • ISO-8859-1: Bytes 0x80-0x9F are control characters, rarely used
  • Windows-1252: Uses 0x80-0x9F for printable characters like curly quotes

The chardet library implements sophisticated heuristics:

In[27]:
import chardet

# Create test data in different encodings
test_texts = {
    'UTF-8': "Héllo, 世界! How are you?".encode('utf-8'),
    'Latin-1': "Héllo, café, naïve".encode('latin-1'),
    'Windows-1252': 'Hello "world" — fancy quotes'.encode('cp1252'),
    'Shift-JIS': "こんにちは世界".encode('shift-jis'),
}

# Detect encoding for each
detections = {name: chardet.detect(data) for name, data in test_texts.items()}
Out[28]:
Encoding Detection Results:
------------------------------------------------------------
UTF-8           → Detected: utf-8           (confidence: 88%)
Latin-1         → Detected: ISO-8859-1      (confidence: 73%)
Windows-1252    → Detected: Windows-1252    (confidence: 73%)
Shift-JIS       → Detected: MacCyrillic     (confidence: 29%)

The detector correctly identifies UTF-8 and Shift-JIS with high confidence. Latin-1 detection shows lower confidence because its byte patterns overlap with other encodings. Windows-1252 is often detected as a related encoding since they share most byte mappings. Detection isn't perfect. Short texts provide fewer statistical clues, and some encodings are nearly indistinguishable for certain content. Always verify detected encodings when possible.

Common Detection PitfallsLink Copied

Some encoding pairs are particularly tricky to distinguish:

In[29]:
# UTF-8 vs Latin-1: Often confused for ASCII-heavy text
ascii_text = "Hello World 123"
utf8_bytes = ascii_text.encode('utf-8')
latin1_bytes = ascii_text.encode('latin-1')

# They're identical for ASCII!
are_same = utf8_bytes == latin1_bytes

# Detection on pure ASCII is ambiguous
detection = chardet.detect(ascii_text.encode('ascii'))
Out[30]:
ASCII text encoding ambiguity:
  UTF-8 bytes:   b'Hello World 123'
  Latin-1 bytes: b'Hello World 123'
  Identical? True

Detection result: {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

For pure ASCII, any ASCII-compatible encoding works!

When detection fails or is uncertain, domain knowledge helps. Web pages usually declare encoding in headers or meta tags. XML files often have encoding declarations. When all else fails, UTF-8 is the safest modern default.

Mojibake: When Encoding Goes WrongLink Copied

Mojibake

Mojibake (from Japanese 文字化け, "character transformation") refers to garbled text that results from decoding bytes using the wrong character encoding. The term describes the visual appearance of incorrectly decoded text.

Mojibake is the bane of text processing. It occurs when bytes encoded in one system are decoded using a different, incompatible encoding. The result is nonsensical characters that often follow recognizable patterns.

In[31]:
# Common mojibake patterns
original = "Héllo Wörld"

# Encode as UTF-8, decode as Latin-1 (common web error)
utf8_as_latin1 = original.encode('utf-8').decode('latin-1')

# Encode as Latin-1, decode as UTF-8 (causes errors or replacement)
try:
    latin1_as_utf8 = original.encode('latin-1').decode('utf-8')
except UnicodeDecodeError as e:
    latin1_as_utf8 = f"Error: {e}"
Out[32]:
Mojibake Examples:
--------------------------------------------------
Original:                    'Héllo Wörld'
UTF-8 decoded as Latin-1:    'Héllo Wörld'
Latin-1 decoded as UTF-8:    Error: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

The first case is particularly insidious. UTF-8's multi-byte sequences are valid Latin-1 byte sequences, so no error occurs. Instead, you get garbage like "Héllo Wörld" where each accented character becomes two or three strange characters.

Recognizing Mojibake PatternsLink Copied

Different encoding mismatches produce characteristic garbage:

In[33]:
# Common mojibake signatures
test_chars = ['é', 'ñ', 'ü', '中', '—', '"', '"']

mojibake_examples = []
for char in test_chars:
    try:
        # UTF-8 interpreted as Latin-1
        utf8_bytes = char.encode('utf-8')
        as_latin1 = utf8_bytes.decode('latin-1')
        mojibake_examples.append((char, as_latin1, len(utf8_bytes)))
    except:
        mojibake_examples.append((char, 'Error', 0))
Out[34]:
UTF-8 Mojibake Patterns (decoded as Latin-1):
---------------------------------------------
Original   Mojibake        UTF-8 bytes
---------------------------------------------
'é'        'é'           2
'ñ'        'ñ'           2
'ü'        'ü'           2
'中'        '中'           3
'—'        '—'           3
'"'        '"'           1
'"'        '"'           1

When you see patterns like "é" for "é" or "â€"" for "—", you're almost certainly looking at UTF-8 text that was incorrectly decoded as Latin-1 or Windows-1252.

Fixing MojibakeLink Copied

Sometimes you can reverse mojibake by re-encoding with the wrong encoding and decoding with the right one:

In[35]:
# Attempt to fix mojibake
mojibake_text = "Héllo Wörld"

# Reverse the damage: encode as Latin-1, decode as UTF-8
try:
    fixed = mojibake_text.encode('latin-1').decode('utf-8')
except (UnicodeDecodeError, UnicodeEncodeError):
    fixed = "Could not fix"
Out[36]:
Mojibake Repair:
  Corrupted: 'Héllo Wörld'
  Fixed:     'Héllo Wörld'

This works because the corruption was reversible. However, some mojibake is destructive, especially when multiple encoding conversions have occurred or when the wrong encoding maps bytes to different code points.

Practical Encoding in PythonLink Copied

Python 3 made a fundamental change from Python 2: strings are Unicode by default. The str type holds Unicode code points, while bytes holds raw byte sequences. Converting between them requires explicit encoding and decoding.

Encoding and DecodingLink Copied

In[37]:
# String (Unicode) to bytes (encoding)
text = "Hello, 世界!"
utf8_bytes = text.encode('utf-8')
utf16_bytes = text.encode('utf-16')

# Bytes to string (decoding)
decoded = utf8_bytes.decode('utf-8')

# Type checking
text_type = type(text)
bytes_type = type(utf8_bytes)
Out[38]:
Python String/Bytes Types:
  text = 'Hello, 世界!'
  type(text) = <class 'str'>

  utf8_bytes = b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
  type(utf8_bytes) = <class 'bytes'>

  decoded = 'Hello, 世界!'
  text == decoded: True

Error HandlingLink Copied

What happens when encoding or decoding fails? Python offers several error handling strategies:

In[39]:
# Text that can't be encoded in ASCII
text = "Hello, 世界!"

# Different error handling strategies
strategies = {
    'strict': None,  # Raises exception
    'ignore': text.encode('ascii', errors='ignore'),
    'replace': text.encode('ascii', errors='replace'),
    'xmlcharrefreplace': text.encode('ascii', errors='xmlcharrefreplace'),
    'backslashreplace': text.encode('ascii', errors='backslashreplace'),
}

# Try strict mode
try:
    strategies['strict'] = text.encode('ascii', errors='strict')
except UnicodeEncodeError as e:
    strategies['strict'] = f"Error: {e}"
Out[40]:
Encoding 'Hello, 世界!' to ASCII with different error handlers:
------------------------------------------------------------
  strict              : Error: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
  ignore              : b'Hello, !'
  replace             : b'Hello, ??!'
  xmlcharrefreplace   : b'Hello, &#19990;&#30028;!'
  backslashreplace    : b'Hello, \\u4e16\\u754c!'

Each strategy handles unencodable characters differently. The strict mode raises an exception, forcing you to handle the problem explicitly. The ignore mode silently drops characters, which can corrupt your data. The replace mode substitutes question marks, making problems visible. The xmlcharrefreplace and backslashreplace modes preserve information in escaped form, useful for debugging or when round-tripping is needed.

For NLP work, errors='replace' or errors='ignore' can be useful when processing noisy data, but be aware that you're losing information. The surrogateescape error handler is particularly useful for round-tripping binary data that might contain encoding errors.

Reading and Writing FilesLink Copied

Always specify encoding when opening text files:

In[41]:
import tempfile
import os

# Create a temporary file to demonstrate
temp_dir = tempfile.mkdtemp()
file_path = os.path.join(temp_dir, 'test.txt')

# Write with explicit encoding
text = "Héllo, 世界! 🌍"
with open(file_path, 'w', encoding='utf-8') as f:
    f.write(text)

# Read with explicit encoding
with open(file_path, 'r', encoding='utf-8') as f:
    read_text = f.read()

# What happens without encoding specification?
# Python uses locale.getpreferredencoding(), which varies by system
import locale
default_encoding = locale.getpreferredencoding()

# Clean up
os.remove(file_path)
os.rmdir(temp_dir)
Out[42]:
File I/O with Encoding:
  Written: 'Héllo, 世界! 🌍'
  Read:    'Héllo, 世界! 🌍'
  Match:   True

System default encoding: UTF-8
Always specify encoding='utf-8' explicitly!

Processing Text DataLink Copied

When building NLP pipelines, handle encoding at the boundaries:

In[43]:
def safe_read_text(file_path, encodings=['utf-8', 'utf-8-sig', 'latin-1', 'cp1252']):
    """
    Attempt to read a text file, trying multiple encodings.
    Returns (text, encoding_used) or raises an exception.
    """
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                text = f.read()
            return text, encoding
        except (UnicodeDecodeError, UnicodeError):
            continue
    
    # If all else fails, use chardet
    with open(file_path, 'rb') as f:
        raw = f.read()
    detected = chardet.detect(raw)
    return raw.decode(detected['encoding']), detected['encoding']

# Example: process a batch of files
def normalize_encoding(text):
    """Ensure text is properly normalized Unicode."""
    import unicodedata
    # Normalize to NFC (composed form)
    return unicodedata.normalize('NFC', text)
Out[44]:
Best practices for text processing:
  1. Always specify encoding when opening files
  2. Use UTF-8 as your default encoding
  3. Handle BOM with 'utf-8-sig' when reading unknown files
  4. Normalize Unicode after reading (covered in next chapter)
  5. Use chardet as a fallback for unknown encodings

Limitations and ChallengesLink Copied

Despite Unicode's success, character encoding still presents challenges:

Legacy data: Vast amounts of text exist in legacy encodings. Converting this data requires knowing the original encoding, which isn't always documented.

Encoding detection uncertainty: Automatic detection is probabilistic, not deterministic. Short texts or texts mixing languages can confuse detection algorithms.

Normalization complexity: The same visual character can have multiple Unicode representations. Without normalization, string comparison and searching become unreliable.

Emoji evolution: New emoji are added regularly, and older systems may not support them. An emoji that renders beautifully on one device might appear as a box or question mark on another.

Security concerns: Unicode includes many look-alike characters (homoglyphs) that can be exploited for phishing. The Latin 'a' (U+0061) looks identical to the Cyrillic 'а' (U+0430).

Impact on NLPLink Copied

Character encoding is the foundation upon which all text processing rests. Getting it wrong corrupts your data before any analysis begins. Here's why it matters for NLP:

Data quality: Training data with encoding errors teaches models garbage. A language model trained on mojibake will reproduce mojibake.

Tokenization: Many tokenizers operate on bytes or byte-pairs. Understanding UTF-8 encoding helps you understand why tokenizers make certain decisions.

Multilingual models: Models that handle multiple languages must handle multiple scripts, which means handling Unicode correctly.

Text normalization: Before comparing or searching text, you need consistent Unicode normalization. Encoding is the prerequisite for normalization.

Reproducibility: Explicitly specifying encodings makes your code portable across systems with different default encodings.

Key Functions and ParametersLink Copied

When working with character encoding in Python, these are the essential functions and their most important parameters:

str.encode(encoding, errors='strict')

  • encoding: The target encoding (e.g., 'utf-8', 'latin-1', 'ascii')
  • errors: How to handle unencodable characters. Options include 'strict' (raise exception), 'ignore' (drop characters), 'replace' (use ?), 'xmlcharrefreplace' (use XML entities), 'backslashreplace' (use Python escape sequences)

bytes.decode(encoding, errors='strict')

  • encoding: The source encoding to interpret the bytes
  • errors: How to handle undecodable bytes. Same options as encode(), plus 'surrogateescape' for round-tripping binary data

open(file, mode, encoding=None, errors=None)

  • encoding: Always specify explicitly for text mode ('r', 'w'). Use 'utf-8' as default, 'utf-8-sig' to handle BOM automatically
  • errors: Same options as encode()/decode()

chardet.detect(byte_string)

  • Returns a dictionary with 'encoding' (detected encoding name), 'confidence' (0.0 to 1.0), and 'language' (detected language if applicable)
  • Higher confidence values indicate more reliable detection
  • Short texts yield lower confidence; prefer explicit encoding when possible

SummaryLink Copied

Character encoding bridges the gap between human writing and computer storage. We traced the evolution from ASCII's 128 characters through the fragmented world of regional encodings to Unicode's universal character set and UTF-8's elegant variable-width encoding.

Key takeaways:

  • ASCII uses 7 bits for 128 characters, covering only English
  • Unicode assigns unique code points to over 149,000 characters across all writing systems
  • UTF-8 encodes Unicode using 1-4 bytes, with ASCII compatibility and no endianness issues
  • Mojibake results from decoding bytes with the wrong encoding
  • Always specify encoding when reading or writing text files in Python
  • Use UTF-8 as your default encoding for new projects
  • Encoding detection is heuristic and imperfect; verify when possible

In the next chapter, we'll build on this foundation to explore text normalization, addressing the challenge of multiple Unicode representations for the same visual character.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about character encoding.

Loading component...

Reference

BIBTEXAcademic
@misc{characterencodingfromasciitoutf8fornlppractitioners, author = {Michael Brenndoerfer}, title = {Character Encoding: From ASCII to UTF-8 for NLP Practitioners}, year = {2025}, url = {https://mbrenndoerfer.com/writing/character-encoding-ascii-unicode-utf8-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }
APAAcademic
Michael Brenndoerfer (2025). Character Encoding: From ASCII to UTF-8 for NLP Practitioners. Retrieved from https://mbrenndoerfer.com/writing/character-encoding-ascii-unicode-utf8-nlp
MLAAcademic
Michael Brenndoerfer. "Character Encoding: From ASCII to UTF-8 for NLP Practitioners." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/character-encoding-ascii-unicode-utf8-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Character Encoding: From ASCII to UTF-8 for NLP Practitioners." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/character-encoding-ascii-unicode-utf8-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Character Encoding: From ASCII to UTF-8 for NLP Practitioners'. Available at: https://mbrenndoerfer.com/writing/character-encoding-ascii-unicode-utf8-nlp (Accessed: 12/7/2025).
SimpleBasic
Michael Brenndoerfer (2025). Character Encoding: From ASCII to UTF-8 for NLP Practitioners. https://mbrenndoerfer.com/writing/character-encoding-ascii-unicode-utf8-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.