Regular Expressions for NLP: Complete Guide to Pattern Matching in Python

Michael BrenndoerferUpdated March 18, 202531 min read

Master regular expressions for text processing, covering metacharacters, quantifiers, lookarounds, and practical NLP patterns. Learn to extract emails, URLs, and dates while avoiding performance pitfalls.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Regular Expressions

Text data is messy. Emails hide in paragraphs, phone numbers appear in a dozen formats, and dates refuse to follow any single convention. Before you can extract meaning from text, you need to find patterns within it. Regular expressions give you a powerful, compact language for describing these patterns. A single regex can match thousands of variations of an email address, validate input formats, or extract structured data from unstructured text.

This chapter teaches you to read and write regular expressions fluently. You'll learn the syntax that makes regex both powerful and cryptic, understand when to use them versus simpler alternatives, and build practical patterns for common NLP tasks. By the end, you'll wield regex as a precision tool for text manipulation.

What Are Regular Expressions?

A regular expression (regex) is a sequence of characters that defines a search pattern. Think of it as a tiny programming language embedded within Python, specialized for matching and manipulating text.

Regular Expression

A regular expression is a formal language for describing patterns in strings. It uses special characters called metacharacters to represent classes of characters, repetition, position, and grouping, allowing a single pattern to match many different strings.

The power of regex comes from its expressiveness. The pattern \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b looks intimidating, but it matches most email addresses in a single line. Without regex, you'd need dozens of lines of conditional logic to achieve the same result.

Let's start with a simple example:

In[2]:
Code
import re

# A simple pattern: find the word "cat"
text = "The cat sat on the mat. The catalog was nearby."
pattern = r"cat"

# Find all matches
matches = re.findall(pattern, text)
Out[3]:
Console
Text: 'The cat sat on the mat. The catalog was nearby.'
Pattern: 'cat'
Matches: ['cat', 'cat']
Number of matches: 2

The pattern cat matches the literal characters c, a, t in sequence. It found two matches: "cat" as a standalone word and "cat" inside "catalog". Let's verify the match positions:

In[4]:
Code
# Find match positions
for match in re.finditer(pattern, text):
    start, end = match.span()
    context = text[max(0, start - 5) : end + 5]
Out[5]:
Console
Match positions:
  Position 4-7: 'cat' in '...The cat sat ...'
  Position 28-31: 'cat' in '... The catalog ...'

The regex found "cat" as a standalone word and "cat" inside "catalog". This illustrates a key point: by default, regex matches anywhere in the text, including inside other words. We'll learn how to match whole words only using word boundaries.

The re Module

Python's re module provides the interface for working with regular expressions. Before diving into pattern syntax, let's understand the main functions you'll use:

In[6]:
Code
import re

text = "Contact us at support@example.com or sales@example.com"

# re.search() - Find first match
first_match = re.search(r"\w+@\w+\.\w+", text)

# re.findall() - Find all matches, return list of strings
all_matches = re.findall(r"\w+@\w+\.\w+", text)

# re.finditer() - Find all matches, return iterator of match objects
match_objects = list(re.finditer(r"\w+@\w+\.\w+", text))

# re.sub() - Replace matches
replaced = re.sub(r"\w+@\w+\.\w+", "[EMAIL]", text)

# re.split() - Split by pattern
parts = re.split(r"\s+", "Hello   world  foo")
Out[7]:
Console
re.search() - First match:
  Match: 'support@example.com' at position (14, 33)

re.findall() - All matches as strings:
  ['support@example.com', 'sales@example.com']

re.finditer() - Match objects with metadata:
  'support@example.com' at (14, 33)
  'sales@example.com' at (37, 54)

re.sub() - Replace matches:
  'Contact us at [EMAIL] or [EMAIL]'

re.split() - Split by pattern:
  ['Hello', 'world', 'foo']

Each function serves a different purpose. Use search() when you only need the first match, findall() when you want a simple list of matched strings, finditer() when you need position information or groups, sub() for replacements, and split() to break text at pattern boundaries.

Raw Strings

Notice the r prefix before pattern strings: r'\w+@\w+'. This creates a raw string where backslashes are treated literally. Without it, Python interprets backslashes as escape sequences before the regex engine sees them.

In[8]:
Code
# Without raw string: \n becomes a newline character
regular_string = "line1\nline2"

# With raw string: \n stays as backslash-n
raw_string = r"line1\nline2"

# This matters for regex patterns
pattern_wrong = "\bword\b"  # \b becomes backspace character!
pattern_right = r"\bword\b"  # \b stays as word boundary
Out[9]:
Console
Regular string 'line1\nline2':
  'line1\nline2'
  Length: 11 characters

Raw string r'line1\nline2':
  'line1\\nline2'
  Length: 12 characters

Pattern comparison:
  Without r: '\x08word\x08' (backspace character!)
  With r:    '\\bword\\b' (word boundary)

Always use raw strings for regex patterns. It's a habit that will save you from subtle bugs.

Metacharacters: The Building Blocks

Regular expressions use special characters called metacharacters to represent patterns. These characters have meaning beyond their literal value. While literal characters like a or 5 match themselves, metacharacters like ., *, and [] define rules for what to match. Mastering these building blocks is the key to writing effective patterns.

The Dot: Match Any Character

The dot . matches any single character except a newline:

In[10]:
Code
text = "cat cot cut c@t c9t c\nt"

# . matches any single character
pattern = r"c.t"
matches = re.findall(pattern, text)
Out[11]:
Console
Text: 'cat cot cut c@t c9t c\nt'
Pattern: c.t (c, any character, t)
Matches: ['cat', 'cot', 'cut', 'c@t', 'c9t']

The dot matched 'a', 'o', 'u', '@', and '9', but not the newline. To match newlines too, use the re.DOTALL flag or the pattern [\s\S].

Character Classes: Matching Sets

Square brackets define a character class, matching any single character from the set:

In[12]:
Code
text = "The gray grey dog played in the fog"

# [ae] matches either 'a' or 'e'
pattern = r"gr[ae]y"
matches = re.findall(pattern, text)

# Ranges: [a-z] matches any lowercase letter
lowercase = re.findall(r"[a-z]+", "Hello World 123")

# Multiple ranges: [a-zA-Z0-9]
alphanumeric = re.findall(r"[a-zA-Z0-9]+", "user@example.com")
Out[13]:
Console
Character class [ae]:
  Pattern: gr[ae]y
  Matches: ['gray', 'grey']

Range [a-z]:
  Text: 'Hello World 123'
  Matches: ['ello', 'orld']

Combined ranges [a-zA-Z0-9]:
  Text: 'user@example.com'
  Matches: ['user', 'example', 'com']

Negated Character Classes

A caret ^ at the start of a character class negates it, matching any character NOT in the set:

In[14]:
Code
text = "abc123xyz"

# [^0-9] matches any non-digit
non_digits = re.findall(r"[^0-9]+", text)

# [^a-z] matches any non-lowercase letter
non_lower = re.findall(r"[^a-z]+", text)
Out[15]:
Console
Text: 'abc123xyz'
Non-digits [^0-9]: ['abc', 'xyz']
Non-lowercase [^a-z]: ['123']

Shorthand Character Classes

Regex provides convenient shortcuts for common character classes:

In[16]:
Code
text = "Call 555-1234 or email bob@mail.com on 2024-01-15"

# \d = digit [0-9]
digits = re.findall(r"\d+", text)

# \w = word character [a-zA-Z0-9_]
words = re.findall(r"\w+", text)

# \s = whitespace [ \t\n\r\f\v]
spaces = re.findall(r"\s+", text)

# Uppercase versions are negations
# \D = non-digit, \W = non-word, \S = non-whitespace
non_digits = re.findall(r"\D+", text)
Out[17]:
Console
Text: 'Call 555-1234 or email bob@mail.com on 2024-01-15'

Shorthand classes:
  \d+ (digits):     ['555', '1234', '2024', '01', '15']
  \w+ (word chars): ['Call', '555', '1234', 'or', 'email', 'bob', 'mail', 'com', 'on', '2024', '01', '15']
  \s+ (whitespace): [' ', ' ', ' ', ' ', ' ', ' ']
  \D+ (non-digit):  ['Call ', '-', ' or email bob@mail.com on ', '-', '-']

The following table summarizes the most common regex character classes:

ShorthandEquivalentDescription
\d[0-9]Any digit
\D[^0-9]Any non-digit
\w[a-zA-Z0-9_]Word character
\W[^a-zA-Z0-9_]Non-word character
\s[ \t\n\r\f\v]Whitespace
\S[^ \t\n\r\f\v]Non-whitespace
.(any except \n)Any character

Note that uppercase versions match the complement (negation) of their lowercase counterparts.

Quantifiers: How Many Times?

Quantifiers specify how many times the preceding element should match. They range from simple repetition (*, +, ?) to precise bounds ({n}, {n,m}). Understanding quantifiers is essential because they determine whether your pattern matches once, multiple times, or not at all.

Basic Quantifiers

In[18]:
Code
text = "a aa aaa aaaa b bb bbb"

# * = zero or more
star = re.findall(r"ba*", "b ba baa baaa")

# + = one or more
plus = re.findall(r"ba+", "b ba baa baaa")

# ? = zero or one (optional)
optional = re.findall(r"colou?r", "color colour")

# {n} = exactly n times
exact = re.findall(r"a{3}", text)

# {n,m} = between n and m times
range_q = re.findall(r"a{2,3}", text)

# {n,} = n or more times
at_least = re.findall(r"a{2,}", text)
Out[19]:
Console
Quantifier examples:

* (zero or more):
  Pattern: ba*  Text: 'b ba baa baaa'
  Matches: ['b', 'ba', 'baa', 'baaa']

+ (one or more):
  Pattern: ba+  Text: 'b ba baa baaa'
  Matches: ['ba', 'baa', 'baaa']

? (zero or one):
  Pattern: colou?r  Text: 'color colour'
  Matches: ['color', 'colour']

{n} (exactly n):
  Pattern: a{3}  Text: 'a aa aaa aaaa b bb bbb'
  Matches: ['aaa', 'aaa']

{n,m} (between n and m):
  Pattern: a{2,3}  Text: 'a aa aaa aaaa b bb bbb'
  Matches: ['aa', 'aaa', 'aaa']
Out[20]:
Visualization
Bar chart comparing match counts for different regex quantifiers applied to the same text.
Comparison of quantifier behavior on the same input text. Each bar shows the number of matches found by different quantifier patterns. The * quantifier matches zero or more (including empty matches), + requires at least one, ? makes the preceding element optional, and {n,m} specifies exact repetition bounds.

The visualization shows how quantifiers dramatically affect matching behavior. Notice that ba* finds 5 matches because it accepts zero 'a's (matching just 'b'), while ba+ finds only 4 because it requires at least one 'a'. The bounded quantifiers {2}, {2,3}, and {2,} are more selective, matching only strings with specific repetition counts.

Greedy vs. Lazy Matching

By default, quantifiers are greedy: they match as much as possible. Adding ? after a quantifier makes it lazy, matching as little as possible.

In[21]:
Code
html = "<div>Hello</div><div>World</div>"

# Greedy: matches as much as possible
greedy = re.findall(r"<div>.*</div>", html)

# Lazy: matches as little as possible
lazy = re.findall(r"<div>.*?</div>", html)
Out[22]:
Console
HTML: '<div>Hello</div><div>World</div>'

Greedy (.*) - matches maximum:
  Pattern: <div>.*</div>
  Matches: ['<div>Hello</div><div>World</div>']

Lazy (.*?) - matches minimum:
  Pattern: <div>.*?</div>
  Matches: ['<div>Hello</div>', '<div>World</div>']

The greedy pattern matched from the first <div> all the way to the last </div>, consuming both tags. The lazy pattern stopped at the first </div> it found, giving us each tag separately. This distinction is critical when parsing structured text.

Out[23]:
Visualization
Diagram comparing greedy and lazy regex matching on HTML text, showing different match boundaries.
Greedy versus lazy quantifier behavior when matching HTML tags. The greedy pattern .* consumes as much text as possible, matching from the first opening tag to the last closing tag. The lazy pattern .*? stops at the first valid match, correctly identifying individual tag pairs.

Anchors: Position Matching

Anchors match positions in the string rather than characters. Unlike metacharacters that consume text, anchors assert that the current position in the string meets certain criteria. This makes them essential for matching patterns at specific locations, such as the beginning of a line or at word boundaries.

In[24]:
Code
text = "Hello World\nHello Python"

# ^ matches start of string (or line with MULTILINE)
start_matches = re.findall(r"^Hello", text, re.MULTILINE)

# $ matches end of string (or line with MULTILINE)
end_matches = re.findall(r"World$|Python$", text, re.MULTILINE)

# \b matches word boundary
word_boundary = re.findall(r"\bcat\b", "The cat sat on the catalog")

# \B matches non-word boundary
non_boundary = re.findall(r"\Bcat\B", "The cat sat on the catalog")
Out[25]:
Console
Text: 'Hello World\nHello Python'

^ (start of line with MULTILINE):
  Pattern: ^Hello
  Matches: ['Hello', 'Hello']

$ (end of line with MULTILINE):
  Pattern: World$|Python$
  Matches: ['World', 'Python']

\b (word boundary):
  Text: 'The cat sat on the catalog'
  Pattern: \bcat\b
  Matches: ['cat']

\B (non-word boundary):
  Pattern: \Bcat\B
  Matches: []

Word boundaries are essential for matching whole words. The \b anchor matches the position between a word character and a non-word character. In "catalog", there's no word boundary before or after "cat", so \bcat\b doesn't match it.

Grouping and Capturing

Parentheses serve two purposes in regex: grouping elements together and capturing matched text for later use. Grouping lets you apply quantifiers to multi-character sequences or create alternations. Capturing stores the matched text so you can reference it later in the pattern or in replacement strings.

Basic Groups

In[26]:
Code
# Grouping for repetition
text = "abcabcabc"
pattern = r"(abc)+"
match = re.search(pattern, text)

# Grouping for alternation
colors = "The car is red, the bike is blue, the bus is green"
pattern = r"is (red|blue|green)"
matches = re.findall(pattern, colors)
Out[27]:
Console
Grouping for repetition:
  Text: 'abcabcabc'
  Pattern: (abc)+
  Full match: 'abcabcabc'
  Captured group: 'abc'

Grouping for alternation:
  Pattern: is (red|blue|green)
  Captured colors: ['red', 'blue', 'green']

When you use findall() with groups, it returns only the captured group contents, not the full match. This is often what you want when extracting specific parts of a pattern.

Multiple Groups

In[28]:
Code
# Extract date components
text = "Meeting on 2024-01-15 and 2024-02-20"
pattern = r"(\d{4})-(\d{2})-(\d{2})"

# findall returns tuples when there are multiple groups
dates = re.findall(pattern, text)

# Using match objects for more control
for match in re.finditer(pattern, text):
    full = match.group(0)
    year = match.group(1)
    month = match.group(2)
    day = match.group(3)
Out[29]:
Console
Text: 'Meeting on 2024-01-15 and 2024-02-20'
Pattern: (\d{4})-(\d{2})-(\d{2})

findall with groups returns tuples:
  [('2024', '01', '15'), ('2024', '02', '20')]

Using match objects:
  Full match: '2024-01-15'
    Year:  2024
    Month: 01
    Day:   15

  Full match: '2024-02-20'
    Year:  2024
    Month: 02
    Day:   20

Named Groups

Named groups make patterns more readable and self-documenting:

In[30]:
Code
# Named groups with (?P<name>...)
pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
text = "Event date: 2024-03-15"

match = re.search(pattern, text)
Out[31]:
Console
Pattern with named groups:
  (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})

Accessing by name:
  Year:  2024
  Month: 03
  Day:   15

As dictionary:
  {'year': '2024', 'month': '03', 'day': '15'}

Named groups are especially valuable in complex patterns where numbered groups become confusing.

Non-Capturing Groups

Sometimes you need grouping for structure but don't want to capture the content. Use (?:...):

In[32]:
Code
# Capturing group - captures "http" or "https"
capturing = re.findall(
    r"(https?)://(\w+\.\w+)", "Visit https://example.com or http://test.org"
)

# Non-capturing group - only captures the domain
non_capturing = re.findall(
    r"(?:https?)://(\w+\.\w+)", "Visit https://example.com or http://test.org"
)
Out[33]:
Console
Capturing group (https?):
  Returns: [('https', 'example.com'), ('http', 'test.org')]

Non-capturing group (?:https?):
  Returns: ['example.com', 'test.org']

Non-capturing groups keep your results clean when you only care about specific parts of the pattern.

Backreferences

Backreferences let you match the same text that was captured by an earlier group:

In[34]:
Code
# Find repeated words
text = "The the quick brown fox jumps over the the lazy dog dog"
pattern = r"\b(\w+)\s+\1\b"
repeated = re.findall(pattern, text, re.IGNORECASE)

# Find matching HTML tags
html = "<div>content</div> <span>text</span> <div>broken</span>"
pattern = r"<(\w+)>.*?</\1>"
valid_tags = re.findall(pattern, html)
Out[35]:
Console
Finding repeated words:
  Text: 'The the quick brown fox jumps over the the lazy dog dog'
  Pattern: \b(\w+)\s+\1\b
  Repeated words: ['The', 'the', 'dog']

Matching HTML tags:
  Text: '<div>content</div> <span>text</span> <div>broken</span>'
  Pattern: <(\w+)>.*?</\1>
  Valid tags: ['div', 'span']

The \1 refers back to whatever was captured by the first group. In the repeated words example, if the first group captures "the", then \1 only matches another "the", not any word.

Lookahead and Lookbehind

Lookahead and lookbehind assertions match a position based on what comes before or after, without consuming any characters. These are called "zero-width assertions" because they check conditions without advancing the regex engine's position in the string. This makes them powerful for extracting text that appears in specific contexts while excluding the context itself from the match.

Lookahead

In[36]:
Code
text = "100 dollars, 50 euros, 75 pounds"

# Positive lookahead: (?=...)
# Match numbers followed by "dollars"
dollars = re.findall(r"\d+(?= dollars)", text)

# Negative lookahead: (?!...)
# Match numbers NOT followed by "dollars"
not_dollars = re.findall(r"\d+(?! dollars)", text)
Out[37]:
Console
Text: '100 dollars, 50 euros, 75 pounds'

Positive lookahead (?= dollars):
  Pattern: \d+(?= dollars)
  Matches: ['100']

Negative lookahead (?! dollars):
  Pattern: \d+(?! dollars)
  Matches: ['10', '50', '75']

Lookbehind

In[38]:
Code
text = "$100 €50 £75"

# Positive lookbehind: (?<=...)
# Match numbers preceded by $
usd = re.findall(r"(?<=\$)\d+", text)

# Negative lookbehind: (?<!...)
# Match numbers NOT preceded by $
not_usd = re.findall(r"(?<!\$)\d+", text)
Out[39]:
Console
Text: '$100 €50 £75'

Positive lookbehind (?<=\$):
  Pattern: (?<=\$)\d+
  Matches: ['100']

Negative lookbehind (?<!\$):
  Pattern: (?<!\$)\d+
  Matches: ['00', '50', '75']

Lookarounds are powerful for extracting data from structured formats where you want the context to guide matching but don't want the context in your result.

In[40]:
Code
# Practical example: Extract values from key-value pairs
config = "name=Alice, age=30, city=Boston"

# Use lookbehind to find values after specific keys
name = re.search(r"(?<=name=)\w+", config)
age = re.search(r"(?<=age=)\d+", config)
Out[41]:
Console
Config: 'name=Alice, age=30, city=Boston'

Extracting with lookbehind:
  name: Alice
  age: 30

Flags and Modifiers

Regex flags modify how patterns are interpreted:

In[42]:
Code
text = """Hello World
hello python
HELLO REGEX"""

# re.IGNORECASE (re.I) - case-insensitive matching
case_insensitive = re.findall(r"hello", text, re.IGNORECASE)

# re.MULTILINE (re.M) - ^ and $ match line boundaries
multiline = re.findall(r"^hello", text, re.IGNORECASE | re.MULTILINE)

# re.DOTALL (re.S) - dot matches newline
dotall = re.findall(r"Hello.*REGEX", text, re.DOTALL)

# re.VERBOSE (re.X) - allows comments and whitespace in patterns
pattern = re.compile(
    r"""
    \d{4}    # Year
    -        # Separator
    \d{2}    # Month
    -        # Separator
    \d{2}    # Day
""",
    re.VERBOSE,
)
Out[43]:
Console
Text:
Hello World
hello python
HELLO REGEX

re.IGNORECASE:
  Pattern: hello
  Matches: ['Hello', 'hello', 'HELLO']

re.MULTILINE:
  Pattern: ^hello (with IGNORECASE)
  Matches: ['Hello', 'hello', 'HELLO']

re.DOTALL:
  Pattern: Hello.*REGEX
  Matches: ['Hello World\nhello python\nHELLO REGEX']

re.VERBOSE allows readable patterns:
  Pattern matches: 2024-01-15 → True

The re.VERBOSE flag is particularly valuable for complex patterns. It lets you break patterns across lines and add comments, making them maintainable.

Common NLP Patterns

Let's build patterns for text elements you'll frequently encounter in NLP work. Real-world text contains a mix of entities like emails, URLs, phone numbers, dates, and social media elements. Understanding how to extract these is fundamental to text preprocessing.

Consider this sample social media post containing multiple entity types:

In[44]:
Code
import re

sample_text = """
Hey @john_doe! Check out our new product at https://example.com/product?id=123
Contact us at support@company.com or call (555) 123-4567 for help.
Sale ends 2024-12-31! Use code #SAVE20 for 20% off. 
Also follow @tech_news and @deals_daily for updates.
Visit http://blog.example.org or email sales@example.org
#BlackFriday #CyberMonday #Shopping
Meeting scheduled for 01/15/2024. Call +1-800-555-0199.
"""

# Define extraction patterns
patterns = {
    "Emails": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "URLs": r"https?://\S+",
    "Mentions": r"@\w+",
    "Hashtags": r"#\w+",
    "Phone Numbers": r"(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}",
    "Dates (ISO)": r"\d{4}-\d{2}-\d{2}",
    "Dates (US)": r"\d{2}/\d{2}/\d{4}",
}

# Extract all entities
for name, pattern in patterns.items():
    matches = re.findall(pattern, sample_text)
Out[45]:
Console
Entity extraction results:

  Emails: 2 matches
    Examples: 'support@company.com', 'sales@example.org'
  URLs: 2 matches
    Examples: 'https://example.com/product?id=123', 'http://blog.example.org'
  Mentions: 5 matches
    Examples: '@john_doe', '@company', ...
  Hashtags: 4 matches
    Examples: '#SAVE20', '#BlackFriday', ...
  Phone Numbers: 2 matches
    Examples: '(555) 123-4567', '+1-800-555-0199'
  Dates (ISO): 1 matches
    Examples: '2024-12-31'
  Dates (US): 1 matches
    Examples: '01/15/2024'

The table below summarizes the extraction results. Notice how social media content tends to have many mentions and hashtags, while business communications include emails and phone numbers:

Entity TypeCountExample Matches
Emails2support@company.com, sales@example.org
URLs2https://example.com/product?id=123, http://blog.example.org
Mentions3@john_doe, @tech_news, @deals_daily
Hashtags4#SAVE20, #BlackFriday, #CyberMonday, #Shopping
Phone Numbers2(555) 123-4567, +1-800-555-0199
Dates (ISO)12024-12-31
Dates (US)101/15/2024

Let's examine each pattern in detail.

Email Addresses

In[46]:
Code
# Basic email pattern
email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"

text = """
Contact us at support@example.com or sales@company.org.
Invalid emails: @missing.com, nodomain@, spaces in@email.com
Edge cases: user.name+tag@sub.domain.co.uk
"""

emails = re.findall(email_pattern, text)
Out[47]:
Console
Email pattern:
  \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b

Found emails:
  support@example.com
  sales@company.org
  in@email.com
  user.name+tag@sub.domain.co.uk

URLs

In[48]:
Code
# URL pattern
url_pattern = r"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)"

text = """
Visit https://www.example.com or http://test.org/page?id=123
Also check https://api.service.io/v2/users#section
Not a URL: ftp://other.com or just example.com
"""

urls = re.findall(url_pattern, text)
Out[49]:
Console
URL pattern (simplified):
  https?://...

Found URLs:
  https://www.example.com
  http://test.org/page?id=123
  https://api.service.io/v2/users#section

Phone Numbers

In[50]:
Code
# US phone number patterns (multiple formats)
phone_pattern = r"""
    (?:
        \+?1[-.\s]?          # Optional country code
    )?
    (?:
        \(?\d{3}\)?          # Area code with optional parens
        [-.\s]?              # Separator
    )
    \d{3}                    # First 3 digits
    [-.\s]?                  # Separator
    \d{4}                    # Last 4 digits
"""

text = """
Call us: (555) 123-4567, 555.123.4567, 555 123 4567
International: +1-555-123-4567, +1 (555) 123-4567
"""

phones = re.findall(phone_pattern, text, re.VERBOSE)
Out[51]:
Console
Phone numbers found:
  '(555) 123-4567'
  '555.123.4567'
  '555 123 4567'
  '+1-555-123-4567'
  '+1 (555) 123-4567'

Dates

In[52]:
Code
text = """
Dates: 2024-01-15, 01/15/2024, January 15, 2024
Also: 15-Jan-2024, 15 January 2024
"""

# ISO format: YYYY-MM-DD
iso_dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)

# US format: MM/DD/YYYY
us_dates = re.findall(r"\d{2}/\d{2}/\d{4}", text)

# Month name formats
month_names = r"(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)"
named_dates = re.findall(
    rf"(?:\d{{1,2}}[-\s])?{month_names}[-\s]?\d{{1,2}}(?:,?\s*\d{{4}})?", text
)
Out[53]:
Console
Text:

Dates: 2024-01-15, 01/15/2024, January 15, 2024
Also: 15-Jan-2024, 15 January 2024


ISO dates (YYYY-MM-DD): ['2024-01-15']
US dates (MM/DD/YYYY): ['01/15/2024']
Named dates: ['January 15, 2024', '15-Jan-20', '15 January 20']

Hashtags and Mentions

In[54]:
Code
tweet = "Just learned about #NLP and #MachineLearning! Thanks @professor_ai for the great tutorial. #AI2024"

# Hashtags: # followed by word characters
hashtags = re.findall(r"#\w+", tweet)

# Mentions: @ followed by word characters
mentions = re.findall(r"@\w+", tweet)
Out[55]:
Console
Tweet: 'Just learned about #NLP and #MachineLearning! Thanks @professor_ai for the great tutorial. #AI2024'

Hashtags: ['#NLP', '#MachineLearning', '#AI2024']
Mentions: ['@professor_ai']

Substitution and Transformation

The re.sub() function replaces matches with new text. You can use backreferences in the replacement string.

In[56]:
Code
text = "Contact john.doe@email.com or jane.smith@company.org"

# Simple replacement
redacted = re.sub(r"\S+@\S+", "[EMAIL REDACTED]", text)

# Using backreferences in replacement
# Swap first and last name in email
swapped = re.sub(r"(\w+)\.(\w+)@", r"\2.\1@", text)


# Using a function for complex replacements
def mask_email(match):
    email = match.group()
    name, domain = email.split("@")
    return f"{name[0]}***@{domain}"


masked = re.sub(r"\S+@\S+", mask_email, text)
Out[57]:
Console
Original: 'Contact john.doe@email.com or jane.smith@company.org'

Simple replacement:
  Contact [EMAIL REDACTED] or [EMAIL REDACTED]

Backreference swap (\1.\2@ → \2.\1@):
  Contact doe.john@email.com or smith.jane@company.org

Function-based masking:
  Contact j***@email.com or j***@company.org

Compiling Patterns

For patterns you use repeatedly, compile them for better performance:

In[58]:
Code
import re

# Compile once, use many times
email_regex = re.compile(
    r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", re.IGNORECASE
)

texts = [
    "Contact support@example.com",
    "No email here",
    "Multiple: a@b.com, c@d.org",
]

results = [email_regex.findall(text) for text in texts]
Out[59]:
Console
Compiled pattern reuse:
  'Contact support@example.com'
    Found: ['support@example.com']
  'No email here'
    Found: []
  'Multiple: a@b.com, c@d.org'
    Found: ['a@b.com', 'c@d.org']

Compiled patterns also store flags, making your code cleaner when the same flags apply everywhere.

Performance Considerations

Regex engines use backtracking to find matches, which can lead to catastrophic performance on certain patterns.

Catastrophic Backtracking

In[60]:
Code
import time

# Dangerous pattern: nested quantifiers
dangerous_pattern = r"(a+)+"
safe_pattern = r"a+"

# Test string that causes backtracking
test_string = "a" * 25 + "b"

# Time the safe pattern
start = time.perf_counter()
re.search(safe_pattern, test_string)
safe_time = time.perf_counter() - start

# The dangerous pattern would take exponentially longer
# We won't run it with a long string to avoid hanging
short_test = "a" * 10 + "b"
start = time.perf_counter()
re.search(dangerous_pattern, short_test)
dangerous_time = time.perf_counter() - start
Out[61]:
Console
Backtracking performance:
  Safe pattern 'a+' on 25 a's: 0.0412 ms
  Dangerous pattern '(a+)+' on 10 a's: 0.0348 ms

Warning: '(a+)+' on 25+ characters can take minutes or hours!
The time doubles with each additional character.

The exponential growth of backtracking time is one of the most important performance concepts to understand. Let's visualize how execution time explodes as input length increases:

Out[62]:
Visualization
Line plot showing exponential growth of regex execution time for dangerous patterns versus constant time for safe patterns.
Exponential time complexity of catastrophic backtracking. The pattern (a+)+ on a string of n 'a's followed by 'b' requires exploring approximately 2^n combinations before failing. Safe patterns like a+ run in linear time regardless of input length.

The logarithmic scale reveals the exponential nature of the problem. While the safe pattern stays flat (constant time), the dangerous pattern's execution time doubles with each additional character. At 20 characters, matching takes seconds. At 25, it takes minutes. At 30, hours. This is why avoiding nested quantifiers is critical for production code.

Performance Tips

To write efficient regex patterns, follow these guidelines:

  1. Anchor when possible: ^pattern is faster than searching the whole string
  2. Be specific: [0-9] is faster than \d in some engines, [a-zA-Z] is faster than .
  3. Avoid nested quantifiers: (a+)+ is dangerous; use a+ instead
  4. Use non-capturing groups: (?:...) is slightly faster than (...)
  5. Compile patterns: For repeated use, re.compile() avoids re-parsing
  6. Use possessive quantifiers or atomic groups: Python's re doesn't support these, but the regex module does
In[63]:
Code
# The regex module offers more features and better performance
# Install with: pip install regex
try:
    import regex

    # Possessive quantifiers prevent backtracking
    # a++ means "match one or more 'a', and don't give them back"
    possessive = regex.compile(r"a++b")

    # Atomic groups: (?>...)
    atomic = regex.compile(r"(?>a+)b")

    has_regex = True
except ImportError:
    has_regex = False
Out[64]:
Console
The 'regex' module provides:
  - Possessive quantifiers: a++ (no backtracking)
  - Atomic groups: (?>a+)
  - Better Unicode support
  - Fuzzy matching

Building a Text Cleaning Pipeline

Let's combine what we've learned into a practical text preprocessing pipeline:

In[65]:
Code
import re
from typing import List, Tuple


class TextCleaner:
    """A regex-based text cleaning pipeline for NLP preprocessing."""

    def __init__(self):
        # Compile patterns once
        self.url_pattern = re.compile(r"https?://\S+|www\.\S+")
        self.email_pattern = re.compile(r"\S+@\S+\.\S+")
        self.mention_pattern = re.compile(r"@\w+")
        self.hashtag_pattern = re.compile(r"#\w+")
        self.number_pattern = re.compile(r"\b\d+(?:\.\d+)?\b")
        self.whitespace_pattern = re.compile(r"\s+")
        self.punctuation_pattern = re.compile(r"[^\w\s]")

    def remove_urls(self, text: str) -> str:
        return self.url_pattern.sub(" ", text)

    def remove_emails(self, text: str) -> str:
        return self.email_pattern.sub(" ", text)

    def remove_mentions(self, text: str) -> str:
        return self.mention_pattern.sub(" ", text)

    def extract_hashtags(self, text: str) -> Tuple[str, List[str]]:
        hashtags = self.hashtag_pattern.findall(text)
        cleaned = self.hashtag_pattern.sub(" ", text)
        return cleaned, hashtags

    def normalize_whitespace(self, text: str) -> str:
        return self.whitespace_pattern.sub(" ", text).strip()

    def clean(self, text: str) -> str:
        """Apply full cleaning pipeline."""
        text = self.remove_urls(text)
        text = self.remove_emails(text)
        text = self.remove_mentions(text)
        text, _ = self.extract_hashtags(text)
        text = self.normalize_whitespace(text)
        return text


# Test the pipeline
cleaner = TextCleaner()

sample = """
Check out https://example.com for more info! 
Contact support@company.com or @helpdesk
#MachineLearning is amazing! #NLP #AI
"""

cleaned = cleaner.clean(sample)
_, hashtags = cleaner.extract_hashtags(sample)
Out[66]:
Console
Original text:

Check out https://example.com for more info! 
Contact support@company.com or @helpdesk
#MachineLearning is amazing! #NLP #AI


Cleaned text:
  'Check out for more info! Contact or is amazing!'

Extracted hashtags:
  ['#MachineLearning', '#NLP', '#AI']

When Not to Use Regex

Regex is powerful, but it's not always the right tool:

Don't use regex for:

  • HTML/XML parsing: Use BeautifulSoup or lxml. Regex can't handle nested structures properly.
  • JSON/structured data: Use json module. Regex is error-prone for complex formats.
  • Complex grammars: Use a proper parser (like pyparsing or lark) for programming languages or complex formats.
  • Simple string operations: str.split(), str.replace(), in operator are clearer and faster for simple cases.
In[67]:
Code
# Simple cases: prefer string methods
text = "Hello, World!"

# Bad: regex for simple check
uses_regex = bool(re.search(r"World", text))

# Good: simple 'in' operator
uses_in = "World" in text

# Bad: regex for simple replacement
regex_replace = re.sub(r",", ";", text)

# Good: str.replace()
str_replace = text.replace(",", ";")

# Bad: regex for simple split
regex_split = re.split(r",\s*", "a, b, c")

# Good: str.split() with strip
str_split = [x.strip() for x in "a, b, c".split(",")]
Out[68]:
Console
Prefer string methods for simple operations:

Checking substring:
  'in' operator: True (clearer)

Simple replacement:
  str.replace(): 'Hello; World!' (faster)

Simple split:
  str.split(): ['a', 'b', 'c'] (more readable)

Limitations and Challenges

Regular expressions have fundamental limitations that you should understand before relying on them heavily:

  • Context-free languages: Regex cannot match arbitrarily nested structures like balanced parentheses. No regex can match only strings with balanced parens like ((())).
  • Readability: Complex regex patterns become write-only code. The email pattern we used earlier is already hard to read, and production-grade patterns are worse.
  • Maintenance: Small changes to requirements can require complete pattern rewrites. Adding "support international characters" to an email pattern is non-trivial.
  • Unicode complexity: While Python's re module handles Unicode, character classes like \w may not match all word characters in all languages. The regex module with Unicode categories helps.
  • Performance unpredictability: Backtracking behavior makes it hard to predict execution time. A pattern that works fine on test data might hang on production data.

Key Functions and Parameters

When working with regular expressions in Python, these are the essential functions and their most important parameters:

re.search(pattern, string, flags=0)

  • pattern: The regex pattern to search for
  • string: The text to search within
  • flags: Optional modifiers like re.IGNORECASE, re.MULTILINE
  • Returns: A match object for the first match, or None if no match

re.findall(pattern, string, flags=0)

  • Returns all non-overlapping matches as a list of strings
  • If the pattern has groups, returns a list of tuples containing the groups

re.finditer(pattern, string, flags=0)

  • Returns an iterator of match objects for all matches
  • Use when you need position information or access to groups

re.sub(pattern, repl, string, count=0, flags=0)

  • repl: Replacement string or function
  • count: Maximum number of replacements (0 means all)
  • Backreferences like \1 can be used in the replacement string

re.split(pattern, string, maxsplit=0, flags=0)

  • maxsplit: Maximum number of splits (0 means no limit)
  • Returns a list of strings split at pattern matches

re.compile(pattern, flags=0)

  • Pre-compiles a pattern for repeated use
  • Returns a compiled pattern object with the same methods

Common Flags

  • re.IGNORECASE (or re.I): Case-insensitive matching
  • re.MULTILINE (or re.M): ^ and $ match at line boundaries
  • re.DOTALL (or re.S): . matches newlines
  • re.VERBOSE (or re.X): Allow comments and whitespace in patterns

Summary

Regular expressions provide a compact, powerful language for pattern matching in text. You've learned:

  • Metacharacters: . matches any character, [] defines character classes, ^ and $ anchor to positions
  • Quantifiers: *, +, ?, {n,m} control repetition; add ? for lazy matching
  • Groups: () captures text, (?:) groups without capturing, (?P<name>) names captures
  • Lookarounds: (?=), (?!), (?<=), (?<!) match positions based on context
  • Backreferences: \1, \2 refer back to captured groups
  • Flags: re.IGNORECASE, re.MULTILINE, re.DOTALL, re.VERBOSE modify behavior

Key practical patterns for NLP:

  • Emails: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
  • URLs: https?://\S+
  • Hashtags/Mentions: #\w+, @\w+
  • Word boundaries: \bword\b for whole-word matching

Best practices:

  • Always use raw strings: r"pattern"
  • Compile patterns used repeatedly
  • Avoid nested quantifiers that cause catastrophic backtracking
  • Use string methods for simple operations
  • Use proper parsers for structured formats like HTML or JSON

In the next chapter, we'll explore sentence segmentation, where regex plays a supporting role in identifying sentence boundaries.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about regular expressions in Python.

Loading component...

Reference

BIBTEXAcademic
@misc{regularexpressionsfornlpcompleteguidetopatternmatchinginpython, author = {Michael Brenndoerfer}, title = {Regular Expressions for NLP: Complete Guide to Pattern Matching in Python}, year = {2025}, url = {https://mbrenndoerfer.com/writing/regular-expressions-pattern-matching-nlp-python}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Regular Expressions for NLP: Complete Guide to Pattern Matching in Python. Retrieved from https://mbrenndoerfer.com/writing/regular-expressions-pattern-matching-nlp-python
MLAAcademic
Michael Brenndoerfer. "Regular Expressions for NLP: Complete Guide to Pattern Matching in Python." 2026. Web. today. <https://mbrenndoerfer.com/writing/regular-expressions-pattern-matching-nlp-python>.
CHICAGOAcademic
Michael Brenndoerfer. "Regular Expressions for NLP: Complete Guide to Pattern Matching in Python." Accessed today. https://mbrenndoerfer.com/writing/regular-expressions-pattern-matching-nlp-python.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Regular Expressions for NLP: Complete Guide to Pattern Matching in Python'. Available at: https://mbrenndoerfer.com/writing/regular-expressions-pattern-matching-nlp-python (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Regular Expressions for NLP: Complete Guide to Pattern Matching in Python. https://mbrenndoerfer.com/writing/regular-expressions-pattern-matching-nlp-python