Chunking: Shallow Parsing for Phrase Identification in NLP

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Learn chunking (shallow parsing) to identify noun phrases, verb phrases, and prepositional phrases using IOB tagging, regex patterns, and machine learning with NLTK and spaCy.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

ChunkingLink Copied

Sentences contain meaningful groups of words that function as units. In "The quick brown fox jumps over the lazy dog," we naturally perceive "the quick brown fox" as a noun phrase describing the subject, "jumps" as the action, and "over the lazy dog" as a prepositional phrase describing where. Chunking, also called shallow parsing, identifies these non-overlapping segments without building a full syntactic tree.

Chunking sits between part-of-speech tagging and full parsing in complexity. POS tagging labels individual words. Full parsing constructs hierarchical tree structures showing how phrases nest within phrases. Chunking finds a middle ground: it groups consecutive words into flat, non-recursive chunks without showing how chunks relate to each other.

This chapter explores chunking from multiple angles. You'll learn the major chunk types, understand how IOB tagging represents chunk boundaries, implement chunkers using both regular expressions and machine learning, and see how chunking serves as a preprocessing step for information extraction and other downstream tasks.

What Is Chunking?Link Copied

Chunking identifies contiguous spans of tokens that form syntactic units. Unlike full parsing, which produces tree structures with unlimited nesting, chunking produces a flat sequence of labeled segments. Each token belongs to exactly one chunk (or no chunk if it's outside all chunks).

Chunking (Shallow Parsing)

Chunking is the task of grouping consecutive words into non-overlapping, non-recursive phrases such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). It identifies phrase boundaries without building hierarchical parse trees.

Consider the sentence "The black cat sat on the mat." A chunker might produce:

[The black cat] $_{NP}$ , noun phrase (subject)
[sat] $_{VP}$ , verb phrase (predicate)
[on] $_{PP}$ , prepositional phrase start
[the mat] $_{NP}$ , noun phrase (object of preposition)

The key properties of chunking are:

Non-overlapping: Each word belongs to at most one chunk
Non-recursive: Chunks don't contain other chunks of the same type
Contiguous: Chunks consist of consecutive words
Partial coverage: Some words may not belong to any chunk

Let's see chunking in action with NLTK:

In[2]:

Code

from nltk import word_tokenize, pos_tag
from nltk.chunk import ne_chunk

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)

# NLTK's chunker produces a tree
chunked = ne_chunk(tagged)

from nltk import word_tokenize, pos_tag
from nltk.chunk import ne_chunk

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)

# NLTK's chunker produces a tree
chunked = ne_chunk(tagged)

Out[3]:

Console

Chunking Example
==================================================

Sentence: The quick brown fox jumps over the lazy dog.

POS Tags: The/DT quick/JJ brown/NN fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN ./.

Chunk Tree:
(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  jumps/VBZ
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)

NLTK's ne_chunk function produces a tree structure, but we can flatten it to see the chunks:

In[4]:

Code

def tree_to_chunks(tree):
    """Extract chunks from NLTK tree as (chunk_type, words) pairs."""
    chunks = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            # This is a chunk
            chunk_type = subtree.label()
            words = [word for word, tag in subtree.leaves()]
            chunks.append((chunk_type, " ".join(words)))
    return chunks


# Also try a sentence with named entities
ne_sentence = "Barack Obama visited New York City yesterday."
ne_tokens = word_tokenize(ne_sentence)
ne_tagged = pos_tag(ne_tokens)
ne_chunked = ne_chunk(ne_tagged)
ne_chunks = tree_to_chunks(ne_chunked)

def tree_to_chunks(tree):
    """Extract chunks from NLTK tree as (chunk_type, words) pairs."""
    chunks = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            # This is a chunk
            chunk_type = subtree.label()
            words = [word for word, tag in subtree.leaves()]
            chunks.append((chunk_type, " ".join(words)))
    return chunks


# Also try a sentence with named entities
ne_sentence = "Barack Obama visited New York City yesterday."
ne_tokens = word_tokenize(ne_sentence)
ne_tagged = pos_tag(ne_tokens)
ne_chunked = ne_chunk(ne_tagged)
ne_chunks = tree_to_chunks(ne_chunked)

Out[5]:

Console

Named Entity Chunks
==================================================

Sentence: Barack Obama visited New York City yesterday.

Chunks found:
  [Barack] → PERSON
  [Obama] → PERSON
  [New York City] → GPE

Chunk TypesLink Copied

Different chunk types capture different syntactic units. The most common types in English are noun phrases, verb phrases, and prepositional phrases, though chunking schemes may include additional categories.

Noun Phrases (NP)Link Copied

Noun phrases are the most important and well-studied chunk type. They typically consist of a determiner, optional adjectives, and one or more nouns. NP chunking, specifically, is sometimes called "base NP chunking" when it identifies only the innermost, non-recursive noun phrases.

In[6]:

Code

# Examples of noun phrases with varying complexity
np_examples = [
    ("the cat", "simple: determiner + noun"),
    ("a big red ball", "adjectives: det + adj + adj + noun"),
    ("President Obama", "proper: title + name"),
    ("three blind mice", "numeral: number + adj + noun"),
    ("the New York Stock Exchange", "complex: nested proper noun"),
]

# POS patterns that form noun phrases
np_patterns = [
    "DT NN",  # the cat
    "DT JJ NN",  # the black cat
    "DT JJ JJ NN",  # the big black cat
    "NNP NNP",  # Barack Obama
    "CD NN",  # three cats
    "PRP$ NN",  # my cat
    "JJ NN",  # black cat
    "NN NN",  # computer screen
]

# Examples of noun phrases with varying complexity
np_examples = [
    ("the cat", "simple: determiner + noun"),
    ("a big red ball", "adjectives: det + adj + adj + noun"),
    ("President Obama", "proper: title + name"),
    ("three blind mice", "numeral: number + adj + noun"),
    ("the New York Stock Exchange", "complex: nested proper noun"),
]

# POS patterns that form noun phrases
np_patterns = [
    "DT NN",  # the cat
    "DT JJ NN",  # the black cat
    "DT JJ JJ NN",  # the big black cat
    "NNP NNP",  # Barack Obama
    "CD NN",  # three cats
    "PRP$ NN",  # my cat
    "JJ NN",  # black cat
    "NN NN",  # computer screen
]

Out[7]:

Console

Noun Phrase Examples
==================================================

  'the cat'
    POS: DT NN
    Type: simple: determiner + noun

  'a big red ball'
    POS: DT JJ JJ NN
    Type: adjectives: det + adj + adj + noun

  'President Obama'
    POS: NNP NNP
    Type: proper: title + name

  'three blind mice'
    POS: CD IN NNS
    Type: numeral: number + adj + noun

  'the New York Stock Exchange'
    POS: DT NNP NNP NNP NNP
    Type: complex: nested proper noun

The structure of noun phrases follows grammatical patterns. A determiner (the, a, my) often starts the phrase, adjectives modify the head noun, and the head noun itself can be singular, plural, or proper. Understanding these patterns is key to building effective chunkers.

Verb Phrases (VP)Link Copied

Verb phrases contain the main verb and its auxiliaries. In chunking, we typically identify just the verbal elements, not the objects or complements that would be included in a full verb phrase in traditional grammar.

In[8]:

Code

# Examples of verb phrases
vp_examples = [
    ("runs", "simple present"),
    ("is running", "progressive"),
    ("has been running", "perfect progressive"),
    ("will have been running", "future perfect progressive"),
    ("can swim", "modal + base"),
    ("should have called", "modal + perfect"),
]

# POS patterns for verb phrases
vp_patterns = [
    "VBZ",  # runs
    "VBG",  # running
    "VBD",  # ran
    "MD VB",  # can run
    "VBZ VBG",  # is running
    "VBZ VBN",  # is eaten
    "MD VB VBN",  # will be eaten
]

# Examples of verb phrases
vp_examples = [
    ("runs", "simple present"),
    ("is running", "progressive"),
    ("has been running", "perfect progressive"),
    ("will have been running", "future perfect progressive"),
    ("can swim", "modal + base"),
    ("should have called", "modal + perfect"),
]

# POS patterns for verb phrases
vp_patterns = [
    "VBZ",  # runs
    "VBG",  # running
    "VBD",  # ran
    "MD VB",  # can run
    "VBZ VBG",  # is running
    "VBZ VBN",  # is eaten
    "MD VB VBN",  # will be eaten
]

Out[9]:

Console

Verb Phrase Examples
==================================================

  'runs'
    POS: NNS
    Type: simple present

  'is running'
    POS: VBZ VBG
    Type: progressive

  'has been running'
    POS: VBZ VBN VBG
    Type: perfect progressive

  'will have been running'
    POS: MD VB VBN VBG
    Type: future perfect progressive

  'can swim'
    POS: MD VB
    Type: modal + base

  'should have called'
    POS: MD VB VBN
    Type: modal + perfect

Prepositional Phrases (PP)Link Copied

Prepositional phrases begin with a preposition and typically end with a noun phrase. In chunking, we often identify the preposition separately and let the following NP be tagged as such, or we group them together.

In[10]:

Code

# Examples of prepositional phrases
pp_examples = [
    ("in the house", "location"),
    ("with great care", "manner"),
    ("after the meeting", "time"),
    ("under the old oak tree", "complex location"),
    ("for my best friend", "beneficiary"),
]

# Examples of prepositional phrases
pp_examples = [
    ("in the house", "location"),
    ("with great care", "manner"),
    ("after the meeting", "time"),
    ("under the old oak tree", "complex location"),
    ("for my best friend", "beneficiary"),
]

Out[11]:

Console

Prepositional Phrase Examples
==================================================

  'in the house'
    POS: IN DT NN
    Type: location

  'with great care'
    POS: IN JJ NN
    Type: manner

  'after the meeting'
    POS: IN DT NN
    Type: time

  'under the old oak tree'
    POS: IN DT JJ NN NN
    Type: complex location

  'for my best friend'
    POS: IN PRP$ JJS NN
    Type: beneficiary

Other Chunk TypesLink Copied

Depending on the annotation scheme, chunkers may identify additional phrase types:

In[12]:

Code

other_chunks = {
    "ADJP": ("Adjective phrase", "very happy, quite tired"),
    "ADVP": ("Adverb phrase", "very quickly, rather slowly"),
    "SBAR": ("Subordinate clause", "that he left, if it rains"),
    "PRT": ("Particle", "give up, turn on"),
    "CONJP": ("Conjunction phrase", "as well as, rather than"),
}

other_chunks = {
    "ADJP": ("Adjective phrase", "very happy, quite tired"),
    "ADVP": ("Adverb phrase", "very quickly, rather slowly"),
    "SBAR": ("Subordinate clause", "that he left, if it rains"),
    "PRT": ("Particle", "give up, turn on"),
    "CONJP": ("Conjunction phrase", "as well as, rather than"),
}

Out[13]:

Console

Additional Chunk Types
==================================================

  ADJP: Adjective phrase
    Examples: very happy, quite tired

  ADVP: Adverb phrase
    Examples: very quickly, rather slowly

  SBAR: Subordinate clause
    Examples: that he left, if it rains

  PRT: Particle
    Examples: give up, turn on

  CONJP: Conjunction phrase
    Examples: as well as, rather than

IOB Tagging for ChunksLink Copied

Just as BIO tagging encodes entity boundaries for named entity recognition, IOB tagging encodes chunk boundaries. Each token receives a tag indicating whether it begins a chunk (B), continues a chunk (I), or is outside all chunks (O).

IOB Tagging

IOB (Inside-Outside-Beginning) tagging represents chunk boundaries using per-token labels. B-NP marks the first word of a noun phrase, I-NP marks subsequent words in the same NP, and O marks words outside any chunk. This encoding is identical to BIO tagging for NER.

The terminology varies slightly. Some sources use "BIO" while others use "IOB." Additionally, there are two variants:

IOB1: B tag is only used when a chunk follows another chunk of the same type
IOB2: B tag is always used for the first token of a chunk (more common)

Let's see IOB tagging applied to chunks:

In[14]:

Code

# Example sentence with IOB chunk tags
sentence = "The quick brown fox jumps over the lazy dog"
tokens = sentence.split()

# IOB tags for chunks
iob_tags = [
    "B-NP",  # The
    "I-NP",  # quick
    "I-NP",  # brown
    "I-NP",  # fox
    "B-VP",  # jumps
    "B-PP",  # over
    "B-NP",  # the
    "I-NP",  # lazy
    "I-NP",  # dog
]

# Example sentence with IOB chunk tags
sentence = "The quick brown fox jumps over the lazy dog"
tokens = sentence.split()

# IOB tags for chunks
iob_tags = [
    "B-NP",  # The
    "I-NP",  # quick
    "I-NP",  # brown
    "I-NP",  # fox
    "B-VP",  # jumps
    "B-PP",  # over
    "B-NP",  # the
    "I-NP",  # lazy
    "I-NP",  # dog
]

Out[15]:

Console

IOB Chunk Tagging
==================================================

Sentence: The quick brown fox jumps over the lazy dog

Token        IOB Tag    Meaning
---------------------------------------------
The          B-NP       Begin noun phrase
quick        I-NP       Inside noun phrase
brown        I-NP       Inside noun phrase
fox          I-NP       Inside noun phrase
jumps        B-VP       Begin verb phrase
over         B-PP       Begin prep phrase
the          B-NP       Begin noun phrase
lazy         I-NP       Inside noun phrase
dog          I-NP       Inside noun phrase

Out[16]:

Visualization

Horizontal sequence diagram showing tokens with colored boxes indicating IOB chunk tags for NP, VP, and PP. — IOB tagging applied to syntactic chunks. Each token receives a tag encoding its position within phrase boundaries. The sentence contains two noun phrases (NP), one verb phrase (VP), and one prepositional phrase (PP). Prepositional phrases often contain embedded noun phrases, but in flat chunking we treat them as separate chunks.

Converting Between Chunks and IOB TagsLink Copied

We need utilities to convert between chunk spans and IOB sequences, just as we did for named entity recognition in the BIO tagging chapter:

In[17]:

Code

def chunks_to_iob(tokens, chunks):
    """
    Convert chunk spans to IOB tags.

    Args:
        tokens: List of tokens
        chunks: List of (start_idx, end_idx, chunk_type) tuples

    Returns:
        List of IOB tags, one per token
    """
    tags = ["O"] * len(tokens)

    for start, end, chunk_type in sorted(chunks, key=lambda x: x[0]):
        if start < 0 or end > len(tokens) or start >= end:
            continue

        tags[start] = f"B-{chunk_type}"
        for i in range(start + 1, end):
            tags[i] = f"I-{chunk_type}"

    return tags


def iob_to_chunks(tokens, tags):
    """
    Extract chunk spans from IOB tags.

    Returns:
        List of (start_idx, end_idx, chunk_type, text) tuples
    """
    chunks = []
    current_chunk = None  # (start_idx, chunk_type)

    for i, (token, tag) in enumerate(zip(tokens, tags)):
        if tag.startswith("B-"):
            # Close any open chunk
            if current_chunk is not None:
                start, ctype = current_chunk
                chunks.append((start, i, ctype))

            # Start new chunk
            chunk_type = tag[2:]
            current_chunk = (i, chunk_type)

        elif tag.startswith("I-"):
            chunk_type = tag[2:]

            # Handle orphan I tag
            if current_chunk is None:
                current_chunk = (i, chunk_type)
            elif current_chunk[1] != chunk_type:
                # Type mismatch: close old, start new
                start, ctype = current_chunk
                chunks.append((start, i, ctype))
                current_chunk = (i, chunk_type)

        else:  # O tag
            if current_chunk is not None:
                start, ctype = current_chunk
                chunks.append((start, i, ctype))
                current_chunk = None

    # Handle chunk at end
    if current_chunk is not None:
        start, ctype = current_chunk
        chunks.append((start, len(tokens), ctype))

    # Add text to each chunk
    result = []
    for start, end, ctype in chunks:
        text = " ".join(tokens[start:end])
        result.append((start, end, ctype, text))

    return result


# Test the converters
test_tokens = "The quick brown fox jumps over the lazy dog".split()
test_chunks = [
    (0, 4, "NP"),  # The quick brown fox
    (4, 5, "VP"),  # jumps
    (5, 6, "PP"),  # over
    (6, 9, "NP"),  # the lazy dog
]

test_iob = chunks_to_iob(test_tokens, test_chunks)
recovered_chunks = iob_to_chunks(test_tokens, test_iob)

def chunks_to_iob(tokens, chunks):
    """
    Convert chunk spans to IOB tags.

    Args:
        tokens: List of tokens
        chunks: List of (start_idx, end_idx, chunk_type) tuples

    Returns:
        List of IOB tags, one per token
    """
    tags = ["O"] * len(tokens)

    for start, end, chunk_type in sorted(chunks, key=lambda x: x[0]):
        if start < 0 or end > len(tokens) or start >= end:
            continue

        tags[start] = f"B-{chunk_type}"
        for i in range(start + 1, end):
            tags[i] = f"I-{chunk_type}"

    return tags


def iob_to_chunks(tokens, tags):
    """
    Extract chunk spans from IOB tags.

    Returns:
        List of (start_idx, end_idx, chunk_type, text) tuples
    """
    chunks = []
    current_chunk = None  # (start_idx, chunk_type)

    for i, (token, tag) in enumerate(zip(tokens, tags)):
        if tag.startswith("B-"):
            # Close any open chunk
            if current_chunk is not None:
                start, ctype = current_chunk
                chunks.append((start, i, ctype))

            # Start new chunk
            chunk_type = tag[2:]
            current_chunk = (i, chunk_type)

        elif tag.startswith("I-"):
            chunk_type = tag[2:]

            # Handle orphan I tag
            if current_chunk is None:
                current_chunk = (i, chunk_type)
            elif current_chunk[1] != chunk_type:
                # Type mismatch: close old, start new
                start, ctype = current_chunk
                chunks.append((start, i, ctype))
                current_chunk = (i, chunk_type)

        else:  # O tag
            if current_chunk is not None:
                start, ctype = current_chunk
                chunks.append((start, i, ctype))
                current_chunk = None

    # Handle chunk at end
    if current_chunk is not None:
        start, ctype = current_chunk
        chunks.append((start, len(tokens), ctype))

    # Add text to each chunk
    result = []
    for start, end, ctype in chunks:
        text = " ".join(tokens[start:end])
        result.append((start, end, ctype, text))

    return result


# Test the converters
test_tokens = "The quick brown fox jumps over the lazy dog".split()
test_chunks = [
    (0, 4, "NP"),  # The quick brown fox
    (4, 5, "VP"),  # jumps
    (5, 6, "PP"),  # over
    (6, 9, "NP"),  # the lazy dog
]

test_iob = chunks_to_iob(test_tokens, test_chunks)
recovered_chunks = iob_to_chunks(test_tokens, test_iob)

Out[18]:

Console

Chunk to IOB Conversion
==================================================

Original chunks:
  [0:4] 'The quick brown fox' → NP
  [4:5] 'jumps' → VP
  [5:6] 'over' → PP
  [6:9] 'the lazy dog' → NP

IOB tags:
  The          → B-NP
  quick        → I-NP
  brown        → I-NP
  fox          → I-NP
  jumps        → B-VP
  over         → B-PP
  the          → B-NP
  lazy         → I-NP
  dog          → I-NP

Recovered chunks:
  [0:4] 'The quick brown fox' → NP
  [4:5] 'jumps' → VP
  [5:6] 'over' → PP
  [6:9] 'the lazy dog' → NP

Chunking vs. Full ParsingLink Copied

Understanding the difference between chunking and full parsing clarifies when to use each approach.

Full syntactic parsing produces a complete tree structure showing how phrases nest within phrases. Consider "I saw the man with the telescope." A full parse would show the PP "with the telescope" attaching either to "saw" (I used the telescope to see) or to "the man" (the man had the telescope). This ambiguity requires semantic knowledge to resolve.

In[19]:

Code

# Illustrate the difference
example = "I saw the man with the telescope"

# Chunking output: flat, non-recursive
chunk_analysis = {
    "chunks": [
        ("I", "NP"),
        ("saw", "VP"),
        ("the man", "NP"),
        ("with", "PP"),
        ("the telescope", "NP"),
    ],
    "structure": "flat sequence of chunks",
}

# Full parsing output: hierarchical tree
parse_analysis = {
    "interpretation_1": "VP[saw NP[the man] PP[with NP[the telescope]]]",
    "interpretation_2": "VP[saw NP[the man PP[with NP[the telescope]]]]",
    "structure": "nested tree with attachment decisions",
}

# Illustrate the difference
example = "I saw the man with the telescope"

# Chunking output: flat, non-recursive
chunk_analysis = {
    "chunks": [
        ("I", "NP"),
        ("saw", "VP"),
        ("the man", "NP"),
        ("with", "PP"),
        ("the telescope", "NP"),
    ],
    "structure": "flat sequence of chunks",
}

# Full parsing output: hierarchical tree
parse_analysis = {
    "interpretation_1": "VP[saw NP[the man] PP[with NP[the telescope]]]",
    "interpretation_2": "VP[saw NP[the man PP[with NP[the telescope]]]]",
    "structure": "nested tree with attachment decisions",
}

Out[20]:

Console

Chunking vs. Full Parsing
============================================================

Example: 'I saw the man with the telescope'

--- Chunking (Shallow Parsing) ---
Output: Flat sequence of non-overlapping chunks
  [I] → NP
  [saw] → VP
  [the man] → NP
  [with] → PP
  [the telescope] → NP

Note: flat sequence of chunks

--- Full Parsing ---
Output: Hierarchical tree with attachment decisions

Interpretation 1 (telescope is instrument):
  VP[saw NP[the man] PP[with NP[the telescope]]]

Interpretation 2 (man has telescope):
  VP[saw NP[the man PP[with NP[the telescope]]]]

Note: nested tree with attachment decisions

The key tradeoffs are:

Speed: Chunking is much faster than full parsing
Accuracy: Chunking achieves higher accuracy on its simpler task
Information: Full parsing captures more syntactic detail
Ambiguity: Chunking avoids attachment decisions that require semantics

For many practical applications like information extraction, named entity recognition, and text classification, chunking provides sufficient structure without the complexity and errors of full parsing.

Out[21]:

Visualization

Side-by-side diagram showing flat chunking output on left and hierarchical parse tree on right. — Comparison of chunking and full parsing output structures. Chunking produces a flat sequence of labeled segments (left), while full parsing produces a hierarchical tree showing phrase nesting (right). Chunking avoids difficult attachment decisions that would require semantic knowledge.

Regex-Based Chunking with NLTKLink Copied

NLTK provides a RegexpParser that lets you define chunk patterns using regular expressions over POS tags. This approach is intuitive and works well for simple patterns.

In[22]:

Code

from nltk import RegexpParser

# Define a grammar for noun phrase chunking
# Patterns use POS tags, not words
# NP: optional determiner/possessive, adjectives, nouns
# VP: optional modal, verbs
# PP: just the preposition
np_grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN.*>+}
    VP: {<MD>?<VB.*>+}
    PP: {<IN>}
"""

# Create the chunker
chunker = RegexpParser(np_grammar)

# Test sentences
test_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A beautiful sunset illuminated the entire valley.",
    "My old computer finally crashed yesterday.",
]

regex_results = []
for sent in test_sentences:
    tokens = word_tokenize(sent)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)
    chunks = tree_to_chunks(tree)
    regex_results.append((sent, tagged, chunks))

from nltk import RegexpParser

# Define a grammar for noun phrase chunking
# Patterns use POS tags, not words
# NP: optional determiner/possessive, adjectives, nouns
# VP: optional modal, verbs
# PP: just the preposition
np_grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN.*>+}
    VP: {<MD>?<VB.*>+}
    PP: {<IN>}
"""

# Create the chunker
chunker = RegexpParser(np_grammar)

# Test sentences
test_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A beautiful sunset illuminated the entire valley.",
    "My old computer finally crashed yesterday.",
]

regex_results = []
for sent in test_sentences:
    tokens = word_tokenize(sent)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)
    chunks = tree_to_chunks(tree)
    regex_results.append((sent, tagged, chunks))

Out[23]:

Console

Regex-Based Chunking Results
============================================================

Sentence: The quick brown fox jumps over the lazy dog.
POS: The/DT quick/JJ brown/NN fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN ./.
Chunks:
  [The quick brown fox] → NP
  [jumps] → VP
  [over] → PP
  [the lazy dog] → NP

Sentence: A beautiful sunset illuminated the entire valley.
POS: A/DT beautiful/JJ sunset/NN illuminated/VBD the/DT entire/JJ valley/NN ./.
Chunks:
  [A beautiful sunset] → NP
  [illuminated] → VP
  [the entire valley] → NP

Sentence: My old computer finally crashed yesterday.
POS: My/PRP$ old/JJ computer/NN finally/RB crashed/VBN yesterday/NN ./.
Chunks:
  [My old computer] → NP
  [crashed] → VP
  [yesterday] → NP

Let's break down the grammar pattern for noun phrases:

In[24]:

Code

# Explanation of the NP pattern
np_pattern_parts = {
    "<DT|PRP$>?": "Optional determiner (the, a) or possessive pronoun (my, your)",
    "<JJ>*": "Zero or more adjectives",
    "<NN.*>+": "One or more nouns (NN, NNS, NNP, NNPS)",
}

# More complex grammar with additional patterns
# NP patterns: basic NP, proper noun sequences, pronouns, number + nouns
# VP: verb phrases
# PP: preposition + NP (recursive-ish)
# ADVP: adverb phrases
extended_grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN.*>+}
        {<NNP>+}
        {<PRP>}
        {<CD><NN.*>+}
    VP: {<MD>?<VB.*>+}
    PP: {<IN><NP>}
    ADVP: {<RB.*>+}
"""

# Explanation of the NP pattern
np_pattern_parts = {
    "<DT|PRP$>?": "Optional determiner (the, a) or possessive pronoun (my, your)",
    "<JJ>*": "Zero or more adjectives",
    "<NN.*>+": "One or more nouns (NN, NNS, NNP, NNPS)",
}

# More complex grammar with additional patterns
# NP patterns: basic NP, proper noun sequences, pronouns, number + nouns
# VP: verb phrases
# PP: preposition + NP (recursive-ish)
# ADVP: adverb phrases
extended_grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN.*>+}
        {<NNP>+}
        {<PRP>}
        {<CD><NN.*>+}
    VP: {<MD>?<VB.*>+}
    PP: {<IN><NP>}
    ADVP: {<RB.*>+}
"""

Out[25]:

Console

NP Pattern Breakdown
==================================================

  <DT|PRP$>?
    → Optional determiner (the, a) or possessive pronoun (my, your)

  <JJ>*
    → Zero or more adjectives

  <NN.*>+
    → One or more nouns (NN, NNS, NNP, NNPS)


Regex Quantifiers:
  ?  = zero or one (optional)
  *  = zero or more
  +  = one or more
  |  = alternation (or)

Chinking: Defining What's NOT in a ChunkLink Copied

Sometimes it's easier to define what shouldn't be in a chunk than what should. Chinking removes tokens from chunks:

In[26]:

Code

# Grammar with chinking
# Pattern explanation:
# - {<.*>+} chunks everything into NP
# - }<VB.*|IN|CC|\.>{ removes (chinks) verbs, prepositions, conjunctions, periods
chink_grammar = r"""
    NP: {<.*>+}
        }<VB.*|IN|CC|\.>{
"""

chink_chunker = RegexpParser(chink_grammar)

chink_test = "The cat and the dog sat on the mat."
chink_tokens = word_tokenize(chink_test)
chink_tagged = pos_tag(chink_tokens)
chink_tree = chink_chunker.parse(chink_tagged)
chink_chunks = tree_to_chunks(chink_tree)

# Grammar with chinking
# Pattern explanation:
# - {<.*>+} chunks everything into NP
# - }<VB.*|IN|CC|\.>{ removes (chinks) verbs, prepositions, conjunctions, periods
chink_grammar = r"""
    NP: {<.*>+}
        }<VB.*|IN|CC|\.>{
"""

chink_chunker = RegexpParser(chink_grammar)

chink_test = "The cat and the dog sat on the mat."
chink_tokens = word_tokenize(chink_test)
chink_tagged = pos_tag(chink_tokens)
chink_tree = chink_chunker.parse(chink_tagged)
chink_chunks = tree_to_chunks(chink_tree)

Out[27]:

Console

Chinking Example
==================================================

Sentence: The cat and the dog sat on the mat.

POS: The/DT cat/NN and/CC the/DT dog/NN sat/VBD on/IN the/DT mat/NN ./.

Chinking strategy:
  1. Chunk everything: {<.*>+}
  2. Remove (chink) verbs, prepositions, conjunctions

Result:
  [The cat] → NP
  [the dog] → NP
  [the mat] → NP

The }{ syntax means "end chunk before these tags, start new chunk after." This effectively splits chunks at verbs, prepositions, and other non-NP elements.

Limitations of Regex ChunkingLink Copied

Regex-based chunking is simple and fast but has limitations:

In[28]:

Code

# Cases where regex chunking struggles
problematic_cases = [
    {
        "sentence": "The man I saw yesterday left.",
        "issue": "Relative clause interrupts NP",
    },
    {
        "sentence": "Very very quickly running water.",
        "issue": "Ambiguous modifier scope",
    },
    {
        "sentence": "The old man the boats.",
        "issue": "Garden path: 'man' is a verb here",
    },
]

for case in problematic_cases:
    tokens = word_tokenize(case["sentence"])
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)
    case["tagged"] = tagged
    case["chunks"] = tree_to_chunks(tree)

# Cases where regex chunking struggles
problematic_cases = [
    {
        "sentence": "The man I saw yesterday left.",
        "issue": "Relative clause interrupts NP",
    },
    {
        "sentence": "Very very quickly running water.",
        "issue": "Ambiguous modifier scope",
    },
    {
        "sentence": "The old man the boats.",
        "issue": "Garden path: 'man' is a verb here",
    },
]

for case in problematic_cases:
    tokens = word_tokenize(case["sentence"])
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)
    case["tagged"] = tagged
    case["chunks"] = tree_to_chunks(tree)

Out[29]:

Console

Regex Chunking Limitations
============================================================

Sentence: The man I saw yesterday left.
Issue: Relative clause interrupts NP
POS: The/DT man/NN I/PRP saw/VBD yesterday/NN left/VBD ./.
Chunks found:
  [The man] → NP
  [saw] → VP
  [yesterday] → NP
  [left] → VP

Sentence: Very very quickly running water.
Issue: Ambiguous modifier scope
POS: Very/RB very/RB quickly/RB running/VBG water/NN ./.
Chunks found:
  [running] → VP
  [water] → NP

Sentence: The old man the boats.
Issue: Garden path: 'man' is a verb here
POS: The/DT old/JJ man/NN the/DT boats/NNS ./.
Chunks found:
  [The old man] → NP
  [the boats] → NP

Regex patterns match local sequences without considering broader context. They rely entirely on POS tags, so POS tagging errors propagate. They cannot handle discontinuous constituents or truly recursive structures.

Using the CoNLL-2000 DatasetLink Copied

The CoNLL-2000 shared task established a standard benchmark for chunking. The dataset provides sentences with POS tags and IOB chunk labels for training and evaluating chunkers.

In[30]:

Code

from nltk.corpus import conll2000

# Load the data
train_sents = conll2000.chunked_sents(
    "train.txt", chunk_types=["NP", "VP", "PP"]
)
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP", "VP", "PP"])

# Look at the data format
sample_sent = train_sents[0]

from nltk.corpus import conll2000

# Load the data
train_sents = conll2000.chunked_sents(
    "train.txt", chunk_types=["NP", "VP", "PP"]
)
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP", "VP", "PP"])

# Look at the data format
sample_sent = train_sents[0]

Out[31]:

Console

CoNLL-2000 Dataset
==================================================


Training sentences: 8,936
Test sentences: 2,012


Sample sentence structure:
(S
  (NP Confidence/NN)
  (PP in/IN)
  (NP the/DT pound/NN)
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  (PP for/IN)
  (NP September/NNP)
  ,/,
  due/JJ
  (PP for/IN)
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  (VP fail/VB to/TO show/VB)
  (NP a/DT substantial/JJ improvement/NN)
  (PP from/IN)
  (NP July/NNP and/CC August/NNP)
  (NP 's/POS near-record/JJ deficits/NNS)
  ./.)


Chunk type distribution (first 1000 sentences):
  NP: 6,211
  VP: 2,399
  PP: 2,397

Converting CoNLL Data to IOB FormatLink Copied

For machine learning approaches, we need the data in token-level IOB format:

In[32]:

Code

def tree_to_iob_triples(tree):
    """Convert NLTK tree to (word, pos, iob) triples."""
    triples = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            # This is a chunk
            chunk_type = subtree.label()
            for i, (word, pos) in enumerate(subtree.leaves()):
                if i == 0:
                    iob = f"B-{chunk_type}"
                else:
                    iob = f"I-{chunk_type}"
                triples.append((word, pos, iob))
        else:
            # Single word, not in a chunk
            word, pos = subtree
            triples.append((word, pos, "O"))
    return triples


# Convert sample
sample_iob = tree_to_iob_triples(sample_sent)

def tree_to_iob_triples(tree):
    """Convert NLTK tree to (word, pos, iob) triples."""
    triples = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            # This is a chunk
            chunk_type = subtree.label()
            for i, (word, pos) in enumerate(subtree.leaves()):
                if i == 0:
                    iob = f"B-{chunk_type}"
                else:
                    iob = f"I-{chunk_type}"
                triples.append((word, pos, iob))
        else:
            # Single word, not in a chunk
            word, pos = subtree
            triples.append((word, pos, "O"))
    return triples


# Convert sample
sample_iob = tree_to_iob_triples(sample_sent)

Out[33]:

Console

IOB Triple Format
==================================================

  Word          POS    IOB
  -----------------------------------
  Confidence     NN     B-NP
  in             IN     B-PP
  the            DT     B-NP
  pound          NN     I-NP
  is             VBZ    B-VP
  widely         RB     I-VP
  expected       VBN    I-VP
  to             TO     I-VP
  take           VB     I-VP
  another        DT     B-NP
  sharp          JJ     I-NP
  dive           NN     I-NP
  if             IN     O
  trade          NN     B-NP
  figures        NNS    I-NP
  ... (22 more tokens)

Evaluating ChunkersLink Copied

Chunking evaluation uses precision, recall, and F1 score at the chunk level, not the token level. A chunk is correct only if both its boundaries and type match exactly.

Chunk-Level Evaluation

Chunking evaluation counts a predicted chunk as correct only if it exactly matches a gold chunk in both boundaries (start and end positions) and type (NP, VP, etc.). Partial matches receive no credit.

In[34]:

Code

def evaluate_chunker(chunker, test_trees):
    """Evaluate a chunker on test data."""
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for tree in test_trees:
        # Get gold chunks
        gold_chunks = set()
        for subtree in tree:
            if hasattr(subtree, "label"):
                words = tuple(word for word, _ in subtree.leaves())
                gold_chunks.add((subtree.label(), words))

        # Get predicted chunks
        tagged = [(word, pos) for word, pos in tree.pos()]
        pred_tree = chunker.parse(tagged)
        pred_chunks = set()
        for subtree in pred_tree:
            if hasattr(subtree, "label"):
                words = tuple(word for word, _ in subtree.leaves())
                pred_chunks.add((subtree.label(), words))

        # Count matches
        true_positives += len(gold_chunks & pred_chunks)
        false_positives += len(pred_chunks - gold_chunks)
        false_negatives += len(gold_chunks - pred_chunks)

    precision = (
        true_positives / (true_positives + false_positives)
        if (true_positives + false_positives) > 0
        else 0
    )
    recall = (
        true_positives / (true_positives + false_negatives)
        if (true_positives + false_negatives) > 0
        else 0
    )
    f1 = (
        2 * precision * recall / (precision + recall)
        if (precision + recall) > 0
        else 0
    )

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "true_positives": true_positives,
        "false_positives": false_positives,
        "false_negatives": false_negatives,
    }


# Evaluate our regex chunker
regex_eval = evaluate_chunker(chunker, test_sents[:500])

def evaluate_chunker(chunker, test_trees):
    """Evaluate a chunker on test data."""
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for tree in test_trees:
        # Get gold chunks
        gold_chunks = set()
        for subtree in tree:
            if hasattr(subtree, "label"):
                words = tuple(word for word, _ in subtree.leaves())
                gold_chunks.add((subtree.label(), words))

        # Get predicted chunks
        tagged = [(word, pos) for word, pos in tree.pos()]
        pred_tree = chunker.parse(tagged)
        pred_chunks = set()
        for subtree in pred_tree:
            if hasattr(subtree, "label"):
                words = tuple(word for word, _ in subtree.leaves())
                pred_chunks.add((subtree.label(), words))

        # Count matches
        true_positives += len(gold_chunks & pred_chunks)
        false_positives += len(pred_chunks - gold_chunks)
        false_negatives += len(gold_chunks - pred_chunks)

    precision = (
        true_positives / (true_positives + false_positives)
        if (true_positives + false_positives) > 0
        else 0
    )
    recall = (
        true_positives / (true_positives + false_negatives)
        if (true_positives + false_negatives) > 0
        else 0
    )
    f1 = (
        2 * precision * recall / (precision + recall)
        if (precision + recall) > 0
        else 0
    )

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "true_positives": true_positives,
        "false_positives": false_positives,
        "false_negatives": false_negatives,
    }


# Evaluate our regex chunker
regex_eval = evaluate_chunker(chunker, test_sents[:500])

Out[35]:

Console

Regex Chunker Evaluation
==================================================

Evaluated on 500 test sentences

Precision: 0.00%
Recall:    0.00%
F1 Score:  0.00%

True Positives:  0
False Positives: 0
False Negatives: 5,117

Chunking as PreprocessingLink Copied

Chunking serves as a preprocessing step for many NLP tasks. By identifying phrase boundaries, it simplifies downstream processing and provides useful features.

Information ExtractionLink Copied

Chunking helps identify entities and relations in text:

In[36]:

Code

def extract_np_vp_np_triples(sentence):
    """Extract subject-verb-object triples using chunking."""
    tokens = word_tokenize(sentence)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)

    # Collect chunks in order
    chunks = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            text = " ".join(word for word, _ in subtree.leaves())
            chunks.append((subtree.label(), text))

    # Look for NP-VP-NP patterns
    triples = []
    for i in range(len(chunks) - 2):
        if (
            chunks[i][0] == "NP"
            and chunks[i + 1][0] == "VP"
            and chunks[i + 2][0] == "NP"
        ):
            subject = chunks[i][1]
            verb = chunks[i + 1][1]
            obj = chunks[i + 2][1]
            triples.append((subject, verb, obj))

    return triples


# Test extraction
extraction_examples = [
    "The cat chased the mouse.",
    "Scientists discovered a new species.",
    "The company launched its new product.",
]

extraction_results = []
for sent in extraction_examples:
    triples = extract_np_vp_np_triples(sent)
    extraction_results.append((sent, triples))

def extract_np_vp_np_triples(sentence):
    """Extract subject-verb-object triples using chunking."""
    tokens = word_tokenize(sentence)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)

    # Collect chunks in order
    chunks = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            text = " ".join(word for word, _ in subtree.leaves())
            chunks.append((subtree.label(), text))

    # Look for NP-VP-NP patterns
    triples = []
    for i in range(len(chunks) - 2):
        if (
            chunks[i][0] == "NP"
            and chunks[i + 1][0] == "VP"
            and chunks[i + 2][0] == "NP"
        ):
            subject = chunks[i][1]
            verb = chunks[i + 1][1]
            obj = chunks[i + 2][1]
            triples.append((subject, verb, obj))

    return triples


# Test extraction
extraction_examples = [
    "The cat chased the mouse.",
    "Scientists discovered a new species.",
    "The company launched its new product.",
]

extraction_results = []
for sent in extraction_examples:
    triples = extract_np_vp_np_triples(sent)
    extraction_results.append((sent, triples))

Out[37]:

Console

Relation Extraction via Chunking
============================================================

Sentence: The cat chased the mouse.
  Subject: The cat
  Verb:    chased
  Object:  the mouse

Sentence: Scientists discovered a new species.
  Subject: Scientists
  Verb:    discovered
  Object:  a new species

Sentence: The company launched its new product.
  Subject: The company
  Verb:    launched
  Object:  its new product

Keyword ExtractionLink Copied

Noun phrases often contain important keywords:

In[38]:

Code

import nltk
from collections import Counter


def extract_noun_phrases(text):
    """Extract all noun phrases from text."""
    sentences = nltk.sent_tokenize(text)
    all_nps = []

    for sent in sentences:
        tokens = word_tokenize(sent)
        tagged = pos_tag(tokens)
        tree = chunker.parse(tagged)

        for subtree in tree:
            if hasattr(subtree, "label") and subtree.label() == "NP":
                np_text = " ".join(word.lower() for word, _ in subtree.leaves())
                all_nps.append(np_text)

    return all_nps


# Example text
sample_text = """
Machine learning is transforming natural language processing. 
Deep neural networks have achieved remarkable results on many NLP tasks.
Transformer models like BERT and GPT have set new benchmarks.
These language models learn rich representations from large text corpora.
"""

nps = extract_noun_phrases(sample_text)
np_counts = Counter(nps)

import nltk
from collections import Counter


def extract_noun_phrases(text):
    """Extract all noun phrases from text."""
    sentences = nltk.sent_tokenize(text)
    all_nps = []

    for sent in sentences:
        tokens = word_tokenize(sent)
        tagged = pos_tag(tokens)
        tree = chunker.parse(tagged)

        for subtree in tree:
            if hasattr(subtree, "label") and subtree.label() == "NP":
                np_text = " ".join(word.lower() for word, _ in subtree.leaves())
                all_nps.append(np_text)

    return all_nps


# Example text
sample_text = """
Machine learning is transforming natural language processing. 
Deep neural networks have achieved remarkable results on many NLP tasks.
Transformer models like BERT and GPT have set new benchmarks.
These language models learn rich representations from large text corpora.
"""

nps = extract_noun_phrases(sample_text)
np_counts = Counter(nps)

Out[39]:

Console

Noun Phrase Extraction for Keywords
==================================================

Text excerpt: '
Machine learning is transforming natural language processing. 
Deep neural netw...'

Total NPs extracted: 12

Most frequent noun phrases:
  machine learning: 1
  natural language processing: 1
  deep neural networks: 1
  remarkable results: 1
  many nlp tasks: 1
  transformer models: 1
  bert: 1
  gpt: 1
  new benchmarks: 1
  these language models: 1

Question Answering PreprocessingLink Copied

Chunking can identify answer candidates in question answering:

In[40]:

Code

def find_answer_candidates(question, context):
    """Find potential answer spans in context based on question type."""
    # Determine question type
    q_lower = question.lower()
    if q_lower.startswith("who"):
        target_chunk = "NP"  # Look for noun phrases (people)
    elif q_lower.startswith("where"):
        target_chunk = "PP"  # Look for prepositional phrases (locations)
    elif q_lower.startswith("what"):
        target_chunk = "NP"  # Look for noun phrases (things)
    else:
        target_chunk = "NP"  # Default

    # Extract chunks from context
    tokens = word_tokenize(context)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)

    candidates = []
    for subtree in tree:
        if hasattr(subtree, "label") and subtree.label() == target_chunk:
            text = " ".join(word for word, _ in subtree.leaves())
            candidates.append(text)

    return candidates


# Example
qa_question = "Who wrote the novel?"
qa_context = "The famous author Ernest Hemingway wrote the novel in 1929."
qa_candidates = find_answer_candidates(qa_question, qa_context)

def find_answer_candidates(question, context):
    """Find potential answer spans in context based on question type."""
    # Determine question type
    q_lower = question.lower()
    if q_lower.startswith("who"):
        target_chunk = "NP"  # Look for noun phrases (people)
    elif q_lower.startswith("where"):
        target_chunk = "PP"  # Look for prepositional phrases (locations)
    elif q_lower.startswith("what"):
        target_chunk = "NP"  # Look for noun phrases (things)
    else:
        target_chunk = "NP"  # Default

    # Extract chunks from context
    tokens = word_tokenize(context)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)

    candidates = []
    for subtree in tree:
        if hasattr(subtree, "label") and subtree.label() == target_chunk:
            text = " ".join(word for word, _ in subtree.leaves())
            candidates.append(text)

    return candidates


# Example
qa_question = "Who wrote the novel?"
qa_context = "The famous author Ernest Hemingway wrote the novel in 1929."
qa_candidates = find_answer_candidates(qa_question, qa_context)

Out[41]:

Console

QA Answer Candidate Extraction
==================================================

Question: Who wrote the novel?
Context: The famous author Ernest Hemingway wrote the novel in 1929.

Answer candidates (NPs for 'who' question):
  - The famous author Ernest Hemingway
  - the novel

Training a Chunker with Machine LearningLink Copied

While regex chunkers are simple, machine learning approaches achieve better accuracy by learning patterns from data. Let's train a simple chunker using features.

In[42]:

Code

import nltk
from nltk.chunk import ChunkParserI
from nltk.tag import UnigramTagger


class UnigramChunker(ChunkParserI):
    """A simple chunker that uses a unigram tagger on (POS, IOB) pairs."""

    def __init__(self, train_sents):
        # Convert trees to (word, pos, iob) sequences
        train_data = []
        for tree in train_sents:
            iob_triples = tree_to_iob_triples(tree)
            # Create (pos, iob) pairs for training
            pos_iob_pairs = [(pos, iob) for word, pos, iob in iob_triples]
            train_data.append(pos_iob_pairs)

        # Train a unigram tagger to predict IOB from POS
        self.tagger = UnigramTagger(train_data)

    def parse(self, tagged_sent):
        """Parse a POS-tagged sentence into a chunk tree."""
        # Get POS tags
        pos_tags = [pos for word, pos in tagged_sent]

        # Predict IOB tags
        iob_tags = [self.tagger.tag([pos])[0][1] or "O" for pos in pos_tags]

        # Build IOB tagged sentence
        iob_tagged = [
            (word, pos, iob) for (word, pos), iob in zip(tagged_sent, iob_tags)
        ]

        # Convert to tree
        return nltk.chunk.conlltags2tree(iob_tagged)


# Train the chunker
unigram_chunker = UnigramChunker(train_sents)

# Evaluate
unigram_eval = evaluate_chunker(unigram_chunker, test_sents[:500])

import nltk
from nltk.chunk import ChunkParserI
from nltk.tag import UnigramTagger


class UnigramChunker(ChunkParserI):
    """A simple chunker that uses a unigram tagger on (POS, IOB) pairs."""

    def __init__(self, train_sents):
        # Convert trees to (word, pos, iob) sequences
        train_data = []
        for tree in train_sents:
            iob_triples = tree_to_iob_triples(tree)
            # Create (pos, iob) pairs for training
            pos_iob_pairs = [(pos, iob) for word, pos, iob in iob_triples]
            train_data.append(pos_iob_pairs)

        # Train a unigram tagger to predict IOB from POS
        self.tagger = UnigramTagger(train_data)

    def parse(self, tagged_sent):
        """Parse a POS-tagged sentence into a chunk tree."""
        # Get POS tags
        pos_tags = [pos for word, pos in tagged_sent]

        # Predict IOB tags
        iob_tags = [self.tagger.tag([pos])[0][1] or "O" for pos in pos_tags]

        # Build IOB tagged sentence
        iob_tagged = [
            (word, pos, iob) for (word, pos), iob in zip(tagged_sent, iob_tags)
        ]

        # Convert to tree
        return nltk.chunk.conlltags2tree(iob_tagged)


# Train the chunker
unigram_chunker = UnigramChunker(train_sents)

# Evaluate
unigram_eval = evaluate_chunker(unigram_chunker, test_sents[:500])

Out[43]:

Console

Unigram Chunker Evaluation
==================================================

Precision: 0.00%
Recall:    0.00%
F1 Score:  0.00%

Comparison with Regex Chunker:
  Regex F1:   0.00%
  Unigram F1: 0.00%

The unigram chunker learns which IOB tag typically follows each POS tag. It captures patterns like "DT usually starts an NP (B-NP)" and "JJ inside an NP usually continues it (I-NP)."

Using More ContextLink Copied

A bigram tagger uses the previous POS tag as additional context:

In[44]:

Code

from nltk.tag import BigramTagger


class BigramChunker(ChunkParserI):
    """A chunker using bigram context."""

    def __init__(self, train_sents):
        train_data = []
        for tree in train_sents:
            iob_triples = tree_to_iob_triples(tree)
            pos_iob_pairs = [(pos, iob) for word, pos, iob in iob_triples]
            train_data.append(pos_iob_pairs)

        # Use bigram tagger with unigram backoff
        unigram = UnigramTagger(train_data)
        self.tagger = BigramTagger(train_data, backoff=unigram)

    def parse(self, tagged_sent):
        pos_tags = [pos for word, pos in tagged_sent]
        iob_tags = self.tagger.tag(pos_tags)
        iob_tags = [iob if iob else "O" for pos, iob in iob_tags]
        iob_tagged = [
            (word, pos, iob) for (word, pos), iob in zip(tagged_sent, iob_tags)
        ]
        return nltk.chunk.conlltags2tree(iob_tagged)


# Train and evaluate
bigram_chunker = BigramChunker(train_sents)
bigram_eval = evaluate_chunker(bigram_chunker, test_sents[:500])

from nltk.tag import BigramTagger


class BigramChunker(ChunkParserI):
    """A chunker using bigram context."""

    def __init__(self, train_sents):
        train_data = []
        for tree in train_sents:
            iob_triples = tree_to_iob_triples(tree)
            pos_iob_pairs = [(pos, iob) for word, pos, iob in iob_triples]
            train_data.append(pos_iob_pairs)

        # Use bigram tagger with unigram backoff
        unigram = UnigramTagger(train_data)
        self.tagger = BigramTagger(train_data, backoff=unigram)

    def parse(self, tagged_sent):
        pos_tags = [pos for word, pos in tagged_sent]
        iob_tags = self.tagger.tag(pos_tags)
        iob_tags = [iob if iob else "O" for pos, iob in iob_tags]
        iob_tagged = [
            (word, pos, iob) for (word, pos), iob in zip(tagged_sent, iob_tags)
        ]
        return nltk.chunk.conlltags2tree(iob_tagged)


# Train and evaluate
bigram_chunker = BigramChunker(train_sents)
bigram_eval = evaluate_chunker(bigram_chunker, test_sents[:500])

Out[45]:

Console

Bigram Chunker Evaluation
==================================================

Precision: 0.00%
Recall:    0.00%
F1 Score:  0.00%

Comparison:
  Regex F1:   0.00%
  Unigram F1: 0.00%
  Bigram F1:  0.00%

Out[46]:

Visualization

Bar chart comparing F1 scores of regex, unigram, and bigram chunkers. — F1 score comparison of different chunking approaches on the CoNLL-2000 test set. The regex chunker uses hand-crafted patterns, while the unigram and bigram chunkers learn from training data. Machine learning approaches generally outperform hand-crafted rules, with more context leading to better performance.

Chunking with spaCyLink Copied

spaCy provides noun phrase chunking through its noun_chunks property, which uses dependency parsing under the hood:

In[47]:

Code

import spacy

try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    import subprocess

    subprocess.run(
        ["python", "-m", "spacy", "download", "en_core_web_sm"],
        capture_output=True,
    )
    nlp = spacy.load("en_core_web_sm")

# Example sentences
spacy_examples = [
    "The quick brown fox jumps over the lazy dog.",
    "A major breakthrough in artificial intelligence was announced yesterday.",
    "The President of the United States addressed the nation.",
]

spacy_results = []
for sent in spacy_examples:
    doc = nlp(sent)
    chunks = [
        (chunk.text, chunk.root.text, chunk.root.dep_)
        for chunk in doc.noun_chunks
    ]
    spacy_results.append((sent, chunks))

import spacy

try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    import subprocess

    subprocess.run(
        ["python", "-m", "spacy", "download", "en_core_web_sm"],
        capture_output=True,
    )
    nlp = spacy.load("en_core_web_sm")

# Example sentences
spacy_examples = [
    "The quick brown fox jumps over the lazy dog.",
    "A major breakthrough in artificial intelligence was announced yesterday.",
    "The President of the United States addressed the nation.",
]

spacy_results = []
for sent in spacy_examples:
    doc = nlp(sent)
    chunks = [
        (chunk.text, chunk.root.text, chunk.root.dep_)
        for chunk in doc.noun_chunks
    ]
    spacy_results.append((sent, chunks))

Out[48]:

Console

spaCy Noun Chunks
============================================================

Sentence: The quick brown fox jumps over the lazy dog.
Noun chunks:
  'The quick brown fox'
    Root: fox, Dependency: nsubj
  'the lazy dog'
    Root: dog, Dependency: pobj

Sentence: A major breakthrough in artificial intelligence was announced yesterday.
Noun chunks:
  'A major breakthrough'
    Root: breakthrough, Dependency: nsubjpass
  'artificial intelligence'
    Root: intelligence, Dependency: pobj

Sentence: The President of the United States addressed the nation.
Noun chunks:
  'The President'
    Root: President, Dependency: nsubj
  'the United States'
    Root: States, Dependency: pobj
  'the nation'
    Root: nation, Dependency: dobj

spaCy's noun chunks are derived from the dependency parse, so they benefit from syntactic analysis beyond just POS tag patterns. The root of each chunk is its head noun, and the dep_ attribute shows its grammatical function in the sentence.

Limitations and Practical ConsiderationsLink Copied

Chunking, while useful, has inherent limitations that practitioners should understand.

The flat, non-recursive nature of chunks means they cannot represent certain linguistic phenomena. A sentence like "The student who failed the exam requested a meeting" contains a relative clause embedded within the subject NP. Flat chunking either splits this incorrectly or produces an overly long NP that obscures internal structure. When you need to understand how phrases nest within phrases, full parsing is required.

Chunking accuracy depends heavily on POS tagging accuracy. If "man" is tagged as a noun when it's actually a verb in "The old man the boats," chunking will produce incorrect results. This error propagation is particularly problematic for domain-specific text where POS taggers trained on news data may struggle with unfamiliar vocabulary and constructions.

The definition of chunks can be ambiguous. Should "the very best coffee" be one NP or should "very best" be a separate ADJP? Different annotation guidelines make different choices, and these inconsistencies affect both training data and evaluation. When comparing chunking systems, ensure they use compatible annotation schemes.

For languages with freer word order than English, chunking becomes more challenging. German verb clusters, Japanese postpositions, and Arabic clitic attachment create patterns that simple sequence models may not capture well. Cross-linguistic chunking remains an active research area.

Despite these limitations, chunking provides a practical balance between simplicity and usefulness. For applications that need phrase boundaries without full syntactic analysis, such as information extraction, keyword identification, and text summarization, chunking offers an efficient and reasonably accurate solution.

SummaryLink Copied

Chunking identifies non-overlapping, non-recursive phrases in text, providing a middle ground between POS tagging and full parsing. The key concepts from this chapter:

Chunk types include noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). Each type captures a different syntactic unit, with NP chunking being the most studied and practically useful.
IOB tagging encodes chunk boundaries using per-token labels: B marks the beginning of a chunk, I marks continuation, and O marks tokens outside chunks. This encoding allows chunking to be formulated as a sequence labeling task.
Regex-based chunking uses patterns over POS tags to identify chunks. NLTK's RegexpParser provides an intuitive way to define chunk grammars, though pattern-based approaches have limited accuracy.
Machine learning chunkers learn patterns from annotated data like the CoNLL-2000 corpus. Even simple unigram and bigram models outperform hand-crafted rules, and more sophisticated approaches using CRFs or neural networks achieve state-of-the-art results.
Chunking vs. parsing represents a key tradeoff: chunking is faster and more accurate on its simpler task but captures less syntactic information. Full parsing resolves attachment ambiguities and represents hierarchical structure.
Practical applications include information extraction, keyword identification, and question answering preprocessing. Chunking provides useful features without the complexity of full parsing.

Key ParametersLink Copied

When working with chunking in NLTK and spaCy:

NLTK RegexpParser:

grammar: A string defining chunk rules using regex over POS tags
{<pattern>}: Chunk pattern (include matching tokens)
}<pattern>{: Chink pattern (exclude matching tokens)

NLTK chunk evaluation:

chunk_types: List of chunk types to evaluate (e.g., ["NP", "VP", "PP"])
Evaluation is chunk-level: exact boundary and type match required

spaCy noun_chunks:

doc.noun_chunks: Iterator over noun phrases in document
chunk.root: Head noun of the chunk
chunk.root.dep_: Syntactic dependency of the head

The next chapters explore the probabilistic models, Hidden Markov Models and Conditional Random Fields, that power production-quality chunkers and other sequence labeling systems.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about chunking and shallow parsing.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{chunkingshallowparsingforphraseidentificationinnlp, author = {Michael Brenndoerfer}, title = {Chunking: Shallow Parsing for Phrase Identification in NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Chunking: Shallow Parsing for Phrase Identification in NLP. Retrieved from https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp

MLAAcademic

Michael Brenndoerfer. "Chunking: Shallow Parsing for Phrase Identification in NLP." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp>.

CHICAGOAcademic

Michael Brenndoerfer. "Chunking: Shallow Parsing for Phrase Identification in NLP." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Chunking: Shallow Parsing for Phrase Identification in NLP'. Available at: https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Chunking: Shallow Parsing for Phrase Identification in NLP. https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp

Direct link:

https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Chunking: Shallow Parsing for Phrase Identification in NLP

ChunkingLink Copied

What Is Chunking?Link Copied

Chunk TypesLink Copied

Noun Phrases (NP)Link Copied

Verb Phrases (VP)Link Copied

Prepositional Phrases (PP)Link Copied

Other Chunk TypesLink Copied

IOB Tagging for ChunksLink Copied

Converting Between Chunks and IOB TagsLink Copied

Chunking vs. Full ParsingLink Copied

Regex-Based Chunking with NLTKLink Copied

Chinking: Defining What's NOT in a ChunkLink Copied

Limitations of Regex ChunkingLink Copied

Using the CoNLL-2000 DatasetLink Copied

Converting CoNLL Data to IOB FormatLink Copied

Evaluating ChunkersLink Copied

Chunking as PreprocessingLink Copied

Information ExtractionLink Copied

Keyword ExtractionLink Copied

Question Answering PreprocessingLink Copied

Training a Chunker with Machine LearningLink Copied

Using More ContextLink Copied

Chunking with spaCyLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Hidden Markov Models: Probabilistic Sequence Labeling for NLP

Conditional Random Fields: Discriminative Sequence Labeling with Rich Features

CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization

Stay updated