Search

Search articles

Chunking: Shallow Parsing for Phrase Identification in NLP

Michael BrenndoerferDecember 16, 202525 min read

Learn chunking (shallow parsing) to identify noun phrases, verb phrases, and prepositional phrases using IOB tagging, regex patterns, and machine learning with NLTK and spaCy.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Chunking

Sentences contain meaningful groups of words that function as units. In "The quick brown fox jumps over the lazy dog," we naturally perceive "the quick brown fox" as a noun phrase describing the subject, "jumps" as the action, and "over the lazy dog" as a prepositional phrase describing where. Chunking, also called shallow parsing, identifies these non-overlapping segments without building a full syntactic tree.

Chunking sits between part-of-speech tagging and full parsing in complexity. POS tagging labels individual words. Full parsing constructs hierarchical tree structures showing how phrases nest within phrases. Chunking finds a middle ground: it groups consecutive words into flat, non-recursive chunks without showing how chunks relate to each other.

This chapter explores chunking from multiple angles. You'll learn the major chunk types, understand how IOB tagging represents chunk boundaries, implement chunkers using both regular expressions and machine learning, and see how chunking serves as a preprocessing step for information extraction and other downstream tasks.

What Is Chunking?

Chunking identifies contiguous spans of tokens that form syntactic units. Unlike full parsing, which produces tree structures with unlimited nesting, chunking produces a flat sequence of labeled segments. Each token belongs to exactly one chunk (or no chunk if it's outside all chunks).

Chunking (Shallow Parsing)

Chunking is the task of grouping consecutive words into non-overlapping, non-recursive phrases such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). It identifies phrase boundaries without building hierarchical parse trees.

Consider the sentence "The black cat sat on the mat." A chunker might produce:

  • [The black cat]NP_{NP}, noun phrase (subject)
  • [sat]VP_{VP}, verb phrase (predicate)
  • [on]PP_{PP}, prepositional phrase start
  • [the mat]NP_{NP}, noun phrase (object of preposition)

The key properties of chunking are:

  • Non-overlapping: Each word belongs to at most one chunk
  • Non-recursive: Chunks don't contain other chunks of the same type
  • Contiguous: Chunks consist of consecutive words
  • Partial coverage: Some words may not belong to any chunk

Let's see chunking in action with NLTK:

In[2]:
Code
from nltk import word_tokenize, pos_tag
from nltk.chunk import ne_chunk

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)

# NLTK's chunker produces a tree
chunked = ne_chunk(tagged)
Out[3]:
Console
Chunking Example
==================================================

Sentence: The quick brown fox jumps over the lazy dog.

POS Tags: The/DT quick/JJ brown/NN fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN ./.

Chunk Tree:
(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  jumps/VBZ
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)

NLTK's ne_chunk function produces a tree structure, but we can flatten it to see the chunks:

In[4]:
Code
def tree_to_chunks(tree):
    """Extract chunks from NLTK tree as (chunk_type, words) pairs."""
    chunks = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            # This is a chunk
            chunk_type = subtree.label()
            words = [word for word, tag in subtree.leaves()]
            chunks.append((chunk_type, " ".join(words)))
    return chunks


# Also try a sentence with named entities
ne_sentence = "Barack Obama visited New York City yesterday."
ne_tokens = word_tokenize(ne_sentence)
ne_tagged = pos_tag(ne_tokens)
ne_chunked = ne_chunk(ne_tagged)
ne_chunks = tree_to_chunks(ne_chunked)
Out[5]:
Console
Named Entity Chunks
==================================================

Sentence: Barack Obama visited New York City yesterday.

Chunks found:
  [Barack] → PERSON
  [Obama] → PERSON
  [New York City] → GPE

Chunk Types

Different chunk types capture different syntactic units. The most common types in English are noun phrases, verb phrases, and prepositional phrases, though chunking schemes may include additional categories.

Noun Phrases (NP)

Noun phrases are the most important and well-studied chunk type. They typically consist of a determiner, optional adjectives, and one or more nouns. NP chunking, specifically, is sometimes called "base NP chunking" when it identifies only the innermost, non-recursive noun phrases.

In[6]:
Code
# Examples of noun phrases with varying complexity
np_examples = [
    ("the cat", "simple: determiner + noun"),
    ("a big red ball", "adjectives: det + adj + adj + noun"),
    ("President Obama", "proper: title + name"),
    ("three blind mice", "numeral: number + adj + noun"),
    ("the New York Stock Exchange", "complex: nested proper noun"),
]

# POS patterns that form noun phrases
np_patterns = [
    "DT NN",  # the cat
    "DT JJ NN",  # the black cat
    "DT JJ JJ NN",  # the big black cat
    "NNP NNP",  # Barack Obama
    "CD NN",  # three cats
    "PRP$ NN",  # my cat
    "JJ NN",  # black cat
    "NN NN",  # computer screen
]
Out[7]:
Console
Noun Phrase Examples
==================================================

  'the cat'
    POS: DT NN
    Type: simple: determiner + noun

  'a big red ball'
    POS: DT JJ JJ NN
    Type: adjectives: det + adj + adj + noun

  'President Obama'
    POS: NNP NNP
    Type: proper: title + name

  'three blind mice'
    POS: CD IN NNS
    Type: numeral: number + adj + noun

  'the New York Stock Exchange'
    POS: DT NNP NNP NNP NNP
    Type: complex: nested proper noun

The structure of noun phrases follows grammatical patterns. A determiner (the, a, my) often starts the phrase, adjectives modify the head noun, and the head noun itself can be singular, plural, or proper. Understanding these patterns is key to building effective chunkers.

Verb Phrases (VP)

Verb phrases contain the main verb and its auxiliaries. In chunking, we typically identify just the verbal elements, not the objects or complements that would be included in a full verb phrase in traditional grammar.

In[8]:
Code
# Examples of verb phrases
vp_examples = [
    ("runs", "simple present"),
    ("is running", "progressive"),
    ("has been running", "perfect progressive"),
    ("will have been running", "future perfect progressive"),
    ("can swim", "modal + base"),
    ("should have called", "modal + perfect"),
]

# POS patterns for verb phrases
vp_patterns = [
    "VBZ",  # runs
    "VBG",  # running
    "VBD",  # ran
    "MD VB",  # can run
    "VBZ VBG",  # is running
    "VBZ VBN",  # is eaten
    "MD VB VBN",  # will be eaten
]
Out[9]:
Console
Verb Phrase Examples
==================================================

  'runs'
    POS: NNS
    Type: simple present

  'is running'
    POS: VBZ VBG
    Type: progressive

  'has been running'
    POS: VBZ VBN VBG
    Type: perfect progressive

  'will have been running'
    POS: MD VB VBN VBG
    Type: future perfect progressive

  'can swim'
    POS: MD VB
    Type: modal + base

  'should have called'
    POS: MD VB VBN
    Type: modal + perfect

Prepositional Phrases (PP)

Prepositional phrases begin with a preposition and typically end with a noun phrase. In chunking, we often identify the preposition separately and let the following NP be tagged as such, or we group them together.

In[10]:
Code
# Examples of prepositional phrases
pp_examples = [
    ("in the house", "location"),
    ("with great care", "manner"),
    ("after the meeting", "time"),
    ("under the old oak tree", "complex location"),
    ("for my best friend", "beneficiary"),
]
Out[11]:
Console
Prepositional Phrase Examples
==================================================

  'in the house'
    POS: IN DT NN
    Type: location

  'with great care'
    POS: IN JJ NN
    Type: manner

  'after the meeting'
    POS: IN DT NN
    Type: time

  'under the old oak tree'
    POS: IN DT JJ NN NN
    Type: complex location

  'for my best friend'
    POS: IN PRP$ JJS NN
    Type: beneficiary

Other Chunk Types

Depending on the annotation scheme, chunkers may identify additional phrase types:

In[12]:
Code
other_chunks = {
    "ADJP": ("Adjective phrase", "very happy, quite tired"),
    "ADVP": ("Adverb phrase", "very quickly, rather slowly"),
    "SBAR": ("Subordinate clause", "that he left, if it rains"),
    "PRT": ("Particle", "give up, turn on"),
    "CONJP": ("Conjunction phrase", "as well as, rather than"),
}
Out[13]:
Console
Additional Chunk Types
==================================================

  ADJP: Adjective phrase
    Examples: very happy, quite tired

  ADVP: Adverb phrase
    Examples: very quickly, rather slowly

  SBAR: Subordinate clause
    Examples: that he left, if it rains

  PRT: Particle
    Examples: give up, turn on

  CONJP: Conjunction phrase
    Examples: as well as, rather than

IOB Tagging for Chunks

Just as BIO tagging encodes entity boundaries for named entity recognition, IOB tagging encodes chunk boundaries. Each token receives a tag indicating whether it begins a chunk (B), continues a chunk (I), or is outside all chunks (O).

IOB Tagging

IOB (Inside-Outside-Beginning) tagging represents chunk boundaries using per-token labels. B-NP marks the first word of a noun phrase, I-NP marks subsequent words in the same NP, and O marks words outside any chunk. This encoding is identical to BIO tagging for NER.

The terminology varies slightly. Some sources use "BIO" while others use "IOB." Additionally, there are two variants:

  • IOB1: B tag is only used when a chunk follows another chunk of the same type
  • IOB2: B tag is always used for the first token of a chunk (more common)

Let's see IOB tagging applied to chunks:

In[14]:
Code
# Example sentence with IOB chunk tags
sentence = "The quick brown fox jumps over the lazy dog"
tokens = sentence.split()

# IOB tags for chunks
iob_tags = [
    "B-NP",  # The
    "I-NP",  # quick
    "I-NP",  # brown
    "I-NP",  # fox
    "B-VP",  # jumps
    "B-PP",  # over
    "B-NP",  # the
    "I-NP",  # lazy
    "I-NP",  # dog
]
Out[15]:
Console
IOB Chunk Tagging
==================================================

Sentence: The quick brown fox jumps over the lazy dog

Token        IOB Tag    Meaning
---------------------------------------------
The          B-NP       Begin noun phrase
quick        I-NP       Inside noun phrase
brown        I-NP       Inside noun phrase
fox          I-NP       Inside noun phrase
jumps        B-VP       Begin verb phrase
over         B-PP       Begin prep phrase
the          B-NP       Begin noun phrase
lazy         I-NP       Inside noun phrase
dog          I-NP       Inside noun phrase
Out[16]:
Visualization
Horizontal sequence diagram showing tokens with colored boxes indicating IOB chunk tags for NP, VP, and PP.
IOB tagging applied to syntactic chunks. Each token receives a tag encoding its position within phrase boundaries. The sentence contains two noun phrases (NP), one verb phrase (VP), and one prepositional phrase (PP). Prepositional phrases often contain embedded noun phrases, but in flat chunking we treat them as separate chunks.

Converting Between Chunks and IOB Tags

We need utilities to convert between chunk spans and IOB sequences, just as we did for named entity recognition in the BIO tagging chapter:

In[17]:
Code
def chunks_to_iob(tokens, chunks):
    """
    Convert chunk spans to IOB tags.

    Args:
        tokens: List of tokens
        chunks: List of (start_idx, end_idx, chunk_type) tuples

    Returns:
        List of IOB tags, one per token
    """
    tags = ["O"] * len(tokens)

    for start, end, chunk_type in sorted(chunks, key=lambda x: x[0]):
        if start < 0 or end > len(tokens) or start >= end:
            continue

        tags[start] = f"B-{chunk_type}"
        for i in range(start + 1, end):
            tags[i] = f"I-{chunk_type}"

    return tags


def iob_to_chunks(tokens, tags):
    """
    Extract chunk spans from IOB tags.

    Returns:
        List of (start_idx, end_idx, chunk_type, text) tuples
    """
    chunks = []
    current_chunk = None  # (start_idx, chunk_type)

    for i, (token, tag) in enumerate(zip(tokens, tags)):
        if tag.startswith("B-"):
            # Close any open chunk
            if current_chunk is not None:
                start, ctype = current_chunk
                chunks.append((start, i, ctype))

            # Start new chunk
            chunk_type = tag[2:]
            current_chunk = (i, chunk_type)

        elif tag.startswith("I-"):
            chunk_type = tag[2:]

            # Handle orphan I tag
            if current_chunk is None:
                current_chunk = (i, chunk_type)
            elif current_chunk[1] != chunk_type:
                # Type mismatch: close old, start new
                start, ctype = current_chunk
                chunks.append((start, i, ctype))
                current_chunk = (i, chunk_type)

        else:  # O tag
            if current_chunk is not None:
                start, ctype = current_chunk
                chunks.append((start, i, ctype))
                current_chunk = None

    # Handle chunk at end
    if current_chunk is not None:
        start, ctype = current_chunk
        chunks.append((start, len(tokens), ctype))

    # Add text to each chunk
    result = []
    for start, end, ctype in chunks:
        text = " ".join(tokens[start:end])
        result.append((start, end, ctype, text))

    return result


# Test the converters
test_tokens = "The quick brown fox jumps over the lazy dog".split()
test_chunks = [
    (0, 4, "NP"),  # The quick brown fox
    (4, 5, "VP"),  # jumps
    (5, 6, "PP"),  # over
    (6, 9, "NP"),  # the lazy dog
]

test_iob = chunks_to_iob(test_tokens, test_chunks)
recovered_chunks = iob_to_chunks(test_tokens, test_iob)
Out[18]:
Console
Chunk to IOB Conversion
==================================================

Original chunks:
  [0:4] 'The quick brown fox' → NP
  [4:5] 'jumps' → VP
  [5:6] 'over' → PP
  [6:9] 'the lazy dog' → NP

IOB tags:
  The          → B-NP
  quick        → I-NP
  brown        → I-NP
  fox          → I-NP
  jumps        → B-VP
  over         → B-PP
  the          → B-NP
  lazy         → I-NP
  dog          → I-NP

Recovered chunks:
  [0:4] 'The quick brown fox' → NP
  [4:5] 'jumps' → VP
  [5:6] 'over' → PP
  [6:9] 'the lazy dog' → NP

Chunking vs. Full Parsing

Understanding the difference between chunking and full parsing clarifies when to use each approach.

Full syntactic parsing produces a complete tree structure showing how phrases nest within phrases. Consider "I saw the man with the telescope." A full parse would show the PP "with the telescope" attaching either to "saw" (I used the telescope to see) or to "the man" (the man had the telescope). This ambiguity requires semantic knowledge to resolve.

In[19]:
Code
# Illustrate the difference
example = "I saw the man with the telescope"

# Chunking output: flat, non-recursive
chunk_analysis = {
    "chunks": [
        ("I", "NP"),
        ("saw", "VP"),
        ("the man", "NP"),
        ("with", "PP"),
        ("the telescope", "NP"),
    ],
    "structure": "flat sequence of chunks",
}

# Full parsing output: hierarchical tree
parse_analysis = {
    "interpretation_1": "VP[saw NP[the man] PP[with NP[the telescope]]]",
    "interpretation_2": "VP[saw NP[the man PP[with NP[the telescope]]]]",
    "structure": "nested tree with attachment decisions",
}
Out[20]:
Console
Chunking vs. Full Parsing
============================================================

Example: 'I saw the man with the telescope'

--- Chunking (Shallow Parsing) ---
Output: Flat sequence of non-overlapping chunks
  [I] → NP
  [saw] → VP
  [the man] → NP
  [with] → PP
  [the telescope] → NP

Note: flat sequence of chunks

--- Full Parsing ---
Output: Hierarchical tree with attachment decisions

Interpretation 1 (telescope is instrument):
  VP[saw NP[the man] PP[with NP[the telescope]]]

Interpretation 2 (man has telescope):
  VP[saw NP[the man PP[with NP[the telescope]]]]

Note: nested tree with attachment decisions

The key tradeoffs are:

  • Speed: Chunking is much faster than full parsing
  • Accuracy: Chunking achieves higher accuracy on its simpler task
  • Information: Full parsing captures more syntactic detail
  • Ambiguity: Chunking avoids attachment decisions that require semantics

For many practical applications like information extraction, named entity recognition, and text classification, chunking provides sufficient structure without the complexity and errors of full parsing.

Out[21]:
Visualization
Side-by-side diagram showing flat chunking output on left and hierarchical parse tree on right.
Comparison of chunking and full parsing output structures. Chunking produces a flat sequence of labeled segments (left), while full parsing produces a hierarchical tree showing phrase nesting (right). Chunking avoids difficult attachment decisions that would require semantic knowledge.

Regex-Based Chunking with NLTK

NLTK provides a RegexpParser that lets you define chunk patterns using regular expressions over POS tags. This approach is intuitive and works well for simple patterns.

In[22]:
Code
from nltk import RegexpParser

# Define a grammar for noun phrase chunking
# Patterns use POS tags, not words
# NP: optional determiner/possessive, adjectives, nouns
# VP: optional modal, verbs
# PP: just the preposition
np_grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN.*>+}
    VP: {<MD>?<VB.*>+}
    PP: {<IN>}
"""

# Create the chunker
chunker = RegexpParser(np_grammar)

# Test sentences
test_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A beautiful sunset illuminated the entire valley.",
    "My old computer finally crashed yesterday.",
]

regex_results = []
for sent in test_sentences:
    tokens = word_tokenize(sent)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)
    chunks = tree_to_chunks(tree)
    regex_results.append((sent, tagged, chunks))
Out[23]:
Console
Regex-Based Chunking Results
============================================================

Sentence: The quick brown fox jumps over the lazy dog.
POS: The/DT quick/JJ brown/NN fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN ./.
Chunks:
  [The quick brown fox] → NP
  [jumps] → VP
  [over] → PP
  [the lazy dog] → NP

Sentence: A beautiful sunset illuminated the entire valley.
POS: A/DT beautiful/JJ sunset/NN illuminated/VBD the/DT entire/JJ valley/NN ./.
Chunks:
  [A beautiful sunset] → NP
  [illuminated] → VP
  [the entire valley] → NP

Sentence: My old computer finally crashed yesterday.
POS: My/PRP$ old/JJ computer/NN finally/RB crashed/VBN yesterday/NN ./.
Chunks:
  [My old computer] → NP
  [crashed] → VP
  [yesterday] → NP

Let's break down the grammar pattern for noun phrases:

In[24]:
Code
# Explanation of the NP pattern
np_pattern_parts = {
    "<DT|PRP$>?": "Optional determiner (the, a) or possessive pronoun (my, your)",
    "<JJ>*": "Zero or more adjectives",
    "<NN.*>+": "One or more nouns (NN, NNS, NNP, NNPS)",
}

# More complex grammar with additional patterns
# NP patterns: basic NP, proper noun sequences, pronouns, number + nouns
# VP: verb phrases
# PP: preposition + NP (recursive-ish)
# ADVP: adverb phrases
extended_grammar = r"""
    NP: {<DT|PRP\$>?<JJ>*<NN.*>+}
        {<NNP>+}
        {<PRP>}
        {<CD><NN.*>+}
    VP: {<MD>?<VB.*>+}
    PP: {<IN><NP>}
    ADVP: {<RB.*>+}
"""
Out[25]:
Console
NP Pattern Breakdown
==================================================

  <DT|PRP$>?
    → Optional determiner (the, a) or possessive pronoun (my, your)

  <JJ>*
    → Zero or more adjectives

  <NN.*>+
    → One or more nouns (NN, NNS, NNP, NNPS)


Regex Quantifiers:
  ?  = zero or one (optional)
  *  = zero or more
  +  = one or more
  |  = alternation (or)

Chinking: Defining What's NOT in a Chunk

Sometimes it's easier to define what shouldn't be in a chunk than what should. Chinking removes tokens from chunks:

In[26]:
Code
# Grammar with chinking
# Pattern explanation:
# - {<.*>+} chunks everything into NP
# - }<VB.*|IN|CC|\.>{ removes (chinks) verbs, prepositions, conjunctions, periods
chink_grammar = r"""
    NP: {<.*>+}
        }<VB.*|IN|CC|\.>{
"""

chink_chunker = RegexpParser(chink_grammar)

chink_test = "The cat and the dog sat on the mat."
chink_tokens = word_tokenize(chink_test)
chink_tagged = pos_tag(chink_tokens)
chink_tree = chink_chunker.parse(chink_tagged)
chink_chunks = tree_to_chunks(chink_tree)
Out[27]:
Console
Chinking Example
==================================================

Sentence: The cat and the dog sat on the mat.

POS: The/DT cat/NN and/CC the/DT dog/NN sat/VBD on/IN the/DT mat/NN ./.

Chinking strategy:
  1. Chunk everything: {<.*>+}
  2. Remove (chink) verbs, prepositions, conjunctions

Result:
  [The cat] → NP
  [the dog] → NP
  [the mat] → NP

The }{ syntax means "end chunk before these tags, start new chunk after." This effectively splits chunks at verbs, prepositions, and other non-NP elements.

Limitations of Regex Chunking

Regex-based chunking is simple and fast but has limitations:

In[28]:
Code
# Cases where regex chunking struggles
problematic_cases = [
    {
        "sentence": "The man I saw yesterday left.",
        "issue": "Relative clause interrupts NP",
    },
    {
        "sentence": "Very very quickly running water.",
        "issue": "Ambiguous modifier scope",
    },
    {
        "sentence": "The old man the boats.",
        "issue": "Garden path: 'man' is a verb here",
    },
]

for case in problematic_cases:
    tokens = word_tokenize(case["sentence"])
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)
    case["tagged"] = tagged
    case["chunks"] = tree_to_chunks(tree)
Out[29]:
Console
Regex Chunking Limitations
============================================================

Sentence: The man I saw yesterday left.
Issue: Relative clause interrupts NP
POS: The/DT man/NN I/PRP saw/VBD yesterday/NN left/VBD ./.
Chunks found:
  [The man] → NP
  [saw] → VP
  [yesterday] → NP
  [left] → VP

Sentence: Very very quickly running water.
Issue: Ambiguous modifier scope
POS: Very/RB very/RB quickly/RB running/VBG water/NN ./.
Chunks found:
  [running] → VP
  [water] → NP

Sentence: The old man the boats.
Issue: Garden path: 'man' is a verb here
POS: The/DT old/JJ man/NN the/DT boats/NNS ./.
Chunks found:
  [The old man] → NP
  [the boats] → NP

Regex patterns match local sequences without considering broader context. They rely entirely on POS tags, so POS tagging errors propagate. They cannot handle discontinuous constituents or truly recursive structures.

Using the CoNLL-2000 Dataset

The CoNLL-2000 shared task established a standard benchmark for chunking. The dataset provides sentences with POS tags and IOB chunk labels for training and evaluating chunkers.

In[30]:
Code
from nltk.corpus import conll2000

# Load the data
train_sents = conll2000.chunked_sents(
    "train.txt", chunk_types=["NP", "VP", "PP"]
)
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP", "VP", "PP"])

# Look at the data format
sample_sent = train_sents[0]
Out[31]:
Console
CoNLL-2000 Dataset
==================================================

Training sentences: 8,936
Test sentences: 2,012


Sample sentence structure:
(S
  (NP Confidence/NN)
  (PP in/IN)
  (NP the/DT pound/NN)
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  (PP for/IN)
  (NP September/NNP)
  ,/,
  due/JJ
  (PP for/IN)
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  (VP fail/VB to/TO show/VB)
  (NP a/DT substantial/JJ improvement/NN)
  (PP from/IN)
  (NP July/NNP and/CC August/NNP)
  (NP 's/POS near-record/JJ deficits/NNS)
  ./.)

Chunk type distribution (first 1000 sentences):
  NP: 6,211
  VP: 2,399
  PP: 2,397

Converting CoNLL Data to IOB Format

For machine learning approaches, we need the data in token-level IOB format:

In[32]:
Code
def tree_to_iob_triples(tree):
    """Convert NLTK tree to (word, pos, iob) triples."""
    triples = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            # This is a chunk
            chunk_type = subtree.label()
            for i, (word, pos) in enumerate(subtree.leaves()):
                if i == 0:
                    iob = f"B-{chunk_type}"
                else:
                    iob = f"I-{chunk_type}"
                triples.append((word, pos, iob))
        else:
            # Single word, not in a chunk
            word, pos = subtree
            triples.append((word, pos, "O"))
    return triples


# Convert sample
sample_iob = tree_to_iob_triples(sample_sent)
Out[33]:
Console
IOB Triple Format
==================================================

  Word          POS    IOB
  -----------------------------------
  Confidence     NN     B-NP
  in             IN     B-PP
  the            DT     B-NP
  pound          NN     I-NP
  is             VBZ    B-VP
  widely         RB     I-VP
  expected       VBN    I-VP
  to             TO     I-VP
  take           VB     I-VP
  another        DT     B-NP
  sharp          JJ     I-NP
  dive           NN     I-NP
  if             IN     O
  trade          NN     B-NP
  figures        NNS    I-NP
  ... (22 more tokens)

Evaluating Chunkers

Chunking evaluation uses precision, recall, and F1 score at the chunk level, not the token level. A chunk is correct only if both its boundaries and type match exactly.

Chunk-Level Evaluation

Chunking evaluation counts a predicted chunk as correct only if it exactly matches a gold chunk in both boundaries (start and end positions) and type (NP, VP, etc.). Partial matches receive no credit.

In[34]:
Code
def evaluate_chunker(chunker, test_trees):
    """Evaluate a chunker on test data."""
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for tree in test_trees:
        # Get gold chunks
        gold_chunks = set()
        for subtree in tree:
            if hasattr(subtree, "label"):
                words = tuple(word for word, _ in subtree.leaves())
                gold_chunks.add((subtree.label(), words))

        # Get predicted chunks
        tagged = [(word, pos) for word, pos in tree.pos()]
        pred_tree = chunker.parse(tagged)
        pred_chunks = set()
        for subtree in pred_tree:
            if hasattr(subtree, "label"):
                words = tuple(word for word, _ in subtree.leaves())
                pred_chunks.add((subtree.label(), words))

        # Count matches
        true_positives += len(gold_chunks & pred_chunks)
        false_positives += len(pred_chunks - gold_chunks)
        false_negatives += len(gold_chunks - pred_chunks)

    precision = (
        true_positives / (true_positives + false_positives)
        if (true_positives + false_positives) > 0
        else 0
    )
    recall = (
        true_positives / (true_positives + false_negatives)
        if (true_positives + false_negatives) > 0
        else 0
    )
    f1 = (
        2 * precision * recall / (precision + recall)
        if (precision + recall) > 0
        else 0
    )

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "true_positives": true_positives,
        "false_positives": false_positives,
        "false_negatives": false_negatives,
    }


# Evaluate our regex chunker
regex_eval = evaluate_chunker(chunker, test_sents[:500])
Out[35]:
Console
Regex Chunker Evaluation
==================================================

Evaluated on 500 test sentences

Precision: 0.00%
Recall:    0.00%
F1 Score:  0.00%

True Positives:  0
False Positives: 0
False Negatives: 5,117

Chunking as Preprocessing

Chunking serves as a preprocessing step for many NLP tasks. By identifying phrase boundaries, it simplifies downstream processing and provides useful features.

Information Extraction

Chunking helps identify entities and relations in text:

In[36]:
Code
def extract_np_vp_np_triples(sentence):
    """Extract subject-verb-object triples using chunking."""
    tokens = word_tokenize(sentence)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)

    # Collect chunks in order
    chunks = []
    for subtree in tree:
        if hasattr(subtree, "label"):
            text = " ".join(word for word, _ in subtree.leaves())
            chunks.append((subtree.label(), text))

    # Look for NP-VP-NP patterns
    triples = []
    for i in range(len(chunks) - 2):
        if (
            chunks[i][0] == "NP"
            and chunks[i + 1][0] == "VP"
            and chunks[i + 2][0] == "NP"
        ):
            subject = chunks[i][1]
            verb = chunks[i + 1][1]
            obj = chunks[i + 2][1]
            triples.append((subject, verb, obj))

    return triples


# Test extraction
extraction_examples = [
    "The cat chased the mouse.",
    "Scientists discovered a new species.",
    "The company launched its new product.",
]

extraction_results = []
for sent in extraction_examples:
    triples = extract_np_vp_np_triples(sent)
    extraction_results.append((sent, triples))
Out[37]:
Console
Relation Extraction via Chunking
============================================================

Sentence: The cat chased the mouse.
  Subject: The cat
  Verb:    chased
  Object:  the mouse

Sentence: Scientists discovered a new species.
  Subject: Scientists
  Verb:    discovered
  Object:  a new species

Sentence: The company launched its new product.
  Subject: The company
  Verb:    launched
  Object:  its new product

Keyword Extraction

Noun phrases often contain important keywords:

In[38]:
Code
import nltk
from collections import Counter


def extract_noun_phrases(text):
    """Extract all noun phrases from text."""
    sentences = nltk.sent_tokenize(text)
    all_nps = []

    for sent in sentences:
        tokens = word_tokenize(sent)
        tagged = pos_tag(tokens)
        tree = chunker.parse(tagged)

        for subtree in tree:
            if hasattr(subtree, "label") and subtree.label() == "NP":
                np_text = " ".join(word.lower() for word, _ in subtree.leaves())
                all_nps.append(np_text)

    return all_nps


# Example text
sample_text = """
Machine learning is transforming natural language processing. 
Deep neural networks have achieved remarkable results on many NLP tasks.
Transformer models like BERT and GPT have set new benchmarks.
These language models learn rich representations from large text corpora.
"""

nps = extract_noun_phrases(sample_text)
np_counts = Counter(nps)
Out[39]:
Console
Noun Phrase Extraction for Keywords
==================================================

Text excerpt: '
Machine learning is transforming natural language processing. 
Deep neural netw...'

Total NPs extracted: 12

Most frequent noun phrases:
  machine learning: 1
  natural language processing: 1
  deep neural networks: 1
  remarkable results: 1
  many nlp tasks: 1
  transformer models: 1
  bert: 1
  gpt: 1
  new benchmarks: 1
  these language models: 1

Question Answering Preprocessing

Chunking can identify answer candidates in question answering:

In[40]:
Code
def find_answer_candidates(question, context):
    """Find potential answer spans in context based on question type."""
    # Determine question type
    q_lower = question.lower()
    if q_lower.startswith("who"):
        target_chunk = "NP"  # Look for noun phrases (people)
    elif q_lower.startswith("where"):
        target_chunk = "PP"  # Look for prepositional phrases (locations)
    elif q_lower.startswith("what"):
        target_chunk = "NP"  # Look for noun phrases (things)
    else:
        target_chunk = "NP"  # Default

    # Extract chunks from context
    tokens = word_tokenize(context)
    tagged = pos_tag(tokens)
    tree = chunker.parse(tagged)

    candidates = []
    for subtree in tree:
        if hasattr(subtree, "label") and subtree.label() == target_chunk:
            text = " ".join(word for word, _ in subtree.leaves())
            candidates.append(text)

    return candidates


# Example
qa_question = "Who wrote the novel?"
qa_context = "The famous author Ernest Hemingway wrote the novel in 1929."
qa_candidates = find_answer_candidates(qa_question, qa_context)
Out[41]:
Console
QA Answer Candidate Extraction
==================================================

Question: Who wrote the novel?
Context: The famous author Ernest Hemingway wrote the novel in 1929.

Answer candidates (NPs for 'who' question):
  - The famous author Ernest Hemingway
  - the novel

Training a Chunker with Machine Learning

While regex chunkers are simple, machine learning approaches achieve better accuracy by learning patterns from data. Let's train a simple chunker using features.

In[42]:
Code
import nltk
from nltk.chunk import ChunkParserI
from nltk.tag import UnigramTagger


class UnigramChunker(ChunkParserI):
    """A simple chunker that uses a unigram tagger on (POS, IOB) pairs."""

    def __init__(self, train_sents):
        # Convert trees to (word, pos, iob) sequences
        train_data = []
        for tree in train_sents:
            iob_triples = tree_to_iob_triples(tree)
            # Create (pos, iob) pairs for training
            pos_iob_pairs = [(pos, iob) for word, pos, iob in iob_triples]
            train_data.append(pos_iob_pairs)

        # Train a unigram tagger to predict IOB from POS
        self.tagger = UnigramTagger(train_data)

    def parse(self, tagged_sent):
        """Parse a POS-tagged sentence into a chunk tree."""
        # Get POS tags
        pos_tags = [pos for word, pos in tagged_sent]

        # Predict IOB tags
        iob_tags = [self.tagger.tag([pos])[0][1] or "O" for pos in pos_tags]

        # Build IOB tagged sentence
        iob_tagged = [
            (word, pos, iob) for (word, pos), iob in zip(tagged_sent, iob_tags)
        ]

        # Convert to tree
        return nltk.chunk.conlltags2tree(iob_tagged)


# Train the chunker
unigram_chunker = UnigramChunker(train_sents)

# Evaluate
unigram_eval = evaluate_chunker(unigram_chunker, test_sents[:500])
Out[43]:
Console
Unigram Chunker Evaluation
==================================================

Precision: 0.00%
Recall:    0.00%
F1 Score:  0.00%

Comparison with Regex Chunker:
  Regex F1:   0.00%
  Unigram F1: 0.00%

The unigram chunker learns which IOB tag typically follows each POS tag. It captures patterns like "DT usually starts an NP (B-NP)" and "JJ inside an NP usually continues it (I-NP)."

Using More Context

A bigram tagger uses the previous POS tag as additional context:

In[44]:
Code
from nltk.tag import BigramTagger


class BigramChunker(ChunkParserI):
    """A chunker using bigram context."""

    def __init__(self, train_sents):
        train_data = []
        for tree in train_sents:
            iob_triples = tree_to_iob_triples(tree)
            pos_iob_pairs = [(pos, iob) for word, pos, iob in iob_triples]
            train_data.append(pos_iob_pairs)

        # Use bigram tagger with unigram backoff
        unigram = UnigramTagger(train_data)
        self.tagger = BigramTagger(train_data, backoff=unigram)

    def parse(self, tagged_sent):
        pos_tags = [pos for word, pos in tagged_sent]
        iob_tags = self.tagger.tag(pos_tags)
        iob_tags = [iob if iob else "O" for pos, iob in iob_tags]
        iob_tagged = [
            (word, pos, iob) for (word, pos), iob in zip(tagged_sent, iob_tags)
        ]
        return nltk.chunk.conlltags2tree(iob_tagged)


# Train and evaluate
bigram_chunker = BigramChunker(train_sents)
bigram_eval = evaluate_chunker(bigram_chunker, test_sents[:500])
Out[45]:
Console
Bigram Chunker Evaluation
==================================================

Precision: 0.00%
Recall:    0.00%
F1 Score:  0.00%

Comparison:
  Regex F1:   0.00%
  Unigram F1: 0.00%
  Bigram F1:  0.00%
Out[46]:
Visualization
Bar chart comparing F1 scores of regex, unigram, and bigram chunkers.
F1 score comparison of different chunking approaches on the CoNLL-2000 test set. The regex chunker uses hand-crafted patterns, while the unigram and bigram chunkers learn from training data. Machine learning approaches generally outperform hand-crafted rules, with more context leading to better performance.

Chunking with spaCy

spaCy provides noun phrase chunking through its noun_chunks property, which uses dependency parsing under the hood:

In[47]:
Code
import spacy

try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    import subprocess

    subprocess.run(
        ["python", "-m", "spacy", "download", "en_core_web_sm"],
        capture_output=True,
    )
    nlp = spacy.load("en_core_web_sm")

# Example sentences
spacy_examples = [
    "The quick brown fox jumps over the lazy dog.",
    "A major breakthrough in artificial intelligence was announced yesterday.",
    "The President of the United States addressed the nation.",
]

spacy_results = []
for sent in spacy_examples:
    doc = nlp(sent)
    chunks = [
        (chunk.text, chunk.root.text, chunk.root.dep_)
        for chunk in doc.noun_chunks
    ]
    spacy_results.append((sent, chunks))
Out[48]:
Console
spaCy Noun Chunks
============================================================

Sentence: The quick brown fox jumps over the lazy dog.
Noun chunks:
  'The quick brown fox'
    Root: fox, Dependency: nsubj
  'the lazy dog'
    Root: dog, Dependency: pobj

Sentence: A major breakthrough in artificial intelligence was announced yesterday.
Noun chunks:
  'A major breakthrough'
    Root: breakthrough, Dependency: nsubjpass
  'artificial intelligence'
    Root: intelligence, Dependency: pobj

Sentence: The President of the United States addressed the nation.
Noun chunks:
  'The President'
    Root: President, Dependency: nsubj
  'the United States'
    Root: States, Dependency: pobj
  'the nation'
    Root: nation, Dependency: dobj

spaCy's noun chunks are derived from the dependency parse, so they benefit from syntactic analysis beyond just POS tag patterns. The root of each chunk is its head noun, and the dep_ attribute shows its grammatical function in the sentence.

Limitations and Practical Considerations

Chunking, while useful, has inherent limitations that practitioners should understand.

The flat, non-recursive nature of chunks means they cannot represent certain linguistic phenomena. A sentence like "The student who failed the exam requested a meeting" contains a relative clause embedded within the subject NP. Flat chunking either splits this incorrectly or produces an overly long NP that obscures internal structure. When you need to understand how phrases nest within phrases, full parsing is required.

Chunking accuracy depends heavily on POS tagging accuracy. If "man" is tagged as a noun when it's actually a verb in "The old man the boats," chunking will produce incorrect results. This error propagation is particularly problematic for domain-specific text where POS taggers trained on news data may struggle with unfamiliar vocabulary and constructions.

The definition of chunks can be ambiguous. Should "the very best coffee" be one NP or should "very best" be a separate ADJP? Different annotation guidelines make different choices, and these inconsistencies affect both training data and evaluation. When comparing chunking systems, ensure they use compatible annotation schemes.

For languages with freer word order than English, chunking becomes more challenging. German verb clusters, Japanese postpositions, and Arabic clitic attachment create patterns that simple sequence models may not capture well. Cross-linguistic chunking remains an active research area.

Despite these limitations, chunking provides a practical balance between simplicity and usefulness. For applications that need phrase boundaries without full syntactic analysis, such as information extraction, keyword identification, and text summarization, chunking offers an efficient and reasonably accurate solution.

Summary

Chunking identifies non-overlapping, non-recursive phrases in text, providing a middle ground between POS tagging and full parsing. The key concepts from this chapter:

  • Chunk types include noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). Each type captures a different syntactic unit, with NP chunking being the most studied and practically useful.
  • IOB tagging encodes chunk boundaries using per-token labels: B marks the beginning of a chunk, I marks continuation, and O marks tokens outside chunks. This encoding allows chunking to be formulated as a sequence labeling task.
  • Regex-based chunking uses patterns over POS tags to identify chunks. NLTK's RegexpParser provides an intuitive way to define chunk grammars, though pattern-based approaches have limited accuracy.
  • Machine learning chunkers learn patterns from annotated data like the CoNLL-2000 corpus. Even simple unigram and bigram models outperform hand-crafted rules, and more sophisticated approaches using CRFs or neural networks achieve state-of-the-art results.
  • Chunking vs. parsing represents a key tradeoff: chunking is faster and more accurate on its simpler task but captures less syntactic information. Full parsing resolves attachment ambiguities and represents hierarchical structure.
  • Practical applications include information extraction, keyword identification, and question answering preprocessing. Chunking provides useful features without the complexity of full parsing.

Key Parameters

When working with chunking in NLTK and spaCy:

NLTK RegexpParser:

  • grammar: A string defining chunk rules using regex over POS tags
  • {<pattern>}: Chunk pattern (include matching tokens)
  • }<pattern>{: Chink pattern (exclude matching tokens)

NLTK chunk evaluation:

  • chunk_types: List of chunk types to evaluate (e.g., ["NP", "VP", "PP"])
  • Evaluation is chunk-level: exact boundary and type match required

spaCy noun_chunks:

  • doc.noun_chunks: Iterator over noun phrases in document
  • chunk.root: Head noun of the chunk
  • chunk.root.dep_: Syntactic dependency of the head

The next chapters explore the probabilistic models, Hidden Markov Models and Conditional Random Fields, that power production-quality chunkers and other sequence labeling systems.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about chunking and shallow parsing.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{chunkingshallowparsingforphraseidentificationinnlp, author = {Michael Brenndoerfer}, title = {Chunking: Shallow Parsing for Phrase Identification in NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Chunking: Shallow Parsing for Phrase Identification in NLP. Retrieved from https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp
MLAAcademic
Michael Brenndoerfer. "Chunking: Shallow Parsing for Phrase Identification in NLP." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Chunking: Shallow Parsing for Phrase Identification in NLP." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Chunking: Shallow Parsing for Phrase Identification in NLP'. Available at: https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Chunking: Shallow Parsing for Phrase Identification in NLP. https://mbrenndoerfer.com/writing/chunking-shallow-parsing-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free