Search

Search articles

Named Entity Recognition: Extracting People, Places & Organizations

Michael BrenndoerferDecember 15, 202527 min read

Learn how NER identifies and classifies entities in text using BIO tagging, evaluation metrics, and spaCy implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Named Entity Recognition

Text is full of references to real-world entities: people, organizations, locations, dates, monetary amounts. Extracting these mentions automatically is one of the most practically useful NLP tasks. Named Entity Recognition (NER) identifies spans of text that refer to specific entities and classifies them into predefined categories.

Consider the sentence: "Apple announced that Tim Cook will visit Paris next Monday." A NER system should identify "Apple" as an organization, "Tim Cook" as a person, "Paris" as a location, and "next Monday" as a date. This extracted information powers applications from search engines to question answering systems to business intelligence tools.

NER sits at the intersection of sequence labeling and information extraction. Like POS tagging, it assigns labels to tokens. Unlike POS tagging, NER deals with multi-word spans and faces the challenge of detecting entity boundaries. This chapter covers entity types, the framing of NER as sequence labeling, boundary detection challenges, and evaluation methodologies.

Entity Types

NER systems categorize entities into a taxonomy of types. The specific categories depend on the application domain, but several core types appear across most NER systems.

Named Entity

A named entity is a real-world object that can be denoted with a proper name: a person, organization, location, or other specific entity. The term "named" distinguishes these from generic references: "the company" is not a named entity, but "Apple Inc." is.

Standard Entity Categories

The most common NER taxonomy includes three core categories that appear in virtually every system:

  • PER (Person): Names of individuals, including fictional characters. Examples: "Albert Einstein", "Sherlock Holmes", "Dr. Smith"
  • ORG (Organization): Companies, institutions, agencies, teams. Examples: "Google", "United Nations", "New York Yankees"
  • LOC (Location): Geopolitical entities, physical locations, addresses. Examples: "France", "Mount Everest", "123 Main Street"

Extended taxonomies add additional categories for specific applications:

  • DATE/TIME: Temporal expressions like "January 2024", "next Tuesday", "3:00 PM"
  • MONEY: Monetary values like "$50 million", "€100", "fifty dollars"
  • PERCENT: Percentage expressions like "25%", "a third"
  • MISC (Miscellaneous): Entities that don't fit other categories, often events, products, or works of art

Let's explore entity types using spaCy's NER system:

In[2]:
Code
import spacy

nlp = spacy.load("en_core_web_sm")

text = """
Apple Inc. reported that CEO Tim Cook met with French President Emmanuel Macron 
in Paris on January 15, 2024. The meeting discussed a $2 billion investment 
in European artificial intelligence research. Sources say 75% of the funds 
will go to the Sorbonne University AI lab.
"""

doc = nlp(text)

entities = [
    (ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents
]
Out[3]:
Console
Extracted Named Entities:
------------------------------------------------------------
Entity                         Type       Position       
------------------------------------------------------------
Apple Inc.                     ORG        1-11
Tim Cook                       PERSON     30-38
French                         NORP       48-54
Emmanuel Macron                PERSON     65-80
Paris                          GPE        85-90
January 15, 2024               DATE       94-110
$2 billion                     MONEY      136-146
European                       NORP       162-170
75%                            PERCENT    217-220
Sorbonne University AI         ORG        250-272

The output shows spaCy correctly extracting diverse entity types from the text. Notice that "Apple Inc." is tagged as ORG (organization), while "Tim Cook" and "Emmanuel Macron" are PERSON entities. The model also recognizes "Paris" as a GPE (geo-political entity), extracts the date "January 15, 2024", identifies the monetary value "$2 billion" as MONEY, and captures "75%" as a PERCENT. The position columns show character offsets, which are useful for highlighting entities in the original text.

Entity Type Distributions

Different domains exhibit different entity distributions. News text contains many person and organization mentions. Scientific papers reference organizations and locations. Financial documents are dense with monetary values and percentages.

In[4]:
Code
from collections import Counter

# Different text domains
news_text = """
President Biden met with Chancellor Scholz in Berlin. The White House 
announced new climate policies. Senator Warren criticized Wall Street banks.
"""

financial_text = """
Microsoft's Q3 revenue reached $52.9 billion, up 17% year-over-year. 
The Nasdaq index rose 2.5% following the Fed's interest rate decision. 
Analysts at Goldman Sachs raised their price target to $400.
"""

scientific_text = """
Researchers at MIT and Stanford published findings in Nature. The NIH 
funded study examined samples from Boston, Chicago, and Seattle hospitals.
Dr. Chen and Dr. Patel led the investigation.
"""


def get_entity_distribution(text):
    """Get entity type counts for text"""
    doc = nlp(text)
    return Counter(ent.label_ for ent in doc.ents)


news_ents = get_entity_distribution(news_text)
financial_ents = get_entity_distribution(financial_text)
scientific_ents = get_entity_distribution(scientific_text)
Out[5]:
Console
Entity Type Distribution by Domain:
-------------------------------------------------------
Entity Type        News    Financial   Scientific
-------------------------------------------------------
CARDINAL              0            1            0
DATE                  0            1            0
GPE                   2            0            4
MONEY                 0            2            0
ORG                   1            4            3
PERCENT               0            2            0
PERSON                2            0            2
Out[6]:
Visualization
Grouped bar chart comparing entity type counts across news, financial, and scientific text domains.
Entity type distribution across three text domains. Financial text is dominated by MONEY and PERCENT entities, while news text shows more PERSON and ORG mentions. Scientific text exhibits a balanced mix of institutions (ORG) and locations (GPE).

The distribution reveals clear domain patterns. Financial text is dense with MONEY and PERCENT entities, while news text contains more PERSON mentions (politicians) and ORG references. Scientific text shows a balance of ORG (institutions) and GPE (locations of research sites). These patterns matter for NER system design. Training data should reflect the target domain's entity mix, and evaluation should weight entity types according to their importance in the application.

NER as Sequence Labeling

NER is fundamentally a sequence labeling problem: given a sequence of tokens, assign a label to each token indicating whether it's part of an entity and, if so, which type. This framing connects NER to other sequence labeling tasks like POS tagging, allowing us to apply similar models and techniques.

The key difference from POS tagging is that entities span multiple tokens. "Tim Cook" is a single PERSON entity spanning two tokens. "United States of America" spans four tokens. The sequence labeling formulation must handle these multi-token spans.

Token-Level Labels

The simplest approach assigns each token one of three labels:

  • B-TYPE: Beginning of an entity of TYPE
  • I-TYPE: Inside (continuation of) an entity of TYPE
  • O: Outside any entity

This is the BIO tagging scheme, which we'll explore in depth in the next chapter. For now, let's see how it works:

In[7]:
Code
def tokens_to_bio(doc):
    """Convert spaCy doc to token-level BIO tags"""
    bio_tags = []
    for token in doc:
        if token.ent_iob_ == "O":
            bio_tags.append(("O", token.text))
        elif token.ent_iob_ == "B":
            bio_tags.append((f"B-{token.ent_type_}", token.text))
        elif token.ent_iob_ == "I":
            bio_tags.append((f"I-{token.ent_type_}", token.text))
    return bio_tags


example = "Tim Cook announced that Apple will invest in Paris."
doc = nlp(example)
bio_result = tokens_to_bio(doc)
Out[8]:
Console
BIO Tags for NER:
----------------------------------------
Token           BIO Tag        
----------------------------------------
Tim             B-PERSON       
Cook            I-PERSON       
announced       O              
that            O              
Apple           B-ORG          
will            O              
invest          O              
in              O              
Paris           B-GPE          
.               O              
Out[9]:
Visualization
Horizontal sequence of tokens with color-coded boxes showing BIO tags for named entities.
BIO tagging visualized as a token sequence. Person entities (green) span 'Tim Cook' with B-PERSON and I-PERSON. Single-token entities like 'Apple' (blue, B-ORG) and 'Paris' (orange, B-GPE) only need the B tag. Gray tokens are outside any entity (O).

Notice how "Tim" receives the B-PERSON tag (beginning of a person entity) while "Cook" gets I-PERSON (inside/continuation). This two-token sequence forms a single PERSON entity. Similarly, "Apple" and "Paris" each receive B-ORG and B-GPE tags as single-token entities. All other tokens are tagged O (outside), indicating they're not part of any entity.

The B (beginning) tag marks the first token of an entity, while I (inside) tags mark continuation tokens. The O (outside) tag marks tokens that aren't part of any entity. This scheme handles adjacent entities of the same type by using B tags to signal new entity boundaries.

From Classification to Structured Prediction

Token-level classification treats each token independently, but entity spans have structure. Consider a sequence of tokens [t1,t2,,tn][t_1, t_2, \ldots, t_n] with corresponding labels [y1,y2,,yn][y_1, y_2, \ldots, y_n]. If token tit_i receives label yi=I-PERy_i = \text{I-PER}, then the preceding label yi1y_{i-1} must be either B-PER or I-PER. A sequence starting with I-PER violates the tagging constraints.

This structural dependency makes NER a structured prediction problem. Models must consider not just individual token features but also the consistency of the entire label sequence. Approaches like Conditional Random Fields (CRFs) explicitly model these dependencies, which we'll cover in later chapters.

In[10]:
Code
# Demonstrate valid vs invalid BIO sequences
valid_sequences = [
    ["B-PER", "I-PER", "O", "B-ORG", "O"],
    ["O", "B-LOC", "O", "O", "O"],
    ["B-PER", "O", "B-PER", "I-PER", "O"],
]

invalid_sequences = [
    ["I-PER", "I-PER", "O"],  # Can't start with I
    ["B-PER", "I-ORG", "O"],  # Type mismatch
    ["O", "I-LOC", "O"],  # I without preceding B
]


def validate_bio_sequence(sequence):
    """Check if BIO sequence is valid"""
    prev_tag = "O"
    for tag in sequence:
        if tag.startswith("I-"):
            # I must follow B or I of same type
            entity_type = tag[2:]
            if prev_tag == "O":
                return False
            if prev_tag.startswith("B-") and prev_tag[2:] != entity_type:
                return False
            if prev_tag.startswith("I-") and prev_tag[2:] != entity_type:
                return False
        prev_tag = tag
    return True
Out[11]:
Console
BIO Sequence Validation:
--------------------------------------------------

Valid sequences:
  ['B-PER', 'I-PER', 'O', 'B-ORG', 'O']
  ['O', 'B-LOC', 'O', 'O', 'O']
  ['B-PER', 'O', 'B-PER', 'I-PER', 'O']

Invalid sequences:
  ['I-PER', 'I-PER', 'O'] (starts with I)
  ['B-PER', 'I-ORG', 'O'] (type mismatch or orphan I)
  ['O', 'I-LOC', 'O'] (type mismatch or orphan I)

The valid sequences demonstrate proper BIO structure: entities start with B and continue with I of the same type. The invalid sequences show common violations. Starting with I-PER is invalid because there's no preceding B-PER to begin the entity. A B-PER followed by I-ORG is invalid because the entity type must be consistent throughout the span. An orphan I-LOC without a preceding B-LOC is likewise invalid.

Understanding these constraints is crucial for both training NER models and decoding their predictions. Many NER systems add a constraint layer that ensures outputs always form valid BIO sequences.

Nested Entity Challenges

Real text often contains nested entities, where one entity is contained within another. Consider "Bank of America headquarters in Charlotte." The full location "Bank of America headquarters in Charlotte" contains the organization "Bank of America" and the location "Charlotte."

Nested Entities

Nested entities occur when one named entity mention is contained within another. Standard sequence labeling with BIO tags cannot represent nested structures since each token receives exactly one label.

Standard BIO tagging forces a choice: you can only assign one label per token. Different strategies handle this limitation.

Flat Annotation

Most NER datasets and systems use flat annotation, where nested entities are resolved by choosing one level. Common strategies include:

  • Outermost entity: Label the largest span
  • Innermost entities: Label only the most specific mentions
  • Head-based: Label based on the syntactic head
In[12]:
Code
nested_examples = [
    {
        "text": "The New York Times reported on the story.",
        "outer": [("New York Times", "ORG")],
        "inner": [("New York", "LOC")],
    },
    {
        "text": "University of California Berkeley scientists won.",
        "outer": [("University of California Berkeley", "ORG")],
        "inner": [("California", "LOC"), ("Berkeley", "LOC")],
    },
    {
        "text": "The European Central Bank president spoke.",
        "outer": [("European Central Bank", "ORG")],
        "inner": [("European", "NORP")],  # Nationality
    },
]
Out[13]:
Console
Nested Entity Examples:
------------------------------------------------------------

Text: "The New York Times reported on the story."
  Outer entities: [('New York Times', 'ORG')]
  Inner entities: [('New York', 'LOC')]
  Standard NER would choose outer: [('New York Times', 'ORG')]

Text: "University of California Berkeley scientists won."
  Outer entities: [('University of California Berkeley', 'ORG')]
  Inner entities: [('California', 'LOC'), ('Berkeley', 'LOC')]
  Standard NER would choose outer: [('University of California Berkeley', 'ORG')]

Text: "The European Central Bank president spoke."
  Outer entities: [('European Central Bank', 'ORG')]
  Inner entities: [('European', 'NORP')]
  Standard NER would choose outer: [('European Central Bank', 'ORG')]

spaCy and most production NER systems use flat annotation. This works well for many applications, but loses information when nesting matters.

Nested NER Approaches

When nested entities are important, specialized approaches can capture them:

  • Multi-layer tagging: Run multiple passes, each extracting one nesting level
  • Span-based models: Score all possible spans rather than labeling tokens
  • Constituency parsing-based: Use tree structures to represent nesting
In[14]:
Code
def extract_all_spans(text, max_span_length=5):
    """Extract all possible spans up to max length"""
    doc = nlp(text)
    tokens = [token.text for token in doc]
    spans = []

    for start in range(len(tokens)):
        for end in range(
            start + 1, min(start + max_span_length + 1, len(tokens) + 1)
        ):
            span_text = " ".join(tokens[start:end])
            spans.append((start, end, span_text))

    return spans


example = "New York Times reporter"
all_spans = extract_all_spans(example)
Out[15]:
Console
All possible spans in 'New York Times reporter':
---------------------------------------------
  [0:1] "New"
  [0:2] "New York"
  [0:3] "New York Times"
  [0:4] "New York Times reporter"
  [1:2] "York"
  [1:3] "York Times"
  [1:4] "York Times reporter"
  [2:3] "Times"
  [2:4] "Times reporter"
  [3:4] "reporter"

Total spans: 10

For just four tokens, we generate 10 candidate spans. Notice this includes both "New York" (a potential location) and "New York Times" (a potential organization), allowing a nested NER system to recognize both. A span-based model would score each of these spans for each entity type, keeping only those with high confidence scores.

The Computational Cost of Span Enumeration

Span-based approaches are elegant conceptually: enumerate all possible spans and classify each one. This naturally handles nesting since overlapping spans can receive different labels. But this flexibility comes with a computational cost that grows rapidly with sentence length.

To understand the cost, let's count how many spans exist in a sentence of nn tokens. A span is defined by its starting position (any token from 1 to nn) and its ending position (any token from the start to nn). For each starting position ii, we can form spans ending at positions ii, i+1i+1, i+2i+2, ..., up to nn. This gives us:

  • Starting at position 1: nn possible spans
  • Starting at position 2: n1n-1 possible spans
  • Starting at position 3: n2n-2 possible spans
  • ...
  • Starting at position nn: 1 possible span

The total is n+(n1)+(n2)++1n + (n-1) + (n-2) + \cdots + 1, which is the sum of the first nn integers:

Number of spans=n(n+1)2\text{Number of spans} = \frac{n(n+1)}{2}

This quadratic growth means span enumeration becomes expensive for long sentences. A 10-token sentence has 55 spans. A 20-token sentence has 210 spans. A 50-token sentence has 1,275 spans to classify. For each span, the model must compute features and score it against each entity type, multiplying the computational burden.

Out[16]:
Visualization
Line plot showing quadratic growth of span count versus sentence length, with a second line showing linear growth when span length is limited.
Number of possible spans grows quadratically with sentence length (blue curve). Limiting maximum span length to 5 tokens (orange dashed line) reduces the growth to approximately linear, making span-based NER tractable for longer documents.

In practice, span-based NER models mitigate this cost by limiting the maximum span length. Most named entities are short (1-5 tokens), so ignoring spans longer than some threshold loses little recall while dramatically reducing computation. With a maximum span length of kk, the number of spans becomes approximately nkn \cdot k rather than n22\frac{n^2}{2}, making the approach tractable for longer documents.

Entity Boundary Detection

Detecting where entities start and end is often harder than classifying entity types. The word "New" in "New York" is clearly part of a location, but in "New policy announced" it's not part of any entity.

Boundary Ambiguity

Entity boundaries are ambiguous for several reasons. First, modifiers may or may not be included: is it "President Biden" or just "Biden"? Second, compound entities are tricky: does "Apple iPhone 15" contain one entity or two? Third, coordination creates challenges: in "Microsoft and Google", are there two separate ORG entities or one?

In[17]:
Code
boundary_examples = [
    "President Joe Biden signed the bill.",
    "Dr. Sarah Chen published the paper.",
    "Apple iPhone 15 Pro Max was announced.",
    "Microsoft, Google, and Meta formed a partnership.",
    "The United States Department of Defense issued a statement.",
]

boundary_results = []
for text in boundary_examples:
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    boundary_results.append((text, entities))
Out[18]:
Console
Entity Boundary Detection:
----------------------------------------------------------------------

Text: "President Joe Biden signed the bill."
  Entities: [('Joe Biden', 'PERSON')]

Text: "Dr. Sarah Chen published the paper."
  Entities: [('Sarah Chen', 'PERSON')]

Text: "Apple iPhone 15 Pro Max was announced."
  Entities: [('Apple', 'ORG'), ('15', 'CARDINAL')]

Text: "Microsoft, Google, and Meta formed a partnership."
  Entities: [('Microsoft', 'ORG'), ('Google', 'ORG'), ('Meta', 'ORG')]

Text: "The United States Department of Defense issued a statement."
  Entities: [('The United States Department of Defense', 'ORG')]

The results reveal interesting boundary decisions. "Joe Biden" is recognized without "President", but "United States Department of Defense" is captured as a complete span. For the Apple example, the model may or may not include "iPhone 15 Pro Max" as part of the entity. The coordination case shows how "Microsoft", "Google", and "Meta" are correctly identified as three separate ORG entities rather than one.

Annotation guidelines must make consistent decisions about these edge cases. Different datasets make different choices, which means NER systems trained on one dataset may produce different boundaries than systems trained on another.

Titles and Honorifics

A common boundary question is whether titles should be included in person names. "Dr. Smith" could be annotated as a single PERSON entity or as a title plus a PERSON.

In[19]:
Code
title_examples = [
    "Dr. Anthony Fauci spoke at the conference.",
    "Professor Stephen Hawking wrote the book.",
    "Senator Elizabeth Warren proposed the bill.",
    "Queen Elizabeth II visited the hospital.",
    "CEO Sundar Pichai announced the product.",
]

for text in title_examples:
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            # Check if preceded by title
            if ent.start > 0:
                prev_token = doc[ent.start - 1]
                has_title = prev_token.text in [
                    "Dr.",
                    "Professor",
                    "Senator",
                    "Queen",
                    "CEO",
                    "President",
                    "Mr.",
                    "Mrs.",
                    "Ms.",
                ]
            else:
                has_title = False
            # Results stored for display
Out[20]:
Console
Title Handling in NER:
------------------------------------------------------------
  Entity: "Anthony Fauci" | Preceding word: Dr.
  Entity: "Stephen Hawking" | Preceding word: Professor
  Entity: "Elizabeth Warren" | Preceding word: Senator
  Entity: "Elizabeth II" | Preceding word: Queen
  Entity: "Sundar Pichai" | Preceding word: CEO

The spaCy model typically excludes titles from person names, treating "Dr." as a separate token. This is a design choice that affects downstream applications. If your use case requires titles, you may need to extend entity spans post-hoc.

NER Evaluation

How do we know if a NER system is any good? This question is more subtle than it first appears. Unlike classification tasks where each input has exactly one label, NER involves identifying spans of varying length and assigning types to them. A prediction might get the entity type right but the boundaries wrong, or it might find some entities but miss others entirely. We need evaluation metrics that capture these nuances.

The Challenge of Measuring NER Performance

Consider a sentence where the gold standard annotation marks "New York Times" as an ORG entity. Suppose our NER system predicts "New York" as a LOC entity instead. How should we score this?

  1. The system found something at roughly the right location
  2. But the span is too short (missing "Times")
  3. And the type is wrong (LOC instead of ORG)

Should this receive partial credit? Zero credit? The answer depends on your evaluation paradigm, and understanding these choices is crucial for interpreting NER benchmarks.

Exact Match Evaluation

The standard approach in NER evaluation is exact match: a prediction counts as correct only if it matches a gold entity exactly in both span boundaries and entity type. This binary decision creates a clean framework built on three fundamental counts.

When comparing predicted entities against gold standard annotations, every entity falls into exactly one of three categories:

  • True Positives (TPTP): Predictions that exactly match gold entities. The span boundaries must be identical (same start and end positions), and the entity type must match.

  • False Positives (FPFP): Predictions that don't match any gold entity. This includes entities with wrong boundaries (even by one token), wrong types, or completely spurious predictions.

  • False Negatives (FNFN): Gold entities that the system failed to predict. These represent missed entities that should have been found.

From these three counts, we derive the standard evaluation metrics. Precision answers the question: "Of all the entities we predicted, how many were correct?"

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

A system with high precision makes few mistakes when it does predict an entity, but it might be overly cautious and miss many entities. Think of a conservative NER system that only predicts entities when it's highly confident.

Recall answers the complementary question: "Of all the entities that exist, how many did we find?"

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

A system with high recall finds most entities but might also predict many false positives. Think of an aggressive NER system that marks anything that could possibly be an entity.

Neither metric alone tells the full story. A system that predicts nothing has perfect precision (no false positives) but zero recall. A system that marks every token as an entity has perfect recall but terrible precision. We need a metric that balances both concerns.

The F1 score provides this balance as the harmonic mean of precision and recall:

F1=2PrecisionRecallPrecision+RecallF_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

The harmonic mean, rather than the arithmetic mean, ensures that F1 is low when either precision or recall is low. You cannot achieve a high F1 by excelling at one metric while failing at the other. This property makes F1 the standard single-number summary for NER performance.

Implementing Exact Match Evaluation

Let's implement these metrics and see how they work on a concrete example. We'll compare a set of predicted entities against gold standard annotations.

In[21]:
Code
def compute_exact_match_metrics(gold_entities, pred_entities):
    """
    Compute precision, recall, F1 for exact entity match.

    Each entity is a tuple of (span_start, span_end, entity_type).
    Exact match requires all three components to be identical.
    """
    gold_set = set(gold_entities)
    pred_set = set(pred_entities)

    # Count the three fundamental categories
    true_positives = len(
        gold_set & pred_set
    )  # Intersection: correct predictions
    false_positives = len(pred_set - gold_set)  # Predicted but not in gold
    false_negatives = len(gold_set - pred_set)  # In gold but not predicted

    # Compute metrics with zero-division protection
    precision = (
        true_positives / (true_positives + false_positives) if pred_set else 0
    )
    recall = (
        true_positives / (true_positives + false_negatives) if gold_set else 0
    )
    f1 = (
        2 * precision * recall / (precision + recall)
        if (precision + recall) > 0
        else 0
    )

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "true_positives": true_positives,
        "false_positives": false_positives,
        "false_negatives": false_negatives,
    }

Now let's create a realistic example. Consider a sentence with three gold entities, and a NER system that makes some correct predictions but also some errors:

In[22]:
Code
# Gold standard: what a human annotator marked
# Format: (span_start, span_end, entity_type)
gold = [
    (0, 2, "PERSON"),  # "Tim Cook" at tokens 0-1
    (4, 5, "ORG"),  # "Apple" at token 4
    (8, 9, "LOC"),  # "Paris" at token 8
]

# System predictions: what our NER model output
predicted = [
    (0, 2, "PERSON"),  # Correct: exact match for "Tim Cook"
    (4, 5, "ORG"),  # Correct: exact match for "Apple"
    (8, 10, "LOC"),  # Wrong boundary: "Paris next" instead of "Paris"
    (12, 13, "DATE"),  # Spurious: predicting an entity that doesn't exist
]

metrics = compute_exact_match_metrics(gold, predicted)
Out[23]:
Console
Exact Match Evaluation Example:
--------------------------------------------------
Gold entities:      3
Predicted entities: 4

Category breakdown:
  True positives:  2 (exact matches)
  False positives: 2 (wrong or spurious)
  False negatives: 1 (missed entities)

Computed metrics:
  Precision: 50.00%
  Recall:    66.67%
  F1 Score:  57.14%

Let's trace through the evaluation logic step by step. Of our 4 predictions, only 2 exactly match gold entities ("Tim Cook" and "Apple"). The prediction for "Paris" fails because the boundary is wrong (tokens 8-10 instead of 8-9), even though we correctly identified the location. The DATE prediction at tokens 12-13 is completely spurious. This gives us:

  • TP=2TP = 2 (the two exact matches)
  • FP=2FP = 2 (boundary error + spurious prediction)
  • FN=1FN = 1 (the missed "Paris" entity)

Plugging into our formulas:

  • Precision =22+2=0.50= \frac{2}{2+2} = 0.50 (half our predictions were correct)
  • Recall =22+1=0.67= \frac{2}{2+1} = 0.67 (we found two of three entities)
  • F1 =2×0.50×0.670.50+0.67=0.57= \frac{2 \times 0.50 \times 0.67}{0.50 + 0.67} = 0.57

The 57% F1 score reflects that our system is mediocre: it makes too many errors and misses too many entities. Notice how the boundary error for "Paris" counts as both a false positive (wrong prediction) and contributes to a false negative (missed entity). Exact match evaluation is strict, which encourages models to learn precise boundaries.

Partial Match: An Alternative Paradigm

Exact match can feel harsh. The prediction "New York" for a gold entity "New York Times" receives zero credit, even though two of three tokens overlap. Partial match evaluation schemes give credit for overlapping predictions, typically proportional to the token overlap.

If the system predicts "New York" for a gold entity "New York Times", partial match might give 2/3 credit for the two overlapping tokens. This is more forgiving but complicates interpretation: a system with 80% partial match F1 might still make frequent boundary errors.

Most NER benchmarks use exact match because it provides a clear, unambiguous standard. However, for applications where approximate entity identification is acceptable, partial match metrics can provide additional insight into system behavior.

Per-Type Evaluation

Overall metrics can hide disparities across entity types. A system might excel at detecting person names but struggle with organizations. Per-type breakdown reveals these patterns:

In[24]:
Code
def evaluate_by_type(gold_entities, pred_entities):
    """Compute per-type precision, recall, F1"""
    # Group by type
    gold_by_type = {}
    pred_by_type = {}

    for start, end, etype in gold_entities:
        gold_by_type.setdefault(etype, set()).add((start, end))

    for start, end, etype in pred_entities:
        pred_by_type.setdefault(etype, set()).add((start, end))

    all_types = sorted(set(gold_by_type.keys()) | set(pred_by_type.keys()))

    results = {}
    for etype in all_types:
        gold_set = gold_by_type.get(etype, set())
        pred_set = pred_by_type.get(etype, set())

        tp = len(gold_set & pred_set)
        fp = len(pred_set - gold_set)
        fn = len(gold_set - pred_set)

        p = tp / (tp + fp) if (tp + fp) > 0 else 0
        r = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0

        results[etype] = {
            "precision": p,
            "recall": r,
            "f1": f1,
            "support": len(gold_set),
        }

    return results


# Larger example for per-type evaluation
gold_extended = [
    (0, 2, "PERSON"),
    (5, 6, "PERSON"),
    (10, 11, "PERSON"),  # 3 persons
    (15, 17, "ORG"),
    (20, 21, "ORG"),  # 2 orgs
    (25, 26, "LOC"),
    (30, 31, "LOC"),
    (35, 36, "LOC"),
    (40, 41, "LOC"),  # 4 locs
]

pred_extended = [
    (0, 2, "PERSON"),
    (5, 6, "PERSON"),  # 2 correct persons (missed 1)
    (15, 17, "ORG"),
    (20, 21, "ORG"),
    (22, 23, "ORG"),  # 2 correct + 1 spurious org
    (25, 26, "LOC"),
    (30, 31, "LOC"),  # 2 correct locs (missed 2)
]

type_results = evaluate_by_type(gold_extended, pred_extended)

The per-type breakdown reveals distinct performance patterns across entity categories:

Per-type NER evaluation metrics. PERSON entities achieve 100% precision but only 67% recall (1 missed). ORG shows 67% precision due to a spurious prediction. LOC suffers from 50% recall, missing half of all location entities.
Entity TypePrecisionRecallF1 ScoreSupport
LOC100%50%67%4
ORG67%100%80%2
PERSON100%67%80%3

The system achieves perfect precision on PERSON and LOC entities, meaning every prediction for these types was correct. However, recall gaps tell a different story: the system missed one of three PERSON entities (67% recall) and two of four LOC entities (50% recall). For ORG, the pattern reverses: the system found all gold ORG entities (100% recall) but made one spurious prediction, dropping precision to 67%. This per-type analysis reveals that different entity types may require different improvement strategies.

Standard Benchmarks

NER systems are typically evaluated on standard benchmark datasets:

  • CoNLL-2003: News articles with PER, LOC, ORG, MISC tags. The most widely used benchmark for English NER.
  • OntoNotes 5.0: Diverse genres with 18 entity types including dates, times, quantities.
  • ACE 2005: Focuses on person, organization, location, facility, geo-political entity, vehicle, weapon.
  • WNUT: Social media text, particularly challenging due to informal language.
In[25]:
Code
# Representative benchmark statistics
benchmarks = {
    "CoNLL-2003": {
        "domain": "News (Reuters)",
        "entity_types": 4,
        "train_entities": 23499,
        "test_entities": 5648,
        "sota_f1": 94.6,
    },
    "OntoNotes 5.0": {
        "domain": "Mixed (news, web, broadcast)",
        "entity_types": 18,
        "train_entities": 81828,
        "test_entities": 11257,
        "sota_f1": 92.4,
    },
    "WNUT-17": {
        "domain": "Social media",
        "entity_types": 6,
        "train_entities": 1975,
        "test_entities": 1287,
        "sota_f1": 56.5,
    },
}
Out[26]:
Console
Standard NER Benchmarks:
---------------------------------------------------------------------------
Dataset         Domain                      Types    Train    Test  SOTA F1
---------------------------------------------------------------------------
CoNLL-2003      News (Reuters)                  4    23499    5648    94.6%
OntoNotes 5.0   Mixed (news, web, broadcast)      18    81828   11257    92.4%
WNUT-17         Social media                    6     1975    1287    56.5%

The benchmark statistics reveal a striking performance gap. CoNLL-2003 and OntoNotes achieve state-of-the-art F1 scores above 92%, reflecting the relative consistency of formal news and web text. The WNUT benchmark on social media text shows only 56.5% F1, demonstrating how informal language, novel entities, and creative spelling challenge NER systems trained on formal text. The WNUT benchmark remains challenging because social media text contains novel entities, creative spelling, and informal grammar that confound models trained on formal text.

Implementing NER with spaCy

Let's build a complete NER pipeline using spaCy, demonstrating entity extraction, visualization, and practical post-processing.

In[27]:
Code
def extract_entities_with_context(text, context_words=3):
    """Extract entities with surrounding context"""
    doc = nlp(text)
    results = []

    for ent in doc.ents:
        # Get context tokens
        start_idx = max(0, ent.start - context_words)
        end_idx = min(len(doc), ent.end + context_words)

        context_before = doc[start_idx : ent.start].text
        context_after = doc[ent.end : end_idx].text

        results.append(
            {
                "entity": ent.text,
                "type": ent.label_,
                "context_before": context_before,
                "context_after": context_after,
                "start": ent.start_char,
                "end": ent.end_char,
            }
        )

    return results


article = """
Elon Musk's Tesla announced record deliveries of 500,000 vehicles in Q4 2023. 
The company's Shanghai factory produced over 200,000 units. Meanwhile, rival 
Ford Motor Company reported strong sales of its F-150 Lightning electric truck.
Analysts at Morgan Stanley raised Tesla's price target to $300 per share.
"""

entities_with_context = extract_entities_with_context(article)
Out[28]:
Console
Entities with Context:
----------------------------------------------------------------------

[WORK_OF_ART] "Elon Musk's Tesla"
  Context: ...
 [Elon Musk's Tesla] announced record deliveries...

[CARDINAL] "500,000"
  Context: ...record deliveries of [500,000] vehicles in Q4...

[DATE] "Q4 2023"
  Context: ...500,000 vehicles in [Q4 2023] . 
The...

[GPE] "Shanghai"
  Context: ...The company's [Shanghai] factory produced over...

[CARDINAL] "over 200,000"
  Context: ...Shanghai factory produced [over 200,000] units. Meanwhile...

[ORG] "Ford Motor Company"
  Context: ..., rival 
 [Ford Motor Company] reported strong sales...

[WORK_OF_ART] "F-150 Lightning"
  Context: ...sales of its [F-150 Lightning] electric truck....

[ORG] "Morgan Stanley"
  Context: ...
Analysts at [Morgan Stanley] raised Tesla's...

[ORG] "Tesla"
  Context: ...Morgan Stanley raised [Tesla] 's price target...

[MONEY] "300"
  Context: ...target to $ [300] per share....

The context reveals how surrounding words help identify and classify entities. "Elon Musk's Tesla" shows a possessive construction linking a person to an organization. The phrase "rival Ford Motor Company" provides a semantic cue that this is a competitor company. Context is particularly valuable for ambiguous mentions where the same text could refer to different entity types.

Adding context around entities helps with interpretation and debugging. You can see how surrounding words provide disambiguation cues.

Entity Linking and Normalization

Raw entity mentions often need normalization. "Tesla", "Tesla Inc.", and "Tesla Motors" all refer to the same company. Entity linking connects mentions to canonical identifiers in a knowledge base.

In[29]:
Code
# Simple normalization rules
normalization_rules = {
    "ORG": {
        "tesla": "Tesla, Inc.",
        "tesla inc": "Tesla, Inc.",
        "tesla motors": "Tesla, Inc.",
        "ford": "Ford Motor Company",
        "ford motor": "Ford Motor Company",
        "morgan stanley": "Morgan Stanley",
    },
    "PERSON": {
        "elon musk": "Elon Musk",
        "musk": "Elon Musk",
    },
}


def normalize_entity(entity_text, entity_type):
    """Normalize entity to canonical form"""
    key = entity_text.lower().strip()

    if entity_type in normalization_rules:
        if key in normalization_rules[entity_type]:
            return normalization_rules[entity_type][key]

    return entity_text


# Apply normalization
normalized_entities = []
for ent in entities_with_context:
    normalized = normalize_entity(ent["entity"], ent["type"])
    normalized_entities.append(
        {
            "original": ent["entity"],
            "normalized": normalized,
            "type": ent["type"],
            "changed": ent["entity"] != normalized,
        }
    )
Out[30]:
Console
Entity Normalization:
-------------------------------------------------------
  "Elon Musk's Tesla" (unchanged) (WORK_OF_ART)
  "500,000" (unchanged) (CARDINAL)
  "Q4 2023" (unchanged) (DATE)
  "Shanghai" (unchanged) (GPE)
  "over 200,000" (unchanged) (CARDINAL)
  "Ford Motor Company" (unchanged) (ORG)
  "F-150 Lightning" (unchanged) (WORK_OF_ART)
  "Morgan Stanley" (unchanged) (ORG)
  "Tesla" → "Tesla, Inc." (ORG)
  "300" (unchanged) (MONEY)

Production NER systems often include a normalization step that maps mentions to unique identifiers. This enables aggregation across documents and linking to knowledge bases like Wikidata or corporate databases.

Aggregating Entity Mentions

When processing multiple documents, you often want to count entity occurrences and track co-occurrence patterns:

In[31]:
Code
from collections import defaultdict

documents = [
    "Apple CEO Tim Cook announced the iPhone 15 launch in Cupertino.",
    "Tim Cook met with investors in New York to discuss Apple's strategy.",
    "Google's Sundar Pichai responded to Apple's announcement from Mountain View.",
    "Amazon and Microsoft are also competing in the AI space with Apple.",
]

# Aggregate entities across documents
entity_counts = defaultdict(lambda: {"count": 0, "types": set(), "docs": []})

for doc_idx, text in enumerate(documents):
    doc = nlp(text)
    for ent in doc.ents:
        key = ent.text.lower()
        entity_counts[key]["count"] += 1
        entity_counts[key]["types"].add(ent.label_)
        entity_counts[key]["docs"].append(doc_idx)
        entity_counts[key]["canonical"] = ent.text

# Sort by frequency
sorted_entities = sorted(entity_counts.items(), key=lambda x: -x[1]["count"])
Out[32]:
Console
Entity Frequency Analysis:
------------------------------------------------------------
Entity                Count Type(s)         Documents      
------------------------------------------------------------
Apple                     4 ORG             0, 1, 2, 3     
Tim Cook                  2 PERSON          0, 1           
15                        1 CARDINAL        0              
Cupertino                 1 GPE             0              
New York                  1 GPE             1              
Google                    1 ORG             2              
Sundar Pichai             1 PERSON          2              
Mountain View             1 GPE             2              
Amazon                    1 ORG             3              
Microsoft                 1 ORG             3              

Entity aggregation is essential for applications like entity-based document clustering, co-reference resolution across documents, and building knowledge graphs from text.

Key Parameters

When working with spaCy for NER, several parameters and configuration options affect performance:

  • Model size (en_core_web_sm, en_core_web_md, en_core_web_lg, en_core_web_trf): Larger models generally achieve better accuracy. The transformer-based trf model provides the best performance but requires more memory and computation.

  • Entity labels: spaCy's models recognize a fixed set of entity types (PERSON, ORG, GPE, DATE, MONEY, etc.). Custom entity types require fine-tuning with annotated training data.

  • Context window: Entity recognition depends on surrounding context. Short snippets may lack sufficient context for disambiguation, while very long documents may exceed memory limits.

  • Batch processing: For large document collections, process documents in batches using nlp.pipe() with the n_process parameter for parallel processing.

  • Entity ruler: For domain-specific entities with known patterns, combine statistical NER with rule-based matching using spaCy's EntityRuler component.

Limitations and Challenges

NER has achieved impressive accuracy on benchmark datasets, but significant challenges remain in real-world applications.

The most pervasive issue is domain shift. NER systems trained on news text struggle with biomedical literature, legal documents, social media, and other specialized domains. Domain-specific entities may not exist in training data, and domain-specific patterns may confuse general-purpose models. A model that excels at detecting political figures and companies may fail entirely when confronted with drug names, gene symbols, or legal citations. The practical impact is substantial: deploying a pre-trained NER system on a new domain typically requires at least some domain-specific fine-tuning to achieve acceptable performance.

Rare and emerging entities pose another fundamental challenge. Language evolves continuously, with new people, organizations, products, and concepts entering the discourse. NER systems can only recognize what they've seen patterns for. A model trained before 2023 wouldn't recognize "ChatGPT" as a product or "Anthropic" as an organization. This temporal gap means production NER systems require regular retraining or sophisticated mechanisms for detecting novel entities.

Ambiguous entity types also cause persistent errors. Is "Washington" a person, a location, or an organization? All three are possible depending on context: George Washington (person), Washington D.C. (location), Washington Nationals (organization). Even with context, some cases remain genuinely ambiguous. The same proper noun can refer to different entity types in different sentences, and NER systems must learn subtle contextual cues to disambiguate.

Summary

Named Entity Recognition extracts references to real-world entities from text, classifying them into categories like person, organization, and location. The key concepts covered in this chapter include:

Entity types define the taxonomy of entities a system can recognize. Standard categories include PER, ORG, LOC, and extensions like DATE, MONEY, and PERCENT. Domain-specific applications may require custom entity types.

NER as sequence labeling assigns labels to each token indicating entity membership. The BIO scheme marks entity boundaries with B (beginning), I (inside), and O (outside) tags, enabling multi-token entity spans.

Nested entities occur when entities contain other entities. Standard sequence labeling cannot represent nesting, requiring either flat annotation choices or specialized span-based approaches.

Entity boundaries are often ambiguous, particularly with modifiers, titles, and compound entities. Annotation guidelines must make consistent decisions that models learn to replicate.

Evaluation uses exact match metrics (precision, recall, F1) where predictions must match gold entities exactly in both span and type. Per-type breakdown reveals performance disparities across entity categories.

The next chapter explores BIO tagging in depth, covering scheme variants, conversion algorithms, and implementation details that underpin practical NER systems.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Named Entity Recognition.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{namedentityrecognitionextractingpeopleplacesorganizations, author = {Michael Brenndoerfer}, title = {Named Entity Recognition: Extracting People, Places & Organizations}, year = {2025}, url = {https://mbrenndoerfer.com/writing/named-entity-recognition-ner-tutorial}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Named Entity Recognition: Extracting People, Places & Organizations. Retrieved from https://mbrenndoerfer.com/writing/named-entity-recognition-ner-tutorial
MLAAcademic
Michael Brenndoerfer. "Named Entity Recognition: Extracting People, Places & Organizations." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/named-entity-recognition-ner-tutorial>.
CHICAGOAcademic
Michael Brenndoerfer. "Named Entity Recognition: Extracting People, Places & Organizations." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/named-entity-recognition-ner-tutorial.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Named Entity Recognition: Extracting People, Places & Organizations'. Available at: https://mbrenndoerfer.com/writing/named-entity-recognition-ner-tutorial (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Named Entity Recognition: Extracting People, Places & Organizations. https://mbrenndoerfer.com/writing/named-entity-recognition-ner-tutorial
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free