Search

Search articles

BIO Tagging: Encoding Entity Boundaries for Sequence Labeling

Michael BrenndoerferDecember 15, 202526 min read

Learn the BIO tagging scheme for named entity recognition, including BIOES variants, span-to-tag conversion, decoding, and handling malformed sequences.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BIO Tagging

Sequence labeling tasks like named entity recognition face a fundamental challenge: how do you represent multi-word entities using per-token labels? The sentence "New York City is beautiful" contains three tokens that together form a single location entity. Assigning all three the label "LOC" creates ambiguity. Does "New" start a new entity, or does it continue one that began earlier? Are "New," "York," and "City" three separate locations, or one?

BIO tagging solves this problem elegantly. The scheme uses a small set of prefixes to encode entity boundaries directly in the labels. B marks the beginning of an entity, I marks inside (continuation), and O marks outside (no entity). With BIO tags, "New York City" becomes B-LOC I-LOC I-LOC, unambiguously marking a single three-token entity.

This chapter explores BIO tagging from its basic mechanics through practical implementation. You'll learn the standard BIO scheme and its variants, implement converters between span annotations and BIO tags, build decoders that extract entities from tagged sequences, and handle the edge cases that arise in real-world tagging scenarios.

The BIO Scheme

BIO tagging encodes entity boundaries through prefix annotations. Each token receives a label combining a position indicator (B, I, or O) with an entity type. The three positions work together to delimit entity spans without ambiguity.

BIO Tagging Scheme

BIO (Beginning-Inside-Outside) is a tagging scheme for sequence labeling where each token receives a label indicating its position relative to entity spans: B marks the first token of an entity, I marks subsequent tokens within the same entity, and O marks tokens outside any entity.

Let's see how BIO tagging works on a concrete example:

In[2]:
Code
# A sentence with named entities
sentence = ["Barack", "Obama", "visited", "New", "York", "City", "yesterday"]

# Entity spans as (start_idx, end_idx, entity_type)
# Note: end_idx is exclusive (Python convention)
entities = [
    (0, 2, "PER"),  # Barack Obama
    (3, 6, "LOC"),  # New York City
]

# BIO tags for each token
bio_tags = ["B-PER", "I-PER", "O", "B-LOC", "I-LOC", "I-LOC", "O"]
Out[3]:
Console
Token-level BIO Annotation:
----------------------------------------
Token        BIO Tag    Meaning
----------------------------------------
Barack       B-PER      Begin person entity
Obama        I-PER      Inside person entity
visited      O          Outside any entity
New          B-LOC      Begin location entity
York         I-LOC      Inside location entity
City         I-LOC      Inside location entity
yesterday    O          Outside any entity
Out[4]:
Visualization
Horizontal sequence diagram showing tokens with colored boxes indicating BIO tags and entity boundaries.
Visual representation of BIO tagging on a sentence. Each token receives a label encoding its position within entity spans. Blue indicates person entities (PER), green indicates location entities (LOC), and gray indicates tokens outside any entity (O). The B prefix marks entity beginnings, while I marks continuation tokens.

The BIO scheme achieves two critical goals. First, it marks entity boundaries explicitly. When you see a B tag, you know a new entity starts at that position. When you see an I tag following a B tag of the same type, you know the entity continues. Second, it handles adjacent entities correctly. If "Barack Obama" and "Michelle Obama" appeared consecutively without a gap, the B prefix on "Michelle" would clearly mark the second entity's start: B-PER I-PER B-PER I-PER.

Why Not Just Use Entity Types?

A simpler approach might label each token with just its entity type: PER, LOC, or O. Let's see why this fails:

In[5]:
Code
# Two consecutive person entities
sentence_adjacent = ["Barack", "Obama", "Michelle", "Obama", "attended"]

# With simple entity-type labels
simple_labels = ["PER", "PER", "PER", "PER", "O"]

# With BIO labels
bio_labels = ["B-PER", "I-PER", "B-PER", "I-PER", "O"]
Out[6]:
Console
Adjacent Entities Problem:
-------------------------------------------------------
Token        Simple     BIO       
-------------------------------------------------------
Barack       PER        B-PER     
Obama        PER        I-PER     
Michelle     PER        B-PER     
Obama        PER        I-PER     
attended     O          O         

Interpretation with simple labels: 1 entity 'Barack Obama Michelle Obama'
Interpretation with BIO labels:    2 entities 'Barack Obama' and 'Michelle Obama'

Without the B prefix, we cannot determine where one entity ends and another begins. The simple scheme makes adjacent same-type entities indistinguishable from single multi-token entities. Real text contains many such cases: lists of names, multiple locations, consecutive organization mentions. BIO tagging handles all of them correctly.

The O Tag

The O tag marks tokens that don't belong to any entity. It carries no suffix because "outside" is the only interpretation. In typical NER datasets, O tokens vastly outnumber entity tokens since most words in a sentence are not named entities:

In[7]:
Code
example_text = [
    "The",
    "president",
    "of",
    "the",
    "United",
    "States",
    "met",
    "with",
    "Angela",
    "Merkel",
    "in",
    "Berlin",
    ".",
]

example_bio = [
    "O",
    "O",
    "O",
    "O",
    "B-LOC",
    "I-LOC",
    "O",
    "O",
    "B-PER",
    "I-PER",
    "O",
    "B-LOC",
    "O",
]

# Count tag distribution
from collections import Counter

tag_counts = Counter(example_bio)
Out[8]:
Console
Tag Distribution in Sample Sentence:
----------------------------------------
O       :  8 tokens ( 61.5%)
B-LOC   :  2 tokens ( 15.4%)
I-LOC   :  1 tokens (  7.7%)
B-PER   :  1 tokens (  7.7%)
I-PER   :  1 tokens (  7.7%)

Entity tokens: 5/13 (38.5%)
Out[9]:
Visualization
Bar chart showing tag frequency distribution with O tag having highest count around 8 tokens and entity tags having 1-2 tokens each.
Tag distribution showing the severe class imbalance typical in NER datasets. The O (outside) tag dominates, comprising over 60% of tokens, while entity tags (B and I prefixes) are relatively rare. This imbalance poses challenges for training and requires strategies like weighted loss functions or focal loss.

This class imbalance, where O tokens dominate, is characteristic of sequence labeling tasks. Training algorithms must account for it, often through weighted loss functions or sampling strategies. The O tag's prevalence also means that a baseline of always predicting O achieves deceptively high accuracy but zero utility.

Extended Tagging Schemes

The basic BIO scheme is sufficient for many applications, but more complex annotation scenarios have motivated several extensions. These variants add prefixes to capture additional boundary information or handle special cases.

BIOES (BILOU) Scheme

The BIOES scheme adds two more prefixes: E for the end of a multi-token entity and S for single-token entities. Some practitioners call this BILOU, using L (last) instead of E and U (unit) instead of S, but the semantics are identical.

In[10]:
Code
# The five tag types
bioes_tags = {
    "B": "Beginning of multi-token entity",
    "I": "Inside multi-token entity",
    "O": "Outside any entity",
    "E": "End of multi-token entity",
    "S": "Single-token entity",
}

# Example sentences
sentences = [
    (
        ["New", "York", "is", "great"],
        ["B-LOC", "E-LOC", "O", "O"],
        "Two-token entity",
    ),
    (["Paris", "is", "beautiful"], ["S-LOC", "O", "O"], "Single-token entity"),
    (
        ["The", "United", "States", "of", "America"],
        ["O", "B-LOC", "I-LOC", "I-LOC", "E-LOC"],
        "Multi-token entity",
    ),
]
Out[11]:
Console
BIOES Tag Types:
--------------------------------------------------
  B: Beginning of multi-token entity
  I: Inside multi-token entity
  O: Outside any entity
  E: End of multi-token entity
  S: Single-token entity

Examples:
--------------------------------------------------

Two-token entity:
  New          → B-LOC
  York         → E-LOC
  is           → O
  great        → O

Single-token entity:
  Paris        → S-LOC
  is           → O
  beautiful    → O

Multi-token entity:
  The          → O
  United       → B-LOC
  States       → I-LOC
  of           → I-LOC
  America      → E-LOC

Why add more tags? BIOES provides two benefits. First, the model learns to recognize entity endpoints explicitly rather than inferring them from tag transitions. Research has shown modest accuracy improvements from BIOES over BIO, particularly for longer entities where boundary precision matters. Second, BIOES makes certain decoding errors impossible: a valid BIOES sequence must have every B paired with an E (or be followed by more I tags and then E), and S must stand alone. These constraints can be enforced during decoding.

BMEWO and Other Variants

Researchers have proposed numerous other schemes for specialized scenarios:

In[12]:
Code
# Alternative schemes
schemes = {
    "IO": {
        "description": "Simplest scheme: Inside/Outside only",
        "tags": ["I", "O"],
        "limitation": "Cannot distinguish adjacent entities of same type",
    },
    "BIO": {
        "description": "Standard scheme with begin marker",
        "tags": ["B", "I", "O"],
        "limitation": "End boundary is implicit",
    },
    "BIOES": {
        "description": "Adds explicit end and single markers",
        "tags": ["B", "I", "O", "E", "S"],
        "limitation": "More tags mean more parameters",
    },
    "BMEWO": {
        "description": "Begin/Middle/End/Word/Outside",
        "tags": ["B", "M", "E", "W", "O"],
        "limitation": "Equivalent to BIOES with different naming",
    },
}
Out[13]:
Console
Tagging Scheme Comparison:
============================================================

IO Scheme
  Tags: I, O
  Description: Simplest scheme: Inside/Outside only
  Limitation: Cannot distinguish adjacent entities of same type

BIO Scheme
  Tags: B, I, O
  Description: Standard scheme with begin marker
  Limitation: End boundary is implicit

BIOES Scheme
  Tags: B, I, O, E, S
  Description: Adds explicit end and single markers
  Limitation: More tags mean more parameters

BMEWO Scheme
  Tags: B, M, E, W, O
  Description: Begin/Middle/End/Word/Outside
  Limitation: Equivalent to BIOES with different naming

The choice of scheme involves tradeoffs. More tags provide richer supervision but increase the number of classes the model must predict. For most NER tasks, BIO strikes a good balance, while BIOES offers marginal improvements when maximum boundary precision is critical.

Valid Tag Transitions

Understanding which tag sequences are valid helps when designing decoders or training models with constraints. Not all tag combinations make sense: an I-PER cannot follow a B-LOC, and an I tag cannot appear after O without a preceding B tag. The following heatmap shows which transitions are valid in the BIO scheme:

Out[14]:
Visualization
Heatmap showing valid and invalid BIO tag transitions, with green for valid transitions like B-PER to I-PER and red for invalid ones like B-PER to I-LOC.
Valid tag transitions in the BIO tagging scheme. Green cells indicate allowed transitions, red cells indicate invalid transitions. An I tag can only follow a B or I tag of the same entity type. Models can be constrained during decoding to only produce valid sequences, improving prediction quality.

The key constraint is that I tags must match the type of their preceding B or I tag. An I-PER can follow B-PER or I-PER, but not B-LOC or I-LOC. This constraint can be enforced during inference using constrained beam search or CRF layers, improving the coherence of predicted sequences.

Converting Spans to BIO Tags

Annotation tools often store entity information as character or token spans rather than per-token labels. Converting these span annotations to BIO format is a common preprocessing step. Let's build a robust converter.

In[15]:
Code
def spans_to_bio(tokens, spans):
    """
    Convert span annotations to BIO tags.

    Args:
        tokens: List of tokens
        spans: List of (start_idx, end_idx, entity_type) tuples
               where indices refer to token positions and end_idx is exclusive

    Returns:
        List of BIO tags, one per token
    """
    # Initialize all tokens as Outside
    tags = ["O"] * len(tokens)

    # Sort spans by start position to handle overlaps deterministically
    sorted_spans = sorted(spans, key=lambda x: x[0])

    for start, end, entity_type in sorted_spans:
        # Validate span boundaries
        if start < 0 or end > len(tokens) or start >= end:
            continue

        # Tag first token with B prefix
        tags[start] = f"B-{entity_type}"

        # Tag remaining tokens with I prefix
        for i in range(start + 1, end):
            tags[i] = f"I-{entity_type}"

    return tags


# Test cases
test_tokens = ["John", "Smith", "works", "at", "Google", "Inc", "."]
test_spans = [
    (0, 2, "PER"),  # John Smith
    (4, 6, "ORG"),  # Google Inc
]

bio_result = spans_to_bio(test_tokens, test_spans)
Out[16]:
Console
Span to BIO Conversion:
---------------------------------------------
Token      BIO Tag   
---------------------------------------------
John       B-PER     
Smith      I-PER     
works      O         
at         O         
Google     B-ORG     
Inc        I-ORG     
.          O         

Input spans:
  [0:2] 'John Smith' -> PER
  [4:6] 'Google Inc' -> ORG

The converter handles the common case well. But real-world data presents edge cases: what happens with overlapping spans, single-token entities, or spans at sentence boundaries?

In[17]:
Code
# Edge case tests
edge_cases = [
    {
        "name": "Single-token entity",
        "tokens": ["Paris", "is", "lovely"],
        "spans": [(0, 1, "LOC")],
    },
    {
        "name": "Entity at end",
        "tokens": ["Visit", "New", "York"],
        "spans": [(1, 3, "LOC")],
    },
    {
        "name": "Adjacent entities",
        "tokens": ["Obama", "Biden", "met"],
        "spans": [(0, 1, "PER"), (1, 2, "PER")],
    },
    {
        "name": "All tokens are entities",
        "tokens": ["Barack", "Obama"],
        "spans": [(0, 2, "PER")],
    },
]

edge_results = []
for case in edge_cases:
    tags = spans_to_bio(case["tokens"], case["spans"])
    edge_results.append(
        {"name": case["name"], "tokens": case["tokens"], "tags": tags}
    )
Out[18]:
Console
Edge Case Handling:
=======================================================

Single-token entity:
  Paris        → B-LOC
  is           → O
  lovely       → O

Entity at end:
  Visit        → O
  New          → B-LOC
  York         → I-LOC

Adjacent entities:
  Obama        → B-PER
  Biden        → B-PER
  met          → O

All tokens are entities:
  Barack       → B-PER
  Obama        → I-PER

Single-token entities receive only a B tag since there's no continuation. Adjacent same-type entities each start with B, correctly distinguishing them. The converter handles boundary positions without special-casing.

BIOES Conversion

For applications requiring BIOES format, we extend the converter to track entity boundaries:

In[19]:
Code
def spans_to_bioes(tokens, spans):
    """
    Convert span annotations to BIOES tags.

    Single-token entities get S tag.
    Multi-token entities get B...I...E pattern.
    """
    tags = ["O"] * len(tokens)
    sorted_spans = sorted(spans, key=lambda x: x[0])

    for start, end, entity_type in sorted_spans:
        if start < 0 or end > len(tokens) or start >= end:
            continue

        span_length = end - start

        if span_length == 1:
            # Single-token entity
            tags[start] = f"S-{entity_type}"
        else:
            # Multi-token entity
            tags[start] = f"B-{entity_type}"
            for i in range(start + 1, end - 1):
                tags[i] = f"I-{entity_type}"
            tags[end - 1] = f"E-{entity_type}"

    return tags


# Compare BIO vs BIOES
comparison_tokens = ["The", "New", "York", "Times", "reported"]
comparison_spans = [(1, 4, "ORG")]

bio_output = spans_to_bio(comparison_tokens, comparison_spans)
bioes_output = spans_to_bioes(comparison_tokens, comparison_spans)
Out[20]:
Console
BIO vs BIOES Comparison:
--------------------------------------------------
Token        BIO          BIOES       
--------------------------------------------------
The          O            O           
New          B-ORG        B-ORG       
York         I-ORG        I-ORG       
Times        I-ORG        E-ORG       
reported     O            O           
Out[21]:
Visualization
Two-row diagram comparing BIO and BIOES tagging for New York Times entity, showing E-ORG marker in BIOES vs I-ORG in BIO.
Side-by-side comparison of BIO and BIOES tagging schemes on the same entity span. BIO uses only B and I prefixes, leaving end boundaries implicit. BIOES adds explicit E (end) markers, making it easier to detect and validate entity boundaries during decoding.

The BIOES output makes the entity endpoint explicit: "Times" receives E-ORG rather than I-ORG, marking it as the final token.

Decoding BIO Tags to Spans

The inverse operation extracts entity spans from a sequence of BIO tags. This is essential for evaluating model predictions and converting output to a usable format. The decoder must handle both well-formed and malformed tag sequences.

In[22]:
Code
def bio_to_spans(tokens, tags):
    """
    Extract entity spans from BIO-tagged sequence.

    Returns list of (start_idx, end_idx, entity_type, text) tuples.
    """
    spans = []
    current_entity = None  # (start_idx, entity_type)

    for i, (token, tag) in enumerate(zip(tokens, tags)):
        if tag.startswith("B-"):
            # Close any open entity
            if current_entity is not None:
                start, etype = current_entity
                spans.append((start, i, etype))

            # Start new entity
            entity_type = tag[2:]  # Remove 'B-' prefix
            current_entity = (i, entity_type)

        elif tag.startswith("I-"):
            entity_type = tag[2:]

            # Validate: I tag should follow B or I of same type
            if current_entity is None:
                # Orphan I tag: treat as beginning
                current_entity = (i, entity_type)
            elif current_entity[1] != entity_type:
                # Type mismatch: close old, start new
                start, etype = current_entity
                spans.append((start, i, etype))
                current_entity = (i, entity_type)
            # Otherwise: continue current entity

        else:  # O tag
            # Close any open entity
            if current_entity is not None:
                start, etype = current_entity
                spans.append((start, i, etype))
                current_entity = None

    # Handle entity at end of sequence
    if current_entity is not None:
        start, etype = current_entity
        spans.append((start, len(tokens), etype))

    # Add text to each span
    spans_with_text = []
    for start, end, etype in spans:
        text = " ".join(tokens[start:end])
        spans_with_text.append((start, end, etype, text))

    return spans_with_text


# Test decoding
test_tokens = ["Barack", "Obama", "visited", "New", "York", "City"]
test_tags = ["B-PER", "I-PER", "O", "B-LOC", "I-LOC", "I-LOC"]

decoded_spans = bio_to_spans(test_tokens, test_tags)
Out[23]:
Console
BIO to Span Decoding:
--------------------------------------------------
Input: Barack Obama visited New York City
Tags:  B-PER I-PER O B-LOC I-LOC I-LOC

Extracted entities:
  [0:2] 'Barack Obama' -> PER
  [3:6] 'New York City' -> LOC

The decoder maintains state across tokens, tracking whether we're inside an entity and of what type. Key transitions occur when we encounter a B tag (start new entity), an O tag (end current entity), or reach the sequence end.

Handling Malformed Sequences

Model predictions don't always produce valid BIO sequences. Common errors include I tags without a preceding B tag and type mismatches where I-LOC follows B-PER. A robust decoder must handle these gracefully:

In[24]:
Code
# Malformed sequences that real models produce
malformed_cases = [
    {
        "name": "Orphan I tag (no preceding B)",
        "tokens": ["went", "to", "York", "City"],
        "tags": ["O", "O", "I-LOC", "I-LOC"],
    },
    {
        "name": "Type mismatch in continuation",
        "tokens": ["John", "Smith", "Jr"],
        "tags": ["B-PER", "I-PER", "I-ORG"],  # Error: ORG doesn't match PER
    },
    {
        "name": "Multiple B tags without I",
        "tokens": ["Paris", "London", "Berlin"],
        "tags": ["B-LOC", "B-LOC", "B-LOC"],
    },
]

malformed_results = []
for case in malformed_cases:
    spans = bio_to_spans(case["tokens"], case["tags"])
    malformed_results.append(
        {
            "name": case["name"],
            "tokens": case["tokens"],
            "tags": case["tags"],
            "spans": spans,
        }
    )
Out[25]:
Console
Handling Malformed Sequences:
============================================================

Orphan I tag (no preceding B):
  Tokens: ['went', 'to', 'York', 'City']
  Tags:   ['O', 'O', 'I-LOC', 'I-LOC']
  Decoded entities:
    [2:4] 'York City' -> LOC

Type mismatch in continuation:
  Tokens: ['John', 'Smith', 'Jr']
  Tags:   ['B-PER', 'I-PER', 'I-ORG']
  Decoded entities:
    [0:2] 'John Smith' -> PER
    [2:3] 'Jr' -> ORG

Multiple B tags without I:
  Tokens: ['Paris', 'London', 'Berlin']
  Tags:   ['B-LOC', 'B-LOC', 'B-LOC']
  Decoded entities:
    [0:1] 'Paris' -> LOC
    [1:2] 'London' -> LOC
    [2:3] 'Berlin' -> LOC

Our decoder applies sensible recovery strategies. Orphan I tags are treated as beginning a new entity. Type mismatches close the previous entity and start a fresh one. Consecutive B tags produce separate single-token entities. These choices maximize recall at the cost of some precision, which is often preferable for downstream error analysis.

BIOES Decoding

Decoding BIOES is slightly more complex but follows the same principles. The S and E tags provide additional boundary information:

In[26]:
Code
def bioes_to_spans(tokens, tags):
    """
    Extract entity spans from BIOES-tagged sequence.
    """
    spans = []
    current_entity = None

    for i, (token, tag) in enumerate(zip(tokens, tags)):
        if tag.startswith("S-"):
            # Single-token entity: close any open entity, add this one
            if current_entity is not None:
                start, etype = current_entity
                spans.append((start, i, etype))
                current_entity = None

            entity_type = tag[2:]
            spans.append((i, i + 1, entity_type))

        elif tag.startswith("B-"):
            # Close any open entity, start new
            if current_entity is not None:
                start, etype = current_entity
                spans.append((start, i, etype))

            entity_type = tag[2:]
            current_entity = (i, entity_type)

        elif tag.startswith("I-"):
            # Continue if valid, otherwise treat as B
            entity_type = tag[2:]
            if current_entity is None or current_entity[1] != entity_type:
                if current_entity is not None:
                    start, etype = current_entity
                    spans.append((start, i, etype))
                current_entity = (i, entity_type)

        elif tag.startswith("E-"):
            # End entity
            entity_type = tag[2:]
            if current_entity is not None and current_entity[1] == entity_type:
                start, _ = current_entity
                spans.append((start, i + 1, entity_type))
            else:
                # Orphan E tag: treat as single token
                spans.append((i, i + 1, entity_type))
            current_entity = None

        else:  # O tag
            if current_entity is not None:
                start, etype = current_entity
                spans.append((start, i, etype))
                current_entity = None

    # Handle unclosed entity
    if current_entity is not None:
        start, etype = current_entity
        spans.append((start, len(tokens), etype))

    # Add text
    spans_with_text = []
    for start, end, etype in spans:
        text = " ".join(tokens[start:end])
        spans_with_text.append((start, end, etype, text))

    return spans_with_text


# Test BIOES decoding
bioes_tokens = ["John", "visited", "New", "York", "and", "Paris"]
bioes_tags = ["S-PER", "O", "B-LOC", "E-LOC", "O", "S-LOC"]

bioes_decoded = bioes_to_spans(bioes_tokens, bioes_tags)
Out[27]:
Console
BIOES Decoding:
--------------------------------------------------
Input: John visited New York and Paris
Tags:  S-PER O B-LOC E-LOC O S-LOC

Extracted entities:
  [0:1] 'John' -> PER
  [2:4] 'New York' -> LOC
  [5:6] 'Paris' -> LOC

The S tags directly produce single-token entities, while B-E pairs define multi-token spans. This explicit boundary marking simplifies validation and can catch more prediction errors.

Tag Consistency and Validation

Real-world tagging systems produce inconsistent output. A well-designed pipeline includes validation to detect problems and, where possible, repair them. Let's build a validator and repair function.

In[28]:
Code
def validate_bio_sequence(tags):
    """
    Validate a BIO tag sequence and report errors.

    Returns:
        List of (position, error_type, description) tuples
    """
    errors = []
    prev_tag = "O"

    for i, tag in enumerate(tags):
        if tag == "O":
            prev_tag = tag
            continue

        if not (tag.startswith("B-") or tag.startswith("I-")):
            errors.append((i, "INVALID_TAG", f"Unrecognized tag format: {tag}"))
            continue

        prefix = tag[0]
        entity_type = tag[2:] if len(tag) > 2 else ""

        if not entity_type:
            errors.append(
                (i, "MISSING_TYPE", f"Tag missing entity type: {tag}")
            )

        if prefix == "I":
            # Check for valid predecessor
            if prev_tag == "O":
                errors.append(
                    (i, "ORPHAN_I", f"I tag without preceding B: {tag}")
                )
            elif prev_tag.startswith("B-") or prev_tag.startswith("I-"):
                prev_type = prev_tag[2:]
                if prev_type != entity_type:
                    errors.append(
                        (
                            i,
                            "TYPE_MISMATCH",
                            f"I-{entity_type} follows {prev_tag}",
                        )
                    )

        prev_tag = tag

    return errors


# Test validation
problematic_tags = [
    "O",
    "I-PER",
    "I-PER",  # Orphan I
    "B-LOC",
    "I-ORG",  # Type mismatch
    "O",
    "B-PER",
    "O",
]

validation_errors = validate_bio_sequence(problematic_tags)
Out[29]:
Console
Tag Sequence Validation:
-------------------------------------------------------
Tags: ['O', 'I-PER', 'I-PER', 'B-LOC', 'I-ORG', 'O', 'B-PER', 'O']

Errors found:
  Position 1: [ORPHAN_I] I tag without preceding B: I-PER
  Position 4: [TYPE_MISMATCH] I-ORG follows B-LOC

Once we've identified errors, we can attempt repairs. The repair strategy depends on the application. Conservative approaches leave errors in place for manual review. Aggressive approaches apply heuristics to fix common patterns:

In[30]:
Code
def repair_bio_sequence(tags):
    """
    Attempt to repair common BIO sequence errors.

    Strategies:
    - Convert orphan I tags to B tags
    - Fix type mismatches by starting new entities

    Returns:
        Tuple of (repaired_tags, repair_log)
    """
    repaired = tags.copy()
    repairs = []
    prev_tag = "O"

    for i, tag in enumerate(repaired):
        if tag == "O":
            prev_tag = tag
            continue

        if tag.startswith("I-"):
            entity_type = tag[2:]

            # Check if this is an orphan I
            if prev_tag == "O":
                repaired[i] = f"B-{entity_type}"
                repairs.append((i, tag, repaired[i], "Orphan I -> B"))

            # Check for type mismatch
            elif prev_tag[0] in "BI" and prev_tag[2:] != entity_type:
                repaired[i] = f"B-{entity_type}"
                repairs.append((i, tag, repaired[i], "Type mismatch -> new B"))

        prev_tag = repaired[i]

    return repaired, repairs


repaired_tags, repair_log = repair_bio_sequence(problematic_tags)
Out[31]:
Console
Sequence Repair:
-------------------------------------------------------
Original:  ['O', 'I-PER', 'I-PER', 'B-LOC', 'I-ORG', 'O', 'B-PER', 'O']
Repaired:  ['O', 'B-PER', 'I-PER', 'B-LOC', 'B-ORG', 'O', 'B-PER', 'O']

Repairs applied:
  Position 1: I-PER -> B-PER (Orphan I -> B)
  Position 4: I-ORG -> B-ORG (Type mismatch -> new B)

The repair function transforms orphan I tags into B tags and creates new entity boundaries at type mismatches. These are common patterns in model output, where the model may predict the correct type but miss a boundary.

Multi-Label BIO Tagging

Standard BIO tagging assumes each token belongs to at most one entity. But some applications require overlapping annotations. Consider "Bank of America," which might be tagged as both an organization (the company) and a location (America is a place). Nested named entities present similar challenges.

Several approaches handle multi-label scenarios:

In[32]:
Code
# Approach 1: Multiple tag columns
multi_column_example = {
    "tokens": ["Bank", "of", "America", "CEO"],
    "ORG_tags": ["B-ORG", "I-ORG", "I-ORG", "O"],
    "LOC_tags": ["O", "O", "B-LOC", "O"],
    "PER_tags": ["O", "O", "O", "O"],
}

# Approach 2: Combined tags (for small label sets)
combined_tags_example = {
    "tokens": ["Bank", "of", "America", "CEO"],
    "tags": ["B-ORG", "I-ORG", "B-LOC+I-ORG", "O"],  # Combined label
}

# Approach 3: Separate passes per entity type
separate_passes = {
    "pass_1_ORG": ["B-ORG", "I-ORG", "I-ORG", "O"],
    "pass_2_LOC": ["O", "O", "B-LOC", "O"],
}
Out[33]:
Console
Multi-Label BIO Approaches:
=======================================================

1. Multiple Tag Columns (one per entity type):
-------------------------------------------------------
Token      ORG        LOC       
-------------------------------------------------------
Bank       B-ORG      O         
of         I-ORG      O         
America    I-ORG      B-LOC     
CEO        O          O         

2. Combined Tags (for overlapping spans):
-------------------------------------------------------
Bank       -> B-ORG
of         -> I-ORG
America    -> B-LOC+I-ORG
CEO        -> O

The multiple-column approach is cleanest but requires training separate models or a model with multiple output heads. Combined tags work for small label sets but explode combinatorially with many types. In practice, most NER systems use flat BIO tagging and handle overlaps through post-processing or by defining a type hierarchy.

Nested Entity Encoding

For nested entities like "New [York [University]]" where "York University" is ORG and "York" is LOC, specialized schemes exist:

In[34]:
Code
def encode_nested_entities(tokens, entities):
    """
    Encode nested entities using layered BIO tags.

    Each nesting level gets its own tag layer.
    """
    # Find maximum nesting depth
    max_depth = max(
        len([e for e in entities if e[0] <= i < e[1]])
        for i in range(len(tokens))
    )

    # Initialize layers
    layers = [["O"] * len(tokens) for _ in range(max_depth)]

    # Sort entities by span length (longest first) then by start position
    sorted_entities = sorted(entities, key=lambda x: (-(x[1] - x[0]), x[0]))

    # Assign each entity to a layer
    for start, end, etype in sorted_entities:
        # Find first layer where this span is available
        for layer in layers:
            if all(layer[i] == "O" for i in range(start, end)):
                layer[start] = f"B-{etype}"
                for i in range(start + 1, end):
                    layer[i] = f"I-{etype}"
                break

    return layers


# Nested entity example
nested_tokens = ["New", "York", "University", "is", "great"]
nested_entities = [
    (0, 3, "ORG"),  # New York University
    (0, 2, "LOC"),  # New York
]

nested_layers = encode_nested_entities(nested_tokens, nested_entities)
Out[35]:
Console
Nested Entity Encoding:
--------------------------------------------------
Token        Layer 1      Layer 2     
--------------------------------------------------
New          B-ORG        B-LOC       
York         I-ORG        I-LOC       
University   I-ORG        O           
is           O            O           
great        O            O           

Layer 1 captures the larger span (ORG), Layer 2 captures nested span (LOC)

This layered approach preserves all entity information but requires models that can predict multiple layers simultaneously. Modern nested NER systems often use span-based prediction instead, directly outputting all valid spans regardless of nesting.

BIO Utilities in Practice

Let's consolidate our functions into a reusable module and demonstrate end-to-end usage with a real NER library.

In[36]:
Code
class BIOConverter:
    """Utility class for BIO tagging operations."""

    @staticmethod
    def spans_to_bio(tokens, spans):
        """Convert span annotations to BIO tags."""
        tags = ["O"] * len(tokens)
        for start, end, entity_type in sorted(spans, key=lambda x: x[0]):
            if 0 <= start < end <= len(tokens):
                tags[start] = f"B-{entity_type}"
                for i in range(start + 1, end):
                    tags[i] = f"I-{entity_type}"
        return tags

    @staticmethod
    def bio_to_spans(tokens, tags):
        """Extract spans from BIO tags."""
        spans = []
        current = None

        for i, (token, tag) in enumerate(zip(tokens, tags)):
            if tag.startswith("B-"):
                if current:
                    spans.append((*current, i))
                current = (i, tag[2:])
            elif tag.startswith("I-"):
                if current is None or current[1] != tag[2:]:
                    if current:
                        spans.append((*current, i))
                    current = (i, tag[2:])
            else:
                if current:
                    spans.append((*current, i))
                    current = None

        if current:
            spans.append((*current, len(tokens)))

        return [(s, e, t, " ".join(tokens[s:e])) for s, t, e in spans]

    @staticmethod
    def validate(tags):
        """Check if BIO sequence is valid."""
        prev = "O"
        for tag in tags:
            if tag.startswith("I-"):
                if prev == "O":
                    return False
                if prev[0] in "BI" and prev[2:] != tag[2:]:
                    return False
            prev = tag
        return True


# Usage demonstration
demo_tokens = ["Apple", "CEO", "Tim", "Cook", "announced", "iPhone"]
demo_spans = [(0, 1, "ORG"), (2, 4, "PER"), (5, 6, "PRODUCT")]

converter = BIOConverter()
demo_tags = converter.spans_to_bio(demo_tokens, demo_spans)
is_valid = converter.validate(demo_tags)
recovered_spans = converter.bio_to_spans(demo_tokens, demo_tags)
Out[37]:
Console
BIO Converter Demonstration:
=======================================================

Input spans:
  [0:1] 'Apple' -> ORG
  [2:4] 'Tim Cook' -> PER
  [5:6] 'iPhone' -> PRODUCT

Generated BIO tags:
  Apple      -> B-ORG
  CEO        -> O
  Tim        -> B-PER
  Cook       -> I-PER
  announced  -> O
  iPhone     -> B-PRODUCT

Sequence valid: True

Recovered spans (round-trip):
  [0:1] 'Apple' -> ORG
  [2:4] 'Tim Cook' -> PER
  [5:6] 'iPhone' -> PRODUCT

Integration with spaCy

Real NER systems output entity spans that we can convert to BIO format for analysis or evaluation:

In[38]:
Code
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Microsoft announced that Satya Nadella will visit London next week."
doc = nlp(text)

# Extract tokens and entity spans
tokens = [token.text for token in doc]
spans = []

for ent in doc.ents:
    # Find token indices for this entity
    start_idx = None
    end_idx = None
    for i, token in enumerate(doc):
        if token.idx == ent.start_char:
            start_idx = i
        if token.idx + len(token.text) == ent.end_char:
            end_idx = i + 1

    if start_idx is not None and end_idx is not None:
        spans.append((start_idx, end_idx, ent.label_))

# Convert to BIO
bio_tags = BIOConverter.spans_to_bio(tokens, spans)
Out[39]:
Console
spaCy NER to BIO Conversion:
--------------------------------------------------
Text: Microsoft announced that Satya Nadella will visit London next week.

Token        BIO Tag     
--------------------------------------------------
Microsoft    B-ORG       
announced    O           
that         O           
Satya        B-PERSON    
Nadella      I-PERSON    
will         O           
visit        O           
London       B-GPE       
next         B-DATE      
week         I-DATE      
.            O           

Entities detected:
  Microsoft -> ORG
  Satya Nadella -> PERSON
  London -> GPE
  next week -> DATE

The BIO representation enables token-level evaluation metrics, comparison between different taggers, and training data preparation for sequence models.

Limitations and Practical Considerations

BIO tagging is the dominant approach for sequence labeling, but it has limitations worth understanding.

The fundamental constraint is that standard BIO assumes non-overlapping entities. Each token receives exactly one tag, so nested or overlapping annotations cannot be represented directly. The workarounds we discussed, including multiple layers, combined tags, and separate passes, add complexity and may not suit all applications. For domains with extensive nesting like biomedical text where gene mentions overlap with protein mentions, span-based or graph-based approaches may be more appropriate.

Boundary precision is another challenge. Models often predict the correct entity type but miss exact boundaries. The sentence "the New York Stock Exchange" might be tagged as starting at "New" when it should start at "the New York Stock Exchange" or "New York Stock Exchange" depending on annotation guidelines. BIO's token-level representation means every boundary error affects multiple labels. BIOES mitigates this slightly by making endpoints explicit, but the underlying challenge remains.

Long entities pose particular difficulties for sequence models. An entity spanning ten tokens requires the model to maintain consistent predictions across all ten positions. In BIO, a single mistake, predicting O instead of I in the middle, breaks the entity into two fragments. CRF layers and constrained decoding help by enforcing valid transitions, but very long entities remain error-prone.

Despite these limitations, BIO tagging works well in practice. Its simplicity, universal tooling support, and compatibility with sequence models make it the right choice for most NER applications. Understanding when and why it fails helps you design better systems and interpret results more accurately.

Summary

BIO tagging provides a standardized format for representing entity boundaries in sequence labeling tasks. The key concepts from this chapter:

The BIO scheme uses three prefixes: B (beginning) marks the first token of an entity, I (inside) marks continuation tokens, and O (outside) marks non-entity tokens. This encoding unambiguously represents entity boundaries, handling adjacent same-type entities correctly.

Extended schemes like BIOES add explicit end markers (E) and single-token markers (S) for stronger supervision and easier validation. The choice between BIO and BIOES involves a tradeoff between simplicity and boundary precision.

Conversion utilities transform between span annotations and per-token BIO tags. Robust converters handle edge cases like single-token entities, adjacent entities, and sentence boundaries. Decoders must gracefully handle malformed sequences from model predictions.

Validation and repair catch common errors like orphan I tags and type mismatches. Repair strategies can automatically fix many issues, improving downstream usability.

Multi-label scenarios require extensions like multiple tag columns or layered encoding for nested entities. Standard BIO assumes non-overlapping annotations.

Key Function Parameters

When working with BIO tagging utilities, these parameters control the conversion and validation behavior:

  • tokens: List of string tokens representing the input sequence. Must align with span indices for correct conversion.
  • spans: List of tuples containing (start_idx, end_idx, entity_type). Uses Python's exclusive end convention where end_idx points to the position after the last token in the entity.
  • tags: List of BIO tag strings, one per token. Valid formats include B-TYPE, I-TYPE, and O.
  • entity_type: String identifier for the entity category (e.g., PER, LOC, ORG). Appears as the suffix in BIO tags after the hyphen.

For BIOES conversion, two additional prefixes are used:

  • S-TYPE: Marks single-token entities that don't need B/I/E structure
  • E-TYPE: Marks the final token of multi-token entities

The next chapters apply BIO tagging to chunking and introduce the probabilistic models, Hidden Markov Models and Conditional Random Fields, that power production sequence labeling systems.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about BIO tagging for sequence labeling.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{biotaggingencodingentityboundariesforsequencelabeling, author = {Michael Brenndoerfer}, title = {BIO Tagging: Encoding Entity Boundaries for Sequence Labeling}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bio-tagging-sequence-labeling-ner}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). BIO Tagging: Encoding Entity Boundaries for Sequence Labeling. Retrieved from https://mbrenndoerfer.com/writing/bio-tagging-sequence-labeling-ner
MLAAcademic
Michael Brenndoerfer. "BIO Tagging: Encoding Entity Boundaries for Sequence Labeling." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/bio-tagging-sequence-labeling-ner>.
CHICAGOAcademic
Michael Brenndoerfer. "BIO Tagging: Encoding Entity Boundaries for Sequence Labeling." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/bio-tagging-sequence-labeling-ner.
HARVARDAcademic
Michael Brenndoerfer (2025) 'BIO Tagging: Encoding Entity Boundaries for Sequence Labeling'. Available at: https://mbrenndoerfer.com/writing/bio-tagging-sequence-labeling-ner (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). BIO Tagging: Encoding Entity Boundaries for Sequence Labeling. https://mbrenndoerfer.com/writing/bio-tagging-sequence-labeling-ner
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free