Document Chunking: Optimizing RAG Retrieval Pipelines

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Master document chunking for RAG systems. Explore fixed-size, recursive, and semantic strategies to balance retrieval precision with context window limits.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Document ChunkingLink Copied

In the previous chapters, we built up the core RAG pipeline: retrieving relevant documents and using them to ground an LLM's responses. But we glossed over a critical question: what exactly constitutes a "document" in retrieval? A single PDF might contain hundreds of pages. A webpage might span thousands of words. If you embed an entire document as one vector, the resulting embedding must compress an enormous amount of information into a single point in vector space. Important details get diluted, and retrieval precision suffers.

Document chunking solves this problem by splitting large documents into smaller, self-contained pieces before embedding and indexing them. The choice of chunking strategy, chunk size, and overlap between chunks has a surprisingly large impact on RAG quality. A poorly chunked document can bury relevant information inside irrelevant context, while a well-chunked document surfaces precisely the passage you need.

This chapter explores the landscape of chunking strategies, from simple fixed-size splits to semantically aware approaches that respect the natural structure of text. We will implement each strategy, examine how chunk size affects retrieval, and build intuition for when to use each approach.

Why Chunking MattersLink Copied

To understand why chunking is so important, consider what happens during the retrieval stage of RAG. As we discussed in the Dense Retrieval chapter, we encode both queries and documents into dense vectors and retrieve documents whose embeddings are closest to the query embedding. The quality of this retrieval depends directly on how well each embedding captures the meaning of its text. If the embedding is a faithful representation of a focused, coherent passage, it will match queries about that passage with high confidence. If the embedding is a blurred summary of many disparate topics, it becomes a mediocre match for all of them and a strong match for none.

Embedding models have two fundamental constraints that make chunking necessary:

context window limits: Most embedding models have a maximum input length, typically 512 tokens for models like those in the BERT family and up to 8,192 tokens for newer models. Text beyond this limit is simply truncated and lost. This means that if you feed a 10,000-token document into a 512-token embedding model, nearly 95% of the document is silently discarded. The resulting embedding reflects only the opening passage, leaving the vast majority of the document completely unrepresented in your index.
Information density: Even within the context window, longer texts produce embeddings that represent a blend of all topics covered. A 5,000-word article about climate change might discuss ocean temperatures, carbon emissions, policy proposals, and economic impacts. A single embedding for this article would be a vague average of all these topics, matching none of them precisely. Think of this like trying to describe an entire meal with a single adjective: "savory" might loosely apply to the soup, the steak, and the roasted vegetables, but it captures the distinctive character of none of them.

Chunking addresses both constraints simultaneously. By splitting documents into smaller pieces, each chunk fits within the model's context window and covers a focused topic. When you ask about ocean temperature trends, the chunk specifically discussing that topic will produce a much stronger similarity match than the full article's embedding would. The key insight is that retrieval quality is not just about having the right information somewhere in your index; it is about having that information represented in a way that makes it discoverable. Chunking is the mechanism that creates this discoverability. Each chunk becomes a distinct, findable unit in vector space, carrying a focused semantic signal that can resonate with a matching query.

The Chunking-Retrieval Connection

Chunking is not just a preprocessing step; it defines the fundamental unit of retrieval. When you chunk a document, you are deciding the granularity at which information can be found and returned. Too coarse, and relevant details are buried in noise. Too fine, and context is lost. This decision echoes throughout the entire pipeline: it shapes what the embedding model encodes, what the similarity search can find, and what the LLM ultimately sees in its context window.

Fixed-Size ChunkingLink Copied

The simplest chunking strategy splits text into pieces of a fixed number of characters or tokens, regardless of where sentences or paragraphs begin or end. Despite its simplicity, fixed-size chunking is surprisingly common in production RAG systems because it is fast, predictable, and easy to reason about. When you need to process millions of documents quickly and you want deterministic, reproducible behavior, fixed-size chunking is often the pragmatic choice. Its uniformity also simplifies downstream engineering: every chunk consumes roughly the same amount of storage, embedding compute, and retrieval bandwidth.

Character-Based SplittingLink Copied

The most basic approach splits text every $n$ characters. This produces chunks of uniform length but can cut words and sentences in the middle. The logic is straightforward: start at position 0, take the next $n$ characters, advance the starting position, and repeat until the entire text is consumed. When overlap is specified, the starting position advances by fewer than $n$ characters, causing adjacent chunks to share some trailing and leading text.

In[2]:

Code

!uv pip install tiktoken spacy sentence-transformers matplotlib numpy
!uv pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

def chunk_by_characters(text, chunk_size=200, overlap=0):
    """Split text into fixed-size character chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

sample_text = (
    "The Amazon rainforest produces about 20 percent of the world's oxygen. "
    "It spans across nine countries in South America. The forest is home to "
    "approximately 10 percent of all species on Earth. Deforestation has reduced "
    "its area significantly over the past decades. Scientists estimate that about "
    "17 percent of the Amazon has been destroyed in the last 50 years. Conservation "
    "efforts are critical to preserving this vital ecosystem. Many indigenous "
    "communities depend on the rainforest for their livelihoods. The Amazon River, "
    "which flows through the forest, is the largest river by volume in the world."
)

!uv pip install tiktoken spacy sentence-transformers matplotlib numpy
!uv pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

def chunk_by_characters(text, chunk_size=200, overlap=0):
    """Split text into fixed-size character chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

sample_text = (
    "The Amazon rainforest produces about 20 percent of the world's oxygen. "
    "It spans across nine countries in South America. The forest is home to "
    "approximately 10 percent of all species on Earth. Deforestation has reduced "
    "its area significantly over the past decades. Scientists estimate that about "
    "17 percent of the Amazon has been destroyed in the last 50 years. Conservation "
    "efforts are critical to preserving this vital ecosystem. Many indigenous "
    "communities depend on the rainforest for their livelihoods. The Amazon River, "
    "which flows through the forest, is the largest river by volume in the world."
)

In[3]:

Code

chunks = chunk_by_characters(sample_text, chunk_size=150)

chunks = chunk_by_characters(sample_text, chunk_size=150)

Out[4]:

Console

Chunk 0: [150 chars] 'The Amazon rainforest produces about 20 percent of the world's oxygen. It spans across nine countries in South America. The forest is home to approxim'
Chunk 1: [150 chars] 'ately 10 percent of all species on Earth. Deforestation has reduced its area significantly over the past decades. Scientists estimate that about 17 pe'
Chunk 2: [150 chars] 'rcent of the Amazon has been destroyed in the last 50 years. Conservation efforts are critical to preserving this vital ecosystem. Many indigenous com'
Chunk 3: [150 chars] 'munities depend on the rainforest for their livelihoods. The Amazon River, which flows through the forest, is the largest river by volume in the world'
Chunk 4: [1 chars] '.'

Notice how chunks cut through words and sentences without regard for meaning. Chunk 1 might start in the middle of a word, making it difficult for an embedding model to capture the intended meaning. A phrase like "approximately 10 percent of all species" could be severed into "approxi" at the end of one chunk and "mately 10 percent of all species" at the beginning of the next, rendering both fragments less semantically coherent than the original sentence. This is the fundamental weakness of character-based splitting: it optimizes for uniform size at the expense of coherence. The embedding model must then try to extract meaning from text that may begin or end mid-thought, producing embeddings that are noisier and less representative of any single idea.

Token-Based SplittingLink Copied

A better variant splits by token count rather than character count. Since embedding models operate on tokens, not raw characters, this ensures each chunk uses the model's capacity efficiently. A character-based chunk of 200 characters might translate to anywhere from 30 to 60 tokens depending on word length and vocabulary, creating unpredictable utilization of the embedding model's context window. Token-based splitting eliminates this variability by directly controlling the unit that matters. We can use the tiktoken library, which implements the tokenizers used by OpenAI models, to perform this conversion accurately.

In[5]:

Code

import tiktoken


def chunk_by_tokens(
    text, chunk_size=50, overlap=0, encoding_name="cl100k_base"
):
    """Split text into chunks of a fixed number of tokens."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += chunk_size - overlap
    return chunks

import tiktoken


def chunk_by_tokens(
    text, chunk_size=50, overlap=0, encoding_name="cl100k_base"
):
    """Split text into chunks of a fixed number of tokens."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += chunk_size - overlap
    return chunks

In[6]:

Code

token_chunks = chunk_by_tokens(sample_text, chunk_size=40, overlap=0)

token_chunks = chunk_by_tokens(sample_text, chunk_size=40, overlap=0)

Out[7]:

Console

Chunk 0: [40 tokens] 'The Amazon rainforest produces about 20 percent of the world's oxygen. It spans across nine countries in South America. The forest is home to approximately 10 percent of all species on Earth. Def'
Chunk 1: [40 tokens] 'orestation has reduced its area significantly over the past decades. Scientists estimate that about 17 percent of the Amazon has been destroyed in the last 50 years. Conservation efforts are critical to preserving this vital ecosystem'
Chunk 2: [34 tokens] '. Many indigenous communities depend on the rainforest for their livelihoods. The Amazon River, which flows through the forest, is the largest river by volume in the world.'

Out[8]:

Visualization

Distribution of character lengths for 50-token chunks. The variability (from approximately 150 to 250 characters) demonstrates that fixed token counts do not guarantee fixed character lengths, as word complexity varies throughout the text.

Token-based splitting guarantees each chunk uses a predictable number of tokens, but it still does not respect sentence boundaries. A sentence split across two chunks loses coherence in both. The first chunk ends with an incomplete thought, and the second begins without the context established earlier in the sentence. For many retrieval scenarios, this partial-sentence problem degrades embedding quality enough to motivate the sentence-aware approaches we explore next.

Key ParametersLink Copied

The key parameters for fixed-size chunking are:

chunk_size: The target size of each chunk (in characters or tokens). Smaller chunks are more precise, while larger chunks provide more context. This parameter directly controls the trade-off between retrieval granularity and the amount of information each chunk carries.
overlap: The number of units (characters or tokens) repeated between adjacent chunks to prevent information loss at boundaries. Overlap acts as a safety net, ensuring that content near a cut point is fully represented in at least one chunk. We explore this concept in depth in the Chunk Overlap section below.

Sentence-Based ChunkingLink Copied

A more linguistically motivated approach uses sentence boundaries as the atomic unit. As we covered in the Sentence Segmentation chapter, identifying sentence boundaries is itself a non-trivial task that must handle abbreviations, decimal numbers, and other edge cases. Here, we leverage spaCy's sentence segmentation for robust boundary detection.

The idea is simple but powerful: split the text into individual sentences first, then group consecutive sentences together into chunks until adding the next sentence would exceed the desired size limit. By treating sentences as indivisible building blocks, this approach guarantees that no chunk will ever contain a partial sentence. Every chunk begins at the start of a sentence and ends at the conclusion of one, preserving the grammatical and semantic completeness that embedding models rely on.

In[9]:

Code

import spacy

nlp = spacy.load("en_core_web_sm")


def chunk_by_sentences(text, max_chunk_size=200, overlap_sentences=0):
    """Group sentences into chunks that respect sentence boundaries."""
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]

    chunks = []
    current_chunk = []
    current_length = 0

    for sent in sentences:
        sent_length = len(sent)
        # If adding this sentence would exceed the limit, finalize current chunk
        if current_chunk and current_length + sent_length + 1 > max_chunk_size:
            chunks.append(" ".join(current_chunk))
            # Keep the last `overlap_sentences` for context continuity
            if (
                overlap_sentences > 0
                and len(current_chunk) >= overlap_sentences
            ):
                current_chunk = current_chunk[-overlap_sentences:]
                current_length = (
                    sum(len(s) for s in current_chunk) + len(current_chunk) - 1
                )
            else:
                current_chunk = []
                current_length = 0
        current_chunk.append(sent)
        current_length += sent_length + (1 if current_length > 0 else 0)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

import spacy

nlp = spacy.load("en_core_web_sm")


def chunk_by_sentences(text, max_chunk_size=200, overlap_sentences=0):
    """Group sentences into chunks that respect sentence boundaries."""
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]

    chunks = []
    current_chunk = []
    current_length = 0

    for sent in sentences:
        sent_length = len(sent)
        # If adding this sentence would exceed the limit, finalize current chunk
        if current_chunk and current_length + sent_length + 1 > max_chunk_size:
            chunks.append(" ".join(current_chunk))
            # Keep the last `overlap_sentences` for context continuity
            if (
                overlap_sentences > 0
                and len(current_chunk) >= overlap_sentences
            ):
                current_chunk = current_chunk[-overlap_sentences:]
                current_length = (
                    sum(len(s) for s in current_chunk) + len(current_chunk) - 1
                )
            else:
                current_chunk = []
                current_length = 0
        current_chunk.append(sent)
        current_length += sent_length + (1 if current_length > 0 else 0)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

In[10]:

Code

sent_chunks = chunk_by_sentences(
    sample_text, max_chunk_size=200, overlap_sentences=0
)

sent_chunks = chunk_by_sentences(
    sample_text, max_chunk_size=200, overlap_sentences=0
)

Out[11]:

Console

Chunk 0: [191 chars]
  'The Amazon rainforest produces about 20 percent of the world's oxygen. It spans across nine countries in South America. The forest is home to approximately 10 percent of all species on Earth.'

Chunk 1: [168 chars]
  'Deforestation has reduced its area significantly over the past decades. Scientists estimate that about 17 percent of the Amazon has been destroyed in the last 50 years.'

Chunk 2: [145 chars]
  'Conservation efforts are critical to preserving this vital ecosystem. Many indigenous communities depend on the rainforest for their livelihoods.'

Chunk 3: [94 chars]
  'The Amazon River, which flows through the forest, is the largest river by volume in the world.'

Each chunk now contains complete sentences. The text within each chunk is coherent, making it much easier for an embedding model to capture its meaning. When the model processes a sentence-aligned chunk, it encounters grammatically well-formed text with clear subjects, verbs, and objects, exactly the kind of input it was trained on. This alignment between the structure of training data and the structure of your chunks tends to produce higher-quality embeddings. The chunk sizes are no longer perfectly uniform, since they depend on sentence lengths, but this is a worthwhile trade-off. A chunk that is 180 characters of coherent prose will almost always produce a better embedding than a chunk that is exactly 200 characters but begins with the fragment "tion efforts are critical to preserving."

Key ParametersLink Copied

The key parameters for sentence-based chunking are:

max_chunk_size: The maximum length (in characters or tokens) allowed for a chunk before a split is forced. Sentences are accumulated into the current chunk as long as this limit is not exceeded, so actual chunk sizes will vary depending on where the sentence boundaries fall relative to this ceiling.
overlap_sentences: The number of full sentences to repeat at the beginning of the next chunk to preserve context. Unlike character or token overlap, sentence overlap guarantees that the shared content is always a complete, meaningful unit. This is particularly valuable when consecutive sentences contain co-references ("This process..." or "These results...") that would be difficult to interpret without the preceding sentence.

Recursive ChunkingLink Copied

Real documents have hierarchical structure: sections, subsections, paragraphs, sentences, and words. Recursive chunking exploits this structure by attempting to split at the most meaningful boundary possible. The strategy embodies a simple but effective principle: always prefer splitting at the highest-level boundary that still produces chunks within the size limit. It tries a sequence of separators in order of decreasing granularity. If splitting by double newlines (paragraph boundaries) produces chunks that are small enough, it stops there. If not, it falls back to single newlines, then sentences, then words, then characters.

This layered approach means the algorithm naturally adapts to different parts of a document. A section with short paragraphs will be split along paragraph boundaries, preserving each paragraph as a self-contained unit. A section containing a single long paragraph will be split at sentence boundaries within that paragraph. Only as a last resort, when individual sentences exceed the chunk size limit, does the algorithm fall back to word-level or character-level splitting.

This approach is popularized by LangChain's RecursiveCharacterTextSplitter and is one of the most widely used strategies in practice. Its popularity stems from the fact that it produces reasonable results across a wide variety of document types without requiring any domain-specific configuration beyond the choice of separators.

In[12]:

Code

def recursive_chunk(text, chunk_size=200, overlap=50, separators=None):
    """Recursively split text using a hierarchy of separators."""
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]

    # Base case: text fits in one chunk
    if len(text) <= chunk_size:
        return [text]

    # Find the best separator that produces a split
    chosen_sep = separators[-1]  # fallback to character-level
    for sep in separators:
        if sep in text:
            chosen_sep = sep
            break

    # Split text using the chosen separator
    parts = text.split(chosen_sep) if chosen_sep else list(text)

    # Merge parts into chunks that respect the size limit
    chunks = []
    current_chunk = ""

    for part in parts:
        # Reconstruct with separator
        candidate = current_chunk + chosen_sep + part if current_chunk else part

        if len(candidate) <= chunk_size:
            current_chunk = candidate
        else:
            if current_chunk:
                chunks.append(current_chunk)
            # If a single part exceeds chunk_size, recurse with finer separators
            if len(part) > chunk_size:
                remaining_seps = separators[separators.index(chosen_sep) + 1 :]
                if remaining_seps:
                    sub_chunks = recursive_chunk(
                        part, chunk_size, overlap, remaining_seps
                    )
                    chunks.extend(sub_chunks)
                    current_chunk = ""
                else:
                    current_chunk = part
            else:
                current_chunk = part

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

def recursive_chunk(text, chunk_size=200, overlap=50, separators=None):
    """Recursively split text using a hierarchy of separators."""
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]

    # Base case: text fits in one chunk
    if len(text) <= chunk_size:
        return [text]

    # Find the best separator that produces a split
    chosen_sep = separators[-1]  # fallback to character-level
    for sep in separators:
        if sep in text:
            chosen_sep = sep
            break

    # Split text using the chosen separator
    parts = text.split(chosen_sep) if chosen_sep else list(text)

    # Merge parts into chunks that respect the size limit
    chunks = []
    current_chunk = ""

    for part in parts:
        # Reconstruct with separator
        candidate = current_chunk + chosen_sep + part if current_chunk else part

        if len(candidate) <= chunk_size:
            current_chunk = candidate
        else:
            if current_chunk:
                chunks.append(current_chunk)
            # If a single part exceeds chunk_size, recurse with finer separators
            if len(part) > chunk_size:
                remaining_seps = separators[separators.index(chosen_sep) + 1 :]
                if remaining_seps:
                    sub_chunks = recursive_chunk(
                        part, chunk_size, overlap, remaining_seps
                    )
                    chunks.extend(sub_chunks)
                    current_chunk = ""
                else:
                    current_chunk = part
            else:
                current_chunk = part

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

Let's test it on a structured document with paragraphs:

In[13]:

Code

structured_text = """Climate Change Overview

Global temperatures have risen by approximately 1.1 degrees Celsius since the pre-industrial era. This warming is primarily driven by greenhouse gas emissions from human activities, including burning fossil fuels and deforestation.

Impact on Ecosystems

Rising temperatures affect biodiversity across the planet. Coral reefs are bleaching at unprecedented rates. Arctic sea ice is declining, threatening polar bear habitats. Migration patterns of birds and marine species are shifting northward.

Mitigation Strategies

Renewable energy sources like solar and wind power are expanding rapidly. Many countries have committed to net-zero emissions targets by 2050. Carbon capture technology is being developed but remains expensive. Individual actions like reducing meat consumption and flying less also contribute to emission reductions."""

structured_text = """Climate Change Overview

Global temperatures have risen by approximately 1.1 degrees Celsius since the pre-industrial era. This warming is primarily driven by greenhouse gas emissions from human activities, including burning fossil fuels and deforestation.

Impact on Ecosystems

Rising temperatures affect biodiversity across the planet. Coral reefs are bleaching at unprecedented rates. Arctic sea ice is declining, threatening polar bear habitats. Migration patterns of birds and marine species are shifting northward.

Mitigation Strategies

Renewable energy sources like solar and wind power are expanding rapidly. Many countries have committed to net-zero emissions targets by 2050. Carbon capture technology is being developed but remains expensive. Individual actions like reducing meat consumption and flying less also contribute to emission reductions."""

In[14]:

Code

rec_chunks = recursive_chunk(structured_text, chunk_size=250, overlap=0)

rec_chunks = recursive_chunk(structured_text, chunk_size=250, overlap=0)

Out[15]:

Console

Chunk 0: [23 chars]
  'Climate Change Overview'

Chunk 1: [231 chars]
  'Global temperatures have risen by approximately 1.1 degrees Celsius since the pre-industrial era. This warming is primarily driven by greenhouse gas emissions from human activities, including burning fossil fuels and deforestation.'

Chunk 2: [20 chars]
  'Impact on Ecosystems'

Chunk 3: [241 chars]
  'Rising temperatures affect biodiversity across the planet. Coral reefs are bleaching at unprecedented rates. Arctic sea ice is declining, threatening polar bear habitats. Migration patterns of birds and marine species are shifting northward.'

Chunk 4: [21 chars]
  'Mitigation Strategies'

Chunk 5: [209 chars]
  'Renewable energy sources like solar and wind power are expanding rapidly. Many countries have committed to net-zero emissions targets by 2050. Carbon capture technology is being developed but remains expensive'

Chunk 6: [105 chars]
  'Individual actions like reducing meat consumption and flying less also contribute to emission reductions.'

The recursive approach respects paragraph boundaries when possible, falling back to finer-grained splits only when a paragraph exceeds the size limit. This preserves the document's logical structure in the chunks. The "Climate Change Overview" heading and its accompanying paragraph naturally form one chunk, while the "Impact on Ecosystems" section forms another. The algorithm arrives at these natural divisions not because it understands the content, but because the paragraph boundaries encoded by double newlines happen to align with topical boundaries, a pattern that holds reliably across well-structured documents.

Key ParametersLink Copied

The key parameters for recursive chunking are:

chunk_size: The hard limit on chunk size. The algorithm attempts to keep chunks under this limit while using the largest possible separators. This means the algorithm always prefers a paragraph-level split over a sentence-level split, as long as both produce chunks that fit within this budget.
overlap: The number of characters to overlap between chunks. In recursive chunking, overlap is applied after the initial splitting pass, duplicating content from the end of one chunk into the beginning of the next.
separators: An ordered list of strings used to split the text (e.g., ["\n\n", "\n", " ", ""]). The algorithm tries them in sequence to find the best split point. The ordering is crucial: it encodes your preference for which types of boundaries should be preserved. By placing paragraph separators first, you ensure the algorithm only resorts to sentence or word breaks when paragraph-level splits are insufficient.

Chunk OverlapLink Copied

When you split a document into non-overlapping chunks, information that spans a chunk boundary gets split across two chunks. A key fact might have its context in one chunk and its conclusion in the next. Neither chunk alone captures the complete idea, and retrieval may miss it entirely. Consider a passage where one sentence introduces a concept and the very next sentence provides the critical detail you are searching for. If the boundary falls between these two sentences, the first chunk contains a setup without a payoff, and the second chunk contains an answer without its question. An embedding of either chunk alone may fail to match your query.

Chunk overlap solves this by including some text from the end of each chunk at the beginning of the next. If your chunk size is 200 tokens and your overlap is 50 tokens, then the last 50 tokens of chunk $k$ appear again as the first 50 tokens of chunk $k+1$ . This means any passage of 50 or fewer tokens near the boundary is guaranteed to appear in full within at least one chunk. The overlapping region acts as a sliding window that "catches" boundary-spanning information, ensuring it is fully embedded in at least one vector.

Out[16]:

Visualization

Diagram showing three text chunks with overlapping regions. — Schematic of the chunk overlap mechanism. Three sequential chunks (blue, orange, and green) cover the document, with red shaded regions highlighting duplicated text at boundaries. This repetition ensures that semantic context is preserved across splits.

The trade-off with overlap is straightforward:

More overlap means better boundary coverage, but it increases the total number of chunks and storage requirements. It also means the same text appears in multiple embeddings, which can inflate retrieval results with near-duplicate chunks. If your query matches the overlapping region, both adjacent chunks will score highly, potentially consuming two of your top- $k$ retrieval slots with content that is largely redundant.
Less overlap reduces redundancy, but risks losing context at boundaries. With zero overlap, any information that depends on content from both sides of a boundary is effectively invisible to retrieval.

A common rule of thumb is to set overlap to 10-20% of the chunk size. For a 500-token chunk, an overlap of 50-100 tokens usually provides sufficient boundary coverage without excessive duplication. This range represents a pragmatic balance: enough overlap to catch most boundary-spanning passages, but not so much that your index becomes bloated with near-identical chunks. In practice, if you find that retrieval frequently returns two highly similar chunks from adjacent positions in the same document, you may be using too much overlap. If you find that relevant answers are being missed because critical context lands on the wrong side of a boundary, you may need more.

Let's see overlap in action with our token-based chunker:

In[17]:

Code

overlap_chunks = chunk_by_tokens(sample_text, chunk_size=40, overlap=10)

overlap_chunks = chunk_by_tokens(sample_text, chunk_size=40, overlap=10)

Out[18]:

Console

Chunk 0: 'The Amazon rainforest produces about 20 percent of the world's oxygen. It spans across nine countries in South America. The forest is home to approximately 10 percent of all species on Earth. Def'

Chunk 1: ' 10 percent of all species on Earth. Deforestation has reduced its area significantly over the past decades. Scientists estimate that about 17 percent of the Amazon has been destroyed in the last 50 years'

Chunk 2: ' Amazon has been destroyed in the last 50 years. Conservation efforts are critical to preserving this vital ecosystem. Many indigenous communities depend on the rainforest for their livelihoods. The Amazon River, which flows'

Chunk 3: ' their livelihoods. The Amazon River, which flows through the forest, is the largest river by volume in the world.'

Compare the end of each chunk with the beginning of the next. You should see shared text that bridges the boundary, ensuring that any information near the cut point is captured in at least one chunk's embedding. This shared text is the overlap in action: it duplicates a small window of content so that the transition zone between chunks is always fully represented somewhere in the index.

Chunk Size SelectionLink Copied

Choosing the right chunk size is one of the most impactful decisions in a RAG pipeline. It involves a fundamental trade-off between precision and context, and the optimal balance depends on the nature of your data, the types of questions you ask, and the capabilities of your embedding model. Getting chunk size right can mean the difference between a system that reliably surfaces the perfect passage and one that returns vaguely related blocks of text.

The Precision-Context Trade-offLink Copied

The precision-context trade-off is the central tension in chunk size selection. At one extreme, you could make each chunk a single sentence, maximizing precision. At the other, you could embed entire documents, maximizing context. Neither extreme works well in practice, and understanding why illuminates the core challenge.

Small chunks (100-200 tokens) provide high retrieval precision. Each chunk covers a narrow topic, so when it matches a query, the match is likely relevant. The embedding vector represents a focused semantic concept, and the cosine similarity between query and chunk is a reliable indicator of topical alignment. However, small chunks may lack sufficient context for the LLM to generate a good answer. A chunk containing "The temperature was 42°C" is useless without knowing what system or location it refers to. The LLM receives a decontextualized fact and must either hallucinate the missing context or produce an unsatisfying response. Small chunks also increase the total number of chunks in your index, which raises storage and search costs.

Large chunks (500-1,000 tokens) provide rich context. The LLM receives enough surrounding information to understand and synthesize an answer. A paragraph that introduces a concept, provides evidence, and draws a conclusion gives the LLM everything it needs to produce a coherent response. But large chunks produce less precise embeddings because they cover more topics, and they may include irrelevant information that confuses the model. When a 800-token chunk discusses three related but distinct subtopics, its embedding becomes a centroid in the semantic space that lies between all three topics. A query that is specifically about one of those subtopics may find a better cosine similarity match with a more focused chunk from a completely different document. This dilution effect is the primary cost of large chunk sizes.

Out[19]:

Visualization

The trade-off between retrieval precision and context completeness. Small chunks provide high precision but low context, while large chunks offer rich context at the cost of precision. The optimization zone represents the balance point where chunks are focused enough to match queries but large enough to answer them.

The optimal chunk size depends on several factors:

Query type: Factoid questions ("What is the boiling point of water?") benefit from small chunks, while analytical questions ("Explain the causes of the 2008 financial crisis") need larger chunks with more context. If your application primarily serves one type of query, you can tune chunk size accordingly. If it serves a mix, you may need to compromise or use multiple chunk sizes.
Document type: Technical documentation with short, self-contained sections works well with smaller chunks. Narrative text where ideas develop over paragraphs needs larger chunks. A software API reference, where each function's description is independent, naturally lends itself to small chunks. A legal brief, where arguments build across paragraphs, demands larger ones.
Embedding model capacity: As we will explore in the next chapter on Embedding Models, different models have different optimal input lengths. Some models are trained on short passages and perform best with 1-2 sentences, while others handle full paragraphs effectively. Feeding a paragraph-optimized model a single sentence wastes its capacity, while feeding a sentence-optimized model a full paragraph may degrade its embedding quality.

Empirical ComparisonLink Copied

Let's compare how different chunk sizes affect the content of chunks from the same document. By viewing the same text through different chunk size lenses, we can build intuition for how this parameter shapes the granularity and coherence of the resulting pieces.

In[20]:

Code

long_text = """
Machine learning is a subset of artificial intelligence that focuses on building systems
that learn from data. Unlike traditional programming where rules are explicitly coded,
machine learning algorithms identify patterns in data and make predictions based on
those patterns. The field has grown rapidly since the 2010s, driven by increases in
computational power and the availability of large datasets.

Supervised learning is the most common paradigm. In supervised learning, the algorithm
is trained on labeled examples, where each input is paired with its correct output.
Common supervised learning tasks include classification, where the goal is to assign
inputs to discrete categories, and regression, where the goal is to predict a continuous
value. Popular algorithms include linear regression, decision trees, and neural networks.

Unsupervised learning works with unlabeled data. The algorithm must discover structure
in the data without guidance. Clustering algorithms like K-means group similar data
points together. Dimensionality reduction techniques like PCA find compact representations
of high-dimensional data. Unsupervised learning is often used for exploratory data
analysis and feature extraction.

Reinforcement learning involves an agent that learns by interacting with an environment.
The agent takes actions and receives rewards or penalties based on the outcomes. Over time,
the agent learns a policy that maximizes cumulative reward. Reinforcement learning has
achieved notable successes in game playing, robotics, and most recently in aligning
large language models through RLHF, as discussed in earlier chapters of this book.
""".strip()

long_text = """
Machine learning is a subset of artificial intelligence that focuses on building systems
that learn from data. Unlike traditional programming where rules are explicitly coded,
machine learning algorithms identify patterns in data and make predictions based on
those patterns. The field has grown rapidly since the 2010s, driven by increases in
computational power and the availability of large datasets.

Supervised learning is the most common paradigm. In supervised learning, the algorithm
is trained on labeled examples, where each input is paired with its correct output.
Common supervised learning tasks include classification, where the goal is to assign
inputs to discrete categories, and regression, where the goal is to predict a continuous
value. Popular algorithms include linear regression, decision trees, and neural networks.

Unsupervised learning works with unlabeled data. The algorithm must discover structure
in the data without guidance. Clustering algorithms like K-means group similar data
points together. Dimensionality reduction techniques like PCA find compact representations
of high-dimensional data. Unsupervised learning is often used for exploratory data
analysis and feature extraction.

Reinforcement learning involves an agent that learns by interacting with an environment.
The agent takes actions and receives rewards or penalties based on the outcomes. Over time,
the agent learns a policy that maximizes cumulative reward. Reinforcement learning has
achieved notable successes in game playing, robotics, and most recently in aligning
large language models through RLHF, as discussed in earlier chapters of this book.
""".strip()

In[21]:

Code

size_results = {}
chunk_sizes = [100, 250, 500]

for size in chunk_sizes:
    size_results[size] = chunk_by_sentences(long_text, max_chunk_size=size)

size_results = {}
chunk_sizes = [100, 250, 500]

for size in chunk_sizes:
    size_results[size] = chunk_by_sentences(long_text, max_chunk_size=size)

Out[22]:

Console

=== Chunk size: 100 chars → 16 chunks ===
  Chunk 0: [110 chars] Machine learning is a subset of artificial intelligence that focuses on building...
  Chunk 1: [164 chars] Unlike traditional programming where rules are explicitly coded, machine learnin...
  Chunk 2: [127 chars] The field has grown rapidly since the 2010s, driven by increases in computationa...
  Chunk 3: [48 chars] Supervised learning is the most common paradigm....
  Chunk 4: [121 chars] In supervised learning, the algorithm is trained on labeled examples, where each...
  Chunk 5: [180 chars] Common supervised learning tasks include classification, where the goal is to as...
  Chunk 6: [82 chars] Popular algorithms include linear regression, decision trees, and neural network...
  Chunk 7: [48 chars] Unsupervised learning works with unlabeled data....
  Chunk 8: [67 chars] The algorithm must discover structure in the data without guidance....
  Chunk 9: [70 chars] Clustering algorithms like K-means group similar data points together....
  Chunk 10: [99 chars] Dimensionality reduction techniques like PCA find compact representations of hig...
  Chunk 11: [89 chars] Unsupervised learning is often used for exploratory data analysis and feature ex...
  Chunk 12: [88 chars] Reinforcement learning involves an agent that learns by interacting with an envi...
  Chunk 13: [80 chars] The agent takes actions and receives rewards or penalties based on the outcomes....
  Chunk 14: [70 chars] Over time, the agent learns a policy that maximizes cumulative reward....
  Chunk 15: [193 chars] Reinforcement learning has achieved notable successes in game playing, robotics,...

=== Chunk size: 250 chars → 10 chunks ===
  Chunk 0: [110 chars] Machine learning is a subset of artificial intelligence that focuses on building...
  Chunk 1: [164 chars] Unlike traditional programming where rules are explicitly coded, machine learnin...
  Chunk 2: [176 chars] The field has grown rapidly since the 2010s, driven by increases in computationa...
  Chunk 3: [121 chars] In supervised learning, the algorithm is trained on labeled examples, where each...
  Chunk 4: [180 chars] Common supervised learning tasks include classification, where the goal is to as...
  Chunk 5: [199 chars] Popular algorithms include linear regression, decision trees, and neural network...
  Chunk 6: [170 chars] Clustering algorithms like K-means group similar data points together. Dimension...
  Chunk 7: [178 chars] Unsupervised learning is often used for exploratory data analysis and feature ex...
  Chunk 8: [151 chars] The agent takes actions and receives rewards or penalties based on the outcomes....
  Chunk 9: [193 chars] Reinforcement learning has achieved notable successes in game playing, robotics,...

=== Chunk size: 500 chars → 4 chunks ===
  Chunk 0: [452 chars] Machine learning is a subset of artificial intelligence that focuses on building...
  Chunk 1: [434 chars] In supervised learning, the algorithm is trained on labeled examples, where each...
  Chunk 2: [498 chars] The algorithm must discover structure in the data without guidance. Clustering a...
  Chunk 3: [264 chars] Over time, the agent learns a policy that maximizes cumulative reward. Reinforce...

Smaller chunk sizes produce more chunks, each covering a narrower topic. With a 100-character limit, individual sentences or pairs of short sentences become the unit of retrieval, giving the system laser-like precision but very little surrounding context. Larger chunk sizes produce fewer chunks that span multiple topics: a 500-character chunk might combine the definition of supervised learning with examples of specific algorithms, creating a richer but less focused unit. There is no universally correct size; the right choice depends on your use case and should ideally be determined through evaluation, which we will cover in the RAG Evaluation chapter.

Structural ChunkingLink Copied

Many real-world documents have explicit structural markers: Markdown headers, HTML tags, LaTeX section commands, or table of contents entries. These markers are not arbitrary formatting; they represent deliberate authorial decisions about how information is organized. A section header signals a topical boundary, and the content beneath it forms a coherent unit that the author intended to be read together. Structural chunking uses these markers to split documents along their natural boundaries, ensuring each chunk corresponds to a coherent section or subsection.

This approach is particularly powerful for well-structured documents because it leverages organizational signals that other chunking methods ignore. A fixed-size chunker treats a Markdown header as just another line of text, potentially grouping it with the tail end of the previous section. A structural chunker recognizes it as a boundary, keeping each section intact and associated with its heading.

In[23]:

Code

import re


def chunk_by_markdown_headers(text, max_chunk_size=500):
    """Split a Markdown document by headers, preserving hierarchy."""
    # Split on lines that start with one or more # characters
    header_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)

    sections = []
    last_end = 0
    headers_stack = []

    for match in header_pattern.finditer(text):
        # Save content before this header
        if last_end < match.start():
            content = text[last_end : match.start()].strip()
            if content and headers_stack:
                sections.append(
                    {"header": " > ".join(headers_stack), "content": content}
                )

        level = len(match.group(1))
        title = match.group(2)

        # Update header stack based on level
        while headers_stack and len(headers_stack) >= level:
            headers_stack.pop()
        headers_stack.append(title)

        last_end = match.end()

    # Capture remaining content
    remaining = text[last_end:].strip()
    if remaining and headers_stack:
        sections.append(
            {"header": " > ".join(headers_stack), "content": remaining}
        )

    # Combine header context with content into chunks
    chunks = []
    for section in sections:
        chunk_text = f"[{section['header']}]\n{section['content']}"
        if len(chunk_text) <= max_chunk_size:
            chunks.append(chunk_text)
        else:
            # Fall back to sentence-level splitting within the section
            sub_chunks = chunk_by_sentences(
                section["content"], max_chunk_size=max_chunk_size
            )
            for sc in sub_chunks:
                chunks.append(f"[{section['header']}]\n{sc}")

    return chunks

import re


def chunk_by_markdown_headers(text, max_chunk_size=500):
    """Split a Markdown document by headers, preserving hierarchy."""
    # Split on lines that start with one or more # characters
    header_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)

    sections = []
    last_end = 0
    headers_stack = []

    for match in header_pattern.finditer(text):
        # Save content before this header
        if last_end < match.start():
            content = text[last_end : match.start()].strip()
            if content and headers_stack:
                sections.append(
                    {"header": " > ".join(headers_stack), "content": content}
                )

        level = len(match.group(1))
        title = match.group(2)

        # Update header stack based on level
        while headers_stack and len(headers_stack) >= level:
            headers_stack.pop()
        headers_stack.append(title)

        last_end = match.end()

    # Capture remaining content
    remaining = text[last_end:].strip()
    if remaining and headers_stack:
        sections.append(
            {"header": " > ".join(headers_stack), "content": remaining}
        )

    # Combine header context with content into chunks
    chunks = []
    for section in sections:
        chunk_text = f"[{section['header']}]\n{section['content']}"
        if len(chunk_text) <= max_chunk_size:
            chunks.append(chunk_text)
        else:
            # Fall back to sentence-level splitting within the section
            sub_chunks = chunk_by_sentences(
                section["content"], max_chunk_size=max_chunk_size
            )
            for sc in sub_chunks:
                chunks.append(f"[{section['header']}]\n{sc}")

    return chunks

In[24]:

Code

markdown_doc = """# Machine Learning

Machine learning enables computers to learn from data without explicit programming.

## Supervised Learning

### Classification

Classification assigns inputs to discrete categories. Common algorithms include
logistic regression, support vector machines, and neural networks. The model learns
a decision boundary that separates different classes in the feature space.

### Regression

Regression predicts continuous values. Linear regression fits a straight line to
the data, while polynomial regression can capture nonlinear relationships. Neural
networks can learn arbitrarily complex regression functions.

## Unsupervised Learning

### Clustering

Clustering groups similar data points without labels. K-means is the most popular
clustering algorithm. It partitions data into K groups by minimizing within-cluster
variance. DBSCAN is an alternative that can find clusters of arbitrary shape.

### Dimensionality Reduction

PCA projects high-dimensional data onto its principal components. t-SNE and UMAP
create nonlinear 2D projections for visualization. Autoencoders learn compressed
representations through neural networks.
"""

markdown_doc = """# Machine Learning

Machine learning enables computers to learn from data without explicit programming.

## Supervised Learning

### Classification

Classification assigns inputs to discrete categories. Common algorithms include
logistic regression, support vector machines, and neural networks. The model learns
a decision boundary that separates different classes in the feature space.

### Regression

Regression predicts continuous values. Linear regression fits a straight line to
the data, while polynomial regression can capture nonlinear relationships. Neural
networks can learn arbitrarily complex regression functions.

## Unsupervised Learning

### Clustering

Clustering groups similar data points without labels. K-means is the most popular
clustering algorithm. It partitions data into K groups by minimizing within-cluster
variance. DBSCAN is an alternative that can find clusters of arbitrary shape.

### Dimensionality Reduction

PCA projects high-dimensional data onto its principal components. t-SNE and UMAP
create nonlinear 2D projections for visualization. Autoencoders learn compressed
representations through neural networks.
"""

In[25]:

Code

md_chunks = chunk_by_markdown_headers(markdown_doc, max_chunk_size=400)

md_chunks = chunk_by_markdown_headers(markdown_doc, max_chunk_size=400)

Out[26]:

Console

Chunk 0:
[Machine Learning]
Machine learning enables computers to learn from data without explicit programming.
------------------------------
Chunk 1:
[Machine Learning > Supervised Learning > Classification]
Classification assigns inputs to discrete categories. Common algorithms include
logistic regression, support vector machines, and neural networks. The model learns
a decision boundary that separates different classes in the feature space.
------------------------------
Chunk 2:
[Machine Learning > Supervised Learning > Regression]
Regression predicts continuous values. Linear regression fits a straight line to
the data, while polynomial regression can capture nonlinear relationships. Neural
networks can learn arbitrarily complex regression functions.
------------------------------
Chunk 3:
[Machine Learning > Unsupervised Learning > Clustering]
Clustering groups similar data points without labels. K-means is the most popular
clustering algorithm. It partitions data into K groups by minimizing within-cluster
variance. DBSCAN is an alternative that can find clusters of arbitrary shape.
------------------------------
Chunk 4:
[Machine Learning > Unsupervised Learning > Dimensionality Reduction]
PCA projects high-dimensional data onto its principal components. t-SNE and UMAP
create nonlinear 2D projections for visualization. Autoencoders learn compressed
representations through neural networks.
------------------------------

Notice how each chunk includes its header context (e.g., [Supervised Learning > Classification]). This metadata helps both the embedding model and the LLM understand the chunk's position within the document's hierarchy. A chunk about "Classification" without the context that it falls under "Supervised Learning" within a "Machine Learning" document would be less informative. The header path acts as a breadcrumb trail that disambiguates the content: "Classification" in a machine learning document means something very different from "Classification" in a library science document. By prepending this contextual path, the resulting embedding encodes not just what the chunk says, but where it sits within the larger knowledge structure.

Key ParametersLink Copied

The key parameters for structural chunking are:

max_chunk_size: The maximum size for the content of each structural section. If a section exceeds this, it is further split using a fallback strategy, typically sentence-based chunking within the section. This hybrid behavior is important because some documents contain sections that vary enormously in length: a brief introductory section might be a single sentence, while a detailed methodology section might span several pages. The fallback ensures that even oversized sections are chunked coherently, with each sub-chunk still carrying the parent section's header context.

Semantic ChunkingLink Copied

All the strategies we have discussed so far use surface-level cues: character counts, sentence boundaries, structural markers. None of them consider what the text actually means. A paragraph break might occur in the middle of a sustained argument, or two paragraphs separated by a heading might actually discuss the same topic from different angles. Surface-level chunking strategies are blind to these semantic realities. Semantic chunking addresses this by using embeddings to detect where topics shift within a document, and placing chunk boundaries at those transition points.

The core idea is elegant: if two consecutive sentences have similar embeddings, they likely discuss the same topic and should be in the same chunk. When embeddings diverge significantly between adjacent sentences, that point likely marks a topic shift and is a natural place to split. Rather than relying on formatting conventions that may or may not align with topical structure, semantic chunking directly measures the conceptual continuity of the text and uses that measurement to determine where to place boundaries.

AlgorithmLink Copied

The semantic chunking algorithm proceeds in four steps, each building naturally on the previous one. The first step decomposes the text into its atomic units. The second step captures the meaning of each unit. The third step measures how that meaning flows from one unit to the next. And the fourth step identifies the points where that flow is disrupted, signaling a topic transition.

Segment text into sentences using standard sentence segmentation. Sentences serve as the finest-grained unit we consider, since splitting within a sentence would break grammatical coherence.
Embed each sentence using a sentence embedding model. Each sentence is mapped to a dense vector that captures its semantic content. This gives us a sequence of vectors, one per sentence, representing the "meaning trajectory" of the document.
Compute similarity between consecutive sentences using cosine similarity (as we discussed in the Dense Retrieval chapter). For each pair of adjacent sentences, we measure how closely related they are. A high similarity score means the two sentences discuss related content. A low score means the topic is shifting.
Identify breakpoints where similarity drops below a threshold, or where the largest similarity drops occur. These breakpoints become the chunk boundaries. The text between consecutive breakpoints forms a single chunk, containing all the sentences that belong to one coherent topic.

In[27]:

Code

import numpy as np
from sentence_transformers import SentenceTransformer


def semantic_chunk(
    text,
    model_name="all-MiniLM-L6-v2",
    threshold_percentile=25,
    min_chunk_size=2,
):
    """Split text into chunks based on semantic similarity between sentences."""
    # Step 1: Segment into sentences
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]

    if len(sentences) <= min_chunk_size:
        return [text], sentences, []

    # Step 2: Embed each sentence
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)

    # Step 3: Compute cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
        )
        similarities.append(sim)

    # Step 4: Find breakpoints where similarity drops below threshold
    threshold = np.percentile(similarities, threshold_percentile)
    breakpoints = [
        i + 1 for i, sim in enumerate(similarities) if sim < threshold
    ]

    # Build chunks from breakpoints
    chunks = []
    start = 0
    for bp in breakpoints:
        chunk_sentences = sentences[start:bp]
        if len(chunk_sentences) >= min_chunk_size:
            chunks.append(" ".join(chunk_sentences))
            start = bp

    # Add remaining sentences
    if start < len(sentences):
        remaining = " ".join(sentences[start:])
        if chunks and len(sentences[start:]) < min_chunk_size:
            chunks[-1] += " " + remaining
        else:
            chunks.append(remaining)

    return chunks, sentences, similarities

import numpy as np
from sentence_transformers import SentenceTransformer


def semantic_chunk(
    text,
    model_name="all-MiniLM-L6-v2",
    threshold_percentile=25,
    min_chunk_size=2,
):
    """Split text into chunks based on semantic similarity between sentences."""
    # Step 1: Segment into sentences
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]

    if len(sentences) <= min_chunk_size:
        return [text], sentences, []

    # Step 2: Embed each sentence
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)

    # Step 3: Compute cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
        )
        similarities.append(sim)

    # Step 4: Find breakpoints where similarity drops below threshold
    threshold = np.percentile(similarities, threshold_percentile)
    breakpoints = [
        i + 1 for i, sim in enumerate(similarities) if sim < threshold
    ]

    # Build chunks from breakpoints
    chunks = []
    start = 0
    for bp in breakpoints:
        chunk_sentences = sentences[start:bp]
        if len(chunk_sentences) >= min_chunk_size:
            chunks.append(" ".join(chunk_sentences))
            start = bp

    # Add remaining sentences
    if start < len(sentences):
        remaining = " ".join(sentences[start:])
        if chunks and len(sentences[start:]) < min_chunk_size:
            chunks[-1] += " " + remaining
        else:
            chunks.append(remaining)

    return chunks, sentences, similarities

Let's apply semantic chunking to a document that covers multiple distinct topics:

In[28]:

Code

multi_topic_text = """
The solar system consists of eight planets orbiting the Sun. Mercury is the closest
planet to the Sun and has no atmosphere. Venus is the hottest planet due to its thick
carbon dioxide atmosphere. Earth is the only known planet to support life.

Photosynthesis is the process by which plants convert sunlight into energy. Chlorophyll
in plant cells absorbs light energy. The process produces oxygen as a byproduct.
Plants use carbon dioxide and water as inputs for photosynthesis.

The stock market experienced significant volatility in 2023. Interest rate hikes by
central banks affected investor sentiment. Technology stocks showed strong recovery
in the second half of the year. Cryptocurrency markets also saw renewed interest from
institutional investors.

Returning to astronomy, the James Webb Space Telescope has revealed new details about
distant galaxies. Its infrared capabilities allow it to see through cosmic dust.
Scientists have discovered exoplanets in habitable zones using data from the telescope.
""".strip()

multi_topic_text = """
The solar system consists of eight planets orbiting the Sun. Mercury is the closest
planet to the Sun and has no atmosphere. Venus is the hottest planet due to its thick
carbon dioxide atmosphere. Earth is the only known planet to support life.

Photosynthesis is the process by which plants convert sunlight into energy. Chlorophyll
in plant cells absorbs light energy. The process produces oxygen as a byproduct.
Plants use carbon dioxide and water as inputs for photosynthesis.

The stock market experienced significant volatility in 2023. Interest rate hikes by
central banks affected investor sentiment. Technology stocks showed strong recovery
in the second half of the year. Cryptocurrency markets also saw renewed interest from
institutional investors.

Returning to astronomy, the James Webb Space Telescope has revealed new details about
distant galaxies. Its infrared capabilities allow it to see through cosmic dust.
Scientists have discovered exoplanets in habitable zones using data from the telescope.
""".strip()

In[29]:

Code

result = semantic_chunk(multi_topic_text, threshold_percentile=30)
sem_chunks, sentences, similarities = result

result = semantic_chunk(multi_topic_text, threshold_percentile=30)
sem_chunks, sentences, similarities = result

Out[30]:

Visualization

Line chart of sentence-to-sentence cosine similarity with dips at topic transitions. — Cosine similarity scores between consecutive sentences in a multi-topic document. Deep troughs in the similarity curve (below the dashed red threshold) indicate topic transitions, which the semantic chunker uses as boundaries to split the text.

Out[31]:

Console

Number of semantic chunks: 5

Chunk 0: 'The solar system consists of eight planets orbiting the Sun. Mercury is the closest planet to the Sun and has no atmosph...
'
Chunk 1: 'Photosynthesis is the process by which plants convert sunlight into energy. Chlorophyll in plant cells absorbs light ene...
'
Chunk 2: 'The stock market experienced significant volatility in 2023. Interest rate hikes by central banks affected investor sent...
'
Chunk 3: 'Technology stocks showed strong recovery in the second half of the year. Cryptocurrency markets also saw renewed interes...
'
Chunk 4: 'Returning to astronomy, the James Webb Space Telescope has revealed new details about distant galaxies. Its infrared cap...
'

The semantic chunker detects topic transitions, placing boundaries between the astronomy, biology, and finance sections. Unlike fixed-size chunking, it produces chunks of varying length, each covering a coherent topic. Notice in particular that the document deliberately returns to astronomy in its final paragraph after the finance section. A structural chunker relying on paragraph breaks alone would separate all four paragraphs into distinct chunks without recognizing the thematic relationship between the first and last paragraphs. The semantic chunker, by contrast, detects the low similarity at the topic transitions, regardless of how the document is formatted.

Choosing the ThresholdLink Copied

The threshold parameter controls how aggressively the chunker splits. It determines the sensitivity of the algorithm to topic transitions: a lenient threshold ignores all but the most dramatic shifts, while a strict threshold reacts to even subtle changes in subject matter. A lower percentile (e.g., 10th) means only the most dramatic topic shifts trigger splits, producing larger, fewer chunks. A higher percentile (e.g., 50th) creates more, smaller chunks at every moderate topic change.

In practice, you can tune this using several approaches:

Percentile-based thresholds (as above): Split at the $p$ -th percentile of similarity scores. This adapts to the document's overall coherence. A document where all consecutive sentences are highly related will have a higher threshold than one with naturally varied topics. This adaptivity is the key advantage: the algorithm calibrates itself to the document's baseline level of topical coherence, only splitting where coherence drops significantly relative to the document's own norm.
Absolute thresholds: Split whenever similarity drops below a fixed value (e.g., 0.3). This is simpler but does not adapt to documents with generally high or low inter-sentence similarity. A technical paper with consistently high inter-sentence similarity might never trigger splits with an absolute threshold of 0.3, while a casual blog post with naturally varied topics might be split into tiny fragments.
Standard deviation-based: Split when similarity drops more than one standard deviation below the mean. This captures statistically significant topic shifts. The logic mirrors outlier detection: if most sentence pairs have similarity around 0.7 with a standard deviation of 0.1, a pair with similarity 0.5 represents a meaningful departure from the norm and likely signals a genuine topic transition.

Key ParametersLink Copied

The key parameters for the semantic chunking implementation are:

threshold_percentile: Controls the sensitivity of split detection. Lower values (e.g., 10) trigger fewer splits, producing large chunks that may span loosely related subtopics. Higher values (e.g., 50) trigger more splits, producing smaller chunks that isolate finer-grained topics at the risk of fragmenting coherent discussions.
min_chunk_size: Minimum number of sentences per chunk. Prevents creating tiny fragments from transient topic shifts. Without this constraint, a single transitional sentence that happens to differ from both its neighbors could be isolated into its own chunk, producing a fragment too small to be useful for retrieval or generation.
model_name: The SentenceTransformer model used for embedding. Models with better semantic understanding produce more accurate boundaries. The choice of model matters because the similarity scores that drive splitting are only as good as the embeddings that produce them. A model with weak topical discrimination will produce noisy similarity curves that lead to poorly placed boundaries.

Comparing Chunking StrategiesLink Copied

Let's compare all our strategies on the same document to see how they differ in practice. Applying each method to identical input text makes the differences concrete and allows us to observe how each strategy's assumptions about text structure manifest in the resulting chunks.

In[32]:

Code

comparison_text = structured_text  # Reuse the climate change document

# Unpack tuple from semantic_chunk
sem_chunks, _, _ = semantic_chunk(comparison_text, threshold_percentile=30)

results = {
    "Fixed (200 chars)": chunk_by_characters(comparison_text, chunk_size=200),
    "Token (50 tokens)": chunk_by_tokens(comparison_text, chunk_size=50),
    "Sentence (200 chars)": chunk_by_sentences(
        comparison_text, max_chunk_size=200
    ),
    "Recursive (250 chars)": recursive_chunk(comparison_text, chunk_size=250),
    "Semantic": sem_chunks,
}

comparison_text = structured_text  # Reuse the climate change document

# Unpack tuple from semantic_chunk
sem_chunks, _, _ = semantic_chunk(comparison_text, threshold_percentile=30)

results = {
    "Fixed (200 chars)": chunk_by_characters(comparison_text, chunk_size=200),
    "Token (50 tokens)": chunk_by_tokens(comparison_text, chunk_size=50),
    "Sentence (200 chars)": chunk_by_sentences(
        comparison_text, max_chunk_size=200
    ),
    "Recursive (250 chars)": recursive_chunk(comparison_text, chunk_size=250),
    "Semantic": sem_chunks,
}

Out[33]:

Visualization

Bar chart comparing chunk count and size variability across strategies. — Number of chunks generated by each strategy. Fixed-size methods produce the most chunks due to their rigid boundaries, while semantic chunking yields fewer, naturally distinct segments.

Average chunk size and variability by strategy. Character-based splitting shows zero variance, whereas token-based and linguistic methods exhibit size variation to accommodate content structure.

Out[34]:

Console

Fixed (200 chars):
  Chunks: 5, Avg size: 172, Std: 55 chars
  First chunk preview: 'Climate Change Overview

Global temperatures have risen by a...
'
Token (50 tokens):
  Chunks: 3, Avg size: 287, Std: 5 chars
  First chunk preview: 'Climate Change Overview

Global temperatures have risen by a...
'
Sentence (200 chars):
  Chunks: 6, Avg size: 142, Std: 29 chars
  First chunk preview: 'Climate Change Overview

Global temperatures have risen by a...
'
Recursive (250 chars):
  Chunks: 7, Avg size: 121, Std: 96 chars
  First chunk preview: 'Climate Change Overview...
'
Semantic:
  Chunks: 3, Avg size: 286, Std: 74 chars
  First chunk preview: 'Climate Change Overview

Global temperatures have risen by a...
'

The results confirm our expectations. Fixed-size strategies produce uniform chunks but cut through text arbitrarily, while sentence-based and recursive strategies produce chunks of varying size that preserve linguistic coherence. The standard deviation of chunk sizes tells an important story: fixed-size methods have near-zero variance by construction, while linguistically aware methods trade that uniformity for meaningfulness. In practice, the slight unpredictability of chunk sizes is a small price to pay for chunks that an embedding model can actually make sense of.

Metadata EnrichmentLink Copied

Raw chunks lose their document context. A chunk about "Section 3.2: Results" is much more useful to an LLM if it knows this section comes from "Annual Report 2023" by "Acme Corp." Enriching chunks with metadata improves both retrieval accuracy and the quality of generated answers. Metadata transforms a chunk from an anonymous fragment of text into a situated piece of knowledge with provenance, position, and context.

Common metadata to attach to each chunk includes:

Source document: filename, URL, or document ID
Position: chunk index, page number, or section path
Structural context: parent headers, preceding and following chunk IDs
Document metadata: author, date, document type, language

In[35]:

Code

from dataclasses import dataclass, field
import tiktoken


@dataclass
class Chunk:
    text: str
    index: int
    source: str
    section: str = ""
    start_char: int = 0
    end_char: int = 0
    metadata: dict = field(default_factory=dict)

    @property
    def token_count(self):
        enc = tiktoken.get_encoding("cl100k_base")
        return len(enc.encode(self.text))


def create_enriched_chunks(text, source, chunk_size=200):
    """Create chunks with metadata."""
    raw_chunks = chunk_by_sentences(text, max_chunk_size=chunk_size)

    enriched = []
    char_offset = 0
    for i, chunk_text in enumerate(raw_chunks):
        start = text.find(chunk_text, char_offset)
        end = (
            start + len(chunk_text)
            if start >= 0
            else char_offset + len(chunk_text)
        )

        chunk = Chunk(
            text=chunk_text,
            index=i,
            source=source,
            start_char=max(start, 0),
            end_char=end,
            metadata={
                "total_chunks": len(raw_chunks),
                "has_previous": i > 0,
                "has_next": i < len(raw_chunks) - 1,
            },
        )
        enriched.append(chunk)
        char_offset = end

    return enriched

from dataclasses import dataclass, field
import tiktoken


@dataclass
class Chunk:
    text: str
    index: int
    source: str
    section: str = ""
    start_char: int = 0
    end_char: int = 0
    metadata: dict = field(default_factory=dict)

    @property
    def token_count(self):
        enc = tiktoken.get_encoding("cl100k_base")
        return len(enc.encode(self.text))


def create_enriched_chunks(text, source, chunk_size=200):
    """Create chunks with metadata."""
    raw_chunks = chunk_by_sentences(text, max_chunk_size=chunk_size)

    enriched = []
    char_offset = 0
    for i, chunk_text in enumerate(raw_chunks):
        start = text.find(chunk_text, char_offset)
        end = (
            start + len(chunk_text)
            if start >= 0
            else char_offset + len(chunk_text)
        )

        chunk = Chunk(
            text=chunk_text,
            index=i,
            source=source,
            start_char=max(start, 0),
            end_char=end,
            metadata={
                "total_chunks": len(raw_chunks),
                "has_previous": i > 0,
                "has_next": i < len(raw_chunks) - 1,
            },
        )
        enriched.append(chunk)
        char_offset = end

    return enriched

In[36]:

Code

enriched = create_enriched_chunks(
    sample_text, source="amazon_facts.txt", chunk_size=200
)

enriched = create_enriched_chunks(
    sample_text, source="amazon_facts.txt", chunk_size=200
)

Out[37]:

Console

Chunk 0/3:
  Source: amazon_facts.txt
  Tokens: 39
  Chars [0:191]
  Text: 'The Amazon rainforest produces about 20 percent of the world's oxygen. It spans ...
'
Chunk 1/3:
  Source: amazon_facts.txt
  Tokens: 32
  Chars [192:360]
  Text: 'Deforestation has reduced its area significantly over the past decades. Scientis...
'
Chunk 2/3:
  Source: amazon_facts.txt
  Tokens: 24
  Chars [361:506]
  Text: 'Conservation efforts are critical to preserving this vital ecosystem. Many indig...
'

This metadata becomes invaluable during the retrieval and generation stages. The chunk index lets you retrieve neighboring chunks for additional context: if a retrieved chunk does not contain quite enough information to answer a question, the system can automatically pull in the preceding or following chunks to expand the context window. The source attribution enables citation in the generated response, allowing the LLM to tell you exactly where its information came from. The character offsets make it possible to highlight the relevant passage in the original document, creating a seamless user experience that connects generated answers back to their source material. We will see how this metadata integrates with vector databases in the upcoming chapters on Vector Similarity Search and the HNSW Index.

Special Document TypesLink Copied

Different document formats require specialized chunking approaches. Here are considerations for the most common types.

Code files: Chunk along function or class boundaries rather than by line count. A function split across two chunks will be incomprehensible in both. Parse the abstract syntax tree if possible to identify natural boundaries.
Tables: Tables should generally be kept as single chunks, even if they exceed the normal chunk size. Splitting a table across chunks destroys its relational structure. Convert tables to a text representation (like Markdown or CSV format) that preserves row-column relationships.
Conversational data: Chat logs and dialogue transcripts should be chunked by conversational turns or topic shifts, not by arbitrary size. Each chunk should contain enough turns to understand the context of the conversation.
Legal and regulatory documents: These often have numbered sections and subsections with precise cross-references. Structural chunking that preserves section numbers and hierarchies is essential. A chunk referencing "as defined in Section 2.1(b)" is useless if you (or LLM) cannot trace that reference.

Limitations and Practical ConsiderationsLink Copied

Despite the variety of strategies available, document chunking remains more art than science. No single strategy works optimally across all document types, query patterns, and embedding models. The fundamental challenge is that chunking decisions must be made at indexing time, before you know what questions you will ask. A chunk boundary that perfectly separates two topics for one query may split the exact passage needed for another.

Semantic chunking, while conceptually appealing, introduces its own challenges. It requires running an embedding model over every sentence during indexing, which significantly increases preprocessing time and cost. The quality of the splits depends heavily on the embedding model's ability to capture topic coherence, and models trained primarily on sentence similarity may not always detect document-level topic transitions accurately. Furthermore, semantic chunking can produce highly variable chunk sizes: a long section on a single topic might produce a single enormous chunk, while a passage that quickly surveys several topics might be split into tiny fragments. This variability can be managed with minimum and maximum size constraints, but adding these constraints partially undermines the semantic purity of the approach.

Another practical limitation is the interaction between chunk size and the downstream LLM's context window. If you retrieve five chunks of 500 tokens each, you consume 2,500 tokens of the LLM's context for retrieved content alone. With larger chunk sizes or more retrieved results, you may exhaust the context window before including your query and system instructions. This creates a system-level constraint that must be considered holistically: chunk size, number of retrieved results, prompt template length, and LLM context window all interact.

Finally, evaluation of chunking quality is inherently tied to end-to-end RAG performance. You cannot evaluate chunking in isolation because the same chunks might work well with one embedding model and poorly with another, or might retrieve perfectly but confuse a particular LLM. We will address this challenge systematically in the RAG Evaluation chapter later in this part.

SummaryLink Copied

Document chunking transforms large documents into retrieval-friendly pieces that each capture a focused topic. The key takeaways from this chapter are:

Fixed-size chunking (by characters or tokens) is simple and fast but ignores text structure, often cutting through sentences and ideas.
Sentence-based chunking uses linguistic boundaries to ensure each chunk contains complete sentences, producing more coherent embeddings at the cost of variable chunk sizes.
Recursive chunking respects document hierarchy by trying paragraph boundaries first, then falling back to finer-grained splits, combining structural awareness with size control.
Semantic chunking uses embeddings to detect topic shifts, placing boundaries where the text's meaning changes most dramatically.
Chunk overlap ensures information near boundaries appears in multiple chunks, reducing the risk of losing cross-boundary context.
Chunk size involves a precision-context trade-off: smaller chunks give more precise retrieval but less context, while larger chunks provide richer context but less focused embeddings. Typical values range from 200 to 1,000 tokens.
Metadata enrichment preserves document context (source, position, section hierarchy) that would otherwise be lost during chunking, enabling better retrieval and citation.

In the next chapter, we will examine the embedding models that convert these chunks into the dense vectors used for retrieval, completing the connection between chunking decisions and retrieval quality.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about document chunking strategies and their impact on retrieval.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Contrastive Learning for Retrieval: InfoNCE & DPR Guide

Next Chapter

Embedding Models

Coming Soon

Reference

BIBTEXAcademic

@misc{documentchunkingoptimizingragretrievalpipelines, author = {Michael Brenndoerfer}, title = {Document Chunking: Optimizing RAG Retrieval Pipelines}, year = {2024}, url = {https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2024). Document Chunking: Optimizing RAG Retrieval Pipelines. Retrieved from https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval

MLAAcademic

Michael Brenndoerfer. "Document Chunking: Optimizing RAG Retrieval Pipelines." 2026. Web. today. <https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval>.

CHICAGOAcademic

Michael Brenndoerfer. "Document Chunking: Optimizing RAG Retrieval Pipelines." Accessed today. https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval.

HARVARDAcademic

Michael Brenndoerfer (2024) 'Document Chunking: Optimizing RAG Retrieval Pipelines'. Available at: https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2024). Document Chunking: Optimizing RAG Retrieval Pipelines. https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval

Direct link:

https://mbrenndoerfer.com/writing/document-chunking-rag-strategies-retrieval

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Document Chunking: Optimizing RAG Retrieval Pipelines

Document ChunkingLink Copied

Why Chunking MattersLink Copied

Fixed-Size ChunkingLink Copied

Character-Based SplittingLink Copied

Token-Based SplittingLink Copied

Key ParametersLink Copied

Sentence-Based ChunkingLink Copied

Key ParametersLink Copied

Recursive ChunkingLink Copied

Key ParametersLink Copied

Chunk OverlapLink Copied

Chunk Size SelectionLink Copied

The Precision-Context Trade-offLink Copied

Empirical ComparisonLink Copied

Structural ChunkingLink Copied

Key ParametersLink Copied

Semantic ChunkingLink Copied

AlgorithmLink Copied

Choosing the ThresholdLink Copied

Key ParametersLink Copied

Comparing Chunking StrategiesLink Copied

Metadata EnrichmentLink Copied

Special Document TypesLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Contrastive Learning for Retrieval: InfoNCE & DPR Guide

Dense Retrieval: Semantic Search & Bi-Encoder Implementation

RAG Architecture: Components, Timing & Design Patterns

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Contrastive Learning for Retrieval: InfoNCE & DPR Guide

Dense Retrieval: Semantic Search & Bi-Encoder Implementation

RAG Architecture: Components, Timing & Design Patterns

Stay updated