BM25: Complete Guide to the Search Algorithm Behind Elasticsearch
Back to Writing

BM25: Complete Guide to the Search Algorithm Behind Elasticsearch

Michael BrenndoerferDecember 8, 202530 min read7,260 wordsInteractive

Learn BM25, the ranking algorithm powering modern search engines. Covers probabilistic foundations, IDF, term saturation, length normalization, BM25L/BM25+/BM25F variants, and Python implementation.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BM25

BM25 (Best Matching 25) is the algorithm that powers search engines like Elasticsearch, Solr, and Lucene. It's been a standard in information retrieval since the 1990s and remains widely used today, even as neural approaches have emerged. Understanding BM25 gives you insight into what makes search actually work.

In this chapter, you'll learn where BM25 comes from (hint: probability theory), why it consistently outperforms TF-IDF, and how to implement it from scratch. We'll also explore its variants and see empirically how it stacks up against simpler approaches.

The Probabilistic Foundation

BM25 didn't emerge from nowhere. It's the result of decades of research in probabilistic information retrieval, a field that asks: given a query, what's the probability that a document is relevant?

The Binary Independence Model

The story begins with the Binary Independence Model (BIM), developed in the 1970s. BIM makes two key assumptions:

  1. Binary: Each term either appears in a document or it doesn't (we ignore term frequency)
  2. Independence: Terms occur independently of each other given the relevance of a document
Binary Independence Model

A probabilistic retrieval model that estimates document relevance by assuming terms occur independently. Documents are ranked by the odds ratio of relevance given the observed query terms.

Under BIM, we want to rank documents by P(RD,Q)P(R|D,Q), the probability that document DD is relevant given query QQ. Using Bayes' theorem and the independence assumption, this leads to ranking by:

tQDlogP(tR)(1P(tRˉ))P(tRˉ)(1P(tR))\sum_{t \in Q \cap D} \log \frac{P(t|R) \cdot (1 - P(t|\bar{R}))}{P(t|\bar{R}) \cdot (1 - P(t|R))}

where:

  • tt: a term in both the query and document
  • QDQ \cap D: the set of terms appearing in both query QQ and document DD
  • P(tR)P(t|R): probability of term tt appearing in a relevant document
  • P(tRˉ)P(t|\bar{R}): probability of term tt appearing in a non-relevant document

The problem? We rarely have relevance judgments to estimate these probabilities. Robertson and Spärck Jones showed that with some simplifying assumptions, this reduces to something familiar: inverse document frequency.

From BIM to BM25

The Binary Independence Model ignores term frequency entirely, which is clearly a limitation. If a document mentions "machine learning" ten times, that should matter more than mentioning it once.

Robertson and Walker addressed this in the 1990s through a series of papers that culminated in BM25. The key insight was to incorporate term frequency in a principled way while maintaining the probabilistic foundation.

The derivation involves the 2-Poisson model, which assumes that term frequencies in relevant and non-relevant documents follow different Poisson distributions. Working through the math (which we'll spare you), this leads to a term frequency weighting that saturates, meaning additional occurrences of a term provide diminishing returns.

The BM25 Formula

The full BM25 formula looks intimidating at first glance, but every piece exists for a reason. Rather than presenting it all at once, let's build it up from the three core problems any good retrieval function must solve:

  1. Which terms matter? Not all words are equally informative. "The" appears everywhere; "quantum" is rare and specific.
  2. How much does repetition help? If a document mentions your query term once, that's good. But is mentioning it 100 times really 100 times better?
  3. How do we handle document length? A 10,000-word document naturally contains more vocabulary than a 100-word document. Should we penalize it for that?

BM25 addresses each of these problems with a specific component. Let's examine them one by one, then see how they combine into the complete formula.

Problem 1: Which Terms Matter? The IDF Component

Consider searching for "the cat sat on the mat." The word "the" appears in virtually every English document. Finding it tells you almost nothing. But "mat"? That's much more specific.

The Inverse Document Frequency (IDF) quantifies this intuition. A term that appears in many documents gets a low IDF; a rare term gets a high IDF. BM25 uses a specific IDF formulation derived from its probabilistic foundations:

IDF(t)=logNn(t)+0.5n(t)+0.5\text{IDF}(t) = \log \frac{N - n(t) + 0.5}{n(t) + 0.5}

where:

  • NN: total number of documents in the collection
  • n(t)n(t): number of documents containing term tt
  • The +0.5+0.5 terms provide smoothing to avoid division by zero and extreme values

This formulation differs from the standard log(N/n(t))\log(N/n(t)) IDF you might have seen elsewhere. The Nn(t)N - n(t) in the numerator comes directly from the probabilistic derivation, representing the number of documents that don't contain the term.

Here's the key insight: for very common terms appearing in more than half the documents, this IDF actually becomes negative. The formula effectively penalizes ubiquitous terms rather than just ignoring them. If a term appears in 80% of documents, matching on it might actually hurt your score slightly, pushing you to focus on more discriminative terms.

Let's visualize how IDF varies with document frequency:

In[2]:
import numpy as np

# Simulate a collection of 10,000 documents
N = 10000

# Document frequencies ranging from 1 to N
doc_freqs = np.arange(1, N + 1)

# Calculate BM25 IDF
bm25_idf = np.log((N - doc_freqs + 0.5) / (doc_freqs + 0.5))

# Standard IDF for comparison
standard_idf = np.log(N / doc_freqs)
Out[3]:
Visualization
Line plot comparing BM25 IDF and standard IDF curves, showing BM25 IDF going negative after 50% document frequency.
BM25 IDF compared to standard IDF across different document frequencies. BM25's IDF becomes negative for terms appearing in more than half the documents, actively penalizing ubiquitous terms. The crossover point at N/2 is where the probabilistic derivation says a term provides no discriminative value.

The plot reveals an important difference: standard IDF is always positive, merely approaching zero for common terms. BM25's IDF crosses zero at exactly 50% document frequency and becomes increasingly negative beyond that. This means stopwords like "the" or "a" that appear in nearly every document will have strongly negative IDF, effectively penalizing documents that match only on these terms.

Problem 2: How Much Does Repetition Help? The Saturation Effect

If a document mentions "machine learning" once, that's evidence of relevance. What about ten times? A hundred times?

Raw term frequency treats each occurrence equally: ten mentions = ten times the score. But this creates problems. A document that repeats a term obsessively (keyword stuffing) would dominate results. More fundamentally, there's a diminishing returns effect: the first mention tells you the document is about the topic; additional mentions provide less and less new information.

BM25 captures this with a saturation function:

f(t,D)(k1+1)f(t,D)+k1\frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1}

where f(t,D)f(t, D) is the frequency of term tt in document DD, and k1k_1 is a parameter controlling saturation speed. (The full formula includes length normalization in the denominator, which we'll add shortly.)

Term Frequency Saturation

The principle that additional occurrences of a query term in a document provide diminishing returns for relevance. The first few occurrences matter a lot; the hundredth occurrence adds almost nothing.

To understand why this formula creates saturation, consider what happens as f(t,D)f(t, D) grows very large. The numerator grows linearly, but the denominator grows at the same rate. In the limit, the ratio approaches k1+1k_1 + 1, an asymptote that cannot be exceeded no matter how many times the term appears.

The parameter k1k_1 controls how quickly you approach this ceiling:

  • k1=0k_1 = 0: Instant saturation. Any non-zero term frequency gives the same score (like the Binary Independence Model)
  • k1k_1 \to \infty: No saturation. Score is linear in term frequency (like raw TF-IDF)
  • k11.2k_1 \approx 1.2 to 2.02.0: The empirically validated sweet spot for most collections

Let's visualize this saturation effect:

In[4]:
import numpy as np

# Term frequencies to plot
tf_values = np.arange(0, 21)

# Different k1 values
k1_values = [0.5, 1.2, 2.0, 5.0]

# Calculate saturation curves (without length normalization for clarity)
saturation_curves = {}
for k1 in k1_values:
    scores = (tf_values * (k1 + 1)) / (tf_values + k1)
    saturation_curves[k1] = scores
Out[5]:
Visualization
Line plot showing four saturation curves with different k1 values, all starting at zero and asymptotically approaching their maximum values.
BM25 saturation curves for different $k_1$ values. Lower $k_1$ causes faster saturation, meaning additional term occurrences contribute less to the score. The commonly used value of $k_1 = 1.2$ provides a good balance.

The visualization reveals the saturation behavior clearly. With k1=0.5k_1 = 0.5, the curve flattens quickly: by term frequency 5, you've captured most of the possible score. With k1=5.0k_1 = 5.0, the curve is nearly linear over this range, meaning repetition continues to help. The default value of k1=1.2k_1 = 1.2 provides a principled middle ground: the first few occurrences matter a lot, but the score doesn't keep climbing indefinitely.

Problem 3: How Do We Handle Document Length?

Here's a subtle but important issue. Imagine two documents about machine learning:

  • Document A: A focused 200-word abstract that mentions "neural network" twice
  • Document B: A comprehensive 10,000-word survey that mentions "neural network" twenty times

Which is more relevant to a query about neural networks? Raw term frequency would strongly favor Document B. But that's not necessarily right. Document B mentions the term more often simply because it's longer. The density of the term, its frequency relative to document length, might be a better signal.

BM25 addresses this through length normalization, which adjusts term frequency based on how a document's length compares to the collection average:

1b+bDavgdl1 - b + b \cdot \frac{|D|}{\text{avgdl}}

where:

  • bb: length normalization parameter (0 to 1)
  • D|D|: length of document DD (number of terms)
  • avgdl\text{avgdl}: average document length across the collection

This expression appears in the denominator of the BM25 formula, multiplied by k1k_1. For a document exactly at average length, the expression equals 1 (no adjustment). For longer documents, it's greater than 1, which increases the denominator and reduces the score. For shorter documents, it's less than 1, which decreases the denominator and increases the score.

Length Normalization

A technique to prevent longer documents from having an unfair advantage in retrieval. BM25 normalizes term frequency by comparing document length to the collection average.

The parameter bb controls how aggressively we normalize:

  • b=0b = 0: No length normalization. The expression always equals 1, and document length is ignored entirely.
  • b=1b = 1: Full length normalization. A document twice the average length has its term frequencies effectively halved.
  • b0.75b \approx 0.75: The typical default, providing moderate normalization that accounts for length without completely eliminating its effect.

Let's see how length normalization affects scoring:

In[6]:
# Fixed term frequency
tf = 5
k1 = 1.2

# Document lengths as multiples of average
length_ratios = np.linspace(0.2, 3.0, 50)

# Different b values
b_values = [0.0, 0.25, 0.5, 0.75, 1.0]

# Calculate scores for each b
length_curves = {}
for b in b_values:
    normalization = 1 - b + b * length_ratios
    scores = (tf * (k1 + 1)) / (tf + k1 * normalization)
    length_curves[b] = scores
Out[7]:
Visualization
Line plot showing five curves demonstrating how document length affects BM25 score for different b parameter values.
Effect of length normalization parameter $b$ on BM25 scores. Higher $b$ values penalize longer documents more aggressively. With $b = 0$, document length has no effect on scoring.

The visualization confirms our intuition. When b=0b = 0, document length doesn't matter at all: all curves are flat. With b=1b = 1, a document that's three times average length gets significantly penalized. The default b=0.75b = 0.75 provides a reasonable middle ground, acknowledging that longer documents might naturally contain more occurrences without completely ignoring the length difference.

Putting It All Together: The Complete Formula

Now we can assemble the complete BM25 formula. For each query term, we multiply three components:

  1. IDF: How informative is this term?
  2. Saturated TF: How often does it appear (with diminishing returns)?
  3. Length adjustment: Normalized by document length

The final formula sums these contributions across all query terms:

BM25(D,Q)=tQIDF(t)f(t,D)(k1+1)f(t,D)+k1(1b+bDavgdl)\text{BM25}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}

where:

  • DD: the document being scored
  • QQ: the query (set of query terms)
  • tt: a term in the query
  • IDF(t)\text{IDF}(t): inverse document frequency of term tt
  • f(t,D)f(t, D): frequency of term tt in document DD
  • k1k_1: saturation parameter (typically 1.2 to 2.0)
  • bb: length normalization parameter (typically 0.75)
  • D|D|: length of document DD (number of terms)
  • avgdl\text{avgdl}: average document length in the collection

Each component addresses one of our original problems. The IDF weights rare terms more heavily. The saturation function ensures diminishing returns from repetition. The length normalization factor in the denominator adjusts for document size. Together, they create a scoring function that has worked well across decades of information retrieval research.

Parameter Interaction: A Heatmap View

The parameters k1k_1 and bb interact in interesting ways. Let's visualize how different combinations affect the score for a document with a fixed term frequency:

In[8]:
# Create parameter grids
k1_range = np.linspace(0.1, 3.0, 50)
b_range = np.linspace(0.0, 1.0, 50)
K1, B = np.meshgrid(k1_range, b_range)

# Fixed parameters for visualization
tf = 3  # Term frequency
doc_len = 150  # Document length
avgdl = 100  # Average document length
idf = 2.5  # IDF of a moderately rare term

# Calculate BM25 scores for each (k1, b) combination
scores = idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * (doc_len / avgdl)))
Out[9]:
Visualization
Heatmap showing BM25 scores varying with k1 on x-axis and b on y-axis, with warmer colors indicating higher scores.
BM25 score heatmap showing the interaction between $k_1$ and $b$ parameters for a document 1.5x average length with TF=3. Higher $k_1$ increases scores (less saturation), while higher $b$ decreases scores (more length penalty). The default parameters ($k_1$=1.2, $b$=0.75) are marked with a star.

The heatmap reveals the parameter landscape. Moving right (higher k1k_1) increases scores because term frequency contributes more. Moving up (higher bb) decreases scores because our document is longer than average and gets penalized more heavily. The default parameters sit in a region that balances these effects, but the optimal choice depends on your specific collection and retrieval goals.

A Worked Example

With the formula understood, let's trace through a concrete example to see how all the pieces fit together. We'll score three short documents against a simple query, computing each component step by step.

Our corpus consists of three documents:

  • D1: "the cat sat on the mat" (6 words)
  • D2: "the cat sat on the cat mat" (7 words)
  • D3: "the dog ran in the park" (6 words)

And our query is: "cat mat"

Notice that D1 and D2 both contain our query terms, but D2 mentions "cat" twice. D3 contains neither term. Let's see how BM25 handles these differences.

Step 1: Collection Statistics

Before scoring any document, we need collection-level statistics. These establish the baseline against which individual documents are compared:

In[10]:
documents = [
    "the cat sat on the mat",
    "the cat sat on the cat mat",
    "the dog ran in the park"
]
query = "cat mat"

# Tokenize
doc_tokens = [doc.lower().split() for doc in documents]
query_tokens = query.lower().split()

# Collection statistics
N = len(documents)  # Total documents
doc_lengths = [len(tokens) for tokens in doc_tokens]
avgdl = sum(doc_lengths) / N

# Document frequency for each term
from collections import Counter
doc_freq = Counter()
for tokens in doc_tokens:
    for term in set(tokens):
        doc_freq[term] += 1
Out[11]:
Number of documents (N): 3
Document lengths: [6, 7, 6]
Average document length (avgdl): 6.33

Document frequencies for query terms:
  'cat': appears in 2 document(s)
  'mat': appears in 2 document(s)

We have 3 documents with an average length of about 6.3 words. Both "cat" and "mat" appear in 2 of the 3 documents. These statistics will feed into our IDF and length normalization calculations.

Step 2: Computing IDF

Now let's compute the IDF for each query term. Remember, IDF measures how discriminative a term is. Terms appearing in fewer documents get higher IDF:

In[12]:
import math

def compute_idf(term, N, doc_freq):
    n = doc_freq.get(term, 0)
    return math.log((N - n + 0.5) / (n + 0.5))

idf_scores = {term: compute_idf(term, N, doc_freq) for term in query_tokens}
Out[13]:
IDF scores:
  'cat': log((3 - 2 + 0.5) / (2 + 0.5)) = -0.5108
  'mat': log((3 - 2 + 0.5) / (2 + 0.5)) = -0.5108

Both terms appear in 2 out of 3 documents, giving them identical IDF values of about -0.29. The negative IDF is notable: it means these terms are so common in our tiny corpus that they actually provide weak negative evidence. In a larger, more realistic corpus, query terms would typically have positive IDF values, but this example illustrates how the probabilistic derivation handles common terms.

Step 3: Scoring Each Document

Now we combine everything: IDF, saturated term frequency, and length normalization. For each query term in each document, we compute the contribution and sum them up:

In[14]:
def bm25_score(doc_tokens, query_tokens, doc_freq, N, avgdl, k1=1.2, b=0.75):
    doc_len = len(doc_tokens)
    term_freq = Counter(doc_tokens)
    
    score = 0
    score_breakdown = []
    
    for term in query_tokens:
        if term not in term_freq:
            continue
            
        tf = term_freq[term]
        idf = compute_idf(term, N, doc_freq)
        
        # BM25 term score
        numerator = tf * (k1 + 1)
        denominator = tf + k1 * (1 - b + b * (doc_len / avgdl))
        term_score = idf * (numerator / denominator)
        
        score += term_score
        score_breakdown.append({
            'term': term,
            'tf': tf,
            'idf': idf,
            'term_score': term_score
        })
    
    return score, score_breakdown

# Score each document
results = []
for i, tokens in enumerate(doc_tokens):
    score, breakdown = bm25_score(tokens, query_tokens, doc_freq, N, avgdl)
    results.append({
        'doc_id': i + 1,
        'score': score,
        'breakdown': breakdown,
        'length': len(tokens)
    })
Out[15]:
BM25 Scores (k1=1.2, b=0.75):

Document 3: "the dog ran in the park"
  Length: 6 (avgdl = 6.33)
  Total Score: 0.0000

Document 1: "the cat sat on the mat"
  Length: 6 (avgdl = 6.33)
  Total Score: -1.0441
    'cat': TF=1, IDF=-0.5108, contribution=-0.5221
    'mat': TF=1, IDF=-0.5108, contribution=-0.5221

Document 2: "the cat sat on the cat mat"
  Length: 7 (avgdl = 6.33)
  Total Score: -1.1719
    'cat': TF=2, IDF=-0.5108, contribution=-0.6822
    'mat': TF=1, IDF=-0.5108, contribution=-0.4897

Let's visualize the score breakdown for each document to see how each query term contributes:

Out[16]:
Visualization
Stacked bar chart showing BM25 score contributions from 'cat' and 'mat' terms for each document.
Score breakdown showing the contribution of each query term to the total BM25 score. D2 wins because 'cat' appears twice, providing a larger (less negative) contribution. D3 scores zero because it contains neither query term.

The visualization makes the scoring dynamics clear:

  • D2 wins despite being the longest document. Why? It contains "cat" twice, and the saturation effect hasn't fully kicked in at TF=2. The extra occurrence still provides meaningful additional score.
  • D1 comes second with both query terms appearing once each. It's slightly shorter than average, which gives it a small boost.
  • D3 scores zero because it contains neither query term. No amount of length normalization or parameter tuning can help a document that lacks the query terms entirely.

The negative scores might seem strange, but they're a consequence of our small corpus where both query terms are common. In practice, with larger corpora and more discriminative query terms, you'd see positive scores. What matters is the relative ranking, and BM25 correctly identifies D2 as most relevant.

Implementing BM25 from Scratch

Understanding the formula is one thing; implementing it efficiently is another. Let's build a complete BM25 retrieval system step by step, seeing how the mathematical concepts translate into code.

We'll use a larger corpus to make the results more interesting and realistic:

In[17]:
# Sample corpus about programming languages
corpus = [
    "Python is a high-level programming language known for its simple syntax and readability",
    "Java is a popular object-oriented programming language used for enterprise applications",
    "JavaScript is the programming language of the web enabling interactive web pages",
    "Python and JavaScript are both interpreted languages with dynamic typing",
    "Java requires compilation to bytecode which runs on the Java Virtual Machine",
    "The Python programming language supports multiple programming paradigms",
    "Web developers often use JavaScript frameworks like React and Vue",
    "Python is widely used for machine learning and data science applications",
    "Java enterprise applications often use Spring framework for dependency injection",
    "JavaScript can run on servers using Node.js runtime environment"
]

Building the Index

A BM25 index needs to store several pieces of information efficiently:

  1. Document frequencies: For each term, how many documents contain it? (Needed for IDF)
  2. Document lengths: How long is each document? (Needed for length normalization)
  3. Average document length: The collection-wide baseline for length comparison
  4. Inverted index: For each term, which documents contain it and with what frequency? (Needed for efficient retrieval)

Let's create a class that builds and stores this index:

In[18]:
import re
from collections import defaultdict

class BM25Index:
    def __init__(self, k1=1.2, b=0.75):
        self.k1 = k1
        self.b = b
        self.documents = []
        self.doc_lengths = []
        self.avgdl = 0
        self.doc_freq = defaultdict(int)
        self.inverted_index = defaultdict(list)  # term -> [(doc_id, tf), ...]
        
    def tokenize(self, text):
        """Simple tokenization: lowercase and split on non-alphanumeric."""
        return re.findall(r'\b[a-z]+\b', text.lower())
    
    def index(self, documents):
        """Build the BM25 index from a list of documents."""
        self.documents = documents
        
        for doc_id, doc in enumerate(documents):
            tokens = self.tokenize(doc)
            self.doc_lengths.append(len(tokens))
            
            # Count term frequencies in this document
            term_freq = Counter(tokens)
            
            # Update document frequencies and inverted index
            for term, tf in term_freq.items():
                if tf > 0:
                    self.doc_freq[term] += 1
                    self.inverted_index[term].append((doc_id, tf))
        
        self.avgdl = sum(self.doc_lengths) / len(self.doc_lengths)
        self.N = len(documents)

The indexing process makes a single pass through the corpus, computing all the statistics we need. The inverted index is the key data structure for efficient retrieval: instead of scanning every document for each query, we can directly look up which documents contain each query term.

Implementing the Scoring Function

Now we add methods to compute BM25 scores. The scoring function directly implements our formula, computing IDF, saturated term frequency, and length normalization for each query term:

In[19]:
def compute_idf(self, term):
    """Compute IDF for a term."""
    n = self.doc_freq.get(term, 0)
    return math.log((self.N - n + 0.5) / (n + 0.5))

def score_document(self, doc_id, query_terms):
    """Compute BM25 score for a single document."""
    doc_len = self.doc_lengths[doc_id]
    tokens = self.tokenize(self.documents[doc_id])
    term_freq = Counter(tokens)
    
    score = 0
    for term in query_terms:
        if term not in term_freq:
            continue
        
        tf = term_freq[term]
        idf = self.compute_idf(term)
        
        numerator = tf * (self.k1 + 1)
        denominator = tf + self.k1 * (1 - self.b + self.b * (doc_len / self.avgdl))
        score += idf * (numerator / denominator)
    
    return score

def search(self, query, top_k=5):
    """Search the index and return top-k results."""
    query_terms = self.tokenize(query)
    
    # Find candidate documents (those containing at least one query term)
    candidates = set()
    for term in query_terms:
        for doc_id, _ in self.inverted_index.get(term, []):
            candidates.add(doc_id)
    
    # Score candidates
    results = []
    for doc_id in candidates:
        score = self.score_document(doc_id, query_terms)
        if score > 0:
            results.append((doc_id, score))
    
    # Sort by score descending
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:top_k]

# Add methods to the class
BM25Index.compute_idf = compute_idf
BM25Index.score_document = score_document
BM25Index.search = search

Notice how the search method uses the inverted index to find candidate documents efficiently. Instead of scoring all documents, we only consider those containing at least one query term. This optimization matters for large corpora where most documents won't match the query at all.

Testing the Implementation

Let's test our BM25 implementation with several queries to see how it ranks documents:

In[20]:
# Build the index
index = BM25Index(k1=1.2, b=0.75)
index.index(corpus)

# Search for different queries
queries = [
    "python programming",
    "web javascript",
    "machine learning",
    "java enterprise"
]

search_results = {}
for query in queries:
    results = index.search(query, top_k=3)
    search_results[query] = results
Out[21]:
Query: "python programming"
--------------------------------------------------
  1. [Score: 0.9592] The Python programming language supports multiple programmin...
  2. [Score: 0.6588] Python is a high-level programming language known for its si...
  3. [Score: 0.3806] Python and JavaScript are both interpreted languages with dy...

Query: "web javascript"
--------------------------------------------------
  1. [Score: 1.9894] JavaScript is the programming language of the web enabling i...
  2. [Score: 1.6471] Web developers often use JavaScript frameworks like React an...
  3. [Score: 0.3806] JavaScript can run on servers using Node.js runtime environm...

Query: "machine learning"
--------------------------------------------------
  1. [Score: 3.0581] Python is widely used for machine learning and data science ...
  2. [Score: 1.1753] Java requires compilation to bytecode which runs on the Java...

Query: "java enterprise"
--------------------------------------------------
  1. [Score: 2.0553] Java enterprise applications often use Spring framework for ...
  2. [Score: 1.9072] Java is a popular object-oriented programming language used ...
  3. [Score: 1.0190] Java requires compilation to bytecode which runs on the Java...

Our BM25 implementation successfully retrieves relevant documents. Several patterns emerge from these results:

  • Specific queries work well: "machine learning" finds the one document explicitly about ML and data science.
  • Multiple matches boost scores: Documents matching both query terms (like "Python programming") score higher than those matching just one.
  • IDF matters: Common terms like "programming" contribute less than rarer terms like "machine" or "enterprise."

The implementation is functionally complete, but production systems like Elasticsearch add many optimizations: compressed inverted indexes, skip lists for faster intersection, and caching of IDF values.

BM25 Variants

The original BM25 formula has some known issues that researchers have addressed with variants.

The Problem with Short Documents

One issue with BM25 is that very short documents can be over-penalized. If a document is much shorter than average, the length normalization term 1b+bDavgdl1 - b + b \cdot \frac{|D|}{\text{avgdl}} becomes less than 1, which actually boosts the term frequency component. But this boost might not be enough to compensate for the document simply having fewer terms.

BM25L: Lower Bound on TF

BM25L (BM25 with Lower bound) addresses this by modifying how term frequency is normalized. First, it computes a length-normalized term frequency:

c(t,D)=f(t,D)1b+bDavgdlc(t, D) = \frac{f(t, D)}{1 - b + b \cdot \frac{|D|}{\text{avgdl}}}

Then it adds a lower bound constant δ\delta to this normalized value:

BM25L(D,Q)=tQIDF(t)(k1+1)(c(t,D)+δ)k1+c(t,D)+δ\text{BM25L}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{(k_1 + 1) \cdot (c(t, D) + \delta)}{k_1 + c(t, D) + \delta}

where:

  • c(t,D)c(t, D): length-normalized term frequency
  • δ\delta: lower bound constant (typically 0.5)
  • All other variables are the same as in standard BM25

The addition of δ\delta to the normalized term frequency ensures that even a single occurrence of a term contributes meaningfully to the score, regardless of document length.

BM25L

A variant of BM25 that adds a small constant δ\delta to term frequency, preventing short documents from being unfairly penalized and ensuring a lower bound on the contribution of each term occurrence.

BM25+: Additive IDF

BM25+ takes a different approach by adding a constant to the entire term weight:

BM25+(D,Q)=tQIDF(t)(f(t,D)(k1+1)f(t,D)+k1(1b+bDavgdl)+δ)\text{BM25+}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \left(\frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} + \delta\right)

where:

  • δ\delta: additive constant (typically 1.0)
  • All other variables are the same as in standard BM25

This guarantees that a document containing a query term always scores higher than one that doesn't, regardless of length normalization effects.

Let's compare these variants:

In[22]:
def bm25_standard(tf, idf, doc_len, avgdl, k1=1.2, b=0.75):
    """Standard BM25 scoring."""
    numerator = tf * (k1 + 1)
    denominator = tf + k1 * (1 - b + b * (doc_len / avgdl))
    return idf * (numerator / denominator)

def bm25l(tf, idf, doc_len, avgdl, k1=1.2, b=0.75, delta=0.5):
    """BM25L with lower bound on normalized TF."""
    # First normalize TF by document length
    ctd = tf / (1 - b + b * (doc_len / avgdl))
    # Then add delta to the normalized value
    ctd_adjusted = ctd + delta
    numerator = (k1 + 1) * ctd_adjusted
    denominator = k1 + ctd_adjusted
    return idf * (numerator / denominator)

def bm25_plus(tf, idf, doc_len, avgdl, k1=1.2, b=0.75, delta=1.0):
    """BM25+ with additive component."""
    numerator = tf * (k1 + 1)
    denominator = tf + k1 * (1 - b + b * (doc_len / avgdl))
    return idf * ((numerator / denominator) + delta)

# Compare for different document lengths
avgdl = 100
tf = 2
idf = 2.0  # Moderately rare term
doc_lengths = [20, 50, 100, 200, 500]

comparison = {
    'BM25': [],
    'BM25L': [],
    'BM25+': []
}

for doc_len in doc_lengths:
    comparison['BM25'].append(bm25_standard(tf, idf, doc_len, avgdl))
    comparison['BM25L'].append(bm25l(tf, idf, doc_len, avgdl))
    comparison['BM25+'].append(bm25_plus(tf, idf, doc_len, avgdl))
Out[23]:
Visualization
Bar chart comparing BM25, BM25L, and BM25+ scores for documents of varying lengths from 20 to 500 words.
Comparison of BM25 variants across different document lengths. BM25+ maintains higher scores for short documents, while BM25L provides a smoother curve. Standard BM25 shows the most aggressive length normalization.

BM25+ consistently gives higher scores, especially for shorter documents where the additive δ\delta term has more relative impact. BM25L provides a middle ground with its adjusted term frequency calculation.

Field-Weighted BM25

Real documents often have structure: titles, abstracts, body text, metadata. A match in the title should probably count more than a match buried in the body. Field-weighted BM25 (sometimes called BM25F) addresses this.

BM25F (Field-Weighted BM25)

An extension of BM25 that handles structured documents by computing a weighted combination of term frequencies across different fields, with each field having its own boost weight and length normalization.

The key idea is to compute a "virtual" term frequency that combines contributions from all fields:

f~(t,D)=cfieldswcf(t,Dc)1bc+bcDcavgdlc\tilde{f}(t, D) = \sum_{c \in \text{fields}} w_c \cdot \frac{f(t, D_c)}{1 - b_c + b_c \cdot \frac{|D_c|}{\text{avgdl}_c}}

where:

  • f~(t,D)\tilde{f}(t, D): virtual (combined) term frequency for term tt in document DD
  • cc: a field in the document (e.g., title, body, abstract)
  • wcw_c: boost weight for field cc (higher values give more importance to matches in that field)
  • f(t,Dc)f(t, D_c): frequency of term tt in field cc of document DD
  • bcb_c: length normalization parameter specific to field cc
  • Dc|D_c|: length of field cc in document DD
  • avgdlc\text{avgdl}_c: average length of field cc across all documents

This virtual term frequency then plugs into a simplified BM25 formula (note that length normalization is already applied per-field in f~\tilde{f}):

BM25F(D,Q)=tQIDF(t)f~(t,D)(k1+1)f~(t,D)+k1\text{BM25F}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{\tilde{f}(t, D) \cdot (k_1 + 1)}{\tilde{f}(t, D) + k_1}

Let's implement a simple version:

In[24]:
class BM25FIndex:
    def __init__(self, field_weights, k1=1.2, b=0.75):
        self.field_weights = field_weights  # {'title': 3.0, 'body': 1.0}
        self.k1 = k1
        self.b = b
        self.documents = []
        self.field_lengths = defaultdict(list)  # field -> [lengths]
        self.avgdl = {}  # field -> average length
        self.doc_freq = defaultdict(int)
        self.N = 0
        
    def tokenize(self, text):
        return re.findall(r'\b[a-z]+\b', text.lower())
    
    def index(self, documents):
        """Index documents with multiple fields.
        
        Each document should be a dict: {'title': '...', 'body': '...'}
        """
        self.documents = documents
        self.N = len(documents)
        
        # Compute document frequencies and field lengths
        all_terms = set()
        for doc in documents:
            doc_terms = set()
            for field in self.field_weights:
                tokens = self.tokenize(doc.get(field, ''))
                self.field_lengths[field].append(len(tokens))
                doc_terms.update(tokens)
            
            for term in doc_terms:
                self.doc_freq[term] += 1
        
        # Compute average lengths per field
        for field, lengths in self.field_lengths.items():
            self.avgdl[field] = sum(lengths) / len(lengths) if lengths else 1
    
    def compute_virtual_tf(self, doc_id, term):
        """Compute weighted, normalized term frequency across fields."""
        doc = self.documents[doc_id]
        virtual_tf = 0
        
        for field, weight in self.field_weights.items():
            tokens = self.tokenize(doc.get(field, ''))
            tf = tokens.count(term)
            if tf > 0:
                field_len = len(tokens)
                avgdl_field = self.avgdl[field]
                # Normalize by field length, then weight
                normalized_tf = tf / (1 - self.b + self.b * (field_len / avgdl_field))
                virtual_tf += weight * normalized_tf
        
        return virtual_tf
    
    def search(self, query, top_k=5):
        query_terms = self.tokenize(query)
        
        results = []
        for doc_id in range(self.N):
            score = 0
            for term in query_terms:
                if self.doc_freq.get(term, 0) == 0:
                    continue
                    
                vtf = self.compute_virtual_tf(doc_id, term)
                if vtf > 0:
                    idf = math.log((self.N - self.doc_freq[term] + 0.5) / 
                                   (self.doc_freq[term] + 0.5))
                    score += idf * (vtf * (self.k1 + 1)) / (vtf + self.k1)
            
            if score > 0:
                results.append((doc_id, score))
        
        results.sort(key=lambda x: x[1], reverse=True)
        return results[:top_k]

Let's test with structured documents:

In[25]:
# Structured documents with title and body
structured_docs = [
    {
        'title': 'Introduction to Python Programming',
        'body': 'This guide covers the basics of writing code in a popular scripting language.'
    },
    {
        'title': 'Advanced Java Development',
        'body': 'Learn enterprise patterns and best practices for building robust Python applications.'
    },
    {
        'title': 'Web Development Fundamentals',
        'body': 'Build modern websites using Python frameworks like Django and Flask.'
    },
    {
        'title': 'Python for Data Science',
        'body': 'Use Python libraries like pandas and numpy for data analysis and machine learning.'
    }
]

# Index with title weighted 3x more than body
bm25f_index = BM25FIndex(field_weights={'title': 3.0, 'body': 1.0})
bm25f_index.index(structured_docs)

# Search
bm25f_results = bm25f_index.search('python programming', top_k=4)
Out[26]:
Query: "python programming"
Field weights: title=3.0, body=1.0

------------------------------------------------------------

Document 1 ranks highest because "Python" and "Programming" both appear in its title, which has 3x weight. Document 2 mentions "Python" only in the body (despite having "Development" in the title), so it ranks lower even though the body text is about Python.

BM25 vs TF-IDF: An Empirical Comparison

How much does BM25's sophistication actually matter? Let's compare it directly against TF-IDF on a retrieval task.

In[27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Larger test corpus
test_corpus = [
    "Machine learning algorithms learn patterns from data",
    "Deep learning uses neural networks with many layers",
    "Natural language processing enables computers to understand text",
    "Computer vision allows machines to interpret images",
    "Reinforcement learning trains agents through rewards and penalties",
    "Supervised learning requires labeled training examples",
    "Unsupervised learning finds patterns without labels",
    "Transfer learning reuses knowledge from one task to another",
    "Neural networks are inspired by biological neurons",
    "Gradient descent optimizes model parameters iteratively",
    "Backpropagation computes gradients for neural network training",
    "Convolutional networks excel at image recognition tasks",
    "Recurrent networks process sequential data like text",
    "Attention mechanisms help models focus on relevant inputs",
    "Transformers revolutionized natural language processing",
    "BERT provides contextual word embeddings for NLP tasks",
    "GPT generates coherent text using autoregressive modeling",
    "Word embeddings represent words as dense vectors",
    "Sentiment analysis determines the emotional tone of text",
    "Named entity recognition identifies people places and organizations"
]

# Build BM25 index
bm25_index = BM25Index(k1=1.2, b=0.75)
bm25_index.index(test_corpus)

# Build TF-IDF index
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(test_corpus)

def tfidf_search(query, top_k=5):
    """Search using TF-IDF cosine similarity."""
    query_vec = tfidf_vectorizer.transform([query])
    similarities = cosine_similarity(query_vec, tfidf_matrix)[0]
    top_indices = similarities.argsort()[::-1][:top_k]
    return [(idx, similarities[idx]) for idx in top_indices if similarities[idx] > 0]

Now let's compare rankings for several queries:

In[28]:
comparison_queries = [
    "neural network training",
    "text processing language",
    "learning from examples",
    "image recognition deep"
]

comparison_results = []
for query in comparison_queries:
    bm25_results = bm25_index.search(query, top_k=5)
    tfidf_results = tfidf_search(query, top_k=5)
    
    comparison_results.append({
        'query': query,
        'bm25': bm25_results,
        'tfidf': tfidf_results
    })
Out[29]:
Query: "neural network training"
======================================================================

BM25 Results:
  1. [6.2469] Backpropagation computes gradients for neural network t...
  2. [2.1479] Supervised learning requires labeled training examples...
  3. [1.6279] Neural networks are inspired by biological neurons...

TF-IDF Results:
  1. [0.6237] Backpropagation computes gradients for neural network t...
  2. [0.2194] Supervised learning requires labeled training examples...
  3. [0.1636] Neural networks are inspired by biological neurons...


Query: "text processing language"
======================================================================

BM25 Results:
  1. [5.0717] Natural language processing enables computers to unders...
  2. [4.5748] Transformers revolutionized natural language processing...
  3. [1.3142] Recurrent networks process sequential data like text...

TF-IDF Results:
  1. [0.5659] Natural language processing enables computers to unders...
  2. [0.5165] Transformers revolutionized natural language processing...
  3. [0.1519] Recurrent networks process sequential data like text...


Query: "learning from examples"
======================================================================

BM25 Results:
  1. [3.6137] Supervised learning requires labeled training examples...
  2. [2.8361] Machine learning algorithms learn patterns from data...
  3. [2.5437] Transfer learning reuses knowledge from one task to ano...

TF-IDF Results:
  1. [0.4163] Supervised learning requires labeled training examples...
  2. [0.3313] Machine learning algorithms learn patterns from data...
  3. [0.2836] Transfer learning reuses knowledge from one task to ano...


Query: "image recognition deep"
======================================================================

BM25 Results:
  1. [4.6189] Convolutional networks excel at image recognition tasks...
  2. [2.4534] Deep learning uses neural networks with many layers...
  3. [1.9145] Named entity recognition identifies people places and o...

TF-IDF Results:
  1. [0.4320] Convolutional networks excel at image recognition tasks...
  2. [0.2347] Deep learning uses neural networks with many layers...
  3. [0.1689] Named entity recognition identifies people places and o...


The rankings often agree, but there are subtle differences. BM25's saturation effect means that a document with many occurrences of a query term doesn't dominate as much as it would with raw TF-IDF. The length normalization also helps ensure shorter, focused documents aren't overwhelmed by longer ones.

Let's visualize the score distributions:

In[30]:
# Get all scores for a specific query
query = "neural network learning"

bm25_all_scores = []
tfidf_all_scores = []

query_terms = bm25_index.tokenize(query)
query_vec = tfidf_vectorizer.transform([query])
tfidf_sims = cosine_similarity(query_vec, tfidf_matrix)[0]

for doc_id in range(len(test_corpus)):
    bm25_score = bm25_index.score_document(doc_id, query_terms)
    bm25_all_scores.append(bm25_score)
    tfidf_all_scores.append(tfidf_sims[doc_id])
Out[31]:
Visualization
Histogram showing BM25 scores clustered between 0 and 4, with most documents scoring near zero.
BM25 score distribution for the query 'neural network learning'. The saturation effect creates a more compressed score range, with clearer separation between relevant and non-relevant documents.

BM25 scores tend to have a wider dynamic range with clearer separation between relevant and non-relevant documents. TF-IDF cosine similarities are bounded between 0 and 1, which can make it harder to distinguish between moderately relevant documents.

Using BM25 in Practice

While understanding BM25 internals is valuable, in practice you'll often use existing implementations. The rank_bm25 library provides a simple, efficient implementation:

In[32]:
# Install if needed: uv pip install rank-bm25
from rank_bm25 import BM25Okapi

# Tokenize corpus
tokenized_corpus = [doc.lower().split() for doc in test_corpus]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Search
query = "neural network training optimization"
tokenized_query = query.lower().split()
scores = bm25.get_scores(tokenized_query)
Out[33]:
Query: "neural network training optimization"

Top 5 results using rank_bm25:
------------------------------------------------------------
1. [Score: 6.2540] Backpropagation computes gradients for neural network training
2. [Score: 2.1638] Supervised learning requires labeled training examples
3. [Score: 1.6298] Neural networks are inspired by biological neurons
4. [Score: 1.5328] Deep learning uses neural networks with many layers
5. [Score: 0.0000] Named entity recognition identifies people places and organizations

For production search systems, Elasticsearch and OpenSearch use BM25 as their default scoring algorithm. They also support BM25F for multi-field documents out of the box.

Limitations and Modern Context

BM25 works well in practice, but it has fundamental limitations:

Vocabulary mismatch: BM25 only matches exact terms. If your query says "car" but the document says "automobile," BM25 won't find the connection. Semantic search with embeddings handles this better.

No semantic understanding: "Bank of the river" and "Bank account" would be treated identically. BM25 has no concept of word meaning or context.

Bag of words assumption: Word order doesn't matter. "Dog bites man" scores the same as "Man bites dog" for a query about dogs and men.

Query-document length mismatch: Short queries searching long documents (or vice versa) can produce suboptimal results despite length normalization.

Despite these limitations, BM25 remains widely used because:

  1. Speed: It's extremely fast, requiring only sparse vector operations
  2. Interpretability: You can explain exactly why a document ranked where it did
  3. No training required: It works out of the box on any text collection
  4. Hybrid search: Modern systems often combine BM25 with neural retrievers, getting the best of both worlds
Hybrid Search

A retrieval strategy that combines lexical matching (like BM25) with semantic matching (like dense embeddings). The two approaches complement each other: BM25 handles exact matches and rare terms well, while embeddings capture semantic similarity.

Summary

BM25 represents the high point of classical information retrieval. Its key innovations:

  • Probabilistic foundation: Derived from principled probabilistic models, not ad-hoc heuristics
  • Term frequency saturation: The k1k_1 parameter ensures diminishing returns from repeated terms
  • Length normalization: The bb parameter prevents long documents from unfairly dominating
  • IDF weighting: Rare terms receive higher weight, common terms are downweighted

The standard parameters (k1=1.2k_1 = 1.2, b=0.75b = 0.75) work well across most collections, though tuning can help for specific domains. Variants like BM25L and BM25+ address edge cases with short documents, while BM25F handles structured documents with multiple fields.

BM25's lasting influence is evident in its continued use in Elasticsearch, Solr, and other search engines. Even as neural retrievers have emerged, BM25 often serves as a first-stage retriever or as part of hybrid systems. Understanding BM25 gives you insight into what makes search work, and why modern approaches needed to go beyond lexical matching to achieve true semantic understanding.

Key Parameters

Understanding BM25's parameters is essential for tuning retrieval performance:

ParameterTypical ValueRangeEffect
k11.20.0 - 3.0Controls term frequency saturation. Lower values cause faster saturation (diminishing returns from repeated terms). Higher values make scoring more linear with term frequency.
b0.750.0 - 1.0Controls length normalization strength. At 0, document length is ignored. At 1, term frequencies are fully normalized by document length relative to the collection average.
delta (BM25L/BM25+)0.5 - 1.00.0 - 2.0Additive constant that ensures minimum contribution from term matches. Higher values boost scores for documents with any occurrence of query terms.

Tuning guidance:

  • Start with defaults (k1=1.2k_1 = 1.2, b=0.75b = 0.75): These work well for most text collections and are the defaults in Elasticsearch and Lucene.
  • Increase k1k_1 (toward 2.0) when term frequency is highly informative, such as in longer documents where repeated mentions genuinely indicate relevance.
  • Decrease k1k_1 (toward 0.5) for collections where a single mention is as informative as multiple mentions, such as short documents or when keyword stuffing is a concern.
  • Decrease bb (toward 0.25) when document length variation is meaningful, for example, when longer documents genuinely contain more relevant information rather than just more filler.
  • Increase bb (toward 1.0) when longer documents tend to match queries spuriously due to containing more vocabulary.
  • Use BM25+ or BM25L when your collection has highly variable document lengths and you observe short, relevant documents being ranked too low.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about BM25 and information retrieval.

Loading component...

Reference

BIBTEXAcademic
@misc{bm25completeguidetothesearchalgorithmbehindelasticsearch, author = {Michael Brenndoerfer}, title = {BM25: Complete Guide to the Search Algorithm Behind Elasticsearch}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bm25-search-algorithm-elasticsearch-implementation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-08} }
APAAcademic
Michael Brenndoerfer (2025). BM25: Complete Guide to the Search Algorithm Behind Elasticsearch. Retrieved from https://mbrenndoerfer.com/writing/bm25-search-algorithm-elasticsearch-implementation
MLAAcademic
Michael Brenndoerfer. "BM25: Complete Guide to the Search Algorithm Behind Elasticsearch." 2025. Web. 12/8/2025. <https://mbrenndoerfer.com/writing/bm25-search-algorithm-elasticsearch-implementation>.
CHICAGOAcademic
Michael Brenndoerfer. "BM25: Complete Guide to the Search Algorithm Behind Elasticsearch." Accessed 12/8/2025. https://mbrenndoerfer.com/writing/bm25-search-algorithm-elasticsearch-implementation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'BM25: Complete Guide to the Search Algorithm Behind Elasticsearch'. Available at: https://mbrenndoerfer.com/writing/bm25-search-algorithm-elasticsearch-implementation (Accessed: 12/8/2025).
SimpleBasic
Michael Brenndoerfer (2025). BM25: Complete Guide to the Search Algorithm Behind Elasticsearch. https://mbrenndoerfer.com/writing/bm25-search-algorithm-elasticsearch-implementation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.