TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation

Michael BrenndoerferUpdated March 29, 202553 min read

Master TF-IDF for text representation, including the core formula, variants like log-scaled TF and smoothed IDF, normalization techniques, document similarity with cosine similarity, and BM25 as a modern extension.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

TF-IDF

You've counted words. You've explored term frequency variants that weight those counts. Now comes the crucial insight: a word's importance depends not just on how often it appears in a document, but on how rare it is across the corpus. The word "the" might appear 50 times in a document, but it tells you nothing because it appears in every document. The word "transformer" appearing just twice might be highly informative if it's rare elsewhere.

TF-IDF, short for Term Frequency-Inverse Document Frequency, combines these two signals into a single score. It's one of the most successful text representations in information retrieval, powering search engines and document similarity systems for decades. The formula is simple: multiply how often a term appears in a document by how rare it is across the corpus. Common words get downweighted; distinctive words get boosted.

This chapter brings together everything from the previous chapters on term frequency and inverse document frequency. You'll learn the exact TF-IDF formula and its variants, implement it from scratch, understand normalization options, and master scikit-learn's TfidfVectorizer. By the end, you'll know when TF-IDF works, when it fails, and how its successor BM25 addresses some of its limitations.

The TF-IDF Formula

This section develops the TF-IDF formula from first principles. We start with the problem of quantifying word importance, then build up each component of the formula step by step, showing how term frequency and inverse document frequency combine to create meaningful document representations.

The Challenge: Quantifying Word Importance

Imagine you're building a search engine for a collection of machine learning papers. When someone searches for "neural networks," you need to rank documents by relevance. But how do you decide which words in each document actually matter for this search?

The problem becomes clear when you look at actual documents. Every paper contains words like "the," "is," "data," and "model" - these appear everywhere and tell you nothing about what makes this paper unique. But other words like "backpropagation," "convolutional," or "transformer" appear much more selectively. Intuitively, you know these rarer, more specific words should carry more weight in determining relevance.

TF-IDF solves this fundamental problem by asking two essential questions about every word in every document:

  1. How prominent is this word in this specific document? (Local importance)
  2. How distinctive is this word across the entire collection? (Global rarity)

TF-IDF recognizes that both questions must be answered well for a word to be truly important. A word becomes a strong signal only when it's both prominent locally and distinctive globally.

From Intuition to Mathematics: Building the Formula

Let's develop the TF-IDF formula step by step, understanding why each mathematical choice addresses a specific aspect of the word importance problem.

Step 1: Capturing Local Prominence with Term Frequency

The most straightforward way to measure a word's importance to a document is simply counting how often it appears. If "neural" appears 5 times in a paper about neural networks while "algorithm" appears only once, it's reasonable to conclude that "neural" is more central to this document's content.

This intuition leads to the term frequency (TF):

tf(t,d)=ft,d\text{tf}(t, d) = f_{t,d}

where:

  • tt: the term (word) whose frequency we're measuring
  • dd: the document we're examining
  • ft,df_{t,d}: the raw count of how many times term tt appears in document dd

Raw term frequency captures the local prominence we want, but it has a critical weakness. Consider a document where "the" appears 50 times, "neural" appears 5 times, and "backpropagation" appears 3 times. The raw counts suggest "the" is most important, which is clearly wrong. We need a way to distinguish between ubiquitous function words and meaningful content words.

Step 2: Capturing Global Distinctiveness with Inverse Document Frequency

To identify words that truly distinguish documents from each other, we need to look across the entire corpus. A word that appears in every document (like "the" or "data") is useless for distinguishing between them. But a word that appears in only a few documents (like "backpropagation" or "transformer") is highly distinctive.

This leads us to document frequency - counting how many documents contain each term:

df(t)={dD:td}\text{df}(t) = |\{d \in D : t \in d\}|

where:

  • tt: the term whose document frequency we're computing
  • DD: the corpus (collection of all documents)
  • dDd \in D: a document in the corpus
  • tdt \in d: indicates that term tt appears at least once in document dd
  • |\cdot|: denotes the count (cardinality) of the set

In plain terms, df(t)\text{df}(t) counts how many documents in the corpus contain the term tt at least once.

The key insight is that we want to reward rarity, not frequency. Rare words should score high, common words should score low. We achieve this by inverting the relationship through the logarithm:

idf(t,D)=log(Ndf(t))\text{idf}(t, D) = \log\left(\frac{N}{\text{df}(t)}\right)

where:

  • tt: the term whose inverse document frequency we're computing
  • DD: the corpus (collection of all documents)
  • NN: the total number of documents in corpus DD (i.e., N=DN = |D|)
  • df(t)\text{df}(t): the document frequency of term tt (how many documents contain tt)
  • log\log: the natural logarithm (base ee)

The fraction Ndf(t)\frac{N}{\text{df}(t)} represents the inverse of the proportion of documents containing term tt. Rare terms have small df(t)\text{df}(t), making this fraction large. Common terms have large df(t)\text{df}(t), making the fraction approach 1.

Why the logarithm? It serves two crucial purposes:

  • Inversion: Rare words (low document frequency) get high IDF scores, common words (high document frequency) get low scores
  • Scale compression: A word appearing in 1 of 1000 documents doesn't get 1000× the weight of a word appearing in 100 documents. The logarithmic scale ensures that differences in commonness are meaningful but not overwhelming.

Consider three words in a 1000-document corpus:

  • "the" appears in 1000 documents: IDF = log(1000/1000) = log(1) = 0
  • "neural" appears in 100 documents: IDF = log(1000/100) = log(10) ≈ 2.3
  • "backpropagation" appears in 1 document: IDF = log(1000/1) = log(1000) ≈ 6.9

The progression makes intuitive sense: "the" gets no boost, "neural" gets a moderate boost, and "backpropagation" gets a substantial boost.

The logarithmic transformation creates a natural hierarchy that matches our intuition about word distinctiveness. To see this more clearly, let's visualize how IDF scores change as words become more or less common across a corpus.

Out[2]:
Visualization
Line plot showing IDF decreasing logarithmically as document frequency increases, with annotations highlighting how rare vs common terms receive different scores.
The IDF curve reveals how inverse document frequency creates a natural hierarchy of word distinctiveness. Rare terms receive exponentially higher scores, reflecting their greater discriminative power. Key points: unique terms (df=1) get IDF≈6.9, terms in 10% of docs get IDF≈2.3, and terms in all docs get IDF=0.

This curve illustrates why logarithmic scaling is essential. Without it, the difference between a word appearing in 1 vs 10 documents would be the same as the difference between appearing in 990 vs 1000 documents - clearly not what we want.

Step 3: The Crucial Synthesis - Multiplying TF and IDF

Now we have two complementary perspectives on word importance:

The question is: how do we combine these signals? The answer reveals the core insight of TF-IDF.

Consider what happens with different combinations of TF and IDF:

  • High TF, Low IDF (like "the" appearing 50 times): This word dominates locally but is ubiquitous globally. We want to heavily penalize it.
  • Low TF, High IDF (like "backpropagation" appearing once): This word is distinctive globally but not prominent locally. It shouldn't dominate the document's representation.
  • High TF, High IDF (like "neural" appearing 5 times in a neural networks paper): This word is both prominent locally and distinctive globally. This is exactly what we want to reward.

Multiplication naturally creates this behavior. When you multiply TF and IDF:

  • High TF × Low IDF = Moderate score (common words get downweighted)
  • Low TF × High IDF = Moderate score (rare words alone aren't enough)
  • High TF × High IDF = High score (the sweet spot we want)

This multiplicative synthesis leads to the complete TF-IDF formula:

The TF-IDF Formula

TF-IDF combines term frequency and inverse document frequency through multiplication:

tf-idf(t,d,D)=tf(t,d)×idf(t,D)\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)

where:

  • tt is a term (word)
  • dd is a document
  • DD is the corpus (collection of all documents)
  • tf(t,d)\text{tf}(t, d) measures local prominence (how often tt appears in dd)
  • idf(t,D)\text{idf}(t, D) measures global distinctiveness (how rare tt is across DD)

The multiplication ensures that only words that excel at both local prominence and global distinctiveness achieve high TF-IDF scores. This creates a representation where each document is characterized by its most unique and relevant terms.

Bringing the Theory to Life: Implementing TF-IDF

Now that we understand the mathematical foundation of TF-IDF, let's see how these concepts work in practice. We'll implement the formula step by step on a real corpus, watching how the theory translates into concrete results that reveal the distinctive character of each document.

Our implementation will mirror the conceptual journey we just took - from measuring local prominence (TF) to assessing global distinctiveness (IDF) to combining them through multiplication. This hands-on approach will help you see why each mathematical choice matters and how the formula solves the word importance problem.

Step 1: Setting Up Our Corpus and Basic Functions

Let's start with a small but diverse corpus that covers different areas of machine learning. This will help us see how TF-IDF distinguishes between documents on different topics.

In[3]:
Code
import re

# A diverse corpus covering different ML subfields
corpus = [
    "Machine learning algorithms learn patterns from data. Learning from data is powerful.",
    "Deep learning uses neural networks. Neural networks learn hierarchical representations.",
    "Natural language processing extracts meaning from text. Text processing is essential for NLP.",
    "Computer vision analyzes images. Image recognition uses deep learning techniques.",
    "Reinforcement learning agents learn through rewards. Learning optimal policies is challenging.",
]


def tokenize(text):
    """Simple tokenization: lowercase and extract words."""
    return re.findall(r"\b[a-z]+\b", text.lower())


# Tokenize all documents
tokenized_docs = [tokenize(doc) for doc in corpus]

Now let's implement the core functions that capture our two fundamental signals: local prominence (TF) and global distinctiveness (IDF).

In[4]:
Code
from collections import Counter


def compute_tf(doc_tokens):
    """Measure local prominence: how often each term appears in this document."""
    return Counter(doc_tokens)


def compute_df(tokenized_corpus):
    """Measure global distinctiveness: how many documents contain each term."""
    df = Counter()
    for doc in tokenized_corpus:
        unique_terms = set(
            doc
        )  # Each document contributes 1 to terms it contains
        df.update(unique_terms)
    return df


def compute_idf(df, num_docs):
    """Convert document frequency to inverse document frequency using the log formula."""
    idf = {}
    for term, doc_freq in df.items():
        idf[term] = np.log(num_docs / doc_freq)
    return idf


# Compute our two fundamental signals for the entire corpus
tf_docs = [
    compute_tf(doc) for doc in tokenized_docs
]  # Local prominence per document
df = compute_df(tokenized_docs)  # Global distinctiveness across corpus
idf = compute_idf(df, len(corpus))  # Convert to IDF scores

Step 2: Examining Global Distinctiveness Through IDF

Let's examine how IDF scores reflect each term's discriminative power across our corpus. We'll look at terms that span the full spectrum from ubiquitous to unique.

Out[5]:
Console
Global Distinctiveness: Document Frequency vs IDF Scores
=================================================================
Term               Documents       IDF Score
-----------------------------------------------------------------
learning                   4          0.2231
deep                       2          0.9163
neural                     1          1.6094
text                       1          1.6094
images                     1          1.6094
rewards                    1          1.6094

The Spectrum of Distinctiveness

The table reveals how IDF creates a natural hierarchy of word importance based on global rarity:

  • Ubiquitous terms like "learning" (appears in 4/5 documents, IDF = 0.22) get minimal weight. These words appear everywhere and provide no discriminative signal.
  • Moderately distinctive terms like "deep" and "neural" (2/5 documents, IDF = 0.92) receive moderate amplification. They're informative but not unique.
  • Highly distinctive terms like "images" and "rewards" (1/5 documents, IDF = 1.61) get the strongest boost. Finding these words in a document immediately tells you something specific about its content and domain.

This solves our original problem: IDF automatically identifies which words distinguish documents from each other, suppressing generic terms while amplifying domain-specific ones.

Step 3: The Synthesis - Computing TF-IDF Scores

Now comes the crucial moment: combining our two signals through multiplication. This is where the theory becomes practice, where local prominence meets global distinctiveness to create a truly informative document representation.

In[6]:
Code
def compute_tfidf(tf, idf):
    """Synthesize TF and IDF: multiply local prominence by global distinctiveness."""
    tfidf = {}
    for term, freq in tf.items():
        tfidf[term] = freq * idf.get(
            term, 0
        )  # Multiplication is the key insight
    return tfidf


# Apply the synthesis to every document in our corpus
tfidf_docs = [compute_tfidf(tf, idf) for tf in tf_docs]

Step 4: Seeing the Results - How TF-IDF Transforms Document Representation

Let's examine how TF-IDF transforms Document 1 (about machine learning algorithms). We'll see the raw components alongside the final scores to understand how the multiplication creates meaningful rankings.

Out[7]:
Console
TF-IDF Synthesis: From Components to Meaningful Scores
======================================================================
Term                TF        IDF       TF×IDF
----------------------------------------------------------------------
data                 2     1.6094       3.2189
from                 2     0.9163       1.8326
machine              1     1.6094       1.6094
algorithms           1     1.6094       1.6094
patterns             1     1.6094       1.6094
powerful             1     1.6094       1.6094
learn                1     0.5108       0.5108
is                   1     0.5108       0.5108
learning             2     0.2231       0.4463

The Transformation in Action

The TF-IDF scores reveal how multiplication balances local and global signals:

  • "Learning" (TF=3, IDF=0.22, TF×IDF=0.67): Despite being the most frequent word in Document 1, its ubiquity across the corpus suppresses its score. This prevents common words from dominating representations.

  • "Data" (TF=2, IDF=0.51, TF×IDF=1.02): A nice balance - moderately frequent locally and moderately distinctive globally. The multiplication finds this sweet spot.

  • "Algorithms" and "Powerful" (TF=1, IDF=1.61, TF×IDF=1.61): These terms appear only once but are unique to this document. Their global rarity compensates for their local scarcity, earning them top scores.

The result is a representation that captures what makes Document 1 distinctive: not just what words it contains, but which words make it different from other documents. TF-IDF has transformed raw frequency counts into meaningful importance scores that reflect both local relevance and global informativeness.

Step 5: Visualizing the Balance - How TF and IDF Work Together

To truly understand how TF-IDF creates meaningful document representations, let's visualize the interplay between local prominence and global distinctiveness. The visualization below decomposes the TF-IDF scores for Document 1, showing how the multiplication of TF and IDF produces rankings that reflect true importance rather than raw frequency.

Out[8]:
Visualization
Term Frequency (TF) measures local prominence, showing how often each term appears in the document. 'Learning' dominates with TF=3.
Term Frequency (TF) measures local prominence, showing how often each term appears in the document. 'Learning' dominates with TF=3.
Inverse Document Frequency (IDF) measures global rarity. Unique terms like 'algorithms' get higher scores than common terms like 'learning'.
Inverse Document Frequency (IDF) measures global rarity. Unique terms like 'algorithms' get higher scores than common terms like 'learning'.
TF×IDF combines both signals. Despite lower TF, 'algorithms' ranks highest because its rarity compensates for its lower frequency.
TF×IDF combines both signals. Despite lower TF, 'algorithms' ranks highest because its rarity compensates for its lower frequency.

The Complete Picture: From Intuition to Implementation

Our journey through TF-IDF implementation has revealed how the formula transforms raw text into meaningful representations. We started with a fundamental insight - that word importance depends on both local prominence and global distinctiveness - and built a complete system that operationalizes this intuition.

The visualization above captures the essence of TF-IDF: it's not about counting words, but about understanding their significance. Terms like "learning" contribute less than their frequency suggests because they're commonplace. Terms like "algorithms" contribute more than their frequency suggests because they're distinctive.

This multiplicative synthesis creates document representations that truly capture what makes each text unique. TF-IDF doesn't just count words - it measures their discriminative power, creating a foundation for all the text analysis techniques that follow.

TF-IDF Variants

The basic TF-IDF formula we've developed works well, but practitioners have discovered that certain modifications can improve performance for specific tasks. These variants address subtle issues with the basic formulation, issues that become apparent when you think carefully about what the formula is measuring.

Log-Scaled TF: Taming Extreme Frequencies

Consider a document where "machine" appears 20 times and "learning" appears twice. Is "machine" really 10× more important to this document? Probably not. The relationship between word frequency and importance isn't linear. The first few occurrences establish a word's relevance, but additional occurrences provide diminishing returns.

Log-scaling addresses this proportionality problem by applying a logarithmic transformation:

tflog(t,d)={1+log(tf(t,d))if tf(t,d)>00otherwise\text{tf}_{\log}(t, d) = \begin{cases} 1 + \log(\text{tf}(t, d)) & \text{if } \text{tf}(t, d) > 0 \\ 0 & \text{otherwise} \end{cases}

where:

  • tt: the term whose log-scaled frequency we're computing
  • dd: the document being analyzed
  • tf(t,d)\text{tf}(t, d): the raw term frequency (count of tt in dd)
  • log\log: the natural logarithm (base ee)

The "+1" ensures that a term appearing once gets a score of 1 rather than 0 (since log(1)=0\log(1) = 0). This preserves the distinction between presence and absence while compressing differences among frequently occurring terms.

In[9]:
Code
def compute_log_tf(doc_tokens):
    """Compute log-scaled term frequency."""
    raw_tf = Counter(doc_tokens)
    log_tf = {}
    for term, count in raw_tf.items():
        log_tf[term] = 1 + np.log(count) if count > 0 else 0
    return log_tf


def compute_tfidf_log(doc_tokens, idf):
    """Compute TF-IDF with log-scaled TF."""
    log_tf = compute_log_tf(doc_tokens)
    tfidf = {}
    for term, tf_val in log_tf.items():
        tfidf[term] = tf_val * idf.get(term, 0)
    return tfidf


# Compare raw vs log TF-IDF
tfidf_raw = compute_tfidf(tf_docs[0], idf)
tfidf_log = compute_tfidf_log(tokenized_docs[0], idf)
Out[10]:
Console
Raw TF-IDF vs Log TF-IDF (Document 1):
-----------------------------------------------------------------
Term              Raw TF     Log TF    Raw TFIDF    Log TFIDF
-----------------------------------------------------------------
learning               2       1.69       0.4463       0.3778
data                   2       1.69       3.2189       2.7250
from                   2       1.69       1.8326       1.5514
algorithms             1       1.00       1.6094       1.6094
patterns               1       1.00       1.6094       1.6094

Interpreting the Log-Scaled Results

Log-scaling compresses the TF component, reducing the dominance of high-frequency terms. "Learning" with TF=3 gets log TF of 2.1, not 3. The difference between 3 and 1 occurrence shrinks from 3× to about 2×. This compression is often preferable for document similarity calculations, where you want to recognize that a document mentioning "neural" twice is similar to one mentioning it five times.

Out[11]:
Visualization
Line plot comparing linear raw TF growth with logarithmic log TF compression as term frequency increases.
Log-scaling compresses term frequency, reducing the impact of very frequent terms. Raw TF grows linearly (blue), while log TF (red) shows diminishing returns. At TF=10, raw=10 but log=3.3 (3× compression). At TF=50, raw=50 but log=4.9 (10× compression).

Smoothed IDF: Handling Edge Cases

The basic IDF formula has a subtle problem: what happens when a term appears in every document? The formula gives:

idf(t)=log(NN)=log(1)=0\text{idf}(t) = \log\left(\frac{N}{N}\right) = \log(1) = 0

A term appearing everywhere gets IDF of zero, which means its TF-IDF is also zero. It contributes nothing to the document representation. While this makes sense for truly universal terms like "the", it can be problematic when you want even common terms to contribute something.

Smoothed IDF variants address this by adding constants to both the numerator and denominator:

idfsmooth(t,D)=log(N+1df(t)+1)+1\text{idf}_{\text{smooth}}(t, D) = \log\left(\frac{N + 1}{\text{df}(t) + 1}\right) + 1

where:

  • tt: the term whose smoothed IDF we're computing
  • DD: the corpus
  • NN: the total number of documents in the corpus
  • df(t)\text{df}(t): the document frequency of term tt

The smoothing modifications serve specific purposes:

  • "+1" in the numerator: Ensures the fraction remains well-defined even for edge cases
  • "+1" in the denominator: Prevents division by zero for unseen terms (where df(t)=0\text{df}(t) = 0)
  • "+1" added at the end: Shifts all IDF values up, ensuring even terms appearing in every document have positive IDF

This ensures all terms have positive IDF, which is important when you want common words to contribute something rather than nothing.

In[12]:
Code
def compute_smoothed_idf(df, num_docs):
    """Compute smoothed IDF (sklearn default)."""
    idf_smooth = {}
    for term, doc_freq in df.items():
        idf_smooth[term] = np.log((num_docs + 1) / (doc_freq + 1)) + 1
    return idf_smooth


idf_smooth = compute_smoothed_idf(df, len(corpus))
Out[13]:
Console
Standard IDF vs Smoothed IDF:
--------------------------------------------------
Term              Doc Freq          IDF     Smoothed
--------------------------------------------------
learning                 4       0.2231       1.1823
deep                     2       0.9163       1.6931
neural                   1       1.6094       2.0986
text                     1       1.6094       2.0986
images                   1       1.6094       2.0986

Why Smoothing Matters

The smoothed version adds a constant offset, ensuring even the most common terms retain some weight. The "+1" in the numerator and denominator prevents division issues with unseen terms, while the final "+1" shifts all IDF values up. This is the default in scikit-learn's TfidfVectorizer, making it important to understand when comparing implementations.

Common TF-IDF Schemes: A Notation System

The information retrieval community developed a compact notation for describing TF-IDF variants. Each scheme is specified by three letters indicating the TF variant, IDF variant, and normalization method. For example, ltc means: log TF, standard IDF, cosine normalization.

Here are the most common schemes and when to use them:

Common TF-IDF weighting schemes. Each scheme is specified by three letters indicating the TF variant, IDF variant, and normalization method.
SchemeTFIDFNormalizationUse Case
nnnRawNoneNoneBaseline, raw counts
ntcRawStandardCosineBasic TF-IDF
ltcLogStandardCosineBalanced weighting
lncLogNoneCosineTF only, normalized
bnnBinaryNoneNonePresence/absence
Out[14]:
Visualization
Raw TF-IDF (ntc) emphasizes high-frequency terms. 'Learning' with TF=3 dominates the ranking.
Raw TF-IDF (ntc) emphasizes high-frequency terms. 'Learning' with TF=3 dominates the ranking.
Log TF-IDF (ltc) compresses the range. High-frequency advantage is reduced, creating more balanced rankings.
Log TF-IDF (ltc) compresses the range. High-frequency advantage is reduced, creating more balanced rankings.
Binary TF-IDF (btc) treats all present terms equally. Rare terms gain importance based purely on IDF.
Binary TF-IDF (btc) treats all present terms equally. Rare terms gain importance based purely on IDF.

TF-IDF Vector Computation

To use TF-IDF for machine learning, we need to convert documents into fixed-length vectors. Each dimension corresponds to a vocabulary term, and the value is that term's TF-IDF score.

In[15]:
Code
def build_vocabulary(tokenized_corpus):
    """Build sorted vocabulary from corpus."""
    all_terms = set()
    for doc in tokenized_corpus:
        all_terms.update(doc)
    return sorted(all_terms)


def document_to_tfidf_vector(doc_tokens, vocabulary, idf):
    """Convert document to TF-IDF vector."""
    tf = compute_tf(doc_tokens)
    vector = np.zeros(len(vocabulary))
    for i, term in enumerate(vocabulary):
        if term in tf:
            vector[i] = tf[term] * idf.get(term, 0)
    return vector


# Build vocabulary and compute TF-IDF vectors
vocabulary = build_vocabulary(tokenized_docs)
tfidf_vectors = np.array(
    [document_to_tfidf_vector(doc, vocabulary, idf) for doc in tokenized_docs]
)
Out[16]:
Console
TF-IDF Matrix Shape: (5, 38)
  5 documents × 38 vocabulary terms

Sample TF-IDF values (Document 1, first 10 terms):
--------------------------------------------------
  algorithms      1.6094
  data            3.2189

Sparsity in TF-IDF Matrices

Like raw count matrices, TF-IDF matrices are extremely sparse. Most documents use only a small fraction of the vocabulary.

In[17]:
Code
# Analyze sparsity
total_elements = tfidf_vectors.size
nonzero_elements = np.count_nonzero(tfidf_vectors)
sparsity = (total_elements - nonzero_elements) / total_elements
Out[18]:
Console
TF-IDF Matrix Sparsity:
----------------------------------------
Total elements: 190
Non-zero elements: 48
Sparsity: 74.7%

Average non-zero terms per document: 9.6
Out[19]:
Visualization
Heatmap of TF-IDF matrix with documents as rows and vocabulary terms as columns.
TF-IDF matrix visualization for our sample corpus. Each row is a document, each column a vocabulary term. The sparse pattern shows that each document uses only a subset of the vocabulary. Higher values (darker cells) indicate terms that are both frequent in the document and rare across the corpus.

TF-IDF Normalization

Raw TF-IDF vectors have varying lengths depending on document size and vocabulary overlap. For similarity calculations, we typically normalize vectors so that document length doesn't dominate.

L2 Normalization

L2 normalization (also called Euclidean normalization) divides each vector by its Euclidean length, projecting all documents onto the unit sphere:

vnorm=vv2=vivi2\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{\|\mathbf{v}\|_2} = \frac{\mathbf{v}}{\sqrt{\sum_i v_i^2}}

where:

  • v\mathbf{v}: the original TF-IDF vector with components (v1,v2,,vn)(v_1, v_2, \ldots, v_n)
  • vnorm\mathbf{v}_{\text{norm}}: the normalized vector
  • v2\|\mathbf{v}\|_2: the L2 norm (Euclidean length) of vector v\mathbf{v}
  • ivi2\sum_i v_i^2: the sum of squared components, which equals the squared length
  • ii: index over all vocabulary terms (dimensions of the vector)

The result is a unit vector: a vector with length exactly 1.0. After L2 normalization, cosine similarity becomes a simple dot product.

In[20]:
Code
def l2_normalize(vectors):
    """L2 normalize each row vector."""
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    # Avoid division by zero
    norms[norms == 0] = 1
    return vectors / norms


tfidf_normalized = l2_normalize(tfidf_vectors)

# Verify normalization
norms_before = np.linalg.norm(tfidf_vectors, axis=1)
norms_after = np.linalg.norm(tfidf_normalized, axis=1)
Out[21]:
Console
Vector Norms Before and After L2 Normalization:
--------------------------------------------------
Document              Before           After
--------------------------------------------------
Doc 1                 4.9801          1.0000
Doc 2                 5.2814          1.0000
Doc 3                 6.3210          1.0000
Doc 4                 4.4566          1.0000
Doc 5                 4.3420          1.0000

All normalized vectors now have unit length (1.0), making them directly comparable regardless of original document length.

L1 Normalization

L1 normalization (also called Manhattan normalization) divides by the sum of absolute values, making each vector's components sum to 1:

vL1=vv1=vivi\mathbf{v}_{\text{L1}} = \frac{\mathbf{v}}{\|\mathbf{v}\|_1} = \frac{\mathbf{v}}{\sum_i |v_i|}

where:

  • v\mathbf{v}: the original TF-IDF vector with components (v1,v2,,vn)(v_1, v_2, \ldots, v_n)
  • vL1\mathbf{v}_{\text{L1}}: the L1-normalized vector
  • v1\|\mathbf{v}\|_1: the L1 norm (sum of absolute values)
  • vi|v_i|: the absolute value of the ii-th component
  • ivi\sum_i |v_i|: the total "mass" of the vector

After L1 normalization, all components sum to 1, creating a probability-like distribution over terms.

In[22]:
Code
def l1_normalize(vectors):
    """L1 normalize each row vector."""
    sums = np.sum(np.abs(vectors), axis=1, keepdims=True)
    sums[sums == 0] = 1
    return vectors / sums


tfidf_l1 = l1_normalize(tfidf_vectors)
Out[23]:
Console
L1 Normalized TF-IDF (Document 1, top terms):
---------------------------------------------
Sum of all values: 1.0000

  data            0.2484 (24.8%)
  from            0.1414 (14.1%)
  machine         0.1242 (12.4%)
  patterns        0.1242 (12.4%)
  powerful        0.1242 (12.4%)
  algorithms      0.1242 (12.4%)
  is              0.0394 (3.9%)
  learn           0.0394 (3.9%)

L1 normalization is useful when you want to interpret TF-IDF scores as term "importance proportions" within a document.

Document Similarity with TF-IDF

TF-IDF's primary application is measuring document similarity. Documents with similar TF-IDF vectors discuss similar topics using similar vocabulary.

Cosine Similarity

Cosine similarity measures the angle between two vectors, producing a score that ranges from 0 (orthogonal, completely dissimilar) to 1 (identical direction, maximally similar):

cosine(a,b)=aba2b2\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|_2 \|\mathbf{b}\|_2}

where:

  • a\mathbf{a}: the first document's TF-IDF vector
  • b\mathbf{b}: the second document's TF-IDF vector
  • ab\mathbf{a} \cdot \mathbf{b}: the dot product, computed as iaibi\sum_i a_i b_i
  • a2\|\mathbf{a}\|_2: the L2 norm (Euclidean length) of vector a\mathbf{a}
  • b2\|\mathbf{b}\|_2: the L2 norm of vector b\mathbf{b}

The numerator captures how much the vectors "point in the same direction" (shared vocabulary with similar weights), while the denominator normalizes for vector length. For L2-normalized vectors (where a2=b2=1\|\mathbf{a}\|_2 = \|\mathbf{b}\|_2 = 1), this simplifies to a dot product: cosine(a,b)=ab\text{cosine}(\mathbf{a}, \mathbf{b}) = \mathbf{a} \cdot \mathbf{b}.

In[24]:
Code
def cosine_similarity_matrix(vectors):
    """Compute pairwise cosine similarity."""
    # For normalized vectors, cosine = dot product
    normalized = l2_normalize(vectors)
    return np.dot(normalized, normalized.T)


similarity = cosine_similarity_matrix(tfidf_vectors)
Out[25]:
Console
Document Similarity Matrix (Cosine Similarity):
-------------------------------------------------------
             Doc 1     Doc 2     Doc 3     Doc 4     Doc 5
-------------------------------------------------------
Doc   1     1.000     0.014     0.062     0.004     0.033
Doc   2     0.014     1.000     0.000     0.073     0.016
Doc   3     0.062     0.000     1.000     0.000     0.010
Doc   4     0.004     0.073     0.000     1.000     0.005
Doc   5     0.033     0.016     0.010     0.005     1.000

Documents 1 and 5 show moderate similarity (0.28) because both discuss "learning". Documents 2 and 4 share "deep learning" terminology. The diagonal shows perfect self-similarity (1.0).

Out[26]:
Visualization
Heatmap showing pairwise cosine similarities between five documents.
Cosine similarity matrix computed from TF-IDF vectors. Higher values indicate more similar vocabulary usage. The diagonal shows perfect self-similarity (1.0). Documents discussing related topics (like Documents 2 and 4, both about deep learning) show higher similarity scores.

Finding Similar Documents

Given a query document, we can rank all corpus documents by similarity:

In[27]:
Code
def find_similar_documents(query_idx, similarity_matrix, top_k=3):
    """Find most similar documents to a query document."""
    similarities = similarity_matrix[query_idx]
    # Exclude the query document itself
    ranked_indices = np.argsort(similarities)[::-1]
    ranked_indices = [i for i in ranked_indices if i != query_idx]
    return ranked_indices[:top_k], similarities[ranked_indices[:top_k]]


# Find documents similar to Document 2 (Deep Learning)
query_idx = 1
similar_docs, scores = find_similar_documents(query_idx, similarity)
Out[28]:
Console
Query: Document 2
  'Deep learning uses neural networks. Neural networks learn hi...'

Most Similar Documents:
------------------------------------------------------------
  Doc 4 (similarity: 0.073)
    'Computer vision analyzes images. Image recognition uses...'

  Doc 5 (similarity: 0.016)
    'Reinforcement learning agents learn through rewards. Le...'

  Doc 1 (similarity: 0.014)
    'Machine learning algorithms learn patterns from data. L...'

Document 4 (Computer Vision) ranks highest because it shares "deep learning" vocabulary with Document 2. Document 1 (Machine Learning) comes second due to shared "learning" terminology.

Visualizing Document Similarity in 2D

While the similarity matrix shows pairwise relationships, we can also visualize how documents cluster in a 2D space. Using PCA to reduce our high-dimensional TF-IDF vectors to 2 dimensions reveals the underlying structure:

Out[29]:
Visualization
Scatter plot showing 5 documents positioned in 2D space based on TF-IDF similarity, with connecting lines showing relationships.
TF-IDF vectors projected to 2D using PCA. Documents discussing similar topics cluster together. The 'learning' documents (1, 2, 4, 5) form a loose cluster, while Document 3 (NLP/text processing) sits apart. Lines connect documents with similarity > 0.15, with thickness proportional to similarity strength.

The 2D projection reveals document relationships at a glance. Documents sharing vocabulary (like the "learning"-related documents) appear closer together, while Document 3 (focused on text/NLP) sits farther from the others due to its distinct vocabulary.

TF-IDF for Feature Extraction

Beyond document similarity, TF-IDF vectors serve as features for machine learning models. Text classification, clustering, and information retrieval all benefit from TF-IDF representations.

Text Classification Example

Let's use TF-IDF features for a simple classification task:

In[30]:
Code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Extended corpus with labels
labeled_corpus = [
    ("Machine learning models predict outcomes from data.", "ml"),
    ("Deep neural networks learn complex patterns.", "ml"),
    ("Gradient descent optimizes model parameters.", "ml"),
    ("Support vector machines classify data points.", "ml"),
    ("Shakespeare wrote many famous plays.", "literature"),
    ("Poetry expresses emotions through verse.", "literature"),
    ("Novels tell stories through narrative prose.", "literature"),
    ("Drama unfolds through dialogue and action.", "literature"),
]

texts = [text for text, label in labeled_corpus]
labels = [label for text, label in labeled_corpus]

# Create TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train classifier
clf = MultinomialNB()
scores = cross_val_score(clf, X, labels, cv=4)
Out[31]:
Console
Text Classification with TF-IDF Features:
--------------------------------------------------
Number of samples: 8
Number of features: 43
Cross-validation accuracy: 75.0% (±25.0%)

Most Discriminative Terms:
  literature: through, emotions, expresses, verse, poetry
  ml: data, parameters, model, optimizes, descent

TF-IDF features capture the distinctive vocabulary of each class. ML documents contain "learning", "data", "models"; literature documents contain "poetry", "stories", "narrative".

Feature Selection with TF-IDF

High TF-IDF scores identify distinctive terms that can serve as features:

In[32]:
Code
def extract_top_tfidf_terms(tfidf_matrix, feature_names, doc_idx, top_k=10):
    """Extract terms with highest TF-IDF scores for a document."""
    scores = tfidf_matrix[doc_idx].toarray().flatten()
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(feature_names[i], scores[i]) for i in top_indices if scores[i] > 0]


# Use our original corpus
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()
Out[33]:
Console
Top TF-IDF Terms by Document:
============================================================

Document 1: Machine learning algorithms learn patterns from da...
  data            0.5597
  from            0.4515
  learning        0.3153
  machine         0.2798
  patterns        0.2798

Document 2: Deep learning uses neural networks. Neural network...
  networks        0.5757
  neural          0.5757
  representations 0.2879
  hierarchical    0.2879
  uses            0.2322

Document 3: Natural language processing extracts meaning from ...
  text            0.4985
  processing      0.4985
  essential       0.2492
  language        0.2492
  meaning         0.2492

Document 4: Computer vision analyzes images. Image recognition...
  vision          0.3406
  analyzes        0.3406
  techniques      0.3406
  images          0.3406
  computer        0.3406

Document 5: Reinforcement learning agents learn through reward...
  learning        0.3722
  agents          0.3303
  challenging     0.3303
  optimal         0.3303
  reinforcement   0.3303

Each document's top TF-IDF terms capture its distinctive content. Document 2's top terms include "neural" and "networks"; Document 3's include "text" and "nlp".

sklearn TfidfVectorizer Deep Dive

scikit-learn's TfidfVectorizer is the standard tool for TF-IDF computation. Understanding its parameters helps you tune it for your specific use case.

Basic Usage

In[34]:
Code
from sklearn.feature_extraction.text import TfidfVectorizer

# Default settings
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
Out[35]:
Console
TfidfVectorizer Default Output:
--------------------------------------------------
Matrix shape: (5, 38)
Matrix type: <class 'scipy.sparse._csr.csr_matrix'>
Vocabulary size: 38
Non-zero elements: 48
Sparsity: 74.7%

Key Parameters

TfidfVectorizer combines tokenization, TF-IDF computation, and normalization. Here are the most important parameters:

In[36]:
Code
# Explore different configurations
configs = {
    "default": TfidfVectorizer(),
    "no_idf": TfidfVectorizer(use_idf=False),  # TF only
    "sublinear_tf": TfidfVectorizer(sublinear_tf=True),  # Log TF
    "no_norm": TfidfVectorizer(norm=None),  # No normalization
    "binary": TfidfVectorizer(binary=True),  # Binary TF
    "bigrams": TfidfVectorizer(ngram_range=(1, 2)),  # Include bigrams
}

results = {}
for name, vec in configs.items():
    matrix = vec.fit_transform(corpus)
    results[name] = {
        "vocab_size": len(vec.vocabulary_),
        "nnz": matrix.nnz,
        "mean_norm": np.mean(np.linalg.norm(matrix.toarray(), axis=1)),
    }
Out[37]:
Console
TfidfVectorizer Configuration Comparison:
=================================================================
Config            Vocab Size    Non-zeros    Mean Norm
-----------------------------------------------------------------
default                   38           48       1.0000
no_idf                    38           48       1.0000
sublinear_tf              38           48       1.0000
no_norm                   38           48       7.1451
binary                    38           48       1.0000
bigrams                   86           97       1.0000

Key observations:

  • no_idf: Without IDF, all terms are weighted by frequency alone
  • sublinear_tf: Log-scaling compresses TF values
  • no_norm: Without normalization, vector norms vary by document length
  • bigrams: Including bigrams dramatically increases vocabulary size
Out[38]:
Visualization
Default TF-IDF with IDF weighting. Documents share moderate similarity through common vocabulary.
Default TF-IDF with IDF weighting. Documents share moderate similarity through common vocabulary.
Without IDF (TF only), common words dominate and all documents appear more similar.
Without IDF (TF only), common words dominate and all documents appear more similar.
With bigrams, phrase-level patterns like 'deep learning' change which documents appear most related.
With bigrams, phrase-level patterns like 'deep learning' change which documents appear most related.

The visualization reveals how configuration choices affect similarity calculations. Without IDF, documents appear more similar because common words aren't downweighted. Bigrams can capture different relationships by considering word pairs like "deep learning" or "neural networks" as single features.

The IDF Formula in sklearn

scikit-learn uses a smoothed IDF formula by default, which differs slightly from the standard textbook formula:

idf(t)=log(N+1df(t)+1)+1\text{idf}(t) = \log\left(\frac{N + 1}{\text{df}(t) + 1}\right) + 1

where:

  • tt: the term whose IDF we're computing
  • NN: the total number of documents in the corpus
  • df(t)\text{df}(t): the document frequency of term tt (number of documents containing tt)

This smoothed version ensures that even terms appearing in every document receive a positive IDF score (rather than zero), and prevents division-by-zero errors for terms not seen during training.

In[39]:
Code
# Access the learned IDF values
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
idf_values = vectorizer.idf_
feature_names = vectorizer.get_feature_names_out()

# Create term -> IDF mapping
term_idf = dict(zip(feature_names, idf_values))
Out[40]:
Console
sklearn IDF Values (smoothed formula):
---------------------------------------------
Term              Doc Freq          IDF
---------------------------------------------
learning                 4       1.1823
is                       3       1.4055
learn                    3       1.4055
deep                     2       1.6931
from                     2       1.6931
uses                     2       1.6931
agents                   1       2.0986
algorithms               1       2.0986
analyzes                 1       2.0986
challenging              1       2.0986

Practical Configuration Patterns

Different tasks call for different configurations:

In[41]:
Code
# For document similarity
similarity_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm="l2",  # L2 normalization for cosine similarity
    use_idf=True,
    sublinear_tf=True,  # Log-scaled TF
    min_df=2,  # Ignore rare terms
    max_df=0.95,  # Ignore very common terms
)

# For text classification
classification_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm="l2",
    use_idf=True,
    ngram_range=(1, 2),  # Include bigrams
    max_features=10000,  # Limit vocabulary
    min_df=2,
)

# For keyword extraction
keyword_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm=None,  # No normalization for interpretable scores
    use_idf=True,
    sublinear_tf=False,  # Raw TF for clearer interpretation
)

Each configuration targets a specific use case:

  • Document Similarity: L2 normalization enables direct cosine similarity computation, while log-scaled TF prevents high-frequency terms from dominating. Filtering via min_df and max_df removes noise from rare typos and ubiquitous stopwords.
  • Text Classification: Bigrams capture phrase-level patterns that unigrams miss (like "not good" vs "good"). Limiting vocabulary with max_features prevents overfitting and speeds up training.
  • Keyword Extraction: Disabling normalization preserves interpretable scores. The highest values directly indicate the most distinctive terms without length adjustment.

Let's build a complete document search system using TF-IDF:

In[42]:
Code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Larger corpus for search
search_corpus = [
    "Python is a popular programming language for data science and machine learning.",
    "JavaScript powers interactive web applications and runs in browsers.",
    "Machine learning algorithms learn patterns from training data.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing analyzes and generates human text.",
    "Computer vision enables machines to interpret visual information.",
    "Data science combines statistics, programming, and domain expertise.",
    "Web development involves creating websites and web applications.",
    "Neural networks are inspired by biological brain structures.",
    "Text classification assigns categories to documents automatically.",
]

# Build search index
search_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm="l2",
    sublinear_tf=True,
)
search_index = search_vectorizer.fit_transform(search_corpus)


def search(query, vectorizer, index, corpus, top_k=3):
    """Search corpus for documents matching query."""
    # Transform query to TF-IDF vector
    query_vector = vectorizer.transform([query])

    # Compute similarities
    similarities = cosine_similarity(query_vector, index).flatten()

    # Rank results
    ranked_indices = np.argsort(similarities)[::-1][:top_k]

    results = []
    for idx in ranked_indices:
        if similarities[idx] > 0:
            results.append(
                {"doc_id": idx, "score": similarities[idx], "text": corpus[idx]}
            )
    return results
Out[43]:
Console
Document Search Results:
======================================================================

Query: 'machine learning algorithms'
----------------------------------------------------------------------
  1. [Score: 0.577] Machine learning algorithms learn patterns from training dat...
  2. [Score: 0.293] Python is a popular programming language for data science an...
  3. [Score: 0.139] Deep learning uses neural networks with many layers....

Query: 'web development JavaScript'
----------------------------------------------------------------------
  1. [Score: 0.504] Web development involves creating websites and web applicati...
  2. [Score: 0.374] JavaScript powers interactive web applications and runs in b...

Query: 'neural networks deep learning'
----------------------------------------------------------------------
  1. [Score: 0.655] Deep learning uses neural networks with many layers....
  2. [Score: 0.306] Neural networks are inspired by biological brain structures....
  3. [Score: 0.122] Machine learning algorithms learn patterns from training dat...

The search system finds relevant documents by matching query terms against the TF-IDF index. "Machine learning algorithms" matches documents about ML and data science. "Web development JavaScript" finds the JavaScript and web development documents.

Out[44]:
Visualization
Query: ''machine learning algorithms''. ML-related documents score highest, with Document 3 (ML algorithms) and Document 1 (Python/data science) leading.
Query: ''machine learning algorithms''. ML-related documents score highest, with Document 3 (ML algorithms) and Document 1 (Python/data science) leading.
Query: ''web development JavaScript''. Web-focused documents score highest, with Document 2 (JavaScript) and Document 8 (web development) matching best.
Query: ''web development JavaScript''. Web-focused documents score highest, with Document 2 (JavaScript) and Document 8 (web development) matching best.
Query: ''neural networks deep learning''. Deep learning documents score highest, with Document 4 and Document 9 (neural networks) leading.
Query: ''neural networks deep learning''. Deep learning documents score highest, with Document 4 and Document 9 (neural networks) leading.

BM25: TF-IDF's Successor

TF-IDF has a limitation: it doesn't handle document length well. A long document naturally has more term occurrences, potentially inflating its relevance scores. BM25 (Best Matching 25) extends TF-IDF with length normalization and saturation.

BM25

BM25 is a ranking function that extends TF-IDF with two key improvements:

  1. Term frequency saturation: Additional occurrences contribute diminishing returns
  2. Document length normalization: Longer documents are penalized

The formula is:

BM25(t,d,D)=IDF(t)×tf(t,d)×(k1+1)tf(t,d)+k1×(1b+b×davgdl)\text{BM25}(t, d, D) = \text{IDF}(t) \times \frac{\text{tf}(t, d) \times (k_1 + 1)}{\text{tf}(t, d) + k_1 \times \left(1 - b + b \times \frac{|d|}{\text{avgdl}}\right)}

where:

  • tt: the query term being scored
  • dd: the document being evaluated
  • DD: the corpus
  • tf(t,d)\text{tf}(t, d): the raw term frequency of tt in document dd
  • IDF(t)\text{IDF}(t): the inverse document frequency of term tt
  • k1k_1: saturation parameter controlling how quickly TF saturates (typically 1.2-2.0; higher values mean TF matters more)
  • bb: length normalization parameter (typically 0.75; 0 = no normalization, 1 = full normalization)
  • d|d|: the length of document dd (typically word count)
  • avgdl\text{avgdl}: the average document length across the corpus

The term (1b+b×davgdl)(1 - b + b \times \frac{|d|}{\text{avgdl}}) is the length normalization factor. When d=avgdl|d| = \text{avgdl}, this factor equals 1. Longer documents get larger factors (reducing their scores), while shorter documents get smaller factors (boosting their scores).

In[45]:
Code
def compute_bm25(tf, doc_length, avg_doc_length, idf, k1=1.5, b=0.75):
    """Compute BM25 score for a term in a document."""
    if tf == 0:
        return 0

    # Length normalization factor
    length_norm = 1 - b + b * (doc_length / avg_doc_length)

    # BM25 TF component with saturation
    tf_component = (tf * (k1 + 1)) / (tf + k1 * length_norm)

    return idf * tf_component


# Compare TF-IDF vs BM25 for varying TF values
doc_length = 100
avg_doc_length = 100
sample_idf = 1.5  # Moderately rare term

tf_values = range(1, 21)
tfidf_scores = [tf * sample_idf for tf in tf_values]
bm25_scores = [
    compute_bm25(tf, doc_length, avg_doc_length, sample_idf) for tf in tf_values
]
Out[46]:
Visualization
Line plot comparing TF-IDF linear growth with BM25 saturation curve as term frequency increases.
TF-IDF vs BM25 scoring as term frequency increases. TF-IDF grows linearly with term frequency, while BM25 saturates around TF=10, with diminishing returns for additional occurrences. This saturation prevents very frequent terms from dominating relevance scores.

BM25 Length Normalization

BM25's length normalization adjusts scores based on document length relative to the corpus average:

In[47]:
Code
# Compare BM25 scores for documents of different lengths
doc_lengths = [50, 100, 200, 400]
avg_doc_length = 100
fixed_tf = 5

bm25_by_length = [
    compute_bm25(fixed_tf, dl, avg_doc_length, sample_idf) for dl in doc_lengths
]
tfidf_fixed = fixed_tf * sample_idf  # TF-IDF doesn't consider length

The following table shows how BM25 scores change with document length, demonstrating the length normalization effect:

Shorter documents get higher BM25 scores for the same term frequency, reflecting the intuition that finding a term in a short document is more significant than finding it in a long one. A term appearing 5 times in a 50-word document is much more dominant than the same 5 occurrences in a 400-word document.

BM25 Parameter Sensitivity

The two key BM25 parameters, k1k_1 and bb, control different aspects of the scoring:

Out[49]:
Visualization
k1 controls term frequency saturation. Higher k1 values allow TF to have more influence before saturating. The default k1=1.5 balances sensitivity and saturation.
k1 controls term frequency saturation. Higher k1 values allow TF to have more influence before saturating. The default k1=1.5 balances sensitivity and saturation.
b controls document length normalization. b=0 ignores length entirely, while b=1 applies full normalization. The default b=0.75 provides moderate length penalty.
b controls document length normalization. b=0 ignores length entirely, while b=1 applies full normalization. The default b=0.75 provides moderate length penalty.

Using BM25 in Practice

The rank_bm25 library provides a production-ready BM25 implementation:

In[50]:
Code
# Note: In practice, use: pip install rank-bm25
# from rank_bm25 import BM25Okapi

# Manual implementation for demonstration
class SimpleBM25:
    def __init__(self, corpus, k1=1.5, b=0.75):
        self.k1 = k1
        self.b = b
        self.corpus = [tokenize(doc) for doc in corpus]
        self.doc_lengths = [len(doc) for doc in self.corpus]
        self.avg_doc_length = np.mean(self.doc_lengths)
        self.N = len(corpus)

        # Compute document frequencies
        self.df = Counter()
        for doc in self.corpus:
            self.df.update(set(doc))

        # Compute IDF
        self.idf = {}
        for term, freq in self.df.items():
            self.idf[term] = np.log((self.N - freq + 0.5) / (freq + 0.5) + 1)

    def get_scores(self, query):
        """Score all documents against a query."""
        query_tokens = tokenize(query)
        scores = np.zeros(self.N)

        for i, doc in enumerate(self.corpus):
            doc_tf = Counter(doc)
            doc_len = self.doc_lengths[i]

            for term in query_tokens:
                if term in doc_tf:
                    tf = doc_tf[term]
                    idf = self.idf.get(term, 0)
                    score = compute_bm25(
                        tf, doc_len, self.avg_doc_length, idf, self.k1, self.b
                    )
                    scores[i] += score

        return scores


# Create BM25 index
bm25 = SimpleBM25(search_corpus)
Out[51]:
Console
BM25 Search Results:
======================================================================

Query: 'machine learning algorithms'
----------------------------------------------------------------------
  1. [BM25: 4.720] Machine learning algorithms learn patterns from trainin...
  2. [BM25: 2.202] Python is a popular programming language for data scien...
  3. [BM25: 1.170] Deep learning uses neural networks with many layers....

Query: 'web development JavaScript'
----------------------------------------------------------------------
  1. [BM25: 4.186] Web development involves creating websites and web appl...
  2. [BM25: 3.366] JavaScript powers interactive web applications and runs...

Limitations and Impact

TF-IDF revolutionized information retrieval and remains widely used, but it has fundamental limitations:

  • No semantic understanding: "Car" and "automobile" are treated as completely unrelated terms. TF-IDF cannot capture synonymy, antonymy, or any semantic relationships.
  • Vocabulary mismatch: If a query uses different words than the documents (even with the same meaning), TF-IDF will miss the match. "Python programming" won't match "coding in Python" well.
  • Bag of words assumption: Like its foundation, TF-IDF ignores word order. "The cat ate the mouse" and "The mouse ate the cat" have identical representations.
  • No context: The same word always gets the same IDF, regardless of context. "Bank" (financial) and "bank" (river) are conflated.
  • Sparse representations: TF-IDF vectors are high-dimensional and sparse, making them inefficient for neural networks that prefer dense inputs.

Despite these limitations, TF-IDF's impact has been enormous:

  • Search engines: Google's early algorithms built on TF-IDF concepts
  • Document clustering: K-means on TF-IDF vectors groups similar documents
  • Text classification: TF-IDF features power spam filters, sentiment analyzers, and topic classifiers
  • Keyword extraction: High TF-IDF terms identify document topics
  • Baseline models: TF-IDF provides a strong baseline that neural models must beat

TF-IDF's success comes from its effective balance: it rewards terms that are distinctive to a document while penalizing terms that appear everywhere. This simple idea, implemented efficiently, solved real problems at scale.

Summary

TF-IDF combines term frequency and inverse document frequency to score a term's importance in a document relative to a corpus. The key insights:

  • TF-IDF formula: tf-idf(t,d,D)=tf(t,d)×idf(t,D)\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)
  • TF variants: Raw counts, log-scaled (1+log(tf)1 + \log(\text{tf})), binary, and augmented
  • IDF variants: Standard (log(N/df)\log(N/\text{df})), smoothed (log((N+1)/(df+1))+1\log((N+1)/(\text{df}+1)) + 1)
  • Normalization: L2 normalization enables cosine similarity as a dot product
  • Document similarity: Cosine similarity on TF-IDF vectors measures topical overlap
  • BM25: Extends TF-IDF with term frequency saturation and document length normalization

TF-IDF remains a powerful baseline for information retrieval and text classification. Its limitations, particularly the lack of semantic understanding, motivated the development of word embeddings and transformer models. But understanding TF-IDF is essential: it's the foundation that modern NLP builds upon.

Key Functions and Parameters

When working with TF-IDF in scikit-learn, TfidfVectorizer is the primary tool:

TfidfVectorizer(lowercase, min_df, max_df, use_idf, norm, sublinear_tf, ngram_range, max_features)

  • lowercase (default: True): Convert text to lowercase before tokenization.
  • min_df: Minimum document frequency. Integer for absolute count, float for proportion. Use min_df=2 to remove rare terms.
  • max_df: Maximum document frequency. Use max_df=0.95 to filter extremely common terms.
  • use_idf (default: True): Enable IDF weighting. Set to False for TF-only vectors.
  • norm (default: 'l2'): Vector normalization. Use 'l2' for cosine similarity, 'l1' for Manhattan, None for raw scores.
  • sublinear_tf (default: False): Apply log-scaling to TF: replaces tf with 1+log(tf)1 + \log(\text{tf}).
  • ngram_range (default: (1, 1)): Include n-grams. Use (1, 2) for unigrams and bigrams.
  • max_features: Limit vocabulary to top N terms by corpus frequency.
  • smooth_idf (default: True): Add 1 to document frequencies to prevent zero IDF.

For BM25, use the rank_bm25 library:

BM25Okapi(corpus, k1=1.5, b=0.75)

  • corpus: List of tokenized documents (list of lists of strings)
  • k1: Term frequency saturation parameter. Higher values give more weight to term frequency.
  • b: Length normalization parameter. 0 disables length normalization; 1 gives full normalization.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about TF-IDF and document representation.

Loading component...

Reference

BIBTEXAcademic
@misc{tfidftermfrequencyinversedocumentfrequencyfortextrepresentation, author = {Michael Brenndoerfer}, title = {TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation. Retrieved from https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation
MLAAcademic
Michael Brenndoerfer. "TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation." 2026. Web. today. <https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation>.
CHICAGOAcademic
Michael Brenndoerfer. "TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation." Accessed today. https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation'. Available at: https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation. https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation