TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master TF-IDF for text representation, including the core formula, variants like log-scaled TF and smoothed IDF, normalization techniques, document similarity with cosine similarity, and BM25 as a modern extension.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

TF-IDFLink Copied

You've counted words. You've explored term frequency variants that weight those counts. Now comes the crucial insight: a word's importance depends not just on how often it appears in a document, but on how rare it is across the corpus. The word "the" might appear 50 times in a document, but it tells you nothing because it appears in every document. The word "transformer" appearing just twice might be highly informative if it's rare elsewhere.

TF-IDF, short for Term Frequency-Inverse Document Frequency, combines these two signals into a single score. It's one of the most successful text representations in information retrieval, powering search engines and document similarity systems for decades. The formula is elegant: multiply how often a term appears in a document by how rare it is across the corpus. Common words get downweighted; distinctive words get boosted.

This chapter brings together everything from the previous chapters on term frequency and inverse document frequency. You'll learn the exact TF-IDF formula and its variants, implement it from scratch, understand normalization options, and master scikit-learn's TfidfVectorizer. By the end, you'll know when TF-IDF works, when it fails, and how its successor BM25 addresses some of its limitations.

The TF-IDF FormulaLink Copied

The Central Question: What Makes a Word Important?Link Copied

Consider searching for documents about "neural networks" in a collection of machine learning papers. Some words appear everywhere: "the", "is", "data", "model". Others appear in specific documents: "backpropagation", "convolutional", "transformer". Intuitively, you know which words are more useful for finding relevant documents, but how do you quantify this intuition?

TF-IDF answers this question by combining two complementary signals:

Local importance: How prominent is this word within this specific document?
Global rarity: How distinctive is this word across the entire collection?

A word is truly important when both signals are strong: it appears frequently in the document you're examining, and it's rare enough across the corpus to actually distinguish that document from others. This insight leads directly to the TF-IDF formula.

Building the Formula Step by StepLink Copied

Let's develop the formula by thinking through what each component contributes.

Step 1: Measuring Local Importance with Term Frequency

The simplest measure of a word's importance in a document is how often it appears. If "neural" appears 5 times in a paper while "algorithm" appears once, "neural" is probably more central to that paper's content. This is the term frequency (TF):

\text{tf}(t, d) = \text{count of term } t \text{ in document } d

But term frequency alone has a critical flaw. The word "the" might appear 50 times in a document, far more than any content word. Raw frequency doesn't distinguish between meaningful content words and ubiquitous function words.

Step 2: Measuring Global Rarity with Inverse Document Frequency

To identify distinctive words, we need to know how common they are across the entire corpus. If a word appears in every document, it's useless for distinguishing between them. If it appears in only one document, it might be highly distinctive.

The document frequency counts how many documents contain a term:

\text{df}(t) = \text{number of documents containing term } t

We want rare words to score high and common words to score low, so we invert this relationship. The inverse document frequency (IDF) achieves this through a logarithm:

\text{idf}(t, D) = \log\left(\frac{N}{\text{df}(t)}\right)

where $N$ is the total number of documents in corpus $D$ .

The logarithm serves two purposes:

It inverts the relationship: rare words (low df) get high IDF; common words (high df) get low IDF
It compresses the scale: a word appearing in 1 of 1000 documents doesn't get 1000× the weight of one appearing in 100 documents. The difference is moderated.

Out[2]:

Visualization

Line plot showing IDF decreasing logarithmically as document frequency increases from 1 to N. — The IDF curve shows how inverse document frequency changes with document frequency. Terms appearing in few documents (left side) receive high IDF scores, while terms appearing in many documents (right side) receive low scores. The logarithmic scale ensures that the difference between appearing in 1 vs 10 documents is much larger than the difference between appearing in 90 vs 100 documents.

Step 3: Combining Local and Global Signals

Now we have two complementary measures:

TF tells us: "This word is prominent in this document"
IDF tells us: "This word is rare across the corpus"

The key insight of TF-IDF is that these signals should be multiplied:

TF-IDF

TF-IDF combines term frequency and inverse document frequency:

\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)

where:

$t$ is a term (word)
$d$ is a document
$D$ is the corpus (collection of documents)
$\text{tf}(t, d)$ measures how often $t$ appears in $d$
$\text{idf}(t, D)$ measures how rare $t$ is across $D$

Why multiplication rather than addition? Consider the edge cases:

High TF, Low IDF (common word like "the"): Multiplication drives the score down because IDF is near zero
Low TF, High IDF (rare word appearing once): The score stays moderate. Rarity alone doesn't make a word important to this document.
High TF, High IDF (frequent and rare): Both signals reinforce each other, producing a high score

This multiplicative combination creates exactly the behavior we want: it rewards words that are both prominent locally and distinctive globally.

From Theory to ImplementationLink Copied

Let's implement these concepts step by step, building intuition for how the formula behaves on real text:

In[3]:

import numpy as np
from collections import Counter
import re

# Sample corpus about different topics
corpus = [
    "Machine learning algorithms learn patterns from data. Learning from data is powerful.",
    "Deep learning uses neural networks. Neural networks learn hierarchical representations.",
    "Natural language processing extracts meaning from text. Text processing is essential for NLP.",
    "Computer vision analyzes images. Image recognition uses deep learning techniques.",
    "Reinforcement learning agents learn through rewards. Learning optimal policies is challenging.",
]

def tokenize(text):
    """Simple tokenization: lowercase and extract words."""
    return re.findall(r'\b[a-z]+\b', text.lower())

# Tokenize all documents
tokenized_docs = [tokenize(doc) for doc in corpus]

def compute_tf(doc_tokens):
    """Compute raw term frequency for a document."""
    return Counter(doc_tokens)

def compute_df(tokenized_corpus):
    """Compute document frequency for each term."""
    df = Counter()
    for doc in tokenized_corpus:
        unique_terms = set(doc)
        df.update(unique_terms)
    return df

def compute_idf(df, num_docs):
    """Compute IDF for each term."""
    idf = {}
    for term, doc_freq in df.items():
        idf[term] = np.log(num_docs / doc_freq)
    return idf

# Compute components
tf_docs = [compute_tf(doc) for doc in tokenized_docs]
df = compute_df(tokenized_docs)
idf = compute_idf(df, len(corpus))

import numpy as np
from collections import Counter
import re

# Sample corpus about different topics
corpus = [
    "Machine learning algorithms learn patterns from data. Learning from data is powerful.",
    "Deep learning uses neural networks. Neural networks learn hierarchical representations.",
    "Natural language processing extracts meaning from text. Text processing is essential for NLP.",
    "Computer vision analyzes images. Image recognition uses deep learning techniques.",
    "Reinforcement learning agents learn through rewards. Learning optimal policies is challenging.",
]

def tokenize(text):
    """Simple tokenization: lowercase and extract words."""
    return re.findall(r'\b[a-z]+\b', text.lower())

# Tokenize all documents
tokenized_docs = [tokenize(doc) for doc in corpus]

def compute_tf(doc_tokens):
    """Compute raw term frequency for a document."""
    return Counter(doc_tokens)

def compute_df(tokenized_corpus):
    """Compute document frequency for each term."""
    df = Counter()
    for doc in tokenized_corpus:
        unique_terms = set(doc)
        df.update(unique_terms)
    return df

def compute_idf(df, num_docs):
    """Compute IDF for each term."""
    idf = {}
    for term, doc_freq in df.items():
        idf[term] = np.log(num_docs / doc_freq)
    return idf

# Compute components
tf_docs = [compute_tf(doc) for doc in tokenized_docs]
df = compute_df(tokenized_docs)
idf = compute_idf(df, len(corpus))

Out[4]:

Document Frequency and IDF for Selected Terms:
============================================================
Term              Doc Freq             IDF
------------------------------------------------------------
learning                 4          0.2231
deep                     2          0.9163
neural                   1          1.6094
text                     1          1.6094
images                   1          1.6094
rewards                  1          1.6094

Interpreting the IDF Values

The table above reveals the discriminative power of IDF. Notice the inverse relationship between document frequency and IDF score:

"Learning" appears in 4 of 5 documents, so its IDF is low (0.22). This word is nearly universal in our corpus, so it can't help distinguish one document from another.
"Deep" and "neural" appear in 2 documents each, earning moderate IDF values (0.92). These terms are somewhat distinctive.
"Images" and "rewards" each appear in only 1 document, giving them the highest IDF (1.61). These words are highly discriminative. Finding them in a document immediately tells you something specific about its content.

This is the core insight: IDF identifies terms that distinguish documents from each other. Common words get suppressed; distinctive words get amplified.

Computing TF-IDF ScoresLink Copied

With both components in place, we can now compute the full TF-IDF score. The implementation is straightforward. For each term in a document, multiply its term frequency by its inverse document frequency:

In[5]:

def compute_tfidf(tf, idf):
    """Compute TF-IDF scores for a document."""
    tfidf = {}
    for term, freq in tf.items():
        tfidf[term] = freq * idf.get(term, 0)
    return tfidf

# Compute TF-IDF for each document
tfidf_docs = [compute_tfidf(tf, idf) for tf in tf_docs]

def compute_tfidf(tf, idf):
    """Compute TF-IDF scores for a document."""
    tfidf = {}
    for term, freq in tf.items():
        tfidf[term] = freq * idf.get(term, 0)
    return tfidf

# Compute TF-IDF for each document
tfidf_docs = [compute_tfidf(tf, idf) for tf in tf_docs]

Out[6]:

TF-IDF Scores for Document 1 (Machine Learning):
------------------------------------------------------------
Term                  TF          IDF       TF-IDF
------------------------------------------------------------
data                   2       1.6094       3.2189
from                   2       0.9163       1.8326
machine                1       1.6094       1.6094
algorithms             1       1.6094       1.6094
patterns               1       1.6094       1.6094
powerful               1       1.6094       1.6094
learn                  1       0.5108       0.5108
is                     1       0.5108       0.5108
learning               2       0.2231       0.4463

Understanding the TF-IDF Scores

The table above shows TF-IDF in action, and the results illuminate exactly how the formula balances local prominence against global rarity:

"Learning" (TF=3, IDF=0.22, TF-IDF=0.67): Despite appearing three times (the highest frequency in this document), its TF-IDF score is moderate. Why? Because "learning" appears in nearly every document in our corpus, so its IDF is low. The multiplication suppresses this common term.
"Data" (TF=2, IDF=0.51, TF-IDF=1.02): This term appears twice and in fewer documents than "learning", giving it a higher TF-IDF. The balance tips in its favor because it's more distinctive.
"Algorithms" and "powerful" (TF=1, IDF=1.61, TF-IDF=1.61): These words appear only once in Document 1, but they're unique to this document in our corpus. Their high IDF compensates for their low frequency, earning them respectable scores.

This is TF-IDF's core strength: it automatically identifies the terms that best characterize each document, not the most frequent words, but the most distinctive ones.

Visualizing the TF-IDF BalanceLink Copied

The following visualization makes the multiplicative relationship concrete. For each term, we show its TF (local importance), IDF (global rarity), and the resulting TF-IDF score. Notice how the final score emerges from the interplay between these two components:

Out[7]:

Visualization

Bar chart showing TF, IDF, and TF-IDF values for top terms in Document 1. — TF-IDF decomposes into two components: term frequency (how often a word appears in the document) and inverse document frequency (how rare the word is across the corpus). The multiplication balances local importance against global rarity. Terms like ''learning'' have high TF but low IDF, while ''algorithms'' has low TF but high IDF.

TF-IDF VariantsLink Copied

The basic TF-IDF formula we've developed works well, but practitioners have discovered that certain modifications can improve performance for specific tasks. These variants address subtle issues with the basic formulation, issues that become apparent when you think carefully about what the formula is measuring.

Log-Scaled TF: Taming Extreme FrequenciesLink Copied

Consider a document where "machine" appears 20 times and "learning" appears twice. Is "machine" really 10× more important to this document? Probably not. The relationship between word frequency and importance isn't linear. The first few occurrences establish a word's relevance, but additional occurrences provide diminishing returns.

Log-scaling addresses this proportionality problem:

\text{tf}_{\log}(t, d) = 1 + \log(\text{tf}(t, d)) \quad \text{if } \text{tf}(t, d) > 0

In[8]:

def compute_log_tf(doc_tokens):
    """Compute log-scaled term frequency."""
    raw_tf = Counter(doc_tokens)
    log_tf = {}
    for term, count in raw_tf.items():
        log_tf[term] = 1 + np.log(count) if count > 0 else 0
    return log_tf

def compute_tfidf_log(doc_tokens, idf):
    """Compute TF-IDF with log-scaled TF."""
    log_tf = compute_log_tf(doc_tokens)
    tfidf = {}
    for term, tf_val in log_tf.items():
        tfidf[term] = tf_val * idf.get(term, 0)
    return tfidf

# Compare raw vs log TF-IDF
tfidf_raw = compute_tfidf(tf_docs[0], idf)
tfidf_log = compute_tfidf_log(tokenized_docs[0], idf)

def compute_log_tf(doc_tokens):
    """Compute log-scaled term frequency."""
    raw_tf = Counter(doc_tokens)
    log_tf = {}
    for term, count in raw_tf.items():
        log_tf[term] = 1 + np.log(count) if count > 0 else 0
    return log_tf

def compute_tfidf_log(doc_tokens, idf):
    """Compute TF-IDF with log-scaled TF."""
    log_tf = compute_log_tf(doc_tokens)
    tfidf = {}
    for term, tf_val in log_tf.items():
        tfidf[term] = tf_val * idf.get(term, 0)
    return tfidf

# Compare raw vs log TF-IDF
tfidf_raw = compute_tfidf(tf_docs[0], idf)
tfidf_log = compute_tfidf_log(tokenized_docs[0], idf)

Out[9]:

Raw TF-IDF vs Log TF-IDF (Document 1):
-----------------------------------------------------------------
Term              Raw TF     Log TF    Raw TFIDF    Log TFIDF
-----------------------------------------------------------------
learning               2       1.69       0.4463       0.3778
data                   2       1.69       3.2189       2.7250
from                   2       1.69       1.8326       1.5514
algorithms             1       1.00       1.6094       1.6094
patterns               1       1.00       1.6094       1.6094

Interpreting the Log-Scaled Results

Log-scaling compresses the TF component, reducing the dominance of high-frequency terms. "Learning" with TF=3 gets log TF of 2.1, not 3. The difference between 3 and 1 occurrence shrinks from 3× to about 2×. This compression is often preferable for document similarity calculations, where you want to recognize that a document mentioning "neural" twice is similar to one mentioning it five times.

Out[10]:

Visualization

Line plot comparing linear raw TF growth with logarithmic log TF compression as term frequency increases. — Log-scaling compresses term frequency, reducing the impact of very frequent terms. Raw TF grows linearly (blue), while log TF (red) shows diminishing returns. A term appearing 20 times has 20× the raw TF of one appearing once, but only about 4× the log TF. This prevents high-frequency terms from dominating document representations.

Smoothed IDF: Handling Edge CasesLink Copied

The basic IDF formula has a subtle problem: what happens when a term appears in every document? The formula gives:

\text{idf}(t) = \log\left(\frac{N}{N}\right) = \log(1) = 0

A term appearing everywhere gets IDF of zero, which means its TF-IDF is also zero. It contributes nothing to the document representation. While this makes sense for truly universal terms like "the", it can be problematic when you want even common terms to contribute something.

Smoothed IDF variants address this by adding constants:

\text{idf}_{\text{smooth}}(t, D) = \log\left(\frac{N + 1}{\text{df}(t) + 1}\right) + 1

This ensures all terms have positive IDF, which is important when you want common words to contribute something rather than nothing.

In[11]:

def compute_smoothed_idf(df, num_docs):
    """Compute smoothed IDF (sklearn default)."""
    idf_smooth = {}
    for term, doc_freq in df.items():
        idf_smooth[term] = np.log((num_docs + 1) / (doc_freq + 1)) + 1
    return idf_smooth

idf_smooth = compute_smoothed_idf(df, len(corpus))

def compute_smoothed_idf(df, num_docs):
    """Compute smoothed IDF (sklearn default)."""
    idf_smooth = {}
    for term, doc_freq in df.items():
        idf_smooth[term] = np.log((num_docs + 1) / (doc_freq + 1)) + 1
    return idf_smooth

idf_smooth = compute_smoothed_idf(df, len(corpus))

Out[12]:

Standard IDF vs Smoothed IDF:
--------------------------------------------------
Term              Doc Freq          IDF     Smoothed
--------------------------------------------------
learning                 4       0.2231       1.1823
deep                     2       0.9163       1.6931
neural                   1       1.6094       2.0986
text                     1       1.6094       2.0986
images                   1       1.6094       2.0986

Why Smoothing Matters

The smoothed version adds a constant offset, ensuring even the most common terms retain some weight. The "+1" in the numerator and denominator prevents division issues with unseen terms, while the final "+1" shifts all IDF values up. This is the default in scikit-learn's TfidfVectorizer, making it important to understand when comparing implementations.

Common TF-IDF Schemes: A Notation SystemLink Copied

The information retrieval community developed a compact notation for describing TF-IDF variants. Each scheme is specified by three letters indicating the TF variant, IDF variant, and normalization method. For example, ltc means: log TF, standard IDF, cosine normalization.

Here are the most common schemes and when to use them:

Scheme	TF	IDF	Normalization	Use Case
`nnn`	Raw	None	None	Baseline, raw counts
`ntc`	Raw	Standard	Cosine	Basic TF-IDF
`ltc`	Log	Standard	Cosine	Balanced weighting
`lnc`	Log	None	Cosine	TF only, normalized
`bnn`	Binary	None	None	Presence/absence

Out[13]:

Visualization

Three bar charts comparing different TF-IDF weighting schemes across terms. — Comparison of TF-IDF weighting schemes for Document 1. Raw TF-IDF (ntc) emphasizes high-frequency terms, while log TF-IDF (ltc) compresses the range. Binary TF-IDF (btc) treats all present terms equally, boosting rare terms that appear even once.

TF-IDF Vector ComputationLink Copied

To use TF-IDF for machine learning, we need to convert documents into fixed-length vectors. Each dimension corresponds to a vocabulary term, and the value is that term's TF-IDF score.

In[14]:

def build_vocabulary(tokenized_corpus):
    """Build sorted vocabulary from corpus."""
    all_terms = set()
    for doc in tokenized_corpus:
        all_terms.update(doc)
    return sorted(all_terms)

def document_to_tfidf_vector(doc_tokens, vocabulary, idf):
    """Convert document to TF-IDF vector."""
    tf = compute_tf(doc_tokens)
    vector = np.zeros(len(vocabulary))
    for i, term in enumerate(vocabulary):
        if term in tf:
            vector[i] = tf[term] * idf.get(term, 0)
    return vector

# Build vocabulary and compute TF-IDF vectors
vocabulary = build_vocabulary(tokenized_docs)
tfidf_vectors = np.array([
    document_to_tfidf_vector(doc, vocabulary, idf) 
    for doc in tokenized_docs
])

def build_vocabulary(tokenized_corpus):
    """Build sorted vocabulary from corpus."""
    all_terms = set()
    for doc in tokenized_corpus:
        all_terms.update(doc)
    return sorted(all_terms)

def document_to_tfidf_vector(doc_tokens, vocabulary, idf):
    """Convert document to TF-IDF vector."""
    tf = compute_tf(doc_tokens)
    vector = np.zeros(len(vocabulary))
    for i, term in enumerate(vocabulary):
        if term in tf:
            vector[i] = tf[term] * idf.get(term, 0)
    return vector

# Build vocabulary and compute TF-IDF vectors
vocabulary = build_vocabulary(tokenized_docs)
tfidf_vectors = np.array([
    document_to_tfidf_vector(doc, vocabulary, idf) 
    for doc in tokenized_docs
])

Out[15]:

TF-IDF Matrix Shape: (5, 38)
  5 documents × 38 vocabulary terms

Sample TF-IDF values (Document 1, first 10 terms):
--------------------------------------------------
  algorithms      1.6094
  data            3.2189

Sparsity in TF-IDF MatricesLink Copied

Like raw count matrices, TF-IDF matrices are extremely sparse. Most documents use only a small fraction of the vocabulary.

In[16]:

# Analyze sparsity
total_elements = tfidf_vectors.size
nonzero_elements = np.count_nonzero(tfidf_vectors)
sparsity = (total_elements - nonzero_elements) / total_elements

# Analyze sparsity
total_elements = tfidf_vectors.size
nonzero_elements = np.count_nonzero(tfidf_vectors)
sparsity = (total_elements - nonzero_elements) / total_elements

Out[17]:

TF-IDF Matrix Sparsity:
----------------------------------------
Total elements: 190
Non-zero elements: 48
Sparsity: 74.7%

Average non-zero terms per document: 9.6

Out[18]:

Visualization

Heatmap of TF-IDF matrix with documents as rows and vocabulary terms as columns. — TF-IDF matrix visualization for our sample corpus. Each row is a document, each column a vocabulary term. The sparse pattern shows that each document uses only a subset of the vocabulary. Higher values (darker cells) indicate terms that are both frequent in the document and rare across the corpus.

TF-IDF NormalizationLink Copied

Raw TF-IDF vectors have varying lengths depending on document size and vocabulary overlap. For similarity calculations, we typically normalize vectors so that document length doesn't dominate.

L2 NormalizationLink Copied

L2 normalization divides each vector by its Euclidean length, projecting all documents onto the unit sphere:

\mathbf{v}_{\text{norm}} = \frac{\mathbf{v}}{\|\mathbf{v}\|_2} = \frac{\mathbf{v}}{\sqrt{\sum_i v_i^2}}

After L2 normalization, cosine similarity becomes a simple dot product.

In[19]:

def l2_normalize(vectors):
    """L2 normalize each row vector."""
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    # Avoid division by zero
    norms[norms == 0] = 1
    return vectors / norms

tfidf_normalized = l2_normalize(tfidf_vectors)

# Verify normalization
norms_before = np.linalg.norm(tfidf_vectors, axis=1)
norms_after = np.linalg.norm(tfidf_normalized, axis=1)

def l2_normalize(vectors):
    """L2 normalize each row vector."""
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    # Avoid division by zero
    norms[norms == 0] = 1
    return vectors / norms

tfidf_normalized = l2_normalize(tfidf_vectors)

# Verify normalization
norms_before = np.linalg.norm(tfidf_vectors, axis=1)
norms_after = np.linalg.norm(tfidf_normalized, axis=1)

Out[20]:

Vector Norms Before and After L2 Normalization:
--------------------------------------------------
Document              Before           After
--------------------------------------------------
Doc 1                 4.9801          1.0000
Doc 2                 5.2814          1.0000
Doc 3                 6.3210          1.0000
Doc 4                 4.4566          1.0000
Doc 5                 4.3420          1.0000

All normalized vectors now have unit length (1.0), making them directly comparable regardless of original document length.

L1 NormalizationLink Copied

L1 normalization divides by the sum of absolute values, making each vector's components sum to 1:

\mathbf{v}_{\text{L1}} = \frac{\mathbf{v}}{\|\mathbf{v}\|_1} = \frac{\mathbf{v}}{\sum_i |v_i|}

This creates a probability-like distribution over terms.

In[21]:

def l1_normalize(vectors):
    """L1 normalize each row vector."""
    sums = np.sum(np.abs(vectors), axis=1, keepdims=True)
    sums[sums == 0] = 1
    return vectors / sums

tfidf_l1 = l1_normalize(tfidf_vectors)

def l1_normalize(vectors):
    """L1 normalize each row vector."""
    sums = np.sum(np.abs(vectors), axis=1, keepdims=True)
    sums[sums == 0] = 1
    return vectors / sums

tfidf_l1 = l1_normalize(tfidf_vectors)

Out[22]:

L1 Normalized TF-IDF (Document 1, top terms):
---------------------------------------------
Sum of all values: 1.0000

  data            0.2484 (24.8%)
  from            0.1414 (14.1%)
  machine         0.1242 (12.4%)
  patterns        0.1242 (12.4%)
  powerful        0.1242 (12.4%)
  algorithms      0.1242 (12.4%)
  is              0.0394 (3.9%)
  learn           0.0394 (3.9%)

L1 normalization is useful when you want to interpret TF-IDF scores as term "importance proportions" within a document.

Document Similarity with TF-IDFLink Copied

TF-IDF's primary application is measuring document similarity. Documents with similar TF-IDF vectors discuss similar topics using similar vocabulary.

Cosine SimilarityLink Copied

Cosine similarity measures the angle between two vectors, ranging from 0 (orthogonal, no similarity) to 1 (identical direction):

\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|_2 \|\mathbf{b}\|_2}

For L2-normalized vectors, this simplifies to a dot product.

In[23]:

def cosine_similarity_matrix(vectors):
    """Compute pairwise cosine similarity."""
    # For normalized vectors, cosine = dot product
    normalized = l2_normalize(vectors)
    return np.dot(normalized, normalized.T)

similarity = cosine_similarity_matrix(tfidf_vectors)

def cosine_similarity_matrix(vectors):
    """Compute pairwise cosine similarity."""
    # For normalized vectors, cosine = dot product
    normalized = l2_normalize(vectors)
    return np.dot(normalized, normalized.T)

similarity = cosine_similarity_matrix(tfidf_vectors)

Out[24]:

Document Similarity Matrix (Cosine Similarity):
-------------------------------------------------------
             Doc 1     Doc 2     Doc 3     Doc 4     Doc 5
-------------------------------------------------------
Doc   1     1.000     0.014     0.062     0.004     0.033
Doc   2     0.014     1.000     0.000     0.073     0.016
Doc   3     0.062     0.000     1.000     0.000     0.010
Doc   4     0.004     0.073     0.000     1.000     0.005
Doc   5     0.033     0.016     0.010     0.005     1.000

Documents 1 and 5 show moderate similarity (0.28) because both discuss "learning". Documents 2 and 4 share "deep learning" terminology. The diagonal shows perfect self-similarity (1.0).

Out[25]:

Visualization

Heatmap showing pairwise cosine similarities between five documents. — Cosine similarity matrix computed from TF-IDF vectors. Higher values indicate more similar vocabulary usage. The diagonal shows perfect self-similarity (1.0). Documents discussing related topics (like Documents 2 and 4, both about deep learning) show higher similarity scores.

Finding Similar DocumentsLink Copied

Given a query document, we can rank all corpus documents by similarity:

In[26]:

def find_similar_documents(query_idx, similarity_matrix, top_k=3):
    """Find most similar documents to a query document."""
    similarities = similarity_matrix[query_idx]
    # Exclude the query document itself
    ranked_indices = np.argsort(similarities)[::-1]
    ranked_indices = [i for i in ranked_indices if i != query_idx]
    return ranked_indices[:top_k], similarities[ranked_indices[:top_k]]

# Find documents similar to Document 2 (Deep Learning)
query_idx = 1
similar_docs, scores = find_similar_documents(query_idx, similarity)

def find_similar_documents(query_idx, similarity_matrix, top_k=3):
    """Find most similar documents to a query document."""
    similarities = similarity_matrix[query_idx]
    # Exclude the query document itself
    ranked_indices = np.argsort(similarities)[::-1]
    ranked_indices = [i for i in ranked_indices if i != query_idx]
    return ranked_indices[:top_k], similarities[ranked_indices[:top_k]]

# Find documents similar to Document 2 (Deep Learning)
query_idx = 1
similar_docs, scores = find_similar_documents(query_idx, similarity)

Out[27]:

Query: Document 2
  'Deep learning uses neural networks. Neural networks learn hi...'

Most Similar Documents:
------------------------------------------------------------
  Doc 4 (similarity: 0.073)
    'Computer vision analyzes images. Image recognition uses...'

  Doc 5 (similarity: 0.016)
    'Reinforcement learning agents learn through rewards. Le...'

  Doc 1 (similarity: 0.014)
    'Machine learning algorithms learn patterns from data. L...'

Document 4 (Computer Vision) ranks highest because it shares "deep learning" vocabulary with Document 2. Document 1 (Machine Learning) comes second due to shared "learning" terminology.

Visualizing Document Similarity in 2DLink Copied

While the similarity matrix shows pairwise relationships, we can also visualize how documents cluster in a 2D space. Using PCA to reduce our high-dimensional TF-IDF vectors to 2 dimensions reveals the underlying structure:

Out[28]:

Visualization

Scatter plot showing 5 documents positioned in 2D space based on TF-IDF similarity, with connecting lines showing relationships. — TF-IDF vectors projected to 2D using PCA. Documents discussing similar topics cluster together. The 'learning' documents (1, 2, 4, 5) form a loose cluster, while Document 3 (NLP/text processing) sits apart. Lines connect documents with similarity > 0.15, with thickness proportional to similarity strength.

The 2D projection reveals document relationships at a glance. Documents sharing vocabulary (like the "learning"-related documents) appear closer together, while Document 3 (focused on text/NLP) sits farther from the others due to its distinct vocabulary.

TF-IDF for Feature ExtractionLink Copied

Beyond document similarity, TF-IDF vectors serve as features for machine learning models. Text classification, clustering, and information retrieval all benefit from TF-IDF representations.

Text Classification ExampleLink Copied

Let's use TF-IDF features for a simple classification task:

In[29]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Extended corpus with labels
labeled_corpus = [
    ("Machine learning models predict outcomes from data.", "ml"),
    ("Deep neural networks learn complex patterns.", "ml"),
    ("Gradient descent optimizes model parameters.", "ml"),
    ("Support vector machines classify data points.", "ml"),
    ("Shakespeare wrote many famous plays.", "literature"),
    ("Poetry expresses emotions through verse.", "literature"),
    ("Novels tell stories through narrative prose.", "literature"),
    ("Drama unfolds through dialogue and action.", "literature"),
]

texts = [text for text, label in labeled_corpus]
labels = [label for text, label in labeled_corpus]

# Create TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train classifier
clf = MultinomialNB()
scores = cross_val_score(clf, X, labels, cv=4)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Extended corpus with labels
labeled_corpus = [
    ("Machine learning models predict outcomes from data.", "ml"),
    ("Deep neural networks learn complex patterns.", "ml"),
    ("Gradient descent optimizes model parameters.", "ml"),
    ("Support vector machines classify data points.", "ml"),
    ("Shakespeare wrote many famous plays.", "literature"),
    ("Poetry expresses emotions through verse.", "literature"),
    ("Novels tell stories through narrative prose.", "literature"),
    ("Drama unfolds through dialogue and action.", "literature"),
]

texts = [text for text, label in labeled_corpus]
labels = [label for text, label in labeled_corpus]

# Create TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train classifier
clf = MultinomialNB()
scores = cross_val_score(clf, X, labels, cv=4)

Out[30]:

Text Classification with TF-IDF Features:
--------------------------------------------------
Number of samples: 8
Number of features: 43
Cross-validation accuracy: 75.0% (±25.0%)

Most Discriminative Terms:
  literature: through, emotions, expresses, verse, poetry
  ml: data, parameters, model, optimizes, descent

TF-IDF features capture the distinctive vocabulary of each class. ML documents contain "learning", "data", "models"; literature documents contain "poetry", "stories", "narrative".

Feature Selection with TF-IDFLink Copied

High TF-IDF scores identify distinctive terms that can serve as features:

In[31]:

def extract_top_tfidf_terms(tfidf_matrix, feature_names, doc_idx, top_k=10):
    """Extract terms with highest TF-IDF scores for a document."""
    scores = tfidf_matrix[doc_idx].toarray().flatten()
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(feature_names[i], scores[i]) for i in top_indices if scores[i] > 0]

# Use our original corpus
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()

def extract_top_tfidf_terms(tfidf_matrix, feature_names, doc_idx, top_k=10):
    """Extract terms with highest TF-IDF scores for a document."""
    scores = tfidf_matrix[doc_idx].toarray().flatten()
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(feature_names[i], scores[i]) for i in top_indices if scores[i] > 0]

# Use our original corpus
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()

Out[32]:

Top TF-IDF Terms by Document:
============================================================

Document 1: Machine learning algorithms learn patterns from da...
  data            0.5597
  from            0.4515
  learning        0.3153
  machine         0.2798
  patterns        0.2798

Document 2: Deep learning uses neural networks. Neural network...
  networks        0.5757
  neural          0.5757
  representations 0.2879
  hierarchical    0.2879
  uses            0.2322

Document 3: Natural language processing extracts meaning from ...
  text            0.4985
  processing      0.4985
  essential       0.2492
  language        0.2492
  meaning         0.2492

Document 4: Computer vision analyzes images. Image recognition...
  vision          0.3406
  analyzes        0.3406
  techniques      0.3406
  images          0.3406
  computer        0.3406

Document 5: Reinforcement learning agents learn through reward...
  learning        0.3722
  agents          0.3303
  challenging     0.3303
  optimal         0.3303
  reinforcement   0.3303

Each document's top TF-IDF terms capture its distinctive content. Document 2's top terms include "neural" and "networks"; Document 3's include "text" and "nlp".

sklearn TfidfVectorizer Deep DiveLink Copied

scikit-learn's TfidfVectorizer is the standard tool for TF-IDF computation. Understanding its parameters helps you tune it for your specific use case.

Basic UsageLink Copied

In[33]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Default settings
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

from sklearn.feature_extraction.text import TfidfVectorizer

# Default settings
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

Out[34]:

TfidfVectorizer Default Output:
--------------------------------------------------
Matrix shape: (5, 38)
Matrix type: <class 'scipy.sparse._csr.csr_matrix'>
Vocabulary size: 38
Non-zero elements: 48
Sparsity: 74.7%

Key ParametersLink Copied

TfidfVectorizer combines tokenization, TF-IDF computation, and normalization. Here are the most important parameters:

In[35]:

# Explore different configurations
configs = {
    'default': TfidfVectorizer(),
    'no_idf': TfidfVectorizer(use_idf=False),  # TF only
    'sublinear_tf': TfidfVectorizer(sublinear_tf=True),  # Log TF
    'no_norm': TfidfVectorizer(norm=None),  # No normalization
    'binary': TfidfVectorizer(binary=True),  # Binary TF
    'bigrams': TfidfVectorizer(ngram_range=(1, 2)),  # Include bigrams
}

results = {}
for name, vec in configs.items():
    matrix = vec.fit_transform(corpus)
    results[name] = {
        'vocab_size': len(vec.vocabulary_),
        'nnz': matrix.nnz,
        'mean_norm': np.mean(np.linalg.norm(matrix.toarray(), axis=1))
    }

# Explore different configurations
configs = {
    'default': TfidfVectorizer(),
    'no_idf': TfidfVectorizer(use_idf=False),  # TF only
    'sublinear_tf': TfidfVectorizer(sublinear_tf=True),  # Log TF
    'no_norm': TfidfVectorizer(norm=None),  # No normalization
    'binary': TfidfVectorizer(binary=True),  # Binary TF
    'bigrams': TfidfVectorizer(ngram_range=(1, 2)),  # Include bigrams
}

results = {}
for name, vec in configs.items():
    matrix = vec.fit_transform(corpus)
    results[name] = {
        'vocab_size': len(vec.vocabulary_),
        'nnz': matrix.nnz,
        'mean_norm': np.mean(np.linalg.norm(matrix.toarray(), axis=1))
    }

Out[36]:

TfidfVectorizer Configuration Comparison:
=================================================================
Config            Vocab Size    Non-zeros    Mean Norm
-----------------------------------------------------------------
default                   38           48       1.0000
no_idf                    38           48       1.0000
sublinear_tf              38           48       1.0000
no_norm                   38           48       7.1451
binary                    38           48       1.0000
bigrams                   86           97       1.0000

Key observations:

no_idf: Without IDF, all terms are weighted by frequency alone
sublinear_tf: Log-scaling compresses TF values
no_norm: Without normalization, vector norms vary by document length
bigrams: Including bigrams dramatically increases vocabulary size

Out[37]:

Visualization

Three heatmaps showing document similarity matrices under different TfidfVectorizer configurations. — Impact of TfidfVectorizer configuration on document similarity. Each heatmap shows pairwise cosine similarities under different settings. Without IDF (left), common words dominate and all documents appear more similar. With sublinear TF (middle), the similarity structure is preserved but with slightly different magnitudes. Including bigrams (right) captures phrase-level patterns, changing which documents appear most related.

The visualization reveals how configuration choices affect similarity calculations. Without IDF, documents appear more similar because common words aren't downweighted. Bigrams can capture different relationships by considering word pairs like "deep learning" or "neural networks" as single features.

The IDF Formula in sklearnLink Copied

scikit-learn uses a smoothed IDF formula by default:

\text{idf}(t) = \log\left(\frac{N + 1}{\text{df}(t) + 1}\right) + 1

where $N$ is the number of documents and $\text{df}(t)$ is the document frequency of term $t$ .

In[38]:

# Access the learned IDF values
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
idf_values = vectorizer.idf_
feature_names = vectorizer.get_feature_names_out()

# Create term -> IDF mapping
term_idf = dict(zip(feature_names, idf_values))

# Access the learned IDF values
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
idf_values = vectorizer.idf_
feature_names = vectorizer.get_feature_names_out()

# Create term -> IDF mapping
term_idf = dict(zip(feature_names, idf_values))

Out[39]:

sklearn IDF Values (smoothed formula):
---------------------------------------------
Term              Doc Freq          IDF
---------------------------------------------
learning                 4       1.1823
is                       3       1.4055
learn                    3       1.4055
deep                     2       1.6931
from                     2       1.6931
uses                     2       1.6931
agents                   1       2.0986
algorithms               1       2.0986
analyzes                 1       2.0986
challenging              1       2.0986

Practical Configuration PatternsLink Copied

Different tasks call for different configurations:

In[40]:

# For document similarity
similarity_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm='l2',           # L2 normalization for cosine similarity
    use_idf=True,
    sublinear_tf=True,   # Log-scaled TF
    min_df=2,            # Ignore rare terms
    max_df=0.95,         # Ignore very common terms
)

# For text classification
classification_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm='l2',
    use_idf=True,
    ngram_range=(1, 2),  # Include bigrams
    max_features=10000,  # Limit vocabulary
    min_df=2,
)

# For keyword extraction
keyword_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm=None,           # No normalization for interpretable scores
    use_idf=True,
    sublinear_tf=False,  # Raw TF for clearer interpretation
)

# For document similarity
similarity_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm='l2',           # L2 normalization for cosine similarity
    use_idf=True,
    sublinear_tf=True,   # Log-scaled TF
    min_df=2,            # Ignore rare terms
    max_df=0.95,         # Ignore very common terms
)

# For text classification
classification_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm='l2',
    use_idf=True,
    ngram_range=(1, 2),  # Include bigrams
    max_features=10000,  # Limit vocabulary
    min_df=2,
)

# For keyword extraction
keyword_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm=None,           # No normalization for interpretable scores
    use_idf=True,
    sublinear_tf=False,  # Raw TF for clearer interpretation
)

Out[41]:

Configuration Recommendations:
------------------------------------------------------------

Document Similarity:
  - L2 normalization (norm='l2')
  - Sublinear TF (sublinear_tf=True)
  - Filter rare/common terms (min_df, max_df)

Text Classification:
  - Include bigrams (ngram_range=(1, 2))
  - Limit vocabulary (max_features)
  - Keep IDF weighting

Keyword Extraction:
  - No normalization (norm=None)
  - Raw TF for interpretability
  - Focus on high TF-IDF terms

Worked Example: Document SearchLink Copied

Let's build a complete document search system using TF-IDF:

In[42]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Larger corpus for search
search_corpus = [
    "Python is a popular programming language for data science and machine learning.",
    "JavaScript powers interactive web applications and runs in browsers.",
    "Machine learning algorithms learn patterns from training data.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing analyzes and generates human text.",
    "Computer vision enables machines to interpret visual information.",
    "Data science combines statistics, programming, and domain expertise.",
    "Web development involves creating websites and web applications.",
    "Neural networks are inspired by biological brain structures.",
    "Text classification assigns categories to documents automatically.",
]

# Build search index
search_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm='l2',
    sublinear_tf=True,
)
search_index = search_vectorizer.fit_transform(search_corpus)

def search(query, vectorizer, index, corpus, top_k=3):
    """Search corpus for documents matching query."""
    # Transform query to TF-IDF vector
    query_vector = vectorizer.transform([query])
    
    # Compute similarities
    similarities = cosine_similarity(query_vector, index).flatten()
    
    # Rank results
    ranked_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in ranked_indices:
        if similarities[idx] > 0:
            results.append({
                'doc_id': idx,
                'score': similarities[idx],
                'text': corpus[idx]
            })
    return results

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Larger corpus for search
search_corpus = [
    "Python is a popular programming language for data science and machine learning.",
    "JavaScript powers interactive web applications and runs in browsers.",
    "Machine learning algorithms learn patterns from training data.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing analyzes and generates human text.",
    "Computer vision enables machines to interpret visual information.",
    "Data science combines statistics, programming, and domain expertise.",
    "Web development involves creating websites and web applications.",
    "Neural networks are inspired by biological brain structures.",
    "Text classification assigns categories to documents automatically.",
]

# Build search index
search_vectorizer = TfidfVectorizer(
    lowercase=True,
    norm='l2',
    sublinear_tf=True,
)
search_index = search_vectorizer.fit_transform(search_corpus)

def search(query, vectorizer, index, corpus, top_k=3):
    """Search corpus for documents matching query."""
    # Transform query to TF-IDF vector
    query_vector = vectorizer.transform([query])
    
    # Compute similarities
    similarities = cosine_similarity(query_vector, index).flatten()
    
    # Rank results
    ranked_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in ranked_indices:
        if similarities[idx] > 0:
            results.append({
                'doc_id': idx,
                'score': similarities[idx],
                'text': corpus[idx]
            })
    return results

Out[43]:

Document Search Results:
======================================================================

Query: 'machine learning algorithms'
----------------------------------------------------------------------
  1. [Score: 0.577] Machine learning algorithms learn patterns from training dat...
  2. [Score: 0.293] Python is a popular programming language for data science an...
  3. [Score: 0.139] Deep learning uses neural networks with many layers....

Query: 'web development JavaScript'
----------------------------------------------------------------------
  1. [Score: 0.504] Web development involves creating websites and web applicati...
  2. [Score: 0.374] JavaScript powers interactive web applications and runs in b...

Query: 'neural networks deep learning'
----------------------------------------------------------------------
  1. [Score: 0.655] Deep learning uses neural networks with many layers....
  2. [Score: 0.306] Neural networks are inspired by biological brain structures....
  3. [Score: 0.122] Machine learning algorithms learn patterns from training dat...

The search system finds relevant documents by matching query terms against the TF-IDF index. "Machine learning algorithms" matches documents about ML and data science. "Web development JavaScript" finds the JavaScript and web development documents.

Out[44]:

Visualization

Grouped bar chart showing search relevance scores for different queries across corpus documents. — Search relevance scores for three different queries. Each bar shows the cosine similarity between the query and corpus documents. Higher scores indicate better matches. The query about machine learning algorithms matches ML-related documents; the web development query matches web-focused content.

BM25: TF-IDF's SuccessorLink Copied

TF-IDF has a limitation: it doesn't handle document length well. A long document naturally has more term occurrences, potentially inflating its relevance scores. BM25 (Best Matching 25) extends TF-IDF with length normalization and saturation.

BM25

BM25 is a ranking function that extends TF-IDF with two key improvements:

Term frequency saturation: Additional occurrences contribute diminishing returns
Document length normalization: Longer documents are penalized

The formula is:

\text{BM25}(t, d, D) = \text{IDF}(t) \times \frac{\text{tf}(t, d) \times (k_1 + 1)}{\text{tf}(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}

where $k_1$ controls saturation (typically 1.2-2.0), $b$ controls length normalization (typically 0.75), $|d|$ is document length, and avgdl is average document length.

In[45]:

def compute_bm25(tf, doc_length, avg_doc_length, idf, k1=1.5, b=0.75):
    """Compute BM25 score for a term in a document."""
    if tf == 0:
        return 0
    
    # Length normalization factor
    length_norm = 1 - b + b * (doc_length / avg_doc_length)
    
    # BM25 TF component with saturation
    tf_component = (tf * (k1 + 1)) / (tf + k1 * length_norm)
    
    return idf * tf_component

# Compare TF-IDF vs BM25 for varying TF values
doc_length = 100
avg_doc_length = 100
sample_idf = 1.5  # Moderately rare term

tf_values = range(1, 21)
tfidf_scores = [tf * sample_idf for tf in tf_values]
bm25_scores = [compute_bm25(tf, doc_length, avg_doc_length, sample_idf) for tf in tf_values]

def compute_bm25(tf, doc_length, avg_doc_length, idf, k1=1.5, b=0.75):
    """Compute BM25 score for a term in a document."""
    if tf == 0:
        return 0
    
    # Length normalization factor
    length_norm = 1 - b + b * (doc_length / avg_doc_length)
    
    # BM25 TF component with saturation
    tf_component = (tf * (k1 + 1)) / (tf + k1 * length_norm)
    
    return idf * tf_component

# Compare TF-IDF vs BM25 for varying TF values
doc_length = 100
avg_doc_length = 100
sample_idf = 1.5  # Moderately rare term

tf_values = range(1, 21)
tfidf_scores = [tf * sample_idf for tf in tf_values]
bm25_scores = [compute_bm25(tf, doc_length, avg_doc_length, sample_idf) for tf in tf_values]

Out[46]:

Visualization

Line plot comparing TF-IDF linear growth with BM25 saturation curve as term frequency increases. — TF-IDF vs BM25 scoring as term frequency increases. TF-IDF grows linearly with term frequency, while BM25 saturates, with diminishing returns for additional occurrences. This saturation prevents very frequent terms from dominating relevance scores.

BM25 Length NormalizationLink Copied

BM25's length normalization adjusts scores based on document length relative to the corpus average:

In[47]:

# Compare BM25 scores for documents of different lengths
doc_lengths = [50, 100, 200, 400]
avg_doc_length = 100
fixed_tf = 5

bm25_by_length = [compute_bm25(fixed_tf, dl, avg_doc_length, sample_idf) for dl in doc_lengths]
tfidf_fixed = fixed_tf * sample_idf  # TF-IDF doesn't consider length

# Compare BM25 scores for documents of different lengths
doc_lengths = [50, 100, 200, 400]
avg_doc_length = 100
fixed_tf = 5

bm25_by_length = [compute_bm25(fixed_tf, dl, avg_doc_length, sample_idf) for dl in doc_lengths]
tfidf_fixed = fixed_tf * sample_idf  # TF-IDF doesn't consider length

Out[48]:

BM25 Length Normalization (TF=5, IDF=1.5):
--------------------------------------------------
Doc Length             Relative      BM25 Score
--------------------------------------------------
50                         0.50x          3.1579
100                        1.00x          2.8846
200                        2.00x          2.4590
400                        4.00x          1.8987

TF-IDF (ignores length): 7.5000

Shorter documents get higher BM25 scores for the same term frequency, reflecting the intuition that finding a term in a short document is more significant than finding it in a long one.

BM25 Parameter SensitivityLink Copied

The two key BM25 parameters, $k_1$ and $b$ , control different aspects of the scoring:

Out[49]:

Visualization

Two line plots showing how BM25 scores change with k1 and b parameters. — BM25 parameter sensitivity. Left: k₁ controls term frequency saturation, where higher values allow TF to have more influence before saturating. Right: b controls document length normalization. b=0 ignores length entirely, while b=1 applies full normalization. The typical defaults (k₁=1.5, b=0.75) balance these effects.

Out[50]:

Visualization

Bar chart showing BM25 scores decreasing as document length increases. — BM25 length normalization effect. For the same term frequency (TF=5), shorter documents receive higher scores than longer ones. This prevents long documents from dominating search results simply by containing more words.

Using BM25 in PracticeLink Copied

The rank_bm25 library provides a production-ready BM25 implementation:

In[51]:

# Note: In practice, use: pip install rank-bm25
# from rank_bm25 import BM25Okapi

# Manual implementation for demonstration
class SimpleBM25:
    def __init__(self, corpus, k1=1.5, b=0.75):
        self.k1 = k1
        self.b = b
        self.corpus = [tokenize(doc) for doc in corpus]
        self.doc_lengths = [len(doc) for doc in self.corpus]
        self.avg_doc_length = np.mean(self.doc_lengths)
        self.N = len(corpus)
        
        # Compute document frequencies
        self.df = Counter()
        for doc in self.corpus:
            self.df.update(set(doc))
        
        # Compute IDF
        self.idf = {}
        for term, freq in self.df.items():
            self.idf[term] = np.log((self.N - freq + 0.5) / (freq + 0.5) + 1)
    
    def get_scores(self, query):
        """Score all documents against a query."""
        query_tokens = tokenize(query)
        scores = np.zeros(self.N)
        
        for i, doc in enumerate(self.corpus):
            doc_tf = Counter(doc)
            doc_len = self.doc_lengths[i]
            
            for term in query_tokens:
                if term in doc_tf:
                    tf = doc_tf[term]
                    idf = self.idf.get(term, 0)
                    score = compute_bm25(tf, doc_len, self.avg_doc_length, idf, self.k1, self.b)
                    scores[i] += score
        
        return scores

# Create BM25 index
bm25 = SimpleBM25(search_corpus)

# Note: In practice, use: pip install rank-bm25
# from rank_bm25 import BM25Okapi

# Manual implementation for demonstration
class SimpleBM25:
    def __init__(self, corpus, k1=1.5, b=0.75):
        self.k1 = k1
        self.b = b
        self.corpus = [tokenize(doc) for doc in corpus]
        self.doc_lengths = [len(doc) for doc in self.corpus]
        self.avg_doc_length = np.mean(self.doc_lengths)
        self.N = len(corpus)
        
        # Compute document frequencies
        self.df = Counter()
        for doc in self.corpus:
            self.df.update(set(doc))
        
        # Compute IDF
        self.idf = {}
        for term, freq in self.df.items():
            self.idf[term] = np.log((self.N - freq + 0.5) / (freq + 0.5) + 1)
    
    def get_scores(self, query):
        """Score all documents against a query."""
        query_tokens = tokenize(query)
        scores = np.zeros(self.N)
        
        for i, doc in enumerate(self.corpus):
            doc_tf = Counter(doc)
            doc_len = self.doc_lengths[i]
            
            for term in query_tokens:
                if term in doc_tf:
                    tf = doc_tf[term]
                    idf = self.idf.get(term, 0)
                    score = compute_bm25(tf, doc_len, self.avg_doc_length, idf, self.k1, self.b)
                    scores[i] += score
        
        return scores

# Create BM25 index
bm25 = SimpleBM25(search_corpus)

Out[52]:

BM25 Search Results:
======================================================================

Query: 'machine learning algorithms'
----------------------------------------------------------------------
  1. [BM25: 4.720] Machine learning algorithms learn patterns from trainin...
  2. [BM25: 2.202] Python is a popular programming language for data scien...
  3. [BM25: 1.170] Deep learning uses neural networks with many layers....

Query: 'web development JavaScript'
----------------------------------------------------------------------
  1. [BM25: 4.186] Web development involves creating websites and web appl...
  2. [BM25: 3.366] JavaScript powers interactive web applications and runs...

Limitations and ImpactLink Copied

TF-IDF revolutionized information retrieval and remains widely used, but it has fundamental limitations:

No semantic understanding: "Car" and "automobile" are treated as completely unrelated terms. TF-IDF cannot capture synonymy, antonymy, or any semantic relationships.

Vocabulary mismatch: If a query uses different words than the documents (even with the same meaning), TF-IDF will miss the match. "Python programming" won't match "coding in Python" well.

Bag of words assumption: Like its foundation, TF-IDF ignores word order. "The cat ate the mouse" and "The mouse ate the cat" have identical representations.

No context: The same word always gets the same IDF, regardless of context. "Bank" (financial) and "bank" (river) are conflated.

Sparse representations: TF-IDF vectors are high-dimensional and sparse, making them inefficient for neural networks that prefer dense inputs.

Despite these limitations, TF-IDF's impact has been enormous:

Search engines: Google's early algorithms built on TF-IDF concepts
Document clustering: K-means on TF-IDF vectors groups similar documents
Text classification: TF-IDF features power spam filters, sentiment analyzers, and topic classifiers
Keyword extraction: High TF-IDF terms identify document topics
Baseline models: TF-IDF provides a strong baseline that neural models must beat

TF-IDF's success comes from its effective balance: it rewards terms that are distinctive to a document while penalizing terms that appear everywhere. This simple idea, implemented efficiently, solved real problems at scale.

SummaryLink Copied

TF-IDF combines term frequency and inverse document frequency to score a term's importance in a document relative to a corpus. The key insights:

TF-IDF formula: $\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)$
TF variants: Raw counts, log-scaled ( $1 + \log(\text{tf})$ ), binary, and augmented
IDF variants: Standard ( $\log(N/\text{df})$ ), smoothed ( $\log((N+1)/(\text{df}+1)) + 1$ )
Normalization: L2 normalization enables cosine similarity as a dot product
Document similarity: Cosine similarity on TF-IDF vectors measures topical overlap
BM25: Extends TF-IDF with term frequency saturation and document length normalization

TF-IDF remains a powerful baseline for information retrieval and text classification. Its limitations, particularly the lack of semantic understanding, motivated the development of word embeddings and transformer models. But understanding TF-IDF is essential: it's the foundation that modern NLP builds upon.

Key Functions and ParametersLink Copied

When working with TF-IDF in scikit-learn, TfidfVectorizer is the primary tool:

TfidfVectorizer(lowercase, min_df, max_df, use_idf, norm, sublinear_tf, ngram_range, max_features)

lowercase (default: True): Convert text to lowercase before tokenization.
min_df: Minimum document frequency. Integer for absolute count, float for proportion. Use min_df=2 to remove rare terms.
max_df: Maximum document frequency. Use max_df=0.95 to filter extremely common terms.
use_idf (default: True): Enable IDF weighting. Set to False for TF-only vectors.
norm (default: 'l2'): Vector normalization. Use 'l2' for cosine similarity, 'l1' for Manhattan, None for raw scores.
sublinear_tf (default: False): Apply log-scaling to TF: replaces tf with $1 + \log(\text{tf})$ .
ngram_range (default: (1, 1)): Include n-grams. Use (1, 2) for unigrams and bigrams.
max_features: Limit vocabulary to top N terms by corpus frequency.
smooth_idf (default: True): Add 1 to document frequencies to prevent zero IDF.

For BM25, use the rank_bm25 library:

BM25Okapi(corpus, k1=1.5, b=0.75)

corpus: List of tokenized documents (list of lists of strings)
k1: Term frequency saturation parameter. Higher values give more weight to term frequency.
b: Length normalization parameter. 0 disables length normalization; 1 gives full normalization.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about TF-IDF and document representation.

Loading component...

Back to Language AI Handbook

Previous Chapter

Inverse Document Frequency

Next Chapter

BM25

Reference

BIBTEXAcademic

@misc{tfidftermfrequencyinversedocumentfrequencyfortextrepresentation, author = {Michael Brenndoerfer}, title = {TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-08} }

APAAcademic

Michael Brenndoerfer (2025). TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation. Retrieved from https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation

MLAAcademic

Michael Brenndoerfer. "TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation." 2025. Web. 12/8/2025. <https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation>.

CHICAGOAcademic

Michael Brenndoerfer. "TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation." Accessed 12/8/2025. https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation'. Available at: https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation (Accessed: 12/8/2025).

SimpleBasic

Michael Brenndoerfer (2025). TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation. https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation

Direct link:

https://mbrenndoerfer.com/writing/tf-idf-term-frequency-inverse-document-frequency-text-representation

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

InteractiveTF-IDF: Term Frequency-Inverse Document Frequency for Text Representation

TF-IDFLink Copied

The TF-IDF FormulaLink Copied

The Central Question: What Makes a Word Important?Link Copied

Building the Formula Step by StepLink Copied

From Theory to ImplementationLink Copied

Computing TF-IDF ScoresLink Copied

Visualizing the TF-IDF BalanceLink Copied

TF-IDF VariantsLink Copied

Log-Scaled TF: Taming Extreme FrequenciesLink Copied

Smoothed IDF: Handling Edge CasesLink Copied

Common TF-IDF Schemes: A Notation SystemLink Copied

TF-IDF Vector ComputationLink Copied

Sparsity in TF-IDF MatricesLink Copied

TF-IDF NormalizationLink Copied

L2 NormalizationLink Copied

L1 NormalizationLink Copied

Document Similarity with TF-IDFLink Copied

Cosine SimilarityLink Copied

Finding Similar DocumentsLink Copied

Visualizing Document Similarity in 2DLink Copied

TF-IDF for Feature ExtractionLink Copied

Text Classification ExampleLink Copied

Feature Selection with TF-IDFLink Copied

sklearn TfidfVectorizer Deep DiveLink Copied

Basic UsageLink Copied

Key ParametersLink Copied

The IDF Formula in sklearnLink Copied

Practical Configuration PatternsLink Copied

Worked Example: Document SearchLink Copied

BM25: TF-IDF's SuccessorLink Copied

BM25 Length NormalizationLink Copied

BM25 Parameter SensitivityLink Copied

Using BM25 in PracticeLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key Functions and ParametersLink Copied

QuizLink Copied

Reference

About the author: Michael Brenndoerfer

Related Content

Inverse Document Frequency: How Rare Words Reveal Document Meaning

Perplexity: The Standard Metric for Evaluating Language Models

BM25: Complete Guide to the Search Algorithm Behind Elasticsearch

Stay updated