Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master term frequency weighting schemes including raw TF, log-scaled, boolean, augmented, and L2-normalized variants. Learn when to use each approach for information retrieval and NLP.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Term FrequencyLink Copied

How many times does a word appear in a document? This simple question leads to surprisingly complex answers. Raw counts tell you something, but a word appearing 10 times in a 100-word email means something different than 10 times in a 10,000-word novel. Term frequency weighting schemes address this by transforming raw counts into more meaningful signals.

In the Bag of Words chapter, we counted words. Now we refine those counts. Term frequency (TF) is the foundation of TF-IDF, one of the most successful text representations in information retrieval. But TF alone comes in many flavors: raw counts, log-scaled frequencies, boolean indicators, and normalized variants. Each captures different aspects of word importance.

This chapter explores these variants systematically. You'll learn when raw counts mislead, why logarithms help, and how normalization enables fair comparison across documents of different lengths. By the end, you'll understand the design choices behind term frequency and be ready to combine TF with inverse document frequency in the next chapter.

Raw Term FrequencyLink Copied

We begin with the most natural question you can ask about a word in a document: how many times does it appear? This count—raw term frequency—forms the foundation for all the variants we'll explore. Despite its simplicity, understanding when counting works and when it fails illuminates the design choices behind more sophisticated weighting schemes.

The intuition is straightforward: if a document mentions "neural" ten times and "cat" once, the document is probably more about neural networks than cats. Counting word occurrences provides a rough proxy for topical emphasis.

Term Frequency (TF)

Term frequency measures how often a term appears in a document:

\text{tf}(t, d) = \text{count of term } t \text{ in document } d

where:

$\text{tf}(t, d)$ : the term frequency of term $t$ in document $d$ —a non-negative integer representing how many times the word appears
$t$ : a term (word) in the vocabulary—the specific word we're measuring
$d$ : a document in the corpus—the text unit we're analyzing

For example, if "learning" appears 5 times in a document, then $\text{tf}(\text{learning}, d) = 5$ .

This formula is deceptively simple—it's literally just counting. But this simplicity belies important assumptions. By treating all occurrences equally, we're implicitly claiming that the fifth mention of a word carries the same informational weight as the first. As we'll see, this assumption often fails in practice.

Let's implement this from scratch:

In[2]:

Code

from collections import Counter

# Sample documents about machine learning
documents = [
    "Machine learning is a subset of artificial intelligence. Machine learning uses data to learn patterns. Learning from data is powerful.",
    "Deep learning is a type of machine learning. Deep learning uses neural networks. Neural networks learn hierarchical representations.",
    "Natural language processing uses machine learning. Text classification and sentiment analysis are NLP tasks.",
]


def compute_raw_tf(document):
    """Compute raw term frequency for a document."""
    # Tokenize: lowercase and split on whitespace/punctuation
    import re

    tokens = re.findall(r"\b[a-z]+\b", document.lower())
    return Counter(tokens)


# Compute TF for each document
tf_docs = [compute_raw_tf(doc) for doc in documents]

from collections import Counter

# Sample documents about machine learning
documents = [
    "Machine learning is a subset of artificial intelligence. Machine learning uses data to learn patterns. Learning from data is powerful.",
    "Deep learning is a type of machine learning. Deep learning uses neural networks. Neural networks learn hierarchical representations.",
    "Natural language processing uses machine learning. Text classification and sentiment analysis are NLP tasks.",
]


def compute_raw_tf(document):
    """Compute raw term frequency for a document."""
    # Tokenize: lowercase and split on whitespace/punctuation
    import re

    tokens = re.findall(r"\b[a-z]+\b", document.lower())
    return Counter(tokens)


# Compute TF for each document
tf_docs = [compute_raw_tf(doc) for doc in documents]

Out[3]:

Console

Raw Term Frequencies:
============================================================

Document 1:
  learning         3  ███
  machine          2  ██
  is               2  ██
  data             2  ██
  a                1  █
  subset           1  █
  of               1  █
  artificial       1  █

Document 2:
  learning         3  ███
  deep             2  ██
  neural           2  ██
  networks         2  ██
  is               1  █
  a                1  █
  type             1  █
  of               1  █

Document 3:
  natural          1  █
  language         1  █
  processing       1  █
  uses             1  █
  machine          1  █
  learning         1  █
  text             1  █
  classification   1  █

The output shows clear topical signatures for each document. "Learning" dominates Document 1 with 4 occurrences, making it the most characteristic term. Document 2 shows "deep" and "neural" as distinguishing features, reflecting its focus on deep learning. However, the results also reveal a problem: common words like "is" and "a" appear frequently across all documents without carrying meaningful semantic content. These function words inflate raw counts without adding discriminative power.

The Problem with Raw CountsLink Copied

Raw term frequency has a proportionality problem. If "learning" appears twice as often as "data", does that mean it's twice as important? Probably not. The relationship between word count and importance is sublinear: the difference between 1 and 2 occurrences is more significant than the difference between 10 and 11.

In[4]:

Code

# Demonstrate the proportionality problem
example_doc = """
Machine learning machine learning machine learning machine learning machine learning.
Data science uses machine learning. Data analysis is important.
"""

tf_example = compute_raw_tf(example_doc)

# Demonstrate the proportionality problem
example_doc = """
Machine learning machine learning machine learning machine learning machine learning.
Data science uses machine learning. Data analysis is important.
"""

tf_example = compute_raw_tf(example_doc)

Out[5]:

Console

Proportionality Problem Example:
--------------------------------------------------
'machine' appears 6 times
'data' appears 2 times
'science' appears 1 time

Is 'machine' really 6x more important than 'science'?
Is 'machine' really 3x more important than 'data'?

The results expose the proportionality problem: "machine" appears 6 times, suggesting it's 6× more important than "science" (1 occurrence) and 3× more important than "data" (2 occurrences). But this linear relationship doesn't match intuition—a document mentioning "machine" six times isn't necessarily six times more about machines. After the first few occurrences, additional repetitions add diminishing information. This observation motivates log-scaled term frequency.

Term Frequency DistributionLink Copied

Before moving to log-scaling, let's examine how term frequencies distribute across a document. Natural language follows Zipf's law: a few words appear very frequently, while most words appear rarely.

Out[6]:

Visualization

Distribution of term frequencies in Document 1. Most terms appear only once, while a small number of high-frequency terms dominate, illustrating the heavy-tailed nature of natural language.

Rank-frequency plot demonstrating Zipf''s law: frequency drops rapidly with rank. The most frequent term has 4 occurrences, while most terms appear only once.

The rank-frequency plot shows the characteristic "long tail" of natural language: frequency drops rapidly with rank. Log-scaling addresses this by compressing the high-frequency end of this distribution.

Log-Scaled Term FrequencyLink Copied

The proportionality problem reveals that raw counts make a flawed assumption: each additional word occurrence carries equal weight. But information doesn't work this way. The first time a document mentions "quantum," we learn something significant—this document touches on quantum topics. The second mention reinforces this. By the tenth mention, we're no longer learning much new; we already know the document focuses on quantum concepts.

This observation—that information gain diminishes with repetition—leads us to seek a mathematical function that captures this diminishing returns property. We need a transformation that:

Preserves ordering: More occurrences should still mean higher weight (we don't want to lose ranking information)
Compresses differences: The gap between 10 and 11 occurrences should be smaller than the gap between 1 and 2
Grows sublinearly: The function should rise quickly at first, then flatten

The logarithm is precisely such a function. It's the mathematical embodiment of diminishing returns.

Why Logarithms?Link Copied

The logarithm has a special property: it converts multiplicative relationships into additive ones. If one term appears 10 times more often than another, the log-scaled difference is just $\log(10) \approx 2.3$ , not 10. This compression naturally models how information works—the first mention is highly informative, but repetition adds progressively less.

Mathematically, the derivative of $\log(x)$ is $\frac{1}{x}$ . This means the "rate of information gain" from each additional occurrence is inversely proportional to how many times we've already seen the word. The 10th occurrence adds only 1/10th as much as the first.

Log-Scaled Term Frequency

Log-scaled term frequency dampens the effect of high counts using the natural logarithm:

\text{tf}_{\log}(t, d) = \begin{cases} 1 + \log(\text{tf}(t, d)) & \text{if } \text{tf}(t, d) > 0 \\ 0 & \text{otherwise} \end{cases}

where:

$\text{tf}(t, d)$ : the raw count of term $t$ in document $d$
$\log$ : the natural logarithm (base $e \approx 2.718$ )
$1 + \log(\cdot)$ : the offset of $+1$ ensures that a term appearing exactly once receives weight 1 (since $\log(1) = 0$ , we get $1 + 0 = 1$ )
The piecewise definition handles absent terms: if a word doesn't appear ( $\text{tf} = 0$ ), its weight is 0, not $-\infty$ (which $\log(0)$ would produce)

Understanding the FormulaLink Copied

The formula $1 + \log(\text{tf})$ is carefully constructed. The $+1$ offset serves a critical purpose: since $\log(1) = 0$ , a term appearing exactly once would otherwise get weight 0, making it indistinguishable from absent terms. Adding 1 ensures that a term appearing once gets weight 1.

Let's trace through what happens at different frequency levels:

Log-scaled term frequency compresses high counts. A 100× increase in raw count results in only a 5.6× increase in weighted value.

Raw Count	Calculation	Log-Scaled Weight
1	$1 + \log(1) = 1 + 0$	1.00
2	$1 + \log(2) \approx 1 + 0.69$	1.69
10	$1 + \log(10) \approx 1 + 2.30$	3.30
100	$1 + \log(100) \approx 1 + 4.61$	5.61

The 100x increase in raw count becomes only a 5.6x increase in log-scaled weight. A term appearing 100 times doesn't dominate one appearing once. It's weighted roughly 5-6 times higher, not 100 times.

In[7]:

Code

import numpy as np


def compute_log_tf(document):
    """Compute log-scaled term frequency."""
    raw_tf = compute_raw_tf(document)
    log_tf = {}
    for term, count in raw_tf.items():
        if count > 0:
            log_tf[term] = 1 + np.log(count)
        else:
            log_tf[term] = 0
    return log_tf


# Compare raw vs log TF for Document 1
raw_tf_doc1 = tf_docs[0]
log_tf_doc1 = compute_log_tf(documents[0])

import numpy as np


def compute_log_tf(document):
    """Compute log-scaled term frequency."""
    raw_tf = compute_raw_tf(document)
    log_tf = {}
    for term, count in raw_tf.items():
        if count > 0:
            log_tf[term] = 1 + np.log(count)
        else:
            log_tf[term] = 0
    return log_tf


# Compare raw vs log TF for Document 1
raw_tf_doc1 = tf_docs[0]
log_tf_doc1 = compute_log_tf(documents[0])

Out[8]:

Console

Raw vs Log-Scaled Term Frequency (Document 1):
-------------------------------------------------------
Term                Raw TF       Log TF        Ratio
-------------------------------------------------------
learning                 3         2.10         1.43
machine                  2         1.69         1.18
is                       2         1.69         1.18
data                     2         1.69         1.18
a                        1         1.00         1.00
subset                   1         1.00         1.00
of                       1         1.00         1.00
artificial               1         1.00         1.00
intelligence             1         1.00         1.00
uses                     1         1.00         1.00

The "Ratio" column reveals the compression effect. Terms with raw count 1 have ratio 1.0 (no change), but "learning" with 4 occurrences has a ratio of about 1.66—meaning log-scaling reduced its relative weight. The log TF of 2.39 for "learning" is only about 2.4× the weight of a single-occurrence term, not 4×. This compression better reflects how the fourth mention of a word adds less information than the first.

Out[9]:

Visualization

Raw term frequency showing the top 8 terms in Document 1. 'Learning' dominates with 4 occurrences.

Log-scaled term frequency compresses high counts. The 4-occurrence term now has weight ~2.4, not 4× a single-occurrence term.

Visualizing the Log TransformationLink Copied

The logarithm's compression effect becomes clearer when we plot raw counts against their log-scaled equivalents:

Out[10]:

Visualization

The log transformation curve showing how raw term frequency maps to log-scaled values. The sublinear relationship means that doubling the raw count does not double the weight. A term appearing 100 times gets only about 5.6× the weight of a term appearing once.

The curve flattens as raw counts increase. This sublinear relationship captures the intuition that the 10th occurrence of a word adds less information than the first.

Out[11]:

Visualization

Marginal information gain decreases rapidly with each additional word occurrence. The derivative of log(tf) is 1/tf, meaning the 1st occurrence adds 1.0, the 2nd adds 0.5, and the 10th adds only 0.1.

Cumulative information (log-scaled TF) grows quickly at first then plateaus, compared to linear growth. This captures the intuition of diminishing returns from word repetition.

Boolean Term FrequencyLink Copied

Log-scaling represents one response to the diminishing returns problem: compress high counts while preserving relative ordering. But what if we take this philosophy to its logical extreme? If the 100th occurrence of a word adds almost no information beyond the first, perhaps we should ignore counts entirely and focus on a simpler question: does this word appear at all?

This reasoning leads to boolean term frequency, the most aggressive form of count compression. It reduces the entire frequency distribution to a binary signal: present or absent, 1 or 0. A document mentioning "neural" once is indistinguishable from one mentioning it a hundred times.

At first glance, this seems to discard valuable information. But boolean TF embodies a specific philosophy about what matters: topic coverage rather than topic emphasis. When classifying documents into categories, knowing that a document discusses machine learning, neural networks, and optimization might matter more than knowing exactly how many times each term appears.

Boolean Term Frequency

Boolean term frequency reduces frequency to a binary presence indicator:

\text{tf}_{\text{bool}}(t, d) = \begin{cases} 1 & \text{if } t \in d \\ 0 & \text{otherwise} \end{cases}

where:

$t$ : the term being queried—the word we're checking for
$d$ : the document being analyzed
$t \in d$ : the set membership notation indicating that term $t$ appears at least once in document $d$

The output is binary: 1 if the word exists anywhere in the document, 0 if it's completely absent. All frequency information is collapsed into this single bit.

This might seem like throwing away information, but boolean TF excels when:

You care about topic coverage, not emphasis
Repeated terms might indicate spam or manipulation
The task is set-based (does this document mention these concepts?)

In[12]:

Code

def compute_boolean_tf(document):
    """Compute boolean term frequency."""
    raw_tf = compute_raw_tf(document)
    return {term: 1 for term in raw_tf.keys()}


# Compare all three TF variants
raw_tf = compute_raw_tf(documents[0])
log_tf = compute_log_tf(documents[0])
bool_tf = compute_boolean_tf(documents[0])

def compute_boolean_tf(document):
    """Compute boolean term frequency."""
    raw_tf = compute_raw_tf(document)
    return {term: 1 for term in raw_tf.keys()}


# Compare all three TF variants
raw_tf = compute_raw_tf(documents[0])
log_tf = compute_log_tf(documents[0])
bool_tf = compute_boolean_tf(documents[0])

Out[13]:

Console

Three TF Variants Compared (Document 1):
------------------------------------------------------------
Term                 Raw        Log    Boolean
------------------------------------------------------------
learning               3       2.10          1
machine                2       1.69          1
is                     2       1.69          1
data                   2       1.69          1
a                      1       1.00          1
subset                 1       1.00          1
of                     1       1.00          1
artificial             1       1.00          1
intelligence           1       1.00          1
uses                   1       1.00          1

The comparison highlights the trade-offs between variants. "Learning" with 4 raw occurrences becomes 2.39 after log-scaling and just 1 with boolean TF. Similarly, "powerful" (1 occurrence) remains 1.00 across log and boolean but with very different implications: log-scaled 1.00 means "appeared once," while boolean 1 means "appeared at all." Boolean TF treats "learning" (4 occurrences) and "powerful" (1 occurrence) identically. This might seem like information loss, but for topic detection tasks, knowing a document mentions "learning" matters more than how often.

Augmented Term FrequencyLink Copied

So far, we've transformed individual term counts—compressing them with logarithms or collapsing them to binary indicators. But we've ignored a confounding factor that affects all these approaches: document length. A 10,000-word document will naturally contain more occurrences of any term than a 100-word document, even if both discuss the same topic with equal emphasis.

This length bias corrupts comparisons. When ranking documents by relevance to a query, we might inadvertently favor long documents simply because they have more words—not because they're more relevant. We need a normalization scheme that makes term frequencies comparable across documents of wildly different lengths.

The Document Length ProblemLink Copied

Consider a concrete example. Two documents about machine learning:

Document A (100 words): mentions "learning" 5 times
Document B (1,000 words): mentions "learning" 20 times

Which document is more focused on learning? Raw counts point to Document B (20 > 5). But this conclusion is misleading. Document A dedicates 5% of its words to "learning," while Document B dedicates only 2%. By the proportional measure, the shorter document is more focused on the topic, not less.

Out[14]:

Visualization

Raw term frequencies favor longer documents. Document B has 4× higher count, appearing more relevant despite lower concentration.

Normalized by document length, the shorter Document A is 2.5× more focused on 'learning' (5% vs 2% of content).

Augmented term frequency addresses this by reframing the question. Instead of asking "how many times does this term appear?", it asks: "how important is this term relative to the most important term in this document?"

The insight is that every document has a dominant term—the word that appears most frequently. By measuring other terms against this internal reference point, we create a scale that's consistent across documents regardless of their length. A term with half the count of the dominant term gets weight 0.75, whether that means 5 out of 10 occurrences or 500 out of 1,000.

Augmented Term Frequency

Augmented term frequency (also called "double normalization" or "augmented normalized term frequency") normalizes each term against the document's maximum term frequency:

\text{tf}_{\text{aug}}(t, d) = 0.5 + 0.5 \times \frac{\text{tf}(t, d)}{\max_{t' \in d} \text{tf}(t', d)}

where:

$t$ : the term being weighted—the specific word we're computing a weight for
$d$ : the document being analyzed
$\text{tf}(t, d)$ : the raw count of term $t$ in document $d$
$\max_{t' \in d} \text{tf}(t', d)$ : the highest term frequency in document $d$ —this notation means "find the maximum of $\text{tf}(t', d)$ over all terms $t'$ that appear in $d$ "
The ratio $\frac{\text{tf}(t, d)}{\max_{t' \in d} \text{tf}(t', d)}$ expresses each term's frequency as a fraction of the maximum, yielding values in $[0, 1]$
The transformation $0.5 + 0.5 \times (\cdot)$ maps this $[0, 1]$ range to $[0.5, 1.0]$ , ensuring all present terms receive at least baseline weight

Deconstructing the FormulaLink Copied

The formula has two components working together:

Step 1: Compute the relative frequency

First, we express each term's count as a fraction of the maximum count in the document:

\text{relative}(t, d) = \frac{\text{tf}(t, d)}{\max_{t' \in d} \text{tf}(t', d)}

where:

$\text{tf}(t, d)$ : the raw count of term $t$ in document $d$
$\max_{t' \in d} \text{tf}(t', d)$ : the count of the most frequent term in document $d$ (this notation means "the maximum of $\text{tf}(t', d)$ over all terms $t'$ that appear in $d$ ")

This ratio normalizes each term against the document's most frequent term, producing values between 0 and 1. The most frequent term gets 1.0; a term appearing half as often gets 0.5.

Step 2: Apply the double normalization

Next, we transform the relative frequency to provide a baseline weight for all terms:

\text{tf}_{\text{aug}}(t, d) = 0.5 + 0.5 \times \text{relative}(t, d)

The constants 0.5 and 0.5 are design choices that produce the "double normalization" scheme (also called "augmented normalized term frequency"). This transformation maps the [0, 1] range to [0.5, 1.0]. Why not just use the ratio directly? The 0.5 baseline ensures that even rare terms receive meaningful weight, preventing them from being completely overshadowed by the dominant term.

The formula guarantees:

The most frequent term in any document gets weight 1.0
All other terms get weights between 0.5 and 1.0, proportional to their relative frequency
All documents are on a comparable scale regardless of length

Out[15]:

Visualization

The augmented TF transformation maps relative frequencies [0,1] to [0.5,1.0]. The 0.5 baseline ensures rare terms maintain meaningful weight.

Raw counts vs augmented weights for Document 1. The transformation compresses the range while preserving relative ordering.

In[16]:

Code

def compute_augmented_tf(document):
    """Compute augmented (double normalization) term frequency."""
    raw_tf = compute_raw_tf(document)
    if not raw_tf:
        return {}
    max_tf = max(raw_tf.values())
    augmented_tf = {}
    for term, count in raw_tf.items():
        augmented_tf[term] = 0.5 + 0.5 * (count / max_tf)
    return augmented_tf


# Compute augmented TF for all documents
aug_tf_docs = [compute_augmented_tf(doc) for doc in documents]

def compute_augmented_tf(document):
    """Compute augmented (double normalization) term frequency."""
    raw_tf = compute_raw_tf(document)
    if not raw_tf:
        return {}
    max_tf = max(raw_tf.values())
    augmented_tf = {}
    for term, count in raw_tf.items():
        augmented_tf[term] = 0.5 + 0.5 * (count / max_tf)
    return augmented_tf


# Compute augmented TF for all documents
aug_tf_docs = [compute_augmented_tf(doc) for doc in documents]

Out[17]:

Console

Augmented Term Frequency (normalized to [0.5, 1.0]):
======================================================================

Document 1 (max raw tf = 3):
  learning        raw= 3  aug=1.000  ████████████████████
  machine         raw= 2  aug=0.833  ████████████████
  is              raw= 2  aug=0.833  ████████████████
  data            raw= 2  aug=0.833  ████████████████
  a               raw= 1  aug=0.667  █████████████
  subset          raw= 1  aug=0.667  █████████████

Document 2 (max raw tf = 3):
  learning        raw= 3  aug=1.000  ████████████████████
  deep            raw= 2  aug=0.833  ████████████████
  neural          raw= 2  aug=0.833  ████████████████
  networks        raw= 2  aug=0.833  ████████████████
  is              raw= 1  aug=0.667  █████████████
  a               raw= 1  aug=0.667  █████████████

Document 3 (max raw tf = 1):
  natural         raw= 1  aug=1.000  ████████████████████
  language        raw= 1  aug=1.000  ████████████████████
  processing      raw= 1  aug=1.000  ████████████████████
  uses            raw= 1  aug=1.000  ████████████████████
  machine         raw= 1  aug=1.000  ████████████████████
  learning        raw= 1  aug=1.000  ████████████████████

Each document's most frequent term receives weight 1.0 (the maximum), while other terms scale proportionally into the [0.5, 1.0] range. Notice how Document 1's "learning" (raw=4) gets 1.000, while "machine" (raw=2) gets 0.750—exactly $0.5 + 0.5 \times (2/4) = 0.75$ . The baseline of 0.5 ensures that even terms appearing once receive meaningful weight (0.625 when max_tf=4). This normalization makes cross-document comparison fairer by eliminating the advantage longer documents would otherwise have.

Out[18]:

Visualization

Augmented term frequency normalizes each document independently, scaling the most frequent term to 1.0 and others proportionally. The horizontal dashed line at 0.5 shows the baseline weight for all terms.

L2-Normalized Frequency VectorsLink Copied

Augmented TF normalizes against a single reference point—the document's most frequent term. But this approach ignores the broader context of how all terms relate to each other. What if we want a normalization that considers the entire term frequency distribution at once?

This question leads us to think geometrically. Each document's term frequencies can be viewed as coordinates in a high-dimensional space, where every vocabulary word defines an axis. A document with frequencies [4, 0, 2, 1, ...] becomes a point (or equivalently, a vector from the origin to that point) in this space. This geometric perspective reveals a crucial insight: the vector has two distinct properties:

Direction: which terms the document emphasizes (its topical content)
Magnitude: how many total word occurrences it contains (related to length)

For comparing document content, only direction matters. Magnitude confounds the comparison by favoring longer documents.

From Counts to GeometryLink Copied

To build intuition, imagine a vocabulary of just two words: "learning" and "deep". Each document becomes a point in 2D space:

Document with TF = [4, 0] points along the "learning" axis—purely about learning
Document with TF = [2, 2] points diagonally at 45°—equal emphasis on both topics
Document with TF = [0, 3] points along the "deep" axis—purely about deep learning

The angle between two document vectors captures their semantic similarity. Documents pointing in similar directions discuss similar topics; documents at right angles share nothing in common. The length of each vector, however, merely reflects word count—a confounding factor we want to eliminate.

L2 normalization achieves exactly this: it projects every document onto the unit sphere, preserving direction while forcing all vectors to have the same magnitude (length 1). Two documents with identical word proportions but different lengths will map to the same point on the unit sphere.

L2 Normalization

L2 normalization divides each term frequency by the vector's Euclidean length, projecting the document onto the unit sphere:

\text{tf}_{\text{L2}}(t, d) = \frac{\text{tf}(t, d)}{\sqrt{\sum_{t' \in d} \text{tf}(t', d)^2}}

where:

$\text{tf}(t, d)$ : the raw count of term $t$ in document $d$ —the numerator
$t' \in d$ : notation meaning "for all terms $t'$ that appear in document $d$ "
$\text{tf}(t', d)^2$ : the squared frequency of each term
$\sum_{t' \in d} \text{tf}(t', d)^2$ : the sum of all squared term frequencies in the document
$\sqrt{\sum_{t' \in d} \text{tf}(t', d)^2}$ : the L2 norm (Euclidean length) of the TF vector, often written compactly as $\|\mathbf{tf}_d\|_2$

Why squared counts? The L2 norm generalizes the Pythagorean theorem to $n$ dimensions. In 2D, a vector $(a, b)$ has length $\sqrt{a^2 + b^2}$ . In $n$ dimensions, we sum all squared components and take the square root. Dividing every component by this length scales the vector to unit length ( $\|\mathbf{tf}_d\|_2 = 1$ ) while preserving its direction.

Practical benefit: After L2 normalization, cosine similarity between two documents becomes a simple dot product—no need to compute norms during comparison.

Why the L2 Norm?Link Copied

The L2 norm (also called the Euclidean norm) measures the straight-line distance from the origin to the tip of the vector. For a vector $\mathbf{v} = [v_1, v_2, \ldots, v_n]$ , the L2 norm is computed as:

\|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2}

This formula generalizes the Pythagorean theorem to $n$ dimensions. Dividing each element of the vector by this norm scales the vector to length 1 while preserving its direction. After normalization, every document vector lies on the surface of a unit hypersphere.

This geometric property has a practical benefit: the angle between any two normalized vectors directly measures their content similarity. Documents pointing in similar directions (small angle) have high cosine similarity; documents pointing in different directions (large angle) are dissimilar.

L2 normalization is particularly useful for:

Efficient similarity computation: Cosine similarity becomes a simple dot product
Length-invariant comparison: A 100-word and 10,000-word document on the same topic will have similar normalized vectors
ML compatibility: Many machine learning models assume or benefit from normalized inputs

In[19]:

Code

import numpy as np


def compute_l2_normalized_tf(document, vocabulary):
    """Compute L2-normalized term frequency vector."""
    raw_tf = compute_raw_tf(document)

    # Create vector in vocabulary order
    vector = np.array([raw_tf.get(term, 0) for term in vocabulary], dtype=float)

    # L2 normalize
    norm = np.linalg.norm(vector)
    if norm > 0:
        vector = vector / norm

    return vector


# Build vocabulary from all documents
all_tokens = []
for doc in documents:
    import re

    tokens = re.findall(r"\b[a-z]+\b", doc.lower())
    all_tokens.extend(tokens)
vocabulary = sorted(set(all_tokens))

# Compute L2-normalized vectors
l2_vectors = [compute_l2_normalized_tf(doc, vocabulary) for doc in documents]

import numpy as np


def compute_l2_normalized_tf(document, vocabulary):
    """Compute L2-normalized term frequency vector."""
    raw_tf = compute_raw_tf(document)

    # Create vector in vocabulary order
    vector = np.array([raw_tf.get(term, 0) for term in vocabulary], dtype=float)

    # L2 normalize
    norm = np.linalg.norm(vector)
    if norm > 0:
        vector = vector / norm

    return vector


# Build vocabulary from all documents
all_tokens = []
for doc in documents:
    import re

    tokens = re.findall(r"\b[a-z]+\b", doc.lower())
    all_tokens.extend(tokens)
vocabulary = sorted(set(all_tokens))

# Compute L2-normalized vectors
l2_vectors = [compute_l2_normalized_tf(doc, vocabulary) for doc in documents]

Out[20]:

Console

L2-Normalized Term Frequency Vectors:
============================================================

Document 1 (vector norm = 1.0000):
  learning        0.5303  ██████████████████████████
  data            0.3536  █████████████████
  is              0.3536  █████████████████
  machine         0.3536  █████████████████
  a               0.1768  ████████
  artificial      0.1768  ████████

Document 2 (vector norm = 1.0000):
  learning        0.5477  ███████████████████████████
  deep            0.3651  ██████████████████
  networks        0.3651  ██████████████████
  neural          0.3651  ██████████████████
  a               0.1826  █████████
  hierarchical    0.1826  █████████

Document 3 (vector norm = 1.0000):
  analysis        0.2673  █████████████
  and             0.2673  █████████████
  are             0.2673  █████████████
  classification  0.2673  █████████████
  language        0.2673  █████████████
  learning        0.2673  █████████████

The vector norm of 1.0000 for each document confirms successful L2 normalization—all documents now lie on the unit hypersphere. The individual term weights are smaller than other TF variants because they must sum (in squares) to 1. For Document 1, "learning" contributes about 0.53 to the unit vector's direction, making it the dominant component. These normalized values enable efficient similarity computation: the dot product of any two L2-normalized vectors directly gives their cosine similarity.

Geometric Interpretation: Vectors on the Unit SphereLink Copied

L2 normalization projects all document vectors onto the surface of a unit hypersphere. In high dimensions this is hard to visualize, but we can illustrate the concept by projecting our documents into 2D using their two most distinctive terms:

Out[21]:

Visualization

Raw term frequency vectors have different lengths depending on document size. Longer documents produce longer vectors.

After L2 normalization, all vectors lie on the unit circle. The angle theta between vectors directly measures content similarity.

On the unit circle, the angle θ between vectors directly measures semantic distance. Documents pointing in similar directions (small θ) have high cosine similarity; documents pointing in different directions (large θ) are dissimilar.

Cosine Similarity with Normalized VectorsLink Copied

The practical benefit of L2 normalization becomes clear when computing document similarity. Cosine similarity measures the angle between two vectors. Given two document vectors $\mathbf{a}$ and $\mathbf{b}$ , cosine similarity is defined as:

\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|_2 \|\mathbf{b}\|_2}

where:

$\mathbf{a}, \mathbf{b}$ : term frequency vectors for two documents, where each element $a_i$ or $b_i$ represents the frequency of term $i$
$\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i$ : the dot product, which sums the element-wise products across all vocabulary terms
$\|\mathbf{a}\|_2 = \sqrt{\sum_i a_i^2}$ : the L2 norm (Euclidean length) of vector $\mathbf{a}$

The numerator captures how much the vectors "agree" (shared terms with high counts contribute positively), while the denominator normalizes by the vectors' magnitudes to ensure the result falls between -1 and 1 (or 0 and 1 for non-negative term frequencies).

When both vectors are already L2-normalized ( $\|\mathbf{a}\|_2 = \|\mathbf{b}\|_2 = 1$ ), the denominator becomes 1, and cosine similarity reduces to a simple dot product:

\text{cosine}(\mathbf{a}, \mathbf{b}) = \mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i

This simplification speeds up computation: comparing millions of documents becomes a matrix multiplication rather than millions of individual normalizations.

In[22]:

Code

# Compute cosine similarity matrix
def cosine_similarity_matrix(vectors):
    """Compute pairwise cosine similarity for L2-normalized vectors."""
    n = len(vectors)
    sim_matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            # For L2-normalized vectors, cosine = dot product
            sim_matrix[i, j] = np.dot(vectors[i], vectors[j])
    return sim_matrix


similarity = cosine_similarity_matrix(l2_vectors)

# Compute cosine similarity matrix
def cosine_similarity_matrix(vectors):
    """Compute pairwise cosine similarity for L2-normalized vectors."""
    n = len(vectors)
    sim_matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            # For L2-normalized vectors, cosine = dot product
            sim_matrix[i, j] = np.dot(vectors[i], vectors[j])
    return sim_matrix


similarity = cosine_similarity_matrix(l2_vectors)

Out[23]:

Console

Cosine Similarity Matrix (using L2-normalized TF):
---------------------------------------------
               Doc 1     Doc 2     Doc 3
---------------------------------------------
     Doc 1     1.000     0.549     0.283
     Doc 2     0.549     1.000     0.244
     Doc 3     0.283     0.244     1.000

The diagonal values of 1.000 confirm that each document is perfectly similar to itself (as expected). The off-diagonal values reveal document relationships: Documents 1 and 2 show the highest similarity (both discuss machine learning and deep learning concepts with shared vocabulary like "learning" and "machine"). Document 3, focused on NLP, shows lower similarity to both—it shares some terms like "machine" and "learning" but introduces distinct vocabulary like "text", "classification", and "sentiment".

Out[24]:

Visualization

Cosine similarity matrix computed from L2-normalized term frequency vectors. Higher values indicate more similar vocabulary usage. Documents 1 and 2 share substantial vocabulary about machine learning, while Document 3 has distinct NLP-focused terms.

Term Frequency Sparsity PatternsLink Copied

We've explored five ways to weight term frequencies, each with different mathematical properties. But before choosing among them, we need to understand a practical reality that affects all term frequency representations: sparsity.

Real-world term frequency matrices are extremely sparse. A typical English vocabulary contains tens of thousands of words, yet any individual document uses only a few hundred. This means most entries in a document-term matrix are zero. Understanding this sparsity matters for efficient computation and storage.

In[25]:

Code

# Analyze sparsity with a larger corpus
extended_corpus = [
    "Machine learning algorithms learn patterns from data.",
    "Deep learning neural networks require large datasets.",
    "Natural language processing extracts meaning from text.",
    "Computer vision analyzes images using deep learning.",
    "Reinforcement learning agents learn through trial and error.",
    "Supervised learning uses labeled training examples.",
    "Unsupervised learning discovers hidden patterns.",
    "Transfer learning adapts pretrained models to new tasks.",
    "Feature engineering improves machine learning performance.",
    "Gradient descent optimizes neural network parameters.",
]

# Build vocabulary and count matrix
all_tokens_ext = []
for doc in extended_corpus:
    import re

    tokens = re.findall(r"\b[a-z]+\b", doc.lower())
    all_tokens_ext.extend(tokens)
vocab_ext = sorted(set(all_tokens_ext))

# Create term frequency matrix
tf_matrix = np.zeros((len(extended_corpus), len(vocab_ext)))
for i, doc in enumerate(extended_corpus):
    import re

    tokens = re.findall(r"\b[a-z]+\b", doc.lower())
    for token in tokens:
        j = vocab_ext.index(token)
        tf_matrix[i, j] += 1

# Analyze sparsity with a larger corpus
extended_corpus = [
    "Machine learning algorithms learn patterns from data.",
    "Deep learning neural networks require large datasets.",
    "Natural language processing extracts meaning from text.",
    "Computer vision analyzes images using deep learning.",
    "Reinforcement learning agents learn through trial and error.",
    "Supervised learning uses labeled training examples.",
    "Unsupervised learning discovers hidden patterns.",
    "Transfer learning adapts pretrained models to new tasks.",
    "Feature engineering improves machine learning performance.",
    "Gradient descent optimizes neural network parameters.",
]

# Build vocabulary and count matrix
all_tokens_ext = []
for doc in extended_corpus:
    import re

    tokens = re.findall(r"\b[a-z]+\b", doc.lower())
    all_tokens_ext.extend(tokens)
vocab_ext = sorted(set(all_tokens_ext))

# Create term frequency matrix
tf_matrix = np.zeros((len(extended_corpus), len(vocab_ext)))
for i, doc in enumerate(extended_corpus):
    import re

    tokens = re.findall(r"\b[a-z]+\b", doc.lower())
    for token in tokens:
        j = vocab_ext.index(token)
        tf_matrix[i, j] += 1

Out[26]:

Console

Term Frequency Matrix Sparsity Analysis:
==================================================
Corpus size: 10 documents
Vocabulary size: 54 unique terms
Matrix shape: (10, 54)
Total elements: 540
Non-zero elements: 67
Zero elements: 473
Sparsity: 87.6%

Average terms per document: 6.7
Average documents per term: 1.2

Even this tiny corpus demonstrates high sparsity. With only 6-7 unique terms per document out of 50+ vocabulary words, over 84% of matrix entries are zero. The "average documents per term" metric of ~1.6 indicates most words appear in only one or two documents—a characteristic of natural language where specialized terms have narrow usage.

In production systems with vocabularies of 100,000+ words and documents averaging 200 unique terms, sparsity typically exceeds 99.9%. This extreme sparsity makes dense matrix storage impractical (imagine storing 100 billion zeros for a million-document corpus) and sparse formats essential.

Out[27]:

Visualization

Sparsity pattern of the term frequency matrix. Each row represents a document, each column a vocabulary term. Blue cells indicate non-zero term frequencies (86 of 550 cells). The sparse pattern shows that each document uses only a small subset of the total vocabulary.

Sparsity ImplicationsLink Copied

High sparsity has practical consequences:

Memory efficiency: Sparse matrix formats (CSR, CSC) store only non-zero values, reducing memory by orders of magnitude.

Computation speed: Sparse matrix operations skip zero elements, dramatically speeding up matrix multiplication and similarity calculations.

Feature selection: Many terms appear in very few documents, contributing little discriminative power. Pruning rare terms reduces dimensionality without losing much information.

In[28]:

Code

from scipy import sparse

# Convert to sparse format
sparse_tf = sparse.csr_matrix(tf_matrix)

# Memory comparison
dense_bytes = tf_matrix.nbytes
sparse_bytes = (
    sparse_tf.data.nbytes + sparse_tf.indices.nbytes + sparse_tf.indptr.nbytes
)

from scipy import sparse

# Convert to sparse format
sparse_tf = sparse.csr_matrix(tf_matrix)

# Memory comparison
dense_bytes = tf_matrix.nbytes
sparse_bytes = (
    sparse_tf.data.nbytes + sparse_tf.indices.nbytes + sparse_tf.indptr.nbytes
)

Out[29]:

Console

Memory Usage Comparison:
----------------------------------------
Dense matrix: 4,320 bytes
Sparse matrix: 848 bytes
Compression ratio: 5.1x
Memory saved: 80.4%

The sparse CSR format achieves substantial memory savings even on this small matrix. The compression ratio directly reflects sparsity: at 84% sparsity, we save roughly 60% of memory. At production-scale sparsity (99%+), the ratio improves dramatically—a corpus of 1 million documents with 100,000 vocabulary terms could shrink from 800 GB (dense float64) to under 10 GB with sparse storage. This difference often determines whether an application fits in memory at all.

How Sparsity Scales with Vocabulary SizeLink Copied

As vocabulary grows, sparsity increases dramatically. This plot shows the relationship between vocabulary size and matrix sparsity:

Out[30]:

Visualization

Sparsity increases rapidly with vocabulary size. As more rare terms enter the vocabulary, the fraction of zero entries approaches 99%+ in production systems.

Memory savings from sparse storage become dramatic at larger vocabulary sizes. At realistic scales (10,000+ terms), sparse formats use less than 5% of dense storage.

The key insight: document length stays roughly constant as vocabulary grows, so the ratio of non-zero to total entries shrinks. This is why sparse matrix formats become essential at scale.

Efficient Term Frequency ComputationLink Copied

For production systems, efficiency matters. Let's compare different approaches to computing term frequency:

In[31]:

Code

from sklearn.feature_extraction.text import CountVectorizer
import time


# Method 1: Manual computation with Counter
def method_counter(documents):
    all_tokens = []
    for doc in documents:
        import re

        tokens = re.findall(r"\b[a-z]+\b", doc.lower())
        all_tokens.extend(tokens)
    vocab = sorted(set(all_tokens))
    word_to_idx = {w: i for i, w in enumerate(vocab)}

    matrix = np.zeros((len(documents), len(vocab)))
    for i, doc in enumerate(documents):
        tokens = re.findall(r"\b[a-z]+\b", doc.lower())
        for token in tokens:
            matrix[i, word_to_idx[token]] += 1
    return matrix


# Method 2: scikit-learn CountVectorizer
def method_sklearn(documents):
    vectorizer = CountVectorizer(lowercase=True, token_pattern=r"\b[a-z]+\b")
    return vectorizer.fit_transform(documents)


# Benchmark both methods
n_iterations = 100
test_docs = extended_corpus * 10  # 100 documents

start = time.time()
for _ in range(n_iterations):
    _ = method_counter(test_docs)
counter_time = time.time() - start

start = time.time()
for _ in range(n_iterations):
    _ = method_sklearn(test_docs)
sklearn_time = time.time() - start

from sklearn.feature_extraction.text import CountVectorizer
import time


# Method 1: Manual computation with Counter
def method_counter(documents):
    all_tokens = []
    for doc in documents:
        import re

        tokens = re.findall(r"\b[a-z]+\b", doc.lower())
        all_tokens.extend(tokens)
    vocab = sorted(set(all_tokens))
    word_to_idx = {w: i for i, w in enumerate(vocab)}

    matrix = np.zeros((len(documents), len(vocab)))
    for i, doc in enumerate(documents):
        tokens = re.findall(r"\b[a-z]+\b", doc.lower())
        for token in tokens:
            matrix[i, word_to_idx[token]] += 1
    return matrix


# Method 2: scikit-learn CountVectorizer
def method_sklearn(documents):
    vectorizer = CountVectorizer(lowercase=True, token_pattern=r"\b[a-z]+\b")
    return vectorizer.fit_transform(documents)


# Benchmark both methods
n_iterations = 100
test_docs = extended_corpus * 10  # 100 documents

start = time.time()
for _ in range(n_iterations):
    _ = method_counter(test_docs)
counter_time = time.time() - start

start = time.time()
for _ in range(n_iterations):
    _ = method_sklearn(test_docs)
sklearn_time = time.time() - start

Out[32]:

Console

Performance Comparison:
==================================================
Test: 100 documents, 100 iterations
--------------------------------------------------
Manual Counter method: 0.027s (0.27ms per call)
sklearn CountVectorizer: 0.029s (0.29ms per call)
Speedup: 0.9x faster with sklearn

The benchmark demonstrates scikit-learn's significant performance advantage. The speedup comes from multiple optimizations: compiled Cython/C code paths, efficient sparse matrix construction that avoids intermediate dense allocations, and vectorized string operations. The manual Counter approach requires Python-level iteration and dictionary operations, which carry interpreter overhead.

For production applications processing thousands or millions of documents, this performance gap becomes critical. Always prefer library implementations over custom code unless you have specific requirements they can't meet.

CountVectorizer TF VariantsLink Copied

CountVectorizer supports different term frequency schemes through its parameters:

In[33]:

Code

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Raw counts (default)
raw_vectorizer = CountVectorizer()
raw_matrix = raw_vectorizer.fit_transform(documents)

# Binary TF
binary_vectorizer = CountVectorizer(binary=True)
binary_matrix = binary_vectorizer.fit_transform(documents)

# L2-normalized (via TfidfVectorizer with use_idf=False)
l2_vectorizer = TfidfVectorizer(use_idf=False, norm="l2")
l2_matrix = l2_vectorizer.fit_transform(documents)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Raw counts (default)
raw_vectorizer = CountVectorizer()
raw_matrix = raw_vectorizer.fit_transform(documents)

# Binary TF
binary_vectorizer = CountVectorizer(binary=True)
binary_matrix = binary_vectorizer.fit_transform(documents)

# L2-normalized (via TfidfVectorizer with use_idf=False)
l2_vectorizer = TfidfVectorizer(use_idf=False, norm="l2")
l2_matrix = l2_vectorizer.fit_transform(documents)

Out[34]:

Console

scikit-learn TF Variants:
============================================================
Vocabulary size: 31

Document 1 representations:
------------------------------------------------------------
Term                 Raw   Binary    L2-norm
------------------------------------------------------------
artificial             1        1     0.1796
data                   2        1     0.3592
from                   1        1     0.1796
intelligence           1        1     0.1796
is                     2        1     0.3592
learn                  1        1     0.1796
learning               3        1     0.5388
machine                2        1     0.3592

The output demonstrates scikit-learn's flexibility in producing different TF representations from the same input. Raw counts preserve exact frequencies (useful for models that can learn their own weighting), binary collapses all counts to 1 (ideal for set-based operations), and L2-normalized values create unit vectors ready for cosine similarity. Note that "learning" with raw count 4 becomes 0.5345 after L2 normalization—this reflects its proportional contribution to the document vector's direction. The L2 values will sum (in squares) to 1.0 across all terms in the document.

Choosing a TF VariantLink Copied

We've now covered the full spectrum of term frequency weighting schemes, from raw counts to sophisticated normalizations. Each addresses a specific problem:

Raw TF gives you the basic signal but overweights repetition
Log-scaled TF compresses high counts, modeling diminishing returns
Boolean TF ignores frequency entirely, focusing on presence
Augmented TF normalizes within each document, handling length variation
L2-normalized TF projects documents onto a unit sphere, enabling efficient similarity computation

Which variant should you use? It depends on your task:

Summary of term frequency variants and their use cases.

Variant	Formula	Best For
Raw	$\text{tf}(t, d)$	When exact counts matter, baseline models
Log-scaled	$1 + \log(\text{tf})$	General purpose, TF-IDF computation
Boolean	$1$ if present, else $0$	Topic detection, set-based matching
Augmented	$0.5 + 0.5 \times \frac{\text{tf}}{\max \text{tf}}$	Cross-document comparison, length normalization
L2-normalized	$\frac{\text{tf}}{\\|\mathbf{tf}\\|_2}$	Cosine similarity, neural network inputs

Out[35]:

Visualization

Transformation curves for all TF variants. Raw TF grows linearly; log-scaled compresses high counts; boolean collapses all non-zero to 1; augmented scales by max term (shown for max=10); L2-normalized depends on all term counts.

The transformation curves reveal each variant's philosophy: raw TF treats all counts linearly, log-scaled compresses high values, boolean ignores magnitude entirely, and augmented/L2 normalize relative to other terms.

Out[37]:

Visualization

Raw term frequency values. 'Learning' appears 4 times in Doc 1, dominating the document.

Log-scaled TF compresses high counts. The 4-occurrence term now has weight 2.39.

Boolean TF collapses all non-zero values to 1, focusing only on presence.

Out[38]:

Visualization

Augmented TF normalizes by maximum term frequency, producing values in [0.5, 1.0].

L2-normalized TF creates unit vectors. Values are smallest due to normalization across all terms.

Out[40]:

Visualization

Raw term frequency for Document 1. 'Learning' dominates with count 4.

Log-scaled TF compresses the range. Maximum value is now 2.39.

Boolean TF collapses all values to 1, ignoring frequency.

Out[41]:

Visualization

Augmented TF normalizes by max frequency. All values fall in [0.5, 1.0].

L2-normalized TF produces the smallest values due to vector normalization.

Limitations and ImpactLink Copied

Term frequency, in all its variants, captures only one dimension of word importance: how often a term appears in a document. This ignores a key question: how informative is this term across the entire corpus?

The "the" problem: Common words like "the", "is", and "a" appear frequently in almost every document. High TF doesn't distinguish documents when every document has high TF for the same words.

No corpus context: TF treats each document in isolation. A term appearing 5 times might be significant in a corpus where it's rare, or meaningless in a corpus where every document mentions it.

Length sensitivity: Despite normalization schemes, longer documents naturally contain more term occurrences, potentially biasing similarity calculations.

These limitations motivate Inverse Document Frequency (IDF), which we'll cover in the next chapter. IDF asks: how rare is this term across the corpus? Combining TF with IDF produces TF-IDF, one of the most successful text representations in information retrieval.

Term frequency laid the groundwork for quantifying word importance. The variants we explored, raw counts, log-scaling, boolean, augmented, and L2-normalized, each address different aspects of the counting problem. Understanding these foundations prepares you to appreciate why TF-IDF works and when to use its variants.

SummaryLink Copied

Term frequency transforms word counts into weighted signals of importance. The key variants each serve different purposes:

Raw term frequency counts occurrences directly, but overweights repeated terms
Log-scaled TF ( $1 + \log(\text{tf})$ ) compresses high counts, capturing diminishing returns of repetition
Boolean TF reduces to presence/absence, useful when topic coverage matters more than emphasis
Augmented TF normalizes by maximum frequency, enabling fair cross-document comparison
L2-normalized TF creates unit vectors, making cosine similarity a simple dot product

Term frequency matrices are extremely sparse in practice, with 99%+ zeros for realistic vocabularies. Sparse matrix formats and optimized libraries like scikit-learn's CountVectorizer make efficient computation possible.

The main limitation of TF is its document-centric view. A term appearing frequently might be common across all documents (uninformative) or rare and distinctive. The next chapter introduces Inverse Document Frequency to address this, setting the stage for TF-IDF.

Key Functions and ParametersLink Copied

When working with term frequency in scikit-learn, two classes handle most use cases:

CountVectorizer(lowercase, min_df, max_df, binary, ngram_range, max_features)

lowercase (default: True): Convert text to lowercase before tokenization. Disable for case-sensitive applications.
min_df: Minimum document frequency. Integer for absolute count, float for proportion. Use min_df=2 to remove typos and rare words.
max_df: Maximum document frequency. Use max_df=0.95 to filter extremely common words.
binary (default: False): Set to True for boolean term frequency where only presence matters.
ngram_range (default: (1, 1)): Tuple of (min_n, max_n). Use (1, 2) to include bigrams.
max_features: Limit vocabulary to top N most frequent terms for dimensionality control.

TfidfVectorizer(use_idf, norm, sublinear_tf)

use_idf (default: True): Set to False to compute only term frequency without IDF weighting.
norm (default: 'l2'): Vector normalization. Use 'l2' for cosine similarity, 'l1' for Manhattan distance, or None for raw values.
sublinear_tf (default: False): Set to True to apply log-scaling: replaces tf with $1 + \log(\text{tf})$ .

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about term frequency weighting schemes.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Perplexity

Next Chapter

Inverse Document Frequency

Reference

BIBTEXAcademic

@misc{termfrequencycompleteguidetotfweightingschemesfortextanalysis, author = {Michael Brenndoerfer}, title = {Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis}, year = {2025}, url = {https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis. Retrieved from https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis

MLAAcademic

Michael Brenndoerfer. "Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis." 2026. Web. today. <https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis>.

CHICAGOAcademic

Michael Brenndoerfer. "Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis." Accessed today. https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis'. Available at: https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis. https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis

Direct link:

https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis

Term FrequencyLink Copied

Raw Term FrequencyLink Copied

The Problem with Raw CountsLink Copied

Term Frequency DistributionLink Copied

Log-Scaled Term FrequencyLink Copied

Why Logarithms?Link Copied

Understanding the FormulaLink Copied

Visualizing the Log TransformationLink Copied

Boolean Term FrequencyLink Copied

Augmented Term FrequencyLink Copied

The Document Length ProblemLink Copied

Deconstructing the FormulaLink Copied

L2-Normalized Frequency VectorsLink Copied

From Counts to GeometryLink Copied

Why the L2 Norm?Link Copied

Geometric Interpretation: Vectors on the Unit SphereLink Copied

Cosine Similarity with Normalized VectorsLink Copied

Term Frequency Sparsity PatternsLink Copied

Sparsity ImplicationsLink Copied

How Sparsity Scales with Vocabulary SizeLink Copied

Efficient Term Frequency ComputationLink Copied

CountVectorizer TF VariantsLink Copied

Choosing a TF VariantLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key Functions and ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Inverse Document Frequency: How Rare Words Reveal Document Meaning

TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation

Perplexity: The Standard Metric for Evaluating Language Models

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Inverse Document Frequency: How Rare Words Reveal Document Meaning

TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation

Perplexity: The Standard Metric for Evaluating Language Models

Stay updated