Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis
Back to Writing

Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis

Michael BrenndoerferDecember 8, 202534 min read8,120 wordsInteractive

Master term frequency weighting schemes including raw TF, log-scaled, boolean, augmented, and L2-normalized variants. Learn when to use each approach for information retrieval and NLP.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Term Frequency

How many times does a word appear in a document? This simple question leads to surprisingly complex answers. Raw counts tell you something, but a word appearing 10 times in a 100-word email means something different than 10 times in a 10,000-word novel. Term frequency weighting schemes address this by transforming raw counts into more meaningful signals.

In the Bag of Words chapter, we counted words. Now we refine those counts. Term frequency (TF) is the foundation of TF-IDF, one of the most successful text representations in information retrieval. But TF alone comes in many flavors: raw counts, log-scaled frequencies, boolean indicators, and normalized variants. Each captures different aspects of word importance.

This chapter explores these variants systematically. You'll learn when raw counts mislead, why logarithms help, and how normalization enables fair comparison across documents of different lengths. By the end, you'll understand the design choices behind term frequency and be ready to combine TF with inverse document frequency in the next chapter.

Raw Term Frequency

The simplest approach counts how many times each term appears in a document. If "learning" appears 5 times in a document, its raw term frequency is 5.

Term Frequency (TF)

Term frequency measures how often a term appears in a document. The raw term frequency tf(t,d)\text{tf}(t, d) is simply the count of term tt in document dd, where:

  • tt: a term (word) in the vocabulary
  • dd: a document in the corpus

Let's implement this from scratch:

In[2]:
from collections import Counter

# Sample documents about machine learning
documents = [
    "Machine learning is a subset of artificial intelligence. Machine learning uses data to learn patterns. Learning from data is powerful.",
    "Deep learning is a type of machine learning. Deep learning uses neural networks. Neural networks learn hierarchical representations.",
    "Natural language processing uses machine learning. Text classification and sentiment analysis are NLP tasks.",
]

def compute_raw_tf(document):
    """Compute raw term frequency for a document."""
    # Tokenize: lowercase and split on whitespace/punctuation
    import re
    tokens = re.findall(r'\b[a-z]+\b', document.lower())
    return Counter(tokens)

# Compute TF for each document
tf_docs = [compute_raw_tf(doc) for doc in documents]
Out[3]:
Raw Term Frequencies:
============================================================

Document 1:
  learning         3  ███
  machine          2  ██
  is               2  ██
  data             2  ██
  a                1  █
  subset           1  █
  of               1  █
  artificial       1  █

Document 2:
  learning         3  ███
  deep             2  ██
  neural           2  ██
  networks         2  ██
  is               1  █
  a                1  █
  type             1  █
  of               1  █

Document 3:
  natural          1  █
  language         1  █
  processing       1  █
  uses             1  █
  machine          1  █
  learning         1  █
  text             1  █
  classification   1  █

Notice how "learning" dominates Document 1 with 4 occurrences, while "deep" and "neural" characterize Document 2. These raw counts capture topic signals, but they also reveal a problem: common words like "is" and "a" appear frequently without carrying much meaning.

The Problem with Raw Counts

Raw term frequency has a proportionality problem. If "learning" appears twice as often as "data", does that mean it's twice as important? Probably not. The relationship between word count and importance is sublinear: the difference between 1 and 2 occurrences is more significant than the difference between 10 and 11.

In[4]:
import numpy as np

# Demonstrate the proportionality problem
example_doc = """
Machine learning machine learning machine learning machine learning machine learning.
Data science uses machine learning. Data analysis is important.
"""

tf_example = compute_raw_tf(example_doc)
Out[5]:
Proportionality Problem Example:
--------------------------------------------------
'machine' appears 6 times
'data' appears 2 times
'science' appears 1 time

Is 'machine' really 6x more important than 'science'?
Is 'machine' really 3x more important than 'data'?

Raw counts overweight repeated terms.

A document that mentions "machine" six times isn't necessarily six times more about machines than a document mentioning it once. After the first few occurrences, additional repetitions add diminishing information. This observation motivates log-scaled term frequency.

Term Frequency Distribution

Before moving to log-scaling, let's examine how term frequencies distribute across a document. Natural language follows Zipf's law: a few words appear very frequently, while most words appear rarely.

Out[6]:
Visualization
Histogram showing term frequency distribution with most terms appearing once and few terms appearing multiple times.
Term frequency distribution in Document 1 follows a power law pattern typical of natural language. A small number of terms (like 'learning' and 'machine') dominate, while most terms appear only once. This heavy-tailed distribution motivates log-scaling to reduce the influence of high-frequency terms.

The rank-frequency plot shows the characteristic "long tail" of natural language: frequency drops rapidly with rank. Log-scaling addresses this by compressing the high-frequency end of this distribution.

Log-Scaled Term Frequency

The proportionality problem highlights a key insight: the relationship between word frequency and importance is not linear. Seeing a word twice tells you more than seeing it once, but seeing it the hundredth time adds almost nothing new. We need a function that grows quickly at first, then slows down. The logarithm is exactly such a function.

Why Logarithms?

Consider what properties our ideal transformation should have:

  1. Monotonicity: More occurrences should still mean higher weight (we don't want to lose ranking information)
  2. Sublinearity: The weight should grow slower than the count (diminishing returns)
  3. Bounded compression: Large counts shouldn't completely dominate

The logarithm satisfies all three. It converts multiplicative relationships into additive ones: if one term appears 10 times more often than another, the log-scaled difference is just log(10)2.3\log(10) \approx 2.3, not 10. This compression matches how information works: the first mention of a concept is informative, but repetition adds progressively less.

Log-Scaled Term Frequency

Log-scaled term frequency dampens the effect of high counts:

tflog(t,d)={1+log(tf(t,d))if tf(t,d)>00otherwise\text{tf}_{\log}(t, d) = \begin{cases} 1 + \log(\text{tf}(t, d)) & \text{if } \text{tf}(t, d) > 0 \\ 0 & \text{otherwise} \end{cases}

where:

  • tf(t,d)\text{tf}(t, d): the raw count of term tt in document dd
  • log\log: the natural logarithm (base ee)
  • The +1+1 ensures that a term appearing once gets weight 1 (since log(1)=0\log(1) = 0)
  • The piecewise definition handles absent terms (count = 0)

Understanding the Formula

The formula 1+log(tf)1 + \log(\text{tf}) is carefully constructed. Let's trace through what happens at different frequency levels:

Raw CountCalculationLog-Scaled Weight
11+log(1)=1+01 + \log(1) = 1 + 01.00
21+log(2)1+0.691 + \log(2) \approx 1 + 0.691.69
101+log(10)1+2.301 + \log(10) \approx 1 + 2.303.30
1001+log(100)1+4.611 + \log(100) \approx 1 + 4.615.61

The 100x increase in raw count becomes only a 5.6x increase in log-scaled weight. A term appearing 100 times doesn't dominate one appearing once. It's weighted roughly 5-6 times higher, not 100 times.

In[7]:
import numpy as np

def compute_log_tf(document):
    """Compute log-scaled term frequency."""
    raw_tf = compute_raw_tf(document)
    log_tf = {}
    for term, count in raw_tf.items():
        if count > 0:
            log_tf[term] = 1 + np.log(count)
        else:
            log_tf[term] = 0
    return log_tf

# Compare raw vs log TF for Document 1
raw_tf_doc1 = tf_docs[0]
log_tf_doc1 = compute_log_tf(documents[0])
Out[8]:
Raw vs Log-Scaled Term Frequency (Document 1):
-------------------------------------------------------
Term                Raw TF       Log TF        Ratio
-------------------------------------------------------
learning                 3         2.10         1.43
machine                  2         1.69         1.18
is                       2         1.69         1.18
data                     2         1.69         1.18
a                        1         1.00         1.00
subset                   1         1.00         1.00
of                       1         1.00         1.00
artificial               1         1.00         1.00
intelligence             1         1.00         1.00
uses                     1         1.00         1.00

The log transformation compresses the range. "Learning" with 4 raw occurrences gets a log TF of about 2.4, not 4x the weight of a single-occurrence term. This better reflects the diminishing returns of word repetition.

Out[9]:
Visualization
Bar chart comparing raw and log-scaled term frequencies for top terms, showing compression of high counts.
Comparison of raw term frequency and log-scaled term frequency. The logarithmic transformation compresses high counts, reducing the dominance of frequently repeated terms while preserving their relative ordering.

Visualizing the Log Transformation

The logarithm's compression effect becomes clearer when we plot raw counts against their log-scaled equivalents:

Out[10]:
Visualization
Line plot showing the sublinear relationship between raw term frequency and log-scaled values.
The log transformation curve showing how raw term frequency maps to log-scaled values. The sublinear relationship means that doubling the raw count does not double the weight. A term appearing 100 times gets only about 5.6x the weight of a term appearing once.

The curve flattens as raw counts increase. This sublinear relationship captures the intuition that the 10th occurrence of a word adds less information than the first.

Out[11]:
Visualization
Two plots showing marginal and cumulative information gain from word repetition.
Diminishing returns: the marginal information gain from each additional word occurrence. Left: The derivative of log(tf) shows that the ''information boost'' from seeing a word one more time decreases rapidly. The first occurrence contributes 1.0, the second contributes 0.5, and by the 10th occurrence, each additional appearance adds only 0.1. Right: Cumulative information (log-scaled TF) grows quickly at first, then plateaus.

Boolean Term Frequency

Log-scaling addresses the proportionality problem, but what if we push this idea to its extreme? If diminishing returns are the issue, why not treat all non-zero counts identically? This reasoning leads to boolean term frequency, the simplest possible weighting scheme.

Boolean TF asks only one question: does this term appear in this document? The answer is binary: yes (1) or no (0). A term mentioned once gets the same weight as a term mentioned a hundred times.

Boolean Term Frequency

Boolean term frequency ignores how many times a term appears:

tfbool(t,d)={1if td0otherwise\text{tf}_{\text{bool}}(t, d) = \begin{cases} 1 & \text{if } t \in d \\ 0 & \text{otherwise} \end{cases}

This treats all present terms equally, regardless of frequency.

This might seem like throwing away information, but boolean TF is useful when:

  • You care about topic coverage, not emphasis
  • Repeated terms might indicate spam or manipulation
  • The task is set-based (does this document mention these concepts?)
In[12]:
def compute_boolean_tf(document):
    """Compute boolean term frequency."""
    raw_tf = compute_raw_tf(document)
    return {term: 1 for term in raw_tf.keys()}

# Compare all three TF variants
raw_tf = compute_raw_tf(documents[0])
log_tf = compute_log_tf(documents[0])
bool_tf = compute_boolean_tf(documents[0])
Out[13]:
Three TF Variants Compared (Document 1):
------------------------------------------------------------
Term                 Raw        Log    Boolean
------------------------------------------------------------
learning               3       2.10          1
machine                2       1.69          1
is                     2       1.69          1
data                   2       1.69          1
a                      1       1.00          1
subset                 1       1.00          1
of                     1       1.00          1
artificial             1       1.00          1
intelligence           1       1.00          1
uses                   1       1.00          1

With boolean TF, "learning" (4 occurrences) and "powerful" (1 occurrence) get equal weight. This might seem like information loss, but for some tasks, knowing a document discusses "learning" at all is more important than knowing it discusses it repeatedly.

Augmented Term Frequency

So far, we've addressed how to weight individual term counts, but we haven't tackled a more fundamental problem: document length. A 10,000-word document will naturally have higher raw term frequencies than a 100-word document, even if both discuss the same topic with equal emphasis. When comparing documents, this length bias can drown out meaningful differences in content.

The Document Length Problem

Consider two documents about machine learning:

  • Document A (100 words): mentions "learning" 5 times
  • Document B (1,000 words): mentions "learning" 20 times

Which document is more focused on learning? Raw counts suggest Document B, but proportionally, Document A dedicates 5% of its words to "learning" while Document B dedicates only 2%. The shorter document is actually more focused on the topic.

Out[14]:
Visualization
Bar charts comparing raw and normalized term frequencies between short and long documents.
The document length bias problem illustrated. Left: Raw term frequencies favor longer documents even when shorter documents are more focused on a topic. Right: When normalized by document length (or maximum term frequency), the shorter document''s higher concentration becomes apparent. Document A dedicates 5% of its content to ''learning'' vs Document B''s 2%.

Augmented term frequency solves this by asking: how important is this term relative to the most important term in this document? Rather than comparing raw counts across documents, we compare proportions. The most frequent term in any document becomes the reference point.

Augmented Term Frequency

Augmented term frequency normalizes by the maximum term frequency in the document:

tfaug(t,d)=0.5+0.5×tf(t,d)maxtdtf(t,d)\text{tf}_{\text{aug}}(t, d) = 0.5 + 0.5 \times \frac{\text{tf}(t, d)}{\max_{t' \in d} \text{tf}(t', d)}

where:

  • tt: the term being weighted
  • dd: the document
  • tf(t,d)\text{tf}(t, d): raw count of term tt in document dd
  • maxtdtf(t,d)\max_{t' \in d} \text{tf}(t', d): the highest term frequency in document dd (the count of the most common term)

Deconstructing the Formula

The formula has two components working together:

Step 1: Compute the relative frequency

relative(t,d)=tf(t,d)maxtdtf(t,d)\text{relative}(t, d) = \frac{\text{tf}(t, d)}{\max_{t' \in d} \text{tf}(t', d)}

This ratio normalizes each term against the document's most frequent term, producing values between 0 and 1. The most frequent term gets 1.0; a term appearing half as often gets 0.5.

Step 2: Apply the double normalization

tfaug(t,d)=0.5+0.5×relative(t,d)\text{tf}_{\text{aug}}(t, d) = 0.5 + 0.5 \times \text{relative}(t, d)

This transformation maps the [0, 1] range to [0.5, 1.0]. Why not just use the ratio directly? The 0.5 baseline ensures that even rare terms receive meaningful weight, preventing them from being completely overshadowed by the dominant term.

The formula guarantees:

  • The most frequent term in any document gets weight 1.0
  • All other terms get weights between 0.5 and 1.0, proportional to their relative frequency
  • All documents are on a comparable scale regardless of length
Out[15]:
Visualization
Two plots showing the augmented TF transformation function and its effect on term weights.
The augmented TF transformation maps relative frequencies to the [0.5, 1.0] range. Left: The linear mapping 0.5 + 0.5×(tf/max_tf) ensures a 0.5 baseline for rare terms. Right: Comparing raw counts vs augmented weights for a sample document shows how the transformation compresses the range while preserving relative ordering.
In[16]:
def compute_augmented_tf(document):
    """Compute augmented (double normalization) term frequency."""
    raw_tf = compute_raw_tf(document)
    if not raw_tf:
        return {}
    max_tf = max(raw_tf.values())
    augmented_tf = {}
    for term, count in raw_tf.items():
        augmented_tf[term] = 0.5 + 0.5 * (count / max_tf)
    return augmented_tf

# Compute augmented TF for all documents
aug_tf_docs = [compute_augmented_tf(doc) for doc in documents]
Out[17]:
Augmented Term Frequency (normalized to [0.5, 1.0]):
======================================================================

Document 1 (max raw tf = 3):
  learning        raw= 3  aug=1.000  ████████████████████
  machine         raw= 2  aug=0.833  ████████████████
  is              raw= 2  aug=0.833  ████████████████
  data            raw= 2  aug=0.833  ████████████████
  a               raw= 1  aug=0.667  █████████████
  subset          raw= 1  aug=0.667  █████████████

Document 2 (max raw tf = 3):
  learning        raw= 3  aug=1.000  ████████████████████
  deep            raw= 2  aug=0.833  ████████████████
  neural          raw= 2  aug=0.833  ████████████████
  networks        raw= 2  aug=0.833  ████████████████
  is              raw= 1  aug=0.667  █████████████
  a               raw= 1  aug=0.667  █████████████

Document 3 (max raw tf = 1):
  natural         raw= 1  aug=1.000  ████████████████████
  language        raw= 1  aug=1.000  ████████████████████
  processing      raw= 1  aug=1.000  ████████████████████
  uses            raw= 1  aug=1.000  ████████████████████
  machine         raw= 1  aug=1.000  ████████████████████
  learning        raw= 1  aug=1.000  ████████████████████

The most frequent term in each document gets 1.0, while other terms scale proportionally. This makes cross-document comparison fairer.

Out[18]:
Visualization
Grouped bar chart showing augmented term frequencies across three documents with normalized scales.
Augmented term frequency normalizes each document independently, scaling the most frequent term to 1.0 and others proportionally. This allows fair comparison between documents of different lengths and vocabulary distributions.

L2-Normalized Frequency Vectors

Augmented TF normalizes against the single most frequent term, but what if we want to consider all terms simultaneously? This leads us to think geometrically: each document's term frequencies form a vector in high-dimensional space, where each dimension corresponds to a vocabulary word. The vector's direction captures what the document is about, while its magnitude reflects document length.

From Counts to Geometry

Imagine a vocabulary of just two words: "learning" and "deep". Each document becomes a point in 2D space:

  • Document with TF = [4, 0] points along the "learning" axis
  • Document with TF = [2, 2] points diagonally between both axes
  • Document with TF = [0, 3] points along the "deep" axis

The direction of each vector tells us what the document emphasizes. The magnitude (length) tells us how many total word occurrences it contains. If we only care about content similarity, not length, we should normalize all vectors to the same magnitude.

L2 normalization projects every document onto the unit sphere, preserving direction while eliminating length differences. Two documents with the same word proportions but different lengths will map to the same point on the unit sphere.

L2 Normalization

L2 normalization divides each term frequency by the vector's Euclidean length:

tfL2(t,d)=tf(t,d)tdtf(t,d)2\text{tf}_{\text{L2}}(t, d) = \frac{\text{tf}(t, d)}{\sqrt{\sum_{t' \in d} \text{tf}(t', d)^2}}

where:

  • tf(t,d)\text{tf}(t, d): raw count of term tt in document dd
  • tdtf(t,d)2\sum_{t' \in d} \text{tf}(t', d)^2: sum of squared term frequencies across all terms in dd
  • tfd2=tdtf(t,d)2\|\mathbf{tf}_d\|_2 = \sqrt{\sum_{t' \in d} \text{tf}(t', d)^2}: the L2 norm (Euclidean length) of the TF vector

The resulting vector has unit length (tfd2=1\|\mathbf{tf}_d\|_2 = 1), making cosine similarity equivalent to a simple dot product.

Why the L2 Norm?

The L2 norm measures the Euclidean distance from the origin, the straight-line length of the vector. Dividing by this norm scales the vector to length 1 while preserving its direction. After normalization, every document vector lies on the surface of a unit hypersphere.

This geometric property has a practical benefit: the angle between any two normalized vectors directly measures their content similarity. Documents pointing in similar directions (small angle) have high cosine similarity; documents pointing in different directions (large angle) are dissimilar.

L2 normalization is particularly useful for:

  • Efficient similarity computation: Cosine similarity becomes a simple dot product
  • Length-invariant comparison: A 100-word and 10,000-word document on the same topic will have similar normalized vectors
  • ML compatibility: Many machine learning models assume or benefit from normalized inputs
In[19]:
import numpy as np

def compute_l2_normalized_tf(document, vocabulary):
    """Compute L2-normalized term frequency vector."""
    raw_tf = compute_raw_tf(document)
    
    # Create vector in vocabulary order
    vector = np.array([raw_tf.get(term, 0) for term in vocabulary], dtype=float)
    
    # L2 normalize
    norm = np.linalg.norm(vector)
    if norm > 0:
        vector = vector / norm
    
    return vector

# Build vocabulary from all documents
all_tokens = []
for doc in documents:
    import re
    tokens = re.findall(r'\b[a-z]+\b', doc.lower())
    all_tokens.extend(tokens)
vocabulary = sorted(set(all_tokens))

# Compute L2-normalized vectors
l2_vectors = [compute_l2_normalized_tf(doc, vocabulary) for doc in documents]
Out[20]:
L2-Normalized Term Frequency Vectors:
============================================================

Document 1 (vector norm = 1.0000):
  learning        0.5303  ██████████████████████████
  data            0.3536  █████████████████
  is              0.3536  █████████████████
  machine         0.3536  █████████████████
  a               0.1768  ████████
  artificial      0.1768  ████████

Document 2 (vector norm = 1.0000):
  learning        0.5477  ███████████████████████████
  deep            0.3651  ██████████████████
  networks        0.3651  ██████████████████
  neural          0.3651  ██████████████████
  a               0.1826  █████████
  hierarchical    0.1826  █████████

Document 3 (vector norm = 1.0000):
  analysis        0.2673  █████████████
  and             0.2673  █████████████
  are             0.2673  █████████████
  classification  0.2673  █████████████
  language        0.2673  █████████████
  learning        0.2673  █████████████

Each vector now has unit length (norm = 1.0). The values represent the relative contribution of each term to the document's "direction" in vocabulary space.

Geometric Interpretation: Vectors on the Unit Sphere

L2 normalization projects all document vectors onto the surface of a unit hypersphere. In high dimensions this is hard to visualize, but we can illustrate the concept by projecting our documents into 2D using their two most distinctive terms:

Out[21]:
Visualization
Two scatter plots showing document vectors before and after L2 normalization, with normalized vectors on a unit circle.
Geometric interpretation of L2 normalization. Left: Raw term frequency vectors have different lengths depending on document size. Right: After L2 normalization, all vectors lie on the unit circle, and the angle between vectors directly measures content similarity. The shorter Document 3 vector becomes comparable to longer documents.

On the unit circle, the angle θ between vectors directly measures semantic distance. Documents pointing in similar directions (small θ) have high cosine similarity; documents pointing in different directions (large θ) are dissimilar.

Cosine Similarity with Normalized Vectors

The practical benefit of L2 normalization becomes clear when computing document similarity. Cosine similarity measures the angle between two vectors, defined as:

cosine(a,b)=aba2b2\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|_2 \|\mathbf{b}\|_2}

where ab=iaibi\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i is the dot product and 2\|\cdot\|_2 denotes the L2 norm.

When both vectors are already L2-normalized (a2=b2=1\|\mathbf{a}\|_2 = \|\mathbf{b}\|_2 = 1), the denominator becomes 1, and cosine similarity reduces to a simple dot product:

cosine(a,b)=ab=iaibi\text{cosine}(\mathbf{a}, \mathbf{b}) = \mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i

This simplification speeds up computation: comparing millions of documents becomes a matrix multiplication rather than millions of individual normalizations.

In[22]:
# Compute cosine similarity matrix
def cosine_similarity_matrix(vectors):
    """Compute pairwise cosine similarity for L2-normalized vectors."""
    n = len(vectors)
    sim_matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            # For L2-normalized vectors, cosine = dot product
            sim_matrix[i, j] = np.dot(vectors[i], vectors[j])
    return sim_matrix

similarity = cosine_similarity_matrix(l2_vectors)
Out[23]:
Cosine Similarity Matrix (using L2-normalized TF):
---------------------------------------------
               Doc 1     Doc 2     Doc 3
---------------------------------------------
     Doc 1     1.000     0.549     0.283
     Doc 2     0.549     1.000     0.244
     Doc 3     0.283     0.244     1.000

Documents 1 and 2 are most similar (both discuss machine learning and deep learning), while Document 3 (about NLP) shows lower similarity to both.

Out[24]:
Visualization
Heatmap showing pairwise cosine similarities between three documents, with diagonal values of 1.0.
Cosine similarity matrix computed from L2-normalized term frequency vectors. Higher values indicate more similar vocabulary usage. Documents 1 and 2 share substantial vocabulary about machine learning, while Document 3 has distinct NLP-focused terms.

Term Frequency Sparsity Patterns

We've explored five ways to weight term frequencies, each with different mathematical properties. But before choosing among them, we need to understand a practical reality that affects all term frequency representations: sparsity.

Real-world term frequency matrices are extremely sparse. A typical English vocabulary contains tens of thousands of words, yet any individual document uses only a few hundred. This means most entries in a document-term matrix are zero. Understanding this sparsity matters for efficient computation and storage.

In[25]:
# Analyze sparsity with a larger corpus
extended_corpus = [
    "Machine learning algorithms learn patterns from data.",
    "Deep learning neural networks require large datasets.",
    "Natural language processing extracts meaning from text.",
    "Computer vision analyzes images using deep learning.",
    "Reinforcement learning agents learn through trial and error.",
    "Supervised learning uses labeled training examples.",
    "Unsupervised learning discovers hidden patterns.",
    "Transfer learning adapts pretrained models to new tasks.",
    "Feature engineering improves machine learning performance.",
    "Gradient descent optimizes neural network parameters."
]

# Build vocabulary and count matrix
all_tokens_ext = []
for doc in extended_corpus:
    import re
    tokens = re.findall(r'\b[a-z]+\b', doc.lower())
    all_tokens_ext.extend(tokens)
vocab_ext = sorted(set(all_tokens_ext))

# Create term frequency matrix
tf_matrix = np.zeros((len(extended_corpus), len(vocab_ext)))
for i, doc in enumerate(extended_corpus):
    import re
    tokens = re.findall(r'\b[a-z]+\b', doc.lower())
    for token in tokens:
        j = vocab_ext.index(token)
        tf_matrix[i, j] += 1
Out[26]:
Term Frequency Matrix Sparsity Analysis:
==================================================
Corpus size: 10 documents
Vocabulary size: 54 unique terms
Matrix shape: (10, 54)
Total elements: 540
Non-zero elements: 67
Zero elements: 473
Sparsity: 87.6%

Average terms per document: 6.7
Average documents per term: 1.2

The sparsity level of 85%+ in this tiny corpus shows a basic property of text data. Each document uses only 6-7 unique terms from the 50+ word vocabulary. In production systems with vocabularies of 100,000+ words and documents averaging 200 unique terms, sparsity typically exceeds 99.9%. This extreme sparsity makes dense matrix storage impractical and sparse formats essential.

Out[27]:
Visualization
Binary heatmap showing sparsity pattern with blue dots indicating non-zero term frequencies.
Sparsity pattern of the term frequency matrix. Each row represents a document, each column a vocabulary term. Blue cells indicate non-zero term frequencies. The sparse pattern shows that each document uses only a small subset of the total vocabulary.

Sparsity Implications

High sparsity has practical consequences:

Memory efficiency: Sparse matrix formats (CSR, CSC) store only non-zero values, reducing memory by orders of magnitude.

Computation speed: Sparse matrix operations skip zero elements, dramatically speeding up matrix multiplication and similarity calculations.

Feature selection: Many terms appear in very few documents, contributing little discriminative power. Pruning rare terms reduces dimensionality without losing much information.

In[28]:
from scipy import sparse

# Convert to sparse format
sparse_tf = sparse.csr_matrix(tf_matrix)

# Memory comparison
dense_bytes = tf_matrix.nbytes
sparse_bytes = sparse_tf.data.nbytes + sparse_tf.indices.nbytes + sparse_tf.indptr.nbytes
Out[29]:
Memory Usage Comparison:
----------------------------------------
Dense matrix: 4,320 bytes
Sparse matrix: 848 bytes
Compression ratio: 5.1x
Memory saved: 80.4%

The sparse format achieves significant memory savings even on this small matrix. The compression ratio scales with sparsity: at 99% sparsity, you'd see roughly 100x memory reduction. For a corpus of 1 million documents with 100,000 vocabulary terms, sparse storage can reduce memory from 800 GB (dense) to under 10 GB.

How Sparsity Scales with Vocabulary Size

As vocabulary grows, sparsity increases dramatically. This plot shows the relationship between vocabulary size and matrix sparsity:

Out[30]:
Visualization
Two line plots showing sparsity percentage and memory ratio increasing with vocabulary size.
Sparsity increases with vocabulary size. Left: As we include more rare terms in the vocabulary, the fraction of zero entries grows rapidly. Right: Memory savings from sparse storage become more dramatic with larger vocabularies. At realistic vocabulary sizes (10,000+ terms), sparse formats use less than 5% of dense storage.

The key insight: document length stays roughly constant as vocabulary grows, so the ratio of non-zero to total entries shrinks. This is why sparse matrix formats become essential at scale.

Efficient Term Frequency Computation

For production systems, efficiency matters. Let's compare different approaches to computing term frequency:

In[31]:
from sklearn.feature_extraction.text import CountVectorizer
import time

# Method 1: Manual computation with Counter
def method_counter(documents):
    all_tokens = []
    for doc in documents:
        import re
        tokens = re.findall(r'\b[a-z]+\b', doc.lower())
        all_tokens.extend(tokens)
    vocab = sorted(set(all_tokens))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    
    matrix = np.zeros((len(documents), len(vocab)))
    for i, doc in enumerate(documents):
        tokens = re.findall(r'\b[a-z]+\b', doc.lower())
        for token in tokens:
            matrix[i, word_to_idx[token]] += 1
    return matrix

# Method 2: scikit-learn CountVectorizer
def method_sklearn(documents):
    vectorizer = CountVectorizer(lowercase=True, token_pattern=r'\b[a-z]+\b')
    return vectorizer.fit_transform(documents)

# Benchmark both methods
n_iterations = 100
test_docs = extended_corpus * 10  # 100 documents

start = time.time()
for _ in range(n_iterations):
    _ = method_counter(test_docs)
counter_time = time.time() - start

start = time.time()
for _ in range(n_iterations):
    _ = method_sklearn(test_docs)
sklearn_time = time.time() - start
Out[32]:
Performance Comparison:
==================================================
Test: 100 documents, 100 iterations
--------------------------------------------------
Manual Counter method: 0.028s (0.28ms per call)
sklearn CountVectorizer: 0.032s (0.32ms per call)
Speedup: 0.9x faster with sklearn

The benchmark shows that scikit-learn's CountVectorizer significantly outperforms manual implementation. This speedup comes from optimized C code, efficient sparse matrix construction, and vectorized operations. For any production application, use the library implementation rather than rolling your own.

CountVectorizer TF Variants

CountVectorizer supports different term frequency schemes through its parameters:

In[33]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import normalize

# Raw counts (default)
raw_vectorizer = CountVectorizer()
raw_matrix = raw_vectorizer.fit_transform(documents)

# Binary TF
binary_vectorizer = CountVectorizer(binary=True)
binary_matrix = binary_vectorizer.fit_transform(documents)

# L2-normalized (via TfidfVectorizer with use_idf=False)
l2_vectorizer = TfidfVectorizer(use_idf=False, norm='l2')
l2_matrix = l2_vectorizer.fit_transform(documents)
Out[34]:
scikit-learn TF Variants:
============================================================
Vocabulary size: 31

Document 1 representations:
------------------------------------------------------------
Term                 Raw   Binary    L2-norm
------------------------------------------------------------
artificial             1        1     0.1796
data                   2        1     0.3592
from                   1        1     0.1796
intelligence           1        1     0.1796
is                     2        1     0.3592
learn                  1        1     0.1796
learning               3        1     0.5388
machine                2        1     0.3592

The three representations show the same document from different perspectives. Raw counts preserve exact frequency information, binary reduces everything to presence indicators (all 1s), and L2-normalized values form a unit-length vector where each term's weight reflects its relative contribution to the document's direction in vocabulary space.

Choosing a TF Variant

We've now covered the full spectrum of term frequency weighting schemes, from raw counts to sophisticated normalizations. Each addresses a specific problem:

  • Raw TF gives you the basic signal but overweights repetition
  • Log-scaled TF compresses high counts, modeling diminishing returns
  • Boolean TF ignores frequency entirely, focusing on presence
  • Augmented TF normalizes within each document, handling length variation
  • L2-normalized TF projects documents onto a unit sphere, enabling efficient similarity computation

Which variant should you use? It depends on your task:

VariantFormulaBest For
Rawtf(t,d)\text{tf}(t, d)When exact counts matter, baseline models
Log-scaled1+log(tf)1 + \log(\text{tf})General purpose, TF-IDF computation
Boolean11 if present, else 00Topic detection, set-based matching
Augmented0.5+0.5×tfmaxtf0.5 + 0.5 \times \frac{\text{tf}}{\max \text{tf}}Cross-document comparison, length normalization
L2-normalizedtftf2\frac{\text{tf}}{\|\mathbf{tf}\|_2}Cosine similarity, neural network inputs
Out[35]:
Visualization
Line plot comparing transformation curves for five TF variants across raw count values from 1 to 10.
How each TF variant transforms raw counts. This plot shows the mapping from raw term frequency (x-axis) to weighted value (y-axis) for each variant. Boolean collapses all non-zero counts to 1. Log-scaled grows sublinearly. Augmented depends on the maximum count in the document (shown for max=10). L2-normalized depends on all term counts (shown for a typical document).

The transformation curves reveal each variant's philosophy: raw TF treats all counts linearly, log-scaled compresses high values, boolean ignores magnitude entirely, and augmented/L2 normalize relative to other terms.

Out[36]:
Visualization
Five heatmaps showing term frequency values for different TF variants across three documents.
Heatmap comparison of TF variants across all three documents. Each row shows a different document, each column a different term. The five panels show how the same underlying data appears under each weighting scheme. Notice how boolean flattens all variation, while raw TF shows the largest spread. Log-scaled and augmented provide intermediate compression.
Out[37]:
Visualization
Grouped bar chart comparing five TF variants across terms, showing different scaling behaviors.
Visual comparison of term frequency variants for a sample document. Raw counts show the largest spread, while boolean collapses everything to 1. Log-scaled, augmented, and L2-normalized variants each compress the range differently, with L2 producing the smallest values due to normalization across all terms.

Limitations and Impact

Term frequency, in all its variants, captures only one dimension of word importance: how often a term appears in a document. This ignores a key question: how informative is this term across the entire corpus?

The "the" problem: Common words like "the", "is", and "a" appear frequently in almost every document. High TF doesn't distinguish documents when every document has high TF for the same words.

No corpus context: TF treats each document in isolation. A term appearing 5 times might be significant in a corpus where it's rare, or meaningless in a corpus where every document mentions it.

Length sensitivity: Despite normalization schemes, longer documents naturally contain more term occurrences, potentially biasing similarity calculations.

These limitations motivate Inverse Document Frequency (IDF), which we'll cover in the next chapter. IDF asks: how rare is this term across the corpus? Combining TF with IDF produces TF-IDF, one of the most successful text representations in information retrieval.

Term frequency laid the groundwork for quantifying word importance. The variants we explored, raw counts, log-scaling, boolean, augmented, and L2-normalized, each address different aspects of the counting problem. Understanding these foundations prepares you to appreciate why TF-IDF works and when to use its variants.

Summary

Term frequency transforms word counts into weighted signals of importance. The key variants each serve different purposes:

  • Raw term frequency counts occurrences directly, but overweights repeated terms
  • Log-scaled TF (1+log(tf)1 + \log(\text{tf})) compresses high counts, capturing diminishing returns of repetition
  • Boolean TF reduces to presence/absence, useful when topic coverage matters more than emphasis
  • Augmented TF normalizes by maximum frequency, enabling fair cross-document comparison
  • L2-normalized TF creates unit vectors, making cosine similarity a simple dot product

Term frequency matrices are extremely sparse in practice, with 99%+ zeros for realistic vocabularies. Sparse matrix formats and optimized libraries like scikit-learn's CountVectorizer make efficient computation possible.

The main limitation of TF is its document-centric view. A term appearing frequently might be common across all documents (uninformative) or rare and distinctive. The next chapter introduces Inverse Document Frequency to address this, setting the stage for TF-IDF.

Key Functions and Parameters

When working with term frequency in scikit-learn, two classes handle most use cases:

CountVectorizer(lowercase, min_df, max_df, binary, ngram_range, max_features)

  • lowercase (default: True): Convert text to lowercase before tokenization. Disable for case-sensitive applications.
  • min_df: Minimum document frequency. Integer for absolute count, float for proportion. Use min_df=2 to remove typos and rare words.
  • max_df: Maximum document frequency. Use max_df=0.95 to filter extremely common words.
  • binary (default: False): Set to True for boolean term frequency where only presence matters.
  • ngram_range (default: (1, 1)): Tuple of (min_n, max_n). Use (1, 2) to include bigrams.
  • max_features: Limit vocabulary to top N most frequent terms for dimensionality control.

TfidfVectorizer(use_idf, norm, sublinear_tf)

  • use_idf (default: True): Set to False to compute only term frequency without IDF weighting.
  • norm (default: 'l2'): Vector normalization. Use 'l2' for cosine similarity, 'l1' for Manhattan distance, or None for raw values.
  • sublinear_tf (default: False): Set to True to apply log-scaling: replaces tf with 1+log(tf)1 + \log(\text{tf}).

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about term frequency weighting schemes.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{termfrequencycompleteguidetotfweightingschemesfortextanalysis, author = {Michael Brenndoerfer}, title = {Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis}, year = {2025}, url = {https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-09} }
APAAcademic
Michael Brenndoerfer (2025). Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis. Retrieved from https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis
MLAAcademic
Michael Brenndoerfer. "Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis." 2025. Web. 12/9/2025. <https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis>.
CHICAGOAcademic
Michael Brenndoerfer. "Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis." Accessed 12/9/2025. https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis'. Available at: https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis (Accessed: 12/9/2025).
SimpleBasic
Michael Brenndoerfer (2025). Term Frequency: Complete Guide to TF Weighting Schemes for Text Analysis. https://mbrenndoerfer.com/writing/term-frequency-weighting-schemes-text-analysis
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.