Master term frequency weighting schemes including raw TF, log-scaled, boolean, augmented, and L2-normalized variants. Learn when to use each approach for information retrieval and NLP.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Term Frequency
How many times does a word appear in a document? This simple question leads to surprisingly complex answers. Raw counts tell you something, but a word appearing 10 times in a 100-word email means something different than 10 times in a 10,000-word novel. Term frequency weighting schemes address this by transforming raw counts into more meaningful signals.
In the Bag of Words chapter, we counted words. Now we refine those counts. Term frequency (TF) is the foundation of TF-IDF, one of the most successful text representations in information retrieval. But TF alone comes in many flavors: raw counts, log-scaled frequencies, boolean indicators, and normalized variants. Each captures different aspects of word importance.
This chapter explores these variants systematically. You'll learn when raw counts mislead, why logarithms help, and how normalization enables fair comparison across documents of different lengths. By the end, you'll understand the design choices behind term frequency and be ready to combine TF with inverse document frequency in the next chapter.
Raw Term Frequency
The simplest approach counts how many times each term appears in a document. If "learning" appears 5 times in a document, its raw term frequency is 5.
Term frequency measures how often a term appears in a document. The raw term frequency is simply the count of term in document , where:
- : a term (word) in the vocabulary
- : a document in the corpus
Let's implement this from scratch:
Raw Term Frequencies: ============================================================ Document 1: learning 3 ███ machine 2 ██ is 2 ██ data 2 ██ a 1 █ subset 1 █ of 1 █ artificial 1 █ Document 2: learning 3 ███ deep 2 ██ neural 2 ██ networks 2 ██ is 1 █ a 1 █ type 1 █ of 1 █ Document 3: natural 1 █ language 1 █ processing 1 █ uses 1 █ machine 1 █ learning 1 █ text 1 █ classification 1 █
Notice how "learning" dominates Document 1 with 4 occurrences, while "deep" and "neural" characterize Document 2. These raw counts capture topic signals, but they also reveal a problem: common words like "is" and "a" appear frequently without carrying much meaning.
The Problem with Raw Counts
Raw term frequency has a proportionality problem. If "learning" appears twice as often as "data", does that mean it's twice as important? Probably not. The relationship between word count and importance is sublinear: the difference between 1 and 2 occurrences is more significant than the difference between 10 and 11.
Proportionality Problem Example: -------------------------------------------------- 'machine' appears 6 times 'data' appears 2 times 'science' appears 1 time Is 'machine' really 6x more important than 'science'? Is 'machine' really 3x more important than 'data'? Raw counts overweight repeated terms.
A document that mentions "machine" six times isn't necessarily six times more about machines than a document mentioning it once. After the first few occurrences, additional repetitions add diminishing information. This observation motivates log-scaled term frequency.
Term Frequency Distribution
Before moving to log-scaling, let's examine how term frequencies distribute across a document. Natural language follows Zipf's law: a few words appear very frequently, while most words appear rarely.

The rank-frequency plot shows the characteristic "long tail" of natural language: frequency drops rapidly with rank. Log-scaling addresses this by compressing the high-frequency end of this distribution.
Log-Scaled Term Frequency
The proportionality problem highlights a key insight: the relationship between word frequency and importance is not linear. Seeing a word twice tells you more than seeing it once, but seeing it the hundredth time adds almost nothing new. We need a function that grows quickly at first, then slows down. The logarithm is exactly such a function.
Why Logarithms?
Consider what properties our ideal transformation should have:
- Monotonicity: More occurrences should still mean higher weight (we don't want to lose ranking information)
- Sublinearity: The weight should grow slower than the count (diminishing returns)
- Bounded compression: Large counts shouldn't completely dominate
The logarithm satisfies all three. It converts multiplicative relationships into additive ones: if one term appears 10 times more often than another, the log-scaled difference is just , not 10. This compression matches how information works: the first mention of a concept is informative, but repetition adds progressively less.
Log-scaled term frequency dampens the effect of high counts:
where:
- : the raw count of term in document
- : the natural logarithm (base )
- The ensures that a term appearing once gets weight 1 (since )
- The piecewise definition handles absent terms (count = 0)
Understanding the Formula
The formula is carefully constructed. Let's trace through what happens at different frequency levels:
| Raw Count | Calculation | Log-Scaled Weight |
|---|---|---|
| 1 | 1.00 | |
| 2 | 1.69 | |
| 10 | 3.30 | |
| 100 | 5.61 |
The 100x increase in raw count becomes only a 5.6x increase in log-scaled weight. A term appearing 100 times doesn't dominate one appearing once. It's weighted roughly 5-6 times higher, not 100 times.
Raw vs Log-Scaled Term Frequency (Document 1): ------------------------------------------------------- Term Raw TF Log TF Ratio ------------------------------------------------------- learning 3 2.10 1.43 machine 2 1.69 1.18 is 2 1.69 1.18 data 2 1.69 1.18 a 1 1.00 1.00 subset 1 1.00 1.00 of 1 1.00 1.00 artificial 1 1.00 1.00 intelligence 1 1.00 1.00 uses 1 1.00 1.00
The log transformation compresses the range. "Learning" with 4 raw occurrences gets a log TF of about 2.4, not 4x the weight of a single-occurrence term. This better reflects the diminishing returns of word repetition.

Visualizing the Log Transformation
The logarithm's compression effect becomes clearer when we plot raw counts against their log-scaled equivalents:

The curve flattens as raw counts increase. This sublinear relationship captures the intuition that the 10th occurrence of a word adds less information than the first.

Boolean Term Frequency
Log-scaling addresses the proportionality problem, but what if we push this idea to its extreme? If diminishing returns are the issue, why not treat all non-zero counts identically? This reasoning leads to boolean term frequency, the simplest possible weighting scheme.
Boolean TF asks only one question: does this term appear in this document? The answer is binary: yes (1) or no (0). A term mentioned once gets the same weight as a term mentioned a hundred times.
Boolean term frequency ignores how many times a term appears:
This treats all present terms equally, regardless of frequency.
This might seem like throwing away information, but boolean TF is useful when:
- You care about topic coverage, not emphasis
- Repeated terms might indicate spam or manipulation
- The task is set-based (does this document mention these concepts?)
Three TF Variants Compared (Document 1): ------------------------------------------------------------ Term Raw Log Boolean ------------------------------------------------------------ learning 3 2.10 1 machine 2 1.69 1 is 2 1.69 1 data 2 1.69 1 a 1 1.00 1 subset 1 1.00 1 of 1 1.00 1 artificial 1 1.00 1 intelligence 1 1.00 1 uses 1 1.00 1
With boolean TF, "learning" (4 occurrences) and "powerful" (1 occurrence) get equal weight. This might seem like information loss, but for some tasks, knowing a document discusses "learning" at all is more important than knowing it discusses it repeatedly.
Augmented Term Frequency
So far, we've addressed how to weight individual term counts, but we haven't tackled a more fundamental problem: document length. A 10,000-word document will naturally have higher raw term frequencies than a 100-word document, even if both discuss the same topic with equal emphasis. When comparing documents, this length bias can drown out meaningful differences in content.
The Document Length Problem
Consider two documents about machine learning:
- Document A (100 words): mentions "learning" 5 times
- Document B (1,000 words): mentions "learning" 20 times
Which document is more focused on learning? Raw counts suggest Document B, but proportionally, Document A dedicates 5% of its words to "learning" while Document B dedicates only 2%. The shorter document is actually more focused on the topic.

Augmented term frequency solves this by asking: how important is this term relative to the most important term in this document? Rather than comparing raw counts across documents, we compare proportions. The most frequent term in any document becomes the reference point.
Augmented term frequency normalizes by the maximum term frequency in the document:
where:
- : the term being weighted
- : the document
- : raw count of term in document
- : the highest term frequency in document (the count of the most common term)
Deconstructing the Formula
The formula has two components working together:
Step 1: Compute the relative frequency
This ratio normalizes each term against the document's most frequent term, producing values between 0 and 1. The most frequent term gets 1.0; a term appearing half as often gets 0.5.
Step 2: Apply the double normalization
This transformation maps the [0, 1] range to [0.5, 1.0]. Why not just use the ratio directly? The 0.5 baseline ensures that even rare terms receive meaningful weight, preventing them from being completely overshadowed by the dominant term.
The formula guarantees:
- The most frequent term in any document gets weight 1.0
- All other terms get weights between 0.5 and 1.0, proportional to their relative frequency
- All documents are on a comparable scale regardless of length

Augmented Term Frequency (normalized to [0.5, 1.0]): ====================================================================== Document 1 (max raw tf = 3): learning raw= 3 aug=1.000 ████████████████████ machine raw= 2 aug=0.833 ████████████████ is raw= 2 aug=0.833 ████████████████ data raw= 2 aug=0.833 ████████████████ a raw= 1 aug=0.667 █████████████ subset raw= 1 aug=0.667 █████████████ Document 2 (max raw tf = 3): learning raw= 3 aug=1.000 ████████████████████ deep raw= 2 aug=0.833 ████████████████ neural raw= 2 aug=0.833 ████████████████ networks raw= 2 aug=0.833 ████████████████ is raw= 1 aug=0.667 █████████████ a raw= 1 aug=0.667 █████████████ Document 3 (max raw tf = 1): natural raw= 1 aug=1.000 ████████████████████ language raw= 1 aug=1.000 ████████████████████ processing raw= 1 aug=1.000 ████████████████████ uses raw= 1 aug=1.000 ████████████████████ machine raw= 1 aug=1.000 ████████████████████ learning raw= 1 aug=1.000 ████████████████████
The most frequent term in each document gets 1.0, while other terms scale proportionally. This makes cross-document comparison fairer.

L2-Normalized Frequency Vectors
Augmented TF normalizes against the single most frequent term, but what if we want to consider all terms simultaneously? This leads us to think geometrically: each document's term frequencies form a vector in high-dimensional space, where each dimension corresponds to a vocabulary word. The vector's direction captures what the document is about, while its magnitude reflects document length.
From Counts to Geometry
Imagine a vocabulary of just two words: "learning" and "deep". Each document becomes a point in 2D space:
- Document with TF = [4, 0] points along the "learning" axis
- Document with TF = [2, 2] points diagonally between both axes
- Document with TF = [0, 3] points along the "deep" axis
The direction of each vector tells us what the document emphasizes. The magnitude (length) tells us how many total word occurrences it contains. If we only care about content similarity, not length, we should normalize all vectors to the same magnitude.
L2 normalization projects every document onto the unit sphere, preserving direction while eliminating length differences. Two documents with the same word proportions but different lengths will map to the same point on the unit sphere.
L2 normalization divides each term frequency by the vector's Euclidean length:
where:
- : raw count of term in document
- : sum of squared term frequencies across all terms in
- : the L2 norm (Euclidean length) of the TF vector
The resulting vector has unit length (), making cosine similarity equivalent to a simple dot product.
Why the L2 Norm?
The L2 norm measures the Euclidean distance from the origin, the straight-line length of the vector. Dividing by this norm scales the vector to length 1 while preserving its direction. After normalization, every document vector lies on the surface of a unit hypersphere.
This geometric property has a practical benefit: the angle between any two normalized vectors directly measures their content similarity. Documents pointing in similar directions (small angle) have high cosine similarity; documents pointing in different directions (large angle) are dissimilar.
L2 normalization is particularly useful for:
- Efficient similarity computation: Cosine similarity becomes a simple dot product
- Length-invariant comparison: A 100-word and 10,000-word document on the same topic will have similar normalized vectors
- ML compatibility: Many machine learning models assume or benefit from normalized inputs
L2-Normalized Term Frequency Vectors: ============================================================ Document 1 (vector norm = 1.0000): learning 0.5303 ██████████████████████████ data 0.3536 █████████████████ is 0.3536 █████████████████ machine 0.3536 █████████████████ a 0.1768 ████████ artificial 0.1768 ████████ Document 2 (vector norm = 1.0000): learning 0.5477 ███████████████████████████ deep 0.3651 ██████████████████ networks 0.3651 ██████████████████ neural 0.3651 ██████████████████ a 0.1826 █████████ hierarchical 0.1826 █████████ Document 3 (vector norm = 1.0000): analysis 0.2673 █████████████ and 0.2673 █████████████ are 0.2673 █████████████ classification 0.2673 █████████████ language 0.2673 █████████████ learning 0.2673 █████████████
Each vector now has unit length (norm = 1.0). The values represent the relative contribution of each term to the document's "direction" in vocabulary space.
Geometric Interpretation: Vectors on the Unit Sphere
L2 normalization projects all document vectors onto the surface of a unit hypersphere. In high dimensions this is hard to visualize, but we can illustrate the concept by projecting our documents into 2D using their two most distinctive terms:

On the unit circle, the angle θ between vectors directly measures semantic distance. Documents pointing in similar directions (small θ) have high cosine similarity; documents pointing in different directions (large θ) are dissimilar.
Cosine Similarity with Normalized Vectors
The practical benefit of L2 normalization becomes clear when computing document similarity. Cosine similarity measures the angle between two vectors, defined as:
where is the dot product and denotes the L2 norm.
When both vectors are already L2-normalized (), the denominator becomes 1, and cosine similarity reduces to a simple dot product:
This simplification speeds up computation: comparing millions of documents becomes a matrix multiplication rather than millions of individual normalizations.
Cosine Similarity Matrix (using L2-normalized TF):
---------------------------------------------
Doc 1 Doc 2 Doc 3
---------------------------------------------
Doc 1 1.000 0.549 0.283
Doc 2 0.549 1.000 0.244
Doc 3 0.283 0.244 1.000
Documents 1 and 2 are most similar (both discuss machine learning and deep learning), while Document 3 (about NLP) shows lower similarity to both.

Term Frequency Sparsity Patterns
We've explored five ways to weight term frequencies, each with different mathematical properties. But before choosing among them, we need to understand a practical reality that affects all term frequency representations: sparsity.
Real-world term frequency matrices are extremely sparse. A typical English vocabulary contains tens of thousands of words, yet any individual document uses only a few hundred. This means most entries in a document-term matrix are zero. Understanding this sparsity matters for efficient computation and storage.
Term Frequency Matrix Sparsity Analysis: ================================================== Corpus size: 10 documents Vocabulary size: 54 unique terms Matrix shape: (10, 54) Total elements: 540 Non-zero elements: 67 Zero elements: 473 Sparsity: 87.6% Average terms per document: 6.7 Average documents per term: 1.2
The sparsity level of 85%+ in this tiny corpus shows a basic property of text data. Each document uses only 6-7 unique terms from the 50+ word vocabulary. In production systems with vocabularies of 100,000+ words and documents averaging 200 unique terms, sparsity typically exceeds 99.9%. This extreme sparsity makes dense matrix storage impractical and sparse formats essential.

Sparsity Implications
High sparsity has practical consequences:
Memory efficiency: Sparse matrix formats (CSR, CSC) store only non-zero values, reducing memory by orders of magnitude.
Computation speed: Sparse matrix operations skip zero elements, dramatically speeding up matrix multiplication and similarity calculations.
Feature selection: Many terms appear in very few documents, contributing little discriminative power. Pruning rare terms reduces dimensionality without losing much information.
Memory Usage Comparison: ---------------------------------------- Dense matrix: 4,320 bytes Sparse matrix: 848 bytes Compression ratio: 5.1x Memory saved: 80.4%
The sparse format achieves significant memory savings even on this small matrix. The compression ratio scales with sparsity: at 99% sparsity, you'd see roughly 100x memory reduction. For a corpus of 1 million documents with 100,000 vocabulary terms, sparse storage can reduce memory from 800 GB (dense) to under 10 GB.
How Sparsity Scales with Vocabulary Size
As vocabulary grows, sparsity increases dramatically. This plot shows the relationship between vocabulary size and matrix sparsity:

The key insight: document length stays roughly constant as vocabulary grows, so the ratio of non-zero to total entries shrinks. This is why sparse matrix formats become essential at scale.
Efficient Term Frequency Computation
For production systems, efficiency matters. Let's compare different approaches to computing term frequency:
Performance Comparison: ================================================== Test: 100 documents, 100 iterations -------------------------------------------------- Manual Counter method: 0.028s (0.28ms per call) sklearn CountVectorizer: 0.032s (0.32ms per call) Speedup: 0.9x faster with sklearn
The benchmark shows that scikit-learn's CountVectorizer significantly outperforms manual implementation. This speedup comes from optimized C code, efficient sparse matrix construction, and vectorized operations. For any production application, use the library implementation rather than rolling your own.
CountVectorizer TF Variants
CountVectorizer supports different term frequency schemes through its parameters:
scikit-learn TF Variants: ============================================================ Vocabulary size: 31 Document 1 representations: ------------------------------------------------------------ Term Raw Binary L2-norm ------------------------------------------------------------ artificial 1 1 0.1796 data 2 1 0.3592 from 1 1 0.1796 intelligence 1 1 0.1796 is 2 1 0.3592 learn 1 1 0.1796 learning 3 1 0.5388 machine 2 1 0.3592
The three representations show the same document from different perspectives. Raw counts preserve exact frequency information, binary reduces everything to presence indicators (all 1s), and L2-normalized values form a unit-length vector where each term's weight reflects its relative contribution to the document's direction in vocabulary space.
Choosing a TF Variant
We've now covered the full spectrum of term frequency weighting schemes, from raw counts to sophisticated normalizations. Each addresses a specific problem:
- Raw TF gives you the basic signal but overweights repetition
- Log-scaled TF compresses high counts, modeling diminishing returns
- Boolean TF ignores frequency entirely, focusing on presence
- Augmented TF normalizes within each document, handling length variation
- L2-normalized TF projects documents onto a unit sphere, enabling efficient similarity computation
Which variant should you use? It depends on your task:
| Variant | Formula | Best For |
|---|---|---|
| Raw | When exact counts matter, baseline models | |
| Log-scaled | General purpose, TF-IDF computation | |
| Boolean | if present, else | Topic detection, set-based matching |
| Augmented | Cross-document comparison, length normalization | |
| L2-normalized | Cosine similarity, neural network inputs |

The transformation curves reveal each variant's philosophy: raw TF treats all counts linearly, log-scaled compresses high values, boolean ignores magnitude entirely, and augmented/L2 normalize relative to other terms.


Limitations and Impact
Term frequency, in all its variants, captures only one dimension of word importance: how often a term appears in a document. This ignores a key question: how informative is this term across the entire corpus?
The "the" problem: Common words like "the", "is", and "a" appear frequently in almost every document. High TF doesn't distinguish documents when every document has high TF for the same words.
No corpus context: TF treats each document in isolation. A term appearing 5 times might be significant in a corpus where it's rare, or meaningless in a corpus where every document mentions it.
Length sensitivity: Despite normalization schemes, longer documents naturally contain more term occurrences, potentially biasing similarity calculations.
These limitations motivate Inverse Document Frequency (IDF), which we'll cover in the next chapter. IDF asks: how rare is this term across the corpus? Combining TF with IDF produces TF-IDF, one of the most successful text representations in information retrieval.
Term frequency laid the groundwork for quantifying word importance. The variants we explored, raw counts, log-scaling, boolean, augmented, and L2-normalized, each address different aspects of the counting problem. Understanding these foundations prepares you to appreciate why TF-IDF works and when to use its variants.
Summary
Term frequency transforms word counts into weighted signals of importance. The key variants each serve different purposes:
- Raw term frequency counts occurrences directly, but overweights repeated terms
- Log-scaled TF () compresses high counts, capturing diminishing returns of repetition
- Boolean TF reduces to presence/absence, useful when topic coverage matters more than emphasis
- Augmented TF normalizes by maximum frequency, enabling fair cross-document comparison
- L2-normalized TF creates unit vectors, making cosine similarity a simple dot product
Term frequency matrices are extremely sparse in practice, with 99%+ zeros for realistic vocabularies. Sparse matrix formats and optimized libraries like scikit-learn's CountVectorizer make efficient computation possible.
The main limitation of TF is its document-centric view. A term appearing frequently might be common across all documents (uninformative) or rare and distinctive. The next chapter introduces Inverse Document Frequency to address this, setting the stage for TF-IDF.
Key Functions and Parameters
When working with term frequency in scikit-learn, two classes handle most use cases:
CountVectorizer(lowercase, min_df, max_df, binary, ngram_range, max_features)
lowercase(default:True): Convert text to lowercase before tokenization. Disable for case-sensitive applications.min_df: Minimum document frequency. Integer for absolute count, float for proportion. Usemin_df=2to remove typos and rare words.max_df: Maximum document frequency. Usemax_df=0.95to filter extremely common words.binary(default:False): Set toTruefor boolean term frequency where only presence matters.ngram_range(default:(1, 1)): Tuple of (min_n, max_n). Use(1, 2)to include bigrams.max_features: Limit vocabulary to top N most frequent terms for dimensionality control.
TfidfVectorizer(use_idf, norm, sublinear_tf)
use_idf(default:True): Set toFalseto compute only term frequency without IDF weighting.norm(default:'l2'): Vector normalization. Use'l2'for cosine similarity,'l1'for Manhattan distance, orNonefor raw values.sublinear_tf(default:False): Set toTrueto apply log-scaling: replaces tf with .
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about term frequency weighting schemes.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Inverse Document Frequency: How Rare Words Reveal Document Meaning
Learn how Inverse Document Frequency (IDF) measures word importance across a corpus by weighting rare, discriminative terms higher than common words. Master IDF formula derivation, smoothing variants, and efficient implementation with scikit-learn.

TF-IDF: Term Frequency-Inverse Document Frequency for Text Representation
Master TF-IDF for text representation, including the core formula, variants like log-scaled TF and smoothed IDF, normalization techniques, document similarity with cosine similarity, and BM25 as a modern extension.

Perplexity: The Standard Metric for Evaluating Language Models
Learn how perplexity measures language model quality through cross-entropy and information theory. Understand the branching factor interpretation, implement perplexity for n-gram models, and discover when perplexity predicts downstream performance.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


Comments