Learn TF-IDF and Bag of Words, including term frequency, inverse document frequency, vectorization, and text classification. Master classical NLP text representation methods with Python implementation.
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
TF-IDF and Bag of Words: Classical text representation methods and their applications
Before neural networks and transformers, how did we convert text into numbers that machines could process? The answer lies in two foundational techniques that remain relevant today: Bag of Words and TF-IDF (Term Frequency-Inverse Document Frequency).
These methods solve a fundamental problem in NLP: text is inherently discrete and symbolic, but most machine learning algorithms require numerical input. Bag of Words provides a simple way to count word occurrences, while TF-IDF adds sophistication by weighting words based on their importance across a document collection.
In this chapter, we'll explore how these classical techniques work, implement them from scratch, and understand both their power and limitations. You'll see how they enable everything from search engines to document classification, and why they're still used in production systems today.
Introduction
Imagine you have a collection of documents, maybe product reviews, news articles, or research papers. You want to find which documents are similar, classify them by topic, or build a search system. The challenge: computers can't directly understand words.
Bag of Words solves this by treating each document as an unordered collection of word counts. It's called a "bag" because word order doesn't matter. Only how many times each word appears matters. This simple idea enables us to represent any document as a fixed-length vector of numbers.
TF-IDF builds on this foundation. It recognizes that not all words are equally informative. The word "the" appears in almost every document, so it's not useful for distinguishing between documents. But a rare word like "quantum" might be highly informative when it appears. TF-IDF automatically downweights common words and emphasizes distinctive ones.
Together, these techniques form the backbone of classical information retrieval and text classification systems. They're fast, interpretable, and surprisingly effective for many tasks.
A text representation method that converts documents into fixed-length vectors by counting word occurrences. Each dimension in the vector corresponds to a word in the vocabulary, and the value represents how many times that word appears in the document.
Term Frequency-Inverse Document Frequency. A weighting scheme that multiplies term frequency (how often a word appears in a document) by inverse document frequency (how rare the word is across the collection). This emphasizes words that are frequent in a specific document but rare overall.
Technical Deep Dive
Bag of Words: The Foundation
Let's start with the simplest approach. Given a vocabulary with unique words, we can represent any document as a vector of length , where each element counts how many times word appears in document .
The process involves three steps:
- Tokenization: Split documents into individual words (tokens)
- Vocabulary building: Collect all unique words across all documents
- Vectorization: For each document, count occurrences of each vocabulary word
For example, if our vocabulary is and a document contains "cat runs", the vector would be : one occurrence of "cat", zero of "dog", and one of "runs".
This representation has several properties:
- Fixed dimensionality: All documents map to vectors of the same length
- Sparsity: Most documents use only a small fraction of the vocabulary, so most vector elements are zero
- Order independence: "cat runs" and "runs cat" produce identical vectors
The sparsity is important. In practice, vocabularies can contain tens of thousands of words, but individual documents might use only hundreds. This makes sparse matrix representations efficient for storage and computation.
Term Frequency (TF)
Term frequency measures how often a word appears in a document. The simplest version is the raw count:
However, longer documents naturally contain more words. To make frequencies comparable across documents of different lengths, we often normalize by document length:
This gives us a proportion: what fraction of the document consists of this word? Normalized TF values range from 0 to 1, with 1 meaning the entire document consists of that single word.
Another common normalization uses logarithmic scaling to dampen the effect of very frequent words:
This formula ensures that doubling the word count doesn't double the TF score, which helps prevent extremely common words from dominating the representation.
Inverse Document Frequency (IDF)
While term frequency tells us how important a word is within a document, inverse document frequency measures how distinctive it is across the entire collection. The key insight: words that appear in many documents are less informative than words that appear in few.
The inverse document frequency is calculated as:
where is the total number of documents, and is the number of documents containing term .
Let's break this down:
- If a word appears in all documents, the denominator equals , so
- If a word appears in only one document, the denominator is 1, so
- Words appearing in fewer documents get higher IDF scores
The logarithm serves two purposes: it compresses the scale (so IDF doesn't grow linearly with collection size), and it makes the metric more interpretable. Without the log, a word appearing in 1 out of 1000 documents would have IDF = 1000, while one appearing in 500 documents would have IDF = 2, a 500x difference. The logarithm smooths this out.
Some implementations add 1 to avoid division by zero and to ensure all terms get at least some weight:
TF-IDF: Combining Both Components
TF-IDF multiplies term frequency by inverse document frequency:
This creates a scoring system where:
- High TF-IDF: Words that are frequent in a specific document but rare across the collection
- Low TF-IDF: Words that are either rare in the document or common across many documents
The multiplication is crucial. A word with high TF but low IDF (common everywhere) gets downweighted. A word with low TF but high IDF (rare but distinctive) gets some weight. Only words with both high TF and high IDF get the highest scores.
This weighting scheme automatically identifies the most distinctive words in each document, exactly what we want for tasks like search, classification, and topic modeling.
Worked Example
Let's work through a concrete example with three short documents:
- Document 1: "the cat sat on the mat"
- Document 2: "the dog sat on the log"
- Document 3: "the cat and dog sat"
First, we build our vocabulary by collecting all unique words: . That's 8 words, so our vectors will have 8 dimensions.
Bag of Words vectors:
- Document 1: (2×"the", 1×"cat", 1×"sat", 1×"on", 1×"mat")
- Document 2: (2×"the", 1×"sat", 1×"on", 1×"dog", 1×"log")
- Document 3: (1×"the", 1×"cat", 1×"sat", 1×"dog", 1×"and")
Notice that "the" appears in all three documents, "sat" appears in all three, but "mat", "log", and "and" appear in only one document each.
Calculating IDF:
- "the": appears in 3/3 documents →
- "sat": appears in 3/3 documents →
- "cat": appears in 2/3 documents →
- "dog": appears in 2/3 documents →
- "mat": appears in 1/3 documents →
- "log": appears in 1/3 documents →
- "and": appears in 1/3 documents →
Calculating TF-IDF for Document 1:
Using raw counts for TF:
- "the": , →
- "cat": , →
- "mat": , →
The word "mat" gets the highest TF-IDF score in Document 1 because it's unique to that document. "The" gets zero weight because it appears everywhere. This is exactly the behavior we want: distinctive words are emphasized, common words are suppressed.
Code Implementation
Let's implement Bag of Words and TF-IDF from scratch. We'll build this step by step, focusing on understanding each component.
Step 1: Tokenization and Vocabulary Building
First, we need to split documents into words and build our vocabulary:
Now we build the vocabulary by collecting all unique words:
Our vocabulary contains 8 unique words. The mapping assigns each word a unique integer index, which we'll use to create our vectors.
Step 2: Bag of Words Vectorization
Now we'll convert each document into a count vector:
Each vector shows word counts. Document 1 has 2 occurrences of "the" (index 7), 1 of "cat" (index 1), and so on. Notice how sparse these vectors are: most entries are zero.
Step 3: Calculating Term Frequency
Let's implement normalized term frequency:
Normalized TF gives us proportions: "cat" makes up 16.7% of Document 1 and 20% of Document 3. Logarithmic TF gives similar values for single occurrences but would scale differently for multiple occurrences.
Step 4: Calculating Inverse Document Frequency
Now we'll compute IDF for each word:
Perfect! The TF-IDF scores emphasize distinctive words. "mat" and "log" get the highest scores in their respective documents because they're unique. Common words like "the" and "sat" get zero weight. This is exactly what we want for distinguishing between documents.
Step 6: Document Similarity
One powerful application of TF-IDF vectors is measuring document similarity using cosine similarity:
Documents 1 and 3 share "cat", while documents 2 and 3 share "dog". Both pairs have the same similarity (0.408) because they share one distinctive word. Document 1 and 2 are less similar (0.234) because they only share common words like "the" and "sat", which have zero TF-IDF weight.
Using scikit-learn for Production
While implementing from scratch teaches the concepts, in practice you'll use libraries like scikit-learn:
scikit-learn's TfidfVectorizer handles all the details: tokenization, vocabulary building, TF-IDF calculation, and sparse matrix storage. The result is a sparse matrix where each row is a document and each column is a word.
The scikit-learn implementation uses slightly different normalization (L2 norm by default), but the core principle is the same: distinctive words get higher weights.
Limitations & Impact
Limitations
Bag of Words and TF-IDF have several well-known limitations:
-
Loss of word order: "The cat chased the dog" and "The dog chased the cat" produce identical vectors. This discards syntactic and semantic information that word order conveys.
-
No semantic understanding: These methods treat words as independent symbols. They can't understand that "car" and "automobile" are synonyms, or that "bank" can mean a financial institution or a river edge.
-
Vocabulary explosion: As document collections grow, vocabularies can become extremely large (hundreds of thousands of words), leading to high-dimensional, sparse vectors that are computationally expensive.
-
Context insensitivity: The same word always gets the same representation, regardless of context. "Apple" in "Apple stock price" and "apple pie recipe" are treated identically.
-
Fixed vocabulary: New words not seen during vocabulary building are ignored. This makes the system brittle when encountering domain-specific terminology or evolving language.
Despite these limitations, Bag of Words and TF-IDF remain valuable tools. They're fast, interpretable, and work well as baselines or feature extractors for downstream models.
Impact and Applications
These classical techniques have had enormous impact and continue to be used in production systems:
-
Search engines: Early web search (including early Google) relied heavily on TF-IDF for ranking. The PageRank algorithm combined link analysis with TF-IDF-based content analysis.
-
Document classification: Email spam filters, news categorization, and sentiment analysis systems often use TF-IDF features with classifiers like Naive Bayes or Support Vector Machines.
-
Information retrieval: Library systems, legal document search, and academic paper search engines use TF-IDF to match queries to relevant documents.
-
Feature engineering: Even in the era of neural networks, TF-IDF vectors are often concatenated with learned embeddings as input features, combining classical and modern approaches.
-
Baseline comparisons: New NLP methods are typically compared against TF-IDF baselines to demonstrate improvement.
-
Interpretability: Unlike black-box neural models, TF-IDF scores are directly interpretable. You can see exactly which words contribute to a document's representation and why.
The simplicity and effectiveness of these methods make them excellent starting points for text analysis. They teach us fundamental concepts about text representation that carry forward to more advanced techniques.
Summary
Bag of Words and TF-IDF provide foundational methods for converting text into numerical representations that machine learning algorithms can process.
Key takeaways:
- Bag of Words represents documents as fixed-length vectors of word counts, discarding word order but enabling mathematical operations on text
- Term Frequency (TF) measures how often a word appears in a document, often normalized by document length
- Inverse Document Frequency (IDF) measures how distinctive a word is across a collection, downweighting common words
- TF-IDF combines both components, emphasizing words that are frequent in specific documents but rare overall
- These methods are fast, interpretable, and effective for many tasks, but lose word order and semantic relationships
When to use:
- Building search systems or information retrieval applications
- Creating baseline models for text classification
- Feature engineering for downstream machine learning models
- Situations where interpretability matters more than state-of-the-art performance
What's next:
While Bag of Words and TF-IDF solve the fundamental problem of text representation, they're just the beginning. In the next chapters, we'll explore word embeddings that capture semantic relationships, sequence models that preserve word order, and transformer architectures that understand context. Each builds on these classical foundations while addressing their limitations.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about TF-IDF and Bag of Words.



Comments