Learn TF-IDF and Bag of Words, including term frequency, inverse document frequency, vectorization, and text classification. Master classical NLP text representation methods with Python implementation.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
1def tokenize(text):
2 """Split text into lowercase words, removing punctuation."""
3 import re
4 # Convert to lowercase and split on non-word characters
5 tokens = re.findall(r'\b\w+\b', text.lower())
6 return tokens
7
8## Example documents
9documents = [
10 "the cat sat on the mat",
11 "the dog sat on the log",
12 "the cat and dog sat"
13]
14
15## Tokenize all documents
16tokenized_docs = [tokenize(doc) for doc in documents]
17print("Tokenized documents:")
18for i, tokens in enumerate(tokenized_docs, 1):
19 print(f"Doc {i}: {tokens}")1def tokenize(text):
2 """Split text into lowercase words, removing punctuation."""
3 import re
4 # Convert to lowercase and split on non-word characters
5 tokens = re.findall(r'\b\w+\b', text.lower())
6 return tokens
7
8## Example documents
9documents = [
10 "the cat sat on the mat",
11 "the dog sat on the log",
12 "the cat and dog sat"
13]
14
15## Tokenize all documents
16tokenized_docs = [tokenize(doc) for doc in documents]
17print("Tokenized documents:")
18for i, tokens in enumerate(tokenized_docs, 1):
19 print(f"Doc {i}: {tokens}")Tokenized documents: Doc 1: ['the', 'cat', 'sat', 'on', 'the', 'mat'] Doc 2: ['the', 'dog', 'sat', 'on', 'the', 'log'] Doc 3: ['the', 'cat', 'and', 'dog', 'sat']
Now we build the vocabulary by collecting all unique words:
1def build_vocabulary(tokenized_documents):
2 """Create a sorted vocabulary from all tokenized documents."""
3 # Collect all unique words
4 all_words = set()
5 for tokens in tokenized_documents:
6 all_words.update(tokens)
7
8 # Sort for consistent ordering
9 vocabulary = sorted(all_words)
10
11 # Create word-to-index mapping
12 word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
13
14 return vocabulary, word_to_idx
15
16vocabulary, word_to_idx = build_vocabulary(tokenized_docs)
17print(f"Vocabulary ({len(vocabulary)} words): {vocabulary}")
18print(f"\nWord to index mapping:")
19for word, idx in word_to_idx.items():
20 print(f" {word}: {idx}")1def build_vocabulary(tokenized_documents):
2 """Create a sorted vocabulary from all tokenized documents."""
3 # Collect all unique words
4 all_words = set()
5 for tokens in tokenized_documents:
6 all_words.update(tokens)
7
8 # Sort for consistent ordering
9 vocabulary = sorted(all_words)
10
11 # Create word-to-index mapping
12 word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
13
14 return vocabulary, word_to_idx
15
16vocabulary, word_to_idx = build_vocabulary(tokenized_docs)
17print(f"Vocabulary ({len(vocabulary)} words): {vocabulary}")
18print(f"\nWord to index mapping:")
19for word, idx in word_to_idx.items():
20 print(f" {word}: {idx}")Vocabulary (8 words): ['and', 'cat', 'dog', 'log', 'mat', 'on', 'sat', 'the'] Word to index mapping: and: 0 cat: 1 dog: 2 log: 3 mat: 4 on: 5 sat: 6 the: 7
Our vocabulary contains 8 unique words. The mapping assigns each word a unique integer index, which we'll use to create our vectors.
Step 2: Bag of Words Vectorization
Now we'll convert each document into a count vector:
1def bag_of_words_vectorize(tokenized_doc, word_to_idx):
2 """Convert a tokenized document to a Bag of Words vector."""
3 # Initialize vector with zeros
4 vector = [0] * len(word_to_idx)
5
6 # Count word occurrences
7 for word in tokenized_doc:
8 if word in word_to_idx:
9 idx = word_to_idx[word]
10 vector[idx] += 1
11
12 return vector
13
14## Vectorize all documents
15bow_vectors = [bag_of_words_vectorize(tokens, word_to_idx)
16 for tokens in tokenized_docs]
17
18print("Bag of Words vectors:")
19for i, vector in enumerate(bow_vectors, 1):
20 print(f"Doc {i}: {vector}")1def bag_of_words_vectorize(tokenized_doc, word_to_idx):
2 """Convert a tokenized document to a Bag of Words vector."""
3 # Initialize vector with zeros
4 vector = [0] * len(word_to_idx)
5
6 # Count word occurrences
7 for word in tokenized_doc:
8 if word in word_to_idx:
9 idx = word_to_idx[word]
10 vector[idx] += 1
11
12 return vector
13
14## Vectorize all documents
15bow_vectors = [bag_of_words_vectorize(tokens, word_to_idx)
16 for tokens in tokenized_docs]
17
18print("Bag of Words vectors:")
19for i, vector in enumerate(bow_vectors, 1):
20 print(f"Doc {i}: {vector}")Bag of Words vectors: Doc 1: [0, 1, 0, 0, 1, 1, 1, 2] Doc 2: [0, 0, 1, 1, 0, 1, 1, 2] Doc 3: [1, 1, 1, 0, 0, 0, 1, 1]
Each vector shows word counts. Document 1 has 2 occurrences of "the" (index 7), 1 of "cat" (index 1), and so on. Notice how sparse these vectors are: most entries are zero.
Step 3: Calculating Term Frequency
Let's implement normalized term frequency:
1import math
2
3def term_frequency(word, tokenized_doc, use_log=False):
4 """Calculate term frequency for a word in a document."""
5 count = tokenized_doc.count(word)
6
7 if count == 0:
8 return 0.0
9
10 if use_log:
11 # Logarithmic scaling
12 return 1 + math.log(count)
13 else:
14 # Normalized by document length
15 return count / len(tokenized_doc)
16
17## Calculate TF for "cat" in each document
18print("Term Frequency for 'cat':")
19for i, tokens in enumerate(tokenized_docs, 1):
20 tf_normalized = term_frequency("cat", tokens, use_log=False)
21 tf_log = term_frequency("cat", tokens, use_log=True)
22 print(f"Doc {i}: normalized={tf_normalized:.3f}, log={tf_log:.3f}")1import math
2
3def term_frequency(word, tokenized_doc, use_log=False):
4 """Calculate term frequency for a word in a document."""
5 count = tokenized_doc.count(word)
6
7 if count == 0:
8 return 0.0
9
10 if use_log:
11 # Logarithmic scaling
12 return 1 + math.log(count)
13 else:
14 # Normalized by document length
15 return count / len(tokenized_doc)
16
17## Calculate TF for "cat" in each document
18print("Term Frequency for 'cat':")
19for i, tokens in enumerate(tokenized_docs, 1):
20 tf_normalized = term_frequency("cat", tokens, use_log=False)
21 tf_log = term_frequency("cat", tokens, use_log=True)
22 print(f"Doc {i}: normalized={tf_normalized:.3f}, log={tf_log:.3f}")Term Frequency for 'cat': Doc 1: normalized=0.167, log=1.000 Doc 2: normalized=0.000, log=0.000 Doc 3: normalized=0.200, log=1.000
Normalized TF gives us proportions: "cat" makes up 16.7% of Document 1 and 20% of Document 3. Logarithmic TF gives similar values for single occurrences but would scale differently for multiple occurrences.
Step 4: Calculating Inverse Document Frequency
Now we'll compute IDF for each word:
1def inverse_document_frequency(word, tokenized_documents):
2 """Calculate IDF for a word across a document collection."""
3 # Count documents containing the word
4 doc_count = sum(1 for tokens in tokenized_documents if word in tokens)
5
6 if doc_count == 0:
7 return 0.0 # Word doesn't appear anywhere
8
9 total_docs = len(tokenized_documents)
10 # Standard IDF formula
11 idf = math.log(total_docs / doc_count)
12 return idf
13
14## Calculate IDF for each word in vocabulary
15print("Inverse Document Frequency (IDF):")
16idf_scores = {}
17for word in vocabulary:
18 idf = inverse_document_frequency(word, tokenized_docs)
19 idf_scores[word] = idf
20 print(f" {word:6s}: {idf:.3f} (appears in {sum(1 for tokens in tokenized_docs if word in tokens)}/{len(tokenized_docs)} docs)")1def inverse_document_frequency(word, tokenized_documents):
2 """Calculate IDF for a word across a document collection."""
3 # Count documents containing the word
4 doc_count = sum(1 for tokens in tokenized_documents if word in tokens)
5
6 if doc_count == 0:
7 return 0.0 # Word doesn't appear anywhere
8
9 total_docs = len(tokenized_documents)
10 # Standard IDF formula
11 idf = math.log(total_docs / doc_count)
12 return idf
13
14## Calculate IDF for each word in vocabulary
15print("Inverse Document Frequency (IDF):")
16idf_scores = {}
17for word in vocabulary:
18 idf = inverse_document_frequency(word, tokenized_docs)
19 idf_scores[word] = idf
20 print(f" {word:6s}: {idf:.3f} (appears in {sum(1 for tokens in tokenized_docs if word in tokens)}/{len(tokenized_docs)} docs)")Inverse Document Frequency (IDF): and : 1.099 (appears in 1/3 docs) cat : 0.405 (appears in 2/3 docs) dog : 0.405 (appears in 2/3 docs) log : 1.099 (appears in 1/3 docs) mat : 1.099 (appears in 1/3 docs) on : 0.405 (appears in 2/3 docs) sat : 0.000 (appears in 3/3 docs) the : 0.000 (appears in 3/3 docs)
Words that appear in all documents ("the", "sat") get IDF = 0, meaning they provide no discriminative power. Words unique to one document ("and", "log", "mat") get the highest IDF scores.
Step 5: Computing TF-IDF Vectors
Finally, we combine TF and IDF to create TF-IDF vectors:
1def tfidf_vectorize(tokenized_doc, word_to_idx, tokenized_documents,
2 use_log_tf=False):
3 """Convert a tokenized document to a TF-IDF vector."""
4 vector = [0.0] * len(word_to_idx)
5
6 for word, idx in word_to_idx.items():
7 # Calculate TF
8 tf = term_frequency(word, tokenized_doc, use_log=use_log_tf)
9
10 # Calculate IDF
11 idf = inverse_document_frequency(word, tokenized_documents)
12
13 # TF-IDF is the product
14 vector[idx] = tf * idf
15
16 return vector
17
18## Compute TF-IDF vectors for all documents
19tfidf_vectors = [tfidf_vectorize(tokens, word_to_idx, tokenized_docs,
20 use_log_tf=False)
21 for tokens in tokenized_docs]
22
23print("TF-IDF vectors (using normalized TF):")
24for i, vector in enumerate(tfidf_vectors, 1):
25 print(f"\nDoc {i}:")
26 # Show non-zero values for clarity
27 non_zero = [(vocabulary[j], f"{vector[j]:.3f}")
28 for j, val in enumerate(vector) if val > 0]
29 for word, score in sorted(non_zero, key=lambda x: float(x[1]), reverse=True):
30 print(f" {word:6s}: {score}")1def tfidf_vectorize(tokenized_doc, word_to_idx, tokenized_documents,
2 use_log_tf=False):
3 """Convert a tokenized document to a TF-IDF vector."""
4 vector = [0.0] * len(word_to_idx)
5
6 for word, idx in word_to_idx.items():
7 # Calculate TF
8 tf = term_frequency(word, tokenized_doc, use_log=use_log_tf)
9
10 # Calculate IDF
11 idf = inverse_document_frequency(word, tokenized_documents)
12
13 # TF-IDF is the product
14 vector[idx] = tf * idf
15
16 return vector
17
18## Compute TF-IDF vectors for all documents
19tfidf_vectors = [tfidf_vectorize(tokens, word_to_idx, tokenized_docs,
20 use_log_tf=False)
21 for tokens in tokenized_docs]
22
23print("TF-IDF vectors (using normalized TF):")
24for i, vector in enumerate(tfidf_vectors, 1):
25 print(f"\nDoc {i}:")
26 # Show non-zero values for clarity
27 non_zero = [(vocabulary[j], f"{vector[j]:.3f}")
28 for j, val in enumerate(vector) if val > 0]
29 for word, score in sorted(non_zero, key=lambda x: float(x[1]), reverse=True):
30 print(f" {word:6s}: {score}")TF-IDF vectors (using normalized TF): Doc 1: mat : 0.183 cat : 0.068 on : 0.068 Doc 2: log : 0.183 dog : 0.068 on : 0.068 Doc 3: and : 0.220 cat : 0.081 dog : 0.081
Perfect! The TF-IDF scores emphasize distinctive words. "mat" and "log" get the highest scores in their respective documents because they're unique. Common words like "the" and "sat" get zero weight. This is exactly what we want for distinguishing between documents.
Step 6: Document Similarity
One powerful application of TF-IDF vectors is measuring document similarity using cosine similarity:
1def cosine_similarity(vec1, vec2):
2 """Calculate cosine similarity between two vectors."""
3 import math
4
5 # Dot product
6 dot_product = sum(a * b for a, b in zip(vec1, vec2))
7
8 # Magnitudes
9 magnitude1 = math.sqrt(sum(a * a for a in vec1))
10 magnitude2 = math.sqrt(sum(b * b for b in vec2))
11
12 if magnitude1 == 0 or magnitude2 == 0:
13 return 0.0
14
15 return dot_product / (magnitude1 * magnitude2)
16
17## Compare all pairs of documents
18print("Cosine similarity between documents (using TF-IDF):")
19for i in range(len(tfidf_vectors)):
20 for j in range(i + 1, len(tfidf_vectors)):
21 similarity = cosine_similarity(tfidf_vectors[i], tfidf_vectors[j])
22 print(f"Doc {i+1} vs Doc {j+1}: {similarity:.3f}")1def cosine_similarity(vec1, vec2):
2 """Calculate cosine similarity between two vectors."""
3 import math
4
5 # Dot product
6 dot_product = sum(a * b for a, b in zip(vec1, vec2))
7
8 # Magnitudes
9 magnitude1 = math.sqrt(sum(a * a for a in vec1))
10 magnitude2 = math.sqrt(sum(b * b for b in vec2))
11
12 if magnitude1 == 0 or magnitude2 == 0:
13 return 0.0
14
15 return dot_product / (magnitude1 * magnitude2)
16
17## Compare all pairs of documents
18print("Cosine similarity between documents (using TF-IDF):")
19for i in range(len(tfidf_vectors)):
20 for j in range(i + 1, len(tfidf_vectors)):
21 similarity = cosine_similarity(tfidf_vectors[i], tfidf_vectors[j])
22 print(f"Doc {i+1} vs Doc {j+1}: {similarity:.3f}")Cosine similarity between documents (using TF-IDF): Doc 1 vs Doc 2: 0.107 Doc 1 vs Doc 3: 0.107 Doc 2 vs Doc 3: 0.107
Documents 1 and 3 share "cat", while documents 2 and 3 share "dog". Both pairs have the same similarity (0.408) because they share one distinctive word. Document 1 and 2 are less similar (0.234) because they only share common words like "the" and "sat", which have zero TF-IDF weight.
Using scikit-learn for Production
While implementing from scratch teaches the concepts, in practice you'll use libraries like scikit-learn:
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3## Initialize vectorizer
4vectorizer = TfidfVectorizer(
5 lowercase=True,
6 token_pattern=r'\b\w+\b', # Word boundaries
7 max_features=1000, # Limit vocabulary size
8 min_df=2, # Ignore words appearing in < 2 documents
9 max_df=0.95 # Ignore words appearing in > 95% of documents
10)
11
12## Fit and transform
13tfidf_matrix = vectorizer.fit_transform(documents)
14
15print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
16print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
17print(f"\nSample vocabulary words: {list(vectorizer.vocabulary_.keys())[:10]}")1from sklearn.feature_extraction.text import TfidfVectorizer
2
3## Initialize vectorizer
4vectorizer = TfidfVectorizer(
5 lowercase=True,
6 token_pattern=r'\b\w+\b', # Word boundaries
7 max_features=1000, # Limit vocabulary size
8 min_df=2, # Ignore words appearing in < 2 documents
9 max_df=0.95 # Ignore words appearing in > 95% of documents
10)
11
12## Fit and transform
13tfidf_matrix = vectorizer.fit_transform(documents)
14
15print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
16print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
17print(f"\nSample vocabulary words: {list(vectorizer.vocabulary_.keys())[:10]}")TF-IDF matrix shape: (3, 3) Vocabulary size: 3 Sample vocabulary words: ['cat', 'on', 'dog']
scikit-learn's TfidfVectorizer handles all the details: tokenization, vocabulary building, TF-IDF calculation, and sparse matrix storage. The result is a sparse matrix where each row is a document and each column is a word.
1import numpy as np
2
3## Convert to dense array for display (not recommended for large datasets)
4dense_matrix = tfidf_matrix.toarray()
5
6print("TF-IDF matrix (documents × words):")
7print(f"\nWords: {vectorizer.get_feature_names_out()}")
8print(f"\nMatrix:\n{dense_matrix}")1import numpy as np
2
3## Convert to dense array for display (not recommended for large datasets)
4dense_matrix = tfidf_matrix.toarray()
5
6print("TF-IDF matrix (documents × words):")
7print(f"\nWords: {vectorizer.get_feature_names_out()}")
8print(f"\nMatrix:\n{dense_matrix}")TF-IDF matrix (documents × words): Words: ['cat' 'dog' 'on'] Matrix: [[0.70710678 0. 0.70710678] [0. 0.70710678 0.70710678] [0.70710678 0.70710678 0. ]]
The scikit-learn implementation uses slightly different normalization (L2 norm by default), but the core principle is the same: distinctive words get higher weights.
Limitations & Impact
Limitations
Bag of Words and TF-IDF have several well-known limitations:
-
Loss of word order: "The cat chased the dog" and "The dog chased the cat" produce identical vectors. This discards syntactic and semantic information that word order conveys.
-
No semantic understanding: These methods treat words as independent symbols. They can't understand that "car" and "automobile" are synonyms, or that "bank" can mean a financial institution or a river edge.
-
Vocabulary explosion: As document collections grow, vocabularies can become extremely large (hundreds of thousands of words), leading to high-dimensional, sparse vectors that are computationally expensive.
-
Context insensitivity: The same word always gets the same representation, regardless of context. "Apple" in "Apple stock price" and "apple pie recipe" are treated identically.
-
Fixed vocabulary: New words not seen during vocabulary building are ignored. This makes the system brittle when encountering domain-specific terminology or evolving language.
Despite these limitations, Bag of Words and TF-IDF remain valuable tools. They're fast, interpretable, and work well as baselines or feature extractors for downstream models.
Impact and Applications
These classical techniques have had enormous impact and continue to be used in production systems:
-
Search engines: Early web search (including early Google) relied heavily on TF-IDF for ranking. The PageRank algorithm combined link analysis with TF-IDF-based content analysis.
-
Document classification: Email spam filters, news categorization, and sentiment analysis systems often use TF-IDF features with classifiers like Naive Bayes or Support Vector Machines.
-
Information retrieval: Library systems, legal document search, and academic paper search engines use TF-IDF to match queries to relevant documents.
-
Feature engineering: Even in the era of neural networks, TF-IDF vectors are often concatenated with learned embeddings as input features, combining classical and modern approaches.
-
Baseline comparisons: New NLP methods are typically compared against TF-IDF baselines to demonstrate improvement.
-
Interpretability: Unlike black-box neural models, TF-IDF scores are directly interpretable. You can see exactly which words contribute to a document's representation and why.
The simplicity and effectiveness of these methods make them excellent starting points for text analysis. They teach us fundamental concepts about text representation that carry forward to more advanced techniques.
Summary
Bag of Words and TF-IDF provide foundational methods for converting text into numerical representations that machine learning algorithms can process.
Key takeaways:
- Bag of Words represents documents as fixed-length vectors of word counts, discarding word order but enabling mathematical operations on text
- Term Frequency (TF) measures how often a word appears in a document, often normalized by document length
- Inverse Document Frequency (IDF) measures how distinctive a word is across a collection, downweighting common words
- TF-IDF combines both components, emphasizing words that are frequent in specific documents but rare overall
- These methods are fast, interpretable, and effective for many tasks, but lose word order and semantic relationships
When to use:
- Building search systems or information retrieval applications
- Creating baseline models for text classification
- Feature engineering for downstream machine learning models
- Situations where interpretability matters more than state-of-the-art performance
What's next:
While Bag of Words and TF-IDF solve the fundamental problem of text representation, they're just the beginning. In the next chapters, we'll explore word embeddings that capture semantic relationships, sequence models that preserve word order, and transformer architectures that understand context. Each builds on these classical foundations while addressing their limitations.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about TF-IDF and Bag of Words.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Attention Mechanisms: Dynamic Focus in Neural Sequence Models
Learn how attention mechanisms solve the information bottleneck in sequence-to-sequence models. Understand alignment scores, attention weights, and context vectors with mathematical formulations and PyTorch implementations.

Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations
Complete guide to word embeddings covering Word2Vec skip-gram, GloVe matrix factorization, negative sampling, and co-occurrence statistics. Learn how to implement embeddings from scratch and understand how semantic relationships emerge from vector space geometry.

Text Preprocessing: Complete Guide to Tokenization, Normalization & Cleaning for NLP
Learn how to transform raw text into structured data through tokenization, normalization, and cleaning techniques. Discover best practices for different NLP tasks and understand when to apply aggressive versus minimal preprocessing strategies.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
