Learn how the Bag of Words model transforms text into numerical vectors through word counting, vocabulary construction, and sparse matrix storage. Master CountVectorizer and understand when this foundational NLP technique works best.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Bag of WordsLink Copied
How do you teach a computer to understand text? The first step is deceptively simple: count words. The Bag of Words (BoW) model transforms documents into numerical vectors by tallying how often each word appears. This representation ignores grammar, word order, and context entirely. It treats a document as nothing more than a collection of words tossed into a bag, hence the name.
Despite its simplicity, Bag of Words powered text classification, spam detection, and information retrieval for decades. It remains a surprisingly effective baseline for many NLP tasks. Understanding BoW is essential because it introduces core concepts (vocabulary construction, document-term matrices, and sparse representations) that persist throughout modern NLP.
This chapter walks you through building a Bag of Words representation from scratch. You'll learn how to construct vocabularies, create document-term matrices, handle the explosion of dimensionality with sparse matrices, and understand when this simple approach works and when it fails.
The Core IdeaLink Copied
Consider three short documents:
- "The cat sat on the mat"
- "The dog sat on the log"
- "The cat and the dog"
To represent these numerically, we first build a vocabulary: a list of all unique words across all documents. Then we count how many times each vocabulary word appears in each document.
The Bag of Words model represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency. Each document becomes a vector where each dimension corresponds to a vocabulary word, and the value indicates how often that word appears.
Let's implement this step by step:
Tokenized documents: Doc 1: ['the', 'cat', 'sat', 'on', 'the', 'mat'] Doc 2: ['the', 'dog', 'sat', 'on', 'the', 'log'] Doc 3: ['the', 'cat', 'and', 'the', 'dog'] Vocabulary (8 words): ['and', 'cat', 'dog', 'log', 'mat', 'on', 'sat', 'the'] Word to index mapping: 'and' → 0 'cat' → 1 'dog' → 2 'log' → 3 'mat' → 4 'on' → 5 'sat' → 6 'the' → 7
Our vocabulary contains 8 unique words. Each word maps to a unique index that will become a dimension in our vector representation.
Document-Term Matrix: ------------------------------------------------------------ Doc and cat dog log mat on sat the ------------------------------------------------------------ 1 0 1 0 0 1 1 1 2 2 0 0 1 1 0 1 1 2 3 1 1 1 0 0 0 0 2
Each row represents a document, and each column represents a word from our vocabulary. The value at position (i, j) tells us how many times word j appears in document i.

Look at the matrix structure. Document 1 has two occurrences of "the" (the cat... the mat), reflected in the count of 2. Document 3 shares vocabulary with both other documents, which we can see from the overlapping non-zero entries.
Document Similarity from Word CountsLink Copied
Once documents become vectors, we can measure their similarity. Cosine similarity compares the angle between vectors, ignoring magnitude. Documents with similar word distributions will have high cosine similarity, even if one is much longer than the other.

Documents 1 and 2 are most similar because they share the phrase structure "The [animal] sat on the [object]". Document 3, with its different structure, shows lower similarity to both. This demonstrates how BoW captures topical similarity through shared vocabulary, even though it ignores word order.
Vocabulary ConstructionLink Copied
Building a vocabulary seems straightforward, but real-world text introduces complications. How do you handle punctuation? What about rare words that appear only once? What about extremely common words like "the" that appear everywhere?
From Corpus to VocabularyLink Copied
A corpus is a collection of documents. The vocabulary is the set of unique terms extracted from this corpus. Let's work with a slightly more realistic example:
Corpus size: 5 documents Total tokens: 36 Vocabulary size: 23 unique words Vocabulary: a, ai, and, are, artificial deep, industries, intelligence, is, language learning, machine, natural, networks, neural of, power, processing, subset, systems techniques, transforming, uses
Word Frequencies (descending): ----------------------------------- learning 6 ██████ machine 4 ████ is 2 ██ a 2 ██ subset 2 ██ of 2 ██ deep 2 ██ artificial 1 █ intelligence 1 █ natural 1 █ language 1 █ processing 1 █ uses 1 █ techniques 1 █ ai 1 █ and 1 █ are 1 █ transforming 1 █ industries 1 █ neural 1 █ networks 1 █ power 1 █ systems 1 █

The word "learning" appears 5 times, "machine" appears 4 times, but many words appear only once. This pattern, a few high-frequency words and many rare words, follows Zipf's Law and is characteristic of natural language.
Vocabulary PruningLink Copied
Raw vocabularies from large corpora can contain millions of unique words. Many of these are noise: typos, rare technical terms, or words that appear in only one document. Vocabulary pruning removes uninformative terms to reduce dimensionality and improve model performance.
Minimum Document FrequencyLink Copied
Words that appear in very few documents provide little discriminative power and may represent noise. The min_df parameter sets a threshold: words must appear in at least this many documents (or this fraction of documents) to be included.
Document Frequency Analysis: --------------------------------------------- Word Doc Freq Keep (min_df=2) --------------------------------------------- a 2 ✓ ai 1 ✗ and 1 ✗ are 1 ✗ artificial 1 ✗ deep 2 ✓ industries 1 ✗ intelligence 1 ✗ is 2 ✓ language 1 ✗ learning 5 ✓ machine 4 ✓ natural 1 ✗ networks 1 ✗ neural 1 ✗ of 2 ✓ power 1 ✗ processing 1 ✗ subset 2 ✓ systems 1 ✗ techniques 1 ✗ transforming 1 ✗ uses 1 ✗ --------------------------------------------- Original vocabulary: 23 words After min_df=2: 7 words
Words like "artificial", "industries", and "networks" appear in only one document. Removing them reduces our vocabulary while keeping words that appear across multiple documents.
Maximum Document FrequencyLink Copied
At the other extreme, words that appear in almost every document provide no discriminative power. The word "the" might appear in 95% of documents, making it useless for distinguishing between them. The max_df parameter sets an upper threshold.
Total documents: 5 max_df = 0.8 → max document count = 4 Words appearing in >80% of documents: 'learning' appears in 5/5 documents Vocabulary after min_df=2, max_df=0.8: 6 words
In our small corpus, "learning" appears in all 5 documents (100%), exceeding our 80% threshold. In real applications, you might filter out words appearing in more than 90% of documents to remove uninformative terms like "the", "is", and "a".
Vocabulary Reduction with min_dfLink Copied
How aggressively should you prune? Let's visualize how vocabulary size changes as we increase the min_df threshold:

The steep drop from min_df=1 to min_df=2 is typical. In real corpora, a large fraction of words appear only once (called hapax legomena). Removing these rare words often improves model performance by reducing noise without losing much signal.
Count vs. Binary RepresentationsLink Copied
So far, we've counted word occurrences. But sometimes presence matters more than frequency. In a binary representation, each cell contains 1 if the word appears in the document and 0 otherwise, regardless of how many times it appears.
Sample document: 'Deep learning is a subset of machine learning.' Tokens: ['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning'] Comparison of representations: -------------------------------------------------- Word Count Binary -------------------------------------------------- a 1 1 deep 1 1 is 1 1 learning 2 1 ← differs machine 1 1 of 1 1 subset 1 1
Notice that "learning" appears twice in this document. The count representation records 2, while the binary representation records 1. Which is better? It depends on the task. For document classification, binary representations often work as well as counts. For tasks where word frequency carries meaning (like authorship attribution), counts are more informative.
Sparse Matrix RepresentationLink Copied
Real-world vocabularies contain tens of thousands of words, yet most documents use only a small fraction. A news article with 500 words might touch only 200 unique vocabulary terms out of 50,000. Storing all those zeros wastes memory.
A sparse matrix is a matrix where most elements are zero. Sparse matrix formats store only the non-zero values and their positions, dramatically reducing memory usage for high-dimensional, mostly-empty data like document-term matrices.
Document-Term Matrix Statistics: Shape: (5, 23) (documents × vocabulary) Total elements: 115 Non-zero elements: 35 Zero elements: 80 Sparsity: 69.6% Memory comparison: Dense matrix: 920 bytes
Even in our tiny example, over 60% of the matrix is zeros. In real applications with vocabularies of 100,000+ words and millions of documents, sparsity typically exceeds 99%. Storing a dense matrix would require terabytes of memory for mostly zeros.
CSR FormatLink Copied
The Compressed Sparse Row (CSR) format stores only non-zero values along with their column indices and row boundaries. This is the standard format for document-term matrices because NLP operations typically process one document (row) at a time.
CSR Format Internals:
--------------------------------------------------
data (non-zero values): [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
indices (column positions): [ 0 4 7 8 10 11 15 18 0 5 8 10 11 15 18 9 10 11 12 17 20 22 1 2
3 6 10 11 21 5 10 13 14 16 19]
indptr (row boundaries): [ 0 8 15 22 29 35]
How to read this:
Row 0: values at columns indices[0:3] = indices[0:3]
→ columns [ 0 4 7 8 10 11 15 18], values [1. 1. 1. 1. 1. 1. 1. 1.]
Row 1: columns [ 0 5 8 10 11 15 18], values [1. 1. 1. 2. 1. 1. 1.]
Row 2: columns [ 9 10 11 12 17 20 22], values [1. 1. 1. 1. 1. 1. 1.]

Memory Usage: Dense vs Sparse ====================================================================== Scale Dense Sparse Savings ---------------------------------------------------------------------- 100 × 1,000 800.0 KB 60.4 KB 92.4% 10,000 × 50,000 4.0 GB 24.0 MB 99.4% 1,000,000 × 100,000 800.0 GB 3.6 GB 99.5%
For a realistic corpus of 1 million documents with a 100,000-word vocabulary, sparse representation uses less than 1% of the memory required by dense storage. This is the difference between fitting in RAM and requiring distributed storage.

Using scikit-learn's CountVectorizerLink Copied
While understanding the internals is valuable, in practice you'll use scikit-learn's CountVectorizer. It handles tokenization, vocabulary building, and sparse matrix creation in a single, optimized package.
CountVectorizer Results: Matrix shape: (5, 22) Matrix type: <class 'scipy.sparse._csr.csr_matrix'> Vocabulary size: 22 Vocabulary (feature names): ['ai', 'and', 'are', 'artificial', 'deep', 'industries', 'intelligence', 'is', 'language', 'learning', 'machine', 'natural', 'networks', 'neural', 'of', 'power', 'processing', 'subset', 'systems', 'techniques', 'transforming', 'uses'] Sparse matrix info: Non-zero elements: 33 Sparsity: 70.0%
The result is a sparse CSR matrix ready for machine learning. Let's visualize what CountVectorizer produced:

CountVectorizer Parameter Comparison: ====================================================================== default: Vocabulary size: 22 Non-zero elements: 33 Sample features: ['ai', 'and', 'are', 'artificial', 'deep'] binary: Vocabulary size: 22 Non-zero elements: 33 Sample features: ['ai', 'and', 'are', 'artificial', 'deep'] min_df=2: Vocabulary size: 6 Non-zero elements: 17 Sample features: ['deep', 'is', 'learning', 'machine', 'of'] max_df=0.8: Vocabulary size: 21 Non-zero elements: 28 Sample features: ['ai', 'and', 'are', 'artificial', 'deep'] bigrams: Vocabulary size: 44 Non-zero elements: 62 Sample features: ['ai', 'ai and', 'and', 'and machine', 'are']
The ngram_range=(1, 2) setting includes both unigrams and bigrams, capturing two-word phrases like "machine learning" and "deep learning". This dramatically increases vocabulary size but can capture meaningful phrases.
The Loss of Word OrderLink Copied
Bag of Words discards all structural information. "The cat chased the dog" and "The dog chased the cat" produce identical vectors, despite having opposite meanings.
Word Order Demonstration: -------------------------------------------------- Vocabulary: ['cat', 'chased', 'dog', 'the'] 'The cat chased the dog' Vector: [1 1 1 2] 'The dog chased the cat' Vector: [1 1 1 2] 'Dog the cat the chased' Vector: [1 1 1 2] Sentence 1 == Sentence 2: True Sentence 1 == Sentence 3: True All three sentences produce IDENTICAL vectors!
This is the fundamental limitation of Bag of Words. It cannot distinguish between:
- Active and passive voice ("John hit Mary" vs "Mary was hit by John")
- Negation scope ("I love this movie" vs "I don't love this movie" have nearly identical vectors)
- Questions and statements ("Is this good?" vs "This is good")
- Any semantic difference that depends on word order

When Bag of Words WorksLink Copied
Despite its limitations, Bag of Words remains useful for many tasks:
Document classification: For categorizing news articles, spam detection, or sentiment analysis on long texts, word presence often matters more than order. A movie review containing "terrible", "boring", and "waste" is likely negative regardless of how those words are arranged.
Information retrieval: Search engines match query terms against document terms. The classic TF-IDF weighting (covered in later chapters) builds directly on the Bag of Words foundation.
Topic modeling: Algorithms like Latent Dirichlet Allocation (LDA) assume documents are mixtures of topics, each characterized by word distributions. The bag-of-words assumption is baked into the model.
Baseline models: Before deploying complex neural networks, a BoW model provides a sanity check. If a simple model achieves 90% accuracy, you know the task is learnable from word frequencies alone.
Sentiment Classification with Bag of Words: -------------------------------------------------- Review: 'This was a brilliant and fantastic experience' Prediction: Positive (confidence: 97.7%) Review: 'Complete waste of time, terrible' Prediction: Negative (confidence: 97.3%)
Even this trivial example shows BoW capturing sentiment through word presence. Words like "brilliant", "fantastic", and "terrible" carry strong sentiment signals regardless of context.
Limitations and ImpactLink Copied
Bag of Words has fundamental limitations that motivated the development of more sophisticated representations:
No word order: As demonstrated, BoW cannot distinguish sentences with different word arrangements.
No semantics: "Good" and "excellent" are treated as completely unrelated words, even though they're synonyms. Similarly, "bank" (financial institution) and "bank" (river edge) are conflated.
Vocabulary explosion: Adding n-grams helps capture some phrases but causes vocabulary size to explode. Bigrams alone can multiply vocabulary by 10-100x.
Sparsity: High-dimensional sparse vectors are inefficient for neural networks, which prefer dense, lower-dimensional inputs.
Out-of-vocabulary words: Words not seen during training have no representation. A model trained on formal text may fail on social media slang.
These limitations drove the development of word embeddings (Word2Vec, GloVe) and eventually transformer-based models that learn dense, contextual representations. Yet Bag of Words remains the conceptual starting point. Understanding document-term matrices, vocabulary construction, and sparse representations provides the foundation for understanding more advanced techniques.
Key Functions and ParametersLink Copied
When working with Bag of Words representations, CountVectorizer from scikit-learn is the primary tool. Here are its most important parameters:
CountVectorizer(lowercase, min_df, max_df, binary, ngram_range, stop_words, max_features)
-
lowercase(default:True): Convert all text to lowercase before tokenizing. Set toFalseif case carries meaning (e.g., proper nouns, acronyms). -
min_df: Minimum document frequency threshold. If an integer, the word must appear in at least this many documents. If a float between 0.0 and 1.0, represents a proportion of documents. Usemin_df=2or higher to remove rare words and typos. -
max_df: Maximum document frequency threshold. Words appearing in more than this fraction of documents are excluded. Usemax_df=0.9to remove extremely common words that provide no discriminative power. -
binary(default:False): IfTrue, all non-zero counts are set to 1. Use binary representation when word presence matters more than frequency. -
ngram_range(default:(1, 1)): Tuple specifying the range of n-gram sizes to include.(1, 2)includes unigrams and bigrams, capturing phrases like "machine learning". Higher values dramatically increase vocabulary size. -
stop_words: Either'english'for built-in stop word list, or a custom list of words to exclude. Removes common words like "the", "is", "and" that typically add noise. -
max_features: Limit vocabulary to the top N most frequent terms. Useful for controlling dimensionality in very large corpora.
SummaryLink Copied
Bag of Words transforms text into numerical vectors by counting word occurrences, ignoring grammar and word order entirely. Despite this brutal simplification, it powers effective text classification, information retrieval, and topic modeling systems.
Key takeaways:
- Vocabulary construction extracts unique words from a corpus, mapping each to a vector dimension
- Document-term matrices represent documents as rows and vocabulary words as columns, with counts (or binary indicators) as values
- Vocabulary pruning with
min_dfandmax_dfremoves uninformative rare and common words - Sparse matrices (CSR format) efficiently store the mostly-zero document-term matrices, reducing memory by 99%+ for realistic corpora
- scikit-learn's CountVectorizer handles tokenization, vocabulary building, and sparse matrix creation in one optimized package
- Word order loss is the fundamental limitation: "The cat chased the dog" and "The dog chased the cat" produce identical vectors
In the next chapters, we'll extend these ideas with n-grams to capture some word sequences, and with TF-IDF weighting to emphasize discriminative terms over common ones.
QuizLink Copied
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Bag of Words representations.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

N-gram Language Models: Probability-Based Text Generation & Prediction
Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

Smoothing Techniques for N-gram Language Models: From Laplace to Kneser-Ney
Master smoothing techniques that solve the zero probability problem in n-gram models, including Laplace, add-k, Good-Turing, interpolation, and Kneser-Ney smoothing with Python implementations.

N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams
Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.

