Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations

Michael BrenndoerferUpdated March 24, 202533 min read

Learn how the Bag of Words model transforms text into numerical vectors through word counting, vocabulary construction, and sparse matrix storage. Master CountVectorizer and understand when this foundational NLP technique works best.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Bag of Words

How do you teach a computer to understand text? The first step is deceptively simple: count words. The Bag of Words (BoW) model transforms documents into numerical vectors by tallying how often each word appears. This representation ignores grammar, word order, and context entirely. It treats a document as nothing more than a collection of words tossed into a bag, hence the name.

Despite its simplicity, Bag of Words powered text classification, spam detection, and information retrieval for decades. It remains a surprisingly effective baseline for many NLP tasks. Understanding BoW is essential because it introduces core concepts (vocabulary construction, document-term matrices, and sparse representations) that persist throughout modern NLP.

This chapter walks you through building a Bag of Words representation from scratch. You'll learn how to construct vocabularies, create document-term matrices, handle the explosion of dimensionality with sparse matrices, and understand when this simple approach works and when it fails.

The Core Idea

Consider three short documents:

  1. "The cat sat on the mat"
  2. "The dog sat on the log"
  3. "The cat and the dog"

To represent these numerically, we first build a vocabulary: a list of all unique words across all documents. Then we count how many times each vocabulary word appears in each document.

Bag of Words

The Bag of Words model represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency. Each document becomes a vector where each dimension corresponds to a vocabulary word, and the value indicates how often that word appears.

Let's implement this step by step:

In[2]:
Code
# Our corpus of documents
documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat and the dog",
]

# Step 1: Tokenize and lowercase
tokenized_docs = [doc.lower().split() for doc in documents]

# Step 2: Build vocabulary from all documents
all_words = []
for doc in tokenized_docs:
    all_words.extend(doc)

vocabulary = sorted(set(all_words))
word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
Out[3]:
Console
Tokenized documents:
  Doc 1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
  Doc 2: ['the', 'dog', 'sat', 'on', 'the', 'log']
  Doc 3: ['the', 'cat', 'and', 'the', 'dog']

Vocabulary (8 words):
  ['and', 'cat', 'dog', 'log', 'mat', 'on', 'sat', 'the']

Word to index mapping:
  'and' → 0
  'cat' → 1
  'dog' → 2
  'log' → 3
  'mat' → 4
  'on' → 5
  'sat' → 6
  'the' → 7

The vocabulary contains all unique words found across the documents. Each word maps to a unique index that will become a dimension in our vector representation.

In[4]:
Code
import numpy as np


# Step 3: Create document vectors by counting word occurrences
def document_to_vector(doc_tokens, word_to_idx):
    """Convert a tokenized document to a count vector."""
    vector = np.zeros(len(word_to_idx))
    for token in doc_tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] += 1
    return vector


# Create the document-term matrix
doc_term_matrix = np.array(
    [document_to_vector(doc, word_to_idx) for doc in tokenized_docs]
)
Out[5]:
Console
Document-Term Matrix:
------------------------------------------------------------
Doc    and    cat    dog    log    mat     on    sat    the
------------------------------------------------------------
  1      0      1      0      0      1      1      1      2
  2      0      0      1      1      0      1      1      2
  3      1      1      1      0      0      0      0      2

Each row represents a document, and each column represents a word from our vocabulary. Formally, we can express the document-term matrix as MRm×n\mathbf{M} \in \mathbb{R}^{m \times n} where element MijM_{ij} represents the count of vocabulary word jj in document ii. Here, mm is the number of documents and nn is the vocabulary size.

Out[6]:
Visualization
Heatmap showing document-term matrix with documents as rows and vocabulary words as columns, cell colors indicating word counts.
Document-term matrix visualization for three sample documents. Each row represents a document, each column a vocabulary word. Cell intensity indicates word frequency, with darker cells showing higher counts. Notice how 'the' appears twice in Documents 1 and 2 (it occurs twice in each), while most words appear only once or not at all.

Look at the matrix structure. Document 1 has two occurrences of "the" (the cat... the mat), reflected in the count of 2. Document 3 shares vocabulary with both other documents, which we can see from the overlapping non-zero entries.

Document Similarity from Word Counts

Once documents become vectors, we can measure their similarity using cosine similarity. This metric computes the cosine of the angle between two vectors, effectively measuring how similar their directions are in high-dimensional space, regardless of their magnitudes.

Given two document vectors a\mathbf{a} and b\mathbf{b}, cosine similarity is defined as:

cosine_similarity(a,b)=abab=i=1naibii=1nai2i=1nbi2\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \cdot \sqrt{\sum_{i=1}^{n} b_i^2}}

where:

  • ab\mathbf{a} \cdot \mathbf{b}: the dot product of vectors a\mathbf{a} and b\mathbf{b}, computed as i=1naibi\sum_{i=1}^{n} a_i b_i
  • a\|\mathbf{a}\|: the Euclidean norm (magnitude) of vector a\mathbf{a}, computed as i=1nai2\sqrt{\sum_{i=1}^{n} a_i^2}
  • nn: the vocabulary size (number of dimensions in each vector)
  • ai,bia_i, b_i: the word counts at position ii in vectors a\mathbf{a} and b\mathbf{b} respectively

The result ranges from 0 (completely different, no shared vocabulary) to 1 (identical word distributions). By normalizing by vector magnitudes, cosine similarity ensures that documents with similar word proportions are considered similar even if one document is much longer than the other.

In[7]:
Code
from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise cosine similarity
similarity_matrix = cosine_similarity(doc_term_matrix)
Out[8]:
Visualization
Heatmap showing pairwise cosine similarity between three documents, with values ranging from 0.33 to 1.0.
Cosine similarity matrix between our three sample documents. Documents 1 and 2 share 'the', 'sat', and 'on', giving them moderate similarity (0.67). Document 3 shares 'the', 'cat', and 'dog' with the others but lacks the verb phrase, resulting in lower similarity scores. The diagonal shows perfect self-similarity (1.0).

Documents 1 and 2 are most similar because they share the phrase structure "The [animal] sat on the [object]". Document 3, with its different structure, shows lower similarity to both. This demonstrates how BoW captures topical similarity through shared vocabulary, even though it ignores word order.

Vocabulary Construction

Building a vocabulary seems straightforward, but real-world text introduces complications. How do you handle punctuation? What about rare words that appear only once? What about extremely common words like "the" that appear everywhere?

From Corpus to Vocabulary

A corpus is a collection of documents. The vocabulary is the set of unique terms extracted from this corpus. Let's work with a slightly more realistic example:

In[9]:
Code
corpus = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning is a subset of machine learning.",
    "Natural language processing uses machine learning techniques.",
    "AI and machine learning are transforming industries.",
    "Neural networks power deep learning systems.",
]


def tokenize(text):
    """Simple tokenization: lowercase and split on whitespace/punctuation."""
    import re

    # Convert to lowercase and extract word tokens
    tokens = re.findall(r"\b[a-z]+\b", text.lower())
    return tokens


# Tokenize all documents
tokenized_corpus = [tokenize(doc) for doc in corpus]

# Build vocabulary
all_tokens = []
for doc in tokenized_corpus:
    all_tokens.extend(doc)

vocabulary = sorted(set(all_tokens))
vocab_size = len(vocabulary)
Out[10]:
Console
Corpus size: 5 documents
Total tokens: 36
Vocabulary size: 23 unique words

Vocabulary:
  a, ai, and, are, artificial
  deep, industries, intelligence, is, language
  learning, machine, natural, networks, neural
  of, power, processing, subset, systems
  techniques, transforming, uses

The vocabulary is significantly smaller than the total token count because many words repeat across documents. This compression—from raw tokens to unique vocabulary terms—is a key characteristic of text data.

Word Frequency Analysis

Before finalizing the vocabulary, examining word frequencies helps identify potential issues:

In[11]:
Code
from collections import Counter

# Count word frequencies across the corpus
word_counts = Counter(all_tokens)

# Sort by frequency
sorted_counts = word_counts.most_common()
Out[12]:
Console
Word Frequencies (descending):
-----------------------------------
  learning         6  ██████
  machine          4  ████
  is               2  ██
  a                2  ██
  subset           2  ██
  of               2  ██
  deep             2  ██
  artificial       1  █
  intelligence     1  █
  natural          1  █
  language         1  █
  processing       1  █
  uses             1  █
  techniques       1  █
  ai               1  █
  and              1  █
  are              1  █
  transforming     1  █
  industries       1  █
  neural           1  █
  networks         1  █
  power            1  █
  systems          1  █
Out[13]:
Visualization
Horizontal bar chart showing word frequencies with 'learning' and 'machine' having highest counts around 5, and many words appearing only once.
Word frequency distribution in our sample corpus. Common words like 'learning' and 'machine' dominate, appearing in multiple documents. Many words appear only once (hapax legomena), contributing to vocabulary size without adding much discriminative power.

The word "learning" appears 5 times, "machine" appears 4 times, but many words appear only once. This pattern, a few high-frequency words and many rare words, follows Zipf's Law and is characteristic of natural language.

Vocabulary Pruning

Raw vocabularies from large corpora can contain millions of unique words. Many of these are noise: typos, rare technical terms, or words that appear in only one document. Vocabulary pruning removes uninformative terms to reduce dimensionality and improve model performance.

Minimum Document Frequency

The document frequency of a word ww, denoted df(w)\text{df}(w), is the number of documents in which that word appears at least once:

df(w)={dD:wd}\text{df}(w) = |\{d \in D : w \in d\}|

where:

  • DD: the corpus (collection of all documents)
  • dd: an individual document in the corpus
  • wdw \in d: indicates that word ww appears in document dd
  • |\cdot|: the count of elements in the set

Words that appear in very few documents provide little discriminative power and may represent noise. The min_df parameter sets a threshold: words must appear in at least this many documents (or this fraction of documents) to be included.

In[14]:
Code
def compute_document_frequency(tokenized_corpus):
    """Count how many documents each word appears in."""
    doc_freq = Counter()
    for doc in tokenized_corpus:
        # Count each word once per document
        unique_words = set(doc)
        doc_freq.update(unique_words)
    return doc_freq


doc_freq = compute_document_frequency(tokenized_corpus)

# Apply min_df threshold
min_df = 2  # Word must appear in at least 2 documents
filtered_vocab_min = {word for word, freq in doc_freq.items() if freq >= min_df}
Out[15]:
Console
Document Frequency Analysis:
---------------------------------------------
Word                   Doc Freq Keep (min_df=2)
---------------------------------------------
a                             2               ✓
ai                            1               ✗
and                           1               ✗
are                           1               ✗
artificial                    1               ✗
deep                          2               ✓
industries                    1               ✗
intelligence                  1               ✗
is                            2               ✓
language                      1               ✗
learning                      5               ✓
machine                       4               ✓
natural                       1               ✗
networks                      1               ✗
neural                        1               ✗
of                            2               ✓
power                         1               ✗
processing                    1               ✗
subset                        2               ✓
systems                       1               ✗
techniques                    1               ✗
transforming                  1               ✗
uses                          1               ✗
---------------------------------------------
Original vocabulary: 23 words
After min_df=2:      7 words

Words like "artificial", "industries", and "networks" appear in only one document. Removing them reduces our vocabulary while keeping words that appear across multiple documents.

Maximum Document Frequency

At the other extreme, words that appear in almost every document provide no discriminative power. The word "the" might appear in 95% of documents, making it useless for distinguishing between them. The max_df parameter sets an upper threshold.

In[16]:
Code
# Apply max_df threshold (as a fraction of documents)
max_df = 0.8  # Word must appear in at most 80% of documents
num_docs = len(tokenized_corpus)
max_doc_count = int(max_df * num_docs)

filtered_vocab_max = {
    word for word, freq in doc_freq.items() if freq <= max_doc_count
}

# Combined filtering
filtered_vocab = {
    word
    for word, freq in doc_freq.items()
    if freq >= min_df and freq <= max_doc_count
}
Out[17]:
Console
Total documents: 5
max_df = 0.8 → max document count = 4

Words appearing in >80% of documents:
  'learning' appears in 5/5 documents

Vocabulary after min_df=2, max_df=0.8: 6 words

In our small corpus, "learning" appears in all 5 documents (100%), exceeding our 80% threshold. In real applications, you might filter out words appearing in more than 90% of documents to remove uninformative terms like "the", "is", and "a".

Vocabulary Reduction with min_df

How aggressively should you prune? Let's visualize how vocabulary size changes as we increase the min_df threshold:

In[18]:
Code
# Calculate vocabulary size at different min_df thresholds
min_df_values = range(1, num_docs + 1)
vocab_sizes = []

for threshold in min_df_values:
    remaining = sum(1 for freq in doc_freq.values() if freq >= threshold)
    vocab_sizes.append(remaining)
Out[19]:
Visualization
Line plot showing vocabulary size decreasing as min_df threshold increases, reaching 1 word at min_df=5.
Vocabulary size reduction as minimum document frequency threshold increases. At min_df=1, all unique words are included. Raising the threshold to min_df=2 removes words appearing in only one document, cutting vocabulary nearly in half. By min_df=5 (all documents), only 'learning' remains.

The steep drop from min_df=1 to min_df=2 is typical. In real corpora, a large fraction of words appear only once (called hapax legomena). Removing these rare words often improves model performance by reducing noise without losing much signal.

Count vs. Binary Representations

So far, we've counted word occurrences. But sometimes presence matters more than frequency. In a binary representation, each cell contains 1 if the word appears in the document and 0 otherwise, regardless of how many times it appears.

In[20]:
Code
# Create word-to-index mapping for our vocabulary
word_to_idx = {word: idx for idx, word in enumerate(sorted(vocabulary))}


# Count representation
def to_count_vector(tokens, word_to_idx):
    vector = np.zeros(len(word_to_idx))
    for token in tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] += 1
    return vector


# Binary representation
def to_binary_vector(tokens, word_to_idx):
    vector = np.zeros(len(word_to_idx))
    for token in tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] = 1  # Set to 1, don't increment
    return vector


# Compare representations for a sample document
sample_doc = tokenized_corpus[
    1
]  # "deep learning is a subset of machine learning"
count_vec = to_count_vector(sample_doc, word_to_idx)
binary_vec = to_binary_vector(sample_doc, word_to_idx)
Out[21]:
Console
Sample document: 'Deep learning is a subset of machine learning.'
Tokens: ['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning']

Comparison of representations:
--------------------------------------------------
Word                 Count     Binary
--------------------------------------------------
a                        1          1
deep                     1          1
is                       1          1
learning                 2          1 ← differs
machine                  1          1
of                       1          1
subset                   1          1

Notice that "learning" appears twice in this document. The count representation records 2, while the binary representation records 1. Which is better? It depends on the task. For document classification, binary representations often work as well as counts. For tasks where word frequency carries meaning (like authorship attribution), counts are more informative.

Sparse Matrix Representation

Real-world vocabularies contain tens of thousands of words, yet most documents use only a small fraction. A news article with 500 words might touch only 200 unique vocabulary terms out of 50,000. Storing all those zeros wastes memory.

Sparse Matrix

A sparse matrix is a matrix where most elements are zero. Sparse matrix formats store only the non-zero values and their positions, dramatically reducing memory usage for high-dimensional, mostly-empty data like document-term matrices.

The Sparsity Problem

Sparsity measures the proportion of zero elements in a matrix. For a document-term matrix M\mathbf{M} with mm documents (rows) and nn vocabulary words (columns), sparsity is defined as:

sparsity=number of zero elementstotal elements=mnnnzmn=1nnzmn\text{sparsity} = \frac{\text{number of zero elements}}{\text{total elements}} = \frac{mn - \text{nnz}}{mn} = 1 - \frac{\text{nnz}}{mn}

where:

  • mm: the number of documents (rows in the matrix)
  • nn: the vocabulary size (columns in the matrix)
  • nnz\text{nnz}: the number of non-zero elements in the matrix

A sparsity of 0.99 means 99% of the matrix elements are zeros. In NLP, high sparsity is the norm because each document uses only a tiny fraction of the total vocabulary.

Let's quantify the sparsity in a typical document-term matrix:

In[22]:
Code
# Create full document-term matrix for our corpus
doc_term_matrix = np.array(
    [to_count_vector(doc, word_to_idx) for doc in tokenized_corpus]
)

# Calculate sparsity
total_elements = doc_term_matrix.size
non_zero_elements = np.count_nonzero(doc_term_matrix)
zero_elements = total_elements - non_zero_elements
sparsity = zero_elements / total_elements
Out[23]:
Console
Document-Term Matrix Statistics:
  Shape: (5, 23) (documents × vocabulary)
  Total elements: 115
  Non-zero elements: 35
  Zero elements: 80
  Sparsity: 69.6%

Memory comparison:
  Dense matrix: 920 bytes

Even in this small example, most of the matrix consists of zeros—each document only uses a fraction of the vocabulary. In real applications with vocabularies of 100,000+ words and millions of documents, sparsity typically exceeds 99%. Storing a dense matrix would require terabytes of memory for mostly zeros.

CSR Format

The Compressed Sparse Row (CSR) format stores only non-zero values along with their column indices and row boundaries. This is the standard format for document-term matrices because NLP operations typically process one document (row) at a time.

In[24]:
Code
from scipy import sparse

# Convert to CSR format
sparse_matrix = sparse.csr_matrix(doc_term_matrix)

# Examine the internal structure
data = sparse_matrix.data  # Non-zero values
indices = sparse_matrix.indices  # Column indices of non-zero values
indptr = sparse_matrix.indptr  # Row boundaries
Out[25]:
Console
CSR Format Internals:
--------------------------------------------------
data (non-zero values):     [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
indices (column positions): [ 0  4  7  8 10 11 15 18  0  5  8 10 11 15 18  9 10 11 12 17 20 22  1  2
  3  6 10 11 21  5 10 13 14 16 19]
indptr (row boundaries):    [ 0  8 15 22 29 35]

How to read this:
  Row 0: values at columns indices[0:3] = indices[0:3]
         → columns [ 0  4  7  8 10 11 15 18], values [1. 1. 1. 1. 1. 1. 1. 1.]
  Row 1: columns [ 0  5  8 10 11 15 18], values [1. 1. 1. 2. 1. 1. 1.]
  Row 2: columns [ 9 10 11 12 17 20 22], values [1. 1. 1. 1. 1. 1. 1.]
Out[26]:
Visualization
Diagram showing a sparse matrix and its CSR representation with three arrays: data, indices, and indptr.
CSR (Compressed Sparse Row) format stores a sparse matrix using three arrays: data (non-zero values), indices (column positions), and indptr (row boundaries). This visualization shows how the original matrix maps to these compact arrays, eliminating storage of zero values.

The indptr array is the key to CSR. To find the non-zero values in row ii, look at positions indptr[i] through indptr[i+1] in both data and indices. This makes row slicing extremely efficient.

Memory Savings

The memory advantage of sparse matrices grows dramatically with vocabulary size:

In[27]:
Code
# Simulate larger matrices to show memory scaling
def estimate_memory(num_docs, vocab_size, avg_words_per_doc):
    """Estimate memory usage for dense vs sparse representations."""
    # Dense: store every element (8 bytes for float64)
    dense_bytes = num_docs * vocab_size * 8

    # Sparse CSR: store only non-zero elements
    # data array: avg_words_per_doc * num_docs * 8 bytes
    # indices array: avg_words_per_doc * num_docs * 4 bytes (int32)
    # indptr array: (num_docs + 1) * 4 bytes
    nnz = avg_words_per_doc * num_docs
    sparse_bytes = nnz * 8 + nnz * 4 + (num_docs + 1) * 4

    return dense_bytes, sparse_bytes


# Test different scales
scales = [
    (100, 1000, 50),  # Small: 100 docs, 1K vocab
    (10000, 50000, 200),  # Medium: 10K docs, 50K vocab
    (1000000, 100000, 300),  # Large: 1M docs, 100K vocab
]
Out[28]:
Console
Memory Usage: Dense vs Sparse
======================================================================
Scale                          Dense          Sparse         Savings
----------------------------------------------------------------------
100 × 1,000                 800.0 KB         60.4 KB           92.4%
10,000 × 50,000               4.0 GB         24.0 MB           99.4%
1,000,000 × 100,000         800.0 GB          3.6 GB           99.5%

For a realistic corpus of 1 million documents with a 100,000-word vocabulary, sparse representation uses less than 1% of the memory required by dense storage. This is the difference between fitting in RAM and requiring distributed storage.

Out[29]:
Visualization
Line plot showing sparsity percentage increasing from about 75% at 1000 vocabulary words to over 99% at 100000 words.
Sparsity increases dramatically with vocabulary size. Even with 200 unique words per document, a 10,000-word vocabulary yields 98% sparsity. At 100,000 words, sparsity exceeds 99.8%. This explains why sparse matrix formats are essential for real-world NLP applications.

Using scikit-learn's CountVectorizer

While understanding the internals is valuable, in practice you'll use scikit-learn's CountVectorizer. It handles tokenization, vocabulary building, and sparse matrix creation in a single, optimized package.

In[30]:
Code
from sklearn.feature_extraction.text import CountVectorizer

# Create vectorizer with common settings
vectorizer = CountVectorizer(
    lowercase=True,  # Convert to lowercase
    min_df=1,  # Minimum document frequency
    max_df=1.0,  # Maximum document frequency (fraction)
    binary=False,  # Use counts, not binary
    ngram_range=(1, 1),  # Unigrams only (single words)
)

# Fit and transform the corpus
bow_matrix = vectorizer.fit_transform(corpus)

# Get the vocabulary
feature_names = vectorizer.get_feature_names_out()
Out[31]:
Console
CountVectorizer Results:
  Matrix shape: (5, 22)
  Matrix type: <class 'scipy.sparse._csr.csr_matrix'>
  Vocabulary size: 22

Vocabulary (feature names):
  ['ai', 'and', 'are', 'artificial', 'deep', 'industries', 'intelligence', 'is', 'language', 'learning', 'machine', 'natural', 'networks', 'neural', 'of', 'power', 'processing', 'subset', 'systems', 'techniques', 'transforming', 'uses']

Sparse matrix info:
  Non-zero elements: 33
  Sparsity: 70.0%

CountVectorizer automatically handles tokenization, lowercasing, and vocabulary construction. The result is a sparse CSR matrix ready for machine learning—no manual preprocessing required.

In[32]:
Code
# Convert to dense for visualization (only for small matrices!)
dense_matrix = bow_matrix.toarray()
Out[33]:
Visualization
Heatmap showing the document-term matrix with 5 documents and approximately 20 vocabulary words, with cell colors indicating word counts.
Document-term matrix produced by scikit-learn's CountVectorizer. Each row represents one of our five sample documents about machine learning, and each column represents a word in the automatically constructed vocabulary. The vectorizer handled tokenization, lowercasing, and vocabulary construction automatically.

Key Parameters

CountVectorizer offers extensive customization:

In[34]:
Code
# Demonstrate key parameters
vectorizers = {
    "default": CountVectorizer(),
    "binary": CountVectorizer(binary=True),
    "min_df=2": CountVectorizer(min_df=2),
    "max_df=0.8": CountVectorizer(max_df=0.8),
    "bigrams": CountVectorizer(ngram_range=(1, 2)),
}

results = {}
for name, vec in vectorizers.items():
    matrix = vec.fit_transform(corpus)
    results[name] = {
        "vocab_size": len(vec.get_feature_names_out()),
        "nnz": matrix.nnz,
        "sample_features": list(vec.get_feature_names_out()[:5]),
    }
Out[35]:
Console
CountVectorizer Parameter Comparison:
======================================================================

default:
  Vocabulary size: 22
  Non-zero elements: 33
  Sample features: ['ai', 'and', 'are', 'artificial', 'deep']

binary:
  Vocabulary size: 22
  Non-zero elements: 33
  Sample features: ['ai', 'and', 'are', 'artificial', 'deep']

min_df=2:
  Vocabulary size: 6
  Non-zero elements: 17
  Sample features: ['deep', 'is', 'learning', 'machine', 'of']

max_df=0.8:
  Vocabulary size: 21
  Non-zero elements: 28
  Sample features: ['ai', 'and', 'are', 'artificial', 'deep']

bigrams:
  Vocabulary size: 44
  Non-zero elements: 62
  Sample features: ['ai', 'ai and', 'and', 'and machine', 'are']

Each parameter setting produces different results. The binary setting keeps the same vocabulary but changes counts to presence indicators. Setting min_df=2 dramatically reduces vocabulary by removing rare words. The ngram_range=(1, 2) setting includes both unigrams and bigrams, capturing two-word phrases like "machine learning" and "deep learning". This dramatically increases vocabulary size but can capture meaningful phrases.

The Loss of Word Order

Bag of Words discards all structural information. "The cat chased the dog" and "The dog chased the cat" produce identical vectors, despite having opposite meanings.

In[36]:
Code
# Demonstrate word order loss
sentences = [
    "The cat chased the dog",
    "The dog chased the cat",
    "Dog the cat the chased",  # Nonsense with same words
]

vec = CountVectorizer()
vectors = vec.fit_transform(sentences)
Out[37]:
Console
Word Order Demonstration:
--------------------------------------------------
Vocabulary: ['cat', 'chased', 'dog', 'the']

'The cat chased the dog'
  Vector: [1 1 1 2]

'The dog chased the cat'
  Vector: [1 1 1 2]

'Dog the cat the chased'
  Vector: [1 1 1 2]

Sentence 1 == Sentence 2: True
Sentence 1 == Sentence 3: True

All three sentences produce IDENTICAL vectors!

This is the fundamental limitation of Bag of Words. It cannot distinguish between:

  • Active and passive voice: "John hit Mary" vs "Mary was hit by John"
  • Negation scope: "I love this movie" vs "I don't love this movie" have nearly identical vectors
  • Questions and statements: "Is this good?" vs "This is good"
  • Any semantic difference that depends on word order
Out[38]:
Visualization
Diagram showing three different sentences mapping to the same bag of words vector representation.
Visualization of word order loss in Bag of Words. Three sentences with completely different meanings (or no meaning at all) map to identical vector representations because BoW only counts word occurrences, ignoring sequence and structure.

When Bag of Words Works

Despite its limitations, Bag of Words remains useful for many tasks:

  • Document classification: For categorizing news articles, spam detection, or sentiment analysis on long texts, word presence often matters more than order. A movie review containing "terrible", "boring", and "waste" is likely negative regardless of how those words are arranged.
  • Information retrieval: Search engines match query terms against document terms. The classic TF-IDF weighting (covered in later chapters) builds directly on the Bag of Words foundation.
  • Topic modeling: Algorithms like Latent Dirichlet Allocation (LDA) assume documents are mixtures of topics, each characterized by word distributions. The bag-of-words assumption is baked into the model.
  • Baseline models: Before deploying complex neural networks, a BoW model provides a sanity check. If a simple model achieves 90% accuracy, you know the task is learnable from word frequencies alone.
In[39]:
Code
# Quick demonstration: sentiment classification with BoW
from sklearn.naive_bayes import MultinomialNB

# Simple sentiment dataset
reviews = [
    "This movie was fantastic and amazing",
    "Absolutely loved this film, brilliant",
    "Great movie, highly recommend",
    "Terrible waste of time, boring",
    "Awful movie, complete disaster",
    "Worst film I have ever seen",
]
labels = [1, 1, 1, 0, 0, 0]  # 1 = positive, 0 = negative

# Vectorize and train
vec = CountVectorizer()
X = vec.fit_transform(reviews)
clf = MultinomialNB()
clf.fit(X, labels)

# Test on new reviews
test_reviews = [
    "This was a brilliant and fantastic experience",
    "Complete waste of time, terrible",
]
test_X = vec.transform(test_reviews)
predictions = clf.predict(test_X)
probabilities = clf.predict_proba(test_X)
Out[40]:
Console
Sentiment Classification with Bag of Words:
--------------------------------------------------
Review: 'This was a brilliant and fantastic experience'
  Prediction: Positive (confidence: 97.7%)

Review: 'Complete waste of time, terrible'
  Prediction: Negative (confidence: 97.3%)

Even this trivial example shows BoW capturing sentiment through word presence. Words like "brilliant", "fantastic", and "terrible" carry strong sentiment signals regardless of context.

Limitations and Impact

Bag of Words has fundamental limitations that motivated the development of more sophisticated representations:

  • No word order: As demonstrated, BoW cannot distinguish sentences with different word arrangements.
  • No semantics: "Good" and "excellent" are treated as completely unrelated words, even though they're synonyms. Similarly, "bank" (financial institution) and "bank" (river edge) are conflated.
  • Vocabulary explosion: Adding n-grams helps capture some phrases but causes vocabulary size to explode. Bigrams alone can multiply vocabulary by 10-100x.
  • Sparsity: High-dimensional sparse vectors are inefficient for neural networks, which prefer dense, lower-dimensional inputs.
  • Out-of-vocabulary words: Words not seen during training have no representation. A model trained on formal text may fail on social media slang.

These limitations drove the development of word embeddings (Word2Vec, GloVe) and eventually transformer-based models that learn dense, contextual representations. Yet Bag of Words remains the conceptual starting point. Understanding document-term matrices, vocabulary construction, and sparse representations provides the foundation for understanding more advanced techniques.

Key Functions and Parameters

When working with Bag of Words representations, CountVectorizer from scikit-learn is the primary tool. Here are its most important parameters:

CountVectorizer(lowercase, min_df, max_df, binary, ngram_range, stop_words, max_features)

  • lowercase (default: True): Convert all text to lowercase before tokenizing. Set to False if case carries meaning (e.g., proper nouns, acronyms).
  • min_df: Minimum document frequency threshold. If an integer, the word must appear in at least this many documents. If a float between 0.0 and 1.0, represents a proportion of documents. Use min_df=2 or higher to remove rare words and typos.
  • max_df: Maximum document frequency threshold. Words appearing in more than this fraction of documents are excluded. Use max_df=0.9 to remove extremely common words that provide no discriminative power.
  • binary (default: False): If True, all non-zero counts are set to 1. Use binary representation when word presence matters more than frequency.
  • ngram_range (default: (1, 1)): Tuple specifying the range of n-gram sizes to include. (1, 2) includes unigrams and bigrams, capturing phrases like "machine learning". Higher values dramatically increase vocabulary size.
  • stop_words: Either 'english' for built-in stop word list, or a custom list of words to exclude. Removes common words like "the", "is", "and" that typically add noise.
  • max_features: Limit vocabulary to the top N most frequent terms. Useful for controlling dimensionality in very large corpora.

Summary

Bag of Words transforms text into numerical vectors by counting word occurrences, ignoring grammar and word order entirely. Despite this brutal simplification, it powers effective text classification, information retrieval, and topic modeling systems.

Key takeaways:

  • Vocabulary construction extracts unique words from a corpus, mapping each to a vector dimension
  • Document-term matrices represent documents as rows and vocabulary words as columns, with counts (or binary indicators) as values
  • Vocabulary pruning with min_df and max_df removes uninformative rare and common words
  • Sparse matrices (CSR format) efficiently store the mostly-zero document-term matrices, reducing memory by 99%+ for realistic corpora
  • scikit-learn's CountVectorizer handles tokenization, vocabulary building, and sparse matrix creation in one optimized package
  • Word order loss is the fundamental limitation: "The cat chased the dog" and "The dog chased the cat" produce identical vectors

In the next chapters, we'll extend these ideas with n-grams to capture some word sequences, and with TF-IDF weighting to emphasize discriminative terms over common ones.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Bag of Words representations.

Loading component...

Reference

BIBTEXAcademic
@misc{bagofwordsdocumenttermmatricesvocabularyconstructionsparserepresentations, author = {Michael Brenndoerfer}, title = {Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bag-of-words-text-representation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations. Retrieved from https://mbrenndoerfer.com/writing/bag-of-words-text-representation
MLAAcademic
Michael Brenndoerfer. "Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations." 2026. Web. today. <https://mbrenndoerfer.com/writing/bag-of-words-text-representation>.
CHICAGOAcademic
Michael Brenndoerfer. "Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations." Accessed today. https://mbrenndoerfer.com/writing/bag-of-words-text-representation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations'. Available at: https://mbrenndoerfer.com/writing/bag-of-words-text-representation (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations. https://mbrenndoerfer.com/writing/bag-of-words-text-representation