Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn how the Bag of Words model transforms text into numerical vectors through word counting, vocabulary construction, and sparse matrix storage. Master CountVectorizer and understand when this foundational NLP technique works best.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Bag of WordsLink Copied

How do you teach a computer to understand text? The first step is deceptively simple: count words. The Bag of Words (BoW) model transforms documents into numerical vectors by tallying how often each word appears. This representation ignores grammar, word order, and context entirely. It treats a document as nothing more than a collection of words tossed into a bag, hence the name.

Despite its simplicity, Bag of Words powered text classification, spam detection, and information retrieval for decades. It remains a surprisingly effective baseline for many NLP tasks. Understanding BoW is essential because it introduces core concepts (vocabulary construction, document-term matrices, and sparse representations) that persist throughout modern NLP.

This chapter walks you through building a Bag of Words representation from scratch. You'll learn how to construct vocabularies, create document-term matrices, handle the explosion of dimensionality with sparse matrices, and understand when this simple approach works and when it fails.

The Core IdeaLink Copied

Consider three short documents:

"The cat sat on the mat"
"The dog sat on the log"
"The cat and the dog"

To represent these numerically, we first build a vocabulary: a list of all unique words across all documents. Then we count how many times each vocabulary word appears in each document.

Bag of Words

The Bag of Words model represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency. Each document becomes a vector where each dimension corresponds to a vocabulary word, and the value indicates how often that word appears.

Let's implement this step by step:

In[2]:

# Our corpus of documents
documents = [
    "The cat sat on the mat",
    "The dog sat on the log", 
    "The cat and the dog"
]

# Step 1: Tokenize and lowercase
tokenized_docs = [doc.lower().split() for doc in documents]

# Step 2: Build vocabulary from all documents
all_words = []
for doc in tokenized_docs:
    all_words.extend(doc)
    
vocabulary = sorted(set(all_words))
word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}

# Our corpus of documents
documents = [
    "The cat sat on the mat",
    "The dog sat on the log", 
    "The cat and the dog"
]

# Step 1: Tokenize and lowercase
tokenized_docs = [doc.lower().split() for doc in documents]

# Step 2: Build vocabulary from all documents
all_words = []
for doc in tokenized_docs:
    all_words.extend(doc)
    
vocabulary = sorted(set(all_words))
word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}

Out[3]:

Tokenized documents:
  Doc 1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
  Doc 2: ['the', 'dog', 'sat', 'on', 'the', 'log']
  Doc 3: ['the', 'cat', 'and', 'the', 'dog']

Vocabulary (8 words):
  ['and', 'cat', 'dog', 'log', 'mat', 'on', 'sat', 'the']

Word to index mapping:
  'and' → 0
  'cat' → 1
  'dog' → 2
  'log' → 3
  'mat' → 4
  'on' → 5
  'sat' → 6
  'the' → 7

Our vocabulary contains 8 unique words. Each word maps to a unique index that will become a dimension in our vector representation.

In[4]:

import numpy as np

# Step 3: Create document vectors by counting word occurrences
def document_to_vector(doc_tokens, word_to_idx):
    """Convert a tokenized document to a count vector."""
    vector = np.zeros(len(word_to_idx))
    for token in doc_tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] += 1
    return vector

# Create the document-term matrix
doc_term_matrix = np.array([
    document_to_vector(doc, word_to_idx) 
    for doc in tokenized_docs
])

import numpy as np

# Step 3: Create document vectors by counting word occurrences
def document_to_vector(doc_tokens, word_to_idx):
    """Convert a tokenized document to a count vector."""
    vector = np.zeros(len(word_to_idx))
    for token in doc_tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] += 1
    return vector

# Create the document-term matrix
doc_term_matrix = np.array([
    document_to_vector(doc, word_to_idx) 
    for doc in tokenized_docs
])

Out[5]:

Document-Term Matrix:
------------------------------------------------------------
Doc    and    cat    dog    log    mat     on    sat    the
------------------------------------------------------------
  1      0      1      0      0      1      1      1      2
  2      0      0      1      1      0      1      1      2
  3      1      1      1      0      0      0      0      2

Each row represents a document, and each column represents a word from our vocabulary. The value at position (i, j) tells us how many times word j appears in document i.

Out[6]:

Visualization

Heatmap showing document-term matrix with documents as rows and vocabulary words as columns, cell colors indicating word counts. — Document-term matrix visualization for three sample documents. Each row represents a document, each column a vocabulary word. Cell intensity indicates word frequency, with darker cells showing higher counts. Notice how 'the' appears twice in Documents 1 and 2 (it occurs twice in each), while most words appear only once or not at all.

Look at the matrix structure. Document 1 has two occurrences of "the" (the cat... the mat), reflected in the count of 2. Document 3 shares vocabulary with both other documents, which we can see from the overlapping non-zero entries.

Document Similarity from Word CountsLink Copied

Once documents become vectors, we can measure their similarity. Cosine similarity compares the angle between vectors, ignoring magnitude. Documents with similar word distributions will have high cosine similarity, even if one is much longer than the other.

In[7]:

from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise cosine similarity
similarity_matrix = cosine_similarity(doc_term_matrix)

from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise cosine similarity
similarity_matrix = cosine_similarity(doc_term_matrix)

Out[8]:

Visualization

Heatmap showing pairwise cosine similarity between three documents, with values ranging from 0.33 to 1.0. — Cosine similarity matrix between our three sample documents. Documents 1 and 2 share 'the', 'sat', and 'on', giving them moderate similarity (0.67). Document 3 shares 'the', 'cat', and 'dog' with the others but lacks the verb phrase, resulting in lower similarity scores. The diagonal shows perfect self-similarity (1.0).

Documents 1 and 2 are most similar because they share the phrase structure "The [animal] sat on the [object]". Document 3, with its different structure, shows lower similarity to both. This demonstrates how BoW captures topical similarity through shared vocabulary, even though it ignores word order.

Vocabulary ConstructionLink Copied

Building a vocabulary seems straightforward, but real-world text introduces complications. How do you handle punctuation? What about rare words that appear only once? What about extremely common words like "the" that appear everywhere?

From Corpus to VocabularyLink Copied

A corpus is a collection of documents. The vocabulary is the set of unique terms extracted from this corpus. Let's work with a slightly more realistic example:

In[9]:

corpus = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning is a subset of machine learning.",
    "Natural language processing uses machine learning techniques.",
    "AI and machine learning are transforming industries.",
    "Neural networks power deep learning systems."
]

def tokenize(text):
    """Simple tokenization: lowercase and split on whitespace/punctuation."""
    import re
    # Convert to lowercase and extract word tokens
    tokens = re.findall(r'\b[a-z]+\b', text.lower())
    return tokens

# Tokenize all documents
tokenized_corpus = [tokenize(doc) for doc in corpus]

# Build vocabulary
all_tokens = []
for doc in tokenized_corpus:
    all_tokens.extend(doc)

vocabulary = sorted(set(all_tokens))
vocab_size = len(vocabulary)

corpus = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning is a subset of machine learning.",
    "Natural language processing uses machine learning techniques.",
    "AI and machine learning are transforming industries.",
    "Neural networks power deep learning systems."
]

def tokenize(text):
    """Simple tokenization: lowercase and split on whitespace/punctuation."""
    import re
    # Convert to lowercase and extract word tokens
    tokens = re.findall(r'\b[a-z]+\b', text.lower())
    return tokens

# Tokenize all documents
tokenized_corpus = [tokenize(doc) for doc in corpus]

# Build vocabulary
all_tokens = []
for doc in tokenized_corpus:
    all_tokens.extend(doc)

vocabulary = sorted(set(all_tokens))
vocab_size = len(vocabulary)

Out[10]:

Corpus size: 5 documents
Total tokens: 36
Vocabulary size: 23 unique words

Vocabulary:
  a, ai, and, are, artificial
  deep, industries, intelligence, is, language
  learning, machine, natural, networks, neural
  of, power, processing, subset, systems
  techniques, transforming, uses

Word Frequency AnalysisLink Copied

Before finalizing the vocabulary, examining word frequencies helps identify potential issues:

In[11]:

from collections import Counter

# Count word frequencies across the corpus
word_counts = Counter(all_tokens)

# Sort by frequency
sorted_counts = word_counts.most_common()

from collections import Counter

# Count word frequencies across the corpus
word_counts = Counter(all_tokens)

# Sort by frequency
sorted_counts = word_counts.most_common()

Out[12]:

Word Frequencies (descending):
-----------------------------------
  learning         6  ██████
  machine          4  ████
  is               2  ██
  a                2  ██
  subset           2  ██
  of               2  ██
  deep             2  ██
  artificial       1  █
  intelligence     1  █
  natural          1  █
  language         1  █
  processing       1  █
  uses             1  █
  techniques       1  █
  ai               1  █
  and              1  █
  are              1  █
  transforming     1  █
  industries       1  █
  neural           1  █
  networks         1  █
  power            1  █
  systems          1  █

Out[13]:

Visualization

Horizontal bar chart showing word frequencies with 'learning' and 'machine' having highest counts around 5, and many words appearing only once. — Word frequency distribution in our sample corpus. Common words like 'learning' and 'machine' dominate, appearing in multiple documents. Many words appear only once (hapax legomena), contributing to vocabulary size without adding much discriminative power.

The word "learning" appears 5 times, "machine" appears 4 times, but many words appear only once. This pattern, a few high-frequency words and many rare words, follows Zipf's Law and is characteristic of natural language.

Vocabulary PruningLink Copied

Raw vocabularies from large corpora can contain millions of unique words. Many of these are noise: typos, rare technical terms, or words that appear in only one document. Vocabulary pruning removes uninformative terms to reduce dimensionality and improve model performance.

Minimum Document FrequencyLink Copied

Words that appear in very few documents provide little discriminative power and may represent noise. The min_df parameter sets a threshold: words must appear in at least this many documents (or this fraction of documents) to be included.

In[14]:

def compute_document_frequency(tokenized_corpus):
    """Count how many documents each word appears in."""
    doc_freq = Counter()
    for doc in tokenized_corpus:
        # Count each word once per document
        unique_words = set(doc)
        doc_freq.update(unique_words)
    return doc_freq

doc_freq = compute_document_frequency(tokenized_corpus)

# Apply min_df threshold
min_df = 2  # Word must appear in at least 2 documents
filtered_vocab_min = {word for word, freq in doc_freq.items() if freq >= min_df}

def compute_document_frequency(tokenized_corpus):
    """Count how many documents each word appears in."""
    doc_freq = Counter()
    for doc in tokenized_corpus:
        # Count each word once per document
        unique_words = set(doc)
        doc_freq.update(unique_words)
    return doc_freq

doc_freq = compute_document_frequency(tokenized_corpus)

# Apply min_df threshold
min_df = 2  # Word must appear in at least 2 documents
filtered_vocab_min = {word for word, freq in doc_freq.items() if freq >= min_df}

Out[15]:

Document Frequency Analysis:
---------------------------------------------
Word                   Doc Freq Keep (min_df=2)
---------------------------------------------
a                             2               ✓
ai                            1               ✗
and                           1               ✗
are                           1               ✗
artificial                    1               ✗
deep                          2               ✓
industries                    1               ✗
intelligence                  1               ✗
is                            2               ✓
language                      1               ✗
learning                      5               ✓
machine                       4               ✓
natural                       1               ✗
networks                      1               ✗
neural                        1               ✗
of                            2               ✓
power                         1               ✗
processing                    1               ✗
subset                        2               ✓
systems                       1               ✗
techniques                    1               ✗
transforming                  1               ✗
uses                          1               ✗
---------------------------------------------
Original vocabulary: 23 words
After min_df=2:      7 words

Words like "artificial", "industries", and "networks" appear in only one document. Removing them reduces our vocabulary while keeping words that appear across multiple documents.

Maximum Document FrequencyLink Copied

At the other extreme, words that appear in almost every document provide no discriminative power. The word "the" might appear in 95% of documents, making it useless for distinguishing between them. The max_df parameter sets an upper threshold.

In[16]:

# Apply max_df threshold (as a fraction of documents)
max_df = 0.8  # Word must appear in at most 80% of documents
num_docs = len(tokenized_corpus)
max_doc_count = int(max_df * num_docs)

filtered_vocab_max = {word for word, freq in doc_freq.items() if freq <= max_doc_count}

# Combined filtering
filtered_vocab = {word for word, freq in doc_freq.items() 
                  if freq >= min_df and freq <= max_doc_count}

# Apply max_df threshold (as a fraction of documents)
max_df = 0.8  # Word must appear in at most 80% of documents
num_docs = len(tokenized_corpus)
max_doc_count = int(max_df * num_docs)

filtered_vocab_max = {word for word, freq in doc_freq.items() if freq <= max_doc_count}

# Combined filtering
filtered_vocab = {word for word, freq in doc_freq.items() 
                  if freq >= min_df and freq <= max_doc_count}

Out[17]:

Total documents: 5
max_df = 0.8 → max document count = 4

Words appearing in >80% of documents:
  'learning' appears in 5/5 documents

Vocabulary after min_df=2, max_df=0.8: 6 words

In our small corpus, "learning" appears in all 5 documents (100%), exceeding our 80% threshold. In real applications, you might filter out words appearing in more than 90% of documents to remove uninformative terms like "the", "is", and "a".

Vocabulary Reduction with min_dfLink Copied

How aggressively should you prune? Let's visualize how vocabulary size changes as we increase the min_df threshold:

In[18]:

# Calculate vocabulary size at different min_df thresholds
min_df_values = range(1, num_docs + 1)
vocab_sizes = []

for threshold in min_df_values:
    remaining = sum(1 for freq in doc_freq.values() if freq >= threshold)
    vocab_sizes.append(remaining)

# Calculate vocabulary size at different min_df thresholds
min_df_values = range(1, num_docs + 1)
vocab_sizes = []

for threshold in min_df_values:
    remaining = sum(1 for freq in doc_freq.values() if freq >= threshold)
    vocab_sizes.append(remaining)

Out[19]:

Visualization

Line plot showing vocabulary size decreasing as min_df threshold increases, reaching 1 word at min_df=5. — Vocabulary size reduction as minimum document frequency threshold increases. At min_df=1, all unique words are included. Raising the threshold to min_df=2 removes words appearing in only one document, cutting vocabulary nearly in half. By min_df=5 (all documents), only 'learning' remains.

The steep drop from min_df=1 to min_df=2 is typical. In real corpora, a large fraction of words appear only once (called hapax legomena). Removing these rare words often improves model performance by reducing noise without losing much signal.

Count vs. Binary RepresentationsLink Copied

So far, we've counted word occurrences. But sometimes presence matters more than frequency. In a binary representation, each cell contains 1 if the word appears in the document and 0 otherwise, regardless of how many times it appears.

In[20]:

# Create word-to-index mapping for our vocabulary
word_to_idx = {word: idx for idx, word in enumerate(sorted(vocabulary))}

# Count representation
def to_count_vector(tokens, word_to_idx):
    vector = np.zeros(len(word_to_idx))
    for token in tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] += 1
    return vector

# Binary representation
def to_binary_vector(tokens, word_to_idx):
    vector = np.zeros(len(word_to_idx))
    for token in tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] = 1  # Set to 1, don't increment
    return vector

# Compare representations for a sample document
sample_doc = tokenized_corpus[1]  # "deep learning is a subset of machine learning"
count_vec = to_count_vector(sample_doc, word_to_idx)
binary_vec = to_binary_vector(sample_doc, word_to_idx)

# Create word-to-index mapping for our vocabulary
word_to_idx = {word: idx for idx, word in enumerate(sorted(vocabulary))}

# Count representation
def to_count_vector(tokens, word_to_idx):
    vector = np.zeros(len(word_to_idx))
    for token in tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] += 1
    return vector

# Binary representation
def to_binary_vector(tokens, word_to_idx):
    vector = np.zeros(len(word_to_idx))
    for token in tokens:
        if token in word_to_idx:
            vector[word_to_idx[token]] = 1  # Set to 1, don't increment
    return vector

# Compare representations for a sample document
sample_doc = tokenized_corpus[1]  # "deep learning is a subset of machine learning"
count_vec = to_count_vector(sample_doc, word_to_idx)
binary_vec = to_binary_vector(sample_doc, word_to_idx)

Out[21]:

Sample document: 'Deep learning is a subset of machine learning.'
Tokens: ['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning']

Comparison of representations:
--------------------------------------------------
Word                 Count     Binary
--------------------------------------------------
a                        1          1
deep                     1          1
is                       1          1
learning                 2          1 ← differs
machine                  1          1
of                       1          1
subset                   1          1

Notice that "learning" appears twice in this document. The count representation records 2, while the binary representation records 1. Which is better? It depends on the task. For document classification, binary representations often work as well as counts. For tasks where word frequency carries meaning (like authorship attribution), counts are more informative.

Sparse Matrix RepresentationLink Copied

Real-world vocabularies contain tens of thousands of words, yet most documents use only a small fraction. A news article with 500 words might touch only 200 unique vocabulary terms out of 50,000. Storing all those zeros wastes memory.

Sparse Matrix

A sparse matrix is a matrix where most elements are zero. Sparse matrix formats store only the non-zero values and their positions, dramatically reducing memory usage for high-dimensional, mostly-empty data like document-term matrices.

The Sparsity ProblemLink Copied

Let's quantify the sparsity in a typical document-term matrix:

In[22]:

# Create full document-term matrix for our corpus
doc_term_matrix = np.array([
    to_count_vector(doc, word_to_idx) for doc in tokenized_corpus
])

# Calculate sparsity
total_elements = doc_term_matrix.size
non_zero_elements = np.count_nonzero(doc_term_matrix)
zero_elements = total_elements - non_zero_elements
sparsity = zero_elements / total_elements

# Create full document-term matrix for our corpus
doc_term_matrix = np.array([
    to_count_vector(doc, word_to_idx) for doc in tokenized_corpus
])

# Calculate sparsity
total_elements = doc_term_matrix.size
non_zero_elements = np.count_nonzero(doc_term_matrix)
zero_elements = total_elements - non_zero_elements
sparsity = zero_elements / total_elements

Out[23]:

Document-Term Matrix Statistics:
  Shape: (5, 23) (documents × vocabulary)
  Total elements: 115
  Non-zero elements: 35
  Zero elements: 80
  Sparsity: 69.6%

Memory comparison:
  Dense matrix: 920 bytes

Even in our tiny example, over 60% of the matrix is zeros. In real applications with vocabularies of 100,000+ words and millions of documents, sparsity typically exceeds 99%. Storing a dense matrix would require terabytes of memory for mostly zeros.

CSR FormatLink Copied

The Compressed Sparse Row (CSR) format stores only non-zero values along with their column indices and row boundaries. This is the standard format for document-term matrices because NLP operations typically process one document (row) at a time.

In[24]:

from scipy import sparse

# Convert to CSR format
sparse_matrix = sparse.csr_matrix(doc_term_matrix)

# Examine the internal structure
data = sparse_matrix.data       # Non-zero values
indices = sparse_matrix.indices # Column indices of non-zero values
indptr = sparse_matrix.indptr   # Row boundaries

from scipy import sparse

# Convert to CSR format
sparse_matrix = sparse.csr_matrix(doc_term_matrix)

# Examine the internal structure
data = sparse_matrix.data       # Non-zero values
indices = sparse_matrix.indices # Column indices of non-zero values
indptr = sparse_matrix.indptr   # Row boundaries

Out[25]:

CSR Format Internals:
--------------------------------------------------
data (non-zero values):     [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
indices (column positions): [ 0  4  7  8 10 11 15 18  0  5  8 10 11 15 18  9 10 11 12 17 20 22  1  2
  3  6 10 11 21  5 10 13 14 16 19]
indptr (row boundaries):    [ 0  8 15 22 29 35]

How to read this:
  Row 0: values at columns indices[0:3] = indices[0:3]
         → columns [ 0  4  7  8 10 11 15 18], values [1. 1. 1. 1. 1. 1. 1. 1.]
  Row 1: columns [ 0  5  8 10 11 15 18], values [1. 1. 1. 2. 1. 1. 1.]
  Row 2: columns [ 9 10 11 12 17 20 22], values [1. 1. 1. 1. 1. 1. 1.]

Out[26]:

Visualization

Diagram showing a sparse matrix and its CSR representation with three arrays: data, indices, and indptr. — CSR (Compressed Sparse Row) format stores a sparse matrix using three arrays: data (non-zero values), indices (column positions), and indptr (row boundaries). This visualization shows how the original matrix maps to these compact arrays, eliminating storage of zero values.

The indptr array is the key to CSR. To find the non-zero values in row $i$ , look at positions indptr[i] through indptr[i+1] in both data and indices. This makes row slicing extremely efficient.

Memory SavingsLink Copied

The memory advantage of sparse matrices grows dramatically with vocabulary size:

In[27]:

# Simulate larger matrices to show memory scaling
def estimate_memory(num_docs, vocab_size, avg_words_per_doc):
    """Estimate memory usage for dense vs sparse representations."""
    # Dense: store every element (8 bytes for float64)
    dense_bytes = num_docs * vocab_size * 8
    
    # Sparse CSR: store only non-zero elements
    # data array: avg_words_per_doc * num_docs * 8 bytes
    # indices array: avg_words_per_doc * num_docs * 4 bytes (int32)
    # indptr array: (num_docs + 1) * 4 bytes
    nnz = avg_words_per_doc * num_docs
    sparse_bytes = nnz * 8 + nnz * 4 + (num_docs + 1) * 4
    
    return dense_bytes, sparse_bytes

# Test different scales
scales = [
    (100, 1000, 50),      # Small: 100 docs, 1K vocab
    (10000, 50000, 200),  # Medium: 10K docs, 50K vocab
    (1000000, 100000, 300), # Large: 1M docs, 100K vocab
]

# Simulate larger matrices to show memory scaling
def estimate_memory(num_docs, vocab_size, avg_words_per_doc):
    """Estimate memory usage for dense vs sparse representations."""
    # Dense: store every element (8 bytes for float64)
    dense_bytes = num_docs * vocab_size * 8
    
    # Sparse CSR: store only non-zero elements
    # data array: avg_words_per_doc * num_docs * 8 bytes
    # indices array: avg_words_per_doc * num_docs * 4 bytes (int32)
    # indptr array: (num_docs + 1) * 4 bytes
    nnz = avg_words_per_doc * num_docs
    sparse_bytes = nnz * 8 + nnz * 4 + (num_docs + 1) * 4
    
    return dense_bytes, sparse_bytes

# Test different scales
scales = [
    (100, 1000, 50),      # Small: 100 docs, 1K vocab
    (10000, 50000, 200),  # Medium: 10K docs, 50K vocab
    (1000000, 100000, 300), # Large: 1M docs, 100K vocab
]

Out[28]:

Memory Usage: Dense vs Sparse
======================================================================
Scale                          Dense          Sparse         Savings
----------------------------------------------------------------------
100 × 1,000                 800.0 KB         60.4 KB           92.4%
10,000 × 50,000               4.0 GB         24.0 MB           99.4%
1,000,000 × 100,000         800.0 GB          3.6 GB           99.5%

For a realistic corpus of 1 million documents with a 100,000-word vocabulary, sparse representation uses less than 1% of the memory required by dense storage. This is the difference between fitting in RAM and requiring distributed storage.

Out[29]:

Visualization

Line plot showing sparsity percentage increasing from about 75% at 1000 vocabulary words to over 99% at 100000 words. — Sparsity increases dramatically with vocabulary size. Even with 200 unique words per document, a 10,000-word vocabulary yields 98% sparsity. At 100,000 words, sparsity exceeds 99.8%. This explains why sparse matrix formats are essential for real-world NLP applications.

Using scikit-learn's CountVectorizerLink Copied

While understanding the internals is valuable, in practice you'll use scikit-learn's CountVectorizer. It handles tokenization, vocabulary building, and sparse matrix creation in a single, optimized package.

In[30]:

from sklearn.feature_extraction.text import CountVectorizer

# Create vectorizer with common settings
vectorizer = CountVectorizer(
    lowercase=True,           # Convert to lowercase
    min_df=1,                 # Minimum document frequency
    max_df=1.0,               # Maximum document frequency (fraction)
    binary=False,             # Use counts, not binary
    ngram_range=(1, 1),       # Unigrams only (single words)
)

# Fit and transform the corpus
bow_matrix = vectorizer.fit_transform(corpus)

# Get the vocabulary
feature_names = vectorizer.get_feature_names_out()

from sklearn.feature_extraction.text import CountVectorizer

# Create vectorizer with common settings
vectorizer = CountVectorizer(
    lowercase=True,           # Convert to lowercase
    min_df=1,                 # Minimum document frequency
    max_df=1.0,               # Maximum document frequency (fraction)
    binary=False,             # Use counts, not binary
    ngram_range=(1, 1),       # Unigrams only (single words)
)

# Fit and transform the corpus
bow_matrix = vectorizer.fit_transform(corpus)

# Get the vocabulary
feature_names = vectorizer.get_feature_names_out()

Out[31]:

CountVectorizer Results:
  Matrix shape: (5, 22)
  Matrix type: <class 'scipy.sparse._csr.csr_matrix'>
  Vocabulary size: 22

Vocabulary (feature names):
  ['ai', 'and', 'are', 'artificial', 'deep', 'industries', 'intelligence', 'is', 'language', 'learning', 'machine', 'natural', 'networks', 'neural', 'of', 'power', 'processing', 'subset', 'systems', 'techniques', 'transforming', 'uses']

Sparse matrix info:
  Non-zero elements: 33
  Sparsity: 70.0%

The result is a sparse CSR matrix ready for machine learning. Let's visualize what CountVectorizer produced:

In[32]:

# Convert to dense for visualization (only for small matrices!)
dense_matrix = bow_matrix.toarray()

# Convert to dense for visualization (only for small matrices!)
dense_matrix = bow_matrix.toarray()

Out[33]:

Visualization

Heatmap showing the document-term matrix with 5 documents and approximately 20 vocabulary words, with cell colors indicating word counts. — Document-term matrix produced by scikit-learn's CountVectorizer. Each row represents one of our five sample documents about machine learning, and each column represents a word in the automatically constructed vocabulary. The vectorizer handled tokenization, lowercasing, and vocabulary construction automatically.

Key ParametersLink Copied

CountVectorizer offers extensive customization:

In[34]:

# Demonstrate key parameters
vectorizers = {
    'default': CountVectorizer(),
    'binary': CountVectorizer(binary=True),
    'min_df=2': CountVectorizer(min_df=2),
    'max_df=0.8': CountVectorizer(max_df=0.8),
    'bigrams': CountVectorizer(ngram_range=(1, 2)),
}

results = {}
for name, vec in vectorizers.items():
    matrix = vec.fit_transform(corpus)
    results[name] = {
        'vocab_size': len(vec.get_feature_names_out()),
        'nnz': matrix.nnz,
        'sample_features': list(vec.get_feature_names_out()[:5])
    }

# Demonstrate key parameters
vectorizers = {
    'default': CountVectorizer(),
    'binary': CountVectorizer(binary=True),
    'min_df=2': CountVectorizer(min_df=2),
    'max_df=0.8': CountVectorizer(max_df=0.8),
    'bigrams': CountVectorizer(ngram_range=(1, 2)),
}

results = {}
for name, vec in vectorizers.items():
    matrix = vec.fit_transform(corpus)
    results[name] = {
        'vocab_size': len(vec.get_feature_names_out()),
        'nnz': matrix.nnz,
        'sample_features': list(vec.get_feature_names_out()[:5])
    }

Out[35]:

CountVectorizer Parameter Comparison:
======================================================================

default:
  Vocabulary size: 22
  Non-zero elements: 33
  Sample features: ['ai', 'and', 'are', 'artificial', 'deep']

binary:
  Vocabulary size: 22
  Non-zero elements: 33
  Sample features: ['ai', 'and', 'are', 'artificial', 'deep']

min_df=2:
  Vocabulary size: 6
  Non-zero elements: 17
  Sample features: ['deep', 'is', 'learning', 'machine', 'of']

max_df=0.8:
  Vocabulary size: 21
  Non-zero elements: 28
  Sample features: ['ai', 'and', 'are', 'artificial', 'deep']

bigrams:
  Vocabulary size: 44
  Non-zero elements: 62
  Sample features: ['ai', 'ai and', 'and', 'and machine', 'are']

The ngram_range=(1, 2) setting includes both unigrams and bigrams, capturing two-word phrases like "machine learning" and "deep learning". This dramatically increases vocabulary size but can capture meaningful phrases.

The Loss of Word OrderLink Copied

Bag of Words discards all structural information. "The cat chased the dog" and "The dog chased the cat" produce identical vectors, despite having opposite meanings.

In[36]:

# Demonstrate word order loss
sentences = [
    "The cat chased the dog",
    "The dog chased the cat",
    "Dog the cat the chased",  # Nonsense with same words
]

vec = CountVectorizer()
vectors = vec.fit_transform(sentences)

# Demonstrate word order loss
sentences = [
    "The cat chased the dog",
    "The dog chased the cat",
    "Dog the cat the chased",  # Nonsense with same words
]

vec = CountVectorizer()
vectors = vec.fit_transform(sentences)

Out[37]:

Word Order Demonstration:
--------------------------------------------------
Vocabulary: ['cat', 'chased', 'dog', 'the']

'The cat chased the dog'
  Vector: [1 1 1 2]

'The dog chased the cat'
  Vector: [1 1 1 2]

'Dog the cat the chased'
  Vector: [1 1 1 2]

Sentence 1 == Sentence 2: True
Sentence 1 == Sentence 3: True

All three sentences produce IDENTICAL vectors!

This is the fundamental limitation of Bag of Words. It cannot distinguish between:

Active and passive voice ("John hit Mary" vs "Mary was hit by John")
Negation scope ("I love this movie" vs "I don't love this movie" have nearly identical vectors)
Questions and statements ("Is this good?" vs "This is good")
Any semantic difference that depends on word order

Out[38]:

Visualization

Diagram showing three different sentences mapping to the same bag of words vector representation. — Visualization of word order loss in Bag of Words. Three sentences with completely different meanings (or no meaning at all) map to identical vector representations because BoW only counts word occurrences, ignoring sequence and structure.

When Bag of Words WorksLink Copied

Despite its limitations, Bag of Words remains useful for many tasks:

Document classification: For categorizing news articles, spam detection, or sentiment analysis on long texts, word presence often matters more than order. A movie review containing "terrible", "boring", and "waste" is likely negative regardless of how those words are arranged.

Information retrieval: Search engines match query terms against document terms. The classic TF-IDF weighting (covered in later chapters) builds directly on the Bag of Words foundation.

Topic modeling: Algorithms like Latent Dirichlet Allocation (LDA) assume documents are mixtures of topics, each characterized by word distributions. The bag-of-words assumption is baked into the model.

Baseline models: Before deploying complex neural networks, a BoW model provides a sanity check. If a simple model achieves 90% accuracy, you know the task is learnable from word frequencies alone.

In[39]:

# Quick demonstration: sentiment classification with BoW
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Simple sentiment dataset
reviews = [
    "This movie was fantastic and amazing",
    "Absolutely loved this film, brilliant",
    "Great movie, highly recommend",
    "Terrible waste of time, boring",
    "Awful movie, complete disaster",
    "Worst film I have ever seen"
]
labels = [1, 1, 1, 0, 0, 0]  # 1 = positive, 0 = negative

# Vectorize and train
vec = CountVectorizer()
X = vec.fit_transform(reviews)
clf = MultinomialNB()
clf.fit(X, labels)

# Test on new reviews
test_reviews = [
    "This was a brilliant and fantastic experience",
    "Complete waste of time, terrible"
]
test_X = vec.transform(test_reviews)
predictions = clf.predict(test_X)
probabilities = clf.predict_proba(test_X)

# Quick demonstration: sentiment classification with BoW
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Simple sentiment dataset
reviews = [
    "This movie was fantastic and amazing",
    "Absolutely loved this film, brilliant",
    "Great movie, highly recommend",
    "Terrible waste of time, boring",
    "Awful movie, complete disaster",
    "Worst film I have ever seen"
]
labels = [1, 1, 1, 0, 0, 0]  # 1 = positive, 0 = negative

# Vectorize and train
vec = CountVectorizer()
X = vec.fit_transform(reviews)
clf = MultinomialNB()
clf.fit(X, labels)

# Test on new reviews
test_reviews = [
    "This was a brilliant and fantastic experience",
    "Complete waste of time, terrible"
]
test_X = vec.transform(test_reviews)
predictions = clf.predict(test_X)
probabilities = clf.predict_proba(test_X)

Out[40]:

Sentiment Classification with Bag of Words:
--------------------------------------------------
Review: 'This was a brilliant and fantastic experience'
  Prediction: Positive (confidence: 97.7%)

Review: 'Complete waste of time, terrible'
  Prediction: Negative (confidence: 97.3%)

Even this trivial example shows BoW capturing sentiment through word presence. Words like "brilliant", "fantastic", and "terrible" carry strong sentiment signals regardless of context.

Limitations and ImpactLink Copied

Bag of Words has fundamental limitations that motivated the development of more sophisticated representations:

No word order: As demonstrated, BoW cannot distinguish sentences with different word arrangements.

No semantics: "Good" and "excellent" are treated as completely unrelated words, even though they're synonyms. Similarly, "bank" (financial institution) and "bank" (river edge) are conflated.

Vocabulary explosion: Adding n-grams helps capture some phrases but causes vocabulary size to explode. Bigrams alone can multiply vocabulary by 10-100x.

Sparsity: High-dimensional sparse vectors are inefficient for neural networks, which prefer dense, lower-dimensional inputs.

Out-of-vocabulary words: Words not seen during training have no representation. A model trained on formal text may fail on social media slang.

These limitations drove the development of word embeddings (Word2Vec, GloVe) and eventually transformer-based models that learn dense, contextual representations. Yet Bag of Words remains the conceptual starting point. Understanding document-term matrices, vocabulary construction, and sparse representations provides the foundation for understanding more advanced techniques.

Key Functions and ParametersLink Copied

When working with Bag of Words representations, CountVectorizer from scikit-learn is the primary tool. Here are its most important parameters:

CountVectorizer(lowercase, min_df, max_df, binary, ngram_range, stop_words, max_features)

lowercase (default: True): Convert all text to lowercase before tokenizing. Set to False if case carries meaning (e.g., proper nouns, acronyms).
min_df: Minimum document frequency threshold. If an integer, the word must appear in at least this many documents. If a float between 0.0 and 1.0, represents a proportion of documents. Use min_df=2 or higher to remove rare words and typos.
max_df: Maximum document frequency threshold. Words appearing in more than this fraction of documents are excluded. Use max_df=0.9 to remove extremely common words that provide no discriminative power.
binary (default: False): If True, all non-zero counts are set to 1. Use binary representation when word presence matters more than frequency.
ngram_range (default: (1, 1)): Tuple specifying the range of n-gram sizes to include. (1, 2) includes unigrams and bigrams, capturing phrases like "machine learning". Higher values dramatically increase vocabulary size.
stop_words: Either 'english' for built-in stop word list, or a custom list of words to exclude. Removes common words like "the", "is", "and" that typically add noise.
max_features: Limit vocabulary to the top N most frequent terms. Useful for controlling dimensionality in very large corpora.

SummaryLink Copied

Bag of Words transforms text into numerical vectors by counting word occurrences, ignoring grammar and word order entirely. Despite this brutal simplification, it powers effective text classification, information retrieval, and topic modeling systems.

Key takeaways:

Vocabulary construction extracts unique words from a corpus, mapping each to a vector dimension
Document-term matrices represent documents as rows and vocabulary words as columns, with counts (or binary indicators) as values
Vocabulary pruning with min_df and max_df removes uninformative rare and common words
Sparse matrices (CSR format) efficiently store the mostly-zero document-term matrices, reducing memory by 99%+ for realistic corpora
scikit-learn's CountVectorizer handles tokenization, vocabulary building, and sparse matrix creation in one optimized package
Word order loss is the fundamental limitation: "The cat chased the dog" and "The dog chased the cat" produce identical vectors

In the next chapters, we'll extend these ideas with n-grams to capture some word sequences, and with TF-IDF weighting to emphasize discriminative terms over common ones.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Bag of Words representations.

Loading component...

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{bagofwordsdocumenttermmatricesvocabularyconstructionsparserepresentations, author = {Michael Brenndoerfer}, title = {Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bag-of-words-text-representation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }

APAAcademic

Michael Brenndoerfer (2025). Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations. Retrieved from https://mbrenndoerfer.com/writing/bag-of-words-text-representation

MLAAcademic

Michael Brenndoerfer. "Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/bag-of-words-text-representation>.

CHICAGOAcademic

Michael Brenndoerfer. "Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/bag-of-words-text-representation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations'. Available at: https://mbrenndoerfer.com/writing/bag-of-words-text-representation (Accessed: 12/7/2025).

SimpleBasic

Michael Brenndoerfer (2025). Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations. https://mbrenndoerfer.com/writing/bag-of-words-text-representation

Direct link:

https://mbrenndoerfer.com/writing/bag-of-words-text-representation

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

InteractiveBag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations

Bag of WordsLink Copied

The Core IdeaLink Copied

Document Similarity from Word CountsLink Copied

Vocabulary ConstructionLink Copied

From Corpus to VocabularyLink Copied

Word Frequency AnalysisLink Copied

Vocabulary PruningLink Copied

Minimum Document FrequencyLink Copied

Maximum Document FrequencyLink Copied

Vocabulary Reduction with min_dfLink Copied

Count vs. Binary RepresentationsLink Copied

Sparse Matrix RepresentationLink Copied

The Sparsity ProblemLink Copied

CSR FormatLink Copied

Memory SavingsLink Copied

Using scikit-learn's CountVectorizerLink Copied

Key ParametersLink Copied

The Loss of Word OrderLink Copied

When Bag of Words WorksLink Copied

Limitations and ImpactLink Copied

Key Functions and ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Reference

About the author: Michael Brenndoerfer

Related Content

N-gram Language Models: Probability-Based Text Generation & Prediction

Smoothing Techniques for N-gram Language Models: From Laplace to Kneser-Ney

N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams

Stay updated