Co-occurrence Matrices: Building Word Representations from Context

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn how to construct word-word and word-document co-occurrence matrices that capture distributional semantics. Covers context window effects, distance weighting, sparse storage, and efficient construction algorithms.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Co-occurrence MatricesLink Copied

The distributional hypothesis tells us that words appearing in similar contexts have similar meanings. But how do we actually capture and quantify these contextual patterns? The answer lies in co-occurrence matrices, the foundational data structures that transform raw text into numerical representations of word relationships.

In this chapter, you'll learn how to construct matrices that encode which words appear near which other words, and how different design choices affect what patterns these matrices capture. These matrices serve as the raw material for more sophisticated techniques like PMI weighting and dimensionality reduction, which we'll explore in subsequent chapters.

The Core Idea: Counting ContextLink Copied

At its heart, a co-occurrence matrix is simply a way to count how often words appear together. Consider a small corpus:

"The cat sat on the mat. The dog sat on the rug."

If we want to know what contexts the word "sat" appears in, we look at its neighbors. Both "cat" and "dog" appear before "sat," while "on" appears after it. By systematically counting these patterns across a large corpus, we build a picture of each word's distributional profile.

Co-occurrence Matrix

A co-occurrence matrix is a square matrix where rows and columns both represent words from the vocabulary. Each cell $(i, j)$ contains a count (or weighted value) representing how often word $i$ appears in the context of word $j$ .

This approach is simple. We don't need linguistic rules, parse trees, or semantic annotations. We just count. And from these counts, meaningful patterns emerge.

Two Types of Co-occurrence MatricesLink Copied

There are two main flavors of co-occurrence matrices, each capturing different aspects of word relationships.

Word-Word Co-occurrence MatricesLink Copied

A word-word matrix counts how often pairs of words appear near each other within a local context window. If your vocabulary has $V$ words, you get a $V \times V$ matrix.

The key parameter is the context window size, which defines "near." A window of size 2 means we count words that appear within 2 positions to the left or right of the target word.

For our example sentence "The cat sat on the mat," with a window size of 2, the word "sat" has context words {the, cat, on, the}. We increment the co-occurrence counts for each of these pairs.

Word-Document Co-occurrence MatricesLink Copied

A word-document matrix (also called a term-document matrix) counts how often each word appears in each document. If you have $V$ vocabulary words and $D$ documents, you get a $V \times D$ matrix.

This representation is widely used in information retrieval and topic modeling. Documents with similar word distributions are likely about similar topics. Latent Semantic Analysis (LSA) applies SVD to exactly this type of matrix.

For this chapter, we'll focus primarily on word-word matrices, as they capture the fine-grained local context that's most relevant for learning word meanings.

Building a Word-Word Co-occurrence MatrixLink Copied

Let's build a co-occurrence matrix from scratch. We'll start with a small corpus to see exactly what's happening, then scale up to real text.

In[2]:

# Our toy corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the rug",
    "the cat chased the dog"
]

# Tokenize and build vocabulary
all_tokens = []
for sentence in corpus:
    all_tokens.extend(sentence.lower().split())

vocab = sorted(set(all_tokens))
word_to_idx = {word: i for i, word in enumerate(vocab)}
vocab_size = len(vocab)

# Our toy corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the rug",
    "the cat chased the dog"
]

# Tokenize and build vocabulary
all_tokens = []
for sentence in corpus:
    all_tokens.extend(sentence.lower().split())

vocab = sorted(set(all_tokens))
word_to_idx = {word: i for i, word in enumerate(vocab)}
vocab_size = len(vocab)

Out[3]:

Vocabulary size: 8
Vocabulary: ['cat', 'chased', 'dog', 'mat', 'on', 'rug', 'sat', 'the']

Word to index mapping:
  'cat' -> 0
  'chased' -> 1
  'dog' -> 2
  'mat' -> 3
  'on' -> 4
  'rug' -> 5
  'sat' -> 6
  'the' -> 7

Our vocabulary contains 8 unique words. Each word gets a unique integer index that we'll use to address rows and columns in our matrix.

Now let's build the co-occurrence matrix with a context window of size 1, meaning we only count immediately adjacent words.

In[4]:

import numpy as np

def build_cooccurrence_matrix(corpus, word_to_idx, window_size=1):
    """Build a word-word co-occurrence matrix from a corpus."""
    vocab_size = len(word_to_idx)
    cooccurrence = np.zeros((vocab_size, vocab_size))
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            target_idx = word_to_idx[target_word]
            
            # Look at context words within the window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:  # Don't count the word with itself
                    context_word = tokens[j]
                    context_idx = word_to_idx[context_word]
                    cooccurrence[target_idx, context_idx] += 1
    
    return cooccurrence

# Build matrix with window size 1
cooc_matrix = build_cooccurrence_matrix(corpus, word_to_idx, window_size=1)

import numpy as np

def build_cooccurrence_matrix(corpus, word_to_idx, window_size=1):
    """Build a word-word co-occurrence matrix from a corpus."""
    vocab_size = len(word_to_idx)
    cooccurrence = np.zeros((vocab_size, vocab_size))
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            target_idx = word_to_idx[target_word]
            
            # Look at context words within the window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:  # Don't count the word with itself
                    context_word = tokens[j]
                    context_idx = word_to_idx[context_word]
                    cooccurrence[target_idx, context_idx] += 1
    
    return cooccurrence

# Build matrix with window size 1
cooc_matrix = build_cooccurrence_matrix(corpus, word_to_idx, window_size=1)

In[5]:

import pandas as pd

# Display as a DataFrame for readability
df = pd.DataFrame(cooc_matrix, index=vocab, columns=vocab, dtype=int)

import pandas as pd

# Display as a DataFrame for readability
df = pd.DataFrame(cooc_matrix, index=vocab, columns=vocab, dtype=int)

Out[6]:

Co-occurrence matrix (window size = 1):
        cat  chased  dog  mat  on  rug  sat  the
cat       0       1    0    0   0    0    1    2
chased    1       0    0    0   0    0    0    1
dog       0       0    0    0   0    0    1    2
mat       0       0    0    0   0    0    0    1
on        0       0    0    0   0    0    2    2
rug       0       0    0    0   0    0    0    1
sat       1       0    1    0   2    0    0    0
the       2       1    2    1   2    1    0    0

Each cell shows how many times the row word appeared adjacent to the column word. Notice that "the" co-occurs frequently with many words since it's a common function word. Meanwhile, "cat" and "dog" both co-occur with "the," "sat," and "chased," reflecting their similar syntactic roles.

Let's visualize this matrix to see the patterns more clearly.

Out[7]:

Visualization

Heatmap showing word co-occurrence counts with vocabulary words on both axes, brighter yellow cells indicating higher counts. — Co-occurrence matrix heatmap for a small corpus. Brighter cells indicate higher co-occurrence counts. The word 'the' shows high co-occurrence with most content words due to its frequency as a determiner.

The heatmap reveals the structure of our tiny corpus. The bright row and column for "the" shows it co-occurs with almost everything, which is typical for function words. Content words like "cat," "dog," "mat," and "rug" show sparser patterns that reflect their actual usage.

Context Window Size EffectsLink Copied

The window size parameter has a major effect on what relationships your matrix captures. Let's explore this with a larger, more realistic corpus.

In[8]:

# A slightly larger corpus with more varied context
larger_corpus = [
    "the quick brown fox jumps over the lazy dog",
    "the lazy cat sleeps on the warm mat",
    "a quick cat chases the brown mouse",
    "the dog barks at the lazy cat",
    "brown leaves fall on the ground",
    "the fox hunts the quick mouse"
]

# Build vocabulary
all_tokens = []
for sentence in larger_corpus:
    all_tokens.extend(sentence.lower().split())

vocab_large = sorted(set(all_tokens))
word_to_idx_large = {word: i for i, word in enumerate(vocab_large)}

# A slightly larger corpus with more varied context
larger_corpus = [
    "the quick brown fox jumps over the lazy dog",
    "the lazy cat sleeps on the warm mat",
    "a quick cat chases the brown mouse",
    "the dog barks at the lazy cat",
    "brown leaves fall on the ground",
    "the fox hunts the quick mouse"
]

# Build vocabulary
all_tokens = []
for sentence in larger_corpus:
    all_tokens.extend(sentence.lower().split())

vocab_large = sorted(set(all_tokens))
word_to_idx_large = {word: i for i, word in enumerate(vocab_large)}

Out[9]:

Vocabulary size: 22
Vocabulary: ['a', 'at', 'barks', 'brown', 'cat', 'chases', 'dog', 'fall', 'fox', 'ground', 'hunts', 'jumps', 'lazy', 'leaves', 'mat', 'mouse', 'on', 'over', 'quick', 'sleeps', 'the', 'warm']

Now let's compare matrices built with different window sizes.

In[10]:

# Build matrices with different window sizes
matrix_w1 = build_cooccurrence_matrix(larger_corpus, word_to_idx_large, window_size=1)
matrix_w2 = build_cooccurrence_matrix(larger_corpus, word_to_idx_large, window_size=2)
matrix_w5 = build_cooccurrence_matrix(larger_corpus, word_to_idx_large, window_size=5)

# Build matrices with different window sizes
matrix_w1 = build_cooccurrence_matrix(larger_corpus, word_to_idx_large, window_size=1)
matrix_w2 = build_cooccurrence_matrix(larger_corpus, word_to_idx_large, window_size=2)
matrix_w5 = build_cooccurrence_matrix(larger_corpus, word_to_idx_large, window_size=5)

Out[11]:

Visualization

Sparse heatmap with few bright cells showing immediate word neighbors only. — Window size 1: Captures immediate syntactic neighbors. Sparse matrix with local relationships.

The visual difference is clear. With window size 1, the matrix is sparse, capturing only immediate neighbors. As we increase the window, more cells fill in, and the matrix becomes denser. By window size 5, nearly every word pair has some co-occurrence.

Let's quantify this sparsity difference.

In[12]:

def calculate_sparsity(matrix):
    """Calculate the percentage of zero entries in a matrix."""
    total_cells = matrix.size
    zero_cells = np.sum(matrix == 0)
    return (zero_cells / total_cells) * 100

sparsity_w1 = calculate_sparsity(matrix_w1)
sparsity_w2 = calculate_sparsity(matrix_w2)
sparsity_w5 = calculate_sparsity(matrix_w5)

def calculate_sparsity(matrix):
    """Calculate the percentage of zero entries in a matrix."""
    total_cells = matrix.size
    zero_cells = np.sum(matrix == 0)
    return (zero_cells / total_cells) * 100

sparsity_w1 = calculate_sparsity(matrix_w1)
sparsity_w2 = calculate_sparsity(matrix_w2)
sparsity_w5 = calculate_sparsity(matrix_w5)

Out[13]:

Matrix sparsity by window size:
  Window 1: 86.8% zeros
  Window 2: 77.3% zeros
  Window 5: 63.4% zeros

Out[14]:

Visualization

Bar chart showing percentage of zero entries decreasing from window size 1 to 5. — Matrix sparsity decreases as window size increases. Larger windows capture more word pairs, filling in more cells of the co-occurrence matrix.

The bar chart makes the trade-off clear: window size 1 produces an extremely sparse matrix (mostly zeros), while window size 5 fills in substantially more entries.

Window Size Trade-offs

Small windows (1-2) capture syntactic relationships. Words that appear immediately adjacent tend to have grammatical relationships: determiners before nouns, verbs before objects.

Large windows (5-10) capture topical or semantic relationships. Words that appear in the same general context are often about the same topic, even if they're not syntactically related.

The choice of window size depends on your downstream task. For syntax-focused applications like part-of-speech tagging, smaller windows work better. For semantic similarity and topic modeling, larger windows capture more meaningful relationships.

Weighting by DistanceLink Copied

Not all context positions are equally informative. A word immediately adjacent to your target is more relevant than one five positions away. Distance weighting addresses this by giving closer words higher counts.

The most common approach is harmonic weighting, where a context word at distance $d$ contributes $\frac{1}{d}$ instead of 1:

\text{weight}(d) = \frac{1}{d}

where:

$d$ : the distance (in word positions) between the target word and context word, always a positive integer

The following plot shows how harmonic weights decay with distance, compared to uniform weighting where all positions contribute equally.

Out[15]:

Visualization

Line plot showing harmonic weight decay curve versus flat uniform weights across distances 1 to 5. — Comparison of weighting schemes by distance. Harmonic weighting gives full weight (1.0) to immediate neighbors but rapidly decreases for distant words. Uniform weighting treats all positions within the window equally.

This makes sense: a word right next to your target is highly relevant, while one five positions away provides weaker evidence of association.

Let's implement this.

In[16]:

def build_weighted_cooccurrence_matrix(corpus, word_to_idx, window_size=2, 
                                        weighting='harmonic'):
    """Build a co-occurrence matrix with distance-based weighting."""
    vocab_size = len(word_to_idx)
    cooccurrence = np.zeros((vocab_size, vocab_size))
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            if target_word not in word_to_idx:
                continue
            target_idx = word_to_idx[target_word]
            
            # Look at context words within the window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = tokens[j]
                    if context_word not in word_to_idx:
                        continue
                    context_idx = word_to_idx[context_word]
                    distance = abs(i - j)
                    
                    if weighting == 'harmonic':
                        weight = 1.0 / distance
                    elif weighting == 'linear':
                        weight = (window_size - distance + 1) / window_size
                    else:  # uniform
                        weight = 1.0
                    
                    cooccurrence[target_idx, context_idx] += weight
    
    return cooccurrence

# Compare uniform vs harmonic weighting
matrix_uniform = build_weighted_cooccurrence_matrix(
    larger_corpus, word_to_idx_large, window_size=3, weighting='uniform')
matrix_harmonic = build_weighted_cooccurrence_matrix(
    larger_corpus, word_to_idx_large, window_size=3, weighting='harmonic')

def build_weighted_cooccurrence_matrix(corpus, word_to_idx, window_size=2, 
                                        weighting='harmonic'):
    """Build a co-occurrence matrix with distance-based weighting."""
    vocab_size = len(word_to_idx)
    cooccurrence = np.zeros((vocab_size, vocab_size))
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            if target_word not in word_to_idx:
                continue
            target_idx = word_to_idx[target_word]
            
            # Look at context words within the window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = tokens[j]
                    if context_word not in word_to_idx:
                        continue
                    context_idx = word_to_idx[context_word]
                    distance = abs(i - j)
                    
                    if weighting == 'harmonic':
                        weight = 1.0 / distance
                    elif weighting == 'linear':
                        weight = (window_size - distance + 1) / window_size
                    else:  # uniform
                        weight = 1.0
                    
                    cooccurrence[target_idx, context_idx] += weight
    
    return cooccurrence

# Compare uniform vs harmonic weighting
matrix_uniform = build_weighted_cooccurrence_matrix(
    larger_corpus, word_to_idx_large, window_size=3, weighting='uniform')
matrix_harmonic = build_weighted_cooccurrence_matrix(
    larger_corpus, word_to_idx_large, window_size=3, weighting='harmonic')

Out[17]:

Visualization

Heatmap of co-occurrence matrix with uniform weighting showing integer-like count patterns. — Uniform weighting: All context positions within the window contribute equally to co-occurrence counts.

The harmonic-weighted matrix has lower overall values because distant context words contribute fractional amounts. This weighting scheme was popularized by GloVe and has become standard practice in modern word embedding methods.

Symmetric vs Directional ContextsLink Copied

So far, we've treated left and right context symmetrically. If "cat" appears before "sat," we count it the same as if it appeared after. But in some cases, you might want to distinguish direction.

Symmetric contexts treat position as irrelevant. This is the standard approach and what we've been doing. The resulting matrix is symmetric: $M_{ij} = M_{ji}$ .

Directional contexts distinguish left from right. You might have separate matrices for "words that appear before" and "words that appear after," or double your vocabulary to include position markers.

In[18]:

def build_directional_cooccurrence(corpus, word_to_idx, window_size=2):
    """Build separate left and right context matrices."""
    vocab_size = len(word_to_idx)
    left_context = np.zeros((vocab_size, vocab_size))
    right_context = np.zeros((vocab_size, vocab_size))
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            if target_word not in word_to_idx:
                continue
            target_idx = word_to_idx[target_word]
            
            # Left context
            for j in range(max(0, i - window_size), i):
                context_word = tokens[j]
                if context_word not in word_to_idx:
                    continue
                context_idx = word_to_idx[context_word]
                left_context[target_idx, context_idx] += 1
            
            # Right context
            for j in range(i + 1, min(len(tokens), i + window_size + 1)):
                context_word = tokens[j]
                if context_word not in word_to_idx:
                    continue
                context_idx = word_to_idx[context_word]
                right_context[target_idx, context_idx] += 1
    
    return left_context, right_context

left_matrix, right_matrix = build_directional_cooccurrence(
    larger_corpus, word_to_idx_large, window_size=2)

def build_directional_cooccurrence(corpus, word_to_idx, window_size=2):
    """Build separate left and right context matrices."""
    vocab_size = len(word_to_idx)
    left_context = np.zeros((vocab_size, vocab_size))
    right_context = np.zeros((vocab_size, vocab_size))
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            if target_word not in word_to_idx:
                continue
            target_idx = word_to_idx[target_word]
            
            # Left context
            for j in range(max(0, i - window_size), i):
                context_word = tokens[j]
                if context_word not in word_to_idx:
                    continue
                context_idx = word_to_idx[context_word]
                left_context[target_idx, context_idx] += 1
            
            # Right context
            for j in range(i + 1, min(len(tokens), i + window_size + 1)):
                context_word = tokens[j]
                if context_word not in word_to_idx:
                    continue
                context_idx = word_to_idx[context_word]
                right_context[target_idx, context_idx] += 1
    
    return left_context, right_context

left_matrix, right_matrix = build_directional_cooccurrence(
    larger_corpus, word_to_idx_large, window_size=2)

Out[19]:

Directional context analysis:
  Left context non-zero entries: 59
  Right context non-zero entries: 59
  Combined equals symmetric: True

Directional contexts can capture asymmetric relationships. For instance, in English, determiners almost always appear to the left of nouns. A directional matrix would capture this pattern, while a symmetric matrix would lose it.

Matrix Sparsity and StorageLink Copied

Real-world co-occurrence matrices are enormous and extremely sparse. With a vocabulary of 100,000 words, you'd have a $100,000 \times 100,000 = 10^{10}$ cell matrix. Most of these cells will be zero since any given word only co-occurs with a tiny fraction of the vocabulary.

In[20]:

# Simulate sparsity for different vocabulary sizes
def estimate_matrix_properties(vocab_size, avg_nonzero_per_row=1000):
    """Estimate memory requirements for dense vs sparse storage."""
    total_cells = vocab_size ** 2
    nonzero_cells = vocab_size * avg_nonzero_per_row
    sparsity = (total_cells - nonzero_cells) / total_cells * 100
    
    # Memory estimates (8 bytes per float64)
    dense_memory_gb = (total_cells * 8) / (1024 ** 3)
    # Sparse CSR: 4-byte float32 value + 4-byte column index + amortized row pointer (~12 bytes per entry)
    sparse_memory_gb = (nonzero_cells * 12) / (1024 ** 3)
    
    return sparsity, dense_memory_gb, sparse_memory_gb

vocab_sizes = [10000, 50000, 100000, 500000]

# Simulate sparsity for different vocabulary sizes
def estimate_matrix_properties(vocab_size, avg_nonzero_per_row=1000):
    """Estimate memory requirements for dense vs sparse storage."""
    total_cells = vocab_size ** 2
    nonzero_cells = vocab_size * avg_nonzero_per_row
    sparsity = (total_cells - nonzero_cells) / total_cells * 100
    
    # Memory estimates (8 bytes per float64)
    dense_memory_gb = (total_cells * 8) / (1024 ** 3)
    # Sparse CSR: 4-byte float32 value + 4-byte column index + amortized row pointer (~12 bytes per entry)
    sparse_memory_gb = (nonzero_cells * 12) / (1024 ** 3)
    
    return sparsity, dense_memory_gb, sparse_memory_gb

vocab_sizes = [10000, 50000, 100000, 500000]

Out[21]:

Memory requirements for co-occurrence matrices:
-----------------------------------------------------------------
  Vocab Size     Sparsity   Dense (GB)  Sparse (GB)
-----------------------------------------------------------------
      10,000       90.00%         0.75         0.11
      50,000       98.00%        18.63         0.56
     100,000       99.00%        74.51         1.12
     500,000       99.80%      1862.65         5.59

Out[22]:

Visualization

Log-scale bar chart comparing dense and sparse memory requirements across vocabulary sizes from 10K to 500K words. — Memory requirements scale very differently for dense vs sparse storage. Dense storage grows quadratically with vocabulary size, while sparse storage grows linearly.

The numbers tell the story. A dense 100K vocabulary matrix would require 74 GB of memory. The same information stored sparsely needs only about 1 GB. This is why sparse matrix formats are necessary for practical applications.

Sparse Matrix Formats

COO (Coordinate): Stores (row, column, value) triples. Good for construction but inefficient for arithmetic.

CSR (Compressed Sparse Row): Compresses row indices. Efficient for row slicing and matrix-vector products.

CSC (Compressed Sparse Column): Compresses column indices. Efficient for column slicing.

Let's see how to use sparse matrices in practice with scipy.

In[23]:

from scipy.sparse import csr_matrix, lil_matrix

def build_sparse_cooccurrence(corpus, word_to_idx, window_size=2):
    """Build a co-occurrence matrix using sparse storage."""
    vocab_size = len(word_to_idx)
    # LIL format is efficient for incremental construction
    cooccurrence = lil_matrix((vocab_size, vocab_size), dtype=np.float32)
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            target_idx = word_to_idx[target_word]
            
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = tokens[j]
                    context_idx = word_to_idx[context_word]
                    cooccurrence[target_idx, context_idx] += 1
    
    # Convert to CSR for efficient arithmetic
    return cooccurrence.tocsr()

sparse_matrix = build_sparse_cooccurrence(larger_corpus, word_to_idx_large, window_size=2)

from scipy.sparse import csr_matrix, lil_matrix

def build_sparse_cooccurrence(corpus, word_to_idx, window_size=2):
    """Build a co-occurrence matrix using sparse storage."""
    vocab_size = len(word_to_idx)
    # LIL format is efficient for incremental construction
    cooccurrence = lil_matrix((vocab_size, vocab_size), dtype=np.float32)
    
    for sentence in corpus:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            target_idx = word_to_idx[target_word]
            
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = tokens[j]
                    context_idx = word_to_idx[context_word]
                    cooccurrence[target_idx, context_idx] += 1
    
    # Convert to CSR for efficient arithmetic
    return cooccurrence.tocsr()

sparse_matrix = build_sparse_cooccurrence(larger_corpus, word_to_idx_large, window_size=2)

Out[24]:

Storage comparison:
  Dense matrix memory: 3.78 KB
  Sparse matrix memory: 972.00 bytes
  Non-zero entries: 110
  Sparsity: 77.3%

For our tiny example, the overhead of sparse storage actually makes it larger. But as matrices grow, sparse storage becomes necessary.

Efficient Construction at ScaleLink Copied

When processing large corpora, naive implementations become prohibitively slow. Here are key optimizations for building co-occurrence matrices efficiently.

Streaming ConstructionLink Copied

Instead of loading the entire corpus into memory, process it in chunks.

In[25]:

from collections import defaultdict

def stream_cooccurrence_counts(corpus_iterator, vocab, window_size=2):
    """
    Build co-occurrence counts from a streaming corpus.
    Returns a dictionary of counts instead of a full matrix.
    """
    word_to_idx = {word: i for i, word in enumerate(vocab)}
    # Use defaultdict for efficient sparse storage during construction
    counts = defaultdict(lambda: defaultdict(float))
    
    for sentence in corpus_iterator:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            if target_word not in word_to_idx:
                continue
            
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = tokens[j]
                    if context_word in word_to_idx:
                        distance = abs(i - j)
                        counts[target_word][context_word] += 1.0 / distance
    
    return counts

# Simulate streaming by using an iterator
counts = stream_cooccurrence_counts(iter(larger_corpus), vocab_large, window_size=2)

from collections import defaultdict

def stream_cooccurrence_counts(corpus_iterator, vocab, window_size=2):
    """
    Build co-occurrence counts from a streaming corpus.
    Returns a dictionary of counts instead of a full matrix.
    """
    word_to_idx = {word: i for i, word in enumerate(vocab)}
    # Use defaultdict for efficient sparse storage during construction
    counts = defaultdict(lambda: defaultdict(float))
    
    for sentence in corpus_iterator:
        tokens = sentence.lower().split()
        
        for i, target_word in enumerate(tokens):
            if target_word not in word_to_idx:
                continue
            
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = tokens[j]
                    if context_word in word_to_idx:
                        distance = abs(i - j)
                        counts[target_word][context_word] += 1.0 / distance
    
    return counts

# Simulate streaming by using an iterator
counts = stream_cooccurrence_counts(iter(larger_corpus), vocab_large, window_size=2)

Out[26]:

Streaming co-occurrence counts (sample):
  'cat': {'lazy': 2.0, 'the': 1.5, 'sleeps': 1.0, 'quick': 1.0, 'chases': 1.0}
  'dog': {'the': 1.5, 'lazy': 1.0, 'barks': 1.0, 'at': 0.5}
  'the': {'lazy': 3.0, 'quick': 2.0, 'on': 2.0, 'brown': 1.5, 'dog': 1.5}

Vocabulary FilteringLink Copied

In practice, you'll want to filter the vocabulary to exclude very rare words (which have unreliable statistics) and very common words (which dominate the matrix without adding much information).

In[27]:

from collections import Counter

def filter_vocabulary(corpus, min_count=5, max_freq=0.5):
    """
    Filter vocabulary by frequency.
    
    Args:
        corpus: List of sentences
        min_count: Minimum occurrence count to include a word
        max_freq: Maximum document frequency (fraction) to include a word
    
    Returns:
        Filtered vocabulary list
    """
    # Count word frequencies
    word_counts = Counter()
    doc_counts = Counter()
    
    for sentence in corpus:
        tokens = set(sentence.lower().split())
        for token in tokens:
            doc_counts[token] += 1
        word_counts.update(sentence.lower().split())
    
    num_docs = len(corpus)
    
    # Filter by min count and max document frequency
    filtered_vocab = [
        word for word, count in word_counts.items()
        if count >= min_count and doc_counts[word] / num_docs <= max_freq
    ]
    
    return sorted(filtered_vocab)

# For demonstration, use very low thresholds given our small corpus
filtered_vocab = filter_vocabulary(larger_corpus, min_count=1, max_freq=0.9)

from collections import Counter

def filter_vocabulary(corpus, min_count=5, max_freq=0.5):
    """
    Filter vocabulary by frequency.
    
    Args:
        corpus: List of sentences
        min_count: Minimum occurrence count to include a word
        max_freq: Maximum document frequency (fraction) to include a word
    
    Returns:
        Filtered vocabulary list
    """
    # Count word frequencies
    word_counts = Counter()
    doc_counts = Counter()
    
    for sentence in corpus:
        tokens = set(sentence.lower().split())
        for token in tokens:
            doc_counts[token] += 1
        word_counts.update(sentence.lower().split())
    
    num_docs = len(corpus)
    
    # Filter by min count and max document frequency
    filtered_vocab = [
        word for word, count in word_counts.items()
        if count >= min_count and doc_counts[word] / num_docs <= max_freq
    ]
    
    return sorted(filtered_vocab)

# For demonstration, use very low thresholds given our small corpus
filtered_vocab = filter_vocabulary(larger_corpus, min_count=1, max_freq=0.9)

Out[28]:

Original vocabulary size: 22
Filtered vocabulary size: 21
Removed words: {'the'}

The word "the" appears in every sentence, so it exceeds our 90% document frequency threshold and gets filtered out. This is a simple form of stopword removal based purely on statistics.

From Counts to SimilarityLink Copied

Once you have a co-occurrence matrix, you can compute word similarity by comparing rows. Words with similar co-occurrence patterns (those appearing in similar contexts) should have similar meanings.

The simplest approach is cosine similarity between row vectors.

In[29]:

from sklearn.metrics.pairwise import cosine_similarity

def get_similar_words(word, matrix, word_to_idx, idx_to_word, top_n=5):
    """Find the most similar words based on co-occurrence patterns."""
    if word not in word_to_idx:
        return []
    
    word_idx = word_to_idx[word]
    word_vector = matrix[word_idx].reshape(1, -1)
    
    # Compute similarity with all other words
    similarities = cosine_similarity(word_vector, matrix)[0]
    
    # Get top similar words (excluding the word itself)
    similar_indices = np.argsort(similarities)[::-1]
    
    results = []
    for idx in similar_indices:
        if idx != word_idx:
            results.append((idx_to_word[idx], similarities[idx]))
            if len(results) >= top_n:
                break
    
    return results

# Create reverse mapping
idx_to_word_large = {i: word for word, i in word_to_idx_large.items()}

from sklearn.metrics.pairwise import cosine_similarity

def get_similar_words(word, matrix, word_to_idx, idx_to_word, top_n=5):
    """Find the most similar words based on co-occurrence patterns."""
    if word not in word_to_idx:
        return []
    
    word_idx = word_to_idx[word]
    word_vector = matrix[word_idx].reshape(1, -1)
    
    # Compute similarity with all other words
    similarities = cosine_similarity(word_vector, matrix)[0]
    
    # Get top similar words (excluding the word itself)
    similar_indices = np.argsort(similarities)[::-1]
    
    results = []
    for idx in similar_indices:
        if idx != word_idx:
            results.append((idx_to_word[idx], similarities[idx]))
            if len(results) >= top_n:
                break
    
    return results

# Create reverse mapping
idx_to_word_large = {i: word for word, i in word_to_idx_large.items()}

Out[30]:

Most similar words based on co-occurrence patterns:

  'cat':
    dog: 0.713
    sleeps: 0.707
    hunts: 0.674

  'dog':
    barks: 0.772
    at: 0.756
    cat: 0.713

  'quick':
    brown: 0.636
    hunts: 0.615
    mouse: 0.615

To understand why certain words are similar, we can visualize their distributional profiles (the row vectors from the co-occurrence matrix).

Out[31]:

Visualization

Grouped bar chart showing co-occurrence counts for three words across all context words in vocabulary. — Distributional profiles for 'cat', 'dog', and 'fox'. Each bar shows how often that word co-occurs with each context word. Similar profiles indicate similar meanings.

The bar chart reveals why "cat" and "dog" are similar: they share high co-occurrence with the same context words (like "the" and "lazy"). "Fox" has a different profile, co-occurring with different words.

Even with our tiny corpus, the similarity captures some meaningful patterns. "Cat" and "dog" are similar because they both appear with "the," "sat," and "chases/barks." The results would be much more meaningful with a larger corpus.

A Real-World ExampleLink Copied

Let's apply everything we've learned to a real text corpus. We'll use a subset of text to keep things manageable while demonstrating realistic patterns.

In[32]:

# A more substantial corpus about animals
animal_corpus = [
    "the cat sleeps on the soft bed",
    "the dog runs in the green park",
    "a cat chases the small mouse",
    "the dog barks at the mail carrier",
    "cats and dogs are popular pets",
    "the mouse hides from the cat",
    "dogs love to play fetch",
    "the cat purrs when happy",
    "a dog wags its tail",
    "mice eat cheese and grain",
    "the pet cat sits by the window",
    "dogs need daily walks",
    "the kitten plays with yarn",
    "puppies are young dogs",
    "the cat hunts birds outside",
    "dogs protect their owners",
    "cats are independent animals",
    "the mouse runs very fast",
    "pet dogs are loyal companions",
    "wild cats hunt at night"
]

# Build vocabulary with frequency filtering
word_counts = Counter()
for sentence in animal_corpus:
    word_counts.update(sentence.lower().split())

# Keep words appearing at least twice
animal_vocab = sorted([w for w, c in word_counts.items() if c >= 2])
animal_word_to_idx = {w: i for i, w in enumerate(animal_vocab)}
animal_idx_to_word = {i: w for w, i in animal_word_to_idx.items()}

# A more substantial corpus about animals
animal_corpus = [
    "the cat sleeps on the soft bed",
    "the dog runs in the green park",
    "a cat chases the small mouse",
    "the dog barks at the mail carrier",
    "cats and dogs are popular pets",
    "the mouse hides from the cat",
    "dogs love to play fetch",
    "the cat purrs when happy",
    "a dog wags its tail",
    "mice eat cheese and grain",
    "the pet cat sits by the window",
    "dogs need daily walks",
    "the kitten plays with yarn",
    "puppies are young dogs",
    "the cat hunts birds outside",
    "dogs protect their owners",
    "cats are independent animals",
    "the mouse runs very fast",
    "pet dogs are loyal companions",
    "wild cats hunt at night"
]

# Build vocabulary with frequency filtering
word_counts = Counter()
for sentence in animal_corpus:
    word_counts.update(sentence.lower().split())

# Keep words appearing at least twice
animal_vocab = sorted([w for w, c in word_counts.items() if c >= 2])
animal_word_to_idx = {w: i for i, w in enumerate(animal_vocab)}
animal_idx_to_word = {i: w for w, i in animal_word_to_idx.items()}

Out[33]:

Corpus: 20 sentences
Vocabulary (words appearing 2+ times): 12 words
Words: ['a', 'and', 'are', 'at', 'cat', 'cats', 'dog', 'dogs', 'mouse', 'pet', 'runs', 'the']

In[34]:

# Build weighted co-occurrence matrix
animal_matrix = build_weighted_cooccurrence_matrix(
    animal_corpus, animal_word_to_idx, window_size=3, weighting='harmonic'
)

# Build weighted co-occurrence matrix
animal_matrix = build_weighted_cooccurrence_matrix(
    animal_corpus, animal_word_to_idx, window_size=3, weighting='harmonic'
)

Out[35]:

Visualization

Heatmap of word co-occurrence patterns with animal-related vocabulary on both axes. — Co-occurrence matrix for an animal-themed corpus. The matrix shows which words frequently appear near each other, with function words like 'the' showing high co-occurrence across the vocabulary.

Now let's find similar words in this more meaningful corpus.

Out[36]:

Word similarities in animal corpus:

  'cat':
    mouse: 0.915
    dog: 0.902
    at: 0.857
    runs: 0.706

  'dog':
    mouse: 0.931
    cat: 0.902
    at: 0.770
    runs: 0.634

  'mouse':
    dog: 0.931
    cat: 0.915
    at: 0.833
    runs: 0.686

We can visualize the full similarity structure by computing pairwise cosine similarities between all words.

Out[37]:

Visualization

Heatmap showing cosine similarity scores between all word pairs, with diagonal showing perfect self-similarity. — Pairwise word similarity heatmap for the animal corpus. Brighter cells indicate higher cosine similarity between word pairs. Notice the cluster of animal-related words showing mutual similarity.

The similarity heatmap reveals the structure learned from co-occurrence patterns. The diagonal is always 1.0 (perfect self-similarity), and off-diagonal bright cells show which words have similar distributional profiles.

The results reflect the corpus content. "Cat" is similar to "the" (they co-occur frequently), "dog" (both are pets), and other contextually related words. With a larger corpus, these patterns become more reliable and meaningful.

Limitations and What Comes NextLink Copied

Raw co-occurrence counts have limitations that motivate the techniques in upcoming chapters.

Frequency bias: Common words dominate the matrix. "The" appears everywhere, so it has high co-occurrence with everything, even though this tells us little about word meaning. We need weighting schemes like PMI to address this.

Dimensionality: Even with a modest 50,000-word vocabulary, you have 50,000-dimensional vectors. This is computationally expensive and prone to noise. Dimensionality reduction techniques like SVD compress these vectors while preserving meaningful structure.

Sparsity: Most word pairs never co-occur, leaving the matrix mostly zeros. This sparsity makes similarity computations unreliable for rare words. Dense embeddings learned through neural methods address this limitation.

Despite these limitations, co-occurrence matrices remain the foundation of distributional semantics. Understanding how they work, what design choices matter, and what patterns they capture prepares you for the more sophisticated techniques built on top of them.

SummaryLink Copied

In this chapter, you learned how to transform raw text into numerical representations through co-occurrence matrices.

Key concepts:

Word-word matrices count how often pairs of words appear near each other, capturing local context patterns
Context window size controls what relationships you capture: small windows for syntax, large windows for semantics
Distance weighting gives more importance to closer context words, typically using harmonic weights ( $\frac{1}{d}$ )
Symmetric vs directional contexts trade off simplicity against capturing word order information
Sparse storage is necessary for real-world vocabulary sizes, reducing memory by orders of magnitude

Practical considerations:

Filter vocabulary by frequency to remove noise from rare words and uninformative function words
Use streaming construction for large corpora that don't fit in memory
Cosine similarity between row vectors gives a simple word similarity measure

The raw counts in co-occurrence matrices are just the starting point. In the next chapter, we'll see how Pointwise Mutual Information transforms these counts into more meaningful association scores that better capture word relationships.

Key ParametersLink Copied

When building co-occurrence matrices, these parameters have the greatest impact on the resulting representations:

Parameter	Typical Range	Effect
`window_size`	1-10	Small (1-2): captures syntactic relationships. Large (5-10): captures semantic/topical relationships.
`weighting`	uniform, harmonic, linear	Harmonic ( $1/d$ ) is most common. Gives more weight to closer context words.
`min_count`	5-100	Filters rare words with unreliable statistics. Higher values reduce vocabulary size and noise.
`max_freq`	0.5-0.9	Filters very common words (stopwords). Lower values are more aggressive.

Choosing window size: Start with 2-5 for general-purpose representations. Use smaller windows (1-2) if syntactic patterns matter, larger windows (5-10) for topic-level similarity.

Choosing weighting: Harmonic weighting is the default choice for most applications. It's used by GloVe and produces more balanced representations than uniform weighting.

Vocabulary filtering: For large corpora, use min_count=5 and max_freq=0.7 as starting points. Adjust based on corpus size and downstream task requirements.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about co-occurrence matrices and distributional semantics.

Loading component...

Back to Language AI Handbook

Previous Chapter

The Distributional Hypothesis

Next Chapter

Pointwise Mutual Information

Coming Soon

Reference

BIBTEXAcademic

@misc{cooccurrencematricesbuildingwordrepresentationsfromcontext, author = {Michael Brenndoerfer}, title = {Co-occurrence Matrices: Building Word Representations from Context}, year = {2025}, url = {https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-08} }

APAAcademic

Michael Brenndoerfer (2025). Co-occurrence Matrices: Building Word Representations from Context. Retrieved from https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp

MLAAcademic

Michael Brenndoerfer. "Co-occurrence Matrices: Building Word Representations from Context." 2025. Web. 12/8/2025. <https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp>.

CHICAGOAcademic

Michael Brenndoerfer. "Co-occurrence Matrices: Building Word Representations from Context." Accessed 12/8/2025. https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Co-occurrence Matrices: Building Word Representations from Context'. Available at: https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp (Accessed: 12/8/2025).

SimpleBasic

Michael Brenndoerfer (2025). Co-occurrence Matrices: Building Word Representations from Context. https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp

Direct link:

https://mbrenndoerfer.com/writing/co-occurrence-matrices-distributional-semantics-nlp

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

InteractiveCo-occurrence Matrices: Building Word Representations from Context

Co-occurrence MatricesLink Copied

The Core Idea: Counting ContextLink Copied

Two Types of Co-occurrence MatricesLink Copied

Word-Word Co-occurrence MatricesLink Copied

Word-Document Co-occurrence MatricesLink Copied

Building a Word-Word Co-occurrence MatrixLink Copied

Context Window Size EffectsLink Copied

Weighting by DistanceLink Copied

Symmetric vs Directional ContextsLink Copied

Matrix Sparsity and StorageLink Copied

Efficient Construction at ScaleLink Copied

Streaming ConstructionLink Copied

Vocabulary FilteringLink Copied

From Counts to SimilarityLink Copied

A Real-World ExampleLink Copied

Limitations and What Comes NextLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Reference

About the author: Michael Brenndoerfer

Related Content

The Distributional Hypothesis: How Context Reveals Word Meaning

Quadratic Programming for Portfolio Optimization: Complete Guide with Python Implementation

Vehicle Routing Problem with Time Windows: Complete Guide to VRPTW Optimization with OR-Tools

Stay updated