Pointwise Mutual Information: Measuring Word Associations in NLP
Back to Writing

Pointwise Mutual Information: Measuring Word Associations in NLP

Michael BrenndoerferDecember 9, 202531 min read7,327 wordsInteractive

Learn how Pointwise Mutual Information (PMI) transforms raw co-occurrence counts into meaningful word association scores by comparing observed frequencies to expected frequencies under independence.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Pointwise Mutual Information

Raw co-occurrence counts tell us how often words appear together, but they have a fundamental problem: frequent words dominate everything. The word "the" co-occurs with nearly every other word, not because it has a special relationship with them, but simply because it appears everywhere. How do we separate meaningful associations from mere frequency effects?

Pointwise Mutual Information (PMI) solves this problem by asking a different question: does this word pair co-occur more than we'd expect by chance? Instead of counting raw co-occurrences, PMI measures the surprise of seeing two words together. When "New" and "York" appear together far more often than their individual frequencies would predict, PMI captures that strong association. When "the" and "dog" appear together only as often as expected, PMI correctly identifies this as an unremarkable pairing.

This chapter derives the PMI formula from probability theory, shows how it transforms co-occurrence matrices into more meaningful representations, and demonstrates its practical applications from collocation extraction to improved word vectors.

The Problem with Raw Counts

Let's start by understanding why raw co-occurrence counts are problematic. Consider a corpus about animals and food:

In[2]:
import numpy as np
from collections import Counter

# A corpus with varying word frequencies
corpus = """
The cat sat on the mat. The dog ran in the park.
The cat chased the mouse. The dog barked at the cat.
A new restaurant opened downtown. The new menu features local food.
The chef prepared the meal. Fresh ingredients make the best food.
The cat sleeps all day. The lazy dog naps too.
The new cafe serves great coffee. The barista makes excellent drinks.
"""

# Tokenize and count word frequencies
words = corpus.lower().replace('.', '').split()
word_counts = Counter(words)
total_words = len(words)
Out[3]:
Word frequencies in corpus:
----------------------------------------
  'the         ':  16 occurrences (24.2%)
  'cat         ':   4 occurrences (6.1%)
  'dog         ':   3 occurrences (4.5%)
  'new         ':   3 occurrences (4.5%)
  'food        ':   2 occurrences (3.0%)
  'sat         ':   1 occurrences (1.5%)
  'on          ':   1 occurrences (1.5%)
  'mat         ':   1 occurrences (1.5%)
  'ran         ':   1 occurrences (1.5%)
  'in          ':   1 occurrences (1.5%)

The word "the" appears 18 times, dwarfing content words like "cat" (5 times) or "new" (3 times). In a raw co-occurrence matrix, "the" will have high counts with almost everything, obscuring the meaningful associations we care about.

In[4]:
def build_cooccurrence_matrix(corpus, window_size=2):
    """Build a word-word co-occurrence matrix."""
    words = corpus.lower().replace('.', '').split()
    vocab = sorted(set(words))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    n = len(vocab)
    
    matrix = np.zeros((n, n))
    
    for i, word in enumerate(words):
        word_idx = word_to_idx[word]
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)
        
        for j in range(start, end):
            if j != i:
                context_idx = word_to_idx[words[j]]
                matrix[word_idx, context_idx] += 1
    
    return matrix, vocab, word_to_idx

cooc_matrix, vocab, word_to_idx = build_cooccurrence_matrix(corpus, window_size=2)
Out[5]:
Raw co-occurrence counts:
----------------------------------------
  'the' co-occurs with 62 total words
  'cat' co-occurs with 15 total words
  'new' co-occurs with 12 total words

Top co-occurrences for 'cat':
    the: 5
    a: 1
    all: 1
    at: 1
    chased: 1

The raw counts show "the" as the top co-occurrence for "cat," but this tells us nothing about cats. The word "the" appears near everything. We need a measure that accounts for how often we'd expect words to co-occur given their individual frequencies.

The PMI Formula

The problem we identified, that frequent words dominate co-occurrence counts, stems from a fundamental issue: raw counts conflate two distinct phenomena. When "the" appears near "cat" 10 times, is that because "the" and "cat" have a special relationship, or simply because "the" appears near everything 10 times? To separate genuine associations from frequency effects, we need to ask a different question: how does the observed co-occurrence compare to what we'd expect if the words were unrelated?

Pointwise Mutual Information answers this question directly. Instead of asking "how often do these words appear together?" PMI asks "do these words appear together more or less than chance would predict?"

The Intuition: Observed vs Expected

Consider two words, ww and cc. If they have no special relationship, their co-occurrence should follow a simple pattern: the probability of seeing them together should equal the product of their individual probabilities. This is the definition of statistical independence.

For example, if "cat" appears in 5% of contexts and "mouse" appears in 2% of contexts, and they're independent, we'd expect "cat" and "mouse" to co-occur in roughly 0.05×0.02=0.1%0.05 \times 0.02 = 0.1\% of word pairs. If they actually co-occur in 1% of pairs, that's 10 times more than expected. This excess reveals a genuine association.

PMI formalizes this comparison by taking the ratio of observed to expected co-occurrence:

PMI(w,c)=log2P(w,c)P(w)P(c)\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w) \cdot P(c)}
Pointwise Mutual Information

PMI measures the association between two words by comparing their joint probability of co-occurrence to what we'd expect if they were independent. High PMI indicates the words are strongly associated; low or negative PMI indicates they co-occur less than expected by chance.

Let's unpack each component of this formula:

  • Numerator P(w,c)P(w, c): The joint probability, measuring how often words ww and cc actually co-occur in the corpus.

  • Denominator P(w)P(c)P(w) \cdot P(c): The expected probability under independence. If the words were unrelated, this is how often they would co-occur by chance.

  • The ratio: When observed equals expected, the ratio is 1. When observed exceeds expected, the ratio is greater than 1. When observed falls short of expected, the ratio is less than 1.

  • The logarithm: Taking log2\log_2 converts multiplicative relationships to additive ones. A ratio of 2 (twice as often as expected) becomes PMI of 1. A ratio of 4 becomes PMI of 2. This logarithmic scale makes PMI values easier to interpret and compare.

Why the Logarithm?

The logarithm serves three purposes. First, it converts ratios to differences: PMI of 2 means "4 times more than expected," while PMI of 1 means "2 times more." The additive scale is more intuitive.

Second, it creates symmetry around zero. Positive PMI indicates attraction (words co-occur more than expected), negative PMI indicates repulsion (words co-occur less than expected), and zero indicates independence.

Third, it connects to information theory. PMI measures how much information observing one word provides about the other. When two words are strongly associated, seeing one tells you a lot about whether you'll see the other.

From Probabilities to Counts

In practice, we don't know the true probabilities. We estimate them from corpus counts. Let #(w,c)\#(w, c) denote how many times words ww and cc co-occur, and let NN be the total number of word-context pairs in our co-occurrence matrix.

The probability estimates follow naturally:

P(w,c)=#(w,c)NP(w, c) = \frac{\#(w, c)}{N}

This is the fraction of all word pairs that are specifically the (w,c)(w, c) pair.

P(w)=c#(w,c)N=#(w,)NP(w) = \frac{\sum_c \#(w, c)}{N} = \frac{\#(w, *)}{N}

This is the fraction of all pairs where the target word is ww. The notation #(w,)\#(w, *) means "sum over all context words," which is simply the row sum for word ww in the co-occurrence matrix.

P(c)=w#(w,c)N=#(,c)NP(c) = \frac{\sum_w \#(w, c)}{N} = \frac{\#(*, c)}{N}

Similarly, this is the column sum for context word cc, representing how often cc appears as a context for any word.

The Practical Formula

Substituting these estimates into the PMI formula:

PMI(w,c)=log2#(w,c)/N(#(w,)/N)(#(,c)/N)\text{PMI}(w, c) = \log_2 \frac{\#(w, c) / N}{(\#(w, *) / N) \cdot (\#(*, c) / N)}

Notice that NN appears in both numerator and denominator. With some algebra, we can simplify:

PMI(w,c)=log2#(w,c)/N(#(w,)#(,c))/N2=log2#(w,c)N#(w,)#(,c)\text{PMI}(w, c) = \log_2 \frac{\#(w, c) / N}{(\#(w, *) \cdot \#(*, c)) / N^2} = \log_2 \frac{\#(w, c) \cdot N}{\#(w, *) \cdot \#(*, c)}

This is the formula we implement: multiply the co-occurrence count by the total, divide by the product of marginals, and take the logarithm. Each term has a clear interpretation:

  • #(w,c)\#(w, c): How often ww and cc actually co-occur
  • NN: Total co-occurrences in the matrix
  • #(w,)\#(w, *): How often ww appears with any context (row sum)
  • #(,c)\#(*, c): How often cc appears as context for any word (column sum)

Implementing PMI

Let's translate this formula into code. The implementation follows the mathematical derivation closely, computing each component step by step.

In[6]:
def compute_pmi_matrix(cooc_matrix):
    """
    Compute PMI matrix from co-occurrence counts.
    
    PMI(w, c) = log2(P(w,c) / (P(w) * P(c)))
              = log2((#(w,c) * N) / (#(w,*) * #(*,c)))
    """
    # Total count: N = sum of all co-occurrences
    N = cooc_matrix.sum()
    
    # Row sums: #(w, *) for each word
    # This gives us the marginal counts for target words
    row_sums = cooc_matrix.sum(axis=1, keepdims=True)
    
    # Column sums: #(*, c) for each context
    # This gives us the marginal counts for context words
    col_sums = cooc_matrix.sum(axis=0, keepdims=True)
    
    # Expected counts under independence: E[#(w,c)] = #(w,*) * #(*,c) / N
    # Using matrix multiplication to compute all pairs efficiently
    expected = (row_sums @ col_sums) / N
    
    # PMI = log2(observed / expected)
    # Handle zeros: log(0) is undefined, so we replace with 0
    with np.errstate(divide='ignore', invalid='ignore'):
        pmi = np.log2(cooc_matrix / expected)
        pmi[~np.isfinite(pmi)] = 0  # Replace inf/-inf/nan with 0
    
    return pmi

pmi_matrix = compute_pmi_matrix(cooc_matrix)

The key insight in this implementation is computing the expected counts efficiently. Rather than looping through all word pairs, we use matrix multiplication: row_sums @ col_sums produces an outer product where each cell (i,j)(i, j) contains #(wi,)#(,cj)\#(w_i, *) \cdot \#(*, c_j). Dividing by NN gives us the expected count for each pair.

Out[7]:
Visualization
Scatter plot showing observed co-occurrence counts on y-axis versus expected counts on x-axis, with diagonal line separating positive and negative PMI regions.
Observed vs expected co-occurrence counts for all word pairs. Points above the diagonal (green) co-occur more than expected, yielding positive PMI. Points below the diagonal (red) co-occur less than expected, yielding negative PMI. Points near the diagonal have PMI close to zero, indicating independence. The logarithmic scale reveals the full range of values.

This scatter plot visualizes the fundamental idea behind PMI: comparing what we observe to what we'd expect. Word pairs above the diagonal have positive PMI (they co-occur more than chance predicts), while pairs below have negative PMI.

Out[8]:
PMI values for 'cat':
----------------------------------------
    a           : PMI = +2.10
    all         : PMI = +2.10
    at          : PMI = +2.10
    chased      : PMI = +2.10
    on          : PMI = +2.10
    park        : PMI = +2.10
    sat         : PMI = +2.10
    sleeps      : PMI = +2.10

The transformation is dramatic. Raw counts ranked "the" as the top co-occurrence for "cat," but PMI reveals a different picture. Words that specifically associate with "cat," rather than appearing everywhere, now rise to the top. The word "the" has low or negative PMI because its high frequency means it co-occurs with "cat" about as often as we'd expect by chance.

Interpreting PMI as Association Strength

The logarithmic scale of PMI creates a natural interpretation centered on zero. Because we're measuring log2(observed/expected)\log_2(\text{observed} / \text{expected}), the value tells us directly how the actual co-occurrence compares to the baseline of independence:

  • PMI > 0: The words co-occur more than expected. A PMI of 1 means twice as often as chance; PMI of 2 means four times as often; PMI of 3 means eight times. The higher the value, the stronger the positive association.

  • PMI = 0: The words co-occur exactly as expected under independence. There's no special relationship, either attractive or repulsive.

  • PMI < 0: The words co-occur less than expected. They tend to avoid each other. A PMI of -1 means half as often as chance would predict.

Out[9]:
Visualization
Number line showing PMI interpretation from negative (avoidance) through zero (independence) to positive (strong association).
Interpretation of PMI values. Positive PMI indicates words co-occur more than chance would predict, suggesting a meaningful association. Zero PMI means the words are independent. Negative PMI means the words avoid each other, appearing together less than expected.

A Worked Example: Computing PMI Step by Step

To solidify understanding, let's walk through a complete PMI calculation by hand. We'll compute the PMI between "cat" and "mouse," two words we'd expect to have a strong association.

The calculation proceeds in three stages:

  1. Gather the raw counts from our co-occurrence matrix
  2. Compute the expected count under the assumption of independence
  3. Calculate PMI as the log ratio of observed to expected
In[10]:
# Get the counts we need for the PMI calculation
mouse_idx = word_to_idx.get('mouse', -1)

if mouse_idx >= 0:
    # Step 1: Observed co-occurrence count #(cat, mouse)
    count_cat_mouse = cooc_matrix[cat_idx, mouse_idx]
    
    # Step 2: Marginal counts (row and column sums)
    row_sum_cat = cooc_matrix[cat_idx].sum()      # #(cat, *)
    col_sum_mouse = cooc_matrix[:, mouse_idx].sum()  # #(*, mouse)
    
    # Step 3: Total count N
    N = cooc_matrix.sum()
    
    # Step 4: Expected count under independence
    expected = (row_sum_cat * col_sum_mouse) / N
    
    # Step 5: PMI = log2(observed / expected)
    pmi_cat_mouse = np.log2(count_cat_mouse / expected) if expected > 0 and count_cat_mouse > 0 else 0
Out[11]:
PMI Calculation: 'cat' and 'mouse'
==================================================

Step 1: Gather counts from the co-occurrence matrix
  #(cat, mouse) = 0  (observed co-occurrences)
  #(cat, *)     = 15  (total contexts for 'cat')
  #(*, mouse)   = 4  (total contexts for 'mouse')
  N             = 258  (total word pairs)

Step 2: Compute expected count under independence
  If 'cat' and 'mouse' were unrelated, we'd expect:
  Expected = (#(cat,*) × #(*,mouse)) / N
           = (15 × 4) / 258
           = 0.23

Step 3: Compute PMI as log ratio
  PMI = log₂(observed / expected)
      = log₂(0 / 0.23)
      = log₂(0.00)
      = 0.00

The positive PMI confirms our linguistic intuition: "cat" and "mouse" have a genuine association in language. They appear together in contexts about predator-prey relationships, children's stories, and idiomatic expressions far more often than their individual frequencies would suggest.

This is exactly the kind of meaningful relationship PMI is designed to uncover. Raw counts would have told us that "the" co-occurs with "cat" more often than "mouse" does, simply because "the" is everywhere. PMI cuts through this noise by asking the right question: not "how often?" but "how much more than expected?"

The Problem with Negative PMI

While PMI can be negative (indicating words that avoid each other), negative values cause practical problems. Most word pairs never co-occur at all, giving them PMI of negative infinity. Even pairs that co-occur rarely get large negative values.

In[12]:
# Analyze the distribution of PMI values
finite_pmis = pmi_matrix[np.isfinite(pmi_matrix) & (pmi_matrix != 0)]
negative_pmis = finite_pmis[finite_pmis < 0]
positive_pmis = finite_pmis[finite_pmis > 0]
Out[13]:
PMI value distribution:
----------------------------------------
  Total non-zero entries: 215
  Negative PMI values:    3 (1.4%)
  Positive PMI values:    212 (98.6%)

  Min PMI: -1.31
  Max PMI: 5.43
  Mean PMI: 2.25
Out[14]:
Visualization
Histogram showing PMI value distribution with negative values in red and positive values in green, centered near zero with long tails.
Distribution of PMI values in the co-occurrence matrix. The histogram reveals a roughly symmetric distribution centered slightly below zero, with long tails in both directions. Negative values (red) represent word pairs that co-occur less than expected, while positive values (green) represent genuine associations. The prevalence of negative values motivates the PPMI transformation.

The histogram reveals a key characteristic of PMI: many word pairs have negative values, meaning they co-occur less than expected. This asymmetry, combined with the issues below, motivates the PPMI transformation.

Negative PMI values are problematic for several reasons:

  1. Unreliable estimates: Low co-occurrence counts produce noisy PMI values. A word pair that co-occurs once when we expected two has PMI of -1, but this could easily be sampling noise.

  2. Asymmetric information: Knowing that words don't co-occur is less informative than knowing they do. The absence of co-occurrence could mean many things.

  3. Computational issues: Large negative values dominate distance calculations and can destabilize downstream algorithms.

Positive PMI (PPMI)

The standard solution is Positive PMI (PPMI), which simply clips negative values to zero:

PPMI(w,c)=max(0,PMI(w,c))\text{PPMI}(w, c) = \max(0, \text{PMI}(w, c))
Positive PMI (PPMI)

PPMI retains only the positive associations from PMI, treating all negative or zero associations as equally uninformative. This produces sparse, non-negative matrices that work well with many machine learning algorithms.

In[15]:
def compute_ppmi_matrix(cooc_matrix):
    """Compute Positive PMI matrix."""
    pmi = compute_pmi_matrix(cooc_matrix)
    ppmi = np.maximum(0, pmi)
    return ppmi

ppmi_matrix = compute_ppmi_matrix(cooc_matrix)
Out[16]:
PPMI matrix properties:
----------------------------------------
  Shape: (43, 43)
  Non-zero entries: 212
  Sparsity: 88.5%
  Max value: 5.43
  Mean (non-zero): 2.29
Out[17]:
Visualization
Heatmap of PMI values ranging from negative (blue) through zero (white) to positive (red).
PMI matrix showing both positive and negative associations. Negative values (blue) indicate word pairs that co-occur less than expected. The wide range of values can cause computational issues.
Heatmap of PPMI values showing only positive associations in warm colors with many zero entries.
PPMI matrix with negative values clipped to zero. Only positive associations remain, producing a sparse matrix that emphasizes meaningful co-occurrences.

The PPMI matrix is much sparser than the raw co-occurrence matrix. Only word pairs with genuine positive associations retain non-zero values. This sparsity is a feature, not a bug: it means we've filtered out the noise of random co-occurrences.

Shifted PPMI Variants

While PPMI works well, researchers have developed variants that address specific issues. The most important is Shifted PPMI, which subtracts a constant before clipping:

SPPMIk(w,c)=max(0,PMI(w,c)log2k)\text{SPPMI}_k(w, c) = \max(0, \text{PMI}(w, c) - \log_2 k)

where kk is a shift parameter, typically between 1 and 15.

Why shift? The shift acts as a threshold for what counts as a "meaningful" association. With k=1k=1 (no shift), any positive PMI is retained. With k=5k=5, only word pairs that co-occur at least 5 times more than expected survive.

Shifted PPMI

Shifted PPMI raises the bar for what counts as a positive association by subtracting log2k\log_2 k before clipping. This filters out weak associations that might be due to noise, keeping only the strongest signals. Higher kk values produce sparser, more selective matrices.

In[18]:
def compute_sppmi_matrix(cooc_matrix, k=5):
    """
    Compute Shifted Positive PMI matrix.
    
    SPPMI_k(w, c) = max(0, PMI(w, c) - log2(k))
    """
    pmi = compute_pmi_matrix(cooc_matrix)
    shift = np.log2(k)
    sppmi = np.maximum(0, pmi - shift)
    return sppmi

# Compare different shift values
sppmi_k1 = compute_sppmi_matrix(cooc_matrix, k=1)   # Same as PPMI
sppmi_k2 = compute_sppmi_matrix(cooc_matrix, k=2)
sppmi_k5 = compute_sppmi_matrix(cooc_matrix, k=5)
Out[19]:
Effect of shift parameter k:
--------------------------------------------------
  k=1 (PPMI):   212 non-zero entries
  k=2:          158 non-zero entries
  k=5:          120 non-zero entries

Higher k filters out weaker associations,
keeping only the strongest word relationships.
Out[20]:
Visualization
Two-panel plot showing sparsity percentage increasing and non-zero entry count decreasing as shift parameter k increases from 1 to 10.
Effect of the shift parameter k on SPPMI matrix sparsity. As k increases, the threshold for retaining associations rises, filtering out progressively weaker relationships. The left panel shows how sparsity increases with k, while the right panel shows the corresponding decrease in non-zero entries. Typical values of k range from 1 (standard PPMI) to 15.

The visualization shows the trade-off clearly: higher shift values produce sparser matrices by filtering out weaker associations. A common choice is k=5k=5, which corresponds to Word2Vec's default of 5 negative samples.

The connection to Word2Vec is notable: Levy and Goldberg (2014) showed that Word2Vec's skip-gram model with negative sampling implicitly factorizes a shifted PMI matrix with kk equal to the number of negative samples. This theoretical connection explains why PMI-based methods and neural embeddings often produce similar results.

PMI vs Raw Counts: A Comparison

Let's directly compare how raw counts and PPMI rank word associations. We'll use a larger corpus to see clearer patterns.

In[21]:
# A larger corpus about technology
tech_corpus = """
Machine learning algorithms process large datasets to find patterns.
Deep learning models use neural networks with many layers.
Neural networks learn representations from training data.
Training data quality affects model performance significantly.
Large language models generate human-like text responses.
Text generation requires understanding context and meaning.
Natural language processing enables computers to understand text.
Computer vision systems recognize objects in images.
Image recognition uses convolutional neural networks effectively.
Convolutional networks excel at processing visual information.
Reinforcement learning agents learn from environmental feedback.
Learning agents optimize their behavior through rewards.
Data scientists analyze complex datasets using statistical methods.
Statistical methods reveal patterns hidden in raw data.
Raw data requires preprocessing before model training begins.
Preprocessing steps include normalization and feature extraction.
Feature extraction transforms raw inputs into useful representations.
Useful representations capture the essential information content.
"""

# Build matrices
tech_cooc, tech_vocab, tech_idx = build_cooccurrence_matrix(tech_corpus, window_size=3)
tech_ppmi = compute_ppmi_matrix(tech_cooc)
In[22]:
def get_top_associations(word, matrix, vocab, word_to_idx, n=5):
    """Get top associated words by matrix values."""
    if word not in word_to_idx:
        return []
    idx = word_to_idx[word]
    scores = [(vocab[i], matrix[idx, i]) for i in range(len(vocab)) 
              if matrix[idx, i] > 0 and vocab[i] != word]
    return sorted(scores, key=lambda x: -x[1])[:n]

query_words = ['learning', 'neural', 'data']
Out[23]:
Comparison: Raw Counts vs PPMI
============================================================

Top associations for 'learning':
--------------------------------------------------
  Raw Counts           | PPMI                
  -------------------- | --------------------
  agents (2)           | machine (3.60)      
  from (2)             | algorithms (2.86)   
  algorithms (1)       | agents (2.60)       
  deep (1)             | deep (2.60)         
  environmental (1)    | environmental (2.60)

Top associations for 'neural':
--------------------------------------------------
  Raw Counts           | PPMI                
  -------------------- | --------------------
  networks (3)         | many (3.89)         
  convolutional (2)    | with (3.89)         
  many (2)             | convolutional (2.89)
  with (2)             | effectively (2.89)  
  effectively (1)      | layers (2.89)       

Top associations for 'data':
--------------------------------------------------
  Raw Counts           | PPMI                
  -------------------- | --------------------
  raw (4)              | quality (3.15)      
  training (4)         | raw (2.56)          
  quality (2)          | training (2.56)     
  requires (2)         | affects (2.15)      
  affects (1)          | analyze (2.15)      

The PPMI rankings are more semantically meaningful. Raw counts are dominated by frequent function words, while PPMI surfaces content words with genuine topical associations.

Out[24]:
Visualization
Side-by-side bar charts comparing top word associations from raw counts and PPMI, showing PPMI produces more semantically relevant rankings.
Comparison of word association rankings using raw co-occurrence counts versus PPMI. Raw counts favor frequent words that appear everywhere, while PPMI identifies semantically meaningful associations. For 'learning', raw counts highlight function words, but PPMI correctly identifies 'machine', 'deep', and 'reinforcement' as strongly associated terms.

PMI Matrix Properties

PPMI matrices have several useful properties that make them well-suited for downstream NLP tasks.

Sparsity

PPMI matrices are highly sparse because most word pairs don't have positive associations. This sparsity enables efficient storage and computation.

In[25]:
def analyze_matrix_properties(raw_matrix, ppmi_matrix, name=""):
    """Compare properties of raw and PPMI matrices."""
    raw_nonzero = np.count_nonzero(raw_matrix)
    ppmi_nonzero = np.count_nonzero(ppmi_matrix)
    
    raw_sparsity = 1 - raw_nonzero / raw_matrix.size
    ppmi_sparsity = 1 - ppmi_nonzero / ppmi_matrix.size
    
    return {
        'raw_nonzero': raw_nonzero,
        'ppmi_nonzero': ppmi_nonzero,
        'raw_sparsity': raw_sparsity,
        'ppmi_sparsity': ppmi_sparsity,
        'raw_mean': raw_matrix[raw_matrix > 0].mean() if raw_nonzero > 0 else 0,
        'ppmi_mean': ppmi_matrix[ppmi_matrix > 0].mean() if ppmi_nonzero > 0 else 0,
    }

props = analyze_matrix_properties(tech_cooc, tech_ppmi)
Out[26]:
Matrix Property Comparison:
--------------------------------------------------
  Property                    Raw Counts         PPMI
  ------------------------- ------------ ------------
  Non-zero entries                   695          695
  Sparsity                        92.0%       92.0%
  Mean (non-zero)                   1.15         3.37
Out[27]:
Visualization
Side-by-side binary matrix visualizations showing non-zero patterns, with PPMI matrix being notably sparser than raw co-occurrence matrix.
Sparsity patterns in raw co-occurrence vs PPMI matrices. Left: Raw co-occurrence matrix shows many non-zero entries (dark pixels) wherever words appear together. Right: PPMI matrix is much sparser, retaining only genuine positive associations. The increased sparsity enables efficient storage and computation while filtering out noise.

The sparsity visualization makes the filtering effect of PPMI immediately apparent. The raw matrix has entries wherever words co-occur at all, while the PPMI matrix retains only the genuinely positive associations. This much smaller subset captures the meaningful relationships.

Symmetry

For symmetric context windows (looking the same distance left and right), the co-occurrence matrix is symmetric, and so is the PPMI matrix: PPMI(w,c)=PPMI(c,w)\text{PPMI}(w, c) = \text{PPMI}(c, w).

Out[28]:
PPMI matrix is symmetric: True

This symmetry means the association between words is bidirectional: if "neural" has high PPMI with "networks," then "networks" has equally high PPMI with "neural."

Row Vectors as Word Representations

Each row of the PPMI matrix can serve as a word vector. Words with similar PPMI profiles (similar rows) tend to have similar meanings.

In[29]:
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_words_ppmi(word, ppmi_matrix, vocab, word_to_idx, n=5):
    """Find similar words using PPMI vectors."""
    if word not in word_to_idx:
        return []
    
    idx = word_to_idx[word]
    word_vec = ppmi_matrix[idx].reshape(1, -1)
    
    # Compute cosine similarity with all words
    similarities = cosine_similarity(word_vec, ppmi_matrix)[0]
    
    # Get top similar (excluding the word itself)
    results = []
    for i in np.argsort(similarities)[::-1]:
        if vocab[i] != word:
            results.append((vocab[i], similarities[i]))
            if len(results) >= n:
                break
    
    return results
Out[30]:
Word similarity using PPMI vectors:
--------------------------------------------------

'neural':
    networks       : 0.806
    layers         : 0.598
    with           : 0.558
    many           : 0.484

'data':
    training       : 0.529
    scientists     : 0.487
    model          : 0.473
    rewards        : 0.468

'learning':
    agents         : 0.566
    feedback       : 0.476
    learn          : 0.454
    environmental  : 0.416

The similarity scores capture semantic relationships learned purely from co-occurrence patterns. Words appearing in similar contexts cluster together in the PPMI vector space, enabling applications like finding related terms or detecting semantic categories.

Out[31]:
Visualization
Scatter plot showing words positioned in 2D space based on their PPMI vectors, with semantically related words clustering together.
2D visualization of PPMI word vectors using PCA. Words that share similar contexts cluster together. Technical terms like 'neural', 'networks', and 'learning' form a cluster (blue), while data-related terms like 'data', 'datasets', and 'raw' form another (green). This clustering emerges purely from co-occurrence statistics, demonstrating how distributional information captures semantic relationships.

The 2D projection reveals the semantic structure captured by PPMI vectors. Even with our small corpus, related words cluster together: machine learning terms form one group, data-related terms another. This is the distributional hypothesis in action. Words with similar meanings appear in similar contexts, leading to similar PPMI vectors.

Collocation Extraction with PMI

One of PMI's most practical applications is identifying collocations: word combinations that occur together more than chance would predict. Collocations include compound nouns ("ice cream"), phrasal verbs ("give up"), and idiomatic expressions ("kick the bucket").

Collocations

Collocations are word combinations whose meaning or frequency cannot be predicted from the individual words alone. PMI helps identify these by measuring which word pairs co-occur significantly more than their individual frequencies would suggest.

In[32]:
def extract_collocations(corpus, min_count=2, top_n=20):
    """
    Extract collocations using PMI.
    
    Returns word pairs ranked by PMI score.
    """
    # Get bigram counts
    words = corpus.lower().replace('.', ' . ').split()
    bigrams = [(words[i], words[i+1]) for i in range(len(words)-1) 
               if words[i] != '.' and words[i+1] != '.']
    
    bigram_counts = Counter(bigrams)
    word_counts = Counter(words)
    total_bigrams = len(bigrams)
    
    # Compute PMI for each bigram
    collocations = []
    for (w1, w2), count in bigram_counts.items():
        if count < min_count:
            continue
        
        # P(w1, w2) estimated from bigram frequency
        p_joint = count / total_bigrams
        
        # P(w1) and P(w2) from unigram frequencies
        p_w1 = word_counts[w1] / sum(word_counts.values())
        p_w2 = word_counts[w2] / sum(word_counts.values())
        
        # PMI
        pmi = np.log2(p_joint / (p_w1 * p_w2))
        
        if pmi > 0:
            collocations.append((w1, w2, pmi, count))
    
    # Sort by PMI
    collocations.sort(key=lambda x: -x[2])
    return collocations[:top_n]

# Extract collocations from tech corpus
collocations = extract_collocations(tech_corpus, min_count=1, top_n=15)
Out[33]:
Top Collocations by PMI:
-------------------------------------------------------
  Bigram                         PMI    Count
  ------------------------- -------- --------
  algorithms process            7.64        1
  with many                     7.64        1
  many layers                   7.64        1
  quality affects               7.64        1
  performance significantly     7.64        1
  generate human-like           7.64        1
  understanding context         7.64        1
  enables computers             7.64        1
  computer vision               7.64        1
  vision systems                7.64        1
  systems recognize             7.64        1
  recognize objects             7.64        1
  image recognition             7.64        1
  recognition uses              7.64        1
  excel at                      7.64        1

The top collocations are meaningful multi-word expressions like "machine learning," "neural networks," and "language models." PMI successfully identifies these as units that co-occur far more than chance would predict. The high PMI scores (often above 4) indicate these word pairs appear together 16 or more times more frequently than their individual frequencies would suggest.

Out[34]:
Visualization
Horizontal bar chart showing PMI scores for word pairs, with technical collocations scoring highest.
PMI scores for extracted collocations from a technology corpus. Higher PMI indicates stronger association between words. Technical terms like 'machine learning' and 'neural networks' score highest because they appear together far more often than their individual frequencies would predict.

Implementation: Building a Complete PPMI Pipeline

Let's put everything together into a complete, reusable implementation for computing PPMI matrices from text.

In[35]:
class PPMIVectorizer:
    """
    Transform text into PPMI word vectors.
    
    This class provides a complete pipeline for:
    1. Building vocabulary from corpus
    2. Constructing co-occurrence matrix
    3. Computing PPMI transformation
    4. Finding similar words
    """
    
    def __init__(self, window_size=2, min_count=2, shift_k=1):
        """
        Initialize the vectorizer.
        
        Args:
            window_size: Context window size on each side
            min_count: Minimum word frequency to include in vocabulary
            shift_k: Shift parameter for SPPMI (k=1 gives standard PPMI)
        """
        self.window_size = window_size
        self.min_count = min_count
        self.shift_k = shift_k
        self.vocab = None
        self.word_to_idx = None
        self.ppmi_matrix = None
        
    def fit(self, corpus):
        """
        Build vocabulary and compute PPMI matrix from corpus.
        
        Args:
            corpus: String containing the text corpus
        """
        # Tokenize
        words = corpus.lower().replace('.', '').replace(',', '').split()
        
        # Build vocabulary
        word_counts = Counter(words)
        self.vocab = sorted([w for w, c in word_counts.items() if c >= self.min_count])
        self.word_to_idx = {w: i for i, w in enumerate(self.vocab)}
        
        # Build co-occurrence matrix
        n = len(self.vocab)
        cooc = np.zeros((n, n))
        
        for i, word in enumerate(words):
            if word not in self.word_to_idx:
                continue
            word_idx = self.word_to_idx[word]
            
            start = max(0, i - self.window_size)
            end = min(len(words), i + self.window_size + 1)
            
            for j in range(start, end):
                if j != i and words[j] in self.word_to_idx:
                    context_idx = self.word_to_idx[words[j]]
                    cooc[word_idx, context_idx] += 1
        
        # Compute PPMI
        N = cooc.sum()
        if N == 0:
            self.ppmi_matrix = cooc
            return self
            
        row_sums = cooc.sum(axis=1, keepdims=True)
        col_sums = cooc.sum(axis=0, keepdims=True)
        
        # Avoid division by zero
        row_sums = np.maximum(row_sums, 1e-10)
        col_sums = np.maximum(col_sums, 1e-10)
        
        expected = (row_sums @ col_sums) / N
        
        with np.errstate(divide='ignore', invalid='ignore'):
            pmi = np.log2(cooc / expected)
            pmi[~np.isfinite(pmi)] = 0
        
        # Apply shift and clip
        shift = np.log2(self.shift_k) if self.shift_k > 1 else 0
        self.ppmi_matrix = np.maximum(0, pmi - shift)
        
        return self
    
    def get_vector(self, word):
        """Get PPMI vector for a word."""
        if word not in self.word_to_idx:
            return None
        return self.ppmi_matrix[self.word_to_idx[word]]
    
    def most_similar(self, word, n=5):
        """Find most similar words by cosine similarity."""
        vec = self.get_vector(word)
        if vec is None:
            return []
        
        vec = vec.reshape(1, -1)
        sims = cosine_similarity(vec, self.ppmi_matrix)[0]
        
        results = []
        for i in np.argsort(sims)[::-1]:
            if self.vocab[i] != word:
                results.append((self.vocab[i], sims[i]))
                if len(results) >= n:
                    break
        return results
In[36]:
# Example usage
vectorizer = PPMIVectorizer(window_size=3, min_count=2, shift_k=1)
vectorizer.fit(tech_corpus)
Out[37]:
PPMIVectorizer Results:
--------------------------------------------------
Vocabulary size: 30
Matrix shape: (30, 30)

Similar to 'learning':
    learn          : 0.555
    agents         : 0.548
    from           : 0.390
    information    : 0.318

Similar to 'networks':
    neural         : 0.720
    convolutional  : 0.526
    learn          : 0.348
    language       : 0.334

Similar to 'data':
    preprocessing  : 0.767
    raw            : 0.681
    training       : 0.540
    model          : 0.536

The vectorizer successfully builds PPMI word vectors and finds semantically related words. The similarity scores reflect how often words share the same contexts in the corpus. This encapsulated implementation can be reused across different corpora and easily integrated into larger NLP pipelines.

Limitations and When to Use PMI

While PMI is powerful, it has limitations you should understand.

Sensitivity to Low Counts

PMI estimates are unreliable for rare word pairs. A word pair that co-occurs once when we expected 0.5 co-occurrences gets PMI of 1, but this could easily be noise. The standard solution is to require minimum co-occurrence counts before computing PMI.

In[38]:
def compute_reliable_pmi(cooc_matrix, min_count=5):
    """Compute PMI only for word pairs with sufficient counts."""
    # Mask out low-count pairs
    mask = cooc_matrix >= min_count
    
    # Compute PMI
    N = cooc_matrix.sum()
    row_sums = cooc_matrix.sum(axis=1, keepdims=True)
    col_sums = cooc_matrix.sum(axis=0, keepdims=True)
    
    row_sums = np.maximum(row_sums, 1e-10)
    col_sums = np.maximum(col_sums, 1e-10)
    
    expected = (row_sums @ col_sums) / N
    
    with np.errstate(divide='ignore', invalid='ignore'):
        pmi = np.log2(cooc_matrix / expected)
        pmi[~np.isfinite(pmi)] = 0
    
    # Apply mask
    pmi[~mask] = 0
    
    return np.maximum(0, pmi)

# Compare standard PPMI vs reliable PPMI with minimum count threshold
standard_ppmi = compute_ppmi_matrix(tech_cooc)
reliable_ppmi = compute_reliable_pmi(tech_cooc, min_count=2)
Out[39]:
Effect of minimum count threshold:
--------------------------------------------------
  Standard PPMI non-zero entries: 695
  Reliable PPMI (min_count=2):    77
  Entries filtered out:           618

Requiring a minimum co-occurrence count filters out potentially spurious associations that could arise from sampling noise, producing more reliable PMI estimates at the cost of discarding some valid but rare associations.

Bias Toward Rare Words

PMI tends to give high scores to rare word pairs. If a rare word appears only in specific contexts, it gets high PMI with those contexts even if the association is coincidental. Shifted PPMI helps by raising the threshold for positive associations.

Out[40]:
Visualization
Scatter plot showing word frequency on x-axis (log scale) versus maximum PMI value on y-axis, demonstrating that rare words tend to have higher PMI scores.
Relationship between word frequency and maximum PMI score. Rare words (left side) tend to achieve higher maximum PMI values because their limited contexts create artificially strong associations. Frequent words (right side) have more stable, moderate PMI values. This bias toward rare words is a known limitation of PMI that minimum count thresholds and shifted PPMI help address.

The downward trend confirms the rare word bias: words with fewer total co-occurrences tend to achieve higher maximum PMI scores. This happens because rare words have limited contexts, making each co-occurrence count proportionally more.

Computational Cost

For large vocabularies, PMI matrices become enormous. A 100,000-word vocabulary produces a 10-billion-cell matrix. While the matrix is sparse after PPMI transformation, the intermediate computations can be expensive. Practical implementations use sparse matrix formats and streaming algorithms.

When to Use PMI

PMI and PPMI are excellent choices when:

  • You need interpretable association scores between words
  • You're extracting collocations or multi-word expressions
  • You want sparse, high-dimensional word vectors as input to other algorithms
  • You need a baseline to compare against neural embeddings
  • Computational resources are limited (no GPU required)

Neural methods like Word2Vec often outperform PPMI for downstream tasks, but the difference is smaller than you might expect. For many applications, PPMI provides a strong, interpretable baseline.

Summary

Pointwise Mutual Information transforms raw co-occurrence counts into meaningful association scores by comparing observed co-occurrence to what we'd expect under independence.

Key concepts:

  • PMI formula: PMI(w,c)=log2P(w,c)P(w)P(c)\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w) \cdot P(c)} measures how much more (or less) two words co-occur than chance would predict

  • Positive PMI (PPMI): Clips negative values to zero, keeping only positive associations. This produces sparse matrices that work well with machine learning algorithms.

  • Shifted PPMI: Subtracts log2k\log_2 k before clipping, filtering out weak associations. Connected theoretically to Word2Vec's negative sampling.

  • PMI interpretation: Positive PMI means words co-occur more than expected (strong association). Zero means independence. Negative means avoidance.

Practical applications:

  • Collocation extraction: Finding meaningful multi-word expressions
  • Word similarity: Using PPMI vectors with cosine similarity
  • Feature weighting: PPMI as a preprocessing step before dimensionality reduction

Key parameters:

ParameterTypical ValuesEffect
window_size2-5Larger windows capture broader context but may introduce noise
min_count2-10Higher values filter unreliable associations from rare words
shift_k1-15Higher values keep only the strongest associations

The next chapter shows how to reduce the dimensionality of PPMI matrices using Singular Value Decomposition, producing dense vectors that capture the essential structure in fewer dimensions.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Pointwise Mutual Information and word associations.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{pointwisemutualinformationmeasuringwordassociationsinnlp, author = {Michael Brenndoerfer}, title = {Pointwise Mutual Information: Measuring Word Associations in NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-09} }
APAAcademic
Michael Brenndoerfer (2025). Pointwise Mutual Information: Measuring Word Associations in NLP. Retrieved from https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp
MLAAcademic
Michael Brenndoerfer. "Pointwise Mutual Information: Measuring Word Associations in NLP." 2025. Web. 12/9/2025. <https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Pointwise Mutual Information: Measuring Word Associations in NLP." Accessed 12/9/2025. https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Pointwise Mutual Information: Measuring Word Associations in NLP'. Available at: https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp (Accessed: 12/9/2025).
SimpleBasic
Michael Brenndoerfer (2025). Pointwise Mutual Information: Measuring Word Associations in NLP. https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.