Learn how Pointwise Mutual Information (PMI) transforms raw co-occurrence counts into meaningful word association scores by comparing observed frequencies to expected frequencies under independence.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Pointwise Mutual Information
Raw co-occurrence counts tell us how often words appear together, but they have a fundamental problem: frequent words dominate everything. The word "the" co-occurs with nearly every other word, not because it has a special relationship with them, but simply because it appears everywhere. How do we separate meaningful associations from mere frequency effects?
Pointwise Mutual Information (PMI) solves this problem by asking a different question: does this word pair co-occur more than we'd expect by chance? Instead of counting raw co-occurrences, PMI measures the surprise of seeing two words together. When "New" and "York" appear together far more often than their individual frequencies would predict, PMI captures that strong association. When "the" and "dog" appear together only as often as expected, PMI correctly identifies this as an unremarkable pairing.
This chapter derives the PMI formula from probability theory, shows how it transforms co-occurrence matrices into more meaningful representations, and demonstrates its practical applications from collocation extraction to improved word vectors.
The Problem with Raw Counts
Let's start by understanding why raw co-occurrence counts are problematic. Consider a corpus about animals and food:
Word frequencies in corpus: ---------------------------------------- 'the ': 16 occurrences (24.2%) 'cat ': 4 occurrences (6.1%) 'dog ': 3 occurrences (4.5%) 'new ': 3 occurrences (4.5%) 'food ': 2 occurrences (3.0%) 'sat ': 1 occurrences (1.5%) 'on ': 1 occurrences (1.5%) 'mat ': 1 occurrences (1.5%) 'ran ': 1 occurrences (1.5%) 'in ': 1 occurrences (1.5%)
The word "the" appears 18 times, dwarfing content words like "cat" (5 times) or "new" (3 times). In a raw co-occurrence matrix, "the" will have high counts with almost everything, obscuring the meaningful associations we care about.
Raw co-occurrence counts:
----------------------------------------
'the' co-occurs with 62 total words
'cat' co-occurs with 15 total words
'new' co-occurs with 12 total words
Top co-occurrences for 'cat':
the: 5
a: 1
all: 1
at: 1
chased: 1
The raw counts show "the" as the top co-occurrence for "cat," but this tells us nothing about cats. The word "the" appears near everything. We need a measure that accounts for how often we'd expect words to co-occur given their individual frequencies.
The PMI Formula
The problem we identified, that frequent words dominate co-occurrence counts, stems from a fundamental issue: raw counts conflate two distinct phenomena. When "the" appears near "cat" 10 times, is that because "the" and "cat" have a special relationship, or simply because "the" appears near everything 10 times? To separate genuine associations from frequency effects, we need to ask a different question: how does the observed co-occurrence compare to what we'd expect if the words were unrelated?
Pointwise Mutual Information answers this question directly. Instead of asking "how often do these words appear together?" PMI asks "do these words appear together more or less than chance would predict?"
The Intuition: Observed vs Expected
Consider two words, and . If they have no special relationship, their co-occurrence should follow a simple pattern: the probability of seeing them together should equal the product of their individual probabilities. This is the definition of statistical independence.
For example, if "cat" appears in 5% of contexts and "mouse" appears in 2% of contexts, and they're independent, we'd expect "cat" and "mouse" to co-occur in roughly of word pairs. If they actually co-occur in 1% of pairs, that's 10 times more than expected. This excess reveals a genuine association.
PMI formalizes this comparison by taking the ratio of observed to expected co-occurrence:
PMI measures the association between two words by comparing their joint probability of co-occurrence to what we'd expect if they were independent. High PMI indicates the words are strongly associated; low or negative PMI indicates they co-occur less than expected by chance.
Let's unpack each component of this formula:
-
Numerator : The joint probability, measuring how often words and actually co-occur in the corpus.
-
Denominator : The expected probability under independence. If the words were unrelated, this is how often they would co-occur by chance.
-
The ratio: When observed equals expected, the ratio is 1. When observed exceeds expected, the ratio is greater than 1. When observed falls short of expected, the ratio is less than 1.
-
The logarithm: Taking converts multiplicative relationships to additive ones. A ratio of 2 (twice as often as expected) becomes PMI of 1. A ratio of 4 becomes PMI of 2. This logarithmic scale makes PMI values easier to interpret and compare.
Why the Logarithm?
The logarithm serves three purposes. First, it converts ratios to differences: PMI of 2 means "4 times more than expected," while PMI of 1 means "2 times more." The additive scale is more intuitive.
Second, it creates symmetry around zero. Positive PMI indicates attraction (words co-occur more than expected), negative PMI indicates repulsion (words co-occur less than expected), and zero indicates independence.
Third, it connects to information theory. PMI measures how much information observing one word provides about the other. When two words are strongly associated, seeing one tells you a lot about whether you'll see the other.
From Probabilities to Counts
In practice, we don't know the true probabilities. We estimate them from corpus counts. Let denote how many times words and co-occur, and let be the total number of word-context pairs in our co-occurrence matrix.
The probability estimates follow naturally:
This is the fraction of all word pairs that are specifically the pair.
This is the fraction of all pairs where the target word is . The notation means "sum over all context words," which is simply the row sum for word in the co-occurrence matrix.
Similarly, this is the column sum for context word , representing how often appears as a context for any word.
The Practical Formula
Substituting these estimates into the PMI formula:
Notice that appears in both numerator and denominator. With some algebra, we can simplify:
This is the formula we implement: multiply the co-occurrence count by the total, divide by the product of marginals, and take the logarithm. Each term has a clear interpretation:
- : How often and actually co-occur
- : Total co-occurrences in the matrix
- : How often appears with any context (row sum)
- : How often appears as context for any word (column sum)
Implementing PMI
Let's translate this formula into code. The implementation follows the mathematical derivation closely, computing each component step by step.
The key insight in this implementation is computing the expected counts efficiently. Rather than looping through all word pairs, we use matrix multiplication: row_sums @ col_sums produces an outer product where each cell contains . Dividing by gives us the expected count for each pair.

This scatter plot visualizes the fundamental idea behind PMI: comparing what we observe to what we'd expect. Word pairs above the diagonal have positive PMI (they co-occur more than chance predicts), while pairs below have negative PMI.
PMI values for 'cat':
----------------------------------------
a : PMI = +2.10
all : PMI = +2.10
at : PMI = +2.10
chased : PMI = +2.10
on : PMI = +2.10
park : PMI = +2.10
sat : PMI = +2.10
sleeps : PMI = +2.10
The transformation is dramatic. Raw counts ranked "the" as the top co-occurrence for "cat," but PMI reveals a different picture. Words that specifically associate with "cat," rather than appearing everywhere, now rise to the top. The word "the" has low or negative PMI because its high frequency means it co-occurs with "cat" about as often as we'd expect by chance.
Interpreting PMI as Association Strength
The logarithmic scale of PMI creates a natural interpretation centered on zero. Because we're measuring , the value tells us directly how the actual co-occurrence compares to the baseline of independence:
-
PMI > 0: The words co-occur more than expected. A PMI of 1 means twice as often as chance; PMI of 2 means four times as often; PMI of 3 means eight times. The higher the value, the stronger the positive association.
-
PMI = 0: The words co-occur exactly as expected under independence. There's no special relationship, either attractive or repulsive.
-
PMI < 0: The words co-occur less than expected. They tend to avoid each other. A PMI of -1 means half as often as chance would predict.

A Worked Example: Computing PMI Step by Step
To solidify understanding, let's walk through a complete PMI calculation by hand. We'll compute the PMI between "cat" and "mouse," two words we'd expect to have a strong association.
The calculation proceeds in three stages:
- Gather the raw counts from our co-occurrence matrix
- Compute the expected count under the assumption of independence
- Calculate PMI as the log ratio of observed to expected
PMI Calculation: 'cat' and 'mouse'
==================================================
Step 1: Gather counts from the co-occurrence matrix
#(cat, mouse) = 0 (observed co-occurrences)
#(cat, *) = 15 (total contexts for 'cat')
#(*, mouse) = 4 (total contexts for 'mouse')
N = 258 (total word pairs)
Step 2: Compute expected count under independence
If 'cat' and 'mouse' were unrelated, we'd expect:
Expected = (#(cat,*) × #(*,mouse)) / N
= (15 × 4) / 258
= 0.23
Step 3: Compute PMI as log ratio
PMI = log₂(observed / expected)
= log₂(0 / 0.23)
= log₂(0.00)
= 0.00
The positive PMI confirms our linguistic intuition: "cat" and "mouse" have a genuine association in language. They appear together in contexts about predator-prey relationships, children's stories, and idiomatic expressions far more often than their individual frequencies would suggest.
This is exactly the kind of meaningful relationship PMI is designed to uncover. Raw counts would have told us that "the" co-occurs with "cat" more often than "mouse" does, simply because "the" is everywhere. PMI cuts through this noise by asking the right question: not "how often?" but "how much more than expected?"
The Problem with Negative PMI
While PMI can be negative (indicating words that avoid each other), negative values cause practical problems. Most word pairs never co-occur at all, giving them PMI of negative infinity. Even pairs that co-occur rarely get large negative values.
PMI value distribution: ---------------------------------------- Total non-zero entries: 215 Negative PMI values: 3 (1.4%) Positive PMI values: 212 (98.6%) Min PMI: -1.31 Max PMI: 5.43 Mean PMI: 2.25

The histogram reveals a key characteristic of PMI: many word pairs have negative values, meaning they co-occur less than expected. This asymmetry, combined with the issues below, motivates the PPMI transformation.
Negative PMI values are problematic for several reasons:
-
Unreliable estimates: Low co-occurrence counts produce noisy PMI values. A word pair that co-occurs once when we expected two has PMI of -1, but this could easily be sampling noise.
-
Asymmetric information: Knowing that words don't co-occur is less informative than knowing they do. The absence of co-occurrence could mean many things.
-
Computational issues: Large negative values dominate distance calculations and can destabilize downstream algorithms.
Positive PMI (PPMI)
The standard solution is Positive PMI (PPMI), which simply clips negative values to zero:
PPMI retains only the positive associations from PMI, treating all negative or zero associations as equally uninformative. This produces sparse, non-negative matrices that work well with many machine learning algorithms.
PPMI matrix properties: ---------------------------------------- Shape: (43, 43) Non-zero entries: 212 Sparsity: 88.5% Max value: 5.43 Mean (non-zero): 2.29


The PPMI matrix is much sparser than the raw co-occurrence matrix. Only word pairs with genuine positive associations retain non-zero values. This sparsity is a feature, not a bug: it means we've filtered out the noise of random co-occurrences.
Shifted PPMI Variants
While PPMI works well, researchers have developed variants that address specific issues. The most important is Shifted PPMI, which subtracts a constant before clipping:
where is a shift parameter, typically between 1 and 15.
Why shift? The shift acts as a threshold for what counts as a "meaningful" association. With (no shift), any positive PMI is retained. With , only word pairs that co-occur at least 5 times more than expected survive.
Shifted PPMI raises the bar for what counts as a positive association by subtracting before clipping. This filters out weak associations that might be due to noise, keeping only the strongest signals. Higher values produce sparser, more selective matrices.
Effect of shift parameter k: -------------------------------------------------- k=1 (PPMI): 212 non-zero entries k=2: 158 non-zero entries k=5: 120 non-zero entries Higher k filters out weaker associations, keeping only the strongest word relationships.

The visualization shows the trade-off clearly: higher shift values produce sparser matrices by filtering out weaker associations. A common choice is , which corresponds to Word2Vec's default of 5 negative samples.
The connection to Word2Vec is notable: Levy and Goldberg (2014) showed that Word2Vec's skip-gram model with negative sampling implicitly factorizes a shifted PMI matrix with equal to the number of negative samples. This theoretical connection explains why PMI-based methods and neural embeddings often produce similar results.
PMI vs Raw Counts: A Comparison
Let's directly compare how raw counts and PPMI rank word associations. We'll use a larger corpus to see clearer patterns.
Comparison: Raw Counts vs PPMI ============================================================ Top associations for 'learning': -------------------------------------------------- Raw Counts | PPMI -------------------- | -------------------- agents (2) | machine (3.60) from (2) | algorithms (2.86) algorithms (1) | agents (2.60) deep (1) | deep (2.60) environmental (1) | environmental (2.60) Top associations for 'neural': -------------------------------------------------- Raw Counts | PPMI -------------------- | -------------------- networks (3) | many (3.89) convolutional (2) | with (3.89) many (2) | convolutional (2.89) with (2) | effectively (2.89) effectively (1) | layers (2.89) Top associations for 'data': -------------------------------------------------- Raw Counts | PPMI -------------------- | -------------------- raw (4) | quality (3.15) training (4) | raw (2.56) quality (2) | training (2.56) requires (2) | affects (2.15) affects (1) | analyze (2.15)
The PPMI rankings are more semantically meaningful. Raw counts are dominated by frequent function words, while PPMI surfaces content words with genuine topical associations.

PMI Matrix Properties
PPMI matrices have several useful properties that make them well-suited for downstream NLP tasks.
Sparsity
PPMI matrices are highly sparse because most word pairs don't have positive associations. This sparsity enables efficient storage and computation.
Matrix Property Comparison: -------------------------------------------------- Property Raw Counts PPMI ------------------------- ------------ ------------ Non-zero entries 695 695 Sparsity 92.0% 92.0% Mean (non-zero) 1.15 3.37

The sparsity visualization makes the filtering effect of PPMI immediately apparent. The raw matrix has entries wherever words co-occur at all, while the PPMI matrix retains only the genuinely positive associations. This much smaller subset captures the meaningful relationships.
Symmetry
For symmetric context windows (looking the same distance left and right), the co-occurrence matrix is symmetric, and so is the PPMI matrix: .
PPMI matrix is symmetric: True
This symmetry means the association between words is bidirectional: if "neural" has high PPMI with "networks," then "networks" has equally high PPMI with "neural."
Row Vectors as Word Representations
Each row of the PPMI matrix can serve as a word vector. Words with similar PPMI profiles (similar rows) tend to have similar meanings.
Word similarity using PPMI vectors:
--------------------------------------------------
'neural':
networks : 0.806
layers : 0.598
with : 0.558
many : 0.484
'data':
training : 0.529
scientists : 0.487
model : 0.473
rewards : 0.468
'learning':
agents : 0.566
feedback : 0.476
learn : 0.454
environmental : 0.416
The similarity scores capture semantic relationships learned purely from co-occurrence patterns. Words appearing in similar contexts cluster together in the PPMI vector space, enabling applications like finding related terms or detecting semantic categories.

The 2D projection reveals the semantic structure captured by PPMI vectors. Even with our small corpus, related words cluster together: machine learning terms form one group, data-related terms another. This is the distributional hypothesis in action. Words with similar meanings appear in similar contexts, leading to similar PPMI vectors.
Collocation Extraction with PMI
One of PMI's most practical applications is identifying collocations: word combinations that occur together more than chance would predict. Collocations include compound nouns ("ice cream"), phrasal verbs ("give up"), and idiomatic expressions ("kick the bucket").
Collocations are word combinations whose meaning or frequency cannot be predicted from the individual words alone. PMI helps identify these by measuring which word pairs co-occur significantly more than their individual frequencies would suggest.
Top Collocations by PMI: ------------------------------------------------------- Bigram PMI Count ------------------------- -------- -------- algorithms process 7.64 1 with many 7.64 1 many layers 7.64 1 quality affects 7.64 1 performance significantly 7.64 1 generate human-like 7.64 1 understanding context 7.64 1 enables computers 7.64 1 computer vision 7.64 1 vision systems 7.64 1 systems recognize 7.64 1 recognize objects 7.64 1 image recognition 7.64 1 recognition uses 7.64 1 excel at 7.64 1
The top collocations are meaningful multi-word expressions like "machine learning," "neural networks," and "language models." PMI successfully identifies these as units that co-occur far more than chance would predict. The high PMI scores (often above 4) indicate these word pairs appear together 16 or more times more frequently than their individual frequencies would suggest.

Implementation: Building a Complete PPMI Pipeline
Let's put everything together into a complete, reusable implementation for computing PPMI matrices from text.
PPMIVectorizer Results:
--------------------------------------------------
Vocabulary size: 30
Matrix shape: (30, 30)
Similar to 'learning':
learn : 0.555
agents : 0.548
from : 0.390
information : 0.318
Similar to 'networks':
neural : 0.720
convolutional : 0.526
learn : 0.348
language : 0.334
Similar to 'data':
preprocessing : 0.767
raw : 0.681
training : 0.540
model : 0.536
The vectorizer successfully builds PPMI word vectors and finds semantically related words. The similarity scores reflect how often words share the same contexts in the corpus. This encapsulated implementation can be reused across different corpora and easily integrated into larger NLP pipelines.
Limitations and When to Use PMI
While PMI is powerful, it has limitations you should understand.
Sensitivity to Low Counts
PMI estimates are unreliable for rare word pairs. A word pair that co-occurs once when we expected 0.5 co-occurrences gets PMI of 1, but this could easily be noise. The standard solution is to require minimum co-occurrence counts before computing PMI.
Effect of minimum count threshold: -------------------------------------------------- Standard PPMI non-zero entries: 695 Reliable PPMI (min_count=2): 77 Entries filtered out: 618
Requiring a minimum co-occurrence count filters out potentially spurious associations that could arise from sampling noise, producing more reliable PMI estimates at the cost of discarding some valid but rare associations.
Bias Toward Rare Words
PMI tends to give high scores to rare word pairs. If a rare word appears only in specific contexts, it gets high PMI with those contexts even if the association is coincidental. Shifted PPMI helps by raising the threshold for positive associations.

The downward trend confirms the rare word bias: words with fewer total co-occurrences tend to achieve higher maximum PMI scores. This happens because rare words have limited contexts, making each co-occurrence count proportionally more.
Computational Cost
For large vocabularies, PMI matrices become enormous. A 100,000-word vocabulary produces a 10-billion-cell matrix. While the matrix is sparse after PPMI transformation, the intermediate computations can be expensive. Practical implementations use sparse matrix formats and streaming algorithms.
When to Use PMI
PMI and PPMI are excellent choices when:
- You need interpretable association scores between words
- You're extracting collocations or multi-word expressions
- You want sparse, high-dimensional word vectors as input to other algorithms
- You need a baseline to compare against neural embeddings
- Computational resources are limited (no GPU required)
Neural methods like Word2Vec often outperform PPMI for downstream tasks, but the difference is smaller than you might expect. For many applications, PPMI provides a strong, interpretable baseline.
Summary
Pointwise Mutual Information transforms raw co-occurrence counts into meaningful association scores by comparing observed co-occurrence to what we'd expect under independence.
Key concepts:
-
PMI formula: measures how much more (or less) two words co-occur than chance would predict
-
Positive PMI (PPMI): Clips negative values to zero, keeping only positive associations. This produces sparse matrices that work well with machine learning algorithms.
-
Shifted PPMI: Subtracts before clipping, filtering out weak associations. Connected theoretically to Word2Vec's negative sampling.
-
PMI interpretation: Positive PMI means words co-occur more than expected (strong association). Zero means independence. Negative means avoidance.
Practical applications:
- Collocation extraction: Finding meaningful multi-word expressions
- Word similarity: Using PPMI vectors with cosine similarity
- Feature weighting: PPMI as a preprocessing step before dimensionality reduction
Key parameters:
| Parameter | Typical Values | Effect |
|---|---|---|
window_size | 2-5 | Larger windows capture broader context but may introduce noise |
min_count | 2-10 | Higher values filter unreliable associations from rare words |
shift_k | 1-15 | Higher values keep only the strongest associations |
The next chapter shows how to reduce the dimensionality of PPMI matrices using Singular Value Decomposition, producing dense vectors that capture the essential structure in fewer dimensions.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Pointwise Mutual Information and word associations.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA
Master SVD for NLP, including truncated SVD for dimensionality reduction, Latent Semantic Analysis, and randomized SVD for large-scale text processing.

The Distributional Hypothesis: How Context Reveals Word Meaning
Learn how the distributional hypothesis uses word co-occurrence patterns to represent meaning computationally, from Firth's linguistic insight to co-occurrence matrices and cosine similarity.

Co-occurrence Matrices: Building Word Representations from Context
Learn how to construct word-word and word-document co-occurrence matrices that capture distributional semantics. Covers context window effects, distance weighting, sparse storage, and efficient construction algorithms.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


Comments