Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master SVD for NLP, including truncated SVD for dimensionality reduction, Latent Semantic Analysis, and randomized SVD for large-scale text processing.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Singular Value DecompositionLink Copied

Co-occurrence matrices capture the distributional patterns that reveal word meaning. But these matrices have a problem: they're enormous and mostly empty. A vocabulary of 100,000 words produces a matrix with 10 billion entries, yet most word pairs never co-occur. This sparsity makes storage expensive, computation slow, and similarity estimates unreliable.

Singular Value Decomposition (SVD) solves this by finding a compact representation that preserves the essential structure. Instead of storing billions of sparse counts, we extract a few hundred dense dimensions that capture the underlying semantic patterns. This is the mathematical foundation behind Latent Semantic Analysis (LSA) and a conceptual precursor to modern word embeddings.

The Intuition: Finding Hidden StructureLink Copied

Consider a term-document matrix where rows are words and columns are documents. Each entry counts how often a word appears in a document. This matrix is high-dimensional (one dimension per document) and sparse (most words don't appear in most documents).

But the underlying semantic structure is much simpler. Documents about "cars" tend to mention "engine," "wheel," "drive," and "road" together. Documents about "cooking" cluster words like "recipe," "ingredient," "stir," and "oven." These latent topics create correlations: knowing a document contains "engine" makes "wheel" more likely.

SVD discovers these hidden patterns. It finds a small number of dimensions that explain most of the variance in the data. Words that co-occur with similar patterns end up close together in this reduced space, even if they never directly co-occurred in the original matrix.

Singular Value Decomposition

SVD is a matrix factorization that decomposes any matrix $M$ into three matrices: $M = U \Sigma V^T$ . The columns of $U$ and $V$ are orthonormal bases, and $\Sigma$ is a diagonal matrix of singular values that indicate the importance of each dimension.

The Mathematics of SVDLink Copied

To understand SVD, we need to answer a fundamental question: what does it mean to decompose a matrix? And more importantly, why would we want to?

The Core Problem: Too Much Information, Poorly OrganizedLink Copied

Consider our term-document matrix $M$ . Each entry $M_{ij}$ tells us how often word $i$ appears in document $j$ . This is raw information, accurate but overwhelming. With 100,000 words and 10,000 documents, we have a billion numbers, most of them zeros. Worse, this representation treats each document as an independent dimension, ignoring the obvious fact that documents about similar topics should be similar.

What we want is a compressed representation that:

Captures the essential patterns (words that co-occur, documents that are similar)
Discards the noise (random co-occurrences that don't reflect meaning)
Organizes information along meaningful dimensions

SVD achieves all three by factoring the matrix into simpler pieces that reveal its underlying structure.

Building Toward the DecompositionLink Copied

Let's develop the SVD formula by thinking about what we need. Our matrix $M$ has size $m \times n$ , where $m$ is the vocabulary size and $n$ is the number of documents (or vocabulary size for word-word matrices).

Step 1: Find the principal directions.

The first insight is that not all directions in our data space are equally important. Some directions capture major patterns (the "technology vs. cooking" distinction), while others capture noise. We want to identify these directions and rank them by importance.

Mathematically, we're looking for orthogonal directions, perpendicular axes that don't interfere with each other. These will become our new coordinate system.

Step 2: Measure importance along each direction.

Once we have the directions, we need to know how much the data varies along each one. A direction with high variance captures important structure; one with low variance captures noise.

Step 3: Express everything in terms of these new coordinates.

Finally, we rewrite our original matrix using the new directions and their importances. This is the decomposition.

The SVD FormulaLink Copied

SVD accomplishes all three steps in one elegant factorization:

M = U \Sigma V^T

Each component has a specific role:

SVD matrix components and their roles in the decomposition.

Component	Size	Role
$U$	$m \times m$	Word directions: orthogonal vectors describing how words relate to latent dimensions
$\Sigma$	$m \times n$	Importance scores: diagonal matrix of singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r$
$V^T$	$n \times n$	Document directions: orthogonal vectors describing how documents relate to latent dimensions

The singular values $\sigma_i$ are always non-negative and sorted in decreasing order. The number of non-zero singular values, $r$ , equals the rank of the matrix, the true dimensionality of the information it contains.

Why orthogonality matters. The conditions $U^T U = I$ and $V^T V = I$ ensure that each column of $U$ is perpendicular to every other column (and likewise for $V$ ). This orthogonality makes the dimensions independent, each capturing a distinct pattern. Without orthogonality, dimensions would overlap and we couldn't cleanly separate signal from noise.

Geometric Interpretation: A New Coordinate SystemLink Copied

Think of SVD as discovering the natural coordinate system for your data. The original matrix $M$ transforms vectors from an $n$ -dimensional input space to an $m$ -dimensional output space. SVD reveals that this transformation has three distinct phases:

Rotate the input space using $V^T$ to align with the principal directions of variation
Scale each dimension by its singular value $\sigma_i$ , stretching important directions and shrinking unimportant ones
Rotate again using $U$ to produce the final output

This decomposition is visualized below. The singular values tell us how much the matrix stretches along each principal direction:

Large $\sigma_i$ : The data varies greatly along this direction, capturing important structure
Small $\sigma_i$ : Little variation, likely noise or fine details we can safely ignore

In[2]:

Code

import numpy as np

# Create a simple 2D example to visualize SVD
np.random.seed(42)

# Generate points along an ellipse (correlated data)
theta = np.linspace(0, 2 * np.pi, 50)
# Ellipse with major axis at 45 degrees
x = 3 * np.cos(theta) + 0.5 * np.sin(theta)
y = 3 * np.sin(theta) + 0.5 * np.cos(theta)

# Add some noise
x += np.random.randn(50) * 0.3
y += np.random.randn(50) * 0.3

# Stack into matrix (each row is a point)
data = np.column_stack([x, y])

# Compute SVD of the centered data
data_centered = data - data.mean(axis=0)
U, S, Vt = np.linalg.svd(data_centered, full_matrices=False)

import numpy as np

# Create a simple 2D example to visualize SVD
np.random.seed(42)

# Generate points along an ellipse (correlated data)
theta = np.linspace(0, 2 * np.pi, 50)
# Ellipse with major axis at 45 degrees
x = 3 * np.cos(theta) + 0.5 * np.sin(theta)
y = 3 * np.sin(theta) + 0.5 * np.cos(theta)

# Add some noise
x += np.random.randn(50) * 0.3
y += np.random.randn(50) * 0.3

# Stack into matrix (each row is a point)
data = np.column_stack([x, y])

# Compute SVD of the centered data
data_centered = data - data.mean(axis=0)
U, S, Vt = np.linalg.svd(data_centered, full_matrices=False)

Out[3]:

Visualization

Scatter plot of 2D points forming an ellipse with two perpendicular arrows showing principal directions from SVD. — Geometric interpretation of SVD on 2D data. The original data points form an elliptical cloud. SVD finds the principal directions (red and blue arrows) along which the data varies most. The length of each arrow is proportional to the corresponding singular value, showing that the data spreads much more along the first principal direction.

The visualization demonstrates the key insight: most of the data's spread lies along the first principal direction. The second direction captures much less variation. If we kept only the first direction, we'd lose some information, but we'd preserve the dominant pattern. This observation leads directly to the most powerful application of SVD.

Truncated SVD: Keeping What MattersLink Copied

Full SVD computes all singular values and vectors, but we rarely need them all. In practice, we keep only the top $k$ components, dramatically reducing dimensionality while preserving most of the meaningful structure. This truncated version is the form most useful for NLP applications.

The Insight: Most Information Lives in Few DimensionsLink Copied

Here's the remarkable fact that makes SVD useful for NLP: real data is approximately low-rank. A term-document matrix with millions of entries might have its essential structure captured by just a few hundred dimensions. The remaining dimensions contain noise, rare events, and idiosyncratic details.

Why does this happen? Because language has structure:

Documents about technology share vocabulary, creating correlated columns
Words with similar meanings appear in similar contexts, creating correlated rows
Topics, genres, and writing styles impose patterns that span many entries

These correlations mean the matrix isn't truly $m \times n$ dimensional. It's effectively much smaller. SVD reveals this hidden simplicity.

The Truncated SVD FormulaLink Copied

If we keep only the $k$ largest singular values and discard the rest, we get:

M \approx M_k = U_k \Sigma_k V_k^T

The components are truncated versions of the full decomposition:

Truncated SVD reduces each component to keep only themost important dimensions.

Component	Original Size	Truncated Size	What's Kept
$U_k$	$m \times m$	$m \times k$	First $k$ columns (most important word directions)
$\Sigma_k$	$m \times n$	$k \times k$	Top $k$ singular values (largest importances)
$V_k^T$	$n \times n$	$k \times n$	First $k$ rows (most important document directions)

This isn't just any approximation. It's the best possible rank- $k$ approximation.

Eckart-Young-Mirsky Theorem

Among all matrices with rank at most $k$ , the truncated SVD $M_k$ minimizes the reconstruction error $\|M - M_k\|_F$ (Frobenius norm). No other rank- $k$ matrix can get closer to the original. This mathematical guarantee is why SVD is the gold standard for linear dimensionality reduction.

Quantifying Information LossLink Copied

How much information do we lose by keeping only $k$ dimensions? The singular values provide an exact answer. Since each $\sigma_i^2$ represents the variance captured by dimension $i$ , the fraction of total variance explained by the top $k$ dimensions is:

\text{Explained variance ratio} = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{r} \sigma_i^2}

where:

$\sigma_i$ : the $i$ -th singular value (importance of dimension $i$ )
$k$ : the number of dimensions we choose to retain
$r$ : the rank of the original matrix (total non-zero singular values)

The key insight: In text data, this ratio typically rises quickly. The first few dimensions might capture 50% of the variance, the first 50 might capture 80%, and the first 200 might capture 95%. The remaining thousands of dimensions contribute almost nothing.

This rapid concentration of variance is why dimensionality reduction works so well for language. The semantic structure we care about lives in a low-dimensional subspace.

In[4]:

Code

# Create a more realistic example: a term-document matrix
np.random.seed(42)

# Simulate a corpus with 3 hidden topics
n_words = 100
n_docs = 50
n_topics = 3

# Topic-word distributions (which words belong to which topic)
topic_word = np.random.dirichlet(np.ones(n_words) * 0.1, size=n_topics)

# Document-topic distributions (which topics each document covers)
doc_topic = np.random.dirichlet(np.ones(n_topics) * 0.5, size=n_docs)

# Generate term-document matrix
term_doc = np.zeros((n_words, n_docs))
for d in range(n_docs):
    for t in range(n_topics):
        term_doc[:, d] += doc_topic[d, t] * topic_word[t] * 100

# Add noise
term_doc += np.random.poisson(1, size=(n_words, n_docs))

# Compute SVD
U, S, Vt = np.linalg.svd(term_doc, full_matrices=False)

# Calculate explained variance
total_variance = np.sum(S**2)
explained_variance = np.cumsum(S**2) / total_variance

# Create a more realistic example: a term-document matrix
np.random.seed(42)

# Simulate a corpus with 3 hidden topics
n_words = 100
n_docs = 50
n_topics = 3

# Topic-word distributions (which words belong to which topic)
topic_word = np.random.dirichlet(np.ones(n_words) * 0.1, size=n_topics)

# Document-topic distributions (which topics each document covers)
doc_topic = np.random.dirichlet(np.ones(n_topics) * 0.5, size=n_docs)

# Generate term-document matrix
term_doc = np.zeros((n_words, n_docs))
for d in range(n_docs):
    for t in range(n_topics):
        term_doc[:, d] += doc_topic[d, t] * topic_word[t] * 100

# Add noise
term_doc += np.random.poisson(1, size=(n_words, n_docs))

# Compute SVD
U, S, Vt = np.linalg.svd(term_doc, full_matrices=False)

# Calculate explained variance
total_variance = np.sum(S**2)
explained_variance = np.cumsum(S**2) / total_variance

Out[5]:

Console

Singular value analysis:
--------------------------------------------------
Matrix shape: (100, 50)
Number of singular values: 50

Top 10 singular values:
  σ_ 1 =   177.86  (cumulative variance:  70.7%)
  σ_ 2 =    75.43  (cumulative variance:  83.4%)
  σ_ 3 =    56.40  (cumulative variance:  90.5%)
  σ_ 4 =    16.70  (cumulative variance:  91.1%)
  σ_ 5 =    15.06  (cumulative variance:  91.7%)
  σ_ 6 =    14.78  (cumulative variance:  92.1%)
  σ_ 7 =    14.35  (cumulative variance:  92.6%)
  σ_ 8 =    14.06  (cumulative variance:  93.0%)
  σ_ 9 =    13.62  (cumulative variance:  93.5%)
  σ_10 =    13.09  (cumulative variance:  93.8%)

Notice the dramatic gap between the first three singular values and the rest. This isn't coincidence. We generated this data with exactly three hidden topics. The singular value spectrum has revealed this latent structure: three large values corresponding to the three topics, followed by much smaller values representing noise and fine-grained details.

This is the power of SVD for exploratory analysis: the singular value decay reveals the intrinsic dimensionality of your data. A sharp drop suggests clear underlying structure; a gradual decay suggests more complex, distributed patterns.

Out[6]:

Visualization

Line plot showing singular values decreasing rapidly from around 200 to near zero. — Singular values decay rapidly, with the first few capturing most of the matrix's structure. The steep initial drop indicates strong underlying patterns (topics) in the data.

Line plot showing cumulative explained variance rising quickly to plateau near 100%. — Cumulative explained variance shows how much information is retained as we add dimensions. In this example, 10 dimensions capture over 90% of the variance.

The left plot shows the singular value spectrum. Note how the first three values tower over the rest. The right plot translates this into cumulative explained variance: with just 10 dimensions, we've captured over 90% of the matrix's information. The annotation marks where we cross the 90% threshold, a common heuristic for choosing dimensionality.

In real text data, the decay is typically more gradual than this simulated example, but the principle holds: a small number of dimensions captures the semantic structure that matters.

Latent Semantic Analysis (LSA): From Theory to PracticeLink Copied

Now that we understand the mathematics of SVD, let's see how it transforms our understanding of text. Latent Semantic Analysis (LSA), introduced by Deerwester et al. in 1990, was one of the first successful methods for learning word representations, and it's built entirely on the truncated SVD we just developed.

The Central InsightLink Copied

LSA rests on a powerful observation: the reduced SVD dimensions aren't arbitrary; they correspond to latent semantic concepts.

Consider what happens when we apply truncated SVD to a term-document matrix:

Words that appear in similar documents get projected to similar locations
Documents about similar topics get projected to similar locations
The dimensions themselves often correspond to interpretable "topics" or semantic contrasts

This means two words can be similar in LSA space even if they never directly co-occur, as long as they appear in similar contexts. "Car" and "automobile" might never appear in the same document, but if they appear in documents about the same topics, LSA will discover their similarity.

Building an LSA Model Step by StepLink Copied

Let's implement LSA from scratch. By building each component ourselves, we'll develop intuition for how the mathematics translates into semantic understanding.

In[7]:

Code

from collections import Counter
import re

# A small corpus about technology and cooking
documents = [
    "The computer processes data using algorithms and software programs",
    "Machine learning algorithms analyze patterns in large datasets",
    "Software engineers write code to build applications",
    "The chef prepared a delicious meal with fresh ingredients",
    "Cooking requires following recipes and using proper techniques",
    "The restaurant serves gourmet dishes made by skilled chefs",
    "Data scientists use programming to analyze information",
    "The kitchen was equipped with modern cooking appliances",
    "Artificial intelligence systems learn from training data",
    "The recipe called for fresh vegetables and herbs",
]


def tokenize(text):
    """Simple tokenization: lowercase and split on non-alphanumeric."""
    return re.findall(r"\b[a-z]+\b", text.lower())


# Build vocabulary
all_tokens = []
for doc in documents:
    all_tokens.extend(tokenize(doc))

# Filter to words appearing at least twice
word_counts = Counter(all_tokens)
vocab = sorted([w for w, c in word_counts.items() if c >= 2])
word_to_idx = {w: i for i, w in enumerate(vocab)}

from collections import Counter
import re

# A small corpus about technology and cooking
documents = [
    "The computer processes data using algorithms and software programs",
    "Machine learning algorithms analyze patterns in large datasets",
    "Software engineers write code to build applications",
    "The chef prepared a delicious meal with fresh ingredients",
    "Cooking requires following recipes and using proper techniques",
    "The restaurant serves gourmet dishes made by skilled chefs",
    "Data scientists use programming to analyze information",
    "The kitchen was equipped with modern cooking appliances",
    "Artificial intelligence systems learn from training data",
    "The recipe called for fresh vegetables and herbs",
]


def tokenize(text):
    """Simple tokenization: lowercase and split on non-alphanumeric."""
    return re.findall(r"\b[a-z]+\b", text.lower())


# Build vocabulary
all_tokens = []
for doc in documents:
    all_tokens.extend(tokenize(doc))

# Filter to words appearing at least twice
word_counts = Counter(all_tokens)
vocab = sorted([w for w, c in word_counts.items() if c >= 2])
word_to_idx = {w: i for i, w in enumerate(vocab)}

Out[8]:

Console

Corpus: 10 documents
Vocabulary size: 11 words

Vocabulary: ['algorithms', 'analyze', 'and', 'cooking', 'data', 'fresh', 'software', 'the', 'to', 'using', 'with']

Our filtered vocabulary contains words that appear in at least two documents, a simple but effective way to focus on meaningful terms. Notice the mix: technology words ("data," "algorithms"), cooking words ("cooking," "fresh"), and general words ("the," "with"). This diversity will let us observe how LSA discovers semantic structure without being told what the categories are.

In[9]:

Code

# Build term-document matrix
def build_term_document_matrix(documents, word_to_idx):
    """Build a term-document matrix from documents."""
    n_words = len(word_to_idx)
    n_docs = len(documents)
    matrix = np.zeros((n_words, n_docs))

    for d, doc in enumerate(documents):
        tokens = tokenize(doc)
        for token in tokens:
            if token in word_to_idx:
                matrix[word_to_idx[token], d] += 1

    return matrix


term_doc_matrix = build_term_document_matrix(documents, word_to_idx)

# Build term-document matrix
def build_term_document_matrix(documents, word_to_idx):
    """Build a term-document matrix from documents."""
    n_words = len(word_to_idx)
    n_docs = len(documents)
    matrix = np.zeros((n_words, n_docs))

    for d, doc in enumerate(documents):
        tokens = tokenize(doc)
        for token in tokens:
            if token in word_to_idx:
                matrix[word_to_idx[token], d] += 1

    return matrix


term_doc_matrix = build_term_document_matrix(documents, word_to_idx)

Out[10]:

Console

Term-document matrix shape: (11, 10)
Non-zero entries: 27
Sparsity: 75.5%

Out[11]:

Visualization

Heatmap showing a sparse term-document matrix with most cells being zero (light) and occasional non-zero entries (darker). — Heatmap of the term-document matrix. Each row is a word, each column is a document. Darker cells indicate higher word counts. The prevalence of light cells (zeros) illustrates the extreme sparsity typical of text data, where most words don't appear in most documents.

The heatmap reveals the sparsity pattern visually. Most cells are light (zero counts), with scattered darker cells where words actually appear in documents. This visual makes clear why storing and computing with the raw matrix is inefficient.

The matrix is extremely sparse. Most entries are zero because any given document contains only a tiny fraction of the vocabulary. This sparsity is both a curse and an opportunity:

The curse: Sparse vectors are inefficient to store and compute with. Similarity calculations are unreliable when most dimensions are zero.
The opportunity: The sparsity tells us the data is redundant. If most entries are predictable (zero), then the true information content is much smaller than the matrix size suggests.

SVD exploits this redundancy, compressing the sparse matrix into a dense representation that captures the underlying patterns.

Preprocessing: TF-IDF WeightingLink Copied

Before applying SVD, we apply TF-IDF (Term Frequency-Inverse Document Frequency) weighting. This transformation addresses a subtle problem: raw counts overweight common words.

Words like "the" and "is" appear in nearly every document but carry little semantic information. TF-IDF down-weights these ubiquitous terms and up-weights distinctive words that characterize specific documents. The result is a matrix where the important patterns are more prominent.

In[12]:

Code

def apply_tfidf(term_doc_matrix):
    """Apply TF-IDF weighting to a term-document matrix."""
    # Term frequency: normalize by document length
    tf = term_doc_matrix / (term_doc_matrix.sum(axis=0, keepdims=True) + 1e-10)

    # Inverse document frequency
    n_docs = term_doc_matrix.shape[1]
    doc_freq = np.sum(term_doc_matrix > 0, axis=1, keepdims=True)
    idf = np.log((n_docs + 1) / (doc_freq + 1)) + 1  # Smoothed IDF

    return tf * idf


tfidf_matrix = apply_tfidf(term_doc_matrix)

def apply_tfidf(term_doc_matrix):
    """Apply TF-IDF weighting to a term-document matrix."""
    # Term frequency: normalize by document length
    tf = term_doc_matrix / (term_doc_matrix.sum(axis=0, keepdims=True) + 1e-10)

    # Inverse document frequency
    n_docs = term_doc_matrix.shape[1]
    doc_freq = np.sum(term_doc_matrix > 0, axis=1, keepdims=True)
    idf = np.log((n_docs + 1) / (doc_freq + 1)) + 1  # Smoothed IDF

    return tf * idf


tfidf_matrix = apply_tfidf(term_doc_matrix)

Applying Truncated SVD: The Heart of LSALink Copied

Now comes the key step: we apply truncated SVD to extract the latent semantic dimensions. The function below computes the full SVD, then truncates to keep only the top $k$ components.

Important detail: Notice how we construct word and document vectors. For word vectors, we multiply $U_k$ by the singular values; for document vectors, we multiply $V_k^T$ by the singular values. This weighting ensures that more important dimensions contribute more to similarity calculations.

In[13]:

Code

def lsa_transform(matrix, n_components):
    """Apply LSA (truncated SVD) to a term-document matrix."""
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)

    # Truncate to k components
    U_k = U[:, :n_components]
    S_k = S[:n_components]
    Vt_k = Vt[:n_components, :]

    # Word vectors: U_k @ diag(S_k)
    # Document vectors: diag(S_k) @ Vt_k
    word_vectors = (
        U_k * S_k
    )  # Broadcasting multiplies each column by corresponding singular value
    doc_vectors = (S_k.reshape(-1, 1) * Vt_k).T  # Each row is a document

    return word_vectors, doc_vectors, S_k


# Apply LSA with 3 components
n_components = 3
word_vectors, doc_vectors, singular_values = lsa_transform(
    tfidf_matrix, n_components
)

def lsa_transform(matrix, n_components):
    """Apply LSA (truncated SVD) to a term-document matrix."""
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)

    # Truncate to k components
    U_k = U[:, :n_components]
    S_k = S[:n_components]
    Vt_k = Vt[:n_components, :]

    # Word vectors: U_k @ diag(S_k)
    # Document vectors: diag(S_k) @ Vt_k
    word_vectors = (
        U_k * S_k
    )  # Broadcasting multiplies each column by corresponding singular value
    doc_vectors = (S_k.reshape(-1, 1) * Vt_k).T  # Each row is a document

    return word_vectors, doc_vectors, S_k


# Apply LSA with 3 components
n_components = 3
word_vectors, doc_vectors, singular_values = lsa_transform(
    tfidf_matrix, n_components
)

Out[14]:

Console

LSA with 3 components:
  Word vectors shape: (11, 3)
  Document vectors shape: (10, 3)
  Singular values: [2.23368488 2.08136469 1.76910355]

We've achieved our goal: each word is now represented by a dense 3-dimensional vector instead of a sparse 10-dimensional document vector. The singular values tell us the relative importance of each dimension. The first captures the most variance, the second captures the most of what remains, and so on.

Out[15]:

Visualization

Side-by-side heatmaps showing the original TF-IDF matrix and its rank-3 SVD reconstruction, demonstrating how truncated SVD captures the essential structure. — Visualization of the SVD decomposition M ≈ U_k Σ_k V_k^T. The original TF-IDF matrix (left) is approximated by the product of three smaller matrices. The reconstruction (right) captures the essential patterns while smoothing out noise. Notice how the rank-3 approximation preserves the main structure.

The visualization shows the SVD decomposition in action. The original matrix on the left contains the TF-IDF weighted word counts. The middle panel shows the singular values that weight each dimension. The reconstruction on the right shows what we get when we multiply the truncated components back together. The essential patterns are preserved while noise is smoothed out.

But what do these dimensions mean? This is where LSA becomes fascinating.

Interpreting the Latent DimensionsLink Copied

Unlike hand-crafted features, SVD dimensions don't have predetermined meanings. But we can interpret them by examining which words load most strongly on each pole. A dimension might separate "technology vs. cooking," or "abstract vs. concrete," or capture some other semantic contrast that emerges from the data.

In[16]:

Code

def top_words_per_dimension(word_vectors, vocab, n_top=5):
    """Find words with highest and lowest loadings on each dimension."""
    results = []
    for dim in range(word_vectors.shape[1]):
        loadings = word_vectors[:, dim]

        # Top positive loadings
        top_pos_idx = np.argsort(loadings)[-n_top:][::-1]
        top_pos = [(vocab[i], loadings[i]) for i in top_pos_idx]

        # Top negative loadings
        top_neg_idx = np.argsort(loadings)[:n_top]
        top_neg = [(vocab[i], loadings[i]) for i in top_neg_idx]

        results.append({"positive": top_pos, "negative": top_neg})

    return results


dimension_analysis = top_words_per_dimension(word_vectors, vocab)

def top_words_per_dimension(word_vectors, vocab, n_top=5):
    """Find words with highest and lowest loadings on each dimension."""
    results = []
    for dim in range(word_vectors.shape[1]):
        loadings = word_vectors[:, dim]

        # Top positive loadings
        top_pos_idx = np.argsort(loadings)[-n_top:][::-1]
        top_pos = [(vocab[i], loadings[i]) for i in top_pos_idx]

        # Top negative loadings
        top_neg_idx = np.argsort(loadings)[:n_top]
        top_neg = [(vocab[i], loadings[i]) for i in top_neg_idx]

        results.append({"positive": top_pos, "negative": top_neg})

    return results


dimension_analysis = top_words_per_dimension(word_vectors, vocab)

Out[17]:

Console

LSA Dimension Analysis:
============================================================

Dimension 1:
  Positive pole:
    data           : +1.967
    analyze        : +0.578
    to             : +0.578
    the            : +0.391
    algorithms     : +0.325
  Negative pole:
    cooking        : +0.104
    with           : +0.119
    fresh          : +0.126
    using          : +0.131
    and            : +0.172

Dimension 2:
  Positive pole:
    the            : +1.704
    with           : +0.643
    fresh          : +0.631
    cooking        : +0.443
    and            : +0.419
  Negative pole:
    data           : -0.435
    to             : -0.106
    analyze        : -0.106
    software       : +0.022
    algorithms     : +0.022

Dimension 3:
  Positive pole:
    analyze        : +0.883
    to             : +0.883
    algorithms     : +0.711
    software       : +0.711
    using          : +0.052
  Negative pole:
    data           : -0.740
    the            : -0.079
    with           : -0.046
    fresh          : -0.038
    cooking        : -0.005

The output reveals interpretable structure. Each dimension has two poles, positive and negative loadings, and words cluster according to their semantic properties. Technology terms group together at one pole; cooking terms at another.

This is the magic of LSA: we never told the algorithm about "technology" or "cooking." It discovered these categories by analyzing which words appear in similar documents. The latent dimensions are emergent properties of the corpus structure.

Visualizing the Semantic SpaceLink Copied

The scatter plot below shows words positioned by their first two LSA coordinates. Even with just two dimensions, the semantic organization is visible.

Out[18]:

Visualization

2D scatter plot showing words positioned by their LSA coordinates, with technology and cooking words forming distinct clusters. — Words projected into the first two LSA dimensions. Technology-related words (blue) cluster separately from cooking-related words (red). The axes represent abstract semantic dimensions discovered by SVD, not predefined categories.

Documents can also be visualized in the same LSA space, revealing how they cluster by topic.

Out[19]:

Visualization

Scatter plot showing both words and documents positioned in 2D LSA space, with clear clustering by topic. — Documents projected into LSA space alongside words. Technology documents (blue squares) cluster in one region, cooking documents (red squares) in another. This shows that LSA creates a shared semantic space where both words and documents can be compared directly.

This joint visualization reveals a key property of LSA: words and documents live in the same semantic space. A document's position is essentially the weighted average of its words' positions. This shared representation enables powerful applications like finding documents similar to a query word, or finding words that characterize a document.

Computing Semantic SimilarityLink Copied

With words represented as dense vectors, computing similarity becomes straightforward and efficient. We use cosine similarity, the cosine of the angle between two vectors, which measures how aligned two words are in the semantic space, regardless of their vector magnitudes.

In[20]:

Code

from sklearn.metrics.pairwise import cosine_similarity


def find_similar_words_lsa(word, word_vectors, vocab, word_to_idx, top_n=5):
    """Find most similar words using LSA vectors."""
    if word not in word_to_idx:
        return []

    word_idx = word_to_idx[word]
    word_vec = word_vectors[word_idx].reshape(1, -1)

    # Compute similarity with all words
    similarities = cosine_similarity(word_vec, word_vectors)[0]

    # Get top similar (excluding the word itself)
    similar_indices = np.argsort(similarities)[::-1]
    results = []
    for idx in similar_indices:
        if idx != word_idx:
            results.append((vocab[idx], similarities[idx]))
            if len(results) >= top_n:
                break

    return results

from sklearn.metrics.pairwise import cosine_similarity


def find_similar_words_lsa(word, word_vectors, vocab, word_to_idx, top_n=5):
    """Find most similar words using LSA vectors."""
    if word not in word_to_idx:
        return []

    word_idx = word_to_idx[word]
    word_vec = word_vectors[word_idx].reshape(1, -1)

    # Compute similarity with all words
    similarities = cosine_similarity(word_vec, word_vectors)[0]

    # Get top similar (excluding the word itself)
    similar_indices = np.argsort(similarities)[::-1]
    results = []
    for idx in similar_indices:
        if idx != word_idx:
            results.append((vocab[idx], similarities[idx]))
            if len(results) >= top_n:
                break

    return results

Out[21]:

Console

Word similarities in LSA space:
--------------------------------------------------

Similar to 'data':
    using          : 0.285
    to             : 0.233
    analyze        : 0.233
    and            : 0.132

Similar to 'cooking':
    the            : 0.999
    fresh          : 0.998
    with           : 0.997
    and            : 0.983

Similar to 'fresh':
    with           : 1.000
    the            : 0.999
    cooking        : 0.998
    and            : 0.972

The results demonstrate LSA's power: words are similar based on meaning, not just co-occurrence. "Data" clusters with "algorithms" because they appear in similar documents, such as technology articles, programming tutorials, and data science content. "Cooking" clusters with "chef" and "kitchen" for the same reason.

Out[22]:

Visualization

Heatmap showing pairwise word similarities, with visible clustering of semantically related words. — Pairwise cosine similarity matrix between words in LSA space. Warmer colors indicate higher similarity. Notice the block structure: technology words (left cluster) are similar to each other, cooking words (right cluster) form another group, with low similarity between the clusters.

The similarity heatmap reveals the semantic structure LSA has discovered. Words are reordered using hierarchical clustering to group similar words together. The resulting block structure shows clear semantic clusters: technology terms cluster together, cooking terms form another block, with low similarity (cooler colors) between the domains.

The key insight: LSA discovers similarity through transitive relationships. Even if "data" and "algorithms" never appear in the same sentence, they're similar because they both appear with words like "computer," "software," and "analysis." The SVD compression captures these indirect relationships that raw co-occurrence counts miss.

Choosing the Number of DimensionsLink Copied

We've seen that truncated SVD works, but we've glossed over a critical decision: how many dimensions should we keep?

This isn't just a technical detail. The choice of $k$ fundamentally affects what your model captures:

Too few dimensions: You lose important distinctions. "Car" and "automobile" might be similar, but so might "car" and "truck" if you've collapsed too much structure.
Too many dimensions: You keep noise. The model memorizes idiosyncratic patterns that don't generalize to new data.

The sweet spot depends on your data and your task. Here are the main approaches for finding it.

The Elbow Method: Visual InspectionLink Copied

The simplest approach is to plot the singular values and look for an elbow, a point where the curve bends from steep decline to gradual decay. Dimensions before the elbow capture signal; those after capture noise.

In[23]:

Code

# Compute full SVD for analysis
U_full, S_full, Vt_full = np.linalg.svd(tfidf_matrix, full_matrices=False)
explained_var = np.cumsum(S_full**2) / np.sum(S_full**2)

# Compute full SVD for analysis
U_full, S_full, Vt_full = np.linalg.svd(tfidf_matrix, full_matrices=False)
explained_var = np.cumsum(S_full**2) / np.sum(S_full**2)

Out[24]:

Visualization

Line plot showing singular values with an elbow around 3-4 dimensions. — The singular value spectrum shows a clear drop after the first few components. The 'elbow' around k=3-4 indicates where signal gives way to noise.

Line plot showing cumulative explained variance rising to near 100%. — Cumulative explained variance shows diminishing returns as we add dimensions. The threshold lines help identify where most information is captured.

Task-Based Selection: Let Performance DecideLink Copied

The elbow method gives a starting point, but the best approach is often empirical: evaluate different dimension counts on your actual task.

For document retrieval: Measure precision and recall at different $k$ values
For word similarity: Compare against human similarity judgments (datasets like SimLex-999)
For classification: Use cross-validation accuracy as your guide

The "optimal" $k$ varies by task. A word similarity task might need 300 dimensions to capture fine distinctions, while a topic classification task might work best with 50 dimensions that capture broad categories.

In[25]:

Code

# Simulate evaluating different dimension counts
def evaluate_reconstruction_error(matrix, k):
    """Compute reconstruction error for rank-k approximation."""
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
    reconstruction = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
    error = np.linalg.norm(matrix - reconstruction, "fro")
    return error


original_norm = np.linalg.norm(tfidf_matrix, "fro")
k_values = range(1, min(len(vocab), len(documents)) + 1)
errors = [
    evaluate_reconstruction_error(tfidf_matrix, k) / original_norm
    for k in k_values
]

# Simulate evaluating different dimension counts
def evaluate_reconstruction_error(matrix, k):
    """Compute reconstruction error for rank-k approximation."""
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
    reconstruction = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
    error = np.linalg.norm(matrix - reconstruction, "fro")
    return error


original_norm = np.linalg.norm(tfidf_matrix, "fro")
k_values = range(1, min(len(vocab), len(documents)) + 1)
errors = [
    evaluate_reconstruction_error(tfidf_matrix, k) / original_norm
    for k in k_values
]

Out[26]:

Console

Reconstruction error by number of components:
---------------------------------------------
  k = 1: 86.7% relative error
  k = 2: 73.3% relative error
  k = 3: 61.8% relative error
  k = 4: 50.1% relative error
  k = 5: 39.2% relative error

The reconstruction error drops rapidly with the first few dimensions, then levels off. This pattern of steep initial decline followed by diminishing returns is characteristic of structured data. The first dimensions capture the dominant patterns; additional dimensions add progressively less information, eventually just fitting noise.

Out[27]:

Visualization

Four heatmaps showing the original matrix and its reconstructions at increasing ranks, demonstrating progressive improvement in approximation quality. — Reconstruction quality at different truncation levels. From left to right: original matrix, rank-1, rank-2, and rank-3 approximations. Each additional dimension captures more structure. By rank-3, the essential patterns are preserved while noise is smoothed.

The progression from k=1 to k=3 shows how each additional dimension adds information. The rank-1 approximation captures only the most dominant pattern (overall word frequency). Rank-2 begins to distinguish between document types. By rank-3, the reconstruction closely matches the original, capturing the essential semantic structure while smoothing out idiosyncratic noise.

Practical GuidelinesLink Copied

For LSA on typical document collections:

50-300 dimensions is a common range for document retrieval and topic modeling
100-500 dimensions works well for word similarity tasks
Start with 100 dimensions as a baseline and tune from there
More data generally supports more dimensions without overfitting

Dimension Selection Heuristics

Rule of thumb: Start with $k \approx \sqrt{n}$ where $n$ is the number of documents, then tune based on task performance. For word-word matrices, $k$ between 100 and 500 typically works well.

Computational Complexity: The Scaling ChallengeLink Copied

Everything we've discussed works beautifully on small datasets. But real-world NLP deals with massive scale: vocabularies of 100,000+ words, corpora with millions of documents. Here, SVD's computational cost becomes a serious obstacle.

Full SVD has complexity $O(\min(m^2n, mn^2))$ , where $m$ is the number of rows (vocabulary size) and $n$ is the number of columns (documents or context words). This is roughly cubic in the matrix dimensions. For a 100,000 × 50,000 matrix, this means billions of operations. Even on fast hardware, computation takes hours or days.

In[28]:

Code

import time


def benchmark_svd(sizes):
    """Benchmark SVD computation time for different matrix sizes."""
    results = []

    for m, n in sizes:
        # Create random matrix
        matrix = np.random.randn(m, n)

        # Time full SVD
        start = time.time()
        U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
        elapsed = time.time() - start

        results.append({"size": f"{m}x{n}", "m": m, "n": n, "time": elapsed})

    return results


# Benchmark on small matrices (full SVD is too slow for large ones)
sizes = [(100, 50), (200, 100), (500, 200), (1000, 500)]
benchmark_results = benchmark_svd(sizes)

import time


def benchmark_svd(sizes):
    """Benchmark SVD computation time for different matrix sizes."""
    results = []

    for m, n in sizes:
        # Create random matrix
        matrix = np.random.randn(m, n)

        # Time full SVD
        start = time.time()
        U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
        elapsed = time.time() - start

        results.append({"size": f"{m}x{n}", "m": m, "n": n, "time": elapsed})

    return results


# Benchmark on small matrices (full SVD is too slow for large ones)
sizes = [(100, 50), (200, 100), (500, 200), (1000, 500)]
benchmark_results = benchmark_svd(sizes)

Out[29]:

Console

SVD Computation Time (full SVD):
---------------------------------------------
 Matrix Size     Time (s)
---------------------------------------------
      100x50       0.0008
     200x100       0.0116
     500x200       0.4107
    1000x500       1.5055

Estimated time for larger matrices:
  10,000 x 5,000:  ~minutes
  100,000 x 50,000: ~hours (impractical)

Out[30]:

Visualization

Line plot showing SVD time increasing steeply with matrix size on a log scale. — SVD computation time grows rapidly with matrix size. The cubic complexity makes full SVD impractical for large vocabularies. This motivates randomized algorithms that provide approximate solutions much faster.

Randomized SVD: Making Scale TractableLink Copied

The scaling problem seems insurmountable, until you realize we don't actually need the full SVD. We only want the top $k$ components, and we've seen that $k$ is typically small (50-500). Can we compute just those components without computing everything?

Randomized SVD answers yes, achieving $O(mnk)$ complexity, where $m$ and $n$ are the matrix dimensions and $k$ is the number of components we want to keep. This is linear in the matrix size. For a 100,000 × 50,000 matrix with $k=100$ , this is 500× faster than full SVD.

The Key Insight: Random ProjectionLink Copied

The algorithm rests on a beautiful idea: random projection preserves structure. If we multiply our matrix by a random matrix, we get a smaller matrix that captures most of the original's important directions. We can then compute SVD on this smaller matrix and recover an approximation to the original SVD.

Why does randomness help? Because the important directions (those with large singular values) are robust and show up even after random projection. The unimportant directions (small singular values) get scrambled, but we didn't want them anyway.

In[31]:

Code

def randomized_svd(matrix, n_components, n_oversamples=10, n_iter=2):
    """
    Compute approximate truncated SVD using randomization.

    Parameters:
    -----------
    matrix : array of shape (m, n)
    n_components : int, target rank
    n_oversamples : int, extra dimensions for accuracy
    n_iter : int, power iterations for accuracy

    Returns:
    --------
    U, S, Vt : truncated SVD components
    """
    m, n = matrix.shape
    k = n_components + n_oversamples

    # Step 1: Random projection to find range
    # Generate random matrix
    random_matrix = np.random.randn(n, k)

    # Project: Y = A @ random_matrix
    Y = matrix @ random_matrix

    # Power iteration for better accuracy
    for _ in range(n_iter):
        Y = matrix @ (matrix.T @ Y)

    # Orthonormalize
    Q, _ = np.linalg.qr(Y)

    # Step 2: SVD on smaller matrix
    # B = Q^T @ A is k x n (much smaller than m x n)
    B = Q.T @ matrix

    # SVD of B
    U_tilde, S, Vt = np.linalg.svd(B, full_matrices=False)

    # Recover U
    U = Q @ U_tilde

    # Truncate to desired components
    return U[:, :n_components], S[:n_components], Vt[:n_components, :]

def randomized_svd(matrix, n_components, n_oversamples=10, n_iter=2):
    """
    Compute approximate truncated SVD using randomization.

    Parameters:
    -----------
    matrix : array of shape (m, n)
    n_components : int, target rank
    n_oversamples : int, extra dimensions for accuracy
    n_iter : int, power iterations for accuracy

    Returns:
    --------
    U, S, Vt : truncated SVD components
    """
    m, n = matrix.shape
    k = n_components + n_oversamples

    # Step 1: Random projection to find range
    # Generate random matrix
    random_matrix = np.random.randn(n, k)

    # Project: Y = A @ random_matrix
    Y = matrix @ random_matrix

    # Power iteration for better accuracy
    for _ in range(n_iter):
        Y = matrix @ (matrix.T @ Y)

    # Orthonormalize
    Q, _ = np.linalg.qr(Y)

    # Step 2: SVD on smaller matrix
    # B = Q^T @ A is k x n (much smaller than m x n)
    B = Q.T @ matrix

    # SVD of B
    U_tilde, S, Vt = np.linalg.svd(B, full_matrices=False)

    # Recover U
    U = Q @ U_tilde

    # Truncate to desired components
    return U[:, :n_components], S[:n_components], Vt[:n_components, :]

Comparing Randomized to Full SVDLink Copied

Let's verify that randomized SVD delivers on its promise: comparable accuracy at a fraction of the cost.

In[32]:

Code

# Create a larger test matrix
np.random.seed(42)
large_matrix = np.random.randn(1000, 500)

# Time full SVD
start = time.time()
U_full, S_full, Vt_full = np.linalg.svd(large_matrix, full_matrices=False)
full_time = time.time() - start

# Time randomized SVD (k=50)
k = 50
start = time.time()
U_rand, S_rand, Vt_rand = randomized_svd(large_matrix, k)
rand_time = time.time() - start

# Compare accuracy
# Reconstruction error
full_recon = U_full[:, :k] @ np.diag(S_full[:k]) @ Vt_full[:k, :]
rand_recon = U_rand @ np.diag(S_rand) @ Vt_rand

full_error = np.linalg.norm(large_matrix - full_recon, "fro")
rand_error = np.linalg.norm(large_matrix - rand_recon, "fro")

# Create a larger test matrix
np.random.seed(42)
large_matrix = np.random.randn(1000, 500)

# Time full SVD
start = time.time()
U_full, S_full, Vt_full = np.linalg.svd(large_matrix, full_matrices=False)
full_time = time.time() - start

# Time randomized SVD (k=50)
k = 50
start = time.time()
U_rand, S_rand, Vt_rand = randomized_svd(large_matrix, k)
rand_time = time.time() - start

# Compare accuracy
# Reconstruction error
full_recon = U_full[:, :k] @ np.diag(S_full[:k]) @ Vt_full[:k, :]
rand_recon = U_rand @ np.diag(S_rand) @ Vt_rand

full_error = np.linalg.norm(large_matrix - full_recon, "fro")
rand_error = np.linalg.norm(large_matrix - rand_recon, "fro")

Out[33]:

Console

Comparison: Full SVD vs Randomized SVD
==================================================
Matrix size: 1000 x 500
Target rank: 50

Full SVD time:       2.7185 seconds
Randomized SVD time: 0.0830 seconds
Speedup:             32.8x

Full SVD reconstruction error:       616.1608
Randomized SVD reconstruction error: 625.9156
Relative difference:                 1.58%

The results confirm the theory: randomized SVD achieves nearly identical reconstruction error while running significantly faster. The small accuracy gap (typically 1-2%) is a worthwhile trade-off for the dramatic speedup.

Production Implementation: scikit-learnLink Copied

For production use, scikit-learn's TruncatedSVD provides an optimized implementation with sensible defaults. It uses randomized algorithms automatically for large matrices.

In[34]:

Code

from sklearn.decomposition import TruncatedSVD

# Using scikit-learn's implementation
svd = TruncatedSVD(n_components=50, algorithm="randomized", random_state=42)

start = time.time()
transformed = svd.fit_transform(large_matrix)
sklearn_time = time.time() - start

from sklearn.decomposition import TruncatedSVD

# Using scikit-learn's implementation
svd = TruncatedSVD(n_components=50, algorithm="randomized", random_state=42)

start = time.time()
transformed = svd.fit_transform(large_matrix)
sklearn_time = time.time() - start

Out[35]:

Console

Scikit-learn TruncatedSVD time: 0.1993 seconds
Explained variance ratio: 0.236

Scikit-learn's implementation is fast, numerically stable, and handles edge cases gracefully. For most applications, this is the recommended approach. There's no need to implement randomized SVD yourself unless you have specialized requirements.

Interpreting SVD DimensionsLink Copied

One of SVD's advantages over neural embeddings is interpretability. While the dimensions don't have predetermined meanings, we can often understand what they capture by examining which words load most strongly on each pole.

This interpretability has practical value: it helps us understand what our model has learned, diagnose problems, and explain results to stakeholders.

In[36]:

Code

# Build a larger corpus for more meaningful dimensions
extended_corpus = [
    # Technology cluster
    "Machine learning algorithms process large datasets efficiently",
    "Neural networks learn patterns from training data",
    "Software engineers develop applications using programming languages",
    "Data scientists analyze information using statistical methods",
    "Artificial intelligence systems make predictions from data",
    "Computer vision algorithms recognize objects in images",
    "Natural language processing enables machines to understand text",
    "Deep learning models require powerful computing hardware",
    # Cooking cluster
    "The chef prepared a gourmet meal with fresh ingredients",
    "Cooking techniques vary across different cuisines",
    "The recipe calls for olive oil and fresh herbs",
    "Restaurant kitchens use professional cooking equipment",
    "Baking requires precise measurements of flour and sugar",
    "The menu features seasonal dishes with local produce",
    "Culinary schools teach classical cooking methods",
    "Food critics review restaurants and their signature dishes",
    # Sports cluster
    "The team won the championship after a thrilling game",
    "Athletes train daily to improve their performance",
    "The coach developed a winning strategy for the match",
    "Football players practice passing and scoring techniques",
    "The stadium was filled with enthusiastic fans",
    "Professional sports leagues attract millions of viewers",
    "Olympic athletes compete for gold medals",
    "Basketball requires speed agility and teamwork",
]

# Build term-document matrix
all_tokens_ext = []
for doc in extended_corpus:
    all_tokens_ext.extend(tokenize(doc))

word_counts_ext = Counter(all_tokens_ext)
vocab_ext = sorted([w for w, c in word_counts_ext.items() if c >= 2])
word_to_idx_ext = {w: i for i, w in enumerate(vocab_ext)}

term_doc_ext = build_term_document_matrix(extended_corpus, word_to_idx_ext)
tfidf_ext = apply_tfidf(term_doc_ext)

# Apply LSA with more components
n_comp = 5
word_vecs_ext, doc_vecs_ext, sing_vals_ext = lsa_transform(tfidf_ext, n_comp)

# Build a larger corpus for more meaningful dimensions
extended_corpus = [
    # Technology cluster
    "Machine learning algorithms process large datasets efficiently",
    "Neural networks learn patterns from training data",
    "Software engineers develop applications using programming languages",
    "Data scientists analyze information using statistical methods",
    "Artificial intelligence systems make predictions from data",
    "Computer vision algorithms recognize objects in images",
    "Natural language processing enables machines to understand text",
    "Deep learning models require powerful computing hardware",
    # Cooking cluster
    "The chef prepared a gourmet meal with fresh ingredients",
    "Cooking techniques vary across different cuisines",
    "The recipe calls for olive oil and fresh herbs",
    "Restaurant kitchens use professional cooking equipment",
    "Baking requires precise measurements of flour and sugar",
    "The menu features seasonal dishes with local produce",
    "Culinary schools teach classical cooking methods",
    "Food critics review restaurants and their signature dishes",
    # Sports cluster
    "The team won the championship after a thrilling game",
    "Athletes train daily to improve their performance",
    "The coach developed a winning strategy for the match",
    "Football players practice passing and scoring techniques",
    "The stadium was filled with enthusiastic fans",
    "Professional sports leagues attract millions of viewers",
    "Olympic athletes compete for gold medals",
    "Basketball requires speed agility and teamwork",
]

# Build term-document matrix
all_tokens_ext = []
for doc in extended_corpus:
    all_tokens_ext.extend(tokenize(doc))

word_counts_ext = Counter(all_tokens_ext)
vocab_ext = sorted([w for w, c in word_counts_ext.items() if c >= 2])
word_to_idx_ext = {w: i for i, w in enumerate(vocab_ext)}

term_doc_ext = build_term_document_matrix(extended_corpus, word_to_idx_ext)
tfidf_ext = apply_tfidf(term_doc_ext)

# Apply LSA with more components
n_comp = 5
word_vecs_ext, doc_vecs_ext, sing_vals_ext = lsa_transform(tfidf_ext, n_comp)

Out[37]:

Console

Extended corpus: 24 documents, 21 words

Dimension Interpretation:
============================================================

Dimension 1 (σ = 3.822):
  Positive pole: ['to', 'athletes', 'their', 'with', 'the']
  Negative pole: ['algorithms', 'learning', 'using', 'requires', 'cooking']

Dimension 2 (σ = 3.375):
  Positive pole: ['learning', 'algorithms', 'a', 'fresh', 'with']
  Negative pole: ['using', 'data', 'from', 'methods', 'cooking']

Dimension 3 (σ = 3.343):
  Positive pole: ['using', 'data', 'from', 'methods', 'learning']
  Negative pole: ['to', 'athletes', 'their', 'for', 'the']

Out[38]:

Visualization

3D scatter plot showing documents clustered by topic in LSA space, with technology, cooking, and sports documents forming distinct groups. — Documents from the extended corpus projected into 3D LSA space. The three topic clusters (technology, cooking, sports) separate clearly in this space. Each point is a document, colored by its topic category. The spatial separation demonstrates how LSA discovers topical structure.

The 3D visualization shows how the three topic clusters separate in LSA space. Technology documents cluster in one region, cooking documents in another, and sports documents in a third. This clear separation emerges purely from analyzing word co-occurrence patterns. We never told the algorithm about these categories.

Out[39]:

Visualization

Three horizontal bar charts showing word loadings on each LSA dimension, with positive and negative poles. — Interpretation of the first three LSA dimensions. Each dimension captures a different semantic contrast. Dimension 1 might separate technology from cooking, while subsequent dimensions capture finer distinctions. The bar charts show which words load most strongly on each pole of each dimension.

The bar charts reveal the semantic contrasts each dimension captures. The first dimension might separate technology from cooking; the second might capture a different contrast (perhaps abstract vs. concrete, or formal vs. informal). Later dimensions capture progressively subtler distinctions.

A note on interpretation: Not every dimension will have a clear human-interpretable meaning. Some capture complex combinations of features, or statistical patterns that don't map neatly to concepts. This is fine. The dimensions are optimized for reconstruction accuracy, not human interpretability.

SVD on Word-Word MatricesLink Copied

So far we've applied SVD to term-document matrices, where similarity reflects shared document membership. But there's another approach: apply SVD to word-word co-occurrence matrices, where similarity reflects shared local context.

These two approaches capture different aspects of meaning:

Term-document (LSA): Words are similar if they appear in similar documents → captures topical/thematic similarity
Word-word (SVD on co-occurrence): Words are similar if they appear near similar words → captures syntactic and fine-grained semantic similarity

In[40]:

Code

def build_word_word_matrix(documents, word_to_idx, window_size=2):
    """Build a word-word co-occurrence matrix."""
    n_words = len(word_to_idx)
    matrix = np.zeros((n_words, n_words))

    for doc in documents:
        tokens = tokenize(doc)

        for i, word in enumerate(tokens):
            if word not in word_to_idx:
                continue
            word_idx = word_to_idx[word]

            # Count co-occurrences within window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)

            for j in range(start, end):
                if i != j and tokens[j] in word_to_idx:
                    context_idx = word_to_idx[tokens[j]]
                    matrix[word_idx, context_idx] += 1

    return matrix


# Build word-word matrix from extended corpus
word_word_matrix = build_word_word_matrix(
    extended_corpus, word_to_idx_ext, window_size=3
)

def build_word_word_matrix(documents, word_to_idx, window_size=2):
    """Build a word-word co-occurrence matrix."""
    n_words = len(word_to_idx)
    matrix = np.zeros((n_words, n_words))

    for doc in documents:
        tokens = tokenize(doc)

        for i, word in enumerate(tokens):
            if word not in word_to_idx:
                continue
            word_idx = word_to_idx[word]

            # Count co-occurrences within window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)

            for j in range(start, end):
                if i != j and tokens[j] in word_to_idx:
                    context_idx = word_to_idx[tokens[j]]
                    matrix[word_idx, context_idx] += 1

    return matrix


# Build word-word matrix from extended corpus
word_word_matrix = build_word_word_matrix(
    extended_corpus, word_to_idx_ext, window_size=3
)

Out[41]:

Console

Word-word matrix shape: (21, 21)
Non-zero entries: 49
Sparsity: 88.9%

The word-word matrix is even sparser than the term-document matrix. Most word pairs never appear within a few words of each other. This extreme sparsity makes dimensionality reduction essential: the raw co-occurrence counts are too noisy and sparse for reliable similarity computation.

PPMI Transformation: Better Than Raw CountsLink Copied

Before applying SVD, we transform the co-occurrence counts using Positive Pointwise Mutual Information (PPMI). This transformation converts raw counts into association scores that better reflect semantic relationships.

The intuition: raw counts overweight frequent words. "The" co-occurs with everything, but this tells us nothing about meaning. PPMI asks: "Does this word pair co-occur more than we'd expect by chance?" High PPMI indicates a meaningful association; low PPMI indicates coincidental co-occurrence.

In[42]:

Code

# Apply PPMI transformation before SVD (common practice)
def ppmi_transform(matrix, k=1):
    """Apply Positive Pointwise Mutual Information transformation."""
    # Row and column sums
    row_sum = matrix.sum(axis=1, keepdims=True)
    col_sum = matrix.sum(axis=0, keepdims=True)
    total = matrix.sum()

    # Expected counts under independence
    expected = (row_sum @ col_sum) / total

    # PMI = log(observed / expected)
    # Add small constant to avoid log(0)
    with np.errstate(divide="ignore", invalid="ignore"):
        pmi = np.log2((matrix * total) / (row_sum @ col_sum + 1e-10) + 1e-10)

    # Positive PMI (set negative values to 0)
    ppmi = np.maximum(pmi, 0)

    # Shifted PPMI (subtract k)
    ppmi = np.maximum(ppmi - np.log2(k), 0)

    return ppmi


ppmi_matrix = ppmi_transform(word_word_matrix)

# Apply PPMI transformation before SVD (common practice)
def ppmi_transform(matrix, k=1):
    """Apply Positive Pointwise Mutual Information transformation."""
    # Row and column sums
    row_sum = matrix.sum(axis=1, keepdims=True)
    col_sum = matrix.sum(axis=0, keepdims=True)
    total = matrix.sum()

    # Expected counts under independence
    expected = (row_sum @ col_sum) / total

    # PMI = log(observed / expected)
    # Add small constant to avoid log(0)
    with np.errstate(divide="ignore", invalid="ignore"):
        pmi = np.log2((matrix * total) / (row_sum @ col_sum + 1e-10) + 1e-10)

    # Positive PMI (set negative values to 0)
    ppmi = np.maximum(pmi, 0)

    # Shifted PPMI (subtract k)
    ppmi = np.maximum(ppmi - np.log2(k), 0)

    return ppmi


ppmi_matrix = ppmi_transform(word_word_matrix)

Out[43]:

Visualization

Heatmap showing raw co-occurrence counts dominated by frequent word patterns. — Raw co-occurrence counts are dominated by frequent words that co-occur with everything, obscuring meaningful associations.

Heatmap showing PPMI-transformed values revealing semantic structure. — PPMI-transformed values highlight word pairs that co-occur more than expected by chance, revealing meaningful semantic associations.

The comparison reveals PPMI's effect: raw counts show high values wherever frequent words appear, regardless of semantic meaning. PPMI normalizes for frequency, highlighting word pairs that co-occur more than expected by chance. These are the meaningful semantic associations.

In[44]:

Code

# SVD on PPMI matrix
U_ww, S_ww, Vt_ww = np.linalg.svd(ppmi_matrix, full_matrices=False)

# Word vectors from SVD
n_dims = 20
word_vectors_ww = U_ww[:, :n_dims] * np.sqrt(S_ww[:n_dims])

# SVD on PPMI matrix
U_ww, S_ww, Vt_ww = np.linalg.svd(ppmi_matrix, full_matrices=False)

# Word vectors from SVD
n_dims = 20
word_vectors_ww = U_ww[:, :n_dims] * np.sqrt(S_ww[:n_dims])

Out[45]:

Console

Word similarities from SVD on word-word matrix:
--------------------------------------------------

Similar to 'learning':
    and            : 0.000
    algorithms     : 0.000
    professional   : 0.000
    techniques     : 0.000

Similar to 'cooking':
    using          : 0.283
    and            : 0.145
    for            : 0.012
    fresh          : 0.008

Similar to 'athletes':
    their          : 0.390
    the            : 0.207
    a              : 0.125
    and            : 0.050

Compare these results to the term-document LSA similarities earlier. The word-word approach captures different relationships, focusing more on syntactic and local semantic patterns, less on broad topical similarity.

Which approach is better? It depends on your task:

For document retrieval and topic modeling: Term-document LSA captures the relevant structure
For word similarity and analogy tasks: Word-word SVD often performs better
For general-purpose embeddings: Word-word approaches (which inspired Word2Vec and GloVe) have become the standard

Limitations of SVD-Based MethodsLink Copied

While powerful, SVD has limitations that motivated the development of neural word embeddings:

Linear assumption: SVD finds linear combinations of the original dimensions. It cannot capture non-linear relationships in the data.
Static representations: Each word gets a single vector regardless of context. "Bank" (financial) and "bank" (river) have the same representation.
Scalability: Even with randomized algorithms, SVD struggles with web-scale corpora containing billions of words.
Incremental updates: Adding new documents requires recomputing the entire decomposition. There's no efficient way to update the model incrementally.
Out-of-vocabulary words: Words not seen during training have no representation.

Out[46]:

Visualization

Scatter plot showing two curved classes that cannot be separated by a linear boundary. — Linear methods like SVD can only find hyperplane separations. The dashed line shows the best linear boundary, which cannot correctly separate the curved classes.

Scatter plot showing the same classes correctly separated by a curved boundary. — Neural methods can learn non-linear decision boundaries (green curve), enabling correct separation of complex patterns that SVD cannot handle.

Historical Impact and Modern RelevanceLink Copied

Despite its limitations, SVD-based methods had enormous impact on NLP:

Information retrieval: LSA revolutionized document search by enabling semantic matching. Queries could find relevant documents even without exact keyword overlap.
Dimensionality reduction: The principle of finding low-rank approximations influenced all subsequent embedding methods, including Word2Vec and GloVe.
Interpretability: SVD dimensions often correspond to interpretable concepts, unlike the opaque representations of deep neural networks.
Theoretical foundation: SVD provides optimality guarantees that neural methods lack. The Eckart-Young theorem ensures we're finding the best possible low-rank approximation.

Modern neural embeddings like Word2Vec can be understood as implicit matrix factorization. Levy and Goldberg (2014) showed that skip-gram with negative sampling implicitly factorizes a shifted PMI matrix, connecting neural methods back to the SVD tradition.

SummaryLink Copied

Singular Value Decomposition transforms sparse, high-dimensional co-occurrence data into dense, low-dimensional representations that capture semantic structure.

Key concepts:

SVD factorization decomposes a matrix into $M = U \Sigma V^T$ , where $U$ and $V$ are orthogonal matrices and $\Sigma$ contains singular values
Truncated SVD keeps only the top $k$ dimensions, providing the optimal rank- $k$ approximation
Singular values indicate dimension importance; their decay reveals the data's intrinsic dimensionality
LSA applies SVD to term-document matrices to discover latent semantic structure
Randomized SVD enables scaling to large matrices with $O(mnk)$ complexity

Practical considerations:

Choose dimensions using the elbow method or task-based evaluation
Apply TF-IDF or PPMI weighting before SVD for better results
Use randomized algorithms (like scikit-learn's TruncatedSVD) for large matrices
Interpret dimensions by examining top-loading words on each pole

Limitations:

Linear method cannot capture non-linear relationships
Static representations don't handle polysemy
Requires recomputation for new documents
No representation for out-of-vocabulary words

SVD-based methods remain relevant as baselines and for interpretability. Understanding them provides essential background for the neural embedding methods that followed, which we'll explore in subsequent chapters.

Key ParametersLink Copied

When applying SVD to text data, these parameters have the greatest impact on the quality and utility of the resulting representations:

Key TruncatedSVD parameters for text applications.

Parameter	Typical Range	Effect
`n_components`	50-500	Number of dimensions to retain. More components capture more variance but increase computation and may include noise.
`algorithm`	'arpack', 'randomized'	'randomized' is faster for large matrices; 'arpack' is more precise for small matrices.
`n_iter`	2-10	Power iterations for randomized SVD. More iterations improve accuracy at the cost of speed.
`n_oversamples`	10-20	Extra dimensions sampled in randomized SVD. Improves approximation quality with minimal overhead.

Choosing n_components:

For document retrieval and topic modeling: 50-300 dimensions typically work well
For word similarity tasks: 100-500 dimensions capture finer distinctions
Use the elbow method on singular value decay to identify where signal ends and noise begins
When in doubt, start with 100 and tune based on downstream task performance

Preprocessing choices:

TF-IDF weighting: Almost always beneficial for term-document matrices. Down-weights frequent terms that appear everywhere.
PPMI transformation: Recommended for word-word matrices. Converts raw counts to association scores.
Centering: Optional. Mean-centering rows or columns can improve results for some tasks.

Computational considerations:

For matrices larger than 10,000 x 10,000, use randomized SVD (algorithm='randomized')
Sparse matrix formats (CSR, CSC) reduce memory requirements significantly
Consider incremental/online variants if data arrives in streams

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Singular Value Decomposition and Latent Semantic Analysis.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Pointwise Mutual Information

Next Chapter

Skip-gram Model

Reference

BIBTEXAcademic

@misc{singularvaluedecompositionmatrixfactorizationforwordembeddingslsa, author = {Michael Brenndoerfer}, title = {Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA}, year = {2025}, url = {https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA. Retrieved from https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings

MLAAcademic

Michael Brenndoerfer. "Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA." 2026. Web. today. <https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings>.

CHICAGOAcademic

Michael Brenndoerfer. "Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA." Accessed today. https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA'. Available at: https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA. https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings

Direct link:

https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA

Singular Value DecompositionLink Copied

The Intuition: Finding Hidden StructureLink Copied

The Mathematics of SVDLink Copied

The Core Problem: Too Much Information, Poorly OrganizedLink Copied

Building Toward the DecompositionLink Copied

The SVD FormulaLink Copied

Geometric Interpretation: A New Coordinate SystemLink Copied

Truncated SVD: Keeping What MattersLink Copied

The Insight: Most Information Lives in Few DimensionsLink Copied

The Truncated SVD FormulaLink Copied

Quantifying Information LossLink Copied

Latent Semantic Analysis (LSA): From Theory to PracticeLink Copied

The Central InsightLink Copied

Building an LSA Model Step by StepLink Copied

Preprocessing: TF-IDF WeightingLink Copied

Applying Truncated SVD: The Heart of LSALink Copied

Interpreting the Latent DimensionsLink Copied

Visualizing the Semantic SpaceLink Copied

Computing Semantic SimilarityLink Copied

Choosing the Number of DimensionsLink Copied

The Elbow Method: Visual InspectionLink Copied

Task-Based Selection: Let Performance DecideLink Copied

Practical GuidelinesLink Copied

Computational Complexity: The Scaling ChallengeLink Copied

Randomized SVD: Making Scale TractableLink Copied

The Key Insight: Random ProjectionLink Copied

Comparing Randomized to Full SVDLink Copied

Production Implementation: scikit-learnLink Copied

Interpreting SVD DimensionsLink Copied

SVD on Word-Word MatricesLink Copied

PPMI Transformation: Better Than Raw CountsLink Copied

Limitations of SVD-Based MethodsLink Copied

Historical Impact and Modern RelevanceLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Pointwise Mutual Information: Measuring Word Associations in NLP

The Distributional Hypothesis: How Context Reveals Word Meaning

Co-occurrence Matrices: Building Word Representations from Context

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Pointwise Mutual Information: Measuring Word Associations in NLP

The Distributional Hypothesis: How Context Reveals Word Meaning

Co-occurrence Matrices: Building Word Representations from Context

Stay updated