Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA
Back to Writing

Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA

Michael BrenndoerferDecember 9, 202539 min read9,365 wordsInteractive

Master SVD for NLP, including truncated SVD for dimensionality reduction, Latent Semantic Analysis, and randomized SVD for large-scale text processing.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Singular Value Decomposition

Co-occurrence matrices capture the distributional patterns that reveal word meaning. But these matrices have a problem: they're enormous and mostly empty. A vocabulary of 100,000 words produces a matrix with 10 billion entries, yet most word pairs never co-occur. This sparsity makes storage expensive, computation slow, and similarity estimates unreliable.

Singular Value Decomposition (SVD) solves this by finding a compact representation that preserves the essential structure. Instead of storing billions of sparse counts, we extract a few hundred dense dimensions that capture the underlying semantic patterns. This is the mathematical foundation behind Latent Semantic Analysis (LSA) and a conceptual precursor to modern word embeddings.

The Intuition: Finding Hidden Structure

Consider a term-document matrix where rows are words and columns are documents. Each entry counts how often a word appears in a document. This matrix is high-dimensional (one dimension per document) and sparse (most words don't appear in most documents).

But the underlying semantic structure is much simpler. Documents about "cars" tend to mention "engine," "wheel," "drive," and "road" together. Documents about "cooking" cluster words like "recipe," "ingredient," "stir," and "oven." These latent topics create correlations: knowing a document contains "engine" makes "wheel" more likely.

SVD discovers these hidden patterns. It finds a small number of dimensions that explain most of the variance in the data. Words that co-occur with similar patterns end up close together in this reduced space, even if they never directly co-occurred in the original matrix.

Singular Value Decomposition

SVD is a matrix factorization that decomposes any matrix MM into three matrices: M=UΣVTM = U \Sigma V^T. The columns of UU and VV are orthonormal bases, and Σ\Sigma is a diagonal matrix of singular values that indicate the importance of each dimension.

The Mathematics of SVD

To understand SVD, we need to answer a fundamental question: what does it mean to decompose a matrix? And more importantly, why would we want to?

The Core Problem: Too Much Information, Poorly Organized

Consider our term-document matrix MM. Each entry MijM_{ij} tells us how often word ii appears in document jj. This is raw information, accurate but overwhelming. With 100,000 words and 10,000 documents, we have a billion numbers, most of them zeros. Worse, this representation treats each document as an independent dimension, ignoring the obvious fact that documents about similar topics should be similar.

What we want is a compressed representation that:

  1. Captures the essential patterns (words that co-occur, documents that are similar)
  2. Discards the noise (random co-occurrences that don't reflect meaning)
  3. Organizes information along meaningful dimensions

SVD achieves all three by factoring the matrix into simpler pieces that reveal its underlying structure.

Building Toward the Decomposition

Let's develop the SVD formula by thinking about what we need. Our matrix MM has size m×nm \times n, where mm is the vocabulary size and nn is the number of documents (or vocabulary size for word-word matrices).

Step 1: Find the principal directions.

The first insight is that not all directions in our data space are equally important. Some directions capture major patterns (the "technology vs. cooking" distinction), while others capture noise. We want to identify these directions and rank them by importance.

Mathematically, we're looking for orthogonal directions, perpendicular axes that don't interfere with each other. These will become our new coordinate system.

Step 2: Measure importance along each direction.

Once we have the directions, we need to know how much the data varies along each one. A direction with high variance captures important structure; one with low variance captures noise.

Step 3: Express everything in terms of these new coordinates.

Finally, we rewrite our original matrix using the new directions and their importances. This is the decomposition.

The SVD Formula

SVD accomplishes all three steps in one elegant factorization:

M=UΣVTM = U \Sigma V^T

Each component has a specific role:

ComponentSizeRole
UUm×mm \times mWord directions: orthogonal vectors describing how words relate to latent dimensions
Σ\Sigmam×nm \times nImportance scores: diagonal matrix of singular values σ1σ2σr\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r
VTV^Tn×nn \times nDocument directions: orthogonal vectors describing how documents relate to latent dimensions

The singular values σi\sigma_i are always non-negative and sorted in decreasing order. The number of non-zero singular values, rr, equals the rank of the matrix, the true dimensionality of the information it contains.

Why orthogonality matters. The conditions UTU=IU^T U = I and VTV=IV^T V = I ensure that each column of UU is perpendicular to every other column (and likewise for VV). This orthogonality makes the dimensions independent, each capturing a distinct pattern. Without orthogonality, dimensions would overlap and we couldn't cleanly separate signal from noise.

Geometric Interpretation: A New Coordinate System

Think of SVD as discovering the natural coordinate system for your data. The original matrix MM transforms vectors from an nn-dimensional input space to an mm-dimensional output space. SVD reveals that this transformation has three distinct phases:

  1. Rotate the input space using VTV^T to align with the principal directions of variation
  2. Scale each dimension by its singular value σi\sigma_i, stretching important directions and shrinking unimportant ones
  3. Rotate again using UU to produce the final output

This decomposition is visualized below. The singular values tell us how much the matrix stretches along each principal direction:

  • Large σi\sigma_i: The data varies greatly along this direction, capturing important structure
  • Small σi\sigma_i: Little variation, likely noise or fine details we can safely ignore
In[2]:
import numpy as np
import matplotlib.pyplot as plt

# Create a simple 2D example to visualize SVD
np.random.seed(42)

# Generate points along an ellipse (correlated data)
theta = np.linspace(0, 2*np.pi, 50)
# Ellipse with major axis at 45 degrees
x = 3 * np.cos(theta) + 0.5 * np.sin(theta)
y = 3 * np.sin(theta) + 0.5 * np.cos(theta)

# Add some noise
x += np.random.randn(50) * 0.3
y += np.random.randn(50) * 0.3

# Stack into matrix (each row is a point)
data = np.column_stack([x, y])

# Compute SVD of the centered data
data_centered = data - data.mean(axis=0)
U, S, Vt = np.linalg.svd(data_centered, full_matrices=False)
Out[3]:
Visualization
Scatter plot of 2D points forming an ellipse with two perpendicular arrows showing principal directions from SVD.
Geometric interpretation of SVD on 2D data. The original data points form an elliptical cloud. SVD finds the principal directions (red and blue arrows) along which the data varies most. The length of each arrow is proportional to the corresponding singular value, showing that the data spreads much more along the first principal direction.

The visualization demonstrates the key insight: most of the data's spread lies along the first principal direction. The second direction captures much less variation. If we kept only the first direction, we'd lose some information, but we'd preserve the dominant pattern. This observation leads directly to the most powerful application of SVD.

Truncated SVD: Keeping What Matters

The Insight: Most Information Lives in Few Dimensions

Here's the remarkable fact that makes SVD useful for NLP: real data is approximately low-rank. A term-document matrix with millions of entries might have its essential structure captured by just a few hundred dimensions. The remaining dimensions contain noise, rare events, and idiosyncratic details.

Why does this happen? Because language has structure:

  • Documents about technology share vocabulary, creating correlated columns
  • Words with similar meanings appear in similar contexts, creating correlated rows
  • Topics, genres, and writing styles impose patterns that span many entries

These correlations mean the matrix isn't truly m×nm \times n dimensional. It's effectively much smaller. SVD reveals this hidden simplicity.

The Truncated SVD Formula

If we keep only the kk largest singular values and discard the rest, we get:

MMk=UkΣkVkTM \approx M_k = U_k \Sigma_k V_k^T

The components are truncated versions of the full decomposition:

ComponentOriginal SizeTruncated SizeWhat's Kept
UkU_km×mm \times mm×km \times kFirst kk columns (most important word directions)
Σk\Sigma_km×nm \times nk×kk \times kTop kk singular values (largest importances)
VkTV_k^Tn×nn \times nk×nk \times nFirst kk rows (most important document directions)

This isn't just any approximation. It's the best possible rank-kk approximation.

Eckart-Young-Mirsky Theorem

Among all matrices with rank at most kk, the truncated SVD MkM_k minimizes the reconstruction error MMkF\|M - M_k\|_F (Frobenius norm). No other rank-kk matrix can get closer to the original. This mathematical guarantee is why SVD is the gold standard for linear dimensionality reduction.

Quantifying Information Loss

How much information do we lose by keeping only kk dimensions? The singular values provide an exact answer. Since each σi2\sigma_i^2 represents the variance captured by dimension ii, the fraction of total variance explained by the top kk dimensions is:

Explained variance ratio=i=1kσi2i=1rσi2\text{Explained variance ratio} = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{r} \sigma_i^2}

where:

  • σi\sigma_i: the ii-th singular value (importance of dimension ii)
  • kk: the number of dimensions we choose to retain
  • rr: the rank of the original matrix (total non-zero singular values)

The key insight: In text data, this ratio typically rises quickly. The first few dimensions might capture 50% of the variance, the first 50 might capture 80%, and the first 200 might capture 95%. The remaining thousands of dimensions contribute almost nothing.

This rapid concentration of variance is why dimensionality reduction works so well for language. The semantic structure we care about lives in a low-dimensional subspace.

In[4]:
# Create a more realistic example: a term-document matrix
np.random.seed(42)

# Simulate a corpus with 3 hidden topics
n_words = 100
n_docs = 50
n_topics = 3

# Topic-word distributions (which words belong to which topic)
topic_word = np.random.dirichlet(np.ones(n_words) * 0.1, size=n_topics)

# Document-topic distributions (which topics each document covers)
doc_topic = np.random.dirichlet(np.ones(n_topics) * 0.5, size=n_docs)

# Generate term-document matrix
term_doc = np.zeros((n_words, n_docs))
for d in range(n_docs):
    for t in range(n_topics):
        term_doc[:, d] += doc_topic[d, t] * topic_word[t] * 100

# Add noise
term_doc += np.random.poisson(1, size=(n_words, n_docs))

# Compute SVD
U, S, Vt = np.linalg.svd(term_doc, full_matrices=False)

# Calculate explained variance
total_variance = np.sum(S**2)
explained_variance = np.cumsum(S**2) / total_variance
Out[5]:
Singular value analysis:
--------------------------------------------------
Matrix shape: (100, 50)
Number of singular values: 50

Top 10 singular values:
  σ_ 1 =   177.86  (cumulative variance:  70.7%)
  σ_ 2 =    75.43  (cumulative variance:  83.4%)
  σ_ 3 =    56.40  (cumulative variance:  90.5%)
  σ_ 4 =    16.70  (cumulative variance:  91.1%)
  σ_ 5 =    15.06  (cumulative variance:  91.7%)
  σ_ 6 =    14.78  (cumulative variance:  92.1%)
  σ_ 7 =    14.35  (cumulative variance:  92.6%)
  σ_ 8 =    14.06  (cumulative variance:  93.0%)
  σ_ 9 =    13.62  (cumulative variance:  93.5%)
  σ_10 =    13.09  (cumulative variance:  93.8%)

Notice the dramatic gap between the first three singular values and the rest. This isn't coincidence. We generated this data with exactly three hidden topics. The singular value spectrum has revealed this latent structure: three large values corresponding to the three topics, followed by much smaller values representing noise and fine-grained details.

This is the power of SVD for exploratory analysis: the singular value decay reveals the intrinsic dimensionality of your data. A sharp drop suggests clear underlying structure; a gradual decay suggests more complex, distributed patterns.

Out[6]:
Visualization
Line plot showing singular values decreasing rapidly from around 200 to near zero.
Singular values decay rapidly, with the first few capturing most of the matrix's structure. The steep initial drop indicates strong underlying patterns (topics) in the data.

The left plot shows the singular value spectrum. Note how the first three values tower over the rest. The right plot translates this into cumulative explained variance: with just 10 dimensions, we've captured over 90% of the matrix's information. The annotation marks where we cross the 90% threshold, a common heuristic for choosing dimensionality.

In real text data, the decay is typically more gradual than this simulated example, but the principle holds: a small number of dimensions captures the semantic structure that matters.

Latent Semantic Analysis (LSA): From Theory to Practice

Now that we understand the mathematics of SVD, let's see how it transforms our understanding of text. Latent Semantic Analysis (LSA), introduced by Deerwester et al. in 1990, was one of the first successful methods for learning word representations, and it's built entirely on the truncated SVD we just developed.

The Central Insight

LSA rests on a powerful observation: the reduced SVD dimensions aren't arbitrary; they correspond to latent semantic concepts.

Consider what happens when we apply truncated SVD to a term-document matrix:

  1. Words that appear in similar documents get projected to similar locations
  2. Documents about similar topics get projected to similar locations
  3. The dimensions themselves often correspond to interpretable "topics" or semantic contrasts

This means two words can be similar in LSA space even if they never directly co-occur, as long as they appear in similar contexts. "Car" and "automobile" might never appear in the same document, but if they appear in documents about the same topics, LSA will discover their similarity.

Building an LSA Model Step by Step

Let's implement LSA from scratch. By building each component ourselves, we'll develop intuition for how the mathematics translates into semantic understanding.

In[7]:
from collections import Counter
import re

# A small corpus about technology and cooking
documents = [
    "The computer processes data using algorithms and software programs",
    "Machine learning algorithms analyze patterns in large datasets",
    "Software engineers write code to build applications",
    "The chef prepared a delicious meal with fresh ingredients",
    "Cooking requires following recipes and using proper techniques",
    "The restaurant serves gourmet dishes made by skilled chefs",
    "Data scientists use programming to analyze information",
    "The kitchen was equipped with modern cooking appliances",
    "Artificial intelligence systems learn from training data",
    "The recipe called for fresh vegetables and herbs"
]

def tokenize(text):
    """Simple tokenization: lowercase and split on non-alphanumeric."""
    return re.findall(r'\b[a-z]+\b', text.lower())

# Build vocabulary
all_tokens = []
for doc in documents:
    all_tokens.extend(tokenize(doc))

# Filter to words appearing at least twice
word_counts = Counter(all_tokens)
vocab = sorted([w for w, c in word_counts.items() if c >= 2])
word_to_idx = {w: i for i, w in enumerate(vocab)}
Out[8]:
Corpus: 10 documents
Vocabulary size: 11 words

Vocabulary: ['algorithms', 'analyze', 'and', 'cooking', 'data', 'fresh', 'software', 'the', 'to', 'using', 'with']

Our filtered vocabulary contains words that appear in at least two documents, a simple but effective way to focus on meaningful terms. Notice the mix: technology words ("data," "algorithms"), cooking words ("cooking," "fresh"), and general words ("the," "with"). This diversity will let us observe how LSA discovers semantic structure without being told what the categories are.

In[9]:
# Build term-document matrix
def build_term_document_matrix(documents, word_to_idx):
    """Build a term-document matrix from documents."""
    n_words = len(word_to_idx)
    n_docs = len(documents)
    matrix = np.zeros((n_words, n_docs))
    
    for d, doc in enumerate(documents):
        tokens = tokenize(doc)
        for token in tokens:
            if token in word_to_idx:
                matrix[word_to_idx[token], d] += 1
    
    return matrix

term_doc_matrix = build_term_document_matrix(documents, word_to_idx)
Out[10]:
Term-document matrix shape: (11, 10)
Non-zero entries: 27
Sparsity: 75.5%
Out[11]:
Visualization
Heatmap showing a sparse term-document matrix with most cells being zero (light) and occasional non-zero entries (darker).
Heatmap of the term-document matrix. Each row is a word, each column is a document. Darker cells indicate higher word counts. The prevalence of light cells (zeros) illustrates the extreme sparsity typical of text data, where most words don't appear in most documents.

The heatmap reveals the sparsity pattern visually. Most cells are light (zero counts), with scattered darker cells where words actually appear in documents. This visual makes clear why storing and computing with the raw matrix is inefficient.

The matrix is extremely sparse. Most entries are zero because any given document contains only a tiny fraction of the vocabulary. This sparsity is both a curse and an opportunity:

  • The curse: Sparse vectors are inefficient to store and compute with. Similarity calculations are unreliable when most dimensions are zero.
  • The opportunity: The sparsity tells us the data is redundant. If most entries are predictable (zero), then the true information content is much smaller than the matrix size suggests.

SVD exploits this redundancy, compressing the sparse matrix into a dense representation that captures the underlying patterns.

Preprocessing: TF-IDF Weighting

Before applying SVD, we apply TF-IDF (Term Frequency-Inverse Document Frequency) weighting. This transformation addresses a subtle problem: raw counts overweight common words.

Words like "the" and "is" appear in nearly every document but carry little semantic information. TF-IDF down-weights these ubiquitous terms and up-weights distinctive words that characterize specific documents. The result is a matrix where the important patterns are more prominent.

In[12]:
def apply_tfidf(term_doc_matrix):
    """Apply TF-IDF weighting to a term-document matrix."""
    # Term frequency: normalize by document length
    tf = term_doc_matrix / (term_doc_matrix.sum(axis=0, keepdims=True) + 1e-10)
    
    # Inverse document frequency
    n_docs = term_doc_matrix.shape[1]
    doc_freq = np.sum(term_doc_matrix > 0, axis=1, keepdims=True)
    idf = np.log((n_docs + 1) / (doc_freq + 1)) + 1  # Smoothed IDF
    
    return tf * idf

tfidf_matrix = apply_tfidf(term_doc_matrix)

Applying Truncated SVD: The Heart of LSA

Now comes the key step: we apply truncated SVD to extract the latent semantic dimensions. The function below computes the full SVD, then truncates to keep only the top kk components.

Important detail: Notice how we construct word and document vectors. For word vectors, we multiply UkU_k by the singular values; for document vectors, we multiply VkTV_k^T by the singular values. This weighting ensures that more important dimensions contribute more to similarity calculations.

In[13]:
def lsa_transform(matrix, n_components):
    """Apply LSA (truncated SVD) to a term-document matrix."""
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
    
    # Truncate to k components
    U_k = U[:, :n_components]
    S_k = S[:n_components]
    Vt_k = Vt[:n_components, :]
    
    # Word vectors: U_k @ diag(S_k)
    # Document vectors: diag(S_k) @ Vt_k
    word_vectors = U_k * S_k  # Broadcasting multiplies each column by corresponding singular value
    doc_vectors = (S_k.reshape(-1, 1) * Vt_k).T  # Each row is a document
    
    return word_vectors, doc_vectors, S_k

# Apply LSA with 3 components
n_components = 3
word_vectors, doc_vectors, singular_values = lsa_transform(tfidf_matrix, n_components)
Out[14]:
LSA with 3 components:
  Word vectors shape: (11, 3)
  Document vectors shape: (10, 3)
  Singular values: [2.23368488 2.08136469 1.76910355]

We've achieved our goal: each word is now represented by a dense 3-dimensional vector instead of a sparse 10-dimensional document vector. The singular values tell us the relative importance of each dimension. The first captures the most variance, the second captures the most of what remains, and so on.

Out[15]:
Visualization
Side-by-side heatmaps showing the original TF-IDF matrix and its rank-3 SVD reconstruction, demonstrating how truncated SVD captures the essential structure.
Visualization of the SVD decomposition M ≈ U_k Σ_k V_k^T. The original TF-IDF matrix (left) is approximated by the product of three smaller matrices. The reconstruction (right) captures the essential patterns while smoothing out noise. Notice how the rank-3 approximation preserves the main structure.

The visualization shows the SVD decomposition in action. The original matrix on the left contains the TF-IDF weighted word counts. The middle panel shows the singular values that weight each dimension. The reconstruction on the right shows what we get when we multiply the truncated components back together. The essential patterns are preserved while noise is smoothed out.

But what do these dimensions mean? This is where LSA becomes fascinating.

Interpreting the Latent Dimensions

Unlike hand-crafted features, SVD dimensions don't have predetermined meanings. But we can interpret them by examining which words load most strongly on each pole. A dimension might separate "technology vs. cooking," or "abstract vs. concrete," or capture some other semantic contrast that emerges from the data.

In[16]:
def top_words_per_dimension(word_vectors, vocab, n_top=5):
    """Find words with highest and lowest loadings on each dimension."""
    results = []
    for dim in range(word_vectors.shape[1]):
        loadings = word_vectors[:, dim]
        
        # Top positive loadings
        top_pos_idx = np.argsort(loadings)[-n_top:][::-1]
        top_pos = [(vocab[i], loadings[i]) for i in top_pos_idx]
        
        # Top negative loadings
        top_neg_idx = np.argsort(loadings)[:n_top]
        top_neg = [(vocab[i], loadings[i]) for i in top_neg_idx]
        
        results.append({'positive': top_pos, 'negative': top_neg})
    
    return results

dimension_analysis = top_words_per_dimension(word_vectors, vocab)
Out[17]:
LSA Dimension Analysis:
============================================================

Dimension 1:
  Positive pole:
    data           : +1.967
    analyze        : +0.578
    to             : +0.578
    the            : +0.391
    algorithms     : +0.325
  Negative pole:
    cooking        : +0.104
    with           : +0.119
    fresh          : +0.126
    using          : +0.131
    and            : +0.172

Dimension 2:
  Positive pole:
    the            : +1.704
    with           : +0.643
    fresh          : +0.631
    cooking        : +0.443
    and            : +0.419
  Negative pole:
    data           : -0.435
    to             : -0.106
    analyze        : -0.106
    software       : +0.022
    algorithms     : +0.022

Dimension 3:
  Positive pole:
    analyze        : +0.883
    to             : +0.883
    algorithms     : +0.711
    software       : +0.711
    using          : +0.052
  Negative pole:
    data           : -0.740
    the            : -0.079
    with           : -0.046
    fresh          : -0.038
    cooking        : -0.005

The output reveals interpretable structure. Each dimension has two poles, positive and negative loadings, and words cluster according to their semantic properties. Technology terms group together at one pole; cooking terms at another.

This is the magic of LSA: we never told the algorithm about "technology" or "cooking." It discovered these categories by analyzing which words appear in similar documents. The latent dimensions are emergent properties of the corpus structure.

Visualizing the Semantic Space

The scatter plot below shows words positioned by their first two LSA coordinates. Even with just two dimensions, the semantic organization is visible.

Out[18]:
Visualization
2D scatter plot showing words positioned by their LSA coordinates, with technology and cooking words forming distinct clusters.
Words projected into the first two LSA dimensions. Technology-related words (blue) cluster separately from cooking-related words (red). The axes represent abstract semantic dimensions discovered by SVD, not predefined categories.

Documents can also be visualized in the same LSA space, revealing how they cluster by topic.

Out[19]:
Visualization
Scatter plot showing both words and documents positioned in 2D LSA space, with clear clustering by topic.
Documents projected into LSA space alongside words. Technology documents (blue squares) cluster in one region, cooking documents (red squares) in another. This shows that LSA creates a shared semantic space where both words and documents can be compared directly.

This joint visualization reveals a key property of LSA: words and documents live in the same semantic space. A document's position is essentially the weighted average of its words' positions. This shared representation enables powerful applications like finding documents similar to a query word, or finding words that characterize a document.

Computing Semantic Similarity

With words represented as dense vectors, computing similarity becomes straightforward and efficient. We use cosine similarity, the cosine of the angle between two vectors, which measures how aligned two words are in the semantic space, regardless of their vector magnitudes.

In[20]:
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_words_lsa(word, word_vectors, vocab, word_to_idx, top_n=5):
    """Find most similar words using LSA vectors."""
    if word not in word_to_idx:
        return []
    
    word_idx = word_to_idx[word]
    word_vec = word_vectors[word_idx].reshape(1, -1)
    
    # Compute similarity with all words
    similarities = cosine_similarity(word_vec, word_vectors)[0]
    
    # Get top similar (excluding the word itself)
    similar_indices = np.argsort(similarities)[::-1]
    results = []
    for idx in similar_indices:
        if idx != word_idx:
            results.append((vocab[idx], similarities[idx]))
            if len(results) >= top_n:
                break
    
    return results
Out[21]:
Word similarities in LSA space:
--------------------------------------------------

Similar to 'data':
    using          : 0.285
    to             : 0.233
    analyze        : 0.233
    and            : 0.132

Similar to 'cooking':
    the            : 0.999
    fresh          : 0.998
    with           : 0.997
    and            : 0.983

Similar to 'fresh':
    with           : 1.000
    the            : 0.999
    cooking        : 0.998
    and            : 0.972

The results demonstrate LSA's power: words are similar based on meaning, not just co-occurrence. "Data" clusters with "algorithms" because they appear in similar documents, such as technology articles, programming tutorials, and data science content. "Cooking" clusters with "chef" and "kitchen" for the same reason.

Out[22]:
Visualization
Heatmap showing pairwise word similarities, with visible clustering of semantically related words.
Pairwise cosine similarity matrix between words in LSA space. Warmer colors indicate higher similarity. Notice the block structure: technology words (left cluster) are similar to each other, cooking words (right cluster) form another group, with low similarity between the clusters.

The similarity heatmap reveals the semantic structure LSA has discovered. Words are reordered using hierarchical clustering to group similar words together. The resulting block structure shows clear semantic clusters: technology terms cluster together, cooking terms form another block, with low similarity (cooler colors) between the domains.

The key insight: LSA discovers similarity through transitive relationships. Even if "data" and "algorithms" never appear in the same sentence, they're similar because they both appear with words like "computer," "software," and "analysis." The SVD compression captures these indirect relationships that raw co-occurrence counts miss.

Choosing the Number of Dimensions

We've seen that truncated SVD works, but we've glossed over a critical decision: how many dimensions should we keep?

This isn't just a technical detail. The choice of kk fundamentally affects what your model captures:

  • Too few dimensions: You lose important distinctions. "Car" and "automobile" might be similar, but so might "car" and "truck" if you've collapsed too much structure.
  • Too many dimensions: You keep noise. The model memorizes idiosyncratic patterns that don't generalize to new data.

The sweet spot depends on your data and your task. Here are the main approaches for finding it.

The Elbow Method: Visual Inspection

The simplest approach is to plot the singular values and look for an elbow, a point where the curve bends from steep decline to gradual decay. Dimensions before the elbow capture signal; those after capture noise.

In[23]:
# Compute full SVD for analysis
U_full, S_full, Vt_full = np.linalg.svd(tfidf_matrix, full_matrices=False)
explained_var = np.cumsum(S_full**2) / np.sum(S_full**2)
Out[24]:
Visualization
Two plots showing singular value decay and cumulative explained variance with an elbow around 3-4 dimensions.
The elbow method for choosing dimensions. The singular value spectrum (left) shows a clear drop after the first few components. The explained variance curve (right) shows diminishing returns as we add dimensions. The 'elbow' around k=3-4 suggests this is a reasonable number of dimensions for this corpus.

Task-Based Selection: Let Performance Decide

The elbow method gives a starting point, but the best approach is often empirical: evaluate different dimension counts on your actual task.

  • For document retrieval: Measure precision and recall at different kk values
  • For word similarity: Compare against human similarity judgments (datasets like SimLex-999)
  • For classification: Use cross-validation accuracy as your guide

The "optimal" kk varies by task. A word similarity task might need 300 dimensions to capture fine distinctions, while a topic classification task might work best with 50 dimensions that capture broad categories.

In[25]:
# Simulate evaluating different dimension counts
def evaluate_reconstruction_error(matrix, k):
    """Compute reconstruction error for rank-k approximation."""
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
    reconstruction = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
    error = np.linalg.norm(matrix - reconstruction, 'fro')
    return error

original_norm = np.linalg.norm(tfidf_matrix, 'fro')
k_values = range(1, min(len(vocab), len(documents)) + 1)
errors = [evaluate_reconstruction_error(tfidf_matrix, k) / original_norm for k in k_values]
Out[26]:
Reconstruction error by number of components:
---------------------------------------------
  k = 1: 86.7% relative error
  k = 2: 73.3% relative error
  k = 3: 61.8% relative error
  k = 4: 50.1% relative error
  k = 5: 39.2% relative error

The reconstruction error drops rapidly with the first few dimensions, then levels off. This pattern of steep initial decline followed by diminishing returns is characteristic of structured data. The first dimensions capture the dominant patterns; additional dimensions add progressively less information, eventually just fitting noise.

Out[27]:
Visualization
Four heatmaps showing the original matrix and its reconstructions at increasing ranks, demonstrating progressive improvement in approximation quality.
Reconstruction quality at different truncation levels. From left to right: original matrix, rank-1, rank-2, and rank-3 approximations. Each additional dimension captures more structure. By rank-3, the essential patterns are preserved while noise is smoothed.

The progression from k=1 to k=3 shows how each additional dimension adds information. The rank-1 approximation captures only the most dominant pattern (overall word frequency). Rank-2 begins to distinguish between document types. By rank-3, the reconstruction closely matches the original, capturing the essential semantic structure while smoothing out idiosyncratic noise.

Practical Guidelines

For LSA on typical document collections:

  • 50-300 dimensions is a common range for document retrieval and topic modeling
  • 100-500 dimensions works well for word similarity tasks
  • Start with 100 dimensions as a baseline and tune from there
  • More data generally supports more dimensions without overfitting
Dimension Selection Heuristics

Rule of thumb: Start with knk \approx \sqrt{n} where nn is the number of documents, then tune based on task performance. For word-word matrices, kk between 100 and 500 typically works well.

Computational Complexity: The Scaling Challenge

Everything we've discussed works beautifully on small datasets. But real-world NLP deals with massive scale: vocabularies of 100,000+ words, corpora with millions of documents. Here, SVD's computational cost becomes a serious obstacle.

Full SVD has complexity O(min(m2n,mn2))O(\min(m^2n, mn^2)), roughly cubic in the matrix dimensions. For a 100,000 × 50,000 matrix, this means billions of operations. Even on fast hardware, computation takes hours or days.

In[28]:
import time

def benchmark_svd(sizes):
    """Benchmark SVD computation time for different matrix sizes."""
    results = []
    
    for m, n in sizes:
        # Create random matrix
        matrix = np.random.randn(m, n)
        
        # Time full SVD
        start = time.time()
        U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
        elapsed = time.time() - start
        
        results.append({
            'size': f'{m}x{n}',
            'm': m,
            'n': n,
            'time': elapsed
        })
    
    return results

# Benchmark on small matrices (full SVD is too slow for large ones)
sizes = [(100, 50), (200, 100), (500, 200), (1000, 500)]
benchmark_results = benchmark_svd(sizes)
Out[29]:
SVD Computation Time (full SVD):
---------------------------------------------
 Matrix Size     Time (s)
---------------------------------------------
      100x50       0.0696
     200x100       2.7923
     500x200       8.0098
    1000x500      37.2111

Estimated time for larger matrices:
  10,000 x 5,000:  ~minutes
  100,000 x 50,000: ~hours (impractical)
Out[30]:
Visualization
Line plot showing SVD time increasing steeply with matrix size on a log scale.
SVD computation time grows rapidly with matrix size. The cubic complexity makes full SVD impractical for large vocabularies. This motivates randomized algorithms that provide approximate solutions much faster.

Randomized SVD: Making Scale Tractable

The scaling problem seems insurmountable, until you realize we don't actually need the full SVD. We only want the top kk components, and we've seen that kk is typically small (50-500). Can we compute just those components without computing everything?

Randomized SVD answers yes, achieving O(mnk)O(mnk) complexity, linear in the matrix size and the number of components we want. For a 100,000 × 50,000 matrix with k=100k=100, this is 500× faster than full SVD.

The Key Insight: Random Projection

The algorithm rests on a beautiful idea: random projection preserves structure. If we multiply our matrix by a random matrix, we get a smaller matrix that captures most of the original's important directions. We can then compute SVD on this smaller matrix and recover an approximation to the original SVD.

Why does randomness help? Because the important directions (those with large singular values) are robust and show up even after random projection. The unimportant directions (small singular values) get scrambled, but we didn't want them anyway.

In[31]:
def randomized_svd(matrix, n_components, n_oversamples=10, n_iter=2):
    """
    Compute approximate truncated SVD using randomization.
    
    Parameters:
    -----------
    matrix : array of shape (m, n)
    n_components : int, target rank
    n_oversamples : int, extra dimensions for accuracy
    n_iter : int, power iterations for accuracy
    
    Returns:
    --------
    U, S, Vt : truncated SVD components
    """
    m, n = matrix.shape
    k = n_components + n_oversamples
    
    # Step 1: Random projection to find range
    # Generate random matrix
    random_matrix = np.random.randn(n, k)
    
    # Project: Y = A @ random_matrix
    Y = matrix @ random_matrix
    
    # Power iteration for better accuracy
    for _ in range(n_iter):
        Y = matrix @ (matrix.T @ Y)
    
    # Orthonormalize
    Q, _ = np.linalg.qr(Y)
    
    # Step 2: SVD on smaller matrix
    # B = Q^T @ A is k x n (much smaller than m x n)
    B = Q.T @ matrix
    
    # SVD of B
    U_tilde, S, Vt = np.linalg.svd(B, full_matrices=False)
    
    # Recover U
    U = Q @ U_tilde
    
    # Truncate to desired components
    return U[:, :n_components], S[:n_components], Vt[:n_components, :]

Comparing Randomized to Full SVD

Let's verify that randomized SVD delivers on its promise: comparable accuracy at a fraction of the cost.

In[32]:
# Create a larger test matrix
np.random.seed(42)
large_matrix = np.random.randn(1000, 500)

# Time full SVD
start = time.time()
U_full, S_full, Vt_full = np.linalg.svd(large_matrix, full_matrices=False)
full_time = time.time() - start

# Time randomized SVD (k=50)
k = 50
start = time.time()
U_rand, S_rand, Vt_rand = randomized_svd(large_matrix, k)
rand_time = time.time() - start

# Compare accuracy
# Reconstruction error
full_recon = U_full[:, :k] @ np.diag(S_full[:k]) @ Vt_full[:k, :]
rand_recon = U_rand @ np.diag(S_rand) @ Vt_rand

full_error = np.linalg.norm(large_matrix - full_recon, 'fro')
rand_error = np.linalg.norm(large_matrix - rand_recon, 'fro')
Out[33]:
Comparison: Full SVD vs Randomized SVD
==================================================
Matrix size: 1000 x 500
Target rank: 50

Full SVD time:       33.6863 seconds
Randomized SVD time: 3.7355 seconds
Speedup:             9.0x

Full SVD reconstruction error:       616.1608
Randomized SVD reconstruction error: 625.9156
Relative difference:                 1.58%

The results confirm the theory: randomized SVD achieves nearly identical reconstruction error while running significantly faster. The small accuracy gap (typically 1-2%) is a worthwhile trade-off for the dramatic speedup.

Production Implementation: scikit-learn

For production use, scikit-learn's TruncatedSVD provides an optimized implementation with sensible defaults. It uses randomized algorithms automatically for large matrices.

In[34]:
from sklearn.decomposition import TruncatedSVD

# Using scikit-learn's implementation
svd = TruncatedSVD(n_components=50, algorithm='randomized', random_state=42)

start = time.time()
transformed = svd.fit_transform(large_matrix)
sklearn_time = time.time() - start
Out[35]:
Scikit-learn TruncatedSVD time: 1.7948 seconds
Explained variance ratio: 0.236

Scikit-learn's implementation is fast, numerically stable, and handles edge cases gracefully. For most applications, this is the recommended approach. There's no need to implement randomized SVD yourself unless you have specialized requirements.

Interpreting SVD Dimensions

One of SVD's advantages over neural embeddings is interpretability. While the dimensions don't have predetermined meanings, we can often understand what they capture by examining which words load most strongly on each pole.

This interpretability has practical value: it helps us understand what our model has learned, diagnose problems, and explain results to stakeholders.

In[36]:
# Build a larger corpus for more meaningful dimensions
extended_corpus = [
    # Technology cluster
    "Machine learning algorithms process large datasets efficiently",
    "Neural networks learn patterns from training data",
    "Software engineers develop applications using programming languages",
    "Data scientists analyze information using statistical methods",
    "Artificial intelligence systems make predictions from data",
    "Computer vision algorithms recognize objects in images",
    "Natural language processing enables machines to understand text",
    "Deep learning models require powerful computing hardware",
    
    # Cooking cluster
    "The chef prepared a gourmet meal with fresh ingredients",
    "Cooking techniques vary across different cuisines",
    "The recipe calls for olive oil and fresh herbs",
    "Restaurant kitchens use professional cooking equipment",
    "Baking requires precise measurements of flour and sugar",
    "The menu features seasonal dishes with local produce",
    "Culinary schools teach classical cooking methods",
    "Food critics review restaurants and their signature dishes",
    
    # Sports cluster
    "The team won the championship after a thrilling game",
    "Athletes train daily to improve their performance",
    "The coach developed a winning strategy for the match",
    "Football players practice passing and scoring techniques",
    "The stadium was filled with enthusiastic fans",
    "Professional sports leagues attract millions of viewers",
    "Olympic athletes compete for gold medals",
    "Basketball requires speed agility and teamwork"
]

# Build term-document matrix
all_tokens_ext = []
for doc in extended_corpus:
    all_tokens_ext.extend(tokenize(doc))

word_counts_ext = Counter(all_tokens_ext)
vocab_ext = sorted([w for w, c in word_counts_ext.items() if c >= 2])
word_to_idx_ext = {w: i for i, w in enumerate(vocab_ext)}

term_doc_ext = build_term_document_matrix(extended_corpus, word_to_idx_ext)
tfidf_ext = apply_tfidf(term_doc_ext)

# Apply LSA with more components
n_comp = 5
word_vecs_ext, doc_vecs_ext, sing_vals_ext = lsa_transform(tfidf_ext, n_comp)
Out[37]:
Extended corpus: 24 documents, 21 words

Dimension Interpretation:
============================================================

Dimension 1 (σ = 3.822):
  Positive pole: ['to', 'athletes', 'their', 'with', 'the']
  Negative pole: ['algorithms', 'learning', 'using', 'requires', 'cooking']

Dimension 2 (σ = 3.375):
  Positive pole: ['learning', 'algorithms', 'a', 'fresh', 'with']
  Negative pole: ['using', 'data', 'from', 'methods', 'cooking']

Dimension 3 (σ = 3.343):
  Positive pole: ['using', 'data', 'from', 'methods', 'learning']
  Negative pole: ['to', 'athletes', 'their', 'for', 'the']
Out[38]:
Visualization
3D scatter plot showing documents clustered by topic in LSA space, with technology, cooking, and sports documents forming distinct groups.
Documents from the extended corpus projected into 3D LSA space. The three topic clusters (technology, cooking, sports) separate clearly in this space. Each point is a document, colored by its topic category. The spatial separation demonstrates how LSA discovers topical structure.

The 3D visualization shows how the three topic clusters separate in LSA space. Technology documents cluster in one region, cooking documents in another, and sports documents in a third. This clear separation emerges purely from analyzing word co-occurrence patterns. We never told the algorithm about these categories.

Out[39]:
Visualization
Three horizontal bar charts showing word loadings on each LSA dimension, with positive and negative poles.
Interpretation of the first three LSA dimensions. Each dimension captures a different semantic contrast. Dimension 1 might separate technology from cooking, while subsequent dimensions capture finer distinctions. The bar charts show which words load most strongly on each pole of each dimension.

The bar charts reveal the semantic contrasts each dimension captures. The first dimension might separate technology from cooking; the second might capture a different contrast (perhaps abstract vs. concrete, or formal vs. informal). Later dimensions capture progressively subtler distinctions.

A note on interpretation: Not every dimension will have a clear human-interpretable meaning. Some capture complex combinations of features, or statistical patterns that don't map neatly to concepts. This is fine. The dimensions are optimized for reconstruction accuracy, not human interpretability.

SVD on Word-Word Matrices

So far we've applied SVD to term-document matrices, where similarity reflects shared document membership. But there's another approach: apply SVD to word-word co-occurrence matrices, where similarity reflects shared local context.

These two approaches capture different aspects of meaning:

  • Term-document (LSA): Words are similar if they appear in similar documents → captures topical/thematic similarity
  • Word-word (SVD on co-occurrence): Words are similar if they appear near similar words → captures syntactic and fine-grained semantic similarity
In[40]:
def build_word_word_matrix(documents, word_to_idx, window_size=2):
    """Build a word-word co-occurrence matrix."""
    n_words = len(word_to_idx)
    matrix = np.zeros((n_words, n_words))
    
    for doc in documents:
        tokens = tokenize(doc)
        
        for i, word in enumerate(tokens):
            if word not in word_to_idx:
                continue
            word_idx = word_to_idx[word]
            
            # Count co-occurrences within window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j and tokens[j] in word_to_idx:
                    context_idx = word_to_idx[tokens[j]]
                    matrix[word_idx, context_idx] += 1
    
    return matrix

# Build word-word matrix from extended corpus
word_word_matrix = build_word_word_matrix(extended_corpus, word_to_idx_ext, window_size=3)
Out[41]:
Word-word matrix shape: (21, 21)
Non-zero entries: 49
Sparsity: 88.9%

The word-word matrix is even sparser than the term-document matrix. Most word pairs never appear within a few words of each other. This extreme sparsity makes dimensionality reduction essential: the raw co-occurrence counts are too noisy and sparse for reliable similarity computation.

PPMI Transformation: Better Than Raw Counts

Before applying SVD, we transform the co-occurrence counts using Positive Pointwise Mutual Information (PPMI). This transformation converts raw counts into association scores that better reflect semantic relationships.

The intuition: raw counts overweight frequent words. "The" co-occurs with everything, but this tells us nothing about meaning. PPMI asks: "Does this word pair co-occur more than we'd expect by chance?" High PPMI indicates a meaningful association; low PPMI indicates coincidental co-occurrence.

In[42]:
# Apply PPMI transformation before SVD (common practice)
def ppmi_transform(matrix, k=1):
    """Apply Positive Pointwise Mutual Information transformation."""
    # Row and column sums
    row_sum = matrix.sum(axis=1, keepdims=True)
    col_sum = matrix.sum(axis=0, keepdims=True)
    total = matrix.sum()
    
    # Expected counts under independence
    expected = (row_sum @ col_sum) / total
    
    # PMI = log(observed / expected)
    # Add small constant to avoid log(0)
    with np.errstate(divide='ignore', invalid='ignore'):
        pmi = np.log2((matrix * total) / (row_sum @ col_sum + 1e-10) + 1e-10)
    
    # Positive PMI (set negative values to 0)
    ppmi = np.maximum(pmi, 0)
    
    # Shifted PPMI (subtract k)
    ppmi = np.maximum(ppmi - np.log2(k), 0)
    
    return ppmi

ppmi_matrix = ppmi_transform(word_word_matrix)
Out[43]:
Visualization
Two heatmaps side by side comparing raw co-occurrence counts with PPMI-transformed values, showing how PPMI reveals more meaningful structure.
Comparison of raw co-occurrence counts (left) versus PPMI-transformed values (right). Raw counts are dominated by frequent words that co-occur with everything. PPMI highlights meaningful associations by asking whether words co-occur more than expected by chance.

The comparison reveals PPMI's effect: raw counts show high values wherever frequent words appear, regardless of semantic meaning. PPMI normalizes for frequency, highlighting word pairs that co-occur more than expected by chance. These are the meaningful semantic associations.

In[44]:
# SVD on PPMI matrix
U_ww, S_ww, Vt_ww = np.linalg.svd(ppmi_matrix, full_matrices=False)

# Word vectors from SVD
n_dims = 20
word_vectors_ww = U_ww[:, :n_dims] * np.sqrt(S_ww[:n_dims])
Out[45]:
Word similarities from SVD on word-word matrix:
--------------------------------------------------

Similar to 'learning':
    and            : 0.000
    algorithms     : 0.000
    professional   : 0.000
    techniques     : 0.000

Similar to 'cooking':
    using          : 0.283
    and            : 0.145
    for            : 0.012
    fresh          : 0.008

Similar to 'athletes':
    their          : 0.390
    the            : 0.207
    a              : 0.125
    and            : 0.050

Compare these results to the term-document LSA similarities earlier. The word-word approach captures different relationships, focusing more on syntactic and local semantic patterns, less on broad topical similarity.

Which approach is better? It depends on your task:

  • For document retrieval and topic modeling: Term-document LSA captures the relevant structure
  • For word similarity and analogy tasks: Word-word SVD often performs better
  • For general-purpose embeddings: Word-word approaches (which inspired Word2Vec and GloVe) have become the standard

Limitations of SVD-Based Methods

While powerful, SVD has limitations that motivated the development of neural word embeddings.

Linear assumption: SVD finds linear combinations of the original dimensions. It cannot capture non-linear relationships in the data.

Static representations: Each word gets a single vector regardless of context. "Bank" (financial) and "bank" (river) have the same representation.

Scalability: Even with randomized algorithms, SVD struggles with web-scale corpora containing billions of words.

Incremental updates: Adding new documents requires recomputing the entire decomposition. There's no efficient way to update the model incrementally.

Out-of-vocabulary words: Words not seen during training have no representation.

Out[46]:
Visualization
Two scatter plots comparing linear SVD separation versus non-linear neural network separation of word clusters.
Conceptual illustration of SVD limitations. Linear methods like SVD can only find hyperplane separations (left), while neural methods can learn non-linear boundaries (right). This limits SVD's ability to capture complex semantic relationships.

Historical Impact and Modern Relevance

Despite its limitations, SVD-based methods had enormous impact on NLP:

Information retrieval: LSA revolutionized document search by enabling semantic matching. Queries could find relevant documents even without exact keyword overlap.

Dimensionality reduction: The principle of finding low-rank approximations influenced all subsequent embedding methods, including Word2Vec and GloVe.

Interpretability: SVD dimensions often correspond to interpretable concepts, unlike the opaque representations of deep neural networks.

Theoretical foundation: SVD provides optimality guarantees that neural methods lack. The Eckart-Young theorem ensures we're finding the best possible low-rank approximation.

Modern neural embeddings like Word2Vec can be understood as implicit matrix factorization. Levy and Goldberg (2014) showed that skip-gram with negative sampling implicitly factorizes a shifted PMI matrix, connecting neural methods back to the SVD tradition.

Summary

Singular Value Decomposition transforms sparse, high-dimensional co-occurrence data into dense, low-dimensional representations that capture semantic structure.

Key concepts:

  • SVD factorization decomposes a matrix into M=UΣVTM = U \Sigma V^T, where UU and VV are orthogonal matrices and Σ\Sigma contains singular values
  • Truncated SVD keeps only the top kk dimensions, providing the optimal rank-kk approximation
  • Singular values indicate dimension importance; their decay reveals the data's intrinsic dimensionality
  • LSA applies SVD to term-document matrices to discover latent semantic structure
  • Randomized SVD enables scaling to large matrices with O(mnk)O(mnk) complexity

Practical considerations:

  • Choose dimensions using the elbow method or task-based evaluation
  • Apply TF-IDF or PPMI weighting before SVD for better results
  • Use randomized algorithms (like scikit-learn's TruncatedSVD) for large matrices
  • Interpret dimensions by examining top-loading words on each pole

Limitations:

  • Linear method cannot capture non-linear relationships
  • Static representations don't handle polysemy
  • Requires recomputation for new documents
  • No representation for out-of-vocabulary words

SVD-based methods remain relevant as baselines and for interpretability. Understanding them provides essential background for the neural embedding methods that followed, which we'll explore in subsequent chapters.

Key Parameters

When applying SVD to text data, these parameters have the greatest impact on the quality and utility of the resulting representations:

ParameterTypical RangeEffect
n_components50-500Number of dimensions to retain. More components capture more variance but increase computation and may include noise.
algorithm'arpack', 'randomized''randomized' is faster for large matrices; 'arpack' is more precise for small matrices.
n_iter2-10Power iterations for randomized SVD. More iterations improve accuracy at the cost of speed.
n_oversamples10-20Extra dimensions sampled in randomized SVD. Improves approximation quality with minimal overhead.

Choosing n_components:

  • For document retrieval and topic modeling: 50-300 dimensions typically work well
  • For word similarity tasks: 100-500 dimensions capture finer distinctions
  • Use the elbow method on singular value decay to identify where signal ends and noise begins
  • When in doubt, start with 100 and tune based on downstream task performance

Preprocessing choices:

  • TF-IDF weighting: Almost always beneficial for term-document matrices. Down-weights frequent terms that appear everywhere.
  • PPMI transformation: Recommended for word-word matrices. Converts raw counts to association scores.
  • Centering: Optional. Mean-centering rows or columns can improve results for some tasks.

Computational considerations:

  • For matrices larger than 10,000 x 10,000, use randomized SVD (algorithm='randomized')
  • Sparse matrix formats (CSR, CSC) reduce memory requirements significantly
  • Consider incremental/online variants if data arrives in streams

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Singular Value Decomposition and Latent Semantic Analysis.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{singularvaluedecompositionmatrixfactorizationforwordembeddingslsa, author = {Michael Brenndoerfer}, title = {Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA}, year = {2025}, url = {https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-09} }
APAAcademic
Michael Brenndoerfer (2025). Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA. Retrieved from https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings
MLAAcademic
Michael Brenndoerfer. "Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA." 2025. Web. 12/9/2025. <https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings>.
CHICAGOAcademic
Michael Brenndoerfer. "Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA." Accessed 12/9/2025. https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA'. Available at: https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings (Accessed: 12/9/2025).
SimpleBasic
Michael Brenndoerfer (2025). Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA. https://mbrenndoerfer.com/writing/singular-value-decomposition-lsa-word-embeddings
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.