Search

Search articles

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Michael BrenndoerferDecember 11, 202533 min read7,946 words

Learn how to evaluate word embeddings using similarity tests, analogy tasks, downstream evaluation, t-SNE visualization, and bias detection with WEAT.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Embedding Evaluation

You've trained a word embedding model. The loss converged, the code ran without errors, and you have a matrix of 300-dimensional vectors. But how do you know if these embeddings are any good? What makes one set of embeddings better than another?

Evaluating word embeddings is surprisingly nuanced. Unlike classification tasks where accuracy provides a clear metric, embedding quality is multidimensional. Embeddings might excel at capturing analogies but fail at word similarity. They might perform brilliantly on sentiment analysis but contain harmful biases. The right evaluation depends entirely on how you plan to use the embeddings.

This chapter develops a comprehensive evaluation toolkit. We'll explore intrinsic evaluations that probe the embedding space directly, extrinsic evaluations that measure downstream task performance, visualization techniques for qualitative inspection, and methods for detecting embedded biases. By the end, you'll have the tools to rigorously assess any word embedding model.

Intrinsic vs Extrinsic Evaluation

The fundamental split in embedding evaluation is between intrinsic and extrinsic methods. Understanding this distinction is crucial for interpreting evaluation results.

Intrinsic Evaluation

Intrinsic evaluation measures properties of the embedding space directly, without reference to any downstream task. Common intrinsic tests include word similarity correlation, analogy completion, and clustering coherence. These evaluations are fast and interpretable but may not predict real-world performance.

Extrinsic Evaluation

Extrinsic evaluation measures embedding quality through performance on downstream tasks such as sentiment analysis, named entity recognition, or text classification. These evaluations are slower and more complex but directly measure what matters: usefulness for practical applications.

The relationship between intrinsic and extrinsic performance is complex and sometimes counterintuitive. Research has shown that embeddings with higher intrinsic scores don't always perform better on downstream tasks. This disconnect arises because intrinsic tests capture specific linguistic properties, while downstream tasks may require different capabilities.

In[2]:
import numpy as np
from scipy.stats import spearmanr

# Simulated example: two embedding models evaluated on different metrics
model_a_scores = {
    'word_similarity': 0.72,    # Intrinsic: correlation with human similarity judgments
    'analogy_accuracy': 0.65,   # Intrinsic: analogy task accuracy
    'sentiment_f1': 0.83,       # Extrinsic: sentiment classification F1
    'ner_f1': 0.89,             # Extrinsic: named entity recognition F1
}

model_b_scores = {
    'word_similarity': 0.68,    # Lower intrinsic similarity score
    'analogy_accuracy': 0.58,   # Lower analogy score
    'sentiment_f1': 0.86,       # But higher sentiment performance!
    'ner_f1': 0.91,             # And higher NER performance!
}
Out[3]:
Embedding Model Comparison
=======================================================
         Metric             Model A      Model B   
-------------------------------------------------------
   Word Similarity (r)        0.72         0.68    
    Analogy Accuracy          0.65         0.58    
-------------------------------------------------------
      Sentiment F1            0.83         0.86    
         NER F1               0.89         0.91    
=======================================================

Model A wins on intrinsic metrics, but Model B
outperforms on the downstream tasks that matter.
Out[4]:
Visualization
Comparison of intrinsic vs extrinsic evaluation metrics for two embedding models. Model A (blue) achieves higher intrinsic scores, but Model B (orange) outperforms on downstream tasks. This pattern is common and demonstrates why relying solely on intrinsic metrics can be misleading.
Comparison of intrinsic vs extrinsic evaluation metrics for two embedding models. Model A (blue) achieves higher intrinsic scores, but Model B (orange) outperforms on downstream tasks. This pattern is common and demonstrates why relying solely on intrinsic metrics can be misleading.

This example illustrates a common scenario: Model A achieves higher intrinsic scores, but Model B performs better on downstream tasks. If you're building a sentiment classifier, Model B is clearly the better choice, despite its lower analogy accuracy.

The practical takeaway: always include extrinsic evaluations relevant to your use case. Intrinsic metrics provide useful diagnostic information, but they don't tell the whole story.

Word Similarity Evaluation

The most established intrinsic evaluation is word similarity. Humans rate the semantic similarity of word pairs on a scale (typically 0-10), and we measure how well embedding cosine similarities correlate with these human judgments.

Standard Datasets

Several benchmark datasets have become standard for word similarity evaluation:

WordSim-353 contains 353 word pairs rated by 13-16 human annotators. It mixes similarity (synonymy) and relatedness (associated but not similar). "Cup" and "coffee" are related but not similar. "Car" and "automobile" are both related and similar.

SimLex-999 addresses WordSim's conflation of similarity and relatedness. Its 999 pairs focus specifically on genuine similarity: words that could be substituted for each other. It's harder than WordSim because relatedness doesn't count.

MEN contains 3,000 pairs covering a range of parts of speech. Ratings come from crowd workers rather than expert annotators.

In[5]:
import pandas as pd

# Create a sample of word pairs with human similarity scores
# These values are representative of SimLex-999 style ratings

sample_pairs = [
    # (word1, word2, human_score)
    ('happy', 'cheerful', 9.55),
    ('old', 'new', 1.58),
    ('smart', 'intelligent', 9.20),
    ('car', 'automobile', 8.94),
    ('hot', 'cold', 1.31),
    ('dog', 'cat', 5.95),
    ('king', 'queen', 8.18),
    ('run', 'walk', 6.11),
    ('book', 'paper', 3.89),
    ('love', 'hate', 1.27),
]
Out[6]:
Sample Word Similarity Pairs (SimLex-999 Scale)
==================================================
   Word 1       Word 2      Human Score  
--------------------------------------------------
   happy       cheerful        9.55      
    old          new           1.58      
   smart     intelligent       9.20      
    car       automobile       8.94      
    hot          cold          1.31      
    dog          cat           5.95      
    king        queen          8.18      
    run          walk          6.11      
    book        paper          3.89      
    love         hate          1.27      
==================================================
Scores range from 0 (unrelated) to 10 (identical meaning)

Computing Similarity Correlations

The evaluation process involves two key measurements that work together to tell us how well embeddings capture human intuitions about word similarity.

First, we compute the cosine similarity between each word pair's embeddings. Cosine similarity measures the angle between two vectors, ignoring their magnitudes. Two vectors pointing in similar directions have high cosine similarity (close to 1), while perpendicular vectors have similarity near 0. For word embeddings, high cosine similarity indicates that two words appear in similar contexts and likely share related meanings.

Second, we calculate the Spearman correlation between these embedding similarities and the human ratings. Why Spearman rather than Pearson? Spearman correlation measures whether the rankings match, not whether the actual values align linearly. This is exactly what we want: if humans rank "happy/cheerful" as more similar than "dog/cat", we care whether embeddings agree with that ordering, regardless of the exact numerical values. Spearman correlation ranges from -1 (perfect inverse ranking) through 0 (no relationship) to +1 (perfect agreement).

In[7]:
import gensim.downloader as api

# Load pre-trained GloVe embeddings
# This downloads ~100MB on first run
glove = api.load('glove-wiki-gigaword-100')
In[8]:
from numpy.linalg import norm

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

def evaluate_word_similarity(embeddings, word_pairs):
    """
    Evaluate embeddings on word similarity task.
    
    Returns Spearman correlation and coverage statistics.
    """
    human_scores = []
    embedding_scores = []
    missing_pairs = []
    
    for word1, word2, human_score in word_pairs:
        # Check if both words are in vocabulary
        if word1 in embeddings and word2 in embeddings:
            vec1 = embeddings[word1]
            vec2 = embeddings[word2]
            emb_sim = cosine_similarity(vec1, vec2)
            
            human_scores.append(human_score)
            embedding_scores.append(emb_sim)
        else:
            missing_pairs.append((word1, word2))
    
    # Compute Spearman correlation
    if len(human_scores) >= 2:
        correlation, p_value = spearmanr(human_scores, embedding_scores)
    else:
        correlation, p_value = 0.0, 1.0
    
    return {
        'correlation': correlation,
        'p_value': p_value,
        'coverage': len(human_scores) / len(word_pairs),
        'missing_pairs': missing_pairs
    }

# Evaluate GloVe on our sample pairs
results = evaluate_word_similarity(glove, sample_pairs)
Out[9]:
Word Similarity Evaluation Results
==================================================
Spearman Correlation: 0.1515
P-value:              6.7607e-01
Vocabulary Coverage:  100.0%

Correlation interpretation:
  0.6-0.7: Good
  0.7-0.8: Very good
  0.8+:    Excellent

The Spearman correlation measures how well the embedding-based rankings match human rankings. A correlation above 0.7 indicates that the embeddings preserve human similarity intuitions reasonably well. The p-value tells us whether this correlation is statistically significant (values below 0.05 indicate significance).

Let's examine the individual similarities to understand what the embeddings capture:

In[10]:
# Compute similarities for each pair
pair_analysis = []
for word1, word2, human_score in sample_pairs:
    if word1 in glove and word2 in glove:
        emb_sim = cosine_similarity(glove[word1], glove[word2])
        pair_analysis.append({
            'word1': word1,
            'word2': word2,
            'human': human_score,
            'embedding': emb_sim,
            'error': abs(human_score/10 - emb_sim)  # Normalize human score to 0-1
        })

pair_df = pd.DataFrame(pair_analysis)
pair_df = pair_df.sort_values('human', ascending=False)
Out[11]:
Detailed Similarity Comparison
======================================================================
          Words             Human     Embedding     Error   
----------------------------------------------------------------------
    happy / cheerful         9.55       0.546       0.409   
   smart / intelligent       9.20       0.755       0.165   
    car / automobile         8.94       0.683       0.211   
      king / queen           8.18       0.751       0.067   
       run / walk            6.11       0.668       0.057   
        dog / cat            5.95       0.880       0.285   
      book / paper           3.89       0.623       0.234   
        old / new            1.58       0.643       0.485   
       hot / cold            1.31       0.725       0.594   
       love / hate           1.27       0.570       0.443   
======================================================================
Human scores: 0-10 scale. Embedding: -1 to 1 (cosine similarity).
Out[12]:
Visualization
Correlation between human similarity judgments and GloVe embedding cosine similarities. Each point represents a word pair. A strong positive correlation (points along the diagonal) indicates that the embeddings capture human intuitions about word similarity. Deviations reveal where embedding geometry differs from human perception.
Correlation between human similarity judgments and GloVe embedding cosine similarities. Each point represents a word pair. A strong positive correlation (points along the diagonal) indicates that the embeddings capture human intuitions about word similarity. Deviations reveal where embedding geometry differs from human perception.

The scatter plot reveals both strengths and limitations. High-similarity pairs like "happy/cheerful" and "smart/intelligent" score high on both scales. Antonyms like "hot/cold" receive low human ratings but may have moderate embedding similarity because they appear in similar contexts (both are temperature words).

Out[13]:
Visualization
Pairwise cosine similarity heatmap for selected words. Darker cells indicate higher similarity. The heatmap reveals interesting patterns: synonyms like ''happy/cheerful'' and ''smart/intelligent'' show high similarity, while antonyms like ''hot/cold'' and ''love/hate'' also show moderate similarity because they appear in similar contexts.
Pairwise cosine similarity heatmap for selected words. Darker cells indicate higher similarity. The heatmap reveals interesting patterns: synonyms like ''happy/cheerful'' and ''smart/intelligent'' show high similarity, while antonyms like ''hot/cold'' and ''love/hate'' also show moderate similarity because they appear in similar contexts.

SimLex vs WordSim: Similarity vs Relatedness

The distinction between similarity and relatedness is crucial. "Coffee" and "cup" are highly related but not similar. You can't substitute one for the other. SimLex-999 was specifically designed to test genuine similarity.

In[14]:
# Examples showing similarity vs relatedness distinction
similarity_vs_relatedness = [
    # word1, word2, type, explanation
    ('car', 'automobile', 'similar', 'Synonyms - can substitute'),
    ('coffee', 'cup', 'related', 'Associated but different concepts'),
    ('doctor', 'hospital', 'related', 'Same domain but different roles'),
    ('big', 'large', 'similar', 'Synonyms - can substitute'),
    ('king', 'crown', 'related', 'Associated but different entities'),
    ('happy', 'joyful', 'similar', 'Synonyms - can substitute'),
]

# Check embeddings for these pairs
relatedness_analysis = []
for w1, w2, rel_type, _ in similarity_vs_relatedness:
    if w1 in glove and w2 in glove:
        sim = cosine_similarity(glove[w1], glove[w2])
        relatedness_analysis.append({
            'pair': f"{w1}/{w2}",
            'type': rel_type,
            'cosine_sim': sim
        })
Out[15]:
Similarity vs Relatedness in Embeddings
============================================================
     Word Pair           Type       Cosine Sim   
------------------------------------------------------------
   car/automobile      similar         0.683     
     coffee/cup        related         0.336     
  doctor/hospital      related         0.690     
     big/large         similar         0.708     
     king/crown        related         0.665     
    happy/joyful       similar         0.526     
============================================================

Note: Embeddings trained on co-occurrence often conflate
similarity and relatedness, scoring both types high.
SimLex-999 specifically tests pure similarity.
Out[16]:
Visualization
Cosine similarities for similar vs related word pairs. Ideally, truly similar pairs (synonyms) should score higher than merely related pairs (associations). GloVe embeddings show overlap between these categories because co-occurrence training conflates similarity and relatedness.
Cosine similarities for similar vs related word pairs. Ideally, truly similar pairs (synonyms) should score higher than merely related pairs (associations). GloVe embeddings show overlap between these categories because co-occurrence training conflates similarity and relatedness.

Standard word embeddings trained on co-occurrence tend to conflate similarity and relatedness because related words appear in similar contexts. This is why SimLex-999 is a harder benchmark. If you need embeddings that distinguish these concepts, you may need specialized training objectives.

Word Analogy Evaluation

The analogy task tests whether embeddings capture semantic relationships through vector arithmetic. The classic example: "king - man + woman ≈ queen." If the embedding space encodes gender as a consistent direction, this arithmetic should work.

The Analogy Task

At first glance, the idea that you can do arithmetic with words seems almost magical. How can you subtract "man" from "king" and add "woman" to get "queen"? The key insight is that word embeddings aren't just arbitrary numbers assigned to words. They're geometric representations where directions carry meaning.

Think of it this way: if the embedding space has learned that "male" and "female" represent opposite ends of a direction (let's call it the gender axis), then moving from "man" to "woman" means traveling along that axis. Similarly, moving from "king" to "queen" should involve the same directional shift. If both relationships encode the same underlying concept (gender), their vector differences should be parallel.

This observation leads to a simple but powerful formula. Given three words A, B, and C, we want to find word D such that "A is to B as C is to D." The relationship between A and B is captured by the vector difference ba\mathbf{b} - \mathbf{a}. If the same relationship holds between C and D, then D should be located at:

d=ba+c\mathbf{d} = \mathbf{b} - \mathbf{a} + \mathbf{c}

where:

  • a\mathbf{a}, b\mathbf{b}, c\mathbf{c}: embedding vectors for words A, B, and C
  • ba\mathbf{b} - \mathbf{a}: the relationship vector (what transforms A into B)
  • d\mathbf{d}: target vector representing the expected embedding for word D

The formula reads as: "Start at C, then apply the same transformation that takes A to B." We find D by locating the word whose embedding is closest to this computed vector (excluding A, B, and C to prevent trivial solutions).

This geometric property only works if the embedding space has organized itself so that analogous relationships point in consistent directions. When it works, it's evidence that the embeddings have captured genuine semantic structure. When it fails, it often reveals that the relationship isn't as consistent as we assumed, or that the training data didn't provide enough examples for the model to learn it.

In[17]:
def solve_analogy(embeddings, a, b, c, top_n=5):
    """
    Solve analogy: a is to b as c is to ?
    
    Returns the top_n most likely answers.
    """
    # Check vocabulary coverage
    if not all(w in embeddings for w in [a, b, c]):
        missing = [w for w in [a, b, c] if w not in embeddings]
        return {'error': f"Missing words: {missing}"}
    
    # Compute target vector: b - a + c
    target = embeddings[b] - embeddings[a] + embeddings[c]
    
    # Find nearest neighbors (excluding input words)
    exclude = {a, b, c}
    candidates = []
    
    for word in embeddings.index_to_key[:50000]:  # Check top 50K words
        if word not in exclude:
            sim = cosine_similarity(target, embeddings[word])
            candidates.append((word, sim))
    
    # Sort by similarity
    candidates.sort(key=lambda x: x[1], reverse=True)
    
    return {
        'query': f"{a} : {b} :: {c} : ?",
        'expected_relationship': f"'{b}' is to '{a}' as '?' is to '{c}'",
        'top_answers': candidates[:top_n]
    }

# Test classic analogies
analogies = [
    ('man', 'king', 'woman'),      # Gender analogy
    ('paris', 'france', 'tokyo'),  # Capital-country
    ('slow', 'slower', 'fast'),    # Comparative
    ('walk', 'walked', 'run'),     # Tense
]

analogy_results = []
for a, b, c in analogies:
    result = solve_analogy(glove, a, b, c)
    analogy_results.append(result)
Out[18]:
Word Analogy Results
=================================================================
Query: man : king :: woman : ?
  Top 5 answers:
    queen           (similarity: 0.783)
    monarch         (similarity: 0.693)
    throne          (similarity: 0.683)
    daughter        (similarity: 0.681)
    prince          (similarity: 0.671)

Query: paris : france :: tokyo : ?
  Top 5 answers:
    japan           (similarity: 0.879)
    korea           (similarity: 0.726)
    germany         (similarity: 0.682)
    japanese        (similarity: 0.679)
    china           (similarity: 0.650)

Query: slow : slower :: fast : ?
  Top 5 answers:
    faster          (similarity: 0.803)
    quicker         (similarity: 0.669)
    pace            (similarity: 0.658)
    fastest         (similarity: 0.632)
    speeds          (similarity: 0.599)

Query: walk : walked :: run : ?
  Top 5 answers:
    went            (similarity: 0.734)
    ran             (similarity: 0.728)
    drove           (similarity: 0.724)
    came            (similarity: 0.700)
    out             (similarity: 0.677)

Analogy Categories

The Google Analogy Test Set contains approximately 19,500 analogies across two categories:

Semantic analogies test factual relationships:

  • Capital-country: Paris : France :: Tokyo : Japan
  • Currency: dollar : USA :: euro : Europe
  • Gender: king : queen :: man : woman

Syntactic analogies test grammatical relationships:

  • Tense: walk : walked :: run : ran
  • Plural: cat : cats :: dog : dogs
  • Comparative: big : bigger :: small : smaller
In[19]:
# Define analogy test categories
analogy_categories = {
    'semantic': {
        'capital-country': [
            ('paris', 'france', 'tokyo', 'japan'),
            ('london', 'england', 'berlin', 'germany'),
            ('rome', 'italy', 'madrid', 'spain'),
        ],
        'gender': [
            ('man', 'woman', 'king', 'queen'),
            ('boy', 'girl', 'brother', 'sister'),
            ('father', 'mother', 'son', 'daughter'),
        ],
    },
    'syntactic': {
        'tense': [
            ('walk', 'walked', 'run', 'ran'),
            ('go', 'went', 'come', 'came'),
            ('see', 'saw', 'hear', 'heard'),
        ],
        'comparative': [
            ('good', 'better', 'bad', 'worse'),
            ('big', 'bigger', 'small', 'smaller'),
            ('fast', 'faster', 'slow', 'slower'),
        ],
    }
}

def evaluate_analogy_accuracy(embeddings, analogies_by_category):
    """Evaluate analogy accuracy by category."""
    results = {}
    
    for cat_type, categories in analogies_by_category.items():
        results[cat_type] = {}
        for cat_name, analogies in categories.items():
            correct = 0
            total = 0
            
            for a, b, c, expected in analogies:
                if all(w in embeddings for w in [a, b, c, expected]):
                    result = solve_analogy(embeddings, a, b, c, top_n=1)
                    if 'top_answers' in result:
                        predicted = result['top_answers'][0][0]
                        if predicted == expected:
                            correct += 1
                    total += 1
            
            accuracy = correct / total if total > 0 else 0
            results[cat_type][cat_name] = {
                'correct': correct,
                'total': total,
                'accuracy': accuracy
            }
    
    return results

category_results = evaluate_analogy_accuracy(glove, analogy_categories)
Out[20]:
Analogy Accuracy by Category
=======================================================

SEMANTIC ANALOGIES
-------------------------------------------------------
  capital-country      3/3 (100.0%)
  gender               2/3 (66.7%)

SYNTACTIC ANALOGIES
-------------------------------------------------------
  tense                2/3 (66.7%)
  comparative          2/3 (66.7%)

=======================================================
OVERALL              9/12 (75.0%)

The accuracy breakdown reveals which relationship types the embeddings capture best. Syntactic analogies often achieve higher accuracy because grammatical patterns appear consistently in text. Semantic analogies like capital-country relationships may vary more depending on how well-represented each entity is in the training corpus.

Limitations of Analogy Evaluation

While analogy tasks are popular, they have significant limitations:

  1. Sensitivity to dataset: Performance varies dramatically across different analogy sets
  2. Only tests specific relationships: Good analogy performance doesn't guarantee general embedding quality
  3. Artifacts of training data: Some analogies work because of corpus biases, not linguistic understanding
  4. Unclear relevance: Analogy performance often doesn't correlate with downstream task performance
Out[21]:
Visualization
Geometric interpretation of word analogies. The parallelogram structure shows how consistent relationships manifest as parallel vectors. If 'man→woman' and 'king→queen' are parallel, then b - a + c lands near d. However, this perfect parallelism rarely holds exactly in practice, requiring approximate matching.
Geometric interpretation of word analogies. The parallelogram structure shows how consistent relationships manifest as parallel vectors. If 'man→woman' and 'king→queen' are parallel, then b - a + c lands near d. However, this perfect parallelism rarely holds exactly in practice, requiring approximate matching.

Embedding Visualization

Visualization provides qualitative insights that quantitative metrics miss. By projecting high-dimensional embeddings to 2D or 3D, we can observe clustering patterns, outliers, and relationships.

t-SNE Visualization

t-Distributed Stochastic Neighbor Embedding (t-SNE) is the most popular technique for embedding visualization. It preserves local structure: words that are close in high-dimensional space remain close in the projection.

t-SNE

t-SNE is a dimensionality reduction technique that converts high-dimensional similarities into probabilities and finds a low-dimensional representation that preserves these probabilities. It excels at revealing cluster structure but doesn't preserve global distances. Points that appear far apart in t-SNE may or may not be far apart in the original space.

In[22]:
from sklearn.manifold import TSNE

# Select words for visualization
word_categories = {
    'animals': ['dog', 'cat', 'bird', 'fish', 'horse', 'cow', 'pig', 'sheep', 'lion', 'tiger'],
    'colors': ['red', 'blue', 'green', 'yellow', 'black', 'white', 'purple', 'orange', 'pink', 'brown'],
    'numbers': ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
    'countries': ['france', 'germany', 'italy', 'spain', 'japan', 'china', 'india', 'brazil', 'canada', 'australia'],
    'verbs': ['run', 'walk', 'jump', 'swim', 'fly', 'eat', 'drink', 'sleep', 'think', 'speak'],
}

# Collect embeddings for visualization
words_for_viz = []
embeddings_for_viz = []
categories_for_viz = []

for category, words in word_categories.items():
    for word in words:
        if word in glove:
            words_for_viz.append(word)
            embeddings_for_viz.append(glove[word])
            categories_for_viz.append(category)

embeddings_matrix = np.array(embeddings_for_viz)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(15, len(words_for_viz)-1))
embeddings_2d = tsne.fit_transform(embeddings_matrix)
Out[23]:
Visualization
t-SNE projection of GloVe embeddings for selected word categories. Words cluster by semantic category: animals group together, colors form a cluster, numbers cluster tightly due to their sequential relationships. The visualization reveals the semantic organization that emerges from training on text co-occurrence.
t-SNE projection of GloVe embeddings for selected word categories. Words cluster by semantic category: animals group together, colors form a cluster, numbers cluster tightly due to their sequential relationships. The visualization reveals the semantic organization that emerges from training on text co-occurrence.

UMAP as an Alternative

Uniform Manifold Approximation and Projection (UMAP) is a newer alternative to t-SNE. It's faster, better preserves global structure, and produces more reproducible results.

In[24]:
try:
    import umap
    HAS_UMAP = True
except ImportError:
    HAS_UMAP = False

if HAS_UMAP:
    # Apply UMAP
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
    embeddings_umap = reducer.fit_transform(embeddings_matrix)
else:
    # Fallback message
    embeddings_umap = None
Out[25]:
Visualization
UMAP projection of the same word embeddings. Compared to t-SNE, UMAP often produces tighter, more separated clusters and runs significantly faster. The global structure is also better preserved, meaning relative positions between clusters are more meaningful.
UMAP projection of the same word embeddings. Compared to t-SNE, UMAP often produces tighter, more separated clusters and runs significantly faster. The global structure is also better preserved, meaning relative positions between clusters are more meaningful.

Visualization Caveats

While visualization is valuable for building intuition, it has important limitations:

  1. Projection distortion: Reducing from 100+ dimensions to 2 inevitably loses information
  2. Non-determinism: t-SNE and UMAP can produce different layouts on different runs
  3. Perplexity/neighbor sensitivity: Results depend heavily on hyperparameter choices
  4. Misleading distances: Distances between clusters in the visualization may not reflect true embedding distances

Use visualization for exploration and communication, but don't make quantitative claims based on 2D projections.

Downstream Task Evaluation

The ultimate test of embeddings is performance on real tasks. Let's evaluate embeddings on text classification, a common downstream application.

Topic Classification

We'll use the 20 Newsgroups dataset to evaluate how well embeddings support text classification. This dataset contains posts from different newsgroups, making it ideal for testing whether embeddings capture topical information.

In[26]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Use 20 newsgroups for a quick demonstration
# Select a subset of categories for binary classification
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    newsgroups.data, newsgroups.target, test_size=0.2, random_state=42
)
In[27]:
def text_to_embedding(text, embeddings, method='mean'):
    """Convert text to a single embedding vector."""
    words = text.lower().split()
    word_vectors = []
    
    for word in words:
        # Clean word
        word = ''.join(c for c in word if c.isalpha())
        if word in embeddings:
            word_vectors.append(embeddings[word])
    
    if not word_vectors:
        # Return zero vector if no words found
        return np.zeros(embeddings.vector_size)
    
    if method == 'mean':
        return np.mean(word_vectors, axis=0)
    elif method == 'max':
        return np.max(word_vectors, axis=0)
    else:
        return np.mean(word_vectors, axis=0)

def evaluate_embeddings_on_classification(embeddings, X_train, X_test, y_train, y_test):
    """Evaluate embeddings on text classification."""
    # Convert texts to embeddings
    X_train_emb = np.array([text_to_embedding(text, embeddings) for text in X_train])
    X_test_emb = np.array([text_to_embedding(text, embeddings) for text in X_test])
    
    # Train classifier
    clf = LogisticRegression(max_iter=1000, random_state=42)
    clf.fit(X_train_emb, y_train)
    
    # Evaluate
    y_pred = clf.predict(X_test_emb)
    
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1_macro': f1_score(y_test, y_pred, average='macro'),
    }

# Evaluate GloVe embeddings
classification_results = evaluate_embeddings_on_classification(
    glove, X_train, X_test, y_train, y_test
)
Out[28]:
Downstream Task Evaluation: Text Classification
=======================================================
Dataset: 20 Newsgroups (sci.space vs rec.sport.baseball)
Train size: 1584, Test size: 397

Results with GloVe Embeddings + Logistic Regression:
-------------------------------------------------------
  Accuracy:     93.20%
  F1 (macro):   93.20%

These embeddings successfully capture topic-distinguishing
information, enabling simple classification of documents.

The high accuracy demonstrates that even simple mean-pooled embeddings can capture enough semantic information for topic classification. The key advantage of this evaluation approach is that it directly measures what matters: whether the embeddings help solve real tasks. If your application involves document classification, these results are far more relevant than intrinsic metrics.

Comparison Across Tasks

A thorough extrinsic evaluation tests embeddings across multiple tasks:

In[29]:
# Simulate results for multiple tasks (in practice, you'd run each evaluation)
extrinsic_tasks = {
    'Sentiment Analysis': {'accuracy': 0.88, 'description': 'Binary sentiment on IMDB'},
    'Topic Classification': {'accuracy': 0.92, 'description': '20 Newsgroups'},
    'Named Entity Recognition': {'f1': 0.85, 'description': 'CoNLL-2003'},
    'Question Type': {'accuracy': 0.94, 'description': 'TREC question classification'},
    'Paraphrase Detection': {'f1': 0.78, 'description': 'MRPC paraphrase corpus'},
}
Out[30]:
Visualization
Performance of GloVe embeddings across multiple downstream tasks. The variation in scores reflects how different tasks emphasize different aspects of word meaning. Embeddings may excel at topic classification (which relies on topical word clusters) while struggling with paraphrase detection (which requires fine-grained semantic similarity).
Performance of GloVe embeddings across multiple downstream tasks. The variation in scores reflects how different tasks emphasize different aspects of word meaning. Embeddings may excel at topic classification (which relies on topical word clusters) while struggling with paraphrase detection (which requires fine-grained semantic similarity).

The variation across tasks highlights why no single metric captures embedding quality. Choose evaluation tasks that match your intended application.

Embedding Bias Detection

Word embeddings learn from human-generated text, and human text contains biases. These biases become encoded in the embedding geometry, potentially amplifying harmful stereotypes in downstream applications.

Detecting Gender Bias

The most studied bias in word embeddings is gender bias. The key insight is that we can use the same geometric properties that make embeddings useful for analogies to detect problematic associations.

Consider an occupation word like "engineer." In an ideal world, this word should be equidistant from "he" and "she" since engineering has no inherent gender. But if the training corpus contains more sentences like "He is an engineer" than "She is an engineer," the embedding for "engineer" will drift closer to male-associated terms.

We can measure this drift with a simple bias score: compute the cosine similarity between a word and male terms, subtract the similarity to female terms. A score of zero means perfect balance. Positive scores indicate male association; negative scores indicate female association. By averaging across multiple gendered word pairs (he/she, man/woman, male/female), we reduce noise from any single comparison.

In[31]:
def compute_bias_score(embeddings, word, attribute_pair):
    """
    Compute bias score for a word relative to an attribute pair.
    
    Positive score indicates association with first attribute,
    negative score indicates association with second attribute.
    """
    attr1, attr2 = attribute_pair
    
    if not all(w in embeddings for w in [word, attr1, attr2]):
        return None
    
    vec = embeddings[word]
    vec1 = embeddings[attr1]
    vec2 = embeddings[attr2]
    
    sim1 = cosine_similarity(vec, vec1)
    sim2 = cosine_similarity(vec, vec2)
    
    return sim1 - sim2

# Gender attribute pairs
gender_pairs = [('he', 'she'), ('man', 'woman'), ('male', 'female')]

# Words to test for gender associations
occupation_words = [
    'doctor', 'nurse', 'engineer', 'teacher', 'programmer',
    'scientist', 'secretary', 'ceo', 'receptionist', 'mechanic',
    'lawyer', 'homemaker', 'professor', 'designer', 'architect'
]

# Compute gender bias scores
gender_bias_scores = {}
for word in occupation_words:
    scores = []
    for pair in gender_pairs:
        score = compute_bias_score(glove, word, pair)
        if score is not None:
            scores.append(score)
    if scores:
        gender_bias_scores[word] = np.mean(scores)
Out[32]:
Gender Bias in Occupation Words
=======================================================
Positive = more associated with 'he/man/male'
Negative = more associated with 'she/woman/female'
-------------------------------------------------------
ceo             +0.049 → male     █
architect       +0.039 → male     █
mechanic        +0.038 → male     █
engineer        +0.034 → male     █
programmer      +0.024 → male     
secretary       +0.020 → male     
scientist       +0.015 → male     
lawyer          +0.005 → male     
professor       +0.004 → male     
doctor          -0.026 → female   
designer        -0.040 → female   █
teacher         -0.045 → female   █
receptionist    -0.119 → female   ███
nurse           -0.122 → female   ███
homemaker       -0.135 → female   ████

The bias scores reveal systematic associations between occupations and gender. Occupations like "engineer" and "programmer" show positive scores (male association), while "nurse" and "secretary" show negative scores (female association). These patterns reflect stereotypes present in the training text, not any inherent truth about these professions.

Out[33]:
Visualization
Gender bias scores for occupation words in GloVe embeddings. Bars extending right indicate male association, left indicates female association. The pattern reveals embedded stereotypes: ''nurse'' and ''secretary'' associate with female, while ''engineer'' and ''programmer'' associate with male. These biases reflect and potentially amplify societal stereotypes.
Gender bias scores for occupation words in GloVe embeddings. Bars extending right indicate male association, left indicates female association. The pattern reveals embedded stereotypes: ''nurse'' and ''secretary'' associate with female, while ''engineer'' and ''programmer'' associate with male. These biases reflect and potentially amplify societal stereotypes.

Word Embedding Association Test (WEAT)

While individual bias scores reveal patterns, we need a more rigorous framework to quantify bias in a statistically meaningful way. The Word Embedding Association Test (WEAT) provides exactly this, drawing inspiration from psychology's Implicit Association Test (IAT).

The core idea is elegant: if embeddings are unbiased, two conceptually neutral word sets (like careers and family) should associate equally with two attribute sets (like male and female terms). Any systematic difference indicates bias.

Here's how WEAT works step by step:

  1. Define target word sets: Two sets we want to test for differential association. For example, career words (executive, salary, office) versus family words (home, parents, children).

  2. Define attribute word sets: Two sets representing the dimension we're measuring bias along. For gender bias: male attributes (he, man, boy) versus female attributes (she, woman, girl).

  3. Compute association scores: For each target word, calculate how much more it associates with one attribute set than the other. A career word that's closer to male terms than female terms receives a positive association score.

  4. Compare target sets: The key question is whether one target set (careers) systematically associates more with one attribute set (male) than the other target set (family) does.

  5. Compute effect size: The final WEAT score uses Cohen's d, a standardized measure of the difference between the two target sets' mean associations, divided by the pooled standard deviation. This normalization makes the score interpretable across different embedding models and word sets.

In[34]:
def weat_score(embeddings, target1, target2, attribute1, attribute2):
    """
    Compute WEAT score and effect size.
    
    target1, target2: Two sets of target words
    attribute1, attribute2: Two sets of attribute words
    """
    def mean_association(word, attr1_words, attr2_words):
        """Mean similarity difference for a word."""
        if word not in embeddings:
            return None
        vec = embeddings[word]
        
        sims1 = [cosine_similarity(vec, embeddings[a]) 
                 for a in attr1_words if a in embeddings]
        sims2 = [cosine_similarity(vec, embeddings[a]) 
                 for a in attr2_words if a in embeddings]
        
        if not sims1 or not sims2:
            return None
        return np.mean(sims1) - np.mean(sims2)
    
    # Compute associations for each target set
    assoc1 = [mean_association(w, attribute1, attribute2) for w in target1]
    assoc2 = [mean_association(w, attribute1, attribute2) for w in target2]
    
    # Remove None values
    assoc1 = [a for a in assoc1 if a is not None]
    assoc2 = [a for a in assoc2 if a is not None]
    
    if not assoc1 or not assoc2:
        return None
    
    # Effect size (Cohen's d)
    diff = np.mean(assoc1) - np.mean(assoc2)
    pooled_std = np.std(assoc1 + assoc2)
    
    if pooled_std == 0:
        return None
    
    effect_size = diff / pooled_std
    
    return {
        'effect_size': effect_size,
        'mean_target1': np.mean(assoc1),
        'mean_target2': np.mean(assoc2),
    }

# WEAT test: Career vs Family with Male vs Female attributes
career_words = ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']
family_words = ['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives']
male_attrs = ['male', 'man', 'boy', 'brother', 'he', 'him', 'his', 'son']
female_attrs = ['female', 'woman', 'girl', 'sister', 'she', 'her', 'hers', 'daughter']

weat_result = weat_score(glove, career_words, family_words, male_attrs, female_attrs)
Out[35]:
Word Embedding Association Test (WEAT)
=======================================================
Test: Career/Family words × Male/Female attributes
-------------------------------------------------------
Effect Size (d): 1.576

Interpretation:
  Strong association: Career → Male, Family → Female

Effect size benchmarks:
  |d| < 0.2: negligible
  |d| 0.2-0.5: small
  |d| 0.5-0.8: medium
  |d| > 0.8: large
Out[36]:
Visualization
WEAT association scores for career and family words. Each bar shows how much more a word associates with male attributes than female attributes. Career words systematically skew toward male associations (positive), while family words skew toward female associations (negative), revealing the gender-career stereotype encoded in the embeddings.
WEAT association scores for career and family words. Each bar shows how much more a word associates with male attributes than female attributes. Career words systematically skew toward male associations (positive), while family words skew toward female associations (negative), revealing the gender-career stereotype encoded in the embeddings.

Implications of Embedded Bias

Bias in embeddings has real-world consequences:

  1. Resume screening: Systems using biased embeddings may rank male candidates higher for technical roles
  2. Search engines: Queries for "CEO" might surface more male images
  3. Machine translation: Gender-neutral terms might be translated with stereotypical gender
  4. Sentiment analysis: Texts about certain demographic groups might receive biased sentiment scores

Bias detection should be part of any responsible embedding evaluation pipeline. Debiasing techniques exist but have limitations, so awareness and mitigation strategies are essential.

Evaluation Pitfalls

Even with the right metrics, embedding evaluation can go wrong. Here are common pitfalls to avoid:

1. Vocabulary Coverage Issues

Many evaluation datasets contain rare or archaic words missing from embedding vocabularies. Simply skipping these words can inflate scores.

In[37]:
# Example: checking vocabulary coverage
test_words = ['synecdoche', 'serendipity', 'defenestration', 'obsequious', 
              'pulchritudinous', 'dog', 'cat', 'happy', 'run', 'think']

coverage = {word: word in glove for word in test_words}
coverage_rate = sum(coverage.values()) / len(coverage)
Out[38]:
Vocabulary Coverage Check
=============================================
  synecdoche           ✓
  serendipity          ✓
  defenestration       ✓
  obsequious           ✓
  pulchritudinous      ✗
  dog                  ✓
  cat                  ✓
  happy                ✓
  run                  ✓
  think                ✓
---------------------------------------------
Coverage: 90%

Warning: Low coverage inflates metrics if
missing words are simply excluded.

Rare or specialized words often missing from embedding vocabularies can skew evaluation results. If your evaluation set contains many such words and you simply exclude them, you're only testing on common words where embeddings typically perform better. Always report coverage alongside performance metrics.

2. Dataset Contamination

If your embeddings were trained on text that includes the evaluation data, results are misleadingly optimistic.

3. Hyperparameter Sensitivity

Results can vary significantly with hyperparameters like the number of neighbors for nearest neighbor searches, or thresholds for similarity judgments.

4. Cherry-Picking Categories

Reporting only the best-performing analogy or similarity categories creates a misleading picture. Always report aggregate scores.

5. Ignoring Statistical Significance

Small test sets can produce unreliable results. An accuracy of 85% on 100 test cases doesn't mean your model is exactly 85% accurate on all possible inputs. It's an estimate with uncertainty, and that uncertainty shrinks as you test on more examples.

Bootstrap confidence intervals offer a practical way to quantify this uncertainty. The idea is simple: resample your test results with replacement many times, compute the mean each time, and observe the distribution. The range containing 95% of these bootstrap means gives you a 95% confidence interval. If two models' confidence intervals don't overlap, you have evidence of a real difference.

In[39]:
def bootstrap_confidence_interval(scores, n_bootstrap=1000, confidence=0.95):
    """Compute bootstrap confidence interval for a metric."""
    bootstrap_means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(scores, size=len(scores), replace=True)
        bootstrap_means.append(np.mean(sample))
    
    lower = np.percentile(bootstrap_means, (1 - confidence) / 2 * 100)
    upper = np.percentile(bootstrap_means, (1 + confidence) / 2 * 100)
    
    return lower, upper

# Example: confidence interval for accuracy
simulated_correct = [1] * 85 + [0] * 15  # 85% accuracy on 100 samples
lower, upper = bootstrap_confidence_interval(simulated_correct)
Out[40]:
Statistical Significance Example
=============================================
Observed accuracy: 85.0%
95% CI: [79.0%, 91.0%]

With only 100 test samples, a model with 85%
accuracy might actually be anywhere from
79% to 91% on the full distribution.
Out[41]:
Visualization
Bootstrap distribution of accuracy estimates from 1000 resamples. The observed accuracy of 85% is our point estimate, but the true population accuracy could be anywhere in the shaded region (95% confidence interval). This visualization shows why small test sets produce uncertain conclusions.
Bootstrap distribution of accuracy estimates from 1000 resamples. The observed accuracy of 85% is our point estimate, but the true population accuracy could be anywhere in the shaded region (95% confidence interval). This visualization shows why small test sets produce uncertain conclusions.

Building an Evaluation Pipeline

Let's bring everything together into a reusable evaluation framework.

In[42]:
class EmbeddingEvaluator:
    """Comprehensive embedding evaluation pipeline."""
    
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.results = {}
    
    def evaluate_word_similarity(self, word_pairs):
        """Evaluate on word similarity dataset."""
        human_scores = []
        embedding_scores = []
        
        for w1, w2, score in word_pairs:
            if w1 in self.embeddings and w2 in self.embeddings:
                sim = cosine_similarity(self.embeddings[w1], self.embeddings[w2])
                human_scores.append(score)
                embedding_scores.append(sim)
        
        if len(human_scores) >= 2:
            corr, p = spearmanr(human_scores, embedding_scores)
        else:
            corr, p = 0, 1
        
        self.results['word_similarity'] = {
            'spearman': corr,
            'p_value': p,
            'n_pairs': len(human_scores)
        }
        return self.results['word_similarity']
    
    def evaluate_analogies(self, analogies):
        """Evaluate on analogy dataset."""
        correct = 0
        total = 0
        
        for a, b, c, expected in analogies:
            if all(w in self.embeddings for w in [a, b, c, expected]):
                target = self.embeddings[b] - self.embeddings[a] + self.embeddings[c]
                
                best_word = None
                best_sim = -1
                exclude = {a, b, c}
                
                for word in self.embeddings.index_to_key[:50000]:
                    if word not in exclude:
                        sim = cosine_similarity(target, self.embeddings[word])
                        if sim > best_sim:
                            best_sim = sim
                            best_word = word
                
                if best_word == expected:
                    correct += 1
                total += 1
        
        self.results['analogies'] = {
            'accuracy': correct / total if total > 0 else 0,
            'correct': correct,
            'total': total
        }
        return self.results['analogies']
    
    def compute_bias(self, target_words, attr1_words, attr2_words):
        """Compute bias scores for target words."""
        bias_scores = {}
        
        for word in target_words:
            if word not in self.embeddings:
                continue
                
            vec = self.embeddings[word]
            
            sims1 = [cosine_similarity(vec, self.embeddings[a]) 
                     for a in attr1_words if a in self.embeddings]
            sims2 = [cosine_similarity(vec, self.embeddings[a]) 
                     for a in attr2_words if a in self.embeddings]
            
            if sims1 and sims2:
                bias_scores[word] = np.mean(sims1) - np.mean(sims2)
        
        self.results['bias'] = bias_scores
        return bias_scores
    
    def summary(self):
        """Print summary of all evaluations."""
        print("=" * 55)
        print("EMBEDDING EVALUATION SUMMARY")
        print("=" * 55)
        
        if 'word_similarity' in self.results:
            ws = self.results['word_similarity']
            print(f"\nWord Similarity:")
            print(f"  Spearman r = {ws['spearman']:.3f} (n = {ws['n_pairs']})")
        
        if 'analogies' in self.results:
            an = self.results['analogies']
            print(f"\nAnalogies:")
            print(f"  Accuracy = {an['accuracy']*100:.1f}% ({an['correct']}/{an['total']})")
        
        if 'bias' in self.results:
            bias = self.results['bias']
            print(f"\nBias Analysis:")
            print(f"  Analyzed {len(bias)} words")
            mean_abs_bias = np.mean([abs(v) for v in bias.values()])
            print(f"  Mean absolute bias = {mean_abs_bias:.3f}")

# Usage example
evaluator = EmbeddingEvaluator(glove)
evaluator.evaluate_word_similarity(sample_pairs)
evaluator.evaluate_analogies([('man', 'king', 'woman', 'queen'), ('paris', 'france', 'tokyo', 'japan')])
evaluator.compute_bias(occupation_words, male_attrs, female_attrs)
Out[43]:
=======================================================
EMBEDDING EVALUATION SUMMARY
=======================================================

Word Similarity:
  Spearman r = 0.152 (n = 10)

Analogies:
  Accuracy = 100.0% (2/2)

Bias Analysis:
  Analyzed 15 words
  Mean absolute bias = 0.080

The evaluation pipeline produces a consolidated view of embedding performance across all dimensions. This modular approach allows you to add new evaluation methods as needed while maintaining a consistent reporting format.

Key Parameters

When evaluating word embeddings, several parameters and choices significantly impact your results:

Word Similarity Evaluation

  • Correlation metric: Spearman correlation (rank-based) is preferred over Pearson because it doesn't assume linearity between human scores and cosine similarities
  • Dataset choice: SimLex-999 for pure similarity, WordSim-353 for similarity + relatedness, MEN for broader coverage

Analogy Evaluation

  • Vocabulary search space: Limiting search to top-N frequent words (e.g., 50,000) balances accuracy with computation time
  • Exclusion set: Always exclude the input words (a, b, c) from candidate answers to avoid trivial solutions

Visualization (t-SNE)

  • perplexity: Controls the balance between local and global structure. Typical values: 5-50. Lower values emphasize local clusters, higher values show more global structure
  • n_iter: Number of optimization iterations. Default 1000 is usually sufficient, but complex datasets may need more
  • random_state: Set for reproducibility, as t-SNE is non-deterministic

Visualization (UMAP)

  • n_neighbors: Number of neighbors for local structure. Higher values (15-50) preserve more global structure
  • min_dist: Controls how tightly points cluster. Lower values (0.0-0.1) create denser clusters

Bias Detection

  • Attribute word sets: Use multiple word pairs per concept (e.g., he/she, man/woman, male/female) to reduce noise from individual word idiosyncrasies
  • Effect size thresholds: Cohen's d benchmarks: < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large

Downstream Evaluation

  • Aggregation method: Mean pooling is standard, but max pooling sometimes works better for sentiment tasks
  • Classifier choice: Logistic regression provides a clean baseline; more complex models may overfit to artifacts rather than embedding quality

Summary

Evaluating word embeddings requires a multi-faceted approach. No single metric captures embedding quality completely.

Key takeaways:

  1. Intrinsic vs extrinsic: Intrinsic evaluations (similarity, analogies) are fast and interpretable but may not predict downstream performance. Always include task-specific extrinsic evaluations.

  2. Word similarity: Spearman correlation with human similarity judgments remains the standard intrinsic test. SimLex-999 tests genuine similarity, while WordSim-353 conflates similarity with relatedness.

  3. Analogies: Vector arithmetic captures some semantic relationships, but analogy accuracy has limited correlation with real-world usefulness.

  4. Visualization: t-SNE and UMAP reveal clustering structure but introduce projection distortions. Use for exploration, not quantitative claims.

  5. Downstream tasks: The ultimate test is performance on your intended application. Classification, NER, and other tasks provide direct measures of utility.

  6. Bias detection: Embeddings encode societal biases. WEAT and association tests can quantify these biases, which is essential for responsible deployment.

  7. Pitfalls: Watch for vocabulary coverage issues, dataset contamination, hyperparameter sensitivity, and statistical significance. Report aggregate results, not cherry-picked categories.

The goal isn't perfect scores on every metric but rather understanding what your embeddings capture and whether it matches your needs. A model with lower intrinsic scores might be the right choice if it excels at your specific task. Evaluation is ultimately about making informed decisions.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about evaluating word embeddings.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{wordembeddingevaluationintrinsicextrinsicmethodswithbiasdetection, author = {Michael Brenndoerfer}, title = {Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection}, year = {2025}, url = {https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection. Retrieved from https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods
MLAAcademic
Michael Brenndoerfer. "Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods>.
CHICAGOAcademic
Michael Brenndoerfer. "Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection'. Available at: https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection. https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or