Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Michael Brenndoerfer

Learn how to evaluate word embeddings using similarity tests, analogy tasks, downstream evaluation, t-SNE visualization, and bias detection with WEAT.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Embedding EvaluationLink Copied

You've trained a word embedding model. The loss converged, the code ran without errors, and you have a matrix of 300-dimensional vectors. But how do you know if these embeddings are any good? What makes one set of embeddings better than another?

Evaluating word embeddings is surprisingly nuanced. Unlike classification tasks where accuracy provides a clear metric, embedding quality is multidimensional. Embeddings might excel at capturing analogies but fail at word similarity. They might perform brilliantly on sentiment analysis but contain harmful biases. The right evaluation depends entirely on how you plan to use the embeddings.

This chapter develops a comprehensive evaluation toolkit. We'll explore intrinsic evaluations that probe the embedding space directly, extrinsic evaluations that measure downstream task performance, visualization techniques for qualitative inspection, and methods for detecting embedded biases. By the end, you'll have the tools to rigorously assess any word embedding model.

Intrinsic vs Extrinsic EvaluationLink Copied

The fundamental split in embedding evaluation is between intrinsic and extrinsic methods. Understanding this distinction is crucial for interpreting evaluation results.

Intrinsic Evaluation

Intrinsic evaluation measures properties of the embedding space directly, without reference to any downstream task. Common intrinsic tests include word similarity correlation, analogy completion, and clustering coherence. These evaluations are fast and interpretable but may not predict real-world performance.

Extrinsic Evaluation

Extrinsic evaluation measures embedding quality through performance on downstream tasks such as sentiment analysis, named entity recognition, or text classification. These evaluations are slower and more complex but directly measure what matters: usefulness for practical applications.

The relationship between intrinsic and extrinsic performance is complex and sometimes counterintuitive. Research has shown that embeddings with higher intrinsic scores don't always perform better on downstream tasks. This disconnect arises because intrinsic tests capture specific linguistic properties, while downstream tasks may require different capabilities.

In[2]:

import numpy as np
from scipy.stats import spearmanr

# Simulated example: two embedding models evaluated on different metrics
model_a_scores = {
    'word_similarity': 0.72,    # Intrinsic: correlation with human similarity judgments
    'analogy_accuracy': 0.65,   # Intrinsic: analogy task accuracy
    'sentiment_f1': 0.83,       # Extrinsic: sentiment classification F1
    'ner_f1': 0.89,             # Extrinsic: named entity recognition F1
}

model_b_scores = {
    'word_similarity': 0.68,    # Lower intrinsic similarity score
    'analogy_accuracy': 0.58,   # Lower analogy score
    'sentiment_f1': 0.86,       # But higher sentiment performance!
    'ner_f1': 0.91,             # And higher NER performance!
}

import numpy as np
from scipy.stats import spearmanr

# Simulated example: two embedding models evaluated on different metrics
model_a_scores = {
    'word_similarity': 0.72,    # Intrinsic: correlation with human similarity judgments
    'analogy_accuracy': 0.65,   # Intrinsic: analogy task accuracy
    'sentiment_f1': 0.83,       # Extrinsic: sentiment classification F1
    'ner_f1': 0.89,             # Extrinsic: named entity recognition F1
}

model_b_scores = {
    'word_similarity': 0.68,    # Lower intrinsic similarity score
    'analogy_accuracy': 0.58,   # Lower analogy score
    'sentiment_f1': 0.86,       # But higher sentiment performance!
    'ner_f1': 0.91,             # And higher NER performance!
}

Out[3]:

Embedding Model Comparison
=======================================================
         Metric             Model A      Model B   
-------------------------------------------------------
   Word Similarity (r)        0.72         0.68    
    Analogy Accuracy          0.65         0.58    
-------------------------------------------------------
      Sentiment F1            0.83         0.86    
         NER F1               0.89         0.91    
=======================================================

Model A wins on intrinsic metrics, but Model B
outperforms on the downstream tasks that matter.

Out[4]:

Visualization

Comparison of intrinsic vs extrinsic evaluation metrics for two embedding models. Model A (blue) achieves higher intrinsic scores, but Model B (orange) outperforms on downstream tasks. This pattern is common and demonstrates why relying solely on intrinsic metrics can be misleading.

This example illustrates a common scenario: Model A achieves higher intrinsic scores, but Model B performs better on downstream tasks. If you're building a sentiment classifier, Model B is clearly the better choice, despite its lower analogy accuracy.

The practical takeaway: always include extrinsic evaluations relevant to your use case. Intrinsic metrics provide useful diagnostic information, but they don't tell the whole story.

Word Similarity EvaluationLink Copied

The most established intrinsic evaluation is word similarity. Humans rate the semantic similarity of word pairs on a scale (typically 0-10), and we measure how well embedding cosine similarities correlate with these human judgments.

Standard DatasetsLink Copied

Several benchmark datasets have become standard for word similarity evaluation:

WordSim-353 contains 353 word pairs rated by 13-16 human annotators. It mixes similarity (synonymy) and relatedness (associated but not similar). "Cup" and "coffee" are related but not similar. "Car" and "automobile" are both related and similar.

SimLex-999 addresses WordSim's conflation of similarity and relatedness. Its 999 pairs focus specifically on genuine similarity: words that could be substituted for each other. It's harder than WordSim because relatedness doesn't count.

MEN contains 3,000 pairs covering a range of parts of speech. Ratings come from crowd workers rather than expert annotators.

In[5]:

import pandas as pd

# Create a sample of word pairs with human similarity scores
# These values are representative of SimLex-999 style ratings

sample_pairs = [
    # (word1, word2, human_score)
    ('happy', 'cheerful', 9.55),
    ('old', 'new', 1.58),
    ('smart', 'intelligent', 9.20),
    ('car', 'automobile', 8.94),
    ('hot', 'cold', 1.31),
    ('dog', 'cat', 5.95),
    ('king', 'queen', 8.18),
    ('run', 'walk', 6.11),
    ('book', 'paper', 3.89),
    ('love', 'hate', 1.27),
]

import pandas as pd

# Create a sample of word pairs with human similarity scores
# These values are representative of SimLex-999 style ratings

sample_pairs = [
    # (word1, word2, human_score)
    ('happy', 'cheerful', 9.55),
    ('old', 'new', 1.58),
    ('smart', 'intelligent', 9.20),
    ('car', 'automobile', 8.94),
    ('hot', 'cold', 1.31),
    ('dog', 'cat', 5.95),
    ('king', 'queen', 8.18),
    ('run', 'walk', 6.11),
    ('book', 'paper', 3.89),
    ('love', 'hate', 1.27),
]

Out[6]:

Sample Word Similarity Pairs (SimLex-999 Scale)
==================================================
   Word 1       Word 2      Human Score  
--------------------------------------------------
   happy       cheerful        9.55      
    old          new           1.58      
   smart     intelligent       9.20      
    car       automobile       8.94      
    hot          cold          1.31      
    dog          cat           5.95      
    king        queen          8.18      
    run          walk          6.11      
    book        paper          3.89      
    love         hate          1.27      
==================================================
Scores range from 0 (unrelated) to 10 (identical meaning)

Computing Similarity CorrelationsLink Copied

The evaluation process involves two key measurements that work together to tell us how well embeddings capture human intuitions about word similarity.

First, we compute the cosine similarity between each word pair's embeddings. Cosine similarity measures the angle between two vectors, ignoring their magnitudes. Two vectors pointing in similar directions have high cosine similarity (close to 1), while perpendicular vectors have similarity near 0. For word embeddings, high cosine similarity indicates that two words appear in similar contexts and likely share related meanings.

Second, we calculate the Spearman correlation between these embedding similarities and the human ratings. Why Spearman rather than Pearson? Spearman correlation measures whether the rankings match, not whether the actual values align linearly. This is exactly what we want: if humans rank "happy/cheerful" as more similar than "dog/cat", we care whether embeddings agree with that ordering, regardless of the exact numerical values. Spearman correlation ranges from -1 (perfect inverse ranking) through 0 (no relationship) to +1 (perfect agreement).

In[7]:

import gensim.downloader as api

# Load pre-trained GloVe embeddings
# This downloads ~100MB on first run
glove = api.load('glove-wiki-gigaword-100')

import gensim.downloader as api

# Load pre-trained GloVe embeddings
# This downloads ~100MB on first run
glove = api.load('glove-wiki-gigaword-100')

In[8]:

from numpy.linalg import norm

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

def evaluate_word_similarity(embeddings, word_pairs):
    """
    Evaluate embeddings on word similarity task.
    
    Returns Spearman correlation and coverage statistics.
    """
    human_scores = []
    embedding_scores = []
    missing_pairs = []
    
    for word1, word2, human_score in word_pairs:
        # Check if both words are in vocabulary
        if word1 in embeddings and word2 in embeddings:
            vec1 = embeddings[word1]
            vec2 = embeddings[word2]
            emb_sim = cosine_similarity(vec1, vec2)
            
            human_scores.append(human_score)
            embedding_scores.append(emb_sim)
        else:
            missing_pairs.append((word1, word2))
    
    # Compute Spearman correlation
    if len(human_scores) >= 2:
        correlation, p_value = spearmanr(human_scores, embedding_scores)
    else:
        correlation, p_value = 0.0, 1.0
    
    return {
        'correlation': correlation,
        'p_value': p_value,
        'coverage': len(human_scores) / len(word_pairs),
        'missing_pairs': missing_pairs
    }

# Evaluate GloVe on our sample pairs
results = evaluate_word_similarity(glove, sample_pairs)

from numpy.linalg import norm

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

def evaluate_word_similarity(embeddings, word_pairs):
    """
    Evaluate embeddings on word similarity task.
    
    Returns Spearman correlation and coverage statistics.
    """
    human_scores = []
    embedding_scores = []
    missing_pairs = []
    
    for word1, word2, human_score in word_pairs:
        # Check if both words are in vocabulary
        if word1 in embeddings and word2 in embeddings:
            vec1 = embeddings[word1]
            vec2 = embeddings[word2]
            emb_sim = cosine_similarity(vec1, vec2)
            
            human_scores.append(human_score)
            embedding_scores.append(emb_sim)
        else:
            missing_pairs.append((word1, word2))
    
    # Compute Spearman correlation
    if len(human_scores) >= 2:
        correlation, p_value = spearmanr(human_scores, embedding_scores)
    else:
        correlation, p_value = 0.0, 1.0
    
    return {
        'correlation': correlation,
        'p_value': p_value,
        'coverage': len(human_scores) / len(word_pairs),
        'missing_pairs': missing_pairs
    }

# Evaluate GloVe on our sample pairs
results = evaluate_word_similarity(glove, sample_pairs)

Out[9]:

Word Similarity Evaluation Results
==================================================
Spearman Correlation: 0.1515
P-value:              6.7607e-01
Vocabulary Coverage:  100.0%

Correlation interpretation:
  0.6-0.7: Good
  0.7-0.8: Very good
  0.8+:    Excellent

The Spearman correlation measures how well the embedding-based rankings match human rankings. A correlation above 0.7 indicates that the embeddings preserve human similarity intuitions reasonably well. The p-value tells us whether this correlation is statistically significant (values below 0.05 indicate significance).

Let's examine the individual similarities to understand what the embeddings capture:

In[10]:

# Compute similarities for each pair
pair_analysis = []
for word1, word2, human_score in sample_pairs:
    if word1 in glove and word2 in glove:
        emb_sim = cosine_similarity(glove[word1], glove[word2])
        pair_analysis.append({
            'word1': word1,
            'word2': word2,
            'human': human_score,
            'embedding': emb_sim,
            'error': abs(human_score/10 - emb_sim)  # Normalize human score to 0-1
        })

pair_df = pd.DataFrame(pair_analysis)
pair_df = pair_df.sort_values('human', ascending=False)

# Compute similarities for each pair
pair_analysis = []
for word1, word2, human_score in sample_pairs:
    if word1 in glove and word2 in glove:
        emb_sim = cosine_similarity(glove[word1], glove[word2])
        pair_analysis.append({
            'word1': word1,
            'word2': word2,
            'human': human_score,
            'embedding': emb_sim,
            'error': abs(human_score/10 - emb_sim)  # Normalize human score to 0-1
        })

pair_df = pd.DataFrame(pair_analysis)
pair_df = pair_df.sort_values('human', ascending=False)

Out[11]:

Detailed Similarity Comparison
======================================================================
          Words             Human     Embedding     Error   
----------------------------------------------------------------------
    happy / cheerful         9.55       0.546       0.409   
   smart / intelligent       9.20       0.755       0.165   
    car / automobile         8.94       0.683       0.211   
      king / queen           8.18       0.751       0.067   
       run / walk            6.11       0.668       0.057   
        dog / cat            5.95       0.880       0.285   
      book / paper           3.89       0.623       0.234   
        old / new            1.58       0.643       0.485   
       hot / cold            1.31       0.725       0.594   
       love / hate           1.27       0.570       0.443   
======================================================================
Human scores: 0-10 scale. Embedding: -1 to 1 (cosine similarity).

Out[12]:

Visualization

Correlation between human similarity judgments and GloVe embedding cosine similarities. Each point represents a word pair. A strong positive correlation (points along the diagonal) indicates that the embeddings capture human intuitions about word similarity. Deviations reveal where embedding geometry differs from human perception.

The scatter plot reveals both strengths and limitations. High-similarity pairs like "happy/cheerful" and "smart/intelligent" score high on both scales. Antonyms like "hot/cold" receive low human ratings but may have moderate embedding similarity because they appear in similar contexts (both are temperature words).

Out[13]:

Visualization

Pairwise cosine similarity heatmap for selected words. Darker cells indicate higher similarity. The heatmap reveals interesting patterns: synonyms like ''happy/cheerful'' and ''smart/intelligent'' show high similarity, while antonyms like ''hot/cold'' and ''love/hate'' also show moderate similarity because they appear in similar contexts.

SimLex vs WordSim: Similarity vs RelatednessLink Copied

The distinction between similarity and relatedness is crucial. "Coffee" and "cup" are highly related but not similar. You can't substitute one for the other. SimLex-999 was specifically designed to test genuine similarity.

In[14]:

# Examples showing similarity vs relatedness distinction
similarity_vs_relatedness = [
    # word1, word2, type, explanation
    ('car', 'automobile', 'similar', 'Synonyms - can substitute'),
    ('coffee', 'cup', 'related', 'Associated but different concepts'),
    ('doctor', 'hospital', 'related', 'Same domain but different roles'),
    ('big', 'large', 'similar', 'Synonyms - can substitute'),
    ('king', 'crown', 'related', 'Associated but different entities'),
    ('happy', 'joyful', 'similar', 'Synonyms - can substitute'),
]

# Check embeddings for these pairs
relatedness_analysis = []
for w1, w2, rel_type, _ in similarity_vs_relatedness:
    if w1 in glove and w2 in glove:
        sim = cosine_similarity(glove[w1], glove[w2])
        relatedness_analysis.append({
            'pair': f"{w1}/{w2}",
            'type': rel_type,
            'cosine_sim': sim
        })

# Examples showing similarity vs relatedness distinction
similarity_vs_relatedness = [
    # word1, word2, type, explanation
    ('car', 'automobile', 'similar', 'Synonyms - can substitute'),
    ('coffee', 'cup', 'related', 'Associated but different concepts'),
    ('doctor', 'hospital', 'related', 'Same domain but different roles'),
    ('big', 'large', 'similar', 'Synonyms - can substitute'),
    ('king', 'crown', 'related', 'Associated but different entities'),
    ('happy', 'joyful', 'similar', 'Synonyms - can substitute'),
]

# Check embeddings for these pairs
relatedness_analysis = []
for w1, w2, rel_type, _ in similarity_vs_relatedness:
    if w1 in glove and w2 in glove:
        sim = cosine_similarity(glove[w1], glove[w2])
        relatedness_analysis.append({
            'pair': f"{w1}/{w2}",
            'type': rel_type,
            'cosine_sim': sim
        })

Out[15]:

Similarity vs Relatedness in Embeddings
============================================================
     Word Pair           Type       Cosine Sim   
------------------------------------------------------------
   car/automobile      similar         0.683     
     coffee/cup        related         0.336     
  doctor/hospital      related         0.690     
     big/large         similar         0.708     
     king/crown        related         0.665     
    happy/joyful       similar         0.526     
============================================================

Note: Embeddings trained on co-occurrence often conflate
similarity and relatedness, scoring both types high.
SimLex-999 specifically tests pure similarity.

Out[16]:

Visualization

Cosine similarities for similar vs related word pairs. Ideally, truly similar pairs (synonyms) should score higher than merely related pairs (associations). GloVe embeddings show overlap between these categories because co-occurrence training conflates similarity and relatedness.

Standard word embeddings trained on co-occurrence tend to conflate similarity and relatedness because related words appear in similar contexts. This is why SimLex-999 is a harder benchmark. If you need embeddings that distinguish these concepts, you may need specialized training objectives.

Word Analogy EvaluationLink Copied

The analogy task tests whether embeddings capture semantic relationships through vector arithmetic. The classic example: "king - man + woman ≈ queen." If the embedding space encodes gender as a consistent direction, this arithmetic should work.

The Analogy TaskLink Copied

At first glance, the idea that you can do arithmetic with words seems almost magical. How can you subtract "man" from "king" and add "woman" to get "queen"? The key insight is that word embeddings aren't just arbitrary numbers assigned to words. They're geometric representations where directions carry meaning.

Think of it this way: if the embedding space has learned that "male" and "female" represent opposite ends of a direction (let's call it the gender axis), then moving from "man" to "woman" means traveling along that axis. Similarly, moving from "king" to "queen" should involve the same directional shift. If both relationships encode the same underlying concept (gender), their vector differences should be parallel.

This observation leads to a simple but powerful formula. Given three words A, B, and C, we want to find word D such that "A is to B as C is to D." The relationship between A and B is captured by the vector difference $\mathbf{b} - \mathbf{a}$ . If the same relationship holds between C and D, then D should be located at:

\mathbf{d} = \mathbf{b} - \mathbf{a} + \mathbf{c}

where:

$\mathbf{a}$ , $\mathbf{b}$ , $\mathbf{c}$ : embedding vectors for words A, B, and C
$\mathbf{b} - \mathbf{a}$ : the relationship vector (what transforms A into B)
$\mathbf{d}$ : target vector representing the expected embedding for word D

The formula reads as: "Start at C, then apply the same transformation that takes A to B." We find D by locating the word whose embedding is closest to this computed vector (excluding A, B, and C to prevent trivial solutions).

This geometric property only works if the embedding space has organized itself so that analogous relationships point in consistent directions. When it works, it's evidence that the embeddings have captured genuine semantic structure. When it fails, it often reveals that the relationship isn't as consistent as we assumed, or that the training data didn't provide enough examples for the model to learn it.

In[17]:

def solve_analogy(embeddings, a, b, c, top_n=5):
    """
    Solve analogy: a is to b as c is to ?
    
    Returns the top_n most likely answers.
    """
    # Check vocabulary coverage
    if not all(w in embeddings for w in [a, b, c]):
        missing = [w for w in [a, b, c] if w not in embeddings]
        return {'error': f"Missing words: {missing}"}
    
    # Compute target vector: b - a + c
    target = embeddings[b] - embeddings[a] + embeddings[c]
    
    # Find nearest neighbors (excluding input words)
    exclude = {a, b, c}
    candidates = []
    
    for word in embeddings.index_to_key[:50000]:  # Check top 50K words
        if word not in exclude:
            sim = cosine_similarity(target, embeddings[word])
            candidates.append((word, sim))
    
    # Sort by similarity
    candidates.sort(key=lambda x: x[1], reverse=True)
    
    return {
        'query': f"{a} : {b} :: {c} : ?",
        'expected_relationship': f"'{b}' is to '{a}' as '?' is to '{c}'",
        'top_answers': candidates[:top_n]
    }

# Test classic analogies
analogies = [
    ('man', 'king', 'woman'),      # Gender analogy
    ('paris', 'france', 'tokyo'),  # Capital-country
    ('slow', 'slower', 'fast'),    # Comparative
    ('walk', 'walked', 'run'),     # Tense
]

analogy_results = []
for a, b, c in analogies:
    result = solve_analogy(glove, a, b, c)
    analogy_results.append(result)

def solve_analogy(embeddings, a, b, c, top_n=5):
    """
    Solve analogy: a is to b as c is to ?
    
    Returns the top_n most likely answers.
    """
    # Check vocabulary coverage
    if not all(w in embeddings for w in [a, b, c]):
        missing = [w for w in [a, b, c] if w not in embeddings]
        return {'error': f"Missing words: {missing}"}
    
    # Compute target vector: b - a + c
    target = embeddings[b] - embeddings[a] + embeddings[c]
    
    # Find nearest neighbors (excluding input words)
    exclude = {a, b, c}
    candidates = []
    
    for word in embeddings.index_to_key[:50000]:  # Check top 50K words
        if word not in exclude:
            sim = cosine_similarity(target, embeddings[word])
            candidates.append((word, sim))
    
    # Sort by similarity
    candidates.sort(key=lambda x: x[1], reverse=True)
    
    return {
        'query': f"{a} : {b} :: {c} : ?",
        'expected_relationship': f"'{b}' is to '{a}' as '?' is to '{c}'",
        'top_answers': candidates[:top_n]
    }

# Test classic analogies
analogies = [
    ('man', 'king', 'woman'),      # Gender analogy
    ('paris', 'france', 'tokyo'),  # Capital-country
    ('slow', 'slower', 'fast'),    # Comparative
    ('walk', 'walked', 'run'),     # Tense
]

analogy_results = []
for a, b, c in analogies:
    result = solve_analogy(glove, a, b, c)
    analogy_results.append(result)

Out[18]:

Word Analogy Results
=================================================================
Query: man : king :: woman : ?
  Top 5 answers:
    queen           (similarity: 0.783)
    monarch         (similarity: 0.693)
    throne          (similarity: 0.683)
    daughter        (similarity: 0.681)
    prince          (similarity: 0.671)

Query: paris : france :: tokyo : ?
  Top 5 answers:
    japan           (similarity: 0.879)
    korea           (similarity: 0.726)
    germany         (similarity: 0.682)
    japanese        (similarity: 0.679)
    china           (similarity: 0.650)

Query: slow : slower :: fast : ?
  Top 5 answers:
    faster          (similarity: 0.803)
    quicker         (similarity: 0.669)
    pace            (similarity: 0.658)
    fastest         (similarity: 0.632)
    speeds          (similarity: 0.599)

Query: walk : walked :: run : ?
  Top 5 answers:
    went            (similarity: 0.734)
    ran             (similarity: 0.728)
    drove           (similarity: 0.724)
    came            (similarity: 0.700)
    out             (similarity: 0.677)

Analogy CategoriesLink Copied

The Google Analogy Test Set contains approximately 19,500 analogies across two categories:

Semantic analogies test factual relationships:

Capital-country: Paris : France :: Tokyo : Japan
Currency: dollar : USA :: euro : Europe
Gender: king : queen :: man : woman

Syntactic analogies test grammatical relationships:

Tense: walk : walked :: run : ran
Plural: cat : cats :: dog : dogs
Comparative: big : bigger :: small : smaller

In[19]:

# Define analogy test categories
analogy_categories = {
    'semantic': {
        'capital-country': [
            ('paris', 'france', 'tokyo', 'japan'),
            ('london', 'england', 'berlin', 'germany'),
            ('rome', 'italy', 'madrid', 'spain'),
        ],
        'gender': [
            ('man', 'woman', 'king', 'queen'),
            ('boy', 'girl', 'brother', 'sister'),
            ('father', 'mother', 'son', 'daughter'),
        ],
    },
    'syntactic': {
        'tense': [
            ('walk', 'walked', 'run', 'ran'),
            ('go', 'went', 'come', 'came'),
            ('see', 'saw', 'hear', 'heard'),
        ],
        'comparative': [
            ('good', 'better', 'bad', 'worse'),
            ('big', 'bigger', 'small', 'smaller'),
            ('fast', 'faster', 'slow', 'slower'),
        ],
    }
}

def evaluate_analogy_accuracy(embeddings, analogies_by_category):
    """Evaluate analogy accuracy by category."""
    results = {}
    
    for cat_type, categories in analogies_by_category.items():
        results[cat_type] = {}
        for cat_name, analogies in categories.items():
            correct = 0
            total = 0
            
            for a, b, c, expected in analogies:
                if all(w in embeddings for w in [a, b, c, expected]):
                    result = solve_analogy(embeddings, a, b, c, top_n=1)
                    if 'top_answers' in result:
                        predicted = result['top_answers'][0][0]
                        if predicted == expected:
                            correct += 1
                    total += 1
            
            accuracy = correct / total if total > 0 else 0
            results[cat_type][cat_name] = {
                'correct': correct,
                'total': total,
                'accuracy': accuracy
            }
    
    return results

category_results = evaluate_analogy_accuracy(glove, analogy_categories)

# Define analogy test categories
analogy_categories = {
    'semantic': {
        'capital-country': [
            ('paris', 'france', 'tokyo', 'japan'),
            ('london', 'england', 'berlin', 'germany'),
            ('rome', 'italy', 'madrid', 'spain'),
        ],
        'gender': [
            ('man', 'woman', 'king', 'queen'),
            ('boy', 'girl', 'brother', 'sister'),
            ('father', 'mother', 'son', 'daughter'),
        ],
    },
    'syntactic': {
        'tense': [
            ('walk', 'walked', 'run', 'ran'),
            ('go', 'went', 'come', 'came'),
            ('see', 'saw', 'hear', 'heard'),
        ],
        'comparative': [
            ('good', 'better', 'bad', 'worse'),
            ('big', 'bigger', 'small', 'smaller'),
            ('fast', 'faster', 'slow', 'slower'),
        ],
    }
}

def evaluate_analogy_accuracy(embeddings, analogies_by_category):
    """Evaluate analogy accuracy by category."""
    results = {}
    
    for cat_type, categories in analogies_by_category.items():
        results[cat_type] = {}
        for cat_name, analogies in categories.items():
            correct = 0
            total = 0
            
            for a, b, c, expected in analogies:
                if all(w in embeddings for w in [a, b, c, expected]):
                    result = solve_analogy(embeddings, a, b, c, top_n=1)
                    if 'top_answers' in result:
                        predicted = result['top_answers'][0][0]
                        if predicted == expected:
                            correct += 1
                    total += 1
            
            accuracy = correct / total if total > 0 else 0
            results[cat_type][cat_name] = {
                'correct': correct,
                'total': total,
                'accuracy': accuracy
            }
    
    return results

category_results = evaluate_analogy_accuracy(glove, analogy_categories)

Out[20]:

Analogy Accuracy by Category
=======================================================

SEMANTIC ANALOGIES
-------------------------------------------------------
  capital-country      3/3 (100.0%)
  gender               2/3 (66.7%)

SYNTACTIC ANALOGIES
-------------------------------------------------------
  tense                2/3 (66.7%)
  comparative          2/3 (66.7%)

=======================================================
OVERALL              9/12 (75.0%)

The accuracy breakdown reveals which relationship types the embeddings capture best. Syntactic analogies often achieve higher accuracy because grammatical patterns appear consistently in text. Semantic analogies like capital-country relationships may vary more depending on how well-represented each entity is in the training corpus.

Limitations of Analogy EvaluationLink Copied

While analogy tasks are popular, they have significant limitations:

Sensitivity to dataset: Performance varies dramatically across different analogy sets
Only tests specific relationships: Good analogy performance doesn't guarantee general embedding quality
Artifacts of training data: Some analogies work because of corpus biases, not linguistic understanding
Unclear relevance: Analogy performance often doesn't correlate with downstream task performance

Out[21]:

Visualization

Geometric interpretation of word analogies. The parallelogram structure shows how consistent relationships manifest as parallel vectors. If 'man→woman' and 'king→queen' are parallel, then b - a + c lands near d. However, this perfect parallelism rarely holds exactly in practice, requiring approximate matching.

Embedding VisualizationLink Copied

Visualization provides qualitative insights that quantitative metrics miss. By projecting high-dimensional embeddings to 2D or 3D, we can observe clustering patterns, outliers, and relationships.

t-SNE VisualizationLink Copied

t-Distributed Stochastic Neighbor Embedding (t-SNE) is the most popular technique for embedding visualization. It preserves local structure: words that are close in high-dimensional space remain close in the projection.

t-SNE

t-SNE is a dimensionality reduction technique that converts high-dimensional similarities into probabilities and finds a low-dimensional representation that preserves these probabilities. It excels at revealing cluster structure but doesn't preserve global distances. Points that appear far apart in t-SNE may or may not be far apart in the original space.

In[22]:

from sklearn.manifold import TSNE

# Select words for visualization
word_categories = {
    'animals': ['dog', 'cat', 'bird', 'fish', 'horse', 'cow', 'pig', 'sheep', 'lion', 'tiger'],
    'colors': ['red', 'blue', 'green', 'yellow', 'black', 'white', 'purple', 'orange', 'pink', 'brown'],
    'numbers': ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
    'countries': ['france', 'germany', 'italy', 'spain', 'japan', 'china', 'india', 'brazil', 'canada', 'australia'],
    'verbs': ['run', 'walk', 'jump', 'swim', 'fly', 'eat', 'drink', 'sleep', 'think', 'speak'],
}

# Collect embeddings for visualization
words_for_viz = []
embeddings_for_viz = []
categories_for_viz = []

for category, words in word_categories.items():
    for word in words:
        if word in glove:
            words_for_viz.append(word)
            embeddings_for_viz.append(glove[word])
            categories_for_viz.append(category)

embeddings_matrix = np.array(embeddings_for_viz)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(15, len(words_for_viz)-1))
embeddings_2d = tsne.fit_transform(embeddings_matrix)

from sklearn.manifold import TSNE

# Select words for visualization
word_categories = {
    'animals': ['dog', 'cat', 'bird', 'fish', 'horse', 'cow', 'pig', 'sheep', 'lion', 'tiger'],
    'colors': ['red', 'blue', 'green', 'yellow', 'black', 'white', 'purple', 'orange', 'pink', 'brown'],
    'numbers': ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
    'countries': ['france', 'germany', 'italy', 'spain', 'japan', 'china', 'india', 'brazil', 'canada', 'australia'],
    'verbs': ['run', 'walk', 'jump', 'swim', 'fly', 'eat', 'drink', 'sleep', 'think', 'speak'],
}

# Collect embeddings for visualization
words_for_viz = []
embeddings_for_viz = []
categories_for_viz = []

for category, words in word_categories.items():
    for word in words:
        if word in glove:
            words_for_viz.append(word)
            embeddings_for_viz.append(glove[word])
            categories_for_viz.append(category)

embeddings_matrix = np.array(embeddings_for_viz)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(15, len(words_for_viz)-1))
embeddings_2d = tsne.fit_transform(embeddings_matrix)

Out[23]:

Visualization

t-SNE projection of GloVe embeddings for selected word categories. Words cluster by semantic category: animals group together, colors form a cluster, numbers cluster tightly due to their sequential relationships. The visualization reveals the semantic organization that emerges from training on text co-occurrence.

UMAP as an AlternativeLink Copied

Uniform Manifold Approximation and Projection (UMAP) is a newer alternative to t-SNE. It's faster, better preserves global structure, and produces more reproducible results.

In[24]:

try:
    import umap
    HAS_UMAP = True
except ImportError:
    HAS_UMAP = False

if HAS_UMAP:
    # Apply UMAP
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
    embeddings_umap = reducer.fit_transform(embeddings_matrix)
else:
    # Fallback message
    embeddings_umap = None

try:
    import umap
    HAS_UMAP = True
except ImportError:
    HAS_UMAP = False

if HAS_UMAP:
    # Apply UMAP
    reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
    embeddings_umap = reducer.fit_transform(embeddings_matrix)
else:
    # Fallback message
    embeddings_umap = None

Out[25]:

Visualization

UMAP projection of the same word embeddings. Compared to t-SNE, UMAP often produces tighter, more separated clusters and runs significantly faster. The global structure is also better preserved, meaning relative positions between clusters are more meaningful.

Visualization CaveatsLink Copied

While visualization is valuable for building intuition, it has important limitations:

Projection distortion: Reducing from 100+ dimensions to 2 inevitably loses information
Non-determinism: t-SNE and UMAP can produce different layouts on different runs
Perplexity/neighbor sensitivity: Results depend heavily on hyperparameter choices
Misleading distances: Distances between clusters in the visualization may not reflect true embedding distances

Use visualization for exploration and communication, but don't make quantitative claims based on 2D projections.

Downstream Task EvaluationLink Copied

The ultimate test of embeddings is performance on real tasks. Let's evaluate embeddings on text classification, a common downstream application.

Topic ClassificationLink Copied

We'll use the 20 Newsgroups dataset to evaluate how well embeddings support text classification. This dataset contains posts from different newsgroups, making it ideal for testing whether embeddings capture topical information.

In[26]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Use 20 newsgroups for a quick demonstration
# Select a subset of categories for binary classification
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    newsgroups.data, newsgroups.target, test_size=0.2, random_state=42
)

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Use 20 newsgroups for a quick demonstration
# Select a subset of categories for binary classification
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    newsgroups.data, newsgroups.target, test_size=0.2, random_state=42
)

In[27]:

def text_to_embedding(text, embeddings, method='mean'):
    """Convert text to a single embedding vector."""
    words = text.lower().split()
    word_vectors = []
    
    for word in words:
        # Clean word
        word = ''.join(c for c in word if c.isalpha())
        if word in embeddings:
            word_vectors.append(embeddings[word])
    
    if not word_vectors:
        # Return zero vector if no words found
        return np.zeros(embeddings.vector_size)
    
    if method == 'mean':
        return np.mean(word_vectors, axis=0)
    elif method == 'max':
        return np.max(word_vectors, axis=0)
    else:
        return np.mean(word_vectors, axis=0)

def evaluate_embeddings_on_classification(embeddings, X_train, X_test, y_train, y_test):
    """Evaluate embeddings on text classification."""
    # Convert texts to embeddings
    X_train_emb = np.array([text_to_embedding(text, embeddings) for text in X_train])
    X_test_emb = np.array([text_to_embedding(text, embeddings) for text in X_test])
    
    # Train classifier
    clf = LogisticRegression(max_iter=1000, random_state=42)
    clf.fit(X_train_emb, y_train)
    
    # Evaluate
    y_pred = clf.predict(X_test_emb)
    
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1_macro': f1_score(y_test, y_pred, average='macro'),
    }

# Evaluate GloVe embeddings
classification_results = evaluate_embeddings_on_classification(
    glove, X_train, X_test, y_train, y_test
)

def text_to_embedding(text, embeddings, method='mean'):
    """Convert text to a single embedding vector."""
    words = text.lower().split()
    word_vectors = []
    
    for word in words:
        # Clean word
        word = ''.join(c for c in word if c.isalpha())
        if word in embeddings:
            word_vectors.append(embeddings[word])
    
    if not word_vectors:
        # Return zero vector if no words found
        return np.zeros(embeddings.vector_size)
    
    if method == 'mean':
        return np.mean(word_vectors, axis=0)
    elif method == 'max':
        return np.max(word_vectors, axis=0)
    else:
        return np.mean(word_vectors, axis=0)

def evaluate_embeddings_on_classification(embeddings, X_train, X_test, y_train, y_test):
    """Evaluate embeddings on text classification."""
    # Convert texts to embeddings
    X_train_emb = np.array([text_to_embedding(text, embeddings) for text in X_train])
    X_test_emb = np.array([text_to_embedding(text, embeddings) for text in X_test])
    
    # Train classifier
    clf = LogisticRegression(max_iter=1000, random_state=42)
    clf.fit(X_train_emb, y_train)
    
    # Evaluate
    y_pred = clf.predict(X_test_emb)
    
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1_macro': f1_score(y_test, y_pred, average='macro'),
    }

# Evaluate GloVe embeddings
classification_results = evaluate_embeddings_on_classification(
    glove, X_train, X_test, y_train, y_test
)

Out[28]:

Downstream Task Evaluation: Text Classification
=======================================================
Dataset: 20 Newsgroups (sci.space vs rec.sport.baseball)
Train size: 1584, Test size: 397

Results with GloVe Embeddings + Logistic Regression:
-------------------------------------------------------
  Accuracy:     93.20%
  F1 (macro):   93.20%

These embeddings successfully capture topic-distinguishing
information, enabling simple classification of documents.

The high accuracy demonstrates that even simple mean-pooled embeddings can capture enough semantic information for topic classification. The key advantage of this evaluation approach is that it directly measures what matters: whether the embeddings help solve real tasks. If your application involves document classification, these results are far more relevant than intrinsic metrics.

Comparison Across TasksLink Copied

A thorough extrinsic evaluation tests embeddings across multiple tasks:

In[29]:

# Simulate results for multiple tasks (in practice, you'd run each evaluation)
extrinsic_tasks = {
    'Sentiment Analysis': {'accuracy': 0.88, 'description': 'Binary sentiment on IMDB'},
    'Topic Classification': {'accuracy': 0.92, 'description': '20 Newsgroups'},
    'Named Entity Recognition': {'f1': 0.85, 'description': 'CoNLL-2003'},
    'Question Type': {'accuracy': 0.94, 'description': 'TREC question classification'},
    'Paraphrase Detection': {'f1': 0.78, 'description': 'MRPC paraphrase corpus'},
}

# Simulate results for multiple tasks (in practice, you'd run each evaluation)
extrinsic_tasks = {
    'Sentiment Analysis': {'accuracy': 0.88, 'description': 'Binary sentiment on IMDB'},
    'Topic Classification': {'accuracy': 0.92, 'description': '20 Newsgroups'},
    'Named Entity Recognition': {'f1': 0.85, 'description': 'CoNLL-2003'},
    'Question Type': {'accuracy': 0.94, 'description': 'TREC question classification'},
    'Paraphrase Detection': {'f1': 0.78, 'description': 'MRPC paraphrase corpus'},
}

Out[30]:

Visualization

Performance of GloVe embeddings across multiple downstream tasks. The variation in scores reflects how different tasks emphasize different aspects of word meaning. Embeddings may excel at topic classification (which relies on topical word clusters) while struggling with paraphrase detection (which requires fine-grained semantic similarity).

The variation across tasks highlights why no single metric captures embedding quality. Choose evaluation tasks that match your intended application.

Embedding Bias DetectionLink Copied

Word embeddings learn from human-generated text, and human text contains biases. These biases become encoded in the embedding geometry, potentially amplifying harmful stereotypes in downstream applications.

Detecting Gender BiasLink Copied

The most studied bias in word embeddings is gender bias. The key insight is that we can use the same geometric properties that make embeddings useful for analogies to detect problematic associations.

Consider an occupation word like "engineer." In an ideal world, this word should be equidistant from "he" and "she" since engineering has no inherent gender. But if the training corpus contains more sentences like "He is an engineer" than "She is an engineer," the embedding for "engineer" will drift closer to male-associated terms.

We can measure this drift with a simple bias score: compute the cosine similarity between a word and male terms, subtract the similarity to female terms. A score of zero means perfect balance. Positive scores indicate male association; negative scores indicate female association. By averaging across multiple gendered word pairs (he/she, man/woman, male/female), we reduce noise from any single comparison.

In[31]:

def compute_bias_score(embeddings, word, attribute_pair):
    """
    Compute bias score for a word relative to an attribute pair.
    
    Positive score indicates association with first attribute,
    negative score indicates association with second attribute.
    """
    attr1, attr2 = attribute_pair
    
    if not all(w in embeddings for w in [word, attr1, attr2]):
        return None
    
    vec = embeddings[word]
    vec1 = embeddings[attr1]
    vec2 = embeddings[attr2]
    
    sim1 = cosine_similarity(vec, vec1)
    sim2 = cosine_similarity(vec, vec2)
    
    return sim1 - sim2

# Gender attribute pairs
gender_pairs = [('he', 'she'), ('man', 'woman'), ('male', 'female')]

# Words to test for gender associations
occupation_words = [
    'doctor', 'nurse', 'engineer', 'teacher', 'programmer',
    'scientist', 'secretary', 'ceo', 'receptionist', 'mechanic',
    'lawyer', 'homemaker', 'professor', 'designer', 'architect'
]

# Compute gender bias scores
gender_bias_scores = {}
for word in occupation_words:
    scores = []
    for pair in gender_pairs:
        score = compute_bias_score(glove, word, pair)
        if score is not None:
            scores.append(score)
    if scores:
        gender_bias_scores[word] = np.mean(scores)

def compute_bias_score(embeddings, word, attribute_pair):
    """
    Compute bias score for a word relative to an attribute pair.
    
    Positive score indicates association with first attribute,
    negative score indicates association with second attribute.
    """
    attr1, attr2 = attribute_pair
    
    if not all(w in embeddings for w in [word, attr1, attr2]):
        return None
    
    vec = embeddings[word]
    vec1 = embeddings[attr1]
    vec2 = embeddings[attr2]
    
    sim1 = cosine_similarity(vec, vec1)
    sim2 = cosine_similarity(vec, vec2)
    
    return sim1 - sim2

# Gender attribute pairs
gender_pairs = [('he', 'she'), ('man', 'woman'), ('male', 'female')]

# Words to test for gender associations
occupation_words = [
    'doctor', 'nurse', 'engineer', 'teacher', 'programmer',
    'scientist', 'secretary', 'ceo', 'receptionist', 'mechanic',
    'lawyer', 'homemaker', 'professor', 'designer', 'architect'
]

# Compute gender bias scores
gender_bias_scores = {}
for word in occupation_words:
    scores = []
    for pair in gender_pairs:
        score = compute_bias_score(glove, word, pair)
        if score is not None:
            scores.append(score)
    if scores:
        gender_bias_scores[word] = np.mean(scores)

Out[32]:

Gender Bias in Occupation Words
=======================================================
Positive = more associated with 'he/man/male'
Negative = more associated with 'she/woman/female'
-------------------------------------------------------
ceo             +0.049 → male     █
architect       +0.039 → male     █
mechanic        +0.038 → male     █
engineer        +0.034 → male     █
programmer      +0.024 → male     
secretary       +0.020 → male     
scientist       +0.015 → male     
lawyer          +0.005 → male     
professor       +0.004 → male     
doctor          -0.026 → female   
designer        -0.040 → female   █
teacher         -0.045 → female   █
receptionist    -0.119 → female   ███
nurse           -0.122 → female   ███
homemaker       -0.135 → female   ████

The bias scores reveal systematic associations between occupations and gender. Occupations like "engineer" and "programmer" show positive scores (male association), while "nurse" and "secretary" show negative scores (female association). These patterns reflect stereotypes present in the training text, not any inherent truth about these professions.

Out[33]:

Visualization

Gender bias scores for occupation words in GloVe embeddings. Bars extending right indicate male association, left indicates female association. The pattern reveals embedded stereotypes: ''nurse'' and ''secretary'' associate with female, while ''engineer'' and ''programmer'' associate with male. These biases reflect and potentially amplify societal stereotypes.

Word Embedding Association Test (WEAT)Link Copied

While individual bias scores reveal patterns, we need a more rigorous framework to quantify bias in a statistically meaningful way. The Word Embedding Association Test (WEAT) provides exactly this, drawing inspiration from psychology's Implicit Association Test (IAT).

The core idea is elegant: if embeddings are unbiased, two conceptually neutral word sets (like careers and family) should associate equally with two attribute sets (like male and female terms). Any systematic difference indicates bias.

Here's how WEAT works step by step:

Define target word sets: Two sets we want to test for differential association. For example, career words (executive, salary, office) versus family words (home, parents, children).
Define attribute word sets: Two sets representing the dimension we're measuring bias along. For gender bias: male attributes (he, man, boy) versus female attributes (she, woman, girl).
Compute association scores: For each target word, calculate how much more it associates with one attribute set than the other. A career word that's closer to male terms than female terms receives a positive association score.
Compare target sets: The key question is whether one target set (careers) systematically associates more with one attribute set (male) than the other target set (family) does.
Compute effect size: The final WEAT score uses Cohen's d, a standardized measure of the difference between the two target sets' mean associations, divided by the pooled standard deviation. This normalization makes the score interpretable across different embedding models and word sets.

In[34]:

def weat_score(embeddings, target1, target2, attribute1, attribute2):
    """
    Compute WEAT score and effect size.
    
    target1, target2: Two sets of target words
    attribute1, attribute2: Two sets of attribute words
    """
    def mean_association(word, attr1_words, attr2_words):
        """Mean similarity difference for a word."""
        if word not in embeddings:
            return None
        vec = embeddings[word]
        
        sims1 = [cosine_similarity(vec, embeddings[a]) 
                 for a in attr1_words if a in embeddings]
        sims2 = [cosine_similarity(vec, embeddings[a]) 
                 for a in attr2_words if a in embeddings]
        
        if not sims1 or not sims2:
            return None
        return np.mean(sims1) - np.mean(sims2)
    
    # Compute associations for each target set
    assoc1 = [mean_association(w, attribute1, attribute2) for w in target1]
    assoc2 = [mean_association(w, attribute1, attribute2) for w in target2]
    
    # Remove None values
    assoc1 = [a for a in assoc1 if a is not None]
    assoc2 = [a for a in assoc2 if a is not None]
    
    if not assoc1 or not assoc2:
        return None
    
    # Effect size (Cohen's d)
    diff = np.mean(assoc1) - np.mean(assoc2)
    pooled_std = np.std(assoc1 + assoc2)
    
    if pooled_std == 0:
        return None
    
    effect_size = diff / pooled_std
    
    return {
        'effect_size': effect_size,
        'mean_target1': np.mean(assoc1),
        'mean_target2': np.mean(assoc2),
    }

# WEAT test: Career vs Family with Male vs Female attributes
career_words = ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']
family_words = ['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives']
male_attrs = ['male', 'man', 'boy', 'brother', 'he', 'him', 'his', 'son']
female_attrs = ['female', 'woman', 'girl', 'sister', 'she', 'her', 'hers', 'daughter']

weat_result = weat_score(glove, career_words, family_words, male_attrs, female_attrs)

def weat_score(embeddings, target1, target2, attribute1, attribute2):
    """
    Compute WEAT score and effect size.
    
    target1, target2: Two sets of target words
    attribute1, attribute2: Two sets of attribute words
    """
    def mean_association(word, attr1_words, attr2_words):
        """Mean similarity difference for a word."""
        if word not in embeddings:
            return None
        vec = embeddings[word]
        
        sims1 = [cosine_similarity(vec, embeddings[a]) 
                 for a in attr1_words if a in embeddings]
        sims2 = [cosine_similarity(vec, embeddings[a]) 
                 for a in attr2_words if a in embeddings]
        
        if not sims1 or not sims2:
            return None
        return np.mean(sims1) - np.mean(sims2)
    
    # Compute associations for each target set
    assoc1 = [mean_association(w, attribute1, attribute2) for w in target1]
    assoc2 = [mean_association(w, attribute1, attribute2) for w in target2]
    
    # Remove None values
    assoc1 = [a for a in assoc1 if a is not None]
    assoc2 = [a for a in assoc2 if a is not None]
    
    if not assoc1 or not assoc2:
        return None
    
    # Effect size (Cohen's d)
    diff = np.mean(assoc1) - np.mean(assoc2)
    pooled_std = np.std(assoc1 + assoc2)
    
    if pooled_std == 0:
        return None
    
    effect_size = diff / pooled_std
    
    return {
        'effect_size': effect_size,
        'mean_target1': np.mean(assoc1),
        'mean_target2': np.mean(assoc2),
    }

# WEAT test: Career vs Family with Male vs Female attributes
career_words = ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']
family_words = ['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives']
male_attrs = ['male', 'man', 'boy', 'brother', 'he', 'him', 'his', 'son']
female_attrs = ['female', 'woman', 'girl', 'sister', 'she', 'her', 'hers', 'daughter']

weat_result = weat_score(glove, career_words, family_words, male_attrs, female_attrs)

Out[35]:

Word Embedding Association Test (WEAT)
=======================================================
Test: Career/Family words × Male/Female attributes
-------------------------------------------------------
Effect Size (d): 1.576

Interpretation:
  Strong association: Career → Male, Family → Female

Effect size benchmarks:
  |d| < 0.2: negligible
  |d| 0.2-0.5: small
  |d| 0.5-0.8: medium
  |d| > 0.8: large

Out[36]:

Visualization

WEAT association scores for career and family words. Each bar shows how much more a word associates with male attributes than female attributes. Career words systematically skew toward male associations (positive), while family words skew toward female associations (negative), revealing the gender-career stereotype encoded in the embeddings.

Implications of Embedded BiasLink Copied

Bias in embeddings has real-world consequences:

Resume screening: Systems using biased embeddings may rank male candidates higher for technical roles
Search engines: Queries for "CEO" might surface more male images
Machine translation: Gender-neutral terms might be translated with stereotypical gender
Sentiment analysis: Texts about certain demographic groups might receive biased sentiment scores

Bias detection should be part of any responsible embedding evaluation pipeline. Debiasing techniques exist but have limitations, so awareness and mitigation strategies are essential.

Evaluation PitfallsLink Copied

Even with the right metrics, embedding evaluation can go wrong. Here are common pitfalls to avoid:

1. Vocabulary Coverage IssuesLink Copied

Many evaluation datasets contain rare or archaic words missing from embedding vocabularies. Simply skipping these words can inflate scores.

In[37]:

# Example: checking vocabulary coverage
test_words = ['synecdoche', 'serendipity', 'defenestration', 'obsequious', 
              'pulchritudinous', 'dog', 'cat', 'happy', 'run', 'think']

coverage = {word: word in glove for word in test_words}
coverage_rate = sum(coverage.values()) / len(coverage)

# Example: checking vocabulary coverage
test_words = ['synecdoche', 'serendipity', 'defenestration', 'obsequious', 
              'pulchritudinous', 'dog', 'cat', 'happy', 'run', 'think']

coverage = {word: word in glove for word in test_words}
coverage_rate = sum(coverage.values()) / len(coverage)

Out[38]:

Vocabulary Coverage Check
=============================================
  synecdoche           ✓
  serendipity          ✓
  defenestration       ✓
  obsequious           ✓
  pulchritudinous      ✗
  dog                  ✓
  cat                  ✓
  happy                ✓
  run                  ✓
  think                ✓
---------------------------------------------
Coverage: 90%

Warning: Low coverage inflates metrics if
missing words are simply excluded.

Rare or specialized words often missing from embedding vocabularies can skew evaluation results. If your evaluation set contains many such words and you simply exclude them, you're only testing on common words where embeddings typically perform better. Always report coverage alongside performance metrics.

2. Dataset ContaminationLink Copied

If your embeddings were trained on text that includes the evaluation data, results are misleadingly optimistic.

3. Hyperparameter SensitivityLink Copied

Results can vary significantly with hyperparameters like the number of neighbors for nearest neighbor searches, or thresholds for similarity judgments.

4. Cherry-Picking CategoriesLink Copied

Reporting only the best-performing analogy or similarity categories creates a misleading picture. Always report aggregate scores.

5. Ignoring Statistical SignificanceLink Copied

Small test sets can produce unreliable results. An accuracy of 85% on 100 test cases doesn't mean your model is exactly 85% accurate on all possible inputs. It's an estimate with uncertainty, and that uncertainty shrinks as you test on more examples.

Bootstrap confidence intervals offer a practical way to quantify this uncertainty. The idea is simple: resample your test results with replacement many times, compute the mean each time, and observe the distribution. The range containing 95% of these bootstrap means gives you a 95% confidence interval. If two models' confidence intervals don't overlap, you have evidence of a real difference.

In[39]:

def bootstrap_confidence_interval(scores, n_bootstrap=1000, confidence=0.95):
    """Compute bootstrap confidence interval for a metric."""
    bootstrap_means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(scores, size=len(scores), replace=True)
        bootstrap_means.append(np.mean(sample))
    
    lower = np.percentile(bootstrap_means, (1 - confidence) / 2 * 100)
    upper = np.percentile(bootstrap_means, (1 + confidence) / 2 * 100)
    
    return lower, upper

# Example: confidence interval for accuracy
simulated_correct = [1] * 85 + [0] * 15  # 85% accuracy on 100 samples
lower, upper = bootstrap_confidence_interval(simulated_correct)

def bootstrap_confidence_interval(scores, n_bootstrap=1000, confidence=0.95):
    """Compute bootstrap confidence interval for a metric."""
    bootstrap_means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(scores, size=len(scores), replace=True)
        bootstrap_means.append(np.mean(sample))
    
    lower = np.percentile(bootstrap_means, (1 - confidence) / 2 * 100)
    upper = np.percentile(bootstrap_means, (1 + confidence) / 2 * 100)
    
    return lower, upper

# Example: confidence interval for accuracy
simulated_correct = [1] * 85 + [0] * 15  # 85% accuracy on 100 samples
lower, upper = bootstrap_confidence_interval(simulated_correct)

Out[40]:

Statistical Significance Example
=============================================
Observed accuracy: 85.0%
95% CI: [79.0%, 91.0%]

With only 100 test samples, a model with 85%
accuracy might actually be anywhere from
79% to 91% on the full distribution.

Out[41]:

Visualization

Bootstrap distribution of accuracy estimates from 1000 resamples. The observed accuracy of 85% is our point estimate, but the true population accuracy could be anywhere in the shaded region (95% confidence interval). This visualization shows why small test sets produce uncertain conclusions.

Building an Evaluation PipelineLink Copied

Let's bring everything together into a reusable evaluation framework.

In[42]:

class EmbeddingEvaluator:
    """Comprehensive embedding evaluation pipeline."""
    
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.results = {}
    
    def evaluate_word_similarity(self, word_pairs):
        """Evaluate on word similarity dataset."""
        human_scores = []
        embedding_scores = []
        
        for w1, w2, score in word_pairs:
            if w1 in self.embeddings and w2 in self.embeddings:
                sim = cosine_similarity(self.embeddings[w1], self.embeddings[w2])
                human_scores.append(score)
                embedding_scores.append(sim)
        
        if len(human_scores) >= 2:
            corr, p = spearmanr(human_scores, embedding_scores)
        else:
            corr, p = 0, 1
        
        self.results['word_similarity'] = {
            'spearman': corr,
            'p_value': p,
            'n_pairs': len(human_scores)
        }
        return self.results['word_similarity']
    
    def evaluate_analogies(self, analogies):
        """Evaluate on analogy dataset."""
        correct = 0
        total = 0
        
        for a, b, c, expected in analogies:
            if all(w in self.embeddings for w in [a, b, c, expected]):
                target = self.embeddings[b] - self.embeddings[a] + self.embeddings[c]
                
                best_word = None
                best_sim = -1
                exclude = {a, b, c}
                
                for word in self.embeddings.index_to_key[:50000]:
                    if word not in exclude:
                        sim = cosine_similarity(target, self.embeddings[word])
                        if sim > best_sim:
                            best_sim = sim
                            best_word = word
                
                if best_word == expected:
                    correct += 1
                total += 1
        
        self.results['analogies'] = {
            'accuracy': correct / total if total > 0 else 0,
            'correct': correct,
            'total': total
        }
        return self.results['analogies']
    
    def compute_bias(self, target_words, attr1_words, attr2_words):
        """Compute bias scores for target words."""
        bias_scores = {}
        
        for word in target_words:
            if word not in self.embeddings:
                continue
                
            vec = self.embeddings[word]
            
            sims1 = [cosine_similarity(vec, self.embeddings[a]) 
                     for a in attr1_words if a in self.embeddings]
            sims2 = [cosine_similarity(vec, self.embeddings[a]) 
                     for a in attr2_words if a in self.embeddings]
            
            if sims1 and sims2:
                bias_scores[word] = np.mean(sims1) - np.mean(sims2)
        
        self.results['bias'] = bias_scores
        return bias_scores
    
    def summary(self):
        """Print summary of all evaluations."""
        print("=" * 55)
        print("EMBEDDING EVALUATION SUMMARY")
        print("=" * 55)
        
        if 'word_similarity' in self.results:
            ws = self.results['word_similarity']
            print(f"\nWord Similarity:")
            print(f"  Spearman r = {ws['spearman']:.3f} (n = {ws['n_pairs']})")
        
        if 'analogies' in self.results:
            an = self.results['analogies']
            print(f"\nAnalogies:")
            print(f"  Accuracy = {an['accuracy']*100:.1f}% ({an['correct']}/{an['total']})")
        
        if 'bias' in self.results:
            bias = self.results['bias']
            print(f"\nBias Analysis:")
            print(f"  Analyzed {len(bias)} words")
            mean_abs_bias = np.mean([abs(v) for v in bias.values()])
            print(f"  Mean absolute bias = {mean_abs_bias:.3f}")

# Usage example
evaluator = EmbeddingEvaluator(glove)
evaluator.evaluate_word_similarity(sample_pairs)
evaluator.evaluate_analogies([('man', 'king', 'woman', 'queen'), ('paris', 'france', 'tokyo', 'japan')])
evaluator.compute_bias(occupation_words, male_attrs, female_attrs)

class EmbeddingEvaluator:
    """Comprehensive embedding evaluation pipeline."""
    
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.results = {}
    
    def evaluate_word_similarity(self, word_pairs):
        """Evaluate on word similarity dataset."""
        human_scores = []
        embedding_scores = []
        
        for w1, w2, score in word_pairs:
            if w1 in self.embeddings and w2 in self.embeddings:
                sim = cosine_similarity(self.embeddings[w1], self.embeddings[w2])
                human_scores.append(score)
                embedding_scores.append(sim)
        
        if len(human_scores) >= 2:
            corr, p = spearmanr(human_scores, embedding_scores)
        else:
            corr, p = 0, 1
        
        self.results['word_similarity'] = {
            'spearman': corr,
            'p_value': p,
            'n_pairs': len(human_scores)
        }
        return self.results['word_similarity']
    
    def evaluate_analogies(self, analogies):
        """Evaluate on analogy dataset."""
        correct = 0
        total = 0
        
        for a, b, c, expected in analogies:
            if all(w in self.embeddings for w in [a, b, c, expected]):
                target = self.embeddings[b] - self.embeddings[a] + self.embeddings[c]
                
                best_word = None
                best_sim = -1
                exclude = {a, b, c}
                
                for word in self.embeddings.index_to_key[:50000]:
                    if word not in exclude:
                        sim = cosine_similarity(target, self.embeddings[word])
                        if sim > best_sim:
                            best_sim = sim
                            best_word = word
                
                if best_word == expected:
                    correct += 1
                total += 1
        
        self.results['analogies'] = {
            'accuracy': correct / total if total > 0 else 0,
            'correct': correct,
            'total': total
        }
        return self.results['analogies']
    
    def compute_bias(self, target_words, attr1_words, attr2_words):
        """Compute bias scores for target words."""
        bias_scores = {}
        
        for word in target_words:
            if word not in self.embeddings:
                continue
                
            vec = self.embeddings[word]
            
            sims1 = [cosine_similarity(vec, self.embeddings[a]) 
                     for a in attr1_words if a in self.embeddings]
            sims2 = [cosine_similarity(vec, self.embeddings[a]) 
                     for a in attr2_words if a in self.embeddings]
            
            if sims1 and sims2:
                bias_scores[word] = np.mean(sims1) - np.mean(sims2)
        
        self.results['bias'] = bias_scores
        return bias_scores
    
    def summary(self):
        """Print summary of all evaluations."""
        print("=" * 55)
        print("EMBEDDING EVALUATION SUMMARY")
        print("=" * 55)
        
        if 'word_similarity' in self.results:
            ws = self.results['word_similarity']
            print(f"\nWord Similarity:")
            print(f"  Spearman r = {ws['spearman']:.3f} (n = {ws['n_pairs']})")
        
        if 'analogies' in self.results:
            an = self.results['analogies']
            print(f"\nAnalogies:")
            print(f"  Accuracy = {an['accuracy']*100:.1f}% ({an['correct']}/{an['total']})")
        
        if 'bias' in self.results:
            bias = self.results['bias']
            print(f"\nBias Analysis:")
            print(f"  Analyzed {len(bias)} words")
            mean_abs_bias = np.mean([abs(v) for v in bias.values()])
            print(f"  Mean absolute bias = {mean_abs_bias:.3f}")

# Usage example
evaluator = EmbeddingEvaluator(glove)
evaluator.evaluate_word_similarity(sample_pairs)
evaluator.evaluate_analogies([('man', 'king', 'woman', 'queen'), ('paris', 'france', 'tokyo', 'japan')])
evaluator.compute_bias(occupation_words, male_attrs, female_attrs)

Out[43]:

=======================================================
EMBEDDING EVALUATION SUMMARY
=======================================================

Word Similarity:
  Spearman r = 0.152 (n = 10)

Analogies:
  Accuracy = 100.0% (2/2)

Bias Analysis:
  Analyzed 15 words
  Mean absolute bias = 0.080

The evaluation pipeline produces a consolidated view of embedding performance across all dimensions. This modular approach allows you to add new evaluation methods as needed while maintaining a consistent reporting format.

Key ParametersLink Copied

When evaluating word embeddings, several parameters and choices significantly impact your results:

Word Similarity Evaluation

Correlation metric: Spearman correlation (rank-based) is preferred over Pearson because it doesn't assume linearity between human scores and cosine similarities
Dataset choice: SimLex-999 for pure similarity, WordSim-353 for similarity + relatedness, MEN for broader coverage

Analogy Evaluation

Vocabulary search space: Limiting search to top-N frequent words (e.g., 50,000) balances accuracy with computation time
Exclusion set: Always exclude the input words (a, b, c) from candidate answers to avoid trivial solutions

Visualization (t-SNE)

perplexity: Controls the balance between local and global structure. Typical values: 5-50. Lower values emphasize local clusters, higher values show more global structure
n_iter: Number of optimization iterations. Default 1000 is usually sufficient, but complex datasets may need more
random_state: Set for reproducibility, as t-SNE is non-deterministic

Visualization (UMAP)

n_neighbors: Number of neighbors for local structure. Higher values (15-50) preserve more global structure
min_dist: Controls how tightly points cluster. Lower values (0.0-0.1) create denser clusters

Bias Detection

Attribute word sets: Use multiple word pairs per concept (e.g., he/she, man/woman, male/female) to reduce noise from individual word idiosyncrasies
Effect size thresholds: Cohen's d benchmarks: < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large

Downstream Evaluation

Aggregation method: Mean pooling is standard, but max pooling sometimes works better for sentiment tasks
Classifier choice: Logistic regression provides a clean baseline; more complex models may overfit to artifacts rather than embedding quality

SummaryLink Copied

Evaluating word embeddings requires a multi-faceted approach. No single metric captures embedding quality completely.

Key takeaways:

Intrinsic vs extrinsic: Intrinsic evaluations (similarity, analogies) are fast and interpretable but may not predict downstream performance. Always include task-specific extrinsic evaluations.
Word similarity: Spearman correlation with human similarity judgments remains the standard intrinsic test. SimLex-999 tests genuine similarity, while WordSim-353 conflates similarity with relatedness.
Analogies: Vector arithmetic captures some semantic relationships, but analogy accuracy has limited correlation with real-world usefulness.
Visualization: t-SNE and UMAP reveal clustering structure but introduce projection distortions. Use for exploration, not quantitative claims.
Downstream tasks: The ultimate test is performance on your intended application. Classification, NER, and other tasks provide direct measures of utility.
Bias detection: Embeddings encode societal biases. WEAT and association tests can quantify these biases, which is essential for responsible deployment.
Pitfalls: Watch for vocabulary coverage issues, dataset contamination, hyperparameter sensitivity, and statistical significance. Report aggregate results, not cherry-picked categories.

The goal isn't perfect scores on every metric but rather understanding what your embeddings capture and whether it matches your needs. A model with lower intrinsic scores might be the right choice if it excels at your specific task. Evaluation is ultimately about making informed decisions.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about evaluating word embeddings.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

FastText

Next Chapter

The Vocabulary Problem

Reference

BIBTEXAcademic

@misc{wordembeddingevaluationintrinsicextrinsicmethodswithbiasdetection, author = {Michael Brenndoerfer}, title = {Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection}, year = {2025}, url = {https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }

APAAcademic

Michael Brenndoerfer (2025). Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection. Retrieved from https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods

MLAAcademic

Michael Brenndoerfer. "Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods>.

CHICAGOAcademic

Michael Brenndoerfer. "Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection'. Available at: https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods (Accessed: 12/13/2025).

SimpleBasic

Michael Brenndoerfer (2025). Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection. https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods

Direct link:

https://mbrenndoerfer.com/writing/word-embedding-evaluation-intrinsic-extrinsic-methods

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Embedding EvaluationLink Copied

Intrinsic vs Extrinsic EvaluationLink Copied

Word Similarity EvaluationLink Copied

Standard DatasetsLink Copied

Computing Similarity CorrelationsLink Copied

SimLex vs WordSim: Similarity vs RelatednessLink Copied

Word Analogy EvaluationLink Copied

The Analogy TaskLink Copied

Analogy CategoriesLink Copied

Limitations of Analogy EvaluationLink Copied

Embedding VisualizationLink Copied

t-SNE VisualizationLink Copied

UMAP as an AlternativeLink Copied

Visualization CaveatsLink Copied

Downstream Task EvaluationLink Copied

Topic ClassificationLink Copied

Comparison Across TasksLink Copied

Embedding Bias DetectionLink Copied

Detecting Gender BiasLink Copied

Word Embedding Association Test (WEAT)Link Copied

Implications of Embedded BiasLink Copied

Evaluation PitfallsLink Copied

1. Vocabulary Coverage IssuesLink Copied

2. Dataset ContaminationLink Copied

3. Hyperparameter SensitivityLink Copied

4. Cherry-Picking CategoriesLink Copied

5. Ignoring Statistical SignificanceLink Copied

Building an Evaluation PipelineLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GloVe: Global Vectors for Word Representation

FastText: Subword Embeddings for OOV Words & Morphology

Training Word2Vec: Complete Pipeline with Gensim & PyTorch Implementation

Stay updated