Word Analogy: Vector Arithmetic for Semantic Relationships

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Master word analogy evaluation using 3CosAdd and 3CosMul methods. Learn the parallelogram model, evaluation datasets, and what analogies reveal about embedding quality.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Word AnalogyLink Copied

Something remarkable happens when you train Skip-gram or CBOW on billions of words: the resulting embeddings exhibit structured geometric relationships that mirror semantic relationships. The famous example: $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ . This isn't a quirk. The vector from "man" to "woman" captures a gender relationship, and adding it to "king" lands near "queen." The embedding space has learned that kingship relates to queenship the same way manhood relates to womanhood.

This phenomenon, known as word analogy, became a central way to evaluate and understand word embeddings. If an embedding space correctly solves "man is to woman as king is to ?", it suggests the model has captured meaningful semantic structure. But analogies also reveal limitations: they work well for certain relationship types and fail for others, and the popular evaluation metrics can be misleading.

This chapter explores the parallelogram model underlying word analogies, develops the mathematical methods for solving them (3CosAdd and 3CosMul), examines standard evaluation datasets, and discusses what analogies actually tell us about embedding quality.

The Parallelogram ModelLink Copied

Word analogies rely on a geometric intuition: if two word pairs share the same relationship, the vectors connecting them should be parallel. Consider the relationship "capital of":

Paris is the capital of France
Tokyo is the capital of Japan
London is the capital of England

If embeddings capture this relationship consistently, then $\vec{Paris} - \vec{France}$ , $\vec{Tokyo} - \vec{Japan}$ , and $\vec{London} - \vec{England}$ should all point in roughly the same direction. This directional consistency allows us to complete analogies.

Parallelogram Model

The parallelogram model assumes that semantic relationships are encoded as consistent vector offsets. If words $a$ and $b$ share a relationship $R$ , and words $c$ and $d$ share the same relationship, then $\vec{b} - \vec{a} \approx \vec{d} - \vec{c}$ . This forms a parallelogram in embedding space.

In[2]:

import numpy as np

# Simulate embeddings that exhibit parallelogram structure
# In real embeddings, this structure emerges from training
np.random.seed(42)

# Create a 2D example for visualization
# The "gender" direction
gender_offset = np.array([0.8, 0.3])

# Base embeddings for male terms
man = np.array([-0.5, 0.2])
king = np.array([0.3, 0.9])
prince = np.array([0.1, 0.6])
uncle = np.array([-0.3, 0.4])

# Female terms: same as male + gender offset
woman = man + gender_offset
queen = king + gender_offset
princess = prince + gender_offset
aunt = uncle + gender_offset

embeddings_2d = {
    'man': man, 'woman': woman,
    'king': king, 'queen': queen,
    'prince': prince, 'princess': princess,
    'uncle': uncle, 'aunt': aunt
}

import numpy as np

# Simulate embeddings that exhibit parallelogram structure
# In real embeddings, this structure emerges from training
np.random.seed(42)

# Create a 2D example for visualization
# The "gender" direction
gender_offset = np.array([0.8, 0.3])

# Base embeddings for male terms
man = np.array([-0.5, 0.2])
king = np.array([0.3, 0.9])
prince = np.array([0.1, 0.6])
uncle = np.array([-0.3, 0.4])

# Female terms: same as male + gender offset
woman = man + gender_offset
queen = king + gender_offset
princess = prince + gender_offset
aunt = uncle + gender_offset

embeddings_2d = {
    'man': man, 'woman': woman,
    'king': king, 'queen': queen,
    'prince': prince, 'princess': princess,
    'uncle': uncle, 'aunt': aunt
}

Out[3]:

Visualization

2D scatter plot showing word embeddings with parallel arrows between male and female word pairs forming parallelograms. — The parallelogram model of word analogies. Each pair of gendered terms (man/woman, king/queen, prince/princess, uncle/aunt) is connected by approximately the same vector offset, shown in purple. This consistent offset allows analogy completion: starting from ''king'' and applying the man→woman offset lands near ''queen''. The parallelogram structure emerges naturally when word embeddings capture semantic relationships consistently.

The parallelogram structure is idealized. Real embeddings have noise, and the offsets aren't perfectly parallel. But when embeddings are trained well, the relationship vectors are similar enough that analogy completion works.

Vector Arithmetic for AnalogiesLink Copied

The classic analogy task asks: "a is to b as c is to ?" In vector terms, we want to find word $d$ such that:

\vec{d} \approx \vec{b} - \vec{a} + \vec{c}

The intuition:

Compute the relationship vector: $\vec{b} - \vec{a}$ captures how $b$ differs from $a$
Apply this offset to $c$ : adding $\vec{b} - \vec{a}$ to $\vec{c}$ should land near $d$
Find the nearest word: search the vocabulary for the word closest to the computed vector

In[4]:

def analogy_vector(a, b, c, embeddings):
    """
    Compute the target vector for analogy: a is to b as c is to ?
    
    Returns: b - a + c
    """
    return embeddings[b] - embeddings[a] + embeddings[c]

# Example: man is to woman as king is to ?
target = analogy_vector('man', 'woman', 'king', embeddings_2d)

def analogy_vector(a, b, c, embeddings):
    """
    Compute the target vector for analogy: a is to b as c is to ?
    
    Returns: b - a + c
    """
    return embeddings[b] - embeddings[a] + embeddings[c]

# Example: man is to woman as king is to ?
target = analogy_vector('man', 'woman', 'king', embeddings_2d)

Out[5]:

Analogy: man is to woman as king is to ?
--------------------------------------------------
Computing: vec(woman) - vec(man) + vec(king)

vec(man):   [-0.5  0.2]
vec(woman): [0.3 0.5]
vec(king):  [0.3 0.9]

Relationship (woman - man): [0.8 0.3]
Target vector: [1.1 1.2]
Actual queen:  [1.1 1.2]

Distance between target and queen: 0.000000

In our synthetic example with perfect parallelogram structure, the target vector exactly matches "queen." In real embeddings, there's noise, so we need to find the nearest vocabulary word.

Out[6]:

Visualization

Three-panel diagram showing the sequential steps of analogy vector arithmetic, with vectors and word positions labeled. — Step-by-step visualization of analogy vector arithmetic for ''man is to woman as king is to ?''. Step 1: Compute the relationship vector (woman - man), shown in purple. Step 2: Start from ''king'' and add the relationship vector. Step 3: The resulting target vector lands exactly on ''queen''. The geometric interpretation makes clear why this works: we''re translating the gender relationship from one word pair to another.

The 3CosAdd MethodLink Copied

We've established the intuition: semantic relationships create parallel vectors, and we can complete analogies by adding relationship offsets. But how do we actually find the answer word? In our 2D visualization, we could simply look at where the target vector lands. In 50 or 300 dimensions, we need a principled search procedure.

The challenge is that the target vector $\vec{b} - \vec{a} + \vec{c}$ almost never exactly equals any word's embedding. Real embeddings have noise, relationships aren't perfectly parallel, and we're working in high-dimensional spaces where geometric intuitions can mislead. What we need is a way to find the closest word to our computed target.

Why Cosine Similarity, Not Euclidean Distance?Link Copied

You might think: just find the word with minimum Euclidean distance to the target vector. But consider what happens when embeddings have different magnitudes. A frequent word like "the" might have a large embedding norm, while a rare word like "czar" has a small norm. Euclidean distance would favor words whose norms happen to match the target, regardless of directional similarity.

Cosine similarity solves this by focusing purely on direction:

\cos(\vec{x}, \vec{y}) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot ||\vec{y}||}

Two vectors pointing in the same direction have cosine similarity 1, regardless of their lengths. Perpendicular vectors have similarity 0. Opposite directions yield -1. For analogy completion, we care about direction (does this word's embedding point the same way as our target?), not magnitude.

Out[7]:

Visualization

Side-by-side comparison showing how Euclidean distance is affected by vector magnitude while cosine similarity focuses on angular alignment. — Why cosine similarity beats Euclidean distance for analogies. Left: Euclidean distance favors vectors with matching magnitudes, so the closest word to the target might simply be one with similar length, not similar meaning. Right: Cosine similarity ignores magnitude and focuses on direction, correctly identifying that ''queen'' points in the same direction as the target regardless of vector lengths.

The 3CosAdd FormulaLink Copied

Given the analogy "a is to b as c is to ?", the 3CosAdd method searches the entire vocabulary to find the word whose embedding is most similar (by cosine) to our target vector:

d^* = \arg\max_{d \in V \setminus \{a, b, c\}} \cos(\vec{d}, \vec{b} - \vec{a} + \vec{c})

where:

$d^*$ : the optimal answer word
$V$ : vocabulary (set of all words with embeddings)
$a$ , $b$ , $c$ : the query words in the analogy " $a$ is to $b$ as $c$ is to ?"
$\vec{d}$ : embedding vector for candidate word $d$
$\vec{b} - \vec{a} + \vec{c}$ : the target vector computed by vector arithmetic
$\cos(\vec{x}, \vec{y}) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot ||\vec{y}||}$ : cosine similarity between vectors
The query words $a$ , $b$ , $c$ are excluded from candidates to prevent trivial answers

Notice the exclusion of query words: without this, "woman" might rank highest for "man:woman::king:?" because it's already part of the target vector computation. We want to discover the answer, not echo back the inputs.

The name "3CosAdd" comes from viewing the formula as three additive cosine operations. Conceptually, we want a word that is:

Similar to $b$ (shares properties with the "answer" exemplar)
Similar to $c$ (shares properties with our new starting point)
Dissimilar to $a$ (doesn't share properties we're subtracting away)

The vector arithmetic $\vec{b} - \vec{a} + \vec{c}$ combines these requirements into a single target.

Implementing 3CosAdd Step by StepLink Copied

Let's build the algorithm from scratch. First, we need a function to compute cosine similarity between any two vectors:

In[8]:

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    dot = np.dot(v1, v2)
    norm1 = np.linalg.norm(v1)
    norm2 = np.linalg.norm(v2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot / (norm1 * norm2)

def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    dot = np.dot(v1, v2)
    norm1 = np.linalg.norm(v1)
    norm2 = np.linalg.norm(v2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot / (norm1 * norm2)

The implementation handles the edge case of zero-magnitude vectors (which shouldn't occur in practice but guards against numerical issues). Now the main algorithm:

In[9]:

def solve_analogy_3cosadd(a, b, c, embeddings, exclude_query=True):
    """
    Solve analogy using 3CosAdd: a is to b as c is to ?
    
    Args:
        a, b, c: Query words
        embeddings: Dict mapping words to vectors
        exclude_query: Whether to exclude a, b, c from candidates
    
    Returns:
        List of (word, similarity) tuples, sorted by similarity
    """
    # Step 1: Compute the target vector using vector arithmetic
    target = embeddings[b] - embeddings[a] + embeddings[c]
    
    # Step 2: Score all candidate words by cosine similarity to target
    candidates = []
    exclude = {a, b, c} if exclude_query else set()
    
    for word, vec in embeddings.items():
        if word not in exclude:
            sim = cosine_similarity(target, vec)
            candidates.append((word, sim))
    
    # Step 3: Return candidates sorted by similarity (highest first)
    return sorted(candidates, key=lambda x: x[1], reverse=True)

# Solve our example analogy
results = solve_analogy_3cosadd('man', 'woman', 'king', embeddings_2d)

def solve_analogy_3cosadd(a, b, c, embeddings, exclude_query=True):
    """
    Solve analogy using 3CosAdd: a is to b as c is to ?
    
    Args:
        a, b, c: Query words
        embeddings: Dict mapping words to vectors
        exclude_query: Whether to exclude a, b, c from candidates
    
    Returns:
        List of (word, similarity) tuples, sorted by similarity
    """
    # Step 1: Compute the target vector using vector arithmetic
    target = embeddings[b] - embeddings[a] + embeddings[c]
    
    # Step 2: Score all candidate words by cosine similarity to target
    candidates = []
    exclude = {a, b, c} if exclude_query else set()
    
    for word, vec in embeddings.items():
        if word not in exclude:
            sim = cosine_similarity(target, vec)
            candidates.append((word, sim))
    
    # Step 3: Return candidates sorted by similarity (highest first)
    return sorted(candidates, key=lambda x: x[1], reverse=True)

# Solve our example analogy
results = solve_analogy_3cosadd('man', 'woman', 'king', embeddings_2d)

The algorithm is straightforward: compute the target, score every word in the vocabulary, and return a ranked list. The computational cost is $O(|V| \cdot d)$ where $|V|$ is vocabulary size and $d$ is embedding dimension, since we must check every word. For large vocabularies, this can be accelerated using approximate nearest neighbor search, but the exact search remains common for evaluation.

Out[10]:

3CosAdd Results: man is to woman as king is to ?
-------------------------------------------------------
Rank   Word           Similarity
-------------------------------------------------------
1      queen              1.0000 ✓
2      princess           0.9991 
3      aunt               0.9926 
4      prince             0.8382 
5      uncle              0.1843

The method correctly identifies "queen" as the best answer. Notice that other female terms (woman, princess, aunt) also score relatively high because they share the "female" component with the target vector, even though they lack the "royalty" component that specifically makes "queen" the best match.

This reveals something important: 3CosAdd doesn't require a perfect match. It finds the best available word, which works well when the vocabulary contains the expected answer. When the vocabulary lacks the expected word (perhaps the analogy expects "empress" but only "queen" is available), the method gracefully returns the closest alternative.

Scaling to Real EmbeddingsLink Copied

Let's load pre-trained GloVe embeddings and test analogies on real data:

In[11]:

def load_glove_embeddings(filepath, max_words=50000):
    """Load GloVe embeddings from file."""
    embeddings = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= max_words:
                break
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

# For demonstration, we'll create synthetic "realistic" embeddings
# that exhibit the known analogy properties
np.random.seed(42)
embedding_dim = 50

# Create base embeddings with semantic structure
base_embeddings = {}
words = ['king', 'queen', 'man', 'woman', 'prince', 'princess',
         'uncle', 'aunt', 'boy', 'girl', 'brother', 'sister',
         'father', 'mother', 'son', 'daughter',
         'paris', 'france', 'tokyo', 'japan', 'london', 'england',
         'berlin', 'germany', 'rome', 'italy', 'madrid', 'spain',
         'walking', 'walked', 'swimming', 'swam', 'running', 'ran',
         'good', 'better', 'best', 'bad', 'worse', 'worst',
         'big', 'bigger', 'biggest', 'small', 'smaller', 'smallest']

# Random base vectors
for word in words:
    base_embeddings[word] = np.random.randn(embedding_dim) * 0.5

# Add structured offsets for known relationships
# Gender offset
gender_offset = np.random.randn(embedding_dim) * 0.3
gender_pairs = [('man', 'woman'), ('king', 'queen'), ('prince', 'princess'),
                ('uncle', 'aunt'), ('boy', 'girl'), ('brother', 'sister'),
                ('father', 'mother'), ('son', 'daughter')]

for male, female in gender_pairs:
    base_embeddings[female] = base_embeddings[male] + gender_offset + np.random.randn(embedding_dim) * 0.1

# Capital-country offset
capital_offset = np.random.randn(embedding_dim) * 0.3
capital_pairs = [('paris', 'france'), ('tokyo', 'japan'), ('london', 'england'),
                 ('berlin', 'germany'), ('rome', 'italy'), ('madrid', 'spain')]

for capital, country in capital_pairs:
    base_embeddings[capital] = base_embeddings[country] + capital_offset + np.random.randn(embedding_dim) * 0.1

# Tense offset
tense_offset = np.random.randn(embedding_dim) * 0.3
tense_pairs = [('walking', 'walked'), ('swimming', 'swam'), ('running', 'ran')]

for present, past in tense_pairs:
    base_embeddings[past] = base_embeddings[present] + tense_offset + np.random.randn(embedding_dim) * 0.1

# Comparative/superlative offset
comparative_offset = np.random.randn(embedding_dim) * 0.2
superlative_offset = np.random.randn(embedding_dim) * 0.35

base_embeddings['better'] = base_embeddings['good'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['best'] = base_embeddings['good'] + superlative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['worse'] = base_embeddings['bad'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['worst'] = base_embeddings['bad'] + superlative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['bigger'] = base_embeddings['big'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['biggest'] = base_embeddings['big'] + superlative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['smaller'] = base_embeddings['small'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['smallest'] = base_embeddings['small'] + superlative_offset + np.random.randn(embedding_dim) * 0.05

def load_glove_embeddings(filepath, max_words=50000):
    """Load GloVe embeddings from file."""
    embeddings = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i >= max_words:
                break
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

# For demonstration, we'll create synthetic "realistic" embeddings
# that exhibit the known analogy properties
np.random.seed(42)
embedding_dim = 50

# Create base embeddings with semantic structure
base_embeddings = {}
words = ['king', 'queen', 'man', 'woman', 'prince', 'princess',
         'uncle', 'aunt', 'boy', 'girl', 'brother', 'sister',
         'father', 'mother', 'son', 'daughter',
         'paris', 'france', 'tokyo', 'japan', 'london', 'england',
         'berlin', 'germany', 'rome', 'italy', 'madrid', 'spain',
         'walking', 'walked', 'swimming', 'swam', 'running', 'ran',
         'good', 'better', 'best', 'bad', 'worse', 'worst',
         'big', 'bigger', 'biggest', 'small', 'smaller', 'smallest']

# Random base vectors
for word in words:
    base_embeddings[word] = np.random.randn(embedding_dim) * 0.5

# Add structured offsets for known relationships
# Gender offset
gender_offset = np.random.randn(embedding_dim) * 0.3
gender_pairs = [('man', 'woman'), ('king', 'queen'), ('prince', 'princess'),
                ('uncle', 'aunt'), ('boy', 'girl'), ('brother', 'sister'),
                ('father', 'mother'), ('son', 'daughter')]

for male, female in gender_pairs:
    base_embeddings[female] = base_embeddings[male] + gender_offset + np.random.randn(embedding_dim) * 0.1

# Capital-country offset
capital_offset = np.random.randn(embedding_dim) * 0.3
capital_pairs = [('paris', 'france'), ('tokyo', 'japan'), ('london', 'england'),
                 ('berlin', 'germany'), ('rome', 'italy'), ('madrid', 'spain')]

for capital, country in capital_pairs:
    base_embeddings[capital] = base_embeddings[country] + capital_offset + np.random.randn(embedding_dim) * 0.1

# Tense offset
tense_offset = np.random.randn(embedding_dim) * 0.3
tense_pairs = [('walking', 'walked'), ('swimming', 'swam'), ('running', 'ran')]

for present, past in tense_pairs:
    base_embeddings[past] = base_embeddings[present] + tense_offset + np.random.randn(embedding_dim) * 0.1

# Comparative/superlative offset
comparative_offset = np.random.randn(embedding_dim) * 0.2
superlative_offset = np.random.randn(embedding_dim) * 0.35

base_embeddings['better'] = base_embeddings['good'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['best'] = base_embeddings['good'] + superlative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['worse'] = base_embeddings['bad'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['worst'] = base_embeddings['bad'] + superlative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['bigger'] = base_embeddings['big'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['biggest'] = base_embeddings['big'] + superlative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['smaller'] = base_embeddings['small'] + comparative_offset + np.random.randn(embedding_dim) * 0.05
base_embeddings['smallest'] = base_embeddings['small'] + superlative_offset + np.random.randn(embedding_dim) * 0.05

In[12]:

# Test various analogies
test_analogies = [
    # Gender analogies
    ('man', 'woman', 'king', 'queen'),
    ('man', 'woman', 'prince', 'princess'),
    ('man', 'woman', 'uncle', 'aunt'),
    ('man', 'woman', 'brother', 'sister'),
    
    # Capital-country analogies
    ('paris', 'france', 'tokyo', 'japan'),
    ('paris', 'france', 'berlin', 'germany'),
    
    # Tense analogies
    ('walking', 'walked', 'running', 'ran'),
    
    # Comparative analogies
    ('good', 'better', 'bad', 'worse'),
    ('big', 'bigger', 'small', 'smaller'),
]

analogy_results = []
for a, b, c, expected in test_analogies:
    results = solve_analogy_3cosadd(a, b, c, base_embeddings)
    top_answer = results[0][0]
    correct = (top_answer == expected)
    
    # Find rank of expected answer
    expected_rank = next((i for i, (w, _) in enumerate(results) if w == expected), -1) + 1
    
    analogy_results.append({
        'analogy': f"{a}:{b}::{c}:?",
        'expected': expected,
        'predicted': top_answer,
        'correct': correct,
        'expected_rank': expected_rank,
        'top_similarity': results[0][1]
    })

# Test various analogies
test_analogies = [
    # Gender analogies
    ('man', 'woman', 'king', 'queen'),
    ('man', 'woman', 'prince', 'princess'),
    ('man', 'woman', 'uncle', 'aunt'),
    ('man', 'woman', 'brother', 'sister'),
    
    # Capital-country analogies
    ('paris', 'france', 'tokyo', 'japan'),
    ('paris', 'france', 'berlin', 'germany'),
    
    # Tense analogies
    ('walking', 'walked', 'running', 'ran'),
    
    # Comparative analogies
    ('good', 'better', 'bad', 'worse'),
    ('big', 'bigger', 'small', 'smaller'),
]

analogy_results = []
for a, b, c, expected in test_analogies:
    results = solve_analogy_3cosadd(a, b, c, base_embeddings)
    top_answer = results[0][0]
    correct = (top_answer == expected)
    
    # Find rank of expected answer
    expected_rank = next((i for i, (w, _) in enumerate(results) if w == expected), -1) + 1
    
    analogy_results.append({
        'analogy': f"{a}:{b}::{c}:?",
        'expected': expected,
        'predicted': top_answer,
        'correct': correct,
        'expected_rank': expected_rank,
        'top_similarity': results[0][1]
    })

Out[13]:

Analogy Evaluation Results (3CosAdd):
---------------------------------------------------------------------------
Analogy                      Expected     Predicted    Correct  Rank
---------------------------------------------------------------------------
man:woman::king:?            queen        queen        ✓        1
man:woman::prince:?          princess     princess     ✓        1
man:woman::uncle:?           aunt         aunt         ✓        1
man:woman::brother:?         sister       sister       ✓        1
paris:france::tokyo:?        japan        japan        ✓        1
paris:france::berlin:?       germany      germany      ✓        1
walking:walked::running:?    ran          ran          ✓        1
good:better::bad:?           worse        worse        ✓        1
big:bigger::small:?          smaller      smaller      ✓        1
---------------------------------------------------------------------------
Accuracy: 9/9 = 100.0%

The 3CosMul MethodLink Copied

3CosAdd works well, but it has a subtle weakness. Consider what happens when one of the similarity terms is unusually large. If a candidate word happens to be extremely similar to $b$ , this high similarity can dominate the vector arithmetic, potentially overriding the contributions from $a$ and $c$ . The additive combination means one term can "swamp" the others.

Levy and Goldberg (2014) proposed an elegant alternative: instead of adding the effects of each query word into a single target vector, treat them separately and combine with multiplication. This is 3CosMul, the multiplicative analogy method.

The Intuition Behind MultiplicationLink Copied

Think about what we want in an answer word $d$ :

It should be similar to $b$ (the "answer" in our template pair)
It should be similar to $c$ (our new starting point)
It should be dissimilar to $a$ (the word we're "subtracting")

With multiplication, all three conditions must be satisfied simultaneously. A word that's perfect for two conditions but terrible for the third gets a low score because multiplication drives the result toward zero. With addition, a single excellent match could compensate for poor matches elsewhere.

Consider a concrete example: for "man:woman::king:?", suppose there's a word "she" that's extremely similar to "woman" (similarity 0.95) but not particularly related to royalty. Under 3CosAdd, this high similarity to "woman" might boost "she" above "queen." Under 3CosMul, "she" would need to also score well on similarity to "king" and dissimilarity to "man", requirements it's unlikely to satisfy.

The 3CosMul FormulaLink Copied

d^* = \arg\max_{d \in V \setminus \{a, b, c\}} \frac{\text{sim}(\vec{d}, \vec{b}) \cdot \text{sim}(\vec{d}, \vec{c})}{\text{sim}(\vec{d}, \vec{a}) + \epsilon}

where:

$d^*$ : the optimal answer word
$\text{sim}(\vec{x}, \vec{y}) = \frac{\cos(\vec{x}, \vec{y}) + 1}{2}$ : shifted cosine similarity mapped to $[0, 1]$
$\epsilon$ : small constant (typically 0.001) to prevent division by zero
The numerator rewards words similar to both $b$ and $c$
The denominator penalizes words similar to $a$

The structure mirrors our intuition: multiply the "want" terms (similarity to $b$ and $c$ ), divide by the "don't want" term (similarity to $a$ ). The $\epsilon$ prevents division by zero when a candidate is perfectly orthogonal to $a$ .

Handling Negative SimilaritiesLink Copied

There's a technical subtlety: cosine similarity ranges from -1 to +1. Multiplying two negative numbers gives a positive result, which would incorrectly reward candidates that are dissimilar to both $b$ and $c$ .

The solution is to shift similarities to a positive range before multiplying. Instead of using raw cosine similarities in $[-1, 1]$ , we transform them to $[0, 1]$ :

\text{sim}_{\text{positive}} = \frac{\cos(\vec{x}, \vec{y}) + 1}{2}

Now a cosine similarity of -1 becomes 0, a similarity of 0 becomes 0.5, and a similarity of 1 becomes 1. Multiplication works correctly: high positive similarities contribute large factors, while negative similarities (now mapped to small positive values) appropriately dampen the score.

Implementing 3CosMulLink Copied

The implementation follows our formula, with care taken to shift similarities to the positive range:

In[14]:

def solve_analogy_3cosmul(a, b, c, embeddings, epsilon=0.001, exclude_query=True):
    """
    Solve analogy using 3CosMul: a is to b as c is to ?
    
    The multiplicative formula:
    score(d) = cos(d,b) * cos(d,c) / (cos(d,a) + epsilon)
    
    Note: We shift cosine similarities to [0,1] to ensure
    multiplication works correctly with negative values.
    """
    exclude = {a, b, c} if exclude_query else set()
    candidates = []
    
    vec_a = embeddings[a]
    vec_b = embeddings[b]
    vec_c = embeddings[c]
    
    for word, vec_d in embeddings.items():
        if word not in exclude:
            # Compute raw cosine similarities
            sim_d_a = cosine_similarity(vec_d, vec_a)
            sim_d_b = cosine_similarity(vec_d, vec_b)
            sim_d_c = cosine_similarity(vec_d, vec_c)
            
            # Shift from [-1, 1] to [0, 1] for safe multiplication
            sim_d_a_pos = (sim_d_a + 1) / 2
            sim_d_b_pos = (sim_d_b + 1) / 2
            sim_d_c_pos = (sim_d_c + 1) / 2
            
            # Multiplicative score: reward similarity to b and c,
            # penalize similarity to a
            score = (sim_d_b_pos * sim_d_c_pos) / (sim_d_a_pos + epsilon)
            
            candidates.append((word, score))
    
    return sorted(candidates, key=lambda x: x[1], reverse=True)

def solve_analogy_3cosmul(a, b, c, embeddings, epsilon=0.001, exclude_query=True):
    """
    Solve analogy using 3CosMul: a is to b as c is to ?
    
    The multiplicative formula:
    score(d) = cos(d,b) * cos(d,c) / (cos(d,a) + epsilon)
    
    Note: We shift cosine similarities to [0,1] to ensure
    multiplication works correctly with negative values.
    """
    exclude = {a, b, c} if exclude_query else set()
    candidates = []
    
    vec_a = embeddings[a]
    vec_b = embeddings[b]
    vec_c = embeddings[c]
    
    for word, vec_d in embeddings.items():
        if word not in exclude:
            # Compute raw cosine similarities
            sim_d_a = cosine_similarity(vec_d, vec_a)
            sim_d_b = cosine_similarity(vec_d, vec_b)
            sim_d_c = cosine_similarity(vec_d, vec_c)
            
            # Shift from [-1, 1] to [0, 1] for safe multiplication
            sim_d_a_pos = (sim_d_a + 1) / 2
            sim_d_b_pos = (sim_d_b + 1) / 2
            sim_d_c_pos = (sim_d_c + 1) / 2
            
            # Multiplicative score: reward similarity to b and c,
            # penalize similarity to a
            score = (sim_d_b_pos * sim_d_c_pos) / (sim_d_a_pos + epsilon)
            
            candidates.append((word, score))
    
    return sorted(candidates, key=lambda x: x[1], reverse=True)

Let's visualize how 3CosMul scores each candidate word by examining the component similarities:

Out[15]:

Visualization

Heatmap showing similarity scores between candidate words and query words for 3CosMul scoring. — How 3CosMul scores candidates for 'man:woman::king:?'. The heatmap shows similarity components for the top candidates. The correct answer 'queen' scores high on similarity to 'woman' and 'king' (what we want) and has low similarity to 'man' (what we subtract). The multiplicative combination ensures all three conditions must be satisfied, so no single high score can compensate for failures elsewhere.

Now let's compare the two methods head-to-head on our test analogies:

In[16]:

# Compare methods on same analogies
comparison_results = []
for a, b, c, expected in test_analogies:
    add_results = solve_analogy_3cosadd(a, b, c, base_embeddings)
    mul_results = solve_analogy_3cosmul(a, b, c, base_embeddings)
    
    add_correct = add_results[0][0] == expected
    mul_correct = mul_results[0][0] == expected
    
    comparison_results.append({
        'analogy': f"{a}:{b}::{c}:?",
        'expected': expected,
        'add_pred': add_results[0][0],
        'mul_pred': mul_results[0][0],
        'add_correct': add_correct,
        'mul_correct': mul_correct
    })

# Compare methods on same analogies
comparison_results = []
for a, b, c, expected in test_analogies:
    add_results = solve_analogy_3cosadd(a, b, c, base_embeddings)
    mul_results = solve_analogy_3cosmul(a, b, c, base_embeddings)
    
    add_correct = add_results[0][0] == expected
    mul_correct = mul_results[0][0] == expected
    
    comparison_results.append({
        'analogy': f"{a}:{b}::{c}:?",
        'expected': expected,
        'add_pred': add_results[0][0],
        'mul_pred': mul_results[0][0],
        'add_correct': add_correct,
        'mul_correct': mul_correct
    })

Out[17]:

Comparison: 3CosAdd vs 3CosMul:
--------------------------------------------------------------------------------
Analogy                      Expected   3CosAdd      3CosMul     
--------------------------------------------------------------------------------
man:woman::king:?            queen      queen      ✓  queen      ✓
man:woman::prince:?          princess   princess   ✓  princess   ✓
man:woman::uncle:?           aunt       aunt       ✓  aunt       ✓
man:woman::brother:?         sister     sister     ✓  sister     ✓
paris:france::tokyo:?        japan      japan      ✓  japan      ✓
paris:france::berlin:?       germany    germany    ✓  germany    ✓
walking:walked::running:?    ran        ran        ✓  ran        ✓
good:better::bad:?           worse      worse      ✓  worse      ✓
big:bigger::small:?          smaller    smaller    ✓  smaller    ✓
--------------------------------------------------------------------------------
Accuracy: 3CosAdd = 9/9, 3CosMul = 9/9

Out[18]:

Visualization

Bar chart comparing top-5 candidate scores for sample analogies under 3CosAdd and 3CosMul methods. — Comparison of 3CosAdd and 3CosMul scoring. For each test analogy, we show the score of the top-5 candidate words under both methods. The expected answer is highlighted in green. Both methods typically rank the correct answer highly, but their score distributions differ. 3CosMul tends to produce more differentiated scores because it multiplies component similarities rather than summing them.

Out[19]:

Visualization

Histogram comparison showing 3CosAdd scores are tightly clustered while 3CosMul scores have better separation. — Score distributions for 3CosAdd vs 3CosMul across all candidates. 3CosAdd produces tightly clustered scores where the correct answer barely stands out from competitors. 3CosMul's multiplicative scoring creates wider separation between good and bad candidates, making the correct answer more distinguishable. This explains why 3CosMul often performs better on noisy real-world embeddings.

When Do the Methods Differ?Link Copied

On our synthetic embeddings with clean structure, both methods typically agree. But with real embeddings trained on noisy text, differences emerge:

Dominating similarities: When one query word is unusually similar to many candidates, 3CosAdd can be swayed by that single term. 3CosMul requires balanced similarity across all query words, making it more robust to outliers.
Syntactic relationships: Levy and Goldberg found that 3CosMul outperforms 3CosAdd specifically on syntactic analogies (verb tenses, plurals, comparatives). These relationships may have less consistent offsets than semantic relationships, making the multiplicative balancing more valuable.
Rare words: When one of the query words is rare and has a noisy embedding, its contribution to 3CosAdd's vector arithmetic might be unreliable. 3CosMul's separate treatment of each word can isolate this noise.

In practice, 3CosMul provides a small but consistent improvement on standard benchmarks. The Google analogy dataset shows roughly 2-3% better accuracy for 3CosMul, with gains concentrated in syntactic categories.

Analogy Evaluation DatasetsLink Copied

The NLP community has developed standard datasets for evaluating word embedding quality through analogies. These datasets contain thousands of analogy questions across various relationship categories.

The Google Analogy DatasetLink Copied

The most famous benchmark, released with the original Word2Vec paper, contains 19,544 analogy questions in two categories:

Semantic analogies (8,869 questions):

Capital-world: Athens is to Greece as Baghdad is to Iraq
Capital-common: Beijing is to China as Berlin is to Germany
Currency: Algeria is to dinar as Angola is to kwanza
Family: boy is to girl as brother is to sister
City-in-state: Chicago is to Illinois as Houston is to Texas

Syntactic analogies (10,675 questions):

Adjective-to-adverb: apparent is to apparently as rapid is to rapidly
Opposite: aware is to unaware as certain is to uncertain
Comparative: bad is to worse as big is to bigger
Superlative: bad is to worst as big is to biggest
Present-participle: code is to coding as dance is to dancing
Nationality-adjective: Albania is to Albanian as Argentina is to Argentinean
Past-tense: dancing is to danced as decreasing is to decreased
Plural: banana is to bananas as bird is to birds
Plural-verbs: decrease is to decreases as describe is to describes

In[20]:

# Simulate a subset of Google analogy categories
analogy_categories = {
    'capital-world': [
        ('athens', 'greece', 'tokyo', 'japan'),
        ('paris', 'france', 'berlin', 'germany'),
        ('london', 'england', 'rome', 'italy'),
    ],
    'family': [
        ('man', 'woman', 'king', 'queen'),
        ('boy', 'girl', 'brother', 'sister'),
        ('father', 'mother', 'son', 'daughter'),
    ],
    'comparative': [
        ('good', 'better', 'bad', 'worse'),
        ('big', 'bigger', 'small', 'smaller'),
    ],
    'past-tense': [
        ('walking', 'walked', 'running', 'ran'),
        ('swimming', 'swam', 'running', 'ran'),
    ],
}

# Evaluate by category
category_results = {}
for category, analogies in analogy_categories.items():
    correct = 0
    total = 0
    for a, b, c, expected in analogies:
        if all(w in base_embeddings for w in [a, b, c, expected]):
            results = solve_analogy_3cosadd(a, b, c, base_embeddings)
            if results[0][0] == expected:
                correct += 1
            total += 1
    
    category_results[category] = {'correct': correct, 'total': total}

# Simulate a subset of Google analogy categories
analogy_categories = {
    'capital-world': [
        ('athens', 'greece', 'tokyo', 'japan'),
        ('paris', 'france', 'berlin', 'germany'),
        ('london', 'england', 'rome', 'italy'),
    ],
    'family': [
        ('man', 'woman', 'king', 'queen'),
        ('boy', 'girl', 'brother', 'sister'),
        ('father', 'mother', 'son', 'daughter'),
    ],
    'comparative': [
        ('good', 'better', 'bad', 'worse'),
        ('big', 'bigger', 'small', 'smaller'),
    ],
    'past-tense': [
        ('walking', 'walked', 'running', 'ran'),
        ('swimming', 'swam', 'running', 'ran'),
    ],
}

# Evaluate by category
category_results = {}
for category, analogies in analogy_categories.items():
    correct = 0
    total = 0
    for a, b, c, expected in analogies:
        if all(w in base_embeddings for w in [a, b, c, expected]):
            results = solve_analogy_3cosadd(a, b, c, base_embeddings)
            if results[0][0] == expected:
                correct += 1
            total += 1
    
    category_results[category] = {'correct': correct, 'total': total}

Out[21]:

Performance by Analogy Category:
-------------------------------------------------------
Category             Correct    Total      Accuracy
-------------------------------------------------------
capital-world        2          2          100.0%
family               3          3          100.0%
comparative          2          2          100.0%
past-tense           2          2          100.0%
-------------------------------------------------------
Overall              9          9          100.0%

The MSR Analogy DatasetLink Copied

Microsoft Research released another widely used dataset focusing on syntactic relationships:

Adjectives: base, comparative, superlative forms
Nouns: singular, plural forms
Verbs: tense variations

BATS: Balanced Analogy Test SetLink Copied

BATS addresses limitations of earlier datasets by including:

More diverse relationships (40 categories)
Multiple valid answers per analogy
Better balance between frequency levels

Out[22]:

Visualization

Bar chart showing counts of different analogy category types in evaluation datasets. — Distribution of analogy categories in evaluation datasets. Semantic analogies (blue) test world knowledge like capitals, currencies, and family relationships. Syntactic analogies (orange) test grammatical patterns like tenses and plurals. The Google dataset is roughly balanced between these two types, though some embedding models perform very differently on each.

Analogy Accuracy: Metrics and InterpretationLink Copied

The standard metric for analogy evaluation is accuracy: the percentage of analogies where the top-ranked word (after excluding query words) matches the expected answer.

\text{Accuracy} = \frac{N_{\text{correct}}}{N_{\text{total}}}

where:

$N_{\text{correct}}$ : number of analogies where the top-ranked prediction matches the expected answer
$N_{\text{total}}$ : total number of analogy questions evaluated

However, this metric has important limitations.

The Top-1 ProblemLink Copied

Accuracy only considers the top-ranked answer. If the correct word is ranked second, it counts as wrong. This is particularly harsh for:

Synonyms: "quick" and "fast" might both be valid
Near-synonyms in context: "happy" and "joyful"
Equally valid alternatives: both "woman" and "lady" might complete an analogy

In[23]:

def evaluate_at_k(a, b, c, expected, embeddings, k_values=[1, 3, 5, 10]):
    """
    Evaluate whether expected answer appears in top-k predictions.
    """
    results = solve_analogy_3cosadd(a, b, c, embeddings)
    
    top_words = [w for w, _ in results]
    
    metrics = {}
    for k in k_values:
        metrics[f'top-{k}'] = expected in top_words[:k]
    
    # Also record the rank
    try:
        rank = top_words.index(expected) + 1
    except ValueError:
        rank = len(top_words) + 1
    metrics['rank'] = rank
    
    return metrics

# Evaluate with relaxed metrics
relaxed_results = []
for a, b, c, expected in test_analogies:
    metrics = evaluate_at_k(a, b, c, expected, base_embeddings)
    relaxed_results.append({
        'analogy': f"{a}:{b}::{c}:?",
        'expected': expected,
        **metrics
    })

def evaluate_at_k(a, b, c, expected, embeddings, k_values=[1, 3, 5, 10]):
    """
    Evaluate whether expected answer appears in top-k predictions.
    """
    results = solve_analogy_3cosadd(a, b, c, embeddings)
    
    top_words = [w for w, _ in results]
    
    metrics = {}
    for k in k_values:
        metrics[f'top-{k}'] = expected in top_words[:k]
    
    # Also record the rank
    try:
        rank = top_words.index(expected) + 1
    except ValueError:
        rank = len(top_words) + 1
    metrics['rank'] = rank
    
    return metrics

# Evaluate with relaxed metrics
relaxed_results = []
for a, b, c, expected in test_analogies:
    metrics = evaluate_at_k(a, b, c, expected, base_embeddings)
    relaxed_results.append({
        'analogy': f"{a}:{b}::{c}:?",
        'expected': expected,
        **metrics
    })

Out[24]:

Relaxed Accuracy Metrics:
----------------------------------------------------------------------
Analogy                      Expected   Rank   Top-1   Top-3   Top-5
----------------------------------------------------------------------
man:woman::king:?            queen      1      ✓       ✓       ✓
man:woman::prince:?          princess   1      ✓       ✓       ✓
man:woman::uncle:?           aunt       1      ✓       ✓       ✓
man:woman::brother:?         sister     1      ✓       ✓       ✓
paris:france::tokyo:?        japan      1      ✓       ✓       ✓
paris:france::berlin:?       germany    1      ✓       ✓       ✓
walking:walked::running:?    ran        1      ✓       ✓       ✓
good:better::bad:?           worse      1      ✓       ✓       ✓
big:bigger::small:?          smaller    1      ✓       ✓       ✓
----------------------------------------------------------------------
Aggregate: Top-1 = 100.0%, Top-3 = 100.0%, Top-5 = 100.0%, MRR = 1.000

Mean Reciprocal Rank (MRR)Link Copied

MRR provides a more nuanced view by considering the rank of the correct answer:

\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}

where:

$\text{MRR}$ : Mean Reciprocal Rank
$N$ : total number of analogy questions
$\text{rank}_i$ : position of the correct answer in the ranked list for the $i$ -th analogy
$\frac{1}{\text{rank}_i}$ : reciprocal rank (1 if correct answer is first, 0.5 if second, etc.)

An MRR of 1.0 means all answers were ranked first. An MRR of 0.5 means answers were typically ranked second.

Out[25]:

Visualization

Histogram showing distribution of ranks where the correct answer appears, with most concentrated at rank 1. — Distribution of correct answer ranks across analogy questions. A well-performing embedding places most correct answers at rank 1, with a rapidly declining tail. Answers ranked beyond position 10 are effectively failures. The mean rank and MRR summarize this distribution.

What Analogies Reveal About EmbeddingsLink Copied

Analogy performance tells us something about embedding quality, but the relationship is nuanced. Here's what good analogy performance does and doesn't imply.

What Analogies Do ShowLink Copied

Consistent relationship encoding: If "king:queen::man:woman" works, the model has learned that gender is encoded consistently across different word pairs. The relationship vector is approximately invariant.

Geometric structure: High analogy accuracy indicates the embedding space has meaningful geometric properties. Similar relationships create parallel vectors.

Distributional pattern learning: Since analogies emerge from distributional training (Skip-gram, GloVe), good performance confirms the model captured co-occurrence patterns that reflect semantic relationships.

What Analogies Don't ShowLink Copied

General NLP performance: Analogy accuracy doesn't strongly predict performance on downstream tasks like sentiment analysis, named entity recognition, or question answering. Models can excel at analogies but underperform on practical applications, and vice versa.

Handling of rare words: Analogy datasets focus on common words. Performance on frequent words doesn't guarantee quality representations for the long tail of rare vocabulary.

Contextual understanding: Static embeddings give one vector per word. "Bank" in "river bank" and "bank account" gets the same embedding, yet analogy tests can't detect this limitation.

Compositionality: Analogies test individual word relationships, not how well embeddings combine into phrase or sentence representations.

Out[26]:

Visualization

Radar chart showing varying analogy accuracy across different relationship categories. — Analogy accuracy across different relationship types reveals uneven performance. Some relationships (like gender opposites and basic tenses) are captured well, while others (like currencies or less common city-state pairs) perform poorly. This variation suggests that analogy accuracy depends heavily on the specific relationships tested and their frequency in training data.

Visualizing Relationship VectorsLink Copied

To understand why some analogies work better than others, we can examine the relationship vectors directly. If a relationship is encoded consistently, all instances should produce similar offset vectors.

In[27]:

def compute_relationship_vectors(word_pairs, embeddings):
    """
    Compute the relationship vector (b - a) for each word pair (a, b).
    """
    vectors = []
    labels = []
    for a, b in word_pairs:
        if a in embeddings and b in embeddings:
            vec = embeddings[b] - embeddings[a]
            vectors.append(vec)
            labels.append(f"{a}→{b}")
    return np.array(vectors), labels

# Gender relationship vectors
gender_pairs = [('man', 'woman'), ('king', 'queen'), ('prince', 'princess'),
                ('uncle', 'aunt'), ('boy', 'girl'), ('brother', 'sister'),
                ('father', 'mother'), ('son', 'daughter')]

gender_vectors, gender_labels = compute_relationship_vectors(gender_pairs, base_embeddings)

# Capital relationship vectors  
capital_pairs = [('france', 'paris'), ('japan', 'tokyo'), ('england', 'london'),
                 ('germany', 'berlin'), ('italy', 'rome'), ('spain', 'madrid')]

capital_vectors, capital_labels = compute_relationship_vectors(capital_pairs, base_embeddings)

def compute_relationship_vectors(word_pairs, embeddings):
    """
    Compute the relationship vector (b - a) for each word pair (a, b).
    """
    vectors = []
    labels = []
    for a, b in word_pairs:
        if a in embeddings and b in embeddings:
            vec = embeddings[b] - embeddings[a]
            vectors.append(vec)
            labels.append(f"{a}→{b}")
    return np.array(vectors), labels

# Gender relationship vectors
gender_pairs = [('man', 'woman'), ('king', 'queen'), ('prince', 'princess'),
                ('uncle', 'aunt'), ('boy', 'girl'), ('brother', 'sister'),
                ('father', 'mother'), ('son', 'daughter')]

gender_vectors, gender_labels = compute_relationship_vectors(gender_pairs, base_embeddings)

# Capital relationship vectors  
capital_pairs = [('france', 'paris'), ('japan', 'tokyo'), ('england', 'london'),
                 ('germany', 'berlin'), ('italy', 'rome'), ('spain', 'madrid')]

capital_vectors, capital_labels = compute_relationship_vectors(capital_pairs, base_embeddings)

In[28]:

def relationship_consistency(vectors):
    """
    Measure how consistent relationship vectors are.
    Returns average pairwise cosine similarity.
    """
    n = len(vectors)
    if n < 2:
        return 1.0
    
    similarities = []
    for i in range(n):
        for j in range(i + 1, n):
            sim = cosine_similarity(vectors[i], vectors[j])
            similarities.append(sim)
    
    return np.mean(similarities)

gender_consistency = relationship_consistency(gender_vectors)
capital_consistency = relationship_consistency(capital_vectors)

def relationship_consistency(vectors):
    """
    Measure how consistent relationship vectors are.
    Returns average pairwise cosine similarity.
    """
    n = len(vectors)
    if n < 2:
        return 1.0
    
    similarities = []
    for i in range(n):
        for j in range(i + 1, n):
            sim = cosine_similarity(vectors[i], vectors[j])
            similarities.append(sim)
    
    return np.mean(similarities)

gender_consistency = relationship_consistency(gender_vectors)
capital_consistency = relationship_consistency(capital_vectors)

Out[29]:

Relationship Vector Consistency:
--------------------------------------------------
Gender relationship:
  Average pairwise similarity: 0.8942

Capital relationship:
  Average pairwise similarity: 0.8180

Higher similarity = more consistent relationship encoding
This predicts better analogy performance for that relationship type

Out[30]:

Visualization

Heatmap showing pairwise cosine similarities between different gender relationship vectors, all showing high similarity. — Pairwise similarity matrix between relationship vectors for the gender relationship. Each cell shows the cosine similarity between two relationship vectors (e.g., man→woman vs. king→queen). High similarity throughout (green) indicates the relationship is encoded consistently, which predicts good analogy performance. Diagonal entries are always 1.0 (a vector is identical to itself).

Out[31]:

Visualization

Two scatter plots showing clustered relationship vectors for gender and capital relationships. — Relationship vector consistency across different types. Left: Gender relationship vectors projected to 2D using PCA. The tight clustering indicates the gender offset is encoded consistently across word pairs. Right: Capital-country relationship vectors show similar consistency. When relationship vectors cluster tightly, analogies involving that relationship are more likely to succeed.

Limitations of Analogy EvaluationLink Copied

Despite their popularity, analogy tests have significant limitations as embedding quality metrics.

The Hubness ProblemLink Copied

High-dimensional spaces suffer from "hubness": some words become nearest neighbors of many other words. These hubs can dominate analogy results, appearing as incorrect answers for many different queries.

In[32]:

def compute_hub_scores(embeddings):
    """
    Compute hubness: how often each word appears in others' nearest neighbors.
    """
    words = list(embeddings.keys())
    hub_counts = {w: 0 for w in words}
    
    for query_word in words:
        query_vec = embeddings[query_word]
        
        # Find k nearest neighbors
        similarities = []
        for other_word, other_vec in embeddings.items():
            if other_word != query_word:
                sim = cosine_similarity(query_vec, other_vec)
                similarities.append((other_word, sim))
        
        top_5 = sorted(similarities, key=lambda x: x[1], reverse=True)[:5]
        
        for neighbor, _ in top_5:
            hub_counts[neighbor] += 1
    
    return hub_counts

hub_scores = compute_hub_scores(base_embeddings)

def compute_hub_scores(embeddings):
    """
    Compute hubness: how often each word appears in others' nearest neighbors.
    """
    words = list(embeddings.keys())
    hub_counts = {w: 0 for w in words}
    
    for query_word in words:
        query_vec = embeddings[query_word]
        
        # Find k nearest neighbors
        similarities = []
        for other_word, other_vec in embeddings.items():
            if other_word != query_word:
                sim = cosine_similarity(query_vec, other_vec)
                similarities.append((other_word, sim))
        
        top_5 = sorted(similarities, key=lambda x: x[1], reverse=True)[:5]
        
        for neighbor, _ in top_5:
            hub_counts[neighbor] += 1
    
    return hub_counts

hub_scores = compute_hub_scores(base_embeddings)

Out[33]:

Hubness Analysis:
----------------------------------------
Top 10 most 'central' words:
  ran             appears in 12 neighbor lists
  woman           appears in 9 neighbor lists
  aunt            appears in 9 neighbor lists
  girl            appears in 9 neighbor lists
  queen           appears in 8 neighbor lists
  swam            appears in 8 neighbor lists
  sister          appears in 7 neighbor lists
  daughter        appears in 7 neighbor lists
  london          appears in 7 neighbor lists
  rome            appears in 7 neighbor lists

Bottom 5 'peripheral' words:
  good            appears in 2 neighbor lists
  boy             appears in 1 neighbor lists
  father          appears in 1 neighbor lists
  son             appears in 1 neighbor lists
  germany         appears in 1 neighbor lists

Frequency ConfoundsLink Copied

Common words have more training examples, leading to better-tuned embeddings. Analogy datasets often focus on frequent words, potentially overstating embedding quality for typical vocabulary.

Dataset ArtifactsLink Copied

Analogy datasets may contain patterns that don't generalize:

Specific naming conventions (e.g., country capitals all having similar suffixes)
Cultural biases (Western-centric knowledge)
Temporal changes (city names, currencies that have changed)

Out[34]:

Visualization

Beyond Simple Analogies: Extensions and AlternativesLink Copied

Researchers have proposed several extensions to address limitations of the basic analogy framework.

Relational SimilarityLink Copied

Instead of requiring exact vector arithmetic, relational similarity measures how well a model captures that two pairs share a relationship:

\text{RelSim}(a:b, c:d) = \cos(\vec{b} - \vec{a}, \vec{d} - \vec{c})

where:

$\text{RelSim}(a:b, c:d)$ : relational similarity between word pairs $(a, b)$ and $(c, d)$
$\vec{b} - \vec{a}$ : relationship vector from word $a$ to word $b$
$\vec{d} - \vec{c}$ : relationship vector from word $c$ to word $d$
$\cos(\cdot, \cdot)$ : cosine similarity

This directly measures whether the relationship vectors are parallel, without requiring finding the correct $d$ .

In[35]:

def relational_similarity(a, b, c, d, embeddings):
    """
    Measure how similar the relationship a→b is to c→d.
    """
    rel_ab = embeddings[b] - embeddings[a]
    rel_cd = embeddings[d] - embeddings[c]
    return cosine_similarity(rel_ab, rel_cd)

# Test relational similarity for different pairs
pairs_to_test = [
    (('man', 'woman'), ('king', 'queen'), 'Same relationship'),
    (('man', 'woman'), ('boy', 'girl'), 'Same relationship'),
    (('man', 'woman'), ('paris', 'france'), 'Different relationship'),
    (('paris', 'france'), ('tokyo', 'japan'), 'Same relationship'),
]

rel_sim_results = []
for (a, b), (c, d), description in pairs_to_test:
    if all(w in base_embeddings for w in [a, b, c, d]):
        sim = relational_similarity(a, b, c, d, base_embeddings)
        rel_sim_results.append({
            'pair1': f"{a}→{b}",
            'pair2': f"{c}→{d}",
            'similarity': sim,
            'description': description
        })

def relational_similarity(a, b, c, d, embeddings):
    """
    Measure how similar the relationship a→b is to c→d.
    """
    rel_ab = embeddings[b] - embeddings[a]
    rel_cd = embeddings[d] - embeddings[c]
    return cosine_similarity(rel_ab, rel_cd)

# Test relational similarity for different pairs
pairs_to_test = [
    (('man', 'woman'), ('king', 'queen'), 'Same relationship'),
    (('man', 'woman'), ('boy', 'girl'), 'Same relationship'),
    (('man', 'woman'), ('paris', 'france'), 'Different relationship'),
    (('paris', 'france'), ('tokyo', 'japan'), 'Same relationship'),
]

rel_sim_results = []
for (a, b), (c, d), description in pairs_to_test:
    if all(w in base_embeddings for w in [a, b, c, d]):
        sim = relational_similarity(a, b, c, d, base_embeddings)
        rel_sim_results.append({
            'pair1': f"{a}→{b}",
            'pair2': f"{c}→{d}",
            'similarity': sim,
            'description': description
        })

Out[36]:

Relational Similarity Analysis:
-----------------------------------------------------------------
Pair 1          Pair 2            Similarity Expected
-----------------------------------------------------------------
man→woman       king→queen            0.8862 High   ✓
man→woman       boy→girl              0.8801 High   ✓
man→woman       paris→france          0.0893 Low    ✓
paris→france    tokyo→japan           0.8109 High   ✓

Multiple Valid AnswersLink Copied

The BATS dataset and newer benchmarks allow multiple correct answers per analogy. This better reflects linguistic reality where synonyms exist.

Word Similarity as ComplementLink Copied

Word similarity tasks (predicting human similarity judgments for word pairs) provide a complementary view of embedding quality. High correlation with human ratings suggests the embedding space reflects human semantic intuitions.

Practical ImplementationLink Copied

Here's a complete, reusable implementation for analogy evaluation:

In[37]:

class AnalogyEvaluator:
    """
    Comprehensive analogy evaluation for word embeddings.
    """
    
    def __init__(self, embeddings):
        """
        Initialize with a dictionary of word embeddings.
        """
        self.embeddings = embeddings
        self.vocab = set(embeddings.keys())
        
        # Precompute normalized embeddings for efficiency
        self.normalized = {}
        for word, vec in embeddings.items():
            norm = np.linalg.norm(vec)
            self.normalized[word] = vec / norm if norm > 0 else vec
    
    def solve_3cosadd(self, a, b, c, top_k=5, exclude_query=True):
        """Solve analogy using 3CosAdd method."""
        if not all(w in self.vocab for w in [a, b, c]):
            return []
        
        target = self.embeddings[b] - self.embeddings[a] + self.embeddings[c]
        target_norm = target / (np.linalg.norm(target) + 1e-10)
        
        exclude = {a, b, c} if exclude_query else set()
        scores = []
        
        for word, vec_norm in self.normalized.items():
            if word not in exclude:
                sim = np.dot(target_norm, vec_norm)
                scores.append((word, sim))
        
        return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
    
    def solve_3cosmul(self, a, b, c, top_k=5, exclude_query=True, epsilon=0.001):
        """Solve analogy using 3CosMul method."""
        if not all(w in self.vocab for w in [a, b, c]):
            return []
        
        vec_a = self.normalized[a]
        vec_b = self.normalized[b]
        vec_c = self.normalized[c]
        
        exclude = {a, b, c} if exclude_query else set()
        scores = []
        
        for word, vec_d in self.normalized.items():
            if word not in exclude:
                sim_a = (np.dot(vec_d, vec_a) + 1) / 2
                sim_b = (np.dot(vec_d, vec_b) + 1) / 2
                sim_c = (np.dot(vec_d, vec_c) + 1) / 2
                
                score = (sim_b * sim_c) / (sim_a + epsilon)
                scores.append((word, score))
        
        return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
    
    def evaluate_dataset(self, analogies, method='3cosadd'):
        """
        Evaluate on a list of (a, b, c, expected) tuples.
        Returns accuracy and detailed results.
        """
        solve_fn = self.solve_3cosadd if method == '3cosadd' else self.solve_3cosmul
        
        results = []
        correct = 0
        total = 0
        
        for a, b, c, expected in analogies:
            if not all(w in self.vocab for w in [a, b, c, expected]):
                continue
            
            predictions = solve_fn(a, b, c)
            if not predictions:
                continue
            
            predicted = predictions[0][0]
            is_correct = (predicted == expected)
            correct += is_correct
            total += 1
            
            # Find rank of expected answer
            ranks = [w for w, _ in predictions]
            rank = ranks.index(expected) + 1 if expected in ranks else len(self.vocab)
            
            results.append({
                'analogy': (a, b, c, expected),
                'predicted': predicted,
                'correct': is_correct,
                'rank': rank,
                'top_5': predictions
            })
        
        accuracy = correct / total if total > 0 else 0
        mrr = np.mean([1/r['rank'] for r in results]) if results else 0
        
        return {
            'accuracy': accuracy,
            'mrr': mrr,
            'total': total,
            'correct': correct,
            'results': results
        }

# Create evaluator and run evaluation
evaluator = AnalogyEvaluator(base_embeddings)
eval_results = evaluator.evaluate_dataset(test_analogies, method='3cosadd')

class AnalogyEvaluator:
    """
    Comprehensive analogy evaluation for word embeddings.
    """
    
    def __init__(self, embeddings):
        """
        Initialize with a dictionary of word embeddings.
        """
        self.embeddings = embeddings
        self.vocab = set(embeddings.keys())
        
        # Precompute normalized embeddings for efficiency
        self.normalized = {}
        for word, vec in embeddings.items():
            norm = np.linalg.norm(vec)
            self.normalized[word] = vec / norm if norm > 0 else vec
    
    def solve_3cosadd(self, a, b, c, top_k=5, exclude_query=True):
        """Solve analogy using 3CosAdd method."""
        if not all(w in self.vocab for w in [a, b, c]):
            return []
        
        target = self.embeddings[b] - self.embeddings[a] + self.embeddings[c]
        target_norm = target / (np.linalg.norm(target) + 1e-10)
        
        exclude = {a, b, c} if exclude_query else set()
        scores = []
        
        for word, vec_norm in self.normalized.items():
            if word not in exclude:
                sim = np.dot(target_norm, vec_norm)
                scores.append((word, sim))
        
        return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
    
    def solve_3cosmul(self, a, b, c, top_k=5, exclude_query=True, epsilon=0.001):
        """Solve analogy using 3CosMul method."""
        if not all(w in self.vocab for w in [a, b, c]):
            return []
        
        vec_a = self.normalized[a]
        vec_b = self.normalized[b]
        vec_c = self.normalized[c]
        
        exclude = {a, b, c} if exclude_query else set()
        scores = []
        
        for word, vec_d in self.normalized.items():
            if word not in exclude:
                sim_a = (np.dot(vec_d, vec_a) + 1) / 2
                sim_b = (np.dot(vec_d, vec_b) + 1) / 2
                sim_c = (np.dot(vec_d, vec_c) + 1) / 2
                
                score = (sim_b * sim_c) / (sim_a + epsilon)
                scores.append((word, score))
        
        return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
    
    def evaluate_dataset(self, analogies, method='3cosadd'):
        """
        Evaluate on a list of (a, b, c, expected) tuples.
        Returns accuracy and detailed results.
        """
        solve_fn = self.solve_3cosadd if method == '3cosadd' else self.solve_3cosmul
        
        results = []
        correct = 0
        total = 0
        
        for a, b, c, expected in analogies:
            if not all(w in self.vocab for w in [a, b, c, expected]):
                continue
            
            predictions = solve_fn(a, b, c)
            if not predictions:
                continue
            
            predicted = predictions[0][0]
            is_correct = (predicted == expected)
            correct += is_correct
            total += 1
            
            # Find rank of expected answer
            ranks = [w for w, _ in predictions]
            rank = ranks.index(expected) + 1 if expected in ranks else len(self.vocab)
            
            results.append({
                'analogy': (a, b, c, expected),
                'predicted': predicted,
                'correct': is_correct,
                'rank': rank,
                'top_5': predictions
            })
        
        accuracy = correct / total if total > 0 else 0
        mrr = np.mean([1/r['rank'] for r in results]) if results else 0
        
        return {
            'accuracy': accuracy,
            'mrr': mrr,
            'total': total,
            'correct': correct,
            'results': results
        }

# Create evaluator and run evaluation
evaluator = AnalogyEvaluator(base_embeddings)
eval_results = evaluator.evaluate_dataset(test_analogies, method='3cosadd')

Out[38]:

Analogy Evaluator Results:
--------------------------------------------------
Method: 3CosAdd
Total analogies: 9
Correct (Top-1): 9
Accuracy: 100.0%
MRR: 1.0000

SummaryLink Copied

Word analogies provide a window into the geometric structure of embedding spaces. When Skip-gram or GloVe learns that "king" relates to "queen" the same way "man" relates to "woman," it encodes this as parallel vectors in high-dimensional space. The parallelogram model formalizes this insight.

Key takeaways:

Vector arithmetic works: The formula $\vec{b} - \vec{a} + \vec{c}$ successfully solves many analogies because semantic relationships are encoded as consistent vector offsets
3CosAdd vs 3CosMul: Both methods find the nearest word to the target vector, but use different scoring functions. 3CosMul often performs slightly better by balancing contributions from each query term
Evaluation datasets: The Google analogy dataset and BATS provide standardized benchmarks, but performance varies significantly by relationship type
Accuracy is limited: Top-1 accuracy is harsh; MRR and top-k metrics provide more nuanced evaluation
Relationship consistency matters: Analogies work best when the relationship vector is consistent across word pairs
Analogies have limitations: High analogy accuracy doesn't guarantee good downstream performance, and datasets have various biases and artifacts

Word analogies revealed something profound about distributional semantics: meaning, or at least certain aspects of meaning, can be captured geometrically. But analogies are just one lens. The next chapter explores GloVe, which takes a different approach to learning embeddings by directly factorizing co-occurrence matrices.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about word analogies and embedding evaluation.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{wordanalogyvectorarithmeticforsemanticrelationships, author = {Michael Brenndoerfer}, title = {Word Analogy: Vector Arithmetic for Semantic Relationships}, year = {2025}, url = {https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }

APAAcademic

Michael Brenndoerfer (2025). Word Analogy: Vector Arithmetic for Semantic Relationships. Retrieved from https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships

MLAAcademic

Michael Brenndoerfer. "Word Analogy: Vector Arithmetic for Semantic Relationships." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships>.

CHICAGOAcademic

Michael Brenndoerfer. "Word Analogy: Vector Arithmetic for Semantic Relationships." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Word Analogy: Vector Arithmetic for Semantic Relationships'. Available at: https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships (Accessed: 12/13/2025).

SimpleBasic

Michael Brenndoerfer (2025). Word Analogy: Vector Arithmetic for Semantic Relationships. https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships

Direct link:

https://mbrenndoerfer.com/writing/word-analogy-vector-arithmetic-semantic-relationships

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Word Analogy: Vector Arithmetic for Semantic Relationships

Word AnalogyLink Copied

The Parallelogram ModelLink Copied

Vector Arithmetic for AnalogiesLink Copied

The 3CosAdd MethodLink Copied

Why Cosine Similarity, Not Euclidean Distance?Link Copied

The 3CosAdd FormulaLink Copied

Implementing 3CosAdd Step by StepLink Copied

Scaling to Real EmbeddingsLink Copied

The 3CosMul MethodLink Copied

The Intuition Behind MultiplicationLink Copied

The 3CosMul FormulaLink Copied

Handling Negative SimilaritiesLink Copied

Implementing 3CosMulLink Copied

When Do the Methods Differ?Link Copied

Analogy Evaluation DatasetsLink Copied

The Google Analogy DatasetLink Copied

The MSR Analogy DatasetLink Copied

BATS: Balanced Analogy Test SetLink Copied

Analogy Accuracy: Metrics and InterpretationLink Copied

The Top-1 ProblemLink Copied

Mean Reciprocal Rank (MRR)Link Copied

What Analogies Reveal About EmbeddingsLink Copied

What Analogies Do ShowLink Copied

What Analogies Don't ShowLink Copied

Visualizing Relationship VectorsLink Copied

Limitations of Analogy EvaluationLink Copied

The Hubness ProblemLink Copied

Frequency ConfoundsLink Copied

Dataset ArtifactsLink Copied

Beyond Simple Analogies: Extensions and AlternativesLink Copied

Relational SimilarityLink Copied

Multiple Valid AnswersLink Copied

Word Similarity as ComplementLink Copied

Practical ImplementationLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GloVe: Global Vectors for Word Representation

FastText: Subword Embeddings for OOV Words & Morphology

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Stay updated