Search

Search articles

CBOW Model: Learning Word Embeddings by Predicting Center Words

Michael BrenndoerferDecember 11, 202538 min read9,174 words

A comprehensive guide to the Continuous Bag of Words (CBOW) model from Word2Vec, covering context averaging, architecture, objective function, gradient derivation, and comparison with Skip-gram.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

CBOW Model

In the previous chapter, we explored Skip-gram, which learns word embeddings by predicting context words from a center word. The Continuous Bag of Words (CBOW) model takes the opposite approach: given the surrounding context words, predict the center word. This architectural inversion leads to different learning dynamics and computational trade-offs.

CBOW was introduced alongside Skip-gram in the original Word2Vec paper by Mikolov et al. (2013). While Skip-gram treats each context position independently, CBOW averages context word embeddings together to make a single prediction per training example. This design choice has practical implications: CBOW trains faster than Skip-gram and performs better on frequent words, while Skip-gram excels at representing rare words.

This chapter covers the CBOW architecture from the ground up. We'll work through the mathematics, implement the model from scratch, and understand when to choose CBOW over Skip-gram.

The Core Idea: Predicting Words from Context

Imagine reading a sentence with a word missing: "The quick brown ___ jumps over the lazy dog." Given the surrounding context ("quick," "brown," "jumps," "over"), you can likely guess the missing word is "fox." This fill-in-the-blank task is exactly what CBOW learns to do.

CBOW Model

The Continuous Bag of Words (CBOW) model learns word representations by training a neural network to predict a center word given its surrounding context words. The context word embeddings are averaged together before making the prediction.

The name "Continuous Bag of Words" comes from two properties:

  1. Continuous: The model uses continuous-valued vectors (embeddings) rather than discrete word representations
  2. Bag of Words: The context words are treated as an unordered set, averaging their embeddings together regardless of position
In[2]:
# Demonstrating CBOW's prediction task
sentence = "The quick brown fox jumps over the lazy dog"
words = sentence.lower().split()

def get_cbow_example(words, target_idx, window_size=2):
    """Generate a CBOW training example: (context_words, target_word)."""
    target = words[target_idx]
    context = []
    
    for offset in range(-window_size, window_size + 1):
        if offset == 0:  # Skip the target word itself
            continue
        context_idx = target_idx + offset
        if 0 <= context_idx < len(words):
            context.append(words[context_idx])
    
    return context, target

# Get CBOW example for predicting "fox" (index 3)
context, target = get_cbow_example(words, target_idx=3, window_size=2)
Out[3]:
CBOW Training Example:
---------------------------------------------
Sentence: 'The quick brown fox jumps over the lazy dog'
Target word: 'fox' (position 3)
Window size: 2

Context words: ['quick', 'brown', 'jumps', 'over']
Prediction task: ['quick', 'brown', 'jumps', 'over'] → 'fox'

This single training example captures the essence of CBOW: given the surrounding words, predict what goes in the middle. Compare this to Skip-gram, which generates four separate training pairs from the same position: (fox → quick), (fox → brown), (fox → jumps), and (fox → over). CBOW instead creates one training example where all four context words together predict "fox."

Architecture: Context Averaging

CBOW's architecture is similar to Skip-gram but with a crucial difference in the input layer. Instead of taking a single word as input, CBOW takes multiple context words and averages their embeddings.

Out[4]:
Visualization
Neural network diagram showing multiple input context words, embedding layer with averaging, hidden representation, output layer, and softmax probabilities.
CBOW architecture. Multiple context words are each embedded using the shared embedding matrix W, then averaged to produce a single hidden vector. This averaged representation is projected through the context matrix W' to produce scores for every word in the vocabulary. Softmax converts these scores to a probability distribution, with the goal of assigning high probability to the true center word.

The network has two weight matrices, identical in purpose to Skip-gram:

  1. Embedding matrix W\mathbf{W} (size V×dV \times d): Maps input words to dense vectors. Each row is the embedding for one vocabulary word. For CBOW, we look up multiple rows (one per context word) and average them.

  2. Context matrix W\mathbf{W}' (size d×Vd \times V): Maps the hidden representation to output scores over the vocabulary.

The key difference from Skip-gram is what happens between the embedding lookup and the output layer: CBOW averages the context embeddings, while Skip-gram uses the single center word embedding directly.

Context Word Averaging

How should a model combine information from multiple context words? Skip-gram sidesteps this question entirely, processing each context word independently. But CBOW must somehow merge four separate embeddings into a single representation that captures the collective meaning of the context.

The simplest approach is also highly effective: take the average. If each context word embedding captures something about that word's meaning, then their average should capture what these words have in common. When the context words are semantically coherent (as they typically are around a meaningful center word), this average points toward the semantic region where the center word belongs.

This leads us to the core formula of CBOW:

hˉ=1Ci=1Cwci\bar{\mathbf{h}} = \frac{1}{C} \sum_{i=1}^{C} \mathbf{w}_{c_i}

where:

  • hˉ\bar{\mathbf{h}}: the averaged hidden representation (the "centroid" of the context)
  • CC: the number of context words (typically 2m2m for window size mm)
  • wci\mathbf{w}_{c_i}: the embedding vector of the ii-th context word (row cic_i of W\mathbf{W})

The averaging operation is simple, just element-wise means across all context embeddings. Yet this simplicity carries important implications for what CBOW can and cannot learn.

In[5]:
import numpy as np

# Small vocabulary for demonstration
small_vocab = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
word_to_idx = {word: i for i, word in enumerate(small_vocab)}
idx_to_word = {i: word for word, i in word_to_idx.items()}

# Initialize small embedding matrix
embedding_dim = 4
np.random.seed(42)
W = np.random.randn(len(small_vocab), embedding_dim) * 0.5

def average_context_embeddings(context_words, W, word_to_idx):
    """Average the embeddings of context words."""
    embeddings = [W[word_to_idx[word]] for word in context_words]
    return np.mean(embeddings, axis=0)

# Example: average embeddings for context of "fox"
context = ['quick', 'brown', 'jumps', 'over']
avg_embedding = average_context_embeddings(context, W, word_to_idx)
Out[6]:
Context Word Averaging:
--------------------------------------------------
Context words: ['quick', 'brown', 'jumps', 'over']

Individual embeddings:
  quick   : [-0.117, -0.117, +0.790, +0.384]
  brown   : [-0.235, +0.271, -0.232, -0.233]
  jumps   : [-0.506, +0.157, -0.454, -0.706]
  over    : [+0.733, -0.113, +0.034, -0.712]

Averaged embedding:
  h_bar   : [-0.031, +0.050, +0.034, -0.317]

Each dimension of the averaged embedding is simply the mean of the corresponding dimensions from the individual word embeddings. Notice how the averaged vector smooths out the individual variations, producing values that lie between the extremes of the component vectors.

Out[7]:
Visualization
Heatmap showing four context word embeddings and their element-wise average.
Heatmap visualization of context word averaging. Each row shows a context word's embedding values across four dimensions. The bottom row shows the averaged embedding, where each cell is the mean of the column above it. The color intensity reveals how extreme values in individual embeddings are smoothed toward the center in the average.
Why Averaging?

Averaging treats context words as an unordered set, a "bag of words." This simplification ignores word order (e.g., "dog bites man" averages the same as "man bites dog"), but works surprisingly well in practice. The key insight is that the set of nearby words, regardless of order, provides strong signal about a word's meaning.

Visualizing Context Averaging

Let's visualize how averaging combines multiple context vectors into a single representation:

Out[8]:
Visualization
2D vector plot showing four context word embeddings as colored arrows and their average as a black arrow.
Context word averaging in CBOW. Each context word has its own embedding vector (colored arrows). The averaged vector (black arrow) represents the centroid of these context embeddings. This averaged representation is what CBOW uses to predict the center word. The averaging operation smooths out individual word idiosyncrasies and captures the shared semantic context.

The averaged vector represents the semantic centroid of the context. When context words are semantically coherent (as they typically are around a meaningful center word), this centroid points toward the region of embedding space where the center word should reside.

The CBOW Objective Function

With the averaged context representation hˉ\bar{\mathbf{h}} in hand, CBOW faces a classification problem: which of the VV vocabulary words is most likely to appear in the center position? This is where the second weight matrix, W\mathbf{W}', comes into play.

From Context to Prediction

The intuition behind the prediction step is geometric. Each word in the vocabulary has a corresponding vector in the context matrix W\mathbf{W}'. To predict the center word, CBOW computes how well the averaged context hˉ\bar{\mathbf{h}} aligns with each vocabulary word's context vector. The word whose context vector points most in the same direction as hˉ\bar{\mathbf{h}} receives the highest probability.

This alignment is measured by the dot product. For the target word wtw_t, we compute wthˉ\mathbf{w}'_t \cdot \bar{\mathbf{h}}, which yields a large positive value when the vectors point in similar directions. But we need probabilities, not raw scores, so we apply softmax normalization:

P(wtwc1,,wcC)=exp(wthˉ)k=1Vexp(wkhˉ)P(w_t | w_{c_1}, \ldots, w_{c_C}) = \frac{\exp(\mathbf{w}'_t \cdot \bar{\mathbf{h}})}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \bar{\mathbf{h}})}

where:

  • wtw_t: the target center word we're trying to predict
  • hˉ\bar{\mathbf{h}}: the averaged context embedding (computed from the input layer)
  • wt\mathbf{w}'_t: the context vector for word wtw_t (column tt of W\mathbf{W}')
  • VV: vocabulary size

The softmax denominator sums over all vocabulary words, ensuring the probabilities sum to 1. This is the same formulation as Skip-gram, but with the averaged context vector hˉ\bar{\mathbf{h}} replacing the single center word embedding.

From Probability to Loss

Training requires a loss function that measures prediction quality. The natural choice is cross-entropy loss, which penalizes the model for assigning low probability to the correct center word. Taking the negative logarithm of the probability gives us:

L=logP(wtcontext)=wthˉ+logk=1Vexp(wkhˉ)\mathcal{L} = -\log P(w_t | \text{context}) = -\mathbf{w}'_t \cdot \bar{\mathbf{h}} + \log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \bar{\mathbf{h}})

This loss has an intuitive interpretation. The first term, wthˉ-\mathbf{w}'_t \cdot \bar{\mathbf{h}}, wants to maximize the dot product between the context representation and the correct word, pushing them to align in embedding space. The second term, the log-sum-exp, acts as a normalizing penalty that prevents all dot products from growing unboundedly large.

The Corpus Objective

For a complete training corpus, we average the loss over all word positions:

J=1Nn=1NLnJ = \frac{1}{N} \sum_{n=1}^{N} \mathcal{L}_n

where NN is the total number of training positions in the corpus and Ln\mathcal{L}_n is the loss at position nn. Minimizing this objective encourages the model to correctly predict center words from their contexts across the entire corpus.

In[9]:
def softmax(z):
    """Compute softmax probabilities with numerical stability."""
    exp_z = np.exp(z - np.max(z))
    return exp_z / np.sum(exp_z)

def cbow_forward(context_words, W, W_prime, word_to_idx):
    """
    Forward pass through CBOW model.
    
    Returns:
        h_bar: averaged context embedding
        z: output scores (logits)
        probs: softmax probabilities
    """
    # Step 1: Average context embeddings
    h_bar = average_context_embeddings(context_words, W, word_to_idx)
    
    # Step 2: Compute output scores
    z = W_prime.T @ h_bar
    
    # Step 3: Apply softmax
    probs = softmax(z)
    
    return h_bar, z, probs

# Initialize context matrix
W_prime = np.random.randn(embedding_dim, len(small_vocab)) * 0.5

# Forward pass
h_bar, z, probs = cbow_forward(context, W, W_prime, word_to_idx)
Out[10]:
CBOW Forward Pass:
--------------------------------------------------
Context: ['quick', 'brown', 'jumps', 'over']
Target: 'fox'

Output probabilities P(word | context):
  the     : 0.1366 
  quick   : 0.1214 
  brown   : 0.1096 
  fox     : 0.1005 ←
  jumps   : 0.1195 
  over    : 0.1230 
  lazy    : 0.1428 
  dog     : 0.1466 

Predicted word: dog

With random weights, the model has no preference for the correct word. Training adjusts the embeddings so that the averaged context vector produces high probability for the actual center word.

Out[11]:
Visualization
Side-by-side bar charts showing raw logits and their softmax-transformed probabilities.
From dot products to probabilities. Left: raw dot products (logits) between the averaged context vector and each vocabulary word''s output vector. Right: after softmax transformation, these scores become a valid probability distribution. The softmax exponentiates and normalizes, amplifying differences between scores.

CBOW vs Skip-gram: A Detailed Comparison

While CBOW and Skip-gram share the same embedding matrices, they differ fundamentally in how they use training data.

Training Signal Density

Consider the sentence "The quick brown fox jumps over the lazy dog" with window size 2. At position 3 (word "fox"):

Skip-gram generates 4 training pairs:

  • (fox → quick)
  • (fox → brown)
  • (fox → jumps)
  • (fox → over)

Each pair updates the "fox" embedding based on predicting one context word.

CBOW generates 1 training example:

  • (quick, brown, jumps, over → fox)

This single example updates all four context word embeddings based on predicting "fox."

Out[12]:
Visualization
Diagram comparing training examples generated by Skip-gram and CBOW for the same sentence position.
Training signal comparison between Skip-gram and CBOW. Skip-gram (left) creates multiple training examples from each center word position, each predicting one context word. CBOW (right) creates a single training example that uses all context words to predict the center word. Skip-gram provides more gradient updates for rare words, while CBOW provides a stronger, averaged signal for each prediction.

Implications for Rare vs. Frequent Words

This difference in training structure has important consequences:

Rare words benefit more from Skip-gram:

  • Each occurrence of a rare word in Skip-gram generates multiple training examples (one per context word)
  • In CBOW, a rare word appearing as center word creates only one training example
  • A rare word appearing in context contributes only a fraction (1/C) of the gradient

Frequent words can benefit from CBOW:

  • CBOW's averaging smooths out noisy context patterns
  • For frequent words with many training examples, the averaging helps learn stable representations
  • The single prediction per position makes training faster
In[13]:
def count_training_signal(corpus, word, window_size=2, model='skipgram'):
    """
    Count how many gradient updates a word receives during training.
    
    For Skip-gram: word as center creates 2*window_size examples
    For CBOW: word as center creates 1 example; word in context contributes 1/C
    """
    words = corpus.lower().split()
    word_positions = [i for i, w in enumerate(words) if w == word]
    
    if model == 'skipgram':
        # Each occurrence as center creates multiple examples
        total_signal = 0
        for pos in word_positions:
            context_count = 0
            for offset in range(-window_size, window_size + 1):
                if offset != 0 and 0 <= pos + offset < len(words):
                    context_count += 1
            total_signal += context_count
        return total_signal, 'training pairs'
    
    else:  # CBOW
        # As center: 1 example with full gradient
        center_examples = len(word_positions)
        
        # As context: contributes fractional gradient
        context_contributions = 0
        for i, w in enumerate(words):
            if w != word:
                # Count if target word appears in this position's context
                context_words = []
                for offset in range(-window_size, window_size + 1):
                    if offset != 0 and 0 <= i + offset < len(words):
                        context_words.append(words[i + offset])
                if word in context_words:
                    context_contributions += 1 / len(context_words)
        
        return center_examples, context_contributions

# Example corpus with varying word frequencies
corpus = """
the king sat on the throne in the royal palace
the queen wore the crown at the royal ceremony
the prince and princess walked through the palace gates
"""

rare_word = 'ceremony'
frequent_word = 'the'
Out[14]:
Training Signal Analysis:
-------------------------------------------------------
Corpus: 28 words

Word: 'the' (frequency: 8)
  Skip-gram: 30 training pairs
  CBOW:      8 as center + 5.2 as context (fractional)

Word: 'king' (frequency: 1)
  Skip-gram: 3 training pairs
  CBOW:      1 as center + 1.0 as context (fractional)

Word: 'ceremony' (frequency: 1)
  Skip-gram: 4 training pairs
  CBOW:      1 as center + 1.0 as context (fractional)

Skip-gram provides proportionally more training signal per word occurrence. For a rare word like "ceremony" (appearing once), Skip-gram creates four training pairs, while CBOW creates only one (plus fractional context contributions). This difference explains why Skip-gram typically produces better representations for rare words.

Out[15]:
Visualization
Grouped bar chart comparing training signal strength between Skip-gram and CBOW for frequent, medium, and rare words.
Training signal comparison between Skip-gram and CBOW for words of different frequencies. Skip-gram generates more training pairs overall, with the advantage being proportionally larger for rare words. CBOW's gradient division means rare words in context receive only fractional updates.

Training Speed

CBOW is faster to train than Skip-gram for two reasons:

  1. Fewer forward/backward passes: CBOW makes one prediction per position; Skip-gram makes 2m2m predictions
  2. Shared computation: The context averaging in CBOW reuses embeddings; Skip-gram computes independently
In[16]:
import time

def benchmark_training_step(vocab_size, embedding_dim, context_size, model='skipgram', iterations=1000):
    """Benchmark forward+backward pass for one training example."""
    W = np.random.randn(vocab_size, embedding_dim) * 0.01
    W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
    
    if model == 'skipgram':
        # Skip-gram: predict each context word separately
        center_idx = np.random.randint(vocab_size)
        context_indices = np.random.randint(vocab_size, size=context_size)
        
        start = time.time()
        for _ in range(iterations):
            h = W[center_idx]
            for ctx_idx in context_indices:
                z = W_prime.T @ h
                exp_z = np.exp(z - np.max(z))
                probs = exp_z / np.sum(exp_z)
                # Gradient computation
                dz = probs.copy()
                dz[ctx_idx] -= 1
                dW_prime = np.outer(h, dz)
                dh = W_prime @ dz
        elapsed = time.time() - start
        
    else:  # CBOW
        # CBOW: predict center word from averaged context
        context_indices = np.random.randint(vocab_size, size=context_size)
        center_idx = np.random.randint(vocab_size)
        
        start = time.time()
        for _ in range(iterations):
            h_bar = np.mean(W[context_indices], axis=0)
            z = W_prime.T @ h_bar
            exp_z = np.exp(z - np.max(z))
            probs = exp_z / np.sum(exp_z)
            # Gradient computation
            dz = probs.copy()
            dz[center_idx] -= 1
            dW_prime = np.outer(h_bar, dz)
            dh = W_prime @ dz
        elapsed = time.time() - start
    
    return elapsed / iterations * 1000  # ms per iteration

# Benchmark
vocab_size = 10000
embedding_dim = 100
context_size = 4

skipgram_time = benchmark_training_step(vocab_size, embedding_dim, context_size, 'skipgram')
cbow_time = benchmark_training_step(vocab_size, embedding_dim, context_size, 'cbow')
Out[17]:
Training Speed Comparison:
---------------------------------------------
Vocabulary size: 10,000
Embedding dimension: 100
Context size: 4

Skip-gram: 3.085 ms per position
CBOW:      0.832 ms per position
Speedup:   3.7x faster with CBOW

The benchmark confirms the theoretical speedup. CBOW processes each corpus position faster because it makes one prediction rather than multiple predictions. For corpora with billions of words, this speedup translates to hours or even days of saved training time.

Summary of Trade-offs

AspectSkip-gramCBOW
Training examples per position2m2m (one per context word)1 (all context words together)
Training speedSlowerFaster (3-4x)
Rare wordsBetter (more training signal)Worse (diluted in average)
Frequent wordsGoodBetter (averaging smooths noise)
Memory usageSameSame
Final embedding qualitySlightly better overallCompetitive, especially for frequent words

The Forward Pass

Let's implement the complete CBOW forward pass. Given context words, we compute the probability distribution over vocabulary words:

In[18]:
class CBOW:
    """
    Continuous Bag of Words implementation.
    
    Uses full softmax for clarity. Production implementations
    use negative sampling or hierarchical softmax.
    """
    
    def __init__(self, vocab_size, embedding_dim):
        """Initialize embedding and context matrices."""
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
        # Embedding matrix: each row is a word's embedding
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        
        # Context matrix: each column is a word's output representation
        self.W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
    
    def forward(self, context_indices):
        """
        Forward pass: context indices → probability distribution.
        
        Args:
            context_indices: List of word indices in the context
            
        Returns:
            h_bar: Averaged context embedding
            probs: Softmax probability distribution
        """
        # Step 1: Look up context embeddings and average
        context_embeddings = self.W[context_indices]
        h_bar = np.mean(context_embeddings, axis=0)
        
        # Step 2: Compute output scores (dot products with all vocabulary vectors)
        z = self.W_prime.T @ h_bar
        
        # Step 3: Apply softmax
        exp_z = np.exp(z - np.max(z))
        probs = exp_z / np.sum(exp_z)
        
        return h_bar, probs
    
    def predict(self, context_indices, top_k=5):
        """Predict most likely center words given context."""
        _, probs = self.forward(context_indices)
        top_indices = np.argsort(probs)[::-1][:top_k]
        return [(idx, probs[idx]) for idx in top_indices]

# Create model and test
model = CBOW(vocab_size=len(small_vocab), embedding_dim=4)

# Test prediction
context_indices = [word_to_idx[w] for w in ['quick', 'brown', 'jumps', 'over']]
predictions = model.predict(context_indices, top_k=3)
Out[19]:
CBOW Prediction (untrained model):
---------------------------------------------
Context: ['quick', 'brown', 'jumps', 'over']

Top predictions:
  the     : 0.1250
  fox     : 0.1250
  brown   : 0.1250

(Random predictions expected from untrained model)

The untrained model assigns roughly uniform probabilities across vocabulary words, reflecting its lack of learned associations. After training, the model will concentrate probability mass on the actual center word.

Gradient Derivation

With the forward pass defined, we now derive the gradients needed to train CBOW. Understanding these gradients reveals why CBOW behaves differently from Skip-gram during training, particularly the crucial insight that context words share the gradient signal.

Recall the loss for a single training example:

L=logP(wtcontext)=wthˉ+logk=1Vexp(wkhˉ)\mathcal{L} = -\log P(w_t | \text{context}) = -\mathbf{w}'_t \cdot \bar{\mathbf{h}} + \log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \bar{\mathbf{h}})

Training requires computing how this loss changes with respect to each weight in the model. We'll work backward through the network, starting at the output and flowing gradients back to the input embeddings.

Gradient for the Context Matrix W\mathbf{W}'

The context matrix W\mathbf{W}' directly produces the output scores. Each column wj\mathbf{w}'_j represents word jj in the output layer. How should we adjust these vectors to reduce the loss?

Taking the derivative with respect to the output vector wj\mathbf{w}'_j for any word jj:

Lwj=1j=thˉ+exp(wjhˉ)kexp(wkhˉ)hˉ\frac{\partial \mathcal{L}}{\partial \mathbf{w}'_j} = -\mathbb{1}_{j=t} \cdot \bar{\mathbf{h}} + \frac{\exp(\mathbf{w}'_j \cdot \bar{\mathbf{h}})}{\sum_k \exp(\mathbf{w}'_k \cdot \bar{\mathbf{h}})} \cdot \bar{\mathbf{h}}

where 1j=t\mathbb{1}_{j=t} is the indicator function: 1 if jj is the target word, 0 otherwise.

The first term pushes the target word's context vector toward the averaged context hˉ\bar{\mathbf{h}}, since we want high dot product for the correct answer. The second term pulls all words away from hˉ\bar{\mathbf{h}} in proportion to their predicted probability. The model is "too confident" about incorrect words if it assigns them high probability.

Recognizing that the second term is simply the softmax probability P(wj)P(w_j), we can simplify:

Lwj=(P(wj)1j=t)hˉ\frac{\partial \mathcal{L}}{\partial \mathbf{w}'_j} = (P(w_j) - \mathbb{1}_{j=t}) \cdot \bar{\mathbf{h}}

This formula has an intuitive interpretation: the gradient is proportional to the "prediction error." For the target word, the error is P(wt)1P(w_t) - 1 (negative, so we push toward hˉ\bar{\mathbf{h}}). For all other words, the error is P(wj)0=P(wj)P(w_j) - 0 = P(w_j) (positive, so we push away from hˉ\bar{\mathbf{h}}).

In matrix form, we can write this compactly as:

LW=hˉ(py)\frac{\partial \mathcal{L}}{\partial \mathbf{W}'} = \bar{\mathbf{h}} \otimes (\mathbf{p} - \mathbf{y})

where:

  • \otimes: the outer product operation
  • p\mathbf{p}: the softmax output probability vector (dimension VV)
  • y\mathbf{y}: the one-hot target vector (1 at position tt, 0 elsewhere)

Gradient for the Embedding Matrix W\mathbf{W}

The gradient must flow backward through the averaging operation to reach the input embeddings. This is where CBOW fundamentally differs from Skip-gram.

First, we compute the gradient with respect to the averaged representation hˉ\bar{\mathbf{h}}. This requires multiplying by the context matrix, which maps from the hidden dimension back to the vocabulary:

Lhˉ=W(py)\frac{\partial \mathcal{L}}{\partial \bar{\mathbf{h}}} = \mathbf{W}' (\mathbf{p} - \mathbf{y})

This gradient tells us how the loss would change if we nudged hˉ\bar{\mathbf{h}} in any direction. But hˉ\bar{\mathbf{h}} is not a learnable parameter. It's computed as the average of the context word embeddings, so we must distribute this gradient to the actual parameters: the embedding vectors wc1,wc2,,wcC\mathbf{w}_{c_1}, \mathbf{w}_{c_2}, \ldots, \mathbf{w}_{c_C}.

Here's the critical insight: since hˉ=1Ci=1Cwci\bar{\mathbf{h}} = \frac{1}{C} \sum_{i=1}^{C} \mathbf{w}_{c_i}, each context word contributes equally to the average. By the chain rule, the gradient splits equally among them:

Lwci=1CLhˉ\frac{\partial \mathcal{L}}{\partial \mathbf{w}_{c_i}} = \frac{1}{C} \frac{\partial \mathcal{L}}{\partial \bar{\mathbf{h}}}

Each context word embedding receives exactly 1/C1/C of the gradient signal. This division has important implications:

  1. Diluted learning for rare words: A rare word appearing in a context of four words receives only 25% of the gradient that a center word would receive in Skip-gram.

  2. Smoothed updates: The averaging acts as implicit regularization, preventing any single context word from dominating the update.

  3. Faster training: Despite the dilution, CBOW processes each corpus position with a single forward-backward pass, making it faster overall.

Out[20]:
Visualization
Flow diagram showing gradient backpropagation through CBOW with gradient division at averaging step.
Gradient flow in CBOW. The output gradient (based on prediction error) flows back through the context matrix W' to the averaged representation. The gradient then divides equally among all context words, each receiving 1/C of the total gradient. This division explains why rare words appearing in context receive weaker training signal in CBOW compared to Skip-gram.

Complete Implementation with Training

The mathematical framework is now complete: we know how to compute predictions (forward pass) and how to compute gradients for learning (backward pass). Let's bring these pieces together into a working implementation.

The implementation follows the gradient equations exactly. In the backward pass, notice how the gradient first flows to hˉ\bar{\mathbf{h}} through the context matrix, then divides by CC before reaching each context word embedding. This is the 1/C1/C gradient division we derived earlier.

In[21]:
class CBOW:
    """
    Complete CBOW implementation with training.
    """
    
    def __init__(self, vocab_size, embedding_dim):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
        # Initialize weights with small random values
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
    
    def forward(self, context_indices):
        """Forward pass: context → probabilities."""
        # Average context embeddings
        context_embeddings = self.W[context_indices]
        h_bar = np.mean(context_embeddings, axis=0)
        
        # Compute output scores and softmax
        z = self.W_prime.T @ h_bar
        exp_z = np.exp(z - np.max(z))
        probs = exp_z / np.sum(exp_z)
        
        return h_bar, probs
    
    def compute_loss(self, context_indices, target_idx):
        """Compute cross-entropy loss for one training example."""
        h_bar, probs = self.forward(context_indices)
        loss = -np.log(probs[target_idx] + 1e-10)
        return loss, h_bar, probs
    
    def backward(self, context_indices, target_idx, h_bar, probs, learning_rate=0.01):
        """Backward pass: compute gradients and update weights."""
        # Gradient of loss w.r.t. output scores: (p - y)
        dz = probs.copy()
        dz[target_idx] -= 1
        
        # Gradient for W': outer product of h_bar and dz
        dW_prime = np.outer(h_bar, dz)
        
        # Gradient for h_bar: W' @ dz
        dh_bar = self.W_prime @ dz
        
        # Gradient for each context word embedding: 1/C of dh_bar
        C = len(context_indices)
        dW_context = dh_bar / C
        
        # Update weights
        self.W_prime -= learning_rate * dW_prime
        for idx in context_indices:
            self.W[idx] -= learning_rate * dW_context
    
    def train_example(self, context_indices, target_idx, learning_rate=0.01):
        """Train on a single example."""
        loss, h_bar, probs = self.compute_loss(context_indices, target_idx)
        self.backward(context_indices, target_idx, h_bar, probs, learning_rate)
        return loss
    
    def get_embedding(self, word_idx):
        """Get embedding for a word."""
        return self.W[word_idx]
    
    def most_similar(self, word_idx, top_n=5):
        """Find most similar words by cosine similarity."""
        word_vec = self.W[word_idx]
        similarities = []
        
        for i in range(self.vocab_size):
            if i != word_idx:
                other_vec = self.W[i]
                cos_sim = np.dot(word_vec, other_vec) / (
                    np.linalg.norm(word_vec) * np.linalg.norm(other_vec) + 1e-10
                )
                similarities.append((i, cos_sim))
        
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

Training on a Toy Corpus

In[22]:
# Training corpus with semantic groups
training_corpus = """
king queen prince princess royal throne crown palace
man woman boy girl child adult person human
cat dog pet animal fur paw tail whisker
happy sad angry joyful emotion feeling mood cheerful
run walk jump sprint move fast slow quick
"""

# Build vocabulary
words = training_corpus.lower().split()
vocab = sorted(set(words))
word_to_idx = {w: i for i, w in enumerate(vocab)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

def generate_cbow_examples(words, window_size=2):
    """Generate CBOW training examples: (context_indices, target_idx)."""
    examples = []
    for i, target in enumerate(words):
        context_indices = []
        for j in range(max(0, i - window_size), min(len(words), i + window_size + 1)):
            if j != i:
                context_indices.append(word_to_idx[words[j]])
        if context_indices:  # Only add if we have context
            examples.append((context_indices, word_to_idx[target]))
    return examples

# Generate training data
training_examples = generate_cbow_examples(words, window_size=2)
In[23]:
# Train the model
np.random.seed(42)
model = CBOW(vocab_size=len(vocab), embedding_dim=20)

epochs = 100
losses = []

for epoch in range(epochs):
    epoch_loss = 0
    np.random.shuffle(training_examples)
    
    for context_indices, target_idx in training_examples:
        loss = model.train_example(context_indices, target_idx, learning_rate=0.1)
        epoch_loss += loss
    
    avg_loss = epoch_loss / len(training_examples)
    losses.append(avg_loss)
Out[24]:
CBOW Training Complete:
---------------------------------------------
Vocabulary size: 40
Embedding dimension: 20
Training examples: 40
Epochs: 100

Initial loss: 3.6888
Final loss: 0.4261
Loss reduction: 88.4%

The loss reduction indicates the model has learned to predict center words from their context. The initial loss reflects random guessing across the vocabulary, while the final loss shows the model assigns higher probability to actual center words.

Out[25]:
Visualization
Side-by-side bar charts showing probability distributions before and after training.
Probability distribution for predicting 'queen' from context ['king', 'prince', 'throne', 'crown'] before and after training. The untrained model assigns near-uniform probabilities across the vocabulary. After training, the model concentrates probability mass on semantically related words like 'queen', 'princess', and 'royal'.
Out[26]:
Visualization
Line plot showing decreasing training loss over epochs with rapid initial decrease and gradual plateau.
CBOW training loss over 100 epochs. The loss decreases rapidly in early epochs as the model learns basic word associations, then converges to a stable value. The final loss indicates how well the model predicts center words from their context.

Examining Learned Embeddings

In[27]:
# Find similar words
test_words = ['king', 'man', 'cat', 'happy', 'run']
similarity_results = {}

for word in test_words:
    if word in word_to_idx:
        similar = model.most_similar(word_to_idx[word], top_n=5)
        similarity_results[word] = [(idx_to_word[idx], sim) for idx, sim in similar]
Out[28]:
CBOW Learned Word Similarities:
--------------------------------------------------

Most similar to 'king':
  princess    : +0.869 █████████████████
  prince      : +0.592 ███████████
  queen       : +0.377 ███████
  royal       : +0.271 █████
  sad         : +0.199 ███

Most similar to 'man':
  throne      : +0.711 ██████████████
  girl        : +0.625 ████████████
  palace      : +0.604 ████████████
  woman       : +0.579 ███████████
  crown       : +0.383 ███████

Most similar to 'cat':
  human       : +0.628 ████████████
  animal      : +0.616 ████████████
  adult       : +0.604 ████████████
  dog         : +0.589 ███████████
  person      : +0.404 ████████

Most similar to 'happy':
  paw         : +0.617 ████████████
  joyful      : +0.602 ████████████
  sad         : +0.592 ███████████
  whisker     : +0.556 ███████████
  tail        : +0.363 ███████

Most similar to 'run':
  sprint      : +0.648 ████████████
  walk        : +0.630 ████████████
  cheerful    : +0.618 ████████████
  feeling     : +0.572 ███████████
  jump        : +0.442 ████████

The model has learned that words appearing in similar contexts have similar embeddings. Words from the same semantic category cluster together in embedding space, demonstrating that CBOW successfully captures distributional semantics even on this small corpus.

Out[29]:
Visualization
Heatmap showing pairwise cosine similarities between words, with clear block structure by semantic category.
Cosine similarity heatmap for selected words from each semantic category. Darker colors indicate higher similarity. The block-diagonal structure reveals that words within the same category (royalty, people, animals, emotions, movement) are more similar to each other than to words from different categories.
Out[30]:
Visualization
Scatter plot of word embeddings projected to 2D with semantic clusters visible.
2D PCA projection of CBOW embeddings. Words are colored by semantic category. Despite the small training corpus, semantic clusters are visible, with royalty terms, animal terms, emotion terms, and movement terms forming distinct groups.

When to Choose CBOW

Given the trade-offs between CBOW and Skip-gram, when should you reach for CBOW? The model excels in specific scenarios:

Large corpora with common words: When you have billions of words and care most about representing frequent vocabulary, CBOW's faster training and averaging-based smoothing are advantageous.

Time-constrained training: If computational resources are limited, CBOW's 3-4x speedup over Skip-gram can make training feasible where it otherwise wouldn't be.

Syntactic tasks: Some research suggests CBOW performs slightly better on syntactic analogy tasks (e.g., "big : bigger :: small : ?"), possibly because averaging captures grammatical patterns effectively.

Downstream averaging: If your application averages word embeddings to create sentence or document representations, CBOW's training objective aligns naturally with this use case.

Limitations of Context Averaging

While CBOW's averaging approach enables faster training and smoother representations, the bag-of-words assumption introduces inherent limitations that affect what the model can learn:

Order insensitivity: "dog bites man" and "man bites dog" produce identical averaged context representations, even though they mean very different things.

Dilution of rare context words: A rare but informative context word contributes only 1/C1/C of the gradient, potentially getting drowned out by more common words in the same context window.

Position blindness: A word immediately adjacent to the target is treated the same as a word at the edge of the window. Some extensions weight by distance to address this.

In[31]:
# Demonstrate order insensitivity
context_a = ['dog', 'bites', 'man']
context_b = ['man', 'bites', 'dog']

# Same embeddings regardless of order
emb_a = average_context_embeddings(context_a, W, word_to_idx) if all(w in word_to_idx for w in context_a) else None
emb_b = average_context_embeddings(context_b, W, word_to_idx) if all(w in word_to_idx for w in context_b) else None
Out[32]:
Order Insensitivity Demonstration:
--------------------------------------------------
Context A: ['king', 'queen', 'prince']
Context B: ['prince', 'queen', 'king'] (same words, different order)

Averaged embedding A: [-0.360, +0.396, ...]
Averaged embedding B: [-0.360, +0.396, ...]

Are they identical? True

CBOW treats these as identical contexts!

Position-Weighted Variants

The position blindness limitation motivates an extension to standard CBOW. Some implementations weight context words by their distance from the target, giving nearby words more influence on the final representation:

hˉ=i=1Cαiwcii=1Cαi\bar{\mathbf{h}} = \frac{\sum_{i=1}^{C} \alpha_i \cdot \mathbf{w}_{c_i}}{\sum_{i=1}^{C} \alpha_i}

where:

  • αi\alpha_i: the weight for the ii-th context word (decreases with distance from center)
  • wci\mathbf{w}_{c_i}: the embedding vector of the ii-th context word
  • CC: the number of context words

Common weighting schemes include:

  • Linear decay: αi=1dim+1\alpha_i = 1 - \frac{|d_i|}{m+1} where did_i is the position offset from the center word and mm is the window size
  • Inverse distance: αi=1di\alpha_i = \frac{1}{|d_i|} where did_i is the position offset
  • Exponential decay: αi=exp(λdi)\alpha_i = \exp(-\lambda |d_i|) where λ\lambda controls the decay rate
Out[33]:
Visualization
Bar chart comparing uniform, linear decay, and inverse distance weighting schemes.
Context word weighting schemes by distance from center word. Uniform weighting (used in standard CBOW) gives equal importance to all context words. Alternative schemes give higher weights to closer words, potentially capturing more relevant contextual information.

Implementing Position Weighting

Let's implement a weighted averaging function that supports different distance-based weighting schemes:

In[34]:
def weighted_average_embeddings(context_words, offsets, W, word_to_idx, scheme='linear'):
    """
    Compute weighted average of context embeddings based on distance.
    
    Args:
        context_words: List of context word strings
        offsets: List of position offsets from center word
        W: Embedding matrix
        word_to_idx: Vocabulary mapping
        scheme: 'uniform', 'linear', or 'inverse'
    
    Returns:
        Weighted average embedding vector
    """
    embeddings = np.array([W[word_to_idx[word]] for word in context_words])
    
    if scheme == 'uniform':
        weights = np.ones(len(offsets))
    elif scheme == 'linear':
        max_offset = max(abs(o) for o in offsets)
        weights = np.array([1 - abs(o) / (max_offset + 1) for o in offsets])
    elif scheme == 'inverse':
        weights = np.array([1 / abs(o) for o in offsets])
    else:
        raise ValueError(f"Unknown weighting scheme: {scheme}")
    
    # Weighted average
    weighted_sum = np.sum(embeddings * weights[:, np.newaxis], axis=0)
    return weighted_sum / np.sum(weights)

# Example with different weighting schemes (using words from training corpus)
context = ['king', 'queen', 'prince', 'princess']  # Words that exist in vocabulary
offsets = [-2, -1, 1, 2]
Out[35]:
Weighted Averaging Comparison:
-------------------------------------------------------
Context: ['king', 'queen', 'prince', 'princess']
Offsets: [-2, -1, 1, 2]

uniform   : [-0.069, +0.295, +0.113, ...]
linear    : [-0.294, +0.389, +0.221, ...]
inverse   : [-0.294, +0.389, +0.221, ...]

The different weighting schemes produce distinct averaged embeddings. With inverse distance weighting, the words at positions -1 and +1 (closest to the center) contribute more to the final representation than those at -2 and +2. This can improve embedding quality when nearby context words carry more semantic relevance than distant ones.

The Softmax Bottleneck (Again)

Like Skip-gram, CBOW faces the softmax computational bottleneck. Each training step requires computing:

P(wtcontext)=exp(wthˉ)k=1Vexp(wkhˉ)P(w_t | \text{context}) = \frac{\exp(\mathbf{w}'_t \cdot \bar{\mathbf{h}})}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \bar{\mathbf{h}})}

The denominator sums over all VV vocabulary words. For a vocabulary of 100,000 words, this means computing 100,000 dot products and exponentials per training step, a significant computational burden. The same approximation techniques that accelerate Skip-gram also apply to CBOW:

  • Negative sampling: Sample a small number of negative examples instead of computing the full softmax
  • Hierarchical softmax: Use a binary tree structure to reduce complexity from O(V)O(V) to O(logV)O(\log V)

We'll cover these techniques in detail in subsequent chapters.

Key Takeaways

CBOW and Skip-gram are complementary approaches to learning word embeddings from context:

  • Architectural difference: CBOW averages context embeddings to predict the center word; Skip-gram uses the center word to predict each context word
  • Training dynamics: CBOW creates one training example per position, Skip-gram creates 2m2m examples. CBOW is faster but provides less training signal for rare words
  • Context averaging: The "bag of words" assumption treats context as an unordered set, losing word order information but capturing overall semantic context
  • Gradient distribution: Each context word receives 1/C1/C of the gradient, diluting the signal for any single word. This explains why rare words benefit more from Skip-gram
  • Practical trade-offs: Choose CBOW when training speed matters and vocabulary is dominated by frequent words. Choose Skip-gram when rare word quality matters

The next chapter explores negative sampling, which accelerates both CBOW and Skip-gram by replacing the expensive softmax with a simpler binary classification objective.

Key Parameters

When training CBOW models, several hyperparameters impact embedding quality:

embedding_dim (typical range: 50-300): The dimensionality of word vectors. Lower values (50-100) provide faster training and smaller memory footprint, sufficient for many tasks. Higher values (200-300) capture more nuanced relationships but require more data. A common choice is 100-200 for most applications.

window_size (typical range: 2-10): Number of context words on each side of the center word. Small windows (2-3) emphasize syntactic relationships, while large windows (5-10) capture broader topical similarity. A common choice is 5 for balanced representations.

min_count (typical range: 1-100): Minimum word frequency to include in vocabulary. Lower values include rare words but with potentially unreliable embeddings. Higher values produce more robust embeddings for included words. A common choice is 5-10 for large corpora.

learning_rate (typical range: 0.01-0.1): Step size for gradient descent updates. Higher values enable faster convergence but may overshoot. Lower values are more stable but slower. A common choice is 0.025-0.05 with linear decay.

epochs (typical range: 1-20): Number of passes through the training corpus. Fewer epochs mean faster training but may underfit. More epochs provide better convergence, with diminishing returns after 5-10. A common choice is 5 epochs for large corpora.

negative_samples (when using negative sampling, typical range: 5-20): Number of negative examples per positive example. Fewer negatives (5) enable faster training. More negatives (15-20) provide better discrimination. A common choice is 5-10 for large corpora.

Summary

CBOW learns word embeddings by predicting center words from their surrounding context. The model averages context word embeddings into a single representation, then uses this averaged vector to predict the center word via softmax. This averaging operation, the defining feature of CBOW, treats context as an unordered "bag of words," ignoring position information but capturing overall semantic context effectively.

Compared to Skip-gram, CBOW trains 3-4x faster due to making a single prediction per position rather than one per context word. However, the gradient division inherent in averaging means each context word receives only a fraction of the training signal, which can weaken representations for rare words. CBOW tends to perform better on frequent words and syntactic tasks, while Skip-gram excels at rare word representation. The choice between them depends on your corpus characteristics and downstream application requirements.

Both CBOW and Skip-gram share a common computational challenge: the softmax normalization requires summing over the entire vocabulary for each training step. The next chapter introduces negative sampling, an approximation technique that dramatically reduces this cost while maintaining embedding quality.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the CBOW model and its differences from Skip-gram.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{cbowmodellearningwordembeddingsbypredictingcenterwords, author = {Michael Brenndoerfer}, title = {CBOW Model: Learning Word Embeddings by Predicting Center Words}, year = {2025}, url = {https://mbrenndoerfer.com/writing/cbow-model-word2vec-word-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). CBOW Model: Learning Word Embeddings by Predicting Center Words. Retrieved from https://mbrenndoerfer.com/writing/cbow-model-word2vec-word-embeddings
MLAAcademic
Michael Brenndoerfer. "CBOW Model: Learning Word Embeddings by Predicting Center Words." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/cbow-model-word2vec-word-embeddings>.
CHICAGOAcademic
Michael Brenndoerfer. "CBOW Model: Learning Word Embeddings by Predicting Center Words." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/cbow-model-word2vec-word-embeddings.
HARVARDAcademic
Michael Brenndoerfer (2025) 'CBOW Model: Learning Word Embeddings by Predicting Center Words'. Available at: https://mbrenndoerfer.com/writing/cbow-model-word2vec-word-embeddings (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). CBOW Model: Learning Word Embeddings by Predicting Center Words. https://mbrenndoerfer.com/writing/cbow-model-word2vec-word-embeddings
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or