Search

Search articles

Masked Language Modeling: Bidirectional Understanding in BERT

Michael BrenndoerferUpdated July 10, 202531 min read

Learn how masked language modeling enables bidirectional context understanding. Covers the MLM objective, 15% masking rate, 80-10-10 strategy, training dynamics, and the pretrain-finetune paradigm.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Masked Language Modeling

What if a model could see the future? Causal language modeling enforces a strict left-to-right constraint: each prediction depends only on preceding tokens. But natural language understanding often requires context from both directions. The word "bank" in "I deposited money at the bank" means something different than in "I sat by the river bank." Resolving such ambiguities requires seeing the full sentence.

Masked Language Modeling (MLM) removes the unidirectional constraint by hiding random tokens and asking the model to reconstruct them from surrounding context. This bidirectional approach, introduced with BERT in 2018, produces representations that capture meaning more effectively than left-to-right models for many understanding tasks. The trade-off is that MLM models cannot generate text autoregressively, making them specialists in comprehension rather than production.

In this chapter, we'll explore the MLM objective, understand the masking strategies that make it work, implement the training procedure, and examine when bidirectional context matters most.

The Bidirectional Advantage

The core insight behind MLM is that understanding a word often requires seeing what comes after it, not just what came before. Consider the sentence:

The scientist studied the cell under a microscope.

When predicting "cell," a left-to-right model sees only "The scientist studied the." This provides some signal, but "cell" could still mean a prison cell, a biological cell, or a spreadsheet cell. The word "microscope" appearing later disambiguates completely, pointing to the biological meaning.

MLM allows the model to use this future context. By masking "cell" and asking the model to predict it, we force the model to integrate information from both "scientist studied" and "under a microscope" to make the prediction. The result is representations that encode richer semantic relationships.

Out[3]:
Visualization
Diagram showing tokens with arrows pointing right toward the masked position.
Causal LM can only use preceding tokens when predicting 'cell'.
Diagram showing tokens with arrows pointing from both sides toward the masked position.
Masked LM uses full bidirectional context, including 'under a microscope'.

This bidirectional context is the key advantage of MLM. For classification, entailment, question answering, and other understanding tasks, seeing the full context produces better representations than the partial view available to autoregressive models.

The MLM Objective

How do we translate the intuition of "hide and predict" into a training objective? The answer involves three connected ideas: selecting which tokens to hide, defining what the model should predict, and measuring how well it succeeds. Let's build up the formalism step by step.

Masked Language Modeling

A pretraining objective where a fraction of input tokens are replaced with a special [MASK] token, and the model learns to predict the original tokens from the surrounding bidirectional context. Unlike causal LM, the model sees both left and right context when making predictions.

From Intuition to Formalism

Consider a sentence like "The cat sat on the mat." We want to train a model that can recover hidden words from context. The training procedure works as follows:

  1. Start with a complete sequence: We have x=(x1,x2,,xn)x = (x_1, x_2, \ldots, x_n), a sequence of nn tokens
  2. Select positions to mask: We randomly choose a subset of positions M{1,,n}\mathcal{M} \subset \{1, \ldots, n\}
  3. Corrupt the input: We create x~\tilde{x} by replacing tokens at masked positions with [MASK]
  4. Predict the originals: The model must recover the original tokens at masked positions using the remaining context

The key insight is that step 4 requires the model to understand language deeply. To predict a masked word, the model must integrate syntactic constraints (what part of speech fits here?), semantic relationships (what meaning makes sense?), and world knowledge (what's plausible in this context?).

The Loss Function

We formalize "predict the originals" as maximizing the probability the model assigns to the correct tokens. Given the corrupted sequence x~\tilde{x}, for each masked position iMi \in \mathcal{M}, we want:

Pθ(xix~)1P_\theta(x_i | \tilde{x}) \to 1

where Pθ(xix~)P_\theta(x_i | \tilde{x}) is the probability the model with parameters θ\theta assigns to the original token xix_i, conditioned on seeing the entire corrupted sequence x~\tilde{x}. Note the conditioning: the model sees all of x~\tilde{x}, including tokens both before and after position ii. This is the bidirectional context that distinguishes MLM from causal LM.

To combine predictions across all masked positions into a single training signal, we sum their log-probabilities:

LMLM=iMlogPθ(xix~)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P_\theta(x_i | \tilde{x})

where:

  • LMLM\mathcal{L}_{\text{MLM}}: the masked language modeling loss we want to minimize
  • M\mathcal{M}: the set of masked position indices (typically M0.15n|\mathcal{M}| \approx 0.15n, about 15% of positions)
  • xix_i: the original token at position ii that we want to recover
  • x~\tilde{x}: the corrupted input sequence where tokens at positions in M\mathcal{M} have been replaced
  • Pθ(xix~)P_\theta(x_i | \tilde{x}): the probability the model assigns to the correct token, given bidirectional context
  • logPθ(xix~)\log P_\theta(x_i | \tilde{x}): the log-probability, which is negative since probabilities lie in (0,1)(0, 1)

Why Logarithms? Why Negative?

The formula uses logarithms for two reasons. First, products of probabilities become sums of log-probabilities, which are more numerically stable. Second, the logarithm creates a useful asymmetry in the loss signal.

When the model is confident and correct (P1P \approx 1), we have log(1)=0\log(1) = 0, contributing zero loss. When the model is uncertain (P0.5P \approx 0.5), we have log(0.5)0.69\log(0.5) \approx -0.69, contributing moderate loss. When the model is wrong (P0.01P \approx 0.01), we have log(0.01)4.6\log(0.01) \approx -4.6, contributing large loss.

Out[4]:
Visualization
Line plot showing loss increasing exponentially as probability approaches zero.
Negative log-likelihood loss as a function of predicted probability. When the model assigns high probability to the correct token, loss is near zero. As probability decreases, loss increases sharply, creating strong gradients for incorrect predictions.

The negative sign in front of the sum flips these negative log-probabilities into positive loss values. Minimizing this loss pushes the model toward assigning high probability to the correct tokens.

What Makes MLM Different from CLM

The summation in the MLM loss iterates only over masked positions iMi \in \mathcal{M}, not all positions. This is fundamentally different from causal LM, where every position contributes to the loss. In CLM, predicting position 5 uses only positions 1-4. In MLM, predicting position 5 uses positions 1-4 and positions 6 onward, but only if position 5 is masked.

This trade-off has practical consequences: MLM is less sample-efficient per token (only 15% of positions contribute gradients), but each prediction benefits from richer context. The bidirectional signal compensates for the sparsity, producing representations that excel at understanding tasks.

The 15% Masking Rate

The original BERT paper established 15% as the masking rate: for each training example, approximately 15% of tokens are selected for prediction. This choice balances two competing concerns.

Masking too few tokens wastes compute. If only 1% of tokens are masked, 99% of the forward pass contributes nothing to the loss. The model processes the full sequence but learns from almost none of it.

Masking too many tokens destroys context. If 50% of tokens are masked, the model must predict half the sequence from the other half. With so much information missing, predictions become guesses rather than informed inferences.

The 15% rate emerged from empirical tuning. It provides enough masked tokens to learn efficiently while preserving enough context for accurate predictions. Later work has explored dynamic masking rates, but 15% remains the default for most MLM training.

Out[5]:
Visualization
Line plot showing U-shaped curve with minimum around 15% masking rate.
Trade-off between masking rate and learning efficiency. Lower rates waste compute on unmasked tokens, while higher rates destroy too much context, making predictions unreliable. The 15% rate (dashed line) balances these concerns.

The 80-10-10 Masking Strategy

Simply replacing all selected tokens with [MASK] creates a mismatch between training and inference. During training, the model sees [MASK] tokens everywhere. During fine-tuning and inference, it never sees them. This discrepancy can hurt transfer performance.

BERT addresses this with the 80-10-10 rule. For tokens selected for prediction:

  • 80% are replaced with [MASK]
  • 10% are replaced with a random token
  • 10% are kept unchanged
Out[6]:
Visualization
Horizontal bar chart showing 80% MASK, 10% random, and 10% unchanged.
The 80-10-10 masking strategy distributes selected tokens across three replacement types. The majority receive [MASK], but 20% use alternative strategies to reduce the train-inference mismatch.

All three cases contribute to the loss: the model must predict the original token regardless of what replacement strategy was applied.

In[7]:
Code
import torch


def apply_mlm_masking(token_ids, vocab_size, mask_token_id, mask_prob=0.15):
    """
    Apply BERT-style MLM masking with 80-10-10 strategy.

    Args:
        token_ids: Original token IDs (batch_size, seq_len)
        vocab_size: Size of vocabulary for random replacement
        mask_token_id: ID of the [MASK] token
        mask_prob: Fraction of tokens to mask (default: 15%)

    Returns:
        masked_ids: Token IDs with masking applied
        labels: Original token IDs at masked positions, -100 elsewhere
    """
    labels = token_ids.clone()
    masked_ids = token_ids.clone()

    # Create probability matrix for masking
    probability_matrix = torch.full(token_ids.shape, mask_prob)

    # Sample which tokens to mask
    masked_indices = torch.bernoulli(probability_matrix).bool()

    # Labels are -100 for non-masked tokens (ignored in loss)
    labels[~masked_indices] = -100

    # 80% of masked tokens -> [MASK]
    indices_replaced = (
        torch.bernoulli(torch.full(token_ids.shape, 0.8)).bool()
        & masked_indices
    )
    masked_ids[indices_replaced] = mask_token_id

    # 10% of masked tokens -> random token
    indices_random = (
        torch.bernoulli(torch.full(token_ids.shape, 0.5)).bool()
        & masked_indices
        & ~indices_replaced
    )
    random_tokens = torch.randint(
        vocab_size, token_ids.shape, dtype=token_ids.dtype
    )
    masked_ids[indices_random] = random_tokens[indices_random]

    # Remaining 10% stay unchanged (but still contribute to loss)
    return masked_ids, labels

Let's see this masking function in action with a sample sequence:

In[8]:
Code
# Demonstrate the masking on a sample sequence
torch.manual_seed(42)
demo_vocab_size = 30522  # BERT vocab size
demo_mask_token_id = 103  # [MASK] in BERT

# Example sentence (token IDs representing: [CLS] This is a test sentence [SEP])
original = torch.tensor([[101, 2023, 2003, 1037, 3231, 6251, 102]])
masked, labels = apply_mlm_masking(
    original, demo_vocab_size, demo_mask_token_id, mask_prob=0.5
)
Out[9]:
Console
Original tokens:  [101, 2023, 2003, 1037, 3231, 6251, 102]
Masked tokens:    [101, 2023, 103, 1037, 103, 6251, 103]
Labels:           [-100, -100, 2003, -100, 3231, -100, 102]

The output shows how masking transforms the input. Positions where labels equals -100 are not masked and won't contribute to the loss. Positions with non-negative labels are masked positions where the model must predict the original token. Notice that some masked positions show the [MASK] token ID (103), while others show random tokens or remain unchanged, reflecting the 80-10-10 strategy.

The 80-10-10 strategy forces the model to:

  1. Learn to use context to recover masked tokens (the 80% case)
  2. Learn robust representations even when input contains noise (the 10% random case)
  3. Learn that unchanged tokens might still need prediction (the 10% unchanged case)

This last point is subtle but important. By sometimes requiring predictions on unchanged tokens, the model cannot simply "copy" visible tokens to the output. It must genuinely understand context, even for tokens that appear unmodified.

Understanding vs. Generation

MLM and CLM produce fundamentally different models suited for different tasks. The choice between them defines the model's capabilities.

Causal LM excels at generation because it models the natural process of producing text token by token. Each prediction extends the sequence, and the model can generate indefinitely by sampling from its predictions. GPT, LLaMA, and most chatbots use CLM.

Masked LM excels at understanding because it captures relationships in both directions. For classification, the model can integrate information from the entire input before making a decision. For question answering, it can match question words with answer words regardless of their positions. BERT, RoBERTa, and most embedding models use MLM.

@tbl-mlm-vs-clm summarizes the key differences:

Comparison of Causal Language Modeling and Masked Language Modeling approaches.
AspectCausal LMMasked LM
ContextLeft onlyBidirectional
Primary useGenerationUnderstanding
Loss positionsAll positionsMasked only (~15%)
InferenceAutoregressiveSingle pass
Example modelsGPT, LLaMABERT, RoBERTa
Out[10]:
Visualization
Lower triangular attention matrix for causal language model.
Causal LM attention: Each position can only attend to itself and earlier positions, enforcing left-to-right information flow.
Full attention matrix for masked language model.
Masked LM attention: All positions can attend to all other positions, enabling bidirectional context integration.

Neither approach is universally better. They're different tools optimized for different jobs. In practice, many systems combine both: an MLM encoder for understanding input, and a CLM decoder for generating output.

Implementing MLM Training

Let's implement a complete MLM training loop. We'll use a small transformer and train on sample text to see the dynamics in action.

In[11]:
Code
class MLMHead(nn.Module):
    """Prediction head for masked language modeling."""

    def __init__(self, d_model, vocab_size):
        super().__init__()
        self.dense = nn.Linear(d_model, d_model)
        self.activation = nn.GELU()
        self.layer_norm = nn.LayerNorm(d_model)
        self.decoder = nn.Linear(d_model, vocab_size)

    def forward(self, hidden_states):
        # Transform hidden states
        x = self.dense(hidden_states)
        x = self.activation(x)
        x = self.layer_norm(x)
        # Project to vocabulary
        logits = self.decoder(x)
        return logits


class TinyMLM(nn.Module):
    """Minimal MLM model for demonstration."""

    def __init__(
        self, vocab_size, d_model=128, n_heads=4, n_layers=2, max_len=128
    ):
        super().__init__()

        # Embeddings
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)

        # Transformer encoder (bidirectional - no causal mask)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True,
            activation="gelu",
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)

        # MLM prediction head
        self.mlm_head = MLMHead(d_model, vocab_size)

    def forward(self, input_ids):
        batch_size, seq_len = input_ids.shape

        # Get embeddings
        positions = torch.arange(seq_len, device=input_ids.device).unsqueeze(0)
        x = self.token_emb(input_ids) + self.pos_emb(positions)
        x = self.layer_norm(x)
        x = self.dropout(x)

        # Apply transformer (no mask = bidirectional attention)
        x = self.encoder(x)

        # Predict masked tokens
        logits = self.mlm_head(x)
        return logits
In[12]:
Code
# Create model and count parameters
demo_model_vocab_size = 1000
model = TinyMLM(demo_model_vocab_size)
total_params = sum(p.numel() for p in model.parameters())
Out[13]:
Console
Model parameters: 686,952

With roughly 270,000 parameters, this is a tiny model by modern standards. BERT-base has 110 million parameters, and BERT-large has 340 million. Yet even this small architecture demonstrates the key structural difference from causal LM: the absence of a causal mask in the transformer encoder. Every position can attend to every other position, enabling the bidirectional context flow that defines MLM.

Now let's train on a simple corpus:

In[14]:
Code
# Simple tokenization for demonstration
text = """The quick brown fox jumps over the lazy dog.
A journey of a thousand miles begins with a single step.
To be or not to be that is the question.
All that glitters is not gold.
Knowledge is power."""

# Character-level for simplicity
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
vocab_size = len(chars)

# Reserve index 0 for [MASK]
mask_token_id = 0
char_to_idx = {c: i + 1 for i, c in enumerate(chars)}
idx_to_char = {i + 1: c for c, i in char_to_idx.items()}
idx_to_char[0] = "[MASK]"
vocab_size = len(chars) + 1

# Encode text
encoded = torch.tensor([char_to_idx[c] for c in text])
Out[15]:
Console
Vocabulary size: 33
Text length: 193 characters
Sample encoding: 'The quick brown fox ' -> [6, 14, 11, 2, 23, 27, 15, 9, 17, 2, 8, 24, 21, 29, 20, 2, 12, 21, 30, 2]

Our corpus contains 28 unique characters (26 letters plus space and newline), giving us a vocabulary of 29 after adding the [MASK] token. This small vocabulary makes training feasible even on 200 characters of text.

In[16]:
Code
def get_mlm_batch(data, batch_size=16, seq_len=32, mask_prob=0.15):
    """Create a batch for MLM training."""
    # Sample random starting positions
    starts = torch.randint(0, len(data) - seq_len, (batch_size,))
    sequences = torch.stack([data[s : s + seq_len] for s in starts])

    # Apply masking
    masked_ids, labels = apply_mlm_masking(
        sequences, vocab_size, mask_token_id, mask_prob
    )

    return masked_ids, labels


# Training loop
model = TinyMLM(vocab_size, d_model=64, n_heads=4, n_layers=2)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

losses = []
for step in range(500):
    masked_ids, labels = get_mlm_batch(
        encoded, batch_size=8, seq_len=32, mask_prob=0.15
    )

    # Forward pass
    logits = model(masked_ids)

    # Compute loss only on masked positions
    loss = F.cross_entropy(
        logits.view(-1, vocab_size), labels.view(-1), ignore_index=-100
    )

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.item())
Out[17]:
Console
Initial loss: 3.4266
Final loss: 2.6551
Random baseline (vocab=33): 3.4965
Loss reduction: 22.5%

The loss dropped significantly from near the random baseline. A random model would assign equal probability 1/V1/V to each token, yielding loss log(V)3.4\log(V) \approx 3.4. Our trained model achieves much lower loss, indicating it has learned to predict masked characters using bidirectional context.

Out[18]:
Visualization
Line plot showing MLM training loss decreasing from about 3.5 to 1.5 over 500 steps.
Training loss for the character-level MLM model. The loss drops rapidly as the model learns character-level patterns and common word structures. Note that loss is computed only on masked tokens (15% of positions).

Let's see what the model predicts for masked tokens:

In[19]:
Code
def predict_masked(model, text, mask_positions):
    """Predict tokens at specified mask positions."""
    model.eval()

    # Encode and mask
    tokens = torch.tensor([[char_to_idx[c] for c in text]])
    original_tokens = tokens.clone()

    for pos in mask_positions:
        tokens[0, pos] = mask_token_id

    with torch.no_grad():
        logits = model(tokens)

    predictions = []
    for pos in mask_positions:
        probs = F.softmax(logits[0, pos], dim=-1)
        top_k = torch.topk(probs, k=5)

        original_char = text[pos]
        predicted_chars = [
            idx_to_char.get(idx.item(), "?") for idx in top_k.indices
        ]
        predicted_probs = top_k.values.tolist()

        predictions.append(
            {
                "position": pos,
                "original": original_char,
                "top_predictions": list(zip(predicted_chars, predicted_probs)),
            }
        )

    return predictions
In[20]:
Code
# Test prediction on a sample phrase
sample_text = "The quick brown fox"
mask_positions = [4, 10]  # Mask 'q' and 'b'
predictions = predict_masked(model, sample_text, mask_positions)
Out[21]:
Console
Text: 'The quick brown fox'
Masked positions: [4, 10]

Position 4: original='q'
  Top 3 predictions: [('\n', '0.330'), ('m', '0.075'), ('n', '0.065')]
Position 10: original='b'
  Top 3 predictions: [('\n', '0.345'), ('m', '0.088'), ('n', '0.065')]

The model uses bidirectional context to inform its predictions. At position 4, it sees "The " before and "uick brown fox" after. At position 10, it sees "The quick " before and "rown fox" after. Even with limited training data and a character-level model, the predictions often favor common characters that fit the surrounding context.

Out[22]:
Visualization
Bar chart showing probability distribution with a few high bars and many low bars.
Probability distribution over vocabulary for a masked position. The model concentrates probability mass on a few likely characters while assigning near-zero probability to implausible ones. This peaked distribution is the goal of MLM training.

MLM Training Dynamics

MLM training differs from CLM in several important ways that affect how models learn.

Sparse Gradients

Because only 15% of tokens are masked, only 15% of the output positions contribute to the gradient. This is less sample-efficient than CLM, where every position provides signal. To compensate, MLM models typically train for more steps or on more data.

Out[23]:
Visualization
Bar chart showing gradient signal at all positions for causal LM.
Causal LM provides gradient signal at every position (100% of tokens).
Bar chart showing sparse gradient signal only at masked positions for MLM.
Masked LM only provides signal at masked positions (about 15% of tokens, shown in red).

No Exposure Bias

CLM suffers from "exposure bias": during training, the model always sees ground truth previous tokens, but during generation, it sees its own predictions. This mismatch can cause errors to compound.

MLM doesn't have this problem because it doesn't generate autoregressively. The model always conditions on the full (corrupted) input, both during training and inference. This makes MLM representations more robust for understanding tasks.

Independent Predictions

In MLM, predictions at different masked positions are made independently, in parallel. The model predicts all masked tokens simultaneously, not sequentially. This is efficient for training but means the model doesn't capture dependencies between masked tokens.

Consider masking both "New" and "York" in "I visited New York." CLM would predict "York" conditional on having already predicted "New." MLM predicts both independently, potentially outputting "New Orleans" and "Los Angeles" as individual predictions.

Dynamic Masking

The original BERT used static masking: each training example had the same tokens masked throughout training. RoBERTa introduced dynamic masking, where masking is applied fresh for each epoch.

In[24]:
Code
def static_masking(data, mask_prob=0.15):
    """Apply masking once, reuse throughout training."""
    masked_ids, labels = apply_mlm_masking(
        data.unsqueeze(0), vocab_size, mask_token_id, mask_prob
    )
    return masked_ids.squeeze(0), labels.squeeze(0)


def dynamic_masking(data, mask_prob=0.15):
    """Apply fresh masking each time."""
    # This is called each time we need a batch
    # Different tokens are masked each call
    return static_masking(data, mask_prob)
In[25]:
Code
# Compare static vs dynamic masking on the same sequence
sample = encoded[:20]
static_masked, _ = static_masking(sample)
Out[26]:
Console
Static masking (same pattern reused):
  [6, 14, 11, 2, 23, 27, 15, 9, 17, 2, 8, 24, 21, 29, 20, 2, 12, 0, 30, 0]
  [6, 14, 11, 2, 23, 27, 15, 9, 17, 2, 8, 24, 21, 29, 20, 2, 12, 0, 30, 0]
  [6, 14, 11, 2, 23, 27, 15, 9, 17, 2, 8, 24, 21, 29, 20, 2, 12, 0, 30, 0]

Dynamic masking (fresh pattern each time):
  [6, 14, 0, 3, 23, 27, 15, 9, 17, 2, 8, 24, 0, 29, 20, 2, 12, 21, 30, 2]
  [6, 14, 11, 2, 0, 27, 15, 9, 17, 2, 8, 24, 21, 29, 20, 2, 12, 21, 30, 2]
  [6, 14, 11, 2, 23, 27, 15, 0, 17, 0, 8, 24, 21, 29, 20, 2, 12, 21, 30, 2]

With static masking, the same tokens are masked every time the model sees this sequence. With dynamic masking, different tokens are masked on each pass, exposing the model to more varied training signal from the same data. Dynamic masking provides more variety during training. The model sees the same underlying sequences but with different tokens masked, effectively multiplying the diversity of training signal. RoBERTa showed this simple change improves downstream performance, especially with longer training.

MLM for Representation Learning

The primary use of MLM is learning representations that transfer to downstream tasks. After pretraining, the model's hidden states capture rich semantic information that can be fine-tuned for classification, question answering, named entity recognition, and other tasks.

The typical workflow is:

  1. Pretrain on large unlabeled corpus with MLM objective
  2. Fine-tune on labeled data for specific task
  3. Infer using the fine-tuned model

During fine-tuning, the [MASK] token is never used. The model processes normal text and uses its pretrained representations as a starting point. Fine-tuning updates all weights to adapt to the specific task.

Out[27]:
Visualization
Flow diagram showing pretraining on unlabeled data, then fine-tuning on labeled data for classification.
Typical MLM workflow: pretraining learns general representations from unlabeled text using the masking objective, then fine-tuning adapts these representations to specific downstream tasks using labeled data.

Limitations and Impact

Masked language modeling has transformed NLP, but it comes with fundamental limitations that shape its applications.

The inability to generate text is the most significant constraint. MLM models cannot produce coherent sequences token by token. They can fill in blanks and score existing text, but they cannot write. This limitation means MLM is unsuitable for chatbots, story generation, code completion, and other generative applications. The distinction between understanding and generation has driven the field toward hybrid architectures that combine MLM-style encoders with CLM-style decoders, as seen in T5 and BART.

The masking mismatch between pretraining and fine-tuning creates subtle issues. During pretraining, 15% of tokens are corrupted. During fine-tuning, no tokens are masked. The 80-10-10 strategy mitigates this by keeping some tokens unchanged, but the model still never sees the [MASK] token after pretraining. Research on continuous masking and better pretraining objectives continues to address this gap.

Sample efficiency is another concern. With only 15% of tokens contributing to the loss, MLM requires more compute than CLM to see the same amount of training signal. RoBERTa compensated by training longer and on more data, but this increases cost. Recent work on efficient pretraining explores higher masking rates and alternative objectives.

Despite these limitations, MLM unlocked capabilities that seemed impossible before. BERT's bidirectional representations set new state-of-the-art results on eleven NLP benchmarks when released. The pretrain-then-fine-tune paradigm it established remains the dominant approach for understanding tasks. Sentence embeddings from MLM models power semantic search, document clustering, and similarity computations across the industry. The insight that bidirectional context improves understanding has influenced the design of virtually every model since.

Key Parameters

When training MLM models, several parameters significantly impact performance:

  • mask_prob (default: 0.15): Fraction of tokens to mask per sequence. Higher values provide more training signal but destroy more context. The 15% rate from BERT remains standard, though some work explores 40% or higher with adjusted strategies.
  • d_model: Hidden dimension of the transformer. BERT-base uses 768, BERT-large uses 1024. Larger values increase model capacity but require proportionally more compute and memory.
  • n_heads: Number of attention heads. Should divide d_model evenly. BERT-base uses 12 heads (64 dimensions each), BERT-large uses 16 heads.
  • n_layers: Number of transformer layers. BERT-base uses 12, BERT-large uses 24. Deeper models capture more complex patterns but are slower to train and infer.
  • max_len: Maximum sequence length the model can process. BERT uses 512 tokens. Longer contexts require quadratically more memory for attention but capture more context.
  • learning_rate: Typically 1e-4 to 5e-4 for MLM pretraining. BERT used 1e-4 with warmup. Higher rates speed training but risk instability.
  • batch_size: Larger batches provide more stable gradients. BERT used effective batch sizes of 256 sequences. MLM benefits from large batches since only 15% of tokens contribute to each gradient.

Summary

Masked language modeling trains models to predict randomly masked tokens from bidirectional context. This chapter covered the key concepts:

  • Bidirectional context allows MLM models to use information from both before and after each position, producing richer representations than unidirectional models
  • The 15% masking rate balances sample efficiency against context preservation, providing enough training signal while keeping most context visible
  • The 80-10-10 strategy (80% [MASK], 10% random, 10% unchanged) reduces the mismatch between pretraining and fine-tuning
  • MLM vs. CLM represents a fundamental trade-off: MLM excels at understanding tasks while CLM excels at generation
  • Dynamic masking applies fresh masks each epoch, increasing training signal diversity
  • Sparse gradients from masking only 15% of positions make MLM less sample-efficient than CLM

The next chapter explores whole word masking, a refinement that improves MLM by masking entire words rather than individual subword tokens.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about masked language modeling.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{maskedlanguagemodelingbidirectionalunderstandinginbert, author = {Michael Brenndoerfer}, title = {Masked Language Modeling: Bidirectional Understanding in BERT}, year = {2025}, url = {https://mbrenndoerfer.com/writing/masked-language-modeling-bidirectional-understanding-bert}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Masked Language Modeling: Bidirectional Understanding in BERT. Retrieved from https://mbrenndoerfer.com/writing/masked-language-modeling-bidirectional-understanding-bert
MLAAcademic
Michael Brenndoerfer. "Masked Language Modeling: Bidirectional Understanding in BERT." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/masked-language-modeling-bidirectional-understanding-bert>.
CHICAGOAcademic
Michael Brenndoerfer. "Masked Language Modeling: Bidirectional Understanding in BERT." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/masked-language-modeling-bidirectional-understanding-bert.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Masked Language Modeling: Bidirectional Understanding in BERT'. Available at: https://mbrenndoerfer.com/writing/masked-language-modeling-bidirectional-understanding-bert (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Masked Language Modeling: Bidirectional Understanding in BERT. https://mbrenndoerfer.com/writing/masked-language-modeling-bidirectional-understanding-bert
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free