Search

Search articles

Whole Word Masking: Eliminating Information Leakage in BERT Pre-training

Michael BrenndoerferUpdated July 10, 202530 min read

Learn how Whole Word Masking improves BERT pre-training by masking complete words instead of subword tokens, eliminating information leakage and strengthening the learning signal.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Whole Word Masking

When BERT tokenizes "undeniably" into ["un", "##deni", "##ably"] and then masks only "##ably", something goes wrong. The model sees "un" and "##deni" in the clear, giving it strong hints about the masked portion. The prediction task becomes almost trivial: what word starts with "undeni-" and ends with a common suffix? This partial visibility undermines the learning objective.

Whole Word Masking (WWM) fixes this by treating subword tokens as parts of atomic units. If any subword of a word is selected for masking, all subwords of that word are masked together. The model must now predict the entire word from surrounding context alone, without peeking at sibling subwords. This seemingly simple change produces meaningful improvements in downstream task performance.

In this chapter, we'll examine why subword masking creates problems, how whole word masking works algorithmically, and how to implement it for different tokenizers. We'll also compare WWM against random subword masking to see the empirical differences.

The Subword Masking Problem

Modern language models use subword tokenization to handle vocabulary efficiently. Algorithms like WordPiece, BPE, and SentencePiece break words into smaller units based on frequency or likelihood. Common words remain intact while rare words decompose into recognizable pieces.

This decomposition creates an asymmetry in the masking process. When we randomly select 15% of tokens for masking, we're selecting subword tokens, not words. A multi-token word might have some subwords masked while others remain visible. The visible subwords leak information about the masked ones.

Why Partial Masking Weakens Learning

Consider how BERT processes the sentence "The transformation was remarkable." Using WordPiece tokenization:

["The", "transform", "##ation", "was", "remark", "##able", "."]

If random masking selects only "##ation", the input becomes:

["The", "transform", "[MASK]", "was", "remark", "##able", "."]

The model sees "transform" immediately adjacent to the mask. How many English words start with "transform-"? Only a handful: transformation, transformed, transforming, transformer. The prediction task collapses from choosing among 30,000 vocabulary items to distinguishing between 3-4 suffixes.

This is problematic for several reasons:

  • Weak learning signal: The model learns to pattern-match subword combinations rather than understand context deeply. It doesn't need to reason about meaning, just morphology.

  • Uneven difficulty: Single-token words face the full prediction challenge while multi-token words get easy hints. The model develops uneven representations.

  • Distributional mismatch: During fine-tuning, the model sees complete words. Pre-training on partial words creates a distribution shift.

Quantifying Information Leakage

To understand exactly how much information leaks when sibling subwords are visible, we need a way to measure uncertainty. How hard is the prediction task? Information theory gives us a precise tool: entropy.

The intuition behind entropy

Imagine you're playing a guessing game. If someone picks a number between 1 and 1,000, you have high uncertainty. If they pick between 1 and 2, you have low uncertainty. Entropy quantifies this: it measures how many "bits" of information you need to identify the answer. More possible answers means higher entropy; fewer means lower.

For language model predictions, entropy captures how "spread out" the model's probability distribution is. When the model is confident about one token, entropy is low. When it's unsure between many tokens, entropy is high. A good training signal comes from high entropy: the model must work hard to make the right prediction.

The entropy formula

Given a masked position, the model outputs a probability distribution over all possible tokens. The conditional entropy of this prediction is:

H(xmaskxcontext)=vVP(vxcontext)logP(vxcontext)H(x_{\text{mask}} | x_{\text{context}}) = -\sum_{v \in V} P(v | x_{\text{context}}) \log P(v | x_{\text{context}})

where:

  • H(xmaskxcontext)H(x_{\text{mask}} | x_{\text{context}}): the conditional entropy of the masked token given visible context, measured in bits (if using log2\log_2) or nats (if using natural log)
  • xmaskx_{\text{mask}}: the masked token the model must predict
  • xcontextx_{\text{context}}: the visible tokens surrounding the masked position
  • VV: the vocabulary of all possible tokens
  • P(vxcontext)P(v | x_{\text{context}}): the probability of token vv being the correct prediction given the context

The formula works by taking each possible token, weighting its log-probability by the probability itself, and summing. High probability on one token means PlogPP \cdot \log P is large and negative for that token but near zero for others, yielding low total entropy. Uniform probability across many tokens means many moderate contributions, yielding high entropy.

Entropy bounds tell us about task difficulty

Two extreme cases illuminate the formula's behavior:

  1. Maximum entropy: When all V|V| tokens are equally likely, each has probability 1/V1/|V|, and entropy reaches logV\log |V|. For a vocabulary of 30,000 tokens, that's about 15 bits: a genuinely challenging prediction.

  2. Minimum entropy: When one token has probability 1.0 and all others have probability 0, entropy equals 0. The prediction is trivial.

Out[3]:
Visualization
Line plot showing entropy increasing logarithmically with vocabulary size, with annotations marking partial masking around 6 tokens and whole word masking around 30000 tokens.
Entropy as a function of effective vocabulary size. When all tokens in a vocabulary are equally likely, entropy equals log(n). Partial masking collapses the effective vocabulary from 30,000 tokens to just a handful of valid suffixes, dramatically reducing entropy and weakening the training signal.

The visualization shows how dramatically entropy drops when the effective vocabulary shrinks. Partial masking constrains predictions to just a few valid suffixes, collapsing entropy from nearly 15 bits to around 2.6 bits. This 5-6x reduction in uncertainty means the model can achieve low loss without deep contextual understanding.

Why partial masking destroys the training signal

Now we can quantify the information leakage problem. When "transform" is visible and we mask only "##ation", what happens to the entropy?

The visible prefix constrains the possibilities dramatically. The only valid continuations are tokens that can follow "transform-" in English words: {##ation, ##ed, ##ing, ##er, ##s, ##able}. The effective vocabulary shrinks from 30,000 to perhaps 6 options. Even with a uniform distribution over these 6, entropy drops from 15 bits to about 2.6 bits.

In practice, it's even worse. The model learns that ##ation is by far the most common suffix after "transform-", so P(##ation)P(\texttt{\#\#ation}) might be 0.7 or higher. The entropy collapses further, perhaps to 1-2 bits.

This explains why partial masking produces weak gradients. The model achieves low loss not by understanding context, but by memorizing subword co-occurrence patterns. It learns morphology instead of semantics.

The Whole Word Masking Procedure

Whole Word Masking preserves word boundaries during the masking process. The algorithm requires knowing which subword tokens belong to the same original word. Different tokenizers signal this differently: WordPiece uses ## prefixes for continuation tokens, SentencePiece uses special Unicode characters, and BPE can use space prefixes.

The Core Algorithm

The WWM procedure works in three steps:

  1. Group subwords into words: Traverse the token sequence and collect consecutive subword tokens that belong to the same word. A new word starts when a token lacks the continuation marker.

  2. Select words for masking: Choose words (not tokens) to mask based on the masking probability. The selection targets approximately 15% of the total tokens while operating at the word level.

  3. Mask all subwords together: For each selected word, replace all its constituent subword tokens with [MASK].

The key insight is that we're changing the unit of selection from subword tokens to whole words while maintaining approximately the same masking ratio in terms of total tokens masked.

Handling the 15% Target

Standard MLM masks 15% of tokens. With WWM, we shift from selecting individual tokens to selecting whole words, but we still want approximately 15% of tokens to end up masked. This creates an interesting problem: words have different lengths in subword tokens.

The variable-length complication

Consider a concrete example. You have a sentence with 20 tokens forming 10 words:

  • 6 words are single tokens (like "the", "is", "a")
  • 3 words are two tokens each (like "trans##form", "learn##ing")
  • 1 word is four tokens (like "un##believ##ab##ly")

If you randomly select 15% of words (1-2 words), you might mask anywhere from 1 token (if you pick "the") to 4 tokens (if you pick the long word). The actual token masking ratio becomes unpredictable: sometimes 5%, sometimes 20%.

A length-weighted selection approach

One solution weights word selection by length. The probability of masking each word becomes proportional to how many tokens it contributes:

P(mask word w)=0.15×wiwiP(\text{mask word } w) = \frac{0.15 \times |w|}{\sum_{i} |w_i|}

where:

  • P(mask word w)P(\text{mask word } w): the probability of selecting word ww for masking
  • w|w|: the number of subword tokens in word ww
  • iwi\sum_{i} |w_i|: the total number of subword tokens in the sentence (summed across all words)
  • 0.150.15: the target masking ratio

The formula works by giving each word a selection probability proportional to its token count. A 4-token word is four times more likely to be selected than a 1-token word. This compensates for the fact that selecting the long word contributes four times as many masked tokens.

Why this approach has drawbacks

The length-weighted approach achieves precise 15% token masking on average, but it introduces a bias: longer words get masked more often. Since longer words tend to be rarer and more complex (like "internationalization" or "photosynthesis"), the model sees these words masked disproportionately often. It might develop weaker representations for common short words.

The practical solution

Production implementations typically use a simpler greedy approach:

  1. Shuffle the list of words randomly
  2. Add words to the masking set one by one
  3. Stop when the total masked tokens reaches approximately 15%

This method is unbiased across word lengths and easy to implement. The masking ratio varies slightly per sentence (sometimes 12%, sometimes 18%), but these variations average out across a training batch. The model's performance is robust to this variance.

The 80-10-10 Rule Still Applies

BERT's original masking strategy applies a probabilistic treatment to selected tokens:

  • 80%: Replace with [MASK]
  • 10%: Replace with a random token
  • 10%: Keep the original token

With WWM, this rule extends to word boundaries. If a word is selected for the "random replacement" 10%, all its subwords are replaced with random tokens. If selected for the "keep original" 10%, all subwords remain unchanged. This maintains consistency within word boundaries.

Implementation

Let's implement whole word masking step by step. We'll start with the core logic for identifying word boundaries, then build the complete masking function.

Identifying Word Boundaries

The first task is grouping subword tokens into words. WordPiece tokenizers mark continuation tokens with a ## prefix:

In[4]:
Code
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


def identify_word_groups(tokens):
    """
    Group tokens into words based on WordPiece conventions.
    Returns list of (start_idx, end_idx) tuples for each word.
    """
    word_groups = []
    current_start = 0

    for i, token in enumerate(tokens):
        # New word starts when token doesn't have ## prefix
        # (and isn't the first token)
        if i > 0 and not token.startswith("##"):
            word_groups.append((current_start, i))
            current_start = i

    # Don't forget the last word
    if current_start < len(tokens):
        word_groups.append((current_start, len(tokens)))

    return word_groups
Out[5]:
Console
Tokens: ['the', 'transformation', 'was', 'und', '##enia', '##bly', 'remarkable', '.']
Word: ['the']
Word: ['transformation']
Word: ['was']
Word: ['und', '##enia', '##bly']
Word: ['remarkable']
Word: ['.']

The function correctly identifies word boundaries by detecting the ## prefix. "The" remains a single-token word, while "transformation" is grouped as ["transform", "##ation"] and "undeniably" as ["undeni", "##ably"]. Each tuple represents the start and end indices in the token list, making it easy to mask all subwords of a word together.

Building the Masking Function

Now we implement the complete WWM function. We'll select words for masking and apply the 80-10-10 rule:

In[6]:
Code
import random


def whole_word_masking(tokens, tokenizer, mask_prob=0.15, mask_token="[MASK]"):
    """
    Apply whole word masking to a token sequence.

    Args:
        tokens: List of subword tokens
        tokenizer: Tokenizer for vocabulary access
        mask_prob: Probability of masking (default 15%)
        mask_token: Token to use for masking

    Returns:
        masked_tokens: Tokens with WWM applied
        labels: Original tokens for masked positions, -100 elsewhere
    """
    # Get word boundaries
    word_groups = identify_word_groups(tokens)

    # Calculate target number of tokens to mask
    num_tokens = len(tokens)
    target_masked = int(num_tokens * mask_prob)

    # Shuffle word groups and select until we hit target
    shuffled_groups = word_groups.copy()
    random.shuffle(shuffled_groups)

    selected_groups = []
    tokens_selected = 0

    for group in shuffled_groups:
        group_size = group[1] - group[0]
        if tokens_selected + group_size <= target_masked + 2:
            selected_groups.append(group)
            tokens_selected += group_size
        if tokens_selected >= target_masked:
            break

    # Apply masking with 80-10-10 rule
    masked_tokens = tokens.copy()
    labels = [-100] * num_tokens  # -100 = ignore in loss
    vocab = list(tokenizer.get_vocab().keys())

    for start, end in selected_groups:
        for i in range(start, end):
            labels[i] = tokenizer.convert_tokens_to_ids([tokens[i]])[0]

            rand = random.random()
            if rand < 0.8:
                masked_tokens[i] = mask_token
            elif rand < 0.9:
                masked_tokens[i] = random.choice(vocab)
            # else: keep original (10% case)

    return masked_tokens, labels
Out[7]:
Console
Original:  ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
Masked:    ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', '[MASK]', 'dog', '.']
Label IDs: ['_', '_', '_', '_', '_', '_', '_', 13971, '_', '_']

The function masks entire words together. When "jumps" is selected, both "jump" and "##s" receive the same treatment. The labels array stores the original token IDs for computing the loss, with -100 marking positions to ignore.

Comparing with Random Subword Masking

Let's visualize the difference between random subword masking and WWM on a longer example:

In[8]:
Code
def random_subword_masking(tokens, tokenizer, mask_prob=0.15):
    """Standard random subword masking (original BERT)."""
    masked_tokens = tokens.copy()
    labels = [-100] * len(tokens)
    vocab = list(tokenizer.get_vocab().keys())

    for i, token in enumerate(tokens):
        if random.random() < mask_prob:
            labels[i] = tokenizer.convert_tokens_to_ids([token])[0]
            rand = random.random()
            if rand < 0.8:
                masked_tokens[i] = "[MASK]"
            elif rand < 0.9:
                masked_tokens[i] = random.choice(vocab)

    return masked_tokens, labels
Out[9]:
Console
Original tokens: ['transformation', '##al', 'leadership', 'inspire', '##s', 'organizational', 'change', '.']

Random Subword Masking:
  Masked: ['[MASK]', '##al', 'leadership', '[MASK]', '##s', 'organizational', 'change', '.']
  Info leak: Visible subwords may hint at masked siblings

Whole Word Masking:
  Masked: ['transformation', '##al', 'leadership', 'inspire', '##s', '[MASK]', 'change', '.']
  No leak: Complete words are masked together

In random subword masking, you might see "transform" with "##ational" masked, revealing the word structure. In WWM, the entire word is hidden.

Visualizing the Difference

Let's create a visualization comparing how both masking strategies affect a set of sentences:

Out[10]:
Visualization
Grid showing sentences with color-coded tokens comparing masking strategies, demonstrating WWM keeps word boundaries intact.
Comparison of random subword masking vs whole word masking on sample sentences. Green tokens are visible, red tokens are masked. Random subword masking often leaves partial words visible (information leakage), while WWM consistently masks complete words.

The visualization highlights how random subword masking creates fragmented masking patterns. In "transformation", you might see "transform" while "##ation" is masked. WWM eliminates this by treating words as atomic units.

WWM for Different Tokenizers

Different tokenizers use different conventions for marking subword boundaries. Implementing WWM correctly requires adapting to each convention.

WordPiece (BERT)

WordPiece uses ## to prefix continuation tokens. A word starts with a token lacking ##, and subsequent ##-prefixed tokens continue it:

In[11]:
Code
def is_wordpiece_continuation(token):
    """Check if token continues a previous word (WordPiece)."""
    return token.startswith("##")
Out[12]:
Console
Tokens: ['un', '##ha', '##pp', '##iness']
Continuations: [False, True, True, True]

SentencePiece (T5, LLaMA)

SentencePiece uses the special character (U+2581) to mark word boundaries. Unlike WordPiece, this character appears at the start of words, not as a continuation marker:

In[13]:
Code
from transformers import T5Tokenizer

t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")


def is_sentencepiece_word_start(token):
    """Check if token starts a new word (SentencePiece)."""
    return token.startswith("▁")


def identify_word_groups_sentencepiece(tokens):
    """Group tokens into words for SentencePiece tokenizers."""
    word_groups = []
    current_start = 0

    for i, token in enumerate(tokens):
        if i > 0 and is_sentencepiece_word_start(token):
            word_groups.append((current_start, i))
            current_start = i

    if current_start < len(tokens):
        word_groups.append((current_start, len(tokens)))

    return word_groups
Out[14]:
Console
T5 tokens: ['▁Transformation', 'al', '▁change']
Word: ['▁Transformation', 'al']
Word: ['▁change']

BPE with GPT-2 Style

GPT-2's BPE tokenizer uses a different convention: it adds Ġ (representing a space) at the start of tokens that begin a new word:

In[15]:
Code
from transformers import GPT2Tokenizer

gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


def identify_word_groups_gpt2(tokens):
    """Group tokens into words for GPT-2 BPE tokenizer."""
    word_groups = []
    current_start = 0

    for i, token in enumerate(tokens):
        # Ġ marks word boundary (space before token)
        if i > 0 and token.startswith("Ġ"):
            word_groups.append((current_start, i))
            current_start = i

    if current_start < len(tokens):
        word_groups.append((current_start, len(tokens)))

    return word_groups
Out[16]:
Console
GPT-2 tokens: ['Transform', 'ational', 'Ġchange']
Word: ['Transform', 'ational']
Word: ['Ġchange']

Universal WWM Function

We can create a universal function that detects the tokenizer type and applies the appropriate grouping:

In[17]:
Code
def identify_word_groups_universal(tokens, tokenizer):
    """
    Identify word groups for any common tokenizer type.
    Automatically detects WordPiece, SentencePiece, or BPE conventions.
    """
    if hasattr(tokenizer, "wordpiece_tokenizer"):
        # WordPiece (BERT-style): ## prefix for continuation
        return identify_word_groups(tokens)

    # Check first few tokens for convention clues
    sample = " ".join(tokens[: min(5, len(tokens))])

    if "▁" in sample:
        # SentencePiece: ▁ prefix for word start
        return identify_word_groups_sentencepiece(tokens)
    elif "Ġ" in sample:
        # GPT-2 BPE: Ġ prefix for word start
        return identify_word_groups_gpt2(tokens)
    else:
        # Fallback: treat each token as separate word
        return [(i, i + 1) for i in range(len(tokens))]
Out[18]:
Console
BERT: ['un', '##ha', '##pp', '##iness']
  Groups: [(0, 4)]

T5: ['▁un', 'h', 'app', 'iness']
  Groups: [(0, 4)]

GPT-2: ['un', 'h', 'appiness']
  Groups: [(0, 1), (1, 2), (2, 3)]

Each tokenizer produces different subword splits, but the universal function correctly groups them into a single word. BERT uses ## prefixes, T5 uses at word starts, and GPT-2 uses Ġ. The grouping result of [(0, n)] indicates all tokens belong to one word, exactly what we need for whole word masking.

Empirical Comparison: WWM vs Random Masking

Let's compare the prediction difficulty under both masking strategies. We'll measure how much information leakage helps the model make correct predictions:

In[19]:
Code
from transformers import BertForMaskedLM
import torch

# Load a pre-trained BERT model
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
model.eval()


def get_mask_predictions(sentence, mask_positions, tokenizer, model):
    """Get model predictions for masked positions."""
    tokens = tokenizer.tokenize(sentence)
    tokens_with_special = ["[CLS]"] + tokens + ["[SEP]"]

    # Adjust mask positions for [CLS]
    adjusted_positions = [p + 1 for p in mask_positions]

    # Apply masking
    masked_tokens = tokens_with_special.copy()
    for pos in adjusted_positions:
        masked_tokens[pos] = "[MASK]"

    # Get predictions
    input_ids = tokenizer.convert_tokens_to_ids(masked_tokens)
    input_tensor = torch.tensor([input_ids])

    with torch.no_grad():
        outputs = model(input_tensor)
        predictions = outputs.logits[0]

    results = []
    for orig_pos, adj_pos in zip(mask_positions, adjusted_positions):
        probs = torch.softmax(predictions[adj_pos], dim=0)
        true_token_id = tokenizer.convert_tokens_to_ids([tokens[orig_pos]])[0]
        true_token_prob = probs[true_token_id].item()

        top5_ids = torch.topk(probs, 5).indices.tolist()
        top5_tokens = tokenizer.convert_ids_to_tokens(top5_ids)

        results.append(
            {
                "position": orig_pos,
                "true_token": tokens[orig_pos],
                "true_prob": true_token_prob,
                "top5": top5_tokens,
            }
        )

    return results
Out[20]:
Console
Tokens: ['the', 'transformation', 'was', 'remarkable']

Scenario 1: Partial masking (only ##ation masked)
  Visible context: 'transform' is visible
  True: 'was', Prob: 0.6490
  Top 5 predictions: ['was', 'is', 'became', 'proved', 'felt']

Scenario 2: Whole word masking (transform + ##ation masked)
  No partial word visible
  True: 'transformation', Prob: 0.0002
  Top 5: ['result', 'church', 'building', 'place', 'site']
  True: 'was', Prob: 0.1639
  Top 5: ['is', 'was', 'are', 'were', 'remains']

The results demonstrate the information leakage problem. When "transform" is visible, predicting "##ation" is trivial since the model essentially pattern-matches to complete the word. When the entire word is masked, the model must genuinely reason about context to make predictions.

Out[21]:
Visualization
Horizontal bar chart showing top-10 token predictions with high probability concentrated on ##ation.
Partial masking: When ''transform'' is visible, probability concentrates heavily on ##ation and related suffixes.
Horizontal bar chart showing top-10 token predictions with more uniform distribution across diverse tokens.
Whole word masking: When the entire word is masked, probability distributes across many plausible tokens.

The contrast is striking. With partial masking, the model assigns overwhelming probability to ##ation because it simply completes the visible prefix. With whole word masking, the model must consider what noun could fit the context "The [MASK] was remarkable," leading to a more diverse and uncertain prediction. This uncertainty creates stronger gradients during training.

Measuring Prediction Entropy

We can quantify the difficulty difference using prediction entropy:

Out[22]:
Visualization
Bar chart showing prediction entropy values for different masking scenarios, with WWM showing higher values.
Prediction entropy comparison between partial masking (only suffix masked) and whole word masking. Higher entropy indicates a more challenging prediction task. WWM creates consistently higher entropy, forcing deeper contextual reasoning.

The entropy measurements confirm our intuition. Partial masking produces lower entropy predictions because the visible subword tokens constrain the answer. Whole word masking forces higher entropy, indicating a more challenging prediction task that requires deeper contextual understanding.

Using WWM with Hugging Face

The Hugging Face transformers library provides built-in support for whole word masking through the DataCollatorForWholeWordMask class:

In[23]:
Code
from transformers import DataCollatorForWholeWordMask

# Create WWM data collator
wwm_collator = DataCollatorForWholeWordMask(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
Out[24]:
Console
Input IDs shape: torch.Size([2, 12])
Labels shape: torch.Size([2, 12])

First sentence after WWM:
  Input:  ['[CLS]', 'the', 'quick', 'brown', '[MASK]', 'jumps', 'over', 'the', 'lazy', 'dog', '.', '[SEP]']
  Labels: ['_', '_', '_', '_', 'fox', '_', '_', '_', 'lazy', '_', '_', '_']

The output shows tensors with matching shapes for inputs and labels. The [MASK] tokens appear in the input sequence, while the labels tensor stores the original token IDs at those positions. Positions with -100 (shown as _) are ignored during loss computation, focusing training only on predicting the masked words.

Training with WWM

To train a model with WWM, simply use the collator in your training pipeline:

In[39]:
Code
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert-wwm",
    per_device_train_batch_size=16,
    num_train_epochs=3,
    learning_rate=5e-5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=wwm_collator,  # WWM applied here
)

trainer.train()

The training loop applies WWM dynamically to each batch, ensuring different masking patterns on each epoch.

Limitations and Impact

Limitations

Whole word masking introduces trade-offs that practitioners should consider:

  • Language-dependent effectiveness: WWM assumes that word boundaries carry semantic significance. This works well for English and other space-delimited languages, but languages without clear word boundaries (like Chinese or Japanese) require additional segmentation tools. For Chinese BERT models, word segmentation must happen before tokenization, adding complexity and potential error propagation.

  • Inconsistent masking ratio: Because words have variable lengths in subword tokens, the actual percentage of tokens masked varies by sentence. A sentence with many long words might see 20% masking while one with short words sees 12%. This variance doesn't significantly harm training but differs from the precise 15% in standard MLM.

  • Computational overhead: Identifying word boundaries adds preprocessing cost. For production training at scale, this overhead is negligible compared to model forward passes, but it complicates the data pipeline.

  • Morphology complications: Some morphological processes don't respect word boundaries. In German, compound words like "Donaudampfschifffahrtsgesellschaft" might be tokenized into many subwords that all belong together semantically. WWM keeps them together, which might mask too large a chunk and provide insufficient learning signal.

Impact

Despite these limitations, WWM has proven valuable:

  • Improved downstream performance: Google released BERT-wwm models showing improvements on SQuAD and other benchmarks. The gains are modest but consistent: typically 0.5-1.0 F1 points on reading comprehension tasks. For Chinese, the improvements are larger because character-level masking in standard BERT created severe information leakage.

  • Better morphological understanding: WWM forces models to learn word-level rather than subword-level patterns. This produces representations that better capture morphological relationships. Words with shared roots cluster more meaningfully in the embedding space.

  • Standard practice for non-English models: WWM has become the default for training BERT models in morphologically rich languages. German, Arabic, and especially Chinese BERT variants almost universally use WWM because the alternative loses too much training signal to information leakage.

  • Foundation for span masking: WWM paved the way for more sophisticated masking strategies like span corruption in T5. The insight that masking units should respect linguistic boundaries generalizes beyond single words to phrases and sentences.

Key Parameters

When implementing whole word masking, several parameters control the masking behavior:

  • mlm_probability (default: 0.15): The target fraction of tokens to mask. WWM aims to achieve this ratio at the token level while selecting at the word level. Values between 0.1 and 0.2 work well; higher values provide more training signal per example but may make the task too difficult.

  • mask_token: The special token used for masking (e.g., [MASK] for BERT). This must match the tokenizer's mask token exactly for the model to recognize masked positions.

  • 80-10-10 distribution: Controls how selected tokens are modified. The standard split (80% [MASK], 10% random, 10% unchanged) helps bridge the gap between pre-training and fine-tuning, where [MASK] never appears.

  • Word boundary detection: Different tokenizers require different detection logic:

  • random_state / seed: Setting a seed ensures reproducible masking patterns during evaluation or debugging. During training, avoid fixed seeds to maximize data diversity across epochs.

Summary

Whole Word Masking addresses a fundamental flaw in applying masked language modeling to subword-tokenized text. When subword tokens are masked independently, visible sibling tokens leak information about masked positions, reducing the prediction task to pattern matching rather than contextual reasoning.

WWM's key contributions:

  • Preserves word boundaries: All subwords of a word are masked together, eliminating within-word information leakage.

  • Strengthens learning signal: The model must predict complete words from context alone, developing deeper semantic understanding.

  • Adapts to tokenizer conventions: Different tokenizers mark word boundaries differently (##, ▁, Ġ), and WWM implementations must detect and handle each convention.

  • Integrates with existing frameworks: Libraries like Hugging Face provide ready-to-use WWM data collators that handle all implementation details.

The technique is particularly important for languages with rich morphology and for any application where word-level understanding matters more than subword pattern matching. While the improvements over standard MLM are modest for English, they compound with other techniques and become essential for many non-English languages.

The next chapter explores span corruption, which extends the WWM insight further by masking contiguous spans of multiple words, creating even more challenging prediction tasks that encourage models to learn longer-range dependencies.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Whole Word Masking.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{wholewordmaskingeliminatinginformationleakageinbertpretraining, author = {Michael Brenndoerfer}, title = {Whole Word Masking: Eliminating Information Leakage in BERT Pre-training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/whole-word-masking-bert-pretraining}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Whole Word Masking: Eliminating Information Leakage in BERT Pre-training. Retrieved from https://mbrenndoerfer.com/writing/whole-word-masking-bert-pretraining
MLAAcademic
Michael Brenndoerfer. "Whole Word Masking: Eliminating Information Leakage in BERT Pre-training." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/whole-word-masking-bert-pretraining>.
CHICAGOAcademic
Michael Brenndoerfer. "Whole Word Masking: Eliminating Information Leakage in BERT Pre-training." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/whole-word-masking-bert-pretraining.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Whole Word Masking: Eliminating Information Leakage in BERT Pre-training'. Available at: https://mbrenndoerfer.com/writing/whole-word-masking-bert-pretraining (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Whole Word Masking: Eliminating Information Leakage in BERT Pre-training. https://mbrenndoerfer.com/writing/whole-word-masking-bert-pretraining
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free