Attention Masking: Controlling Information Flow in Transformers

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master attention masking techniques including padding masks, causal masks, and sparse patterns. Learn how masking enables autoregressive generation and efficient batch processing.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Attention MaskingLink Copied

Self-attention computes pairwise interactions between all positions in a sequence. But sometimes we need to block certain interactions. A model generating text should not peek at future tokens. Sequences padded to equal length should not attend to padding tokens. These constraints are implemented through attention masking, a technique that selectively prevents certain positions from influencing others.

Masking modifies the attention computation by adding large negative values to specific positions before the softmax. Recall that attention weights are computed as:

\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_{k=1}^{n} \exp(s_{ik})}

where:

$\alpha_{ij}$ : the attention weight from query position $i$ to key position $j$
$s_{ij}$ : the raw attention score between positions $i$ and $j$
$n$ : the sequence length

When we add a large negative value (like $-10^9$ ) to a score $s_{ij}$ , the exponential $\exp(s_{ij} - 10^9)$ becomes vanishingly small, driving $\alpha_{ij}$ to near-zero. These masked positions effectively disappear from the weighted sum. This simple mechanism enables causal language modeling, efficient batch processing, and custom attention patterns.

Why Masking MattersLink Copied

Consider training a language model to predict the next word. Given "The cat sat on the," the model should predict "mat" based only on the preceding words. But standard self-attention allows every position to see every other position, including future tokens. Without intervention, the model could cheat by looking ahead.

Attention Mask

An attention mask is a matrix of values that modifies attention scores before softmax. Positions marked for masking receive large negative values, causing the softmax to assign them near-zero weights. This effectively prevents the query from attending to those positions.

Masking solves this by blocking attention to future positions during training. The model learns to predict each token using only past context, matching the autoregressive setup it will encounter during generation.

Padding presents a similar challenge. When batching sequences of different lengths, we pad shorter sequences to match the longest. But these padding tokens carry no meaning. Allowing real tokens to attend to padding would corrupt their representations with noise. Masking removes padding from the attention computation entirely.

Padding MasksLink Copied

Real-world text comes in varying lengths. "Hello" has one token while "The quick brown fox jumps over the lazy dog" has nine. To process multiple sequences efficiently in a batch, we pad shorter sequences to a common length using a special padding token.

In[2]:

Code

# Example batch with varying sequence lengths
sequences = [
    ["The", "cat", "sat"],  # Length 3
    ["Hello", "world"],  # Length 2
    ["A", "quick", "brown", "fox"],  # Length 4
]

# Pad to maximum length
max_len = max(len(seq) for seq in sequences)
pad_token = "[PAD]"

padded_sequences = []
for seq in sequences:
    padded = seq + [pad_token] * (max_len - len(seq))
    padded_sequences.append(padded)

# Example batch with varying sequence lengths
sequences = [
    ["The", "cat", "sat"],  # Length 3
    ["Hello", "world"],  # Length 2
    ["A", "quick", "brown", "fox"],  # Length 4
]

# Pad to maximum length
max_len = max(len(seq) for seq in sequences)
pad_token = "[PAD]"

padded_sequences = []
for seq in sequences:
    padded = seq + [pad_token] * (max_len - len(seq))
    padded_sequences.append(padded)

Out[3]:

Console

Padded sequences (max length = 4):
  Sequence 0: ['The', 'cat', 'sat', '[PAD]']
  Sequence 1: ['Hello', 'world', '[PAD]', '[PAD]']
  Sequence 2: ['A', 'quick', 'brown', 'fox']

Padding ensures uniform tensor shapes, but it introduces tokens that should be invisible to the attention mechanism. A padding mask marks which positions contain real tokens versus padding.

In[4]:

Code

import numpy as np


def create_padding_mask(sequences, pad_token="[PAD]"):
    """
    Create a padding mask for a batch of sequences.

    Returns a mask where True indicates positions to KEEP (real tokens)
    and False indicates positions to MASK (padding tokens).
    """
    batch_size = len(sequences)
    seq_len = len(sequences[0])

    # True where we have real tokens, False for padding
    mask = np.array(
        [[token != pad_token for token in seq] for seq in sequences]
    )

    return mask


padding_mask = create_padding_mask(padded_sequences)

import numpy as np


def create_padding_mask(sequences, pad_token="[PAD]"):
    """
    Create a padding mask for a batch of sequences.

    Returns a mask where True indicates positions to KEEP (real tokens)
    and False indicates positions to MASK (padding tokens).
    """
    batch_size = len(sequences)
    seq_len = len(sequences[0])

    # True where we have real tokens, False for padding
    mask = np.array(
        [[token != pad_token for token in seq] for seq in sequences]
    )

    return mask


padding_mask = create_padding_mask(padded_sequences)

Out[5]:

Console

Padding mask (True = real token, False = padding):
  Sequence 0: [True, True, True, False]
             ['The', 'cat', 'sat', '[PAD]']
  Sequence 1: [True, True, False, False]
             ['Hello', 'world', '[PAD]', '[PAD]']
  Sequence 2: [True, True, True, True]
             ['A', 'quick', 'brown', 'fox']

The padding mask is a boolean array where True marks real tokens and False marks padding. To apply this mask to attention scores, we need to convert it into an additive mask.

Converting to Attention MaskLink Copied

We have a boolean padding mask, but attention needs numerical scores. How do we bridge this gap? The answer lies in understanding how softmax behaves with extreme values.

Softmax converts a vector of arbitrary real numbers into a probability distribution. The key insight is that softmax is sensitive to relative differences between values, not their absolute magnitudes. When computing softmax, each score is exponentiated, then divided by the sum of all exponentials:

\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

If one score is extremely negative while others are moderate, its exponential becomes vanishingly small. To illustrate, consider applying softmax to a vector with one large negative value:

\text{softmax}([2.0, 1.5, -10000]) \approx [0.62, 0.38, 0.00]

The third element receives essentially zero weight. Why? Because $e^{-10000}$ is astronomically small compared to $e^{2.0}$ and $e^{1.5}$ . In fact, $e^{-10000}$ is so close to zero that floating-point arithmetic rounds it to exactly zero.

This gives us our masking strategy: add a large negative value to positions we want to block. We typically use $-10^9$ rather than true infinity to avoid numerical edge cases, though both work in practice.

In[6]:

Code

# Demonstrate how mask value affects attention weight

base_scores = np.array([2.0, 1.5, 1.0])  # Three positions with similar scores
mask_values = np.linspace(0, -20, 100)

# For each mask value, compute softmax and track the masked position's weight
masked_weights = []
other_weights_0 = []
other_weights_1 = []

for mv in mask_values:
    scores_with_mask = base_scores.copy()
    scores_with_mask[2] = base_scores[2] + mv  # Apply mask to position 2
    exp_scores = np.exp(scores_with_mask - scores_with_mask.max())
    weights = exp_scores / exp_scores.sum()
    masked_weights.append(weights[2])
    other_weights_0.append(weights[0])
    other_weights_1.append(weights[1])

# Demonstrate how mask value affects attention weight

base_scores = np.array([2.0, 1.5, 1.0])  # Three positions with similar scores
mask_values = np.linspace(0, -20, 100)

# For each mask value, compute softmax and track the masked position's weight
masked_weights = []
other_weights_0 = []
other_weights_1 = []

for mv in mask_values:
    scores_with_mask = base_scores.copy()
    scores_with_mask[2] = base_scores[2] + mv  # Apply mask to position 2
    exp_scores = np.exp(scores_with_mask - scores_with_mask.max())
    weights = exp_scores / exp_scores.sum()
    masked_weights.append(weights[2])
    other_weights_0.append(weights[0])
    other_weights_1.append(weights[1])

Out[7]:

Visualization

Line plot showing attention weight vs mask value, with masked position weight dropping to zero as mask becomes negative. — Effect of mask value on attention weight distribution. As the mask value becomes more negative, attention to the masked position drops to zero while unmasked positions absorb the redistributed attention. By -15, the masked position is effectively invisible.

The plot demonstrates why we use large negative values like $-10^9$ . Even a mask value of $-10$ reduces the masked position's weight to near zero, while $-15$ makes it essentially invisible. The unmasked positions automatically absorb the redistributed attention.

In[8]:

Code

def padding_mask_to_attention_mask(padding_mask):
    """
    Convert padding mask to additive attention mask.

    For self-attention, we need a (batch, seq_len, seq_len) mask
    that masks out attention TO padding positions.

    Args:
        padding_mask: shape (batch, seq_len), True for real tokens

    Returns:
        attention_mask: shape (batch, seq_len, seq_len)
        0.0 for allowed attention, -inf for masked positions
    """
    batch_size, seq_len = padding_mask.shape

    # Expand mask to (batch, 1, seq_len) for broadcasting
    # We want to mask attention TO padding positions (keys/values)
    mask_expanded = padding_mask[:, np.newaxis, :]

    # Broadcast to (batch, seq_len, seq_len)
    attention_mask = np.where(
        mask_expanded,
        0.0,  # Real token: no modification
        -1e9,  # Padding: large negative value
    )

    return attention_mask


attention_mask = padding_mask_to_attention_mask(padding_mask)

def padding_mask_to_attention_mask(padding_mask):
    """
    Convert padding mask to additive attention mask.

    For self-attention, we need a (batch, seq_len, seq_len) mask
    that masks out attention TO padding positions.

    Args:
        padding_mask: shape (batch, seq_len), True for real tokens

    Returns:
        attention_mask: shape (batch, seq_len, seq_len)
        0.0 for allowed attention, -inf for masked positions
    """
    batch_size, seq_len = padding_mask.shape

    # Expand mask to (batch, 1, seq_len) for broadcasting
    # We want to mask attention TO padding positions (keys/values)
    mask_expanded = padding_mask[:, np.newaxis, :]

    # Broadcast to (batch, seq_len, seq_len)
    attention_mask = np.where(
        mask_expanded,
        0.0,  # Real token: no modification
        -1e9,  # Padding: large negative value
    )

    return attention_mask


attention_mask = padding_mask_to_attention_mask(padding_mask)

Out[9]:

Console

Attention mask shape: (3, 1, 4)

Sequence 0 attention mask (length 3, 1 pad token):
[[ 0.e+00  0.e+00  0.e+00 -1.e+09]]

Sequence 1 attention mask (length 2, 2 pad tokens):
[[ 0.e+00  0.e+00 -1.e+09 -1.e+09]]

The attention mask has shape (batch, seq_len, seq_len). For sequence 0, the last column contains large negative values because position 3 is padding. Every query position will receive near-zero attention weight for position 3 after softmax.

Visualizing Padding Mask EffectsLink Copied

Let's see how padding masks affect attention weights. We'll create random attention scores and compare the distributions with and without masking.

In[10]:

Code

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    exp_x = np.exp(x - x.max(axis=axis, keepdims=True))
    return exp_x / exp_x.sum(axis=axis, keepdims=True)


# Simulate attention scores for sequence 1 (2 real tokens, 2 padding)
np.random.seed(42)
seq_len = 4
scores = np.random.randn(seq_len, seq_len)

# Attention without masking
weights_unmasked = softmax(scores)

# Attention with padding mask for sequence 1
mask = attention_mask[1]  # Sequence 1 has 2 padding tokens
weights_masked = softmax(scores + mask)

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    exp_x = np.exp(x - x.max(axis=axis, keepdims=True))
    return exp_x / exp_x.sum(axis=axis, keepdims=True)


# Simulate attention scores for sequence 1 (2 real tokens, 2 padding)
np.random.seed(42)
seq_len = 4
scores = np.random.randn(seq_len, seq_len)

# Attention without masking
weights_unmasked = softmax(scores)

# Attention with padding mask for sequence 1
mask = attention_mask[1]  # Sequence 1 has 2 padding tokens
weights_masked = softmax(scores + mask)

Out[11]:

Visualization

Heatmap showing attention weights distributed across all 4 positions. — Attention weights without masking. All positions receive non-trivial attention, including padding tokens in columns 2 and 3.

Heatmap showing attention weights concentrated in first 2 columns with zeros in last 2 columns. — Attention weights with padding mask applied. Columns 2 and 3 (padding positions) receive zero attention weight, concentrating attention on real tokens.

The contrast is stark. Without masking, attention flows to all positions including padding. The padding tokens absorb 10-30% of the attention weight at each query position. With the mask applied, padding positions receive exactly 0.00 weight, and the remaining attention redistributes entirely to real tokens. This ensures padding never contaminates token representations.

Causal MasksLink Copied

Causal masking, also called look-ahead masking, prevents positions from attending to future positions. This is essential for autoregressive language models that generate text one token at a time.

During training, we process entire sequences in parallel for efficiency. But the model must learn to predict each position using only past context. Causal masking enforces this constraint: position $i$ can only attend to positions $0, 1, \ldots, i$ , where $i$ is the current position index (0-indexed).

Causal Attention

Causal attention restricts each position to attend only to itself and previous positions. This creates a left-to-right information flow where future tokens cannot influence past representations.

The Causal Mask FormulaLink Copied

How do we formalize "only attend to past positions"? We need a mask that blocks any attention where the key position comes after the query position. Mathematically, the causal mask $M$ is defined as:

M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

where:

$i$ : the query position (row index), representing the token that is "asking"
$j$ : the key position (column index), representing the token being attended to
$M_{ij}$ : the mask value added to the attention score at position $(i, j)$

The condition $j \leq i$ means "key position is at or before query position." When this holds, we add 0 (no masking). When $j > i$ , the key is in the future relative to the query, so we add $-\infty$ to block attention.

This creates a lower triangular pattern of zeros and an upper triangular pattern of infinities:

In[12]:

Code

def create_causal_mask(seq_len):
    """
    Create a causal (look-ahead) mask.

    Returns a mask where:
    - 0.0 for positions that CAN be attended (current and past)
    - -inf for positions that CANNOT be attended (future)

    Shape: (seq_len, seq_len)
    """
    # Create lower triangular matrix of ones
    # mask[i, j] = 1 if j <= i (can attend), else 0
    causal = np.tril(np.ones((seq_len, seq_len)))

    # Convert to additive mask
    mask = np.where(causal, 0.0, -1e9)

    return mask


seq_len = 5
causal_mask = create_causal_mask(seq_len)

def create_causal_mask(seq_len):
    """
    Create a causal (look-ahead) mask.

    Returns a mask where:
    - 0.0 for positions that CAN be attended (current and past)
    - -inf for positions that CANNOT be attended (future)

    Shape: (seq_len, seq_len)
    """
    # Create lower triangular matrix of ones
    # mask[i, j] = 1 if j <= i (can attend), else 0
    causal = np.tril(np.ones((seq_len, seq_len)))

    # Convert to additive mask
    mask = np.where(causal, 0.0, -1e9)

    return mask


seq_len = 5
causal_mask = create_causal_mask(seq_len)

Out[13]:

Console

Causal mask (0 = can attend, -1e9 = cannot attend):
[[ 0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
 [ 0.e+00  0.e+00 -1.e+09 -1.e+09 -1.e+09]
 [ 0.e+00  0.e+00  0.e+00 -1.e+09 -1.e+09]
 [ 0.e+00  0.e+00  0.e+00  0.e+00 -1.e+09]
 [ 0.e+00  0.e+00  0.e+00  0.e+00  0.e+00]]

Position 0 can only attend to itself (only position 0 has value 0). Position 1 can attend to positions 0 and 1. Position 4 can attend to all positions. The upper triangle is filled with large negative values, blocking all look-ahead.

Visualizing Causal AttentionLink Copied

Let's trace through how causal masking affects attention during sequence processing.

In[14]:

Code

# Example sequence
tokens = ["The", "cat", "sat", "on", "mat"]
seq_len = len(tokens)

# Random attention scores (before softmax)
np.random.seed(123)
scores = np.random.randn(seq_len, seq_len) * 0.5

# Apply causal mask
causal_mask = create_causal_mask(seq_len)
masked_scores = scores + causal_mask

# Compute attention weights
causal_weights = softmax(masked_scores)

# Example sequence
tokens = ["The", "cat", "sat", "on", "mat"]
seq_len = len(tokens)

# Random attention scores (before softmax)
np.random.seed(123)
scores = np.random.randn(seq_len, seq_len) * 0.5

# Apply causal mask
causal_mask = create_causal_mask(seq_len)
masked_scores = scores + causal_mask

# Compute attention weights
causal_weights = softmax(masked_scores)

Out[15]:

Visualization

Lower triangular heatmap showing causal attention weights with zeros in upper triangle. — Causal attention pattern for a five-token sequence. The upper triangle is zeroed out, preventing each position from attending to future tokens. Position 0 ('The') attends only to itself, while position 4 ('mat') can attend to all positions.

The triangular structure is clear. "The" at position 0 attends entirely to itself (weight 1.00) because it has no past context. "cat" can attend to "The" or itself. By position 4, "mat" distributes attention across all five positions based on learned relevance patterns.

This structure ensures that when training on a sequence like "The cat sat on mat," each position learns to predict the next token using only information from positions to its left.

Why Causal Masking Enables Parallel TrainingLink Copied

Without causal masking, training autoregressive models would be painfully slow. We would need to generate one token at a time, feeding each output back as input for the next. With causal masking, we can process the entire training sequence in parallel.

In[16]:

Code

# During training, we have the full sequence
training_sequence = ["<s>", "The", "cat", "sat", "</s>"]
target_sequence = ["The", "cat", "sat", "</s>", "<pad>"]

# The model predicts each position simultaneously
# Causal masking ensures each prediction only sees past context

# Position 0: sees only "<s>" -> predicts "The"
# Position 1: sees "<s> The" -> predicts "cat"
# Position 2: sees "<s> The cat" -> predicts "sat"
# Position 3: sees "<s> The cat sat" -> predicts "</s>"

# All predictions happen in one forward pass!

# During training, we have the full sequence
training_sequence = ["<s>", "The", "cat", "sat", "</s>"]
target_sequence = ["The", "cat", "sat", "</s>", "<pad>"]

# The model predicts each position simultaneously
# Causal masking ensures each prediction only sees past context

# Position 0: sees only "<s>" -> predicts "The"
# Position 1: sees "<s> The" -> predicts "cat"
# Position 2: sees "<s> The cat" -> predicts "sat"
# Position 3: sees "<s> The cat sat" -> predicts "</s>"

# All predictions happen in one forward pass!

Each row of the attention matrix corresponds to one prediction task. Row 0 uses context $[x_0]$ , row 1 uses context $[x_0, x_1]$ , and so on. The causal mask automatically provides the correct context for each prediction position without needing separate forward passes.

Combining Multiple MasksLink Copied

Real models often need both padding and causal masks simultaneously. A decoder processing batched sequences must handle variable lengths (requiring padding masks) while maintaining autoregressive constraints (requiring causal masks).

The solution is straightforward: combine masks by addition. If we have a causal mask $M_{\text{causal}}$ and a padding mask $M_{\text{pad}}$ , the combined mask is:

M_{\text{combined}} = M_{\text{causal}} + M_{\text{pad}}

Since both masks use $-\infty$ for blocked positions and 0 for allowed positions, the sum gives $-\infty$ wherever either mask blocks attention. A position is only allowed (value 0) when both masks allow it.

In[17]:

Code

def combine_masks(*masks):
    """
    Combine multiple attention masks by addition.

    Each mask should use 0 for allowed positions and -inf for blocked.
    The result blocks any position blocked by ANY mask.
    """
    combined = np.zeros_like(masks[0])
    for mask in masks:
        combined = combined + mask
    return combined


# Example: Batch of 2 sequences with different lengths
# Sequence 0: "Hello world" (2 tokens) + 2 padding
# Sequence 1: "The cat sat" (3 tokens) + 1 padding
batch_sequences = [
    ["Hello", "world", "[PAD]", "[PAD]"],
    ["The", "cat", "sat", "[PAD]"],
]
seq_len = 4

# Create padding mask for this batch
padding_mask = create_padding_mask(batch_sequences)
padding_attn_mask = padding_mask_to_attention_mask(padding_mask)

# Create causal mask (same for all sequences)
causal = create_causal_mask(seq_len)
# Expand to batch dimension: (1, seq_len, seq_len) for broadcasting
causal_expanded = causal[np.newaxis, :, :]

# Combine masks
combined_mask = combine_masks(padding_attn_mask, causal_expanded)

def combine_masks(*masks):
    """
    Combine multiple attention masks by addition.

    Each mask should use 0 for allowed positions and -inf for blocked.
    The result blocks any position blocked by ANY mask.
    """
    combined = np.zeros_like(masks[0])
    for mask in masks:
        combined = combined + mask
    return combined


# Example: Batch of 2 sequences with different lengths
# Sequence 0: "Hello world" (2 tokens) + 2 padding
# Sequence 1: "The cat sat" (3 tokens) + 1 padding
batch_sequences = [
    ["Hello", "world", "[PAD]", "[PAD]"],
    ["The", "cat", "sat", "[PAD]"],
]
seq_len = 4

# Create padding mask for this batch
padding_mask = create_padding_mask(batch_sequences)
padding_attn_mask = padding_mask_to_attention_mask(padding_mask)

# Create causal mask (same for all sequences)
causal = create_causal_mask(seq_len)
# Expand to batch dimension: (1, seq_len, seq_len) for broadcasting
causal_expanded = causal[np.newaxis, :, :]

# Combine masks
combined_mask = combine_masks(padding_attn_mask, causal_expanded)

Out[18]:

Console

Sequence 0: ['Hello', 'world', '[PAD]', '[PAD]']
Combined mask (0=attend, large negative=block):
[[          0 -1000000000 -2000000000 -2000000000]
 [          0           0 -2000000000 -2000000000]
 [          0           0 -1000000000 -2000000000]
 [          0           0 -1000000000 -1000000000]]

Sequence 1: ['The', 'cat', 'sat', '[PAD]']
Combined mask:
[[          0 -1000000000 -1000000000 -2000000000]
 [          0           0 -1000000000 -2000000000]
 [          0           0           0 -2000000000]
 [          0           0           0 -1000000000]]

For sequence 0 with 2 real tokens, the combined mask blocks:

Upper triangle (causal constraint)
Columns 2 and 3 (padding positions)

The resulting mask allows each position to attend only to past, non-padding positions.

Out[19]:

Visualization

The green cells marked "Y" show where attention is allowed. For sequence 0, only the 2x2 lower-left block is active. For sequence 1, the 3x3 lower-left block allows attention. Both patterns combine causal and padding constraints in a single mask.

Mask Shapes and BroadcastingLink Copied

Attention masks can have different shapes depending on the use case. Understanding these shapes and how they broadcast is crucial for correct implementation.

The attention score matrix has shape (batch, num_heads, seq_len, seq_len) in a full transformer with multi-head attention. Masks can be:

Full shape (batch, num_heads, seq_len, seq_len): Fully specified mask for every batch item and head
Per-batch (batch, 1, seq_len, seq_len): Same mask across all heads, different per batch item
Per-batch compact (batch, 1, 1, seq_len): Key-only masking (like padding masks)
Global (1, 1, seq_len, seq_len): Same mask for entire batch (like causal masks)

NumPy and PyTorch broadcast smaller masks to match the score tensor shape.

In[20]:

Code

# Different mask shapes and their purposes
batch_size = 2
num_heads = 4
seq_len = 8

# Causal mask: same for all batches and heads
# Shape: (1, 1, seq_len, seq_len)
causal_mask = create_causal_mask(seq_len)[np.newaxis, np.newaxis, :, :]

# Padding mask: varies per batch, same across heads
# Original shape: (batch, seq_len)
# Expanded shape: (batch, 1, 1, seq_len)
padding_indicator = np.array(
    [
        [True, True, True, True, True, False, False, False],  # 5 real tokens
        [True, True, True, False, False, False, False, False],  # 3 real tokens
    ]
)
padding_mask_expanded = padding_indicator[:, np.newaxis, np.newaxis, :]
attention_padding = np.where(padding_mask_expanded, 0.0, -1e9)

# Simulated attention scores
scores = np.random.randn(batch_size, num_heads, seq_len, seq_len)

# Broadcasting handles dimension matching automatically
masked_scores = scores + causal_mask + attention_padding

# Different mask shapes and their purposes
batch_size = 2
num_heads = 4
seq_len = 8

# Causal mask: same for all batches and heads
# Shape: (1, 1, seq_len, seq_len)
causal_mask = create_causal_mask(seq_len)[np.newaxis, np.newaxis, :, :]

# Padding mask: varies per batch, same across heads
# Original shape: (batch, seq_len)
# Expanded shape: (batch, 1, 1, seq_len)
padding_indicator = np.array(
    [
        [True, True, True, True, True, False, False, False],  # 5 real tokens
        [True, True, True, False, False, False, False, False],  # 3 real tokens
    ]
)
padding_mask_expanded = padding_indicator[:, np.newaxis, np.newaxis, :]
attention_padding = np.where(padding_mask_expanded, 0.0, -1e9)

# Simulated attention scores
scores = np.random.randn(batch_size, num_heads, seq_len, seq_len)

# Broadcasting handles dimension matching automatically
masked_scores = scores + causal_mask + attention_padding

Out[21]:

Console

Causal mask shape: (1, 1, 8, 8)
Padding mask shape: (2, 1, 1, 8)
Scores shape: (2, 4, 8, 8)
Masked scores shape: (2, 4, 8, 8)

Broadcasting expands smaller dimensions automatically:
  Causal (1,1,8,8) + Padding (2,1,1,8) + Scores (2,4,8,8) = (2,4,8,8)

Broadcasting rules make mask combination elegant. The causal mask (1, 1, 8, 8) applies identically to all batch items and heads. The padding mask (2, 1, 1, 8) applies per batch item across all heads, masking keys based on which positions are padding. The final result (2, 4, 8, 8) has the correct combined mask for each batch item and attention head.

Memory-Efficient MaskingLink Copied

For very long sequences, storing full (seq_len, seq_len) masks becomes expensive. A 10,000-token sequence requires 100 million entries per mask. Several strategies reduce this cost.

Lazy mask generation: Instead of precomputing the full mask, generate it during the attention computation.

In[22]:

Code

def apply_causal_mask_lazy(scores):
    """
    Apply causal mask without pre-allocating a full mask tensor.
    Uses numpy's triu to directly set upper triangle to -inf.
    """
    seq_len = scores.shape[-1]
    # Create mask in-place using views
    masked = scores.copy()
    # Get upper triangle indices
    row_indices, col_indices = np.triu_indices(seq_len, k=1)
    # Set upper triangle to -inf
    masked[..., row_indices, col_indices] = -1e9
    return masked


# Compare memory: full mask vs lazy
seq_len = 1000
full_mask_memory = seq_len * seq_len * 4  # 4 bytes per float32

def apply_causal_mask_lazy(scores):
    """
    Apply causal mask without pre-allocating a full mask tensor.
    Uses numpy's triu to directly set upper triangle to -inf.
    """
    seq_len = scores.shape[-1]
    # Create mask in-place using views
    masked = scores.copy()
    # Get upper triangle indices
    row_indices, col_indices = np.triu_indices(seq_len, k=1)
    # Set upper triangle to -inf
    masked[..., row_indices, col_indices] = -1e9
    return masked


# Compare memory: full mask vs lazy
seq_len = 1000
full_mask_memory = seq_len * seq_len * 4  # 4 bytes per float32

Out[23]:

Console

Full mask memory for seq_len=1000: 4,000,000 bytes (4.0 MB)
Lazy approach: generates mask on-the-fly, no pre-allocation needed

A 1000-token sequence requires 4 MB just for the mask tensor. The lazy approach avoids this allocation by setting mask values directly on the score matrix during computation.

Boolean masks: Store masks as boolean arrays (1 byte per element) rather than float32 (4 bytes), converting to float only when needed.

In[24]:

Code

# Boolean mask storage
seq_len = 1000
bool_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))
float_mask = np.tril(np.ones((seq_len, seq_len), dtype=np.float32))

bool_memory = bool_mask.nbytes
float_memory = float_mask.nbytes

# Boolean mask storage
seq_len = 1000
bool_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))
float_mask = np.tril(np.ones((seq_len, seq_len), dtype=np.float32))

bool_memory = bool_mask.nbytes
float_memory = float_mask.nbytes

Out[25]:

Console

Boolean mask: 1,000,000 bytes (1.0 MB)
Float32 mask: 4,000,000 bytes (4.0 MB)
Memory savings: 75%

For sequences of length 1000, boolean masks use 75% less memory. This adds up when processing large batches or very long documents.

Efficient Masking ImplementationLink Copied

With our understanding of padding and causal masks in place, we can now build a complete masked attention function. The key insight is that masking integrates seamlessly into the standard attention formula through a single addition operation.

The Masked Attention FormulaLink Copied

Recall that standard scaled dot-product attention computes:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

To add masking, we simply insert the mask matrix $M$ before the softmax:

\text{Attention}(Q, K, V, M) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

where:

$Q$ : the query matrix of shape $(n, d_k)$ , representing what each position is "looking for"
$K$ : the key matrix of shape $(n, d_k)$ , representing what each position "offers"
$V$ : the value matrix of shape $(n, d_v)$ , containing the information to aggregate
$M$ : the mask matrix of shape $(n, n)$ , with 0 for allowed and $-\infty$ for blocked positions
$d_k$ : the dimension of queries and keys, used for scaling to prevent vanishing gradients
$n$ : the sequence length

The formula unfolds in three steps:

Compute raw scores: $QK^T$ produces an $(n, n)$ matrix where entry $(i, j)$ measures the similarity between query $i$ and key $j$ .
Scale and mask: Divide by $\sqrt{d_k}$ to stabilize gradients, then add the mask $M$ . Masked positions receive $-\infty$ , which the softmax will convert to near-zero weights.
Normalize and aggregate: Softmax converts scores to weights summing to 1 (per row), then these weights select and combine values from $V$ .

This formulation is elegant because the mask $M$ doesn't change the computational structure. We still compute all pairwise scores, but the masking happens through simple addition before softmax. The exponential function in softmax then "erases" the masked positions.

Implementing Masked AttentionLink Copied

Let's translate the formula into code. The implementation follows the three-step structure exactly:

In[26]:

Code

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute scaled dot-product attention with optional masking.

    Args:
        query: shape (batch, seq_len, d_k) or (batch, heads, seq_len, d_k)
        key: same shape as query
        value: same shape as query
        mask: broadcastable mask, 0 for attend, -inf for block

    Returns:
        output: weighted values, same shape as value
        attention_weights: softmax weights
    """
    d_k = query.shape[-1]

    # Step 1: Compute raw similarity scores
    # (batch, ..., seq_len, d_k) @ (batch, ..., d_k, seq_len)
    # -> (batch, ..., seq_len, seq_len)
    scores = query @ key.swapaxes(-2, -1) / np.sqrt(d_k)

    # Step 2: Apply mask before softmax
    if mask is not None:
        scores = scores + mask

    # Step 3: Normalize to weights and aggregate values
    attention_weights = softmax(scores, axis=-1)
    output = attention_weights @ value

    return output, attention_weights

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute scaled dot-product attention with optional masking.

    Args:
        query: shape (batch, seq_len, d_k) or (batch, heads, seq_len, d_k)
        key: same shape as query
        value: same shape as query
        mask: broadcastable mask, 0 for attend, -inf for block

    Returns:
        output: weighted values, same shape as value
        attention_weights: softmax weights
    """
    d_k = query.shape[-1]

    # Step 1: Compute raw similarity scores
    # (batch, ..., seq_len, d_k) @ (batch, ..., d_k, seq_len)
    # -> (batch, ..., seq_len, seq_len)
    scores = query @ key.swapaxes(-2, -1) / np.sqrt(d_k)

    # Step 2: Apply mask before softmax
    if mask is not None:
        scores = scores + mask

    # Step 3: Normalize to weights and aggregate values
    attention_weights = softmax(scores, axis=-1)
    output = attention_weights @ value

    return output, attention_weights

Notice how the mask application is just a single line: scores = scores + mask. This simplicity is intentional. The mask contains 0 for allowed positions (no effect on scores) and large negative values for blocked positions (which softmax converts to near-zero weights). No branching, no special cases, just addition.

Testing the ImplementationLink Copied

Let's verify that our implementation correctly handles both padding and causal masks:

In[27]:

Code

# Test setup
np.random.seed(42)
batch_size = 2
seq_len = 6
d_model = 8

# Random query, key, value
Q = np.random.randn(batch_size, seq_len, d_model)
K = np.random.randn(batch_size, seq_len, d_model)
V = np.random.randn(batch_size, seq_len, d_model)

# Padding: batch 0 has 6 tokens, batch 1 has 4 tokens
padding_indicator = np.array(
    [
        [True, True, True, True, True, True],  # All real
        [True, True, True, True, False, False],  # 4 real + 2 padding
    ]
)

# Create combined mask (causal + padding)
causal = create_causal_mask(seq_len)[np.newaxis, :, :]  # (1, seq_len, seq_len)
padding = np.where(
    padding_indicator[:, np.newaxis, :],  # (batch, 1, seq_len)
    0.0,
    -1e9,
)
combined_mask = causal + padding

# Run attention
output, weights = scaled_dot_product_attention(Q, K, V, mask=combined_mask)

# Test setup
np.random.seed(42)
batch_size = 2
seq_len = 6
d_model = 8

# Random query, key, value
Q = np.random.randn(batch_size, seq_len, d_model)
K = np.random.randn(batch_size, seq_len, d_model)
V = np.random.randn(batch_size, seq_len, d_model)

# Padding: batch 0 has 6 tokens, batch 1 has 4 tokens
padding_indicator = np.array(
    [
        [True, True, True, True, True, True],  # All real
        [True, True, True, True, False, False],  # 4 real + 2 padding
    ]
)

# Create combined mask (causal + padding)
causal = create_causal_mask(seq_len)[np.newaxis, :, :]  # (1, seq_len, seq_len)
padding = np.where(
    padding_indicator[:, np.newaxis, :],  # (batch, 1, seq_len)
    0.0,
    -1e9,
)
combined_mask = causal + padding

# Run attention
output, weights = scaled_dot_product_attention(Q, K, V, mask=combined_mask)

Out[28]:

Console

Output shape: (2, 6, 8)
Attention weights shape: (2, 6, 6)

Batch 1 attention weights (4 real tokens + 2 padding):
Last 2 columns should be ~0 (padding), upper triangle should be ~0 (causal)
[[1.    0.    0.    0.    0.    0.   ]
 [0.285 0.715 0.    0.    0.    0.   ]
 [0.262 0.206 0.533 0.    0.    0.   ]
 [0.682 0.116 0.03  0.172 0.    0.   ]
 [0.135 0.116 0.59  0.159 0.    0.   ]
 [0.285 0.291 0.169 0.255 0.    0.   ]]

The attention weight matrix confirms that both masks work together correctly. Looking at the output for batch 1 (which has 4 real tokens and 2 padding tokens):

Columns 4 and 5 are nearly zero: The padding mask successfully blocks attention to padding positions. No real token wastes attention on meaningless padding.
Upper triangle is nearly zero: The causal mask blocks attention to future positions. Position 0 cannot see positions 1-5, position 1 cannot see positions 2-5, and so on.
Lower-left region has non-zero weights: The intersection of "past positions" and "real tokens" receives all the attention. These are exactly the positions each query should attend to.

The combined mask creates a triangular pattern truncated by the padding boundary. This is precisely what a causal language model needs when processing variable-length batched sequences.

Custom Attention PatternsLink Copied

Beyond standard padding and causal masks, researchers have explored various attention patterns for efficiency and modeling goals.

Local AttentionLink Copied

Restrict attention to a fixed window around each position. This reduces complexity from $O(n^2)$ to $O(n \cdot w)$ , where:

$n$ : the sequence length (total number of tokens)
$w$ : the window size (number of positions each token can attend to)

In[29]:

Code

def create_local_attention_mask(seq_len, window_size):
    """
    Create a local attention mask with a fixed window.
    Each position attends to window_size positions before and after.

    Args:
        seq_len: sequence length
        window_size: number of positions to attend on each side

    Returns:
        mask: (seq_len, seq_len) attention mask
    """
    mask = np.full((seq_len, seq_len), -1e9)

    for i in range(seq_len):
        start = max(0, i - window_size)
        end = min(seq_len, i + window_size + 1)
        mask[i, start:end] = 0.0

    return mask


seq_len = 10
window_size = 2
local_mask = create_local_attention_mask(seq_len, window_size)

def create_local_attention_mask(seq_len, window_size):
    """
    Create a local attention mask with a fixed window.
    Each position attends to window_size positions before and after.

    Args:
        seq_len: sequence length
        window_size: number of positions to attend on each side

    Returns:
        mask: (seq_len, seq_len) attention mask
    """
    mask = np.full((seq_len, seq_len), -1e9)

    for i in range(seq_len):
        start = max(0, i - window_size)
        end = min(seq_len, i + window_size + 1)
        mask[i, start:end] = 0.0

    return mask


seq_len = 10
window_size = 2
local_mask = create_local_attention_mask(seq_len, window_size)

Out[30]:

Visualization

Diagonal band pattern showing local attention where each position attends to nearby positions only. — Local attention mask with window size 2. Each position attends only to 2 positions before and after itself, creating a band-diagonal pattern. Positions outside the window (dark blue) receive zero attention weight.

Local attention is used in models like Longformer to handle very long sequences. The window captures local context efficiently, and special "global" tokens can still attend to all positions for long-range information.

Strided AttentionLink Copied

Attend to every $k$ -th position, spreading attention across the sequence with fixed intervals. Here, $k$ is the stride parameter that controls how far apart attended positions are.

In[31]:

Code

def create_strided_attention_mask(seq_len, stride):
    """
    Create a strided attention mask.
    Each position attends to every stride-th position.

    Args:
        seq_len: sequence length
        stride: attend to every stride-th position

    Returns:
        mask: (seq_len, seq_len) attention mask
    """
    mask = np.full((seq_len, seq_len), -1e9)

    for i in range(seq_len):
        # Attend to positions at stride intervals, aligned to position
        for j in range(0, seq_len, stride):
            mask[i, j] = 0.0
        # Also attend to local neighborhood for current position
        mask[i, i] = 0.0

    return mask


stride = 3
strided_mask = create_strided_attention_mask(seq_len, stride)

def create_strided_attention_mask(seq_len, stride):
    """
    Create a strided attention mask.
    Each position attends to every stride-th position.

    Args:
        seq_len: sequence length
        stride: attend to every stride-th position

    Returns:
        mask: (seq_len, seq_len) attention mask
    """
    mask = np.full((seq_len, seq_len), -1e9)

    for i in range(seq_len):
        # Attend to positions at stride intervals, aligned to position
        for j in range(0, seq_len, stride):
            mask[i, j] = 0.0
        # Also attend to local neighborhood for current position
        mask[i, i] = 0.0

    return mask


stride = 3
strided_mask = create_strided_attention_mask(seq_len, stride)

Combined PatternsLink Copied

Real efficient attention mechanisms often combine multiple patterns. Sparse Transformer uses local + strided attention to cover both nearby and distant positions.

In[32]:

Code

def create_sparse_attention_mask(seq_len, local_window, stride):
    """
    Create a sparse attention mask combining local and strided patterns.

    Args:
        seq_len: sequence length
        local_window: size of local attention window
        stride: stride for global attention

    Returns:
        mask: (seq_len, seq_len) attention mask
    """
    # Start with all blocked
    mask = np.full((seq_len, seq_len), -1e9)

    # Add local attention
    for i in range(seq_len):
        start = max(0, i - local_window)
        end = min(seq_len, i + local_window + 1)
        mask[i, start:end] = 0.0

    # Add strided attention
    for i in range(seq_len):
        for j in range(0, seq_len, stride):
            mask[i, j] = 0.0

    return mask


sparse_mask = create_sparse_attention_mask(seq_len, local_window=1, stride=4)

def create_sparse_attention_mask(seq_len, local_window, stride):
    """
    Create a sparse attention mask combining local and strided patterns.

    Args:
        seq_len: sequence length
        local_window: size of local attention window
        stride: stride for global attention

    Returns:
        mask: (seq_len, seq_len) attention mask
    """
    # Start with all blocked
    mask = np.full((seq_len, seq_len), -1e9)

    # Add local attention
    for i in range(seq_len):
        start = max(0, i - local_window)
        end = min(seq_len, i + local_window + 1)
        mask[i, start:end] = 0.0

    # Add strided attention
    for i in range(seq_len):
        for j in range(0, seq_len, stride):
            mask[i, j] = 0.0

    return mask


sparse_mask = create_sparse_attention_mask(seq_len, local_window=1, stride=4)

Out[33]:

Visualization

Fully filled attention matrix showing all positions can attend to all positions. — Full attention: every position attends to every position. This is the baseline pattern with O(n²) complexity.

Sparse pattern with diagonal band and periodic vertical stripes. — Local + Strided: combines a narrow local window with strided global attention, as used in Sparse Transformer.

Lower triangular band pattern combining causal and local constraints. — Causal + Local: restricts to past positions with a local window, useful for efficient autoregressive models.

Each pattern represents a different trade-off. Full attention has maximum expressiveness but $O(n^2)$ cost, where $n$ is the sequence length. Sparse patterns reduce complexity at the cost of some long-range interactions. The choice depends on sequence length, available compute, and task requirements.

In[34]:

Code

# Compare sparsity levels across different mask patterns
def compute_sparsity(mask):
    """Compute the fraction of allowed (non-masked) positions."""
    return (mask == 0).sum() / mask.size


seq_lengths = [64, 128, 256, 512, 1024]
patterns = {
    "Full": [],
    "Causal": [],
    "Local (w=32)": [],
    "Sparse (w=16, s=64)": [],
}

for n in seq_lengths:
    # Full attention: all n^2 pairs
    patterns["Full"].append(1.0)

    # Causal: lower triangle = n(n+1)/2 pairs
    causal = np.tril(np.ones((n, n)))
    patterns["Causal"].append(causal.sum() / causal.size)

    # Local with window 32
    local = create_local_attention_mask(n, window_size=32)
    patterns["Local (w=32)"].append((local == 0).sum() / local.size)

    # Sparse: local window 16 + stride 64
    sparse = create_sparse_attention_mask(n, local_window=16, stride=64)
    patterns["Sparse (w=16, s=64)"].append((sparse == 0).sum() / sparse.size)

# Compare sparsity levels across different mask patterns
def compute_sparsity(mask):
    """Compute the fraction of allowed (non-masked) positions."""
    return (mask == 0).sum() / mask.size


seq_lengths = [64, 128, 256, 512, 1024]
patterns = {
    "Full": [],
    "Causal": [],
    "Local (w=32)": [],
    "Sparse (w=16, s=64)": [],
}

for n in seq_lengths:
    # Full attention: all n^2 pairs
    patterns["Full"].append(1.0)

    # Causal: lower triangle = n(n+1)/2 pairs
    causal = np.tril(np.ones((n, n)))
    patterns["Causal"].append(causal.sum() / causal.size)

    # Local with window 32
    local = create_local_attention_mask(n, window_size=32)
    patterns["Local (w=32)"].append((local == 0).sum() / local.size)

    # Sparse: local window 16 + stride 64
    sparse = create_sparse_attention_mask(n, local_window=16, stride=64)
    patterns["Sparse (w=16, s=64)"].append((sparse == 0).sum() / sparse.size)

Out[35]:

Visualization

Line plot showing attention density decreasing for local and sparse patterns as sequence length increases. — Density of attention patterns across sequence lengths. Full and causal attention maintain constant density, while local and sparse patterns become increasingly efficient for longer sequences. At 1024 tokens, sparse attention uses only ~6% of the connections that full attention requires.

The efficiency gains are dramatic. Full attention always uses 100% of possible pairs. Causal attention uses ~50% (the lower triangle). But local and sparse patterns become increasingly efficient as sequences grow longer: at 1024 tokens, local attention with window 32 uses only ~6% of pairs, and sparse patterns even less. This is why efficient attention variants are essential for processing long documents.

Limitations and ImpactLink Copied

Attention masking is essential infrastructure for modern transformers, but it introduces its own complexities and constraints.

The most significant limitation is computational overhead. While masking itself is cheap (just addition before softmax), it doesn't reduce the fundamental $O(n^2)$ cost of computing all pairwise attention scores, where $n$ is the sequence length. Even with half the positions masked, we still compute the full score matrix. Truly sparse attention requires specialized implementations that skip masked computations entirely, which is harder to optimize on GPUs that prefer dense, regular operations. Libraries like FlashAttention address this by fusing the entire attention computation, mask included, into a single kernel.

Masking also constrains what the model can learn. Causal masking prevents bidirectional context, which is suboptimal for tasks like fill-in-the-blank or sentence classification. This is why BERT uses bidirectional attention (no causal mask) while GPT uses causal attention. The mask choice becomes an architectural decision that shapes what the model can and cannot do.

Despite these limitations, masking unlocked critical capabilities. Causal masking enabled efficient parallel training of autoregressive models, which would otherwise require sequential generation during training. Padding masks allowed practical batching of variable-length sequences. Custom patterns like local and strided attention extended transformers to sequence lengths previously impossible. Without masking, the transformer architecture would be far less practical and flexible.

Key ParametersLink Copied

When implementing attention masking, several parameters control the mask behavior:

mask_value: The large negative value used for masked positions (typically $-10^9$ or -float('inf')). Using $-10^9$ rather than true infinity avoids numerical issues with some operations while still producing near-zero attention weights after softmax.
window_size (local attention): Controls how many positions on each side a token can attend to. Smaller windows reduce computation but may miss important long-range dependencies. Common values range from 128 to 512 tokens.
stride (strided attention): Determines the spacing between attended positions in sparse patterns. A stride of $k$ means attending to every $k$ -th position, reducing complexity while maintaining some global connectivity.
pad_token_id: The token ID used for padding in tokenized sequences. This must match the padding token used during tokenization to correctly identify positions to mask.

SummaryLink Copied

Attention masking controls which positions can attend to which others, enabling autoregressive generation and efficient batch processing.

The key concepts from this chapter:

Additive masking: Adding large negative values before softmax drives attention weights to near-zero, effectively blocking those positions
Padding masks: Prevent attention to padding tokens when batching sequences of different lengths. Real tokens should not be influenced by meaningless padding values
Causal masks: Block attention to future positions, enforcing left-to-right information flow for autoregressive language models. This enables parallel training while maintaining proper context constraints
Mask combination: Multiple masks combine by addition. Any position blocked by any mask receives near-zero attention
Broadcasting: Masks can have shapes like (1, 1, seq_len, seq_len) for global patterns or (batch, 1, 1, seq_len) for per-sequence patterns, with broadcasting handling dimension expansion
Custom patterns: Local, strided, and sparse attention patterns reduce complexity for long sequences by limiting which positions can interact

In the next chapter, we'll explore multi-head attention, which runs multiple attention operations in parallel with different learned projections. This allows the model to attend to information from different representation subspaces at different positions.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about attention masking.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Scaled Dot-Product Attention

Next Chapter

Multi-Head Attention

Reference

BIBTEXAcademic

@misc{attentionmaskingcontrollinginformationflowintransformers, author = {Michael Brenndoerfer}, title = {Attention Masking: Controlling Information Flow in Transformers}, year = {2025}, url = {https://mbrenndoerfer.com/writing/attention-masking-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Attention Masking: Controlling Information Flow in Transformers. Retrieved from https://mbrenndoerfer.com/writing/attention-masking-transformers

MLAAcademic

Michael Brenndoerfer. "Attention Masking: Controlling Information Flow in Transformers." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/attention-masking-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "Attention Masking: Controlling Information Flow in Transformers." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/attention-masking-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Attention Masking: Controlling Information Flow in Transformers'. Available at: https://mbrenndoerfer.com/writing/attention-masking-transformers (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Attention Masking: Controlling Information Flow in Transformers. https://mbrenndoerfer.com/writing/attention-masking-transformers

Direct link:

https://mbrenndoerfer.com/writing/attention-masking-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Attention Masking: Controlling Information Flow in Transformers

Attention MaskingLink Copied

Why Masking MattersLink Copied

Padding MasksLink Copied

Converting to Attention MaskLink Copied

Visualizing Padding Mask EffectsLink Copied

Causal MasksLink Copied

The Causal Mask FormulaLink Copied

Visualizing Causal AttentionLink Copied

Why Causal Masking Enables Parallel TrainingLink Copied

Combining Multiple MasksLink Copied

Mask Shapes and BroadcastingLink Copied

Memory-Efficient MaskingLink Copied

Efficient Masking ImplementationLink Copied

The Masked Attention FormulaLink Copied

Implementing Masked AttentionLink Copied

Testing the ImplementationLink Copied

Custom Attention PatternsLink Copied

Local AttentionLink Copied

Strided AttentionLink Copied

Combined PatternsLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Query, Key, Value: The Foundation of Transformer Attention

Multi-Head Attention: Parallel Attention for Richer Representations

Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Stay updated