Sliding Window Attention: Linear Complexity for Long Sequences

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how sliding window attention reduces transformer complexity from quadratic to linear by restricting attention to local neighborhoods, enabling efficient processing of long documents.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Sliding Window AttentionLink Copied

Standard attention computes all $n^2$ pairwise interactions between tokens, but most of that computation may be wasted. In natural language, context is often local: the meaning of a word depends heavily on its immediate neighbors. Consider the phrase "the quick brown fox jumps over the lazy dog." When determining what "jumps" means, the model needs "fox" (the subject) and "over" (what follows), but distant tokens like "the" or "dog" contribute less to understanding this particular word. Sliding window attention exploits this locality by restricting each token to attend only to tokens within a fixed-size window around it.

This chapter explores how sliding window attention works, why it reduces complexity from quadratic to linear, and how to implement it effectively. We will also examine the trade-offs: what do we lose when tokens cannot see beyond their local window, and how can we address these limitations?

The Locality AssumptionLink Copied

Before diving into the mechanics, let's understand why local attention makes sense. Natural language exhibits strong locality: adjacent words form phrases, phrases form clauses, and meaning often crystallizes within a narrow span.

Sliding Window Attention

Sliding window attention restricts each token to attending only to tokens within a fixed distance $w$ (the window size). Instead of $n^2$ attention weights, we compute approximately $n \cdot w$ weights, achieving linear complexity in sequence length.

Consider how you read this sentence. When your eye lands on "read," you immediately connect it to "you" (subject) and "this sentence" (object), all within a few tokens. You do not need to re-read the beginning of the paragraph to understand this verb. Sliding window attention captures this intuition by limiting attention to a local neighborhood.

The key insight is that attention patterns in trained transformers often concentrate locally. Research analyzing attention weights in models like BERT and GPT-2 found that many heads allocate most attention mass to nearby tokens. By building this locality bias into the architecture, we can dramatically reduce computation while preserving much of the model's representational power.

How Sliding Window Attention WorksLink Copied

To understand sliding window attention, we need to answer three questions: What does "local" mean mathematically? How do we enforce locality in the attention computation? And how does this change the algorithm's behavior? Let's build up the formalism step by step, starting from the intuition and arriving at a complete mathematical specification.

From Intuition to NeighborhoodsLink Copied

In standard attention, every token can attend to every other token. This creates a fully connected graph where information flows freely between all positions. But we observed that most meaningful connections occur between nearby tokens. The first step toward efficiency is formalizing what "nearby" means.

Consider token $i$ in a sequence. Which tokens should it attend to? The natural answer is: tokens within some fixed distance. If we set a window size $w$ , we want token $i$ to attend only to tokens $j$ that are close enough. We measure closeness by position distance:

|i - j| \leq \lfloor w/2 \rfloor

where:

$i$ : the position of the query token (the one computing its representation)
$j$ : the position of the key token (a candidate for attention)
$w$ : the window size (total number of positions each token can attend to)
$|i - j|$ : the absolute distance between the two positions
$\lfloor w/2 \rfloor$ : half the window size (rounded down), giving the maximum distance in each direction

This formula defines a neighborhood around each token. With $w = 5$ , token at position 10 can attend to positions 8, 9, 10, 11, and 12. The neighborhood extends $\lfloor 5/2 \rfloor = 2$ positions in each direction, plus the token itself. Token at position 3 can attend to positions 1, 2, 3, 4, and 5. Every token has the same sized window, centered on itself.

Why floor instead of exact division? Window sizes are typically odd (like 3, 5, or 257) so the token sits exactly in the middle. For even windows, the floor gives a slight asymmetry, attending to one more position on one side than the other. This rarely matters in practice.

Translating Neighborhoods into MasksLink Copied

Now we face a practical question: how do we tell the attention mechanism which positions each token should attend to? Standard attention computes a score between every query-key pair. We need a way to "zero out" the scores for positions outside the neighborhood.

The solution is an attention mask: a matrix that modifies the raw attention scores before softmax. The mask acts like a filter, allowing some connections and blocking others. For each pair of positions $(i, j)$ , we assign a mask value:

M_{ij} = \begin{cases} 0 & \text{if } |i - j| \leq \lfloor w/2 \rfloor \\ -\infty & \text{otherwise} \end{cases}

where:

$M_{ij}$ : the mask value at position $(i, j)$
$0$ : no modification (allow attention)
$-\infty$ : block this connection entirely

Why these particular values? The mask gets added to the raw attention scores before softmax. Adding 0 leaves the score unchanged, preserving normal attention behavior for positions within the window. Adding $-\infty$ ensures that after the exponential in softmax, the weight becomes exactly 0:

\text{softmax}(\ldots + (-\infty)) = \frac{e^{-\infty}}{\sum} = \frac{0}{\sum} = 0

The blocked position contributes nothing to the output. In implementation, we use a large negative number like $-10^9$ rather than true infinity to avoid numerical issues, but the effect is identical.

The Complete Attention FormulaLink Copied

With the mask defined, we can write the full sliding window attention computation. Start with the standard attention formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

This computes scores between all query-key pairs ( $QK^T$ ), normalizes them into weights (softmax), and uses those weights to aggregate values. Sliding window attention inserts the mask before softmax:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

where:

$Q \in \mathbb{R}^{n \times d_k}$ : the query matrix, with one row per token
$K \in \mathbb{R}^{n \times d_k}$ : the key matrix, with one row per token
$V \in \mathbb{R}^{n \times d_v}$ : the value matrix, containing information to aggregate
$M \in \mathbb{R}^{n \times n}$ : the sliding window mask
$QK^T \in \mathbb{R}^{n \times n}$ : the raw attention scores before masking
$d_k$ : the dimension of query and key vectors
$\sqrt{d_k}$ : scaling factor to prevent dot products from growing too large

The computation proceeds in four steps:

Compute raw scores: Multiply queries by keys to get $n \times n$ scores
Apply mask: Add $M$ to block positions outside each token's window
Normalize: Apply softmax row-wise to convert scores to weights
Aggregate: Multiply weights by values to get the output

After masking and softmax, each row of the weight matrix sums to 1, but the non-zero weights are concentrated in a band around the diagonal. Token 5 has non-zero weights only for tokens 3-7 (with $w=5$ ). Token 100 has non-zero weights only for tokens 98-102. The sparse pattern emerges from the mask, even though we computed all $n^2$ scores.

Out[2]:

Visualization

Dense heatmap showing all positions have non-zero attention weights. — Full attention: Every token attends to every other token, creating a dense n x n matrix.

Banded heatmap showing attention concentrated along the diagonal. — Sliding window attention (w=5): Each token attends only to its local neighborhood, creating a sparse banded pattern.

Window PatternsLink Copied

Several window configurations are possible. The most common is the symmetric window where each token looks an equal distance forward and backward:

Out[3]:

Visualization

Matrix showing diagonal band of attention weights centered on the main diagonal. — Symmetric window (w=5): Each token attends to 2 tokens on each side plus itself.

Matrix showing lower triangular band below and including the diagonal. — Causal window (w=3): Each token attends only to itself and 2 previous tokens.

Matrix showing asymmetric band with more left context than right. — Asymmetric window: Different left and right context sizes.

The symmetric window suits bidirectional models like BERT where context flows in both directions. The causal window is essential for autoregressive models like GPT, where tokens must not attend to future positions. Asymmetric windows can be useful when past context is more informative than future context, or vice versa.

Complexity AnalysisLink Copied

Understanding the complexity reduction is key to appreciating why sliding window attention matters. Standard attention computes $n \times n$ attention scores because every token attends to every other token. With a sliding window of size $w$ , each token attends to at most $w$ positions, yielding approximately $n \cdot w$ attention scores instead.

The critical insight is that $w$ is a fixed constant (typically 128-512), independent of sequence length. This transforms the complexity:

O(n^2 d) \to O(n \cdot w \cdot d) = O(n \cdot d)

where:

$n$ : sequence length (the variable that grows with input size)
$w$ : window size (a fixed hyperparameter, constant with respect to $n$ )
$d$ : model dimension (also fixed for a given model architecture)

Since both $w$ and $d$ are constants for a given model, the complexity is linear in $n$ . This linear scaling means we can process sequences of arbitrary length with proportional, rather than quadratic, cost. A 10x longer sequence requires only 10x more computation, not 100x.

In[4]:

Code

def attention_complexity(n, d, window_size=None):
    """
    Compute attention complexity for standard vs windowed attention.

    Args:
        n: Sequence length
        d: Model dimension
        window_size: If provided, use sliding window; otherwise full attention

    Returns:
        Number of operations (approximately)
    """
    if window_size is None:
        # Standard attention: O(n^2 * d)
        return n * n * d
    else:
        # Sliding window: O(n * w * d)
        return n * window_size * d

def attention_complexity(n, d, window_size=None):
    """
    Compute attention complexity for standard vs windowed attention.

    Args:
        n: Sequence length
        d: Model dimension
        window_size: If provided, use sliding window; otherwise full attention

    Returns:
        Number of operations (approximately)
    """
    if window_size is None:
        # Standard attention: O(n^2 * d)
        return n * n * d
    else:
        # Sliding window: O(n * w * d)
        return n * window_size * d

In[5]:

Code

# Compare complexity across sequence lengths
d = 768  # GPT-2 small dimension
window = 256

seq_lengths = [512, 1024, 2048, 4096, 8192, 16384]
results = []

for n in seq_lengths:
    standard = attention_complexity(n, d)
    windowed = attention_complexity(n, d, window)
    speedup = standard / windowed
    results.append(
        {
            "n": n,
            "standard_gflops": standard / 1e9,
            "windowed_gflops": windowed / 1e9,
            "speedup": speedup,
        }
    )

# Compare complexity across sequence lengths
d = 768  # GPT-2 small dimension
window = 256

seq_lengths = [512, 1024, 2048, 4096, 8192, 16384]
results = []

for n in seq_lengths:
    standard = attention_complexity(n, d)
    windowed = attention_complexity(n, d, window)
    speedup = standard / windowed
    results.append(
        {
            "n": n,
            "standard_gflops": standard / 1e9,
            "windowed_gflops": windowed / 1e9,
            "speedup": speedup,
        }
    )

Out[6]:

Console

Complexity Comparison (d=768, window=256):

   Seq Len        Standard        Windowed    Speedup
----------------------------------------------------
       512           0.20G           0.10G        2.0x
     1,024           0.81G           0.20G        4.0x
     2,048           3.22G           0.40G        8.0x
     4,096          12.88G           0.81G       16.0x
     8,192          51.54G           1.61G       32.0x
    16,384         206.16G           3.22G       64.0x

The speedup grows linearly with sequence length. At 512 tokens, sliding window provides a 2x speedup. At 16K tokens, the speedup reaches 64x. This difference becomes crucial for processing long documents, where standard attention would be prohibitively expensive.

Out[7]:

Visualization

Log-scale line plot showing quadratic growth of standard attention versus linear growth of sliding window attention. — Complexity comparison between standard attention and sliding window attention (window=256). Standard attention grows quadratically while sliding window grows linearly, with the gap widening as sequence length increases.

Window Size SelectionLink Copied

Choosing the right window size involves balancing efficiency against the ability to capture dependencies. Smaller windows are faster but may miss important long-range relationships. Larger windows capture more context but reduce the computational advantage.

Factors Influencing Window SizeLink Copied

Several considerations guide window size selection:

Task requirements: Classification tasks often need only local features, while question answering may require longer-range reasoning. Summarization tasks might need to connect information across paragraphs.
Model depth: With multiple layers, information can propagate across the sequence. In a model with $L$ layers and window size $w$ , information can theoretically travel $L \cdot (w/2)$ positions in each direction through the network. This is because each layer allows a token to gather information from $w/2$ positions on each side, and stacking $L$ layers compounds this reach. A 12-layer model with window size 128 can propagate information across $12 \times 64 = 768$ positions in each direction.
Computational budget: Larger windows cost more. If latency is critical, smaller windows enable faster inference.
Sequence length: The ratio of window size to sequence length matters. A 256-token window covers an 8K-token document poorly but might suffice for 1K-token inputs.

Empirical GuidanceLink Copied

Research on efficient transformers has explored various window sizes:

Longformer: Uses window sizes of 256-512 tokens, combined with global attention on specific positions
BigBird: Employs block sizes of 64 tokens with random and global attention patterns
Mistral 7B: Uses a 4,096-token sliding window, covering substantial context for most applications

A useful heuristic is to start with a window size that captures typical sentence or paragraph lengths (128-512 tokens) and increase it if downstream task performance suffers.

In[8]:

Code

def effective_receptive_field(num_layers, window_size):
    """
    Calculate the maximum distance information can travel.

    With L layers and window w, token i can receive information
    from token j if |i - j| <= L * (w // 2) for symmetric windows.
    """
    return num_layers * (window_size // 2)


# Example: how far can information travel?
window_sizes = [64, 128, 256, 512]
layer_counts = [6, 12, 24]

def effective_receptive_field(num_layers, window_size):
    """
    Calculate the maximum distance information can travel.

    With L layers and window w, token i can receive information
    from token j if |i - j| <= L * (w // 2) for symmetric windows.
    """
    return num_layers * (window_size // 2)


# Example: how far can information travel?
window_sizes = [64, 128, 256, 512]
layer_counts = [6, 12, 24]

Out[9]:

Console

Effective Receptive Field (positions):

    Window         6 layers        12 layers        24 layers
------------------------------------------
        64         192         384         768
       128         384         768       1,536
       256         768       1,536       3,072
       512       1,536       3,072       6,144

The receptive field table shows how layer depth compensates for narrow windows. A 64-token window in a 24-layer model allows information to propagate 768 positions, while the same window in a 6-layer model reaches only 192 positions. This explains why deeper models can often use smaller windows without sacrificing performance.

Out[10]:

Visualization

Line plot showing linear growth of receptive field with layer count for different window sizes. — Effective receptive field growth across layers. With a window size of 128, each layer expands the receptive field by 64 positions in each direction. Deeper models achieve broader information propagation without increasing per-layer computation.

Dilated Sliding WindowsLink Copied

Standard sliding windows have a limitation: the receptive field grows slowly with window size. To see positions 100 tokens away, you need either a window of 200 tokens (expensive) or many layers for information to propagate (slow). Dilated sliding windows offer a third option: spread the attention across a wider range by skipping intermediate positions.

Dilated Attention

Dilated sliding window attention attends to positions at regular intervals (the dilation rate) rather than consecutive positions. This expands the effective receptive field while keeping the number of attended positions constant.

The intuition comes from signal processing and computer vision. A dilated convolution samples every $r$ th position instead of consecutive positions, covering more ground with the same number of parameters. Dilated attention applies the same principle: instead of attending to your 5 nearest neighbors, attend to 5 positions spaced 2 apart.

How Dilation WorksLink Copied

Let's formalize this idea. In standard attention, the attended positions form a consecutive sequence centered on the query. With dilation, we introduce gaps between attended positions.

With dilation rate $r$ and $k$ attended positions, token $i$ attends to the following set of positions:

\{i - r \cdot k/2, \, i - r \cdot (k/2 - 1), \, \ldots, \, i, \, \ldots, \, i + r \cdot (k/2 - 1), \, i + r \cdot k/2\}

where:

$i$ : the query position (the token computing its attention)
$r$ : the dilation rate (the gap between consecutive attended positions)
$k$ : the number of positions to attend to (same as in non-dilated attention)
$k/2$ : the number of positions on each side of the center

The dilation rate $r$ acts as a multiplier on the offset from the center position. With $r = 1$ (no dilation), we get the standard consecutive window. With $r = 2$ , we skip every other position, doubling the effective range while attending to the same number of positions.

For example, with $k = 5$ and $r = 2$ , token at position 10 attends to positions $\{6, 8, 10, 12, 14\}$ rather than the consecutive window $\{8, 9, 10, 11, 12\}$ . Both patterns attend to exactly 5 positions, but the dilated version covers a range of 8 positions instead of 4.

Out[11]:

Visualization

Attention matrix showing narrow diagonal band of consecutive positions. — Standard window (dilation=1): Attends to 5 consecutive positions.

Attention matrix showing wider diagonal pattern with gaps between attended positions. — Dilated window (dilation=2): Attends to 5 positions with gaps, covering twice the range.

Multi-Scale Attention with Varying DilationLink Copied

A single dilation rate creates gaps in coverage: with $r=2$ , we miss all odd-positioned tokens. The solution is to use different dilation rates across attention heads. Some heads use $r=1$ for fine-grained local attention, others use $r=2$ for medium-range context, and still others use $r=4$ or $r=8$ for long-range dependencies.

This multi-scale approach, used in models like Longformer, creates a powerful combination. The union of all heads covers a broad range of positions, each head specializes in a particular scale, and the total computation remains sparse. Think of it like having multiple photographers at an event: one captures close-ups, another medium shots, and a third wide angles. Together, they document everything without any single photographer doing all the work.

In[12]:

Code

def multi_scale_receptive_field(dilations, k):
    """
    Calculate coverage of multi-scale dilated attention.

    Args:
        dilations: List of dilation rates for different heads
        k: Number of positions each head attends to

    Returns:
        Total range covered by union of all heads
    """
    all_positions = set()
    center = 1000  # Arbitrary center position
    half_k = k // 2

    for d in dilations:
        for offset in range(-half_k, half_k + 1):
            all_positions.add(center + offset * d)

    min_pos = min(all_positions)
    max_pos = max(all_positions)
    return max_pos - min_pos, len(all_positions)


# Example: 4 heads with different dilations
dilations = [1, 2, 4, 8]
k = 5
range_covered, unique_positions = multi_scale_receptive_field(dilations, k)

def multi_scale_receptive_field(dilations, k):
    """
    Calculate coverage of multi-scale dilated attention.

    Args:
        dilations: List of dilation rates for different heads
        k: Number of positions each head attends to

    Returns:
        Total range covered by union of all heads
    """
    all_positions = set()
    center = 1000  # Arbitrary center position
    half_k = k // 2

    for d in dilations:
        for offset in range(-half_k, half_k + 1):
            all_positions.add(center + offset * d)

    min_pos = min(all_positions)
    max_pos = max(all_positions)
    return max_pos - min_pos, len(all_positions)


# Example: 4 heads with different dilations
dilations = [1, 2, 4, 8]
k = 5
range_covered, unique_positions = multi_scale_receptive_field(dilations, k)

Out[13]:

Console

Multi-scale dilated attention (k=5 positions per head):

Dilation rates: [1, 2, 4, 8]
Total range covered: 32 positions
Unique positions attended: 11
Coverage density: 33.3%

The multi-scale approach extends the effective receptive field while maintaining sparse attention. With dilation rates of $\{1, 2, 4, 8\}$ and 5 positions per head, the model covers a range of 32 positions with 11 unique attention connections, achieving both broad coverage and computational efficiency.

Out[14]:

Visualization

Horizontal bar showing attended positions at different dilation rates, with overlapping coverage at the center. — Multi-scale dilated attention coverage from a single query position (center). Each head attends to 5 positions at different dilation rates. The bottom row shows combined coverage where darker colors indicate positions attended by multiple heads.

ImplementationLink Copied

With the theory established, let's translate these ideas into working code. We'll build sliding window attention step by step, starting with the most straightforward approach and progressively optimizing. Along the way, you'll see how the mathematical formulas map directly to implementation choices.

Mask-Based ImplementationLink Copied

The most direct implementation follows the formula exactly: compute all attention scores, apply the mask, run softmax, and aggregate values. This approach prioritizes clarity over efficiency, making it ideal for understanding the mechanism.

First, we need helper functions. The softmax function normalizes scores into probability distributions. We implement a numerically stable version that subtracts the maximum before exponentiating to prevent overflow:

In[15]:

Code

import numpy as np


def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def create_sliding_window_mask(seq_len, window_size, causal=False):
    """
    Create a sliding window attention mask.

    Args:
        seq_len: Length of the sequence
        window_size: Size of the attention window
        causal: If True, mask future positions (for autoregressive models)

    Returns:
        Mask of shape (seq_len, seq_len) with 0 for allowed positions,
        -inf for masked positions
    """
    mask = np.full((seq_len, seq_len), -1e9)
    half_window = window_size // 2

    for i in range(seq_len):
        if causal:
            # Only attend to current and previous positions within window
            start = max(0, i - window_size + 1)
            end = i + 1
        else:
            # Symmetric window
            start = max(0, i - half_window)
            end = min(seq_len, i + half_window + 1)

        mask[i, start:end] = 0

    return mask

import numpy as np


def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def create_sliding_window_mask(seq_len, window_size, causal=False):
    """
    Create a sliding window attention mask.

    Args:
        seq_len: Length of the sequence
        window_size: Size of the attention window
        causal: If True, mask future positions (for autoregressive models)

    Returns:
        Mask of shape (seq_len, seq_len) with 0 for allowed positions,
        -inf for masked positions
    """
    mask = np.full((seq_len, seq_len), -1e9)
    half_window = window_size // 2

    for i in range(seq_len):
        if causal:
            # Only attend to current and previous positions within window
            start = max(0, i - window_size + 1)
            end = i + 1
        else:
            # Symmetric window
            start = max(0, i - half_window)
            end = min(seq_len, i + half_window + 1)

        mask[i, start:end] = 0

    return mask

The mask creation function iterates through each position and sets the mask to 0 for positions within the window, $-10^9$ for positions outside. Notice how the code directly implements our mathematical definition: for symmetric windows, we check if the distance is at most half_window; for causal windows, we also require that the key position is not in the future.

This implementation explicitly constructs the full $n \times n$ mask, which works well for moderate sequence lengths. For very long sequences, we'll need a more efficient approach, but this version is perfect for understanding the pattern.

With the mask function ready, we can implement the complete attention computation:

In[16]:

Code

def sliding_window_attention(Q, K, V, window_size, causal=False):
    """
    Compute sliding window attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)
        window_size: Size of the attention window
        causal: If True, use causal masking

    Returns:
        Output of shape (seq_len, d_v)
    """
    seq_len, d_k = Q.shape

    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply sliding window mask
    mask = create_sliding_window_mask(seq_len, window_size, causal)
    masked_scores = scores + mask

    # Apply softmax
    weights = softmax(masked_scores, axis=-1)

    # Compute weighted sum of values
    output = weights @ V

    return output, weights

def sliding_window_attention(Q, K, V, window_size, causal=False):
    """
    Compute sliding window attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)
        window_size: Size of the attention window
        causal: If True, use causal masking

    Returns:
        Output of shape (seq_len, d_v)
    """
    seq_len, d_k = Q.shape

    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply sliding window mask
    mask = create_sliding_window_mask(seq_len, window_size, causal)
    masked_scores = scores + mask

    # Apply softmax
    weights = softmax(masked_scores, axis=-1)

    # Compute weighted sum of values
    output = weights @ V

    return output, weights

This function follows our four-step process exactly: compute scores ( $Q @ K.T$ ), scale by $\sqrt{d_k}$ , add the mask, apply softmax, and multiply by values. The function returns both the output and the attention weights, which we'll visualize to verify the windowed pattern.

Let's test this implementation with a concrete example. We'll create random queries, keys, and values, then examine the attention pattern:

In[17]:

Code

# Create sample input
np.random.seed(42)
seq_len = 16
d_k = 32

Q = np.random.randn(seq_len, d_k) * 0.1
K = np.random.randn(seq_len, d_k) * 0.1
V = np.random.randn(seq_len, d_k) * 0.1

# Compute attention with window size 5
output, weights = sliding_window_attention(Q, K, V, window_size=5, causal=False)

# Create sample input
np.random.seed(42)
seq_len = 16
d_k = 32

Q = np.random.randn(seq_len, d_k) * 0.1
K = np.random.randn(seq_len, d_k) * 0.1
V = np.random.randn(seq_len, d_k) * 0.1

# Compute attention with window size 5
output, weights = sliding_window_attention(Q, K, V, window_size=5, causal=False)

Out[18]:

Console

Input shapes: Q=(16, 32), K=(16, 32), V=(16, 32)
Output shape: (16, 32)
Attention weights shape: (16, 16)

Attention pattern for token 8:
  Positions attended: [ 6  7  8  9 10]
  Weights: [0.20061626 0.20531482 0.1960464  0.20224883 0.19577369]

The attention weights confirm that token 8 attends only to positions 6-10 (a window of 5 centered on position 8), with all other weights effectively zero.

Out[19]:

Visualization

Heatmap showing diagonal band of attention weights with most mass concentrated near the diagonal. — Attention weight matrix from sliding window attention (window=5). Each row shows which positions a token attends to, forming a characteristic band pattern along the diagonal.

Efficient Block-Based ImplementationLink Copied

The mask-based implementation has a problem: it still computes and stores the full $n \times n$ attention matrix, even though most entries are zero. For a 10,000-token sequence, we'd allocate 100 million floats just for the attention weights. This defeats the purpose of sliding window attention.

The solution is to never materialize the full matrix. Instead, we process each position independently, computing attention only over its local window. This insight transforms memory complexity from $O(n^2)$ to $O(n \cdot w)$ , where $w$ is the window size:

In[20]:

Code

def efficient_sliding_window_attention(Q, K, V, window_size):
    """
    Memory-efficient sliding window attention.

    Instead of creating the full n x n attention matrix, we process
    each position individually, computing only the relevant window.

    Args:
        Q, K, V: Query, Key, Value matrices of shape (seq_len, d)
        window_size: Size of the attention window

    Returns:
        Output of shape (seq_len, d)
    """
    seq_len, d_k = Q.shape
    d_v = V.shape[1]
    half_window = window_size // 2

    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):
        # Determine window boundaries
        start = max(0, i - half_window)
        end = min(seq_len, i + half_window + 1)

        # Extract local keys and values
        K_local = K[start:end]  # shape: (window, d_k)
        V_local = V[start:end]  # shape: (window, d_v)

        # Compute attention scores for this position
        scores = Q[i : i + 1] @ K_local.T / np.sqrt(d_k)  # shape: (1, window)
        weights = softmax(scores, axis=-1)

        # Compute output for this position
        output[i] = (weights @ V_local).squeeze()

    return output

def efficient_sliding_window_attention(Q, K, V, window_size):
    """
    Memory-efficient sliding window attention.

    Instead of creating the full n x n attention matrix, we process
    each position individually, computing only the relevant window.

    Args:
        Q, K, V: Query, Key, Value matrices of shape (seq_len, d)
        window_size: Size of the attention window

    Returns:
        Output of shape (seq_len, d)
    """
    seq_len, d_k = Q.shape
    d_v = V.shape[1]
    half_window = window_size // 2

    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):
        # Determine window boundaries
        start = max(0, i - half_window)
        end = min(seq_len, i + half_window + 1)

        # Extract local keys and values
        K_local = K[start:end]  # shape: (window, d_k)
        V_local = V[start:end]  # shape: (window, d_v)

        # Compute attention scores for this position
        scores = Q[i : i + 1] @ K_local.T / np.sqrt(d_k)  # shape: (1, window)
        weights = softmax(scores, axis=-1)

        # Compute output for this position
        output[i] = (weights @ V_local).squeeze()

    return output

The key insight is in the loop body. For each position $i$ , we compute the window boundaries, extract only the relevant keys and values, compute attention scores over just $w$ positions, and aggregate. Each iteration uses $O(w \cdot d)$ memory instead of $O(n \cdot d)$ , and we never store the full attention matrix.

This approach trades sequential processing for memory efficiency. In practice, production implementations use batched operations and GPU-optimized kernels to process multiple positions in parallel while still avoiding the full matrix.

Let's verify that both implementations produce identical results:

In[21]:

Code

# Verify both implementations produce the same result
output_mask_based, _ = sliding_window_attention(Q, K, V, window_size=5)
output_efficient = efficient_sliding_window_attention(Q, K, V, window_size=5)

difference = np.max(np.abs(output_mask_based - output_efficient))

# Verify both implementations produce the same result
output_mask_based, _ = sliding_window_attention(Q, K, V, window_size=5)
output_efficient = efficient_sliding_window_attention(Q, K, V, window_size=5)

difference = np.max(np.abs(output_mask_based - output_efficient))

Out[22]:

Console

Maximum difference between implementations: 0.00e+00
Implementations match: True

The negligible difference (on the order of $10^{-16}$ ) confirms that both implementations compute identical results. The efficient version achieves the same output while using significantly less memory for long sequences.

Causal Sliding Window for Language ModelsLink Copied

For autoregressive language models like GPT, we need an additional constraint: tokens must not attend to future positions. If token 10 could see token 11, it would be "cheating" during text generation, looking at words it's supposed to predict.

Combining causality with sliding windows creates a pattern where each token sees at most $w$ previous tokens, including itself. The window is asymmetric: all context comes from the past. Token 10 with $w=4$ attends to tokens 7, 8, 9, and 10, never tokens 11 and beyond:

In[23]:

Code

def causal_sliding_window_attention(Q, K, V, window_size):
    """
    Causal sliding window attention for autoregressive models.

    Each position attends only to itself and the previous (window_size - 1) positions.
    """
    seq_len, d_k = Q.shape
    d_v = V.shape[1]

    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):
        # Only attend to current and previous positions within window
        start = max(0, i - window_size + 1)
        end = i + 1

        K_local = K[start:end]
        V_local = V[start:end]

        scores = Q[i : i + 1] @ K_local.T / np.sqrt(d_k)
        weights = softmax(scores, axis=-1)
        output[i] = (weights @ V_local).squeeze()

    return output

def causal_sliding_window_attention(Q, K, V, window_size):
    """
    Causal sliding window attention for autoregressive models.

    Each position attends only to itself and the previous (window_size - 1) positions.
    """
    seq_len, d_k = Q.shape
    d_v = V.shape[1]

    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):
        # Only attend to current and previous positions within window
        start = max(0, i - window_size + 1)
        end = i + 1

        K_local = K[start:end]
        V_local = V[start:end]

        scores = Q[i : i + 1] @ K_local.T / np.sqrt(d_k)
        weights = softmax(scores, axis=-1)
        output[i] = (weights @ V_local).squeeze()

    return output

Out[24]:

Visualization

Lower triangular band pattern showing causal sliding window mask. — Causal sliding window attention pattern (window=4). The mask is lower triangular within each token's window, ensuring no information flows from future positions.

Mistral-Style Windowed AttentionLink Copied

Mistral 7B popularized sliding window attention in modern large language models. Their approach uses a 4,096-token window size, which covers most practical context lengths while enabling efficient inference.

Key Design ChoicesLink Copied

Mistral's implementation incorporates several practical considerations:

Large window size (4096): Covers entire documents in many use cases, reducing the need for global tokens
Rolling buffer KV cache: During generation, only the most recent $w$ key-value pairs are stored
Pre-fill and chunking: Long prompts are processed in chunks to manage memory

The rolling buffer approach is particularly elegant. Traditional transformers store all past key-value pairs, requiring memory that grows with sequence length:

M_{\text{full}} = O(n \cdot L \cdot d)

where:

$M_{\text{full}}$ : memory required for the full KV cache
$n$ : current sequence length (grows during generation)
$L$ : number of transformer layers
$d$ : model dimension

With sliding window attention, we only store the most recent $w$ key-value pairs, giving constant memory usage:

M_{\text{window}} = O(w \cdot L \cdot d)

where $w$ is the window size. Since $w$ , $L$ , and $d$ are all constants for a given model, memory usage becomes independent of sequence length. This is the key insight that enables generating arbitrarily long sequences without running out of memory.

In[25]:

Code

class RollingKVCache:
    """
    Rolling key-value cache for efficient autoregressive generation.

    Maintains only the most recent 'window_size' key-value pairs,
    enabling constant memory usage regardless of sequence length.
    """

    def __init__(self, window_size, num_heads, d_k, d_v):
        self.window_size = window_size
        self.num_heads = num_heads

        # Pre-allocate buffers
        self.k_cache = np.zeros((window_size, num_heads, d_k))
        self.v_cache = np.zeros((window_size, num_heads, d_v))

        # Current position in the circular buffer
        self.position = 0
        self.length = 0

    def update(self, k, v):
        """
        Add new key-value pair to the cache.

        Args:
            k: New key of shape (num_heads, d_k)
            v: New value of shape (num_heads, d_v)
        """
        # Write to current position (wraps around)
        idx = self.position % self.window_size
        self.k_cache[idx] = k
        self.v_cache[idx] = v

        self.position += 1
        self.length = min(self.length + 1, self.window_size)

    def get_kv(self):
        """Get current keys and values in correct order."""
        if self.length < self.window_size:
            # Haven't filled the buffer yet
            return self.k_cache[: self.length], self.v_cache[: self.length]
        else:
            # Buffer is full, need to reorder
            start = self.position % self.window_size
            indices = np.concatenate(
                [np.arange(start, self.window_size), np.arange(0, start)]
            )
            return self.k_cache[indices], self.v_cache[indices]

class RollingKVCache:
    """
    Rolling key-value cache for efficient autoregressive generation.

    Maintains only the most recent 'window_size' key-value pairs,
    enabling constant memory usage regardless of sequence length.
    """

    def __init__(self, window_size, num_heads, d_k, d_v):
        self.window_size = window_size
        self.num_heads = num_heads

        # Pre-allocate buffers
        self.k_cache = np.zeros((window_size, num_heads, d_k))
        self.v_cache = np.zeros((window_size, num_heads, d_v))

        # Current position in the circular buffer
        self.position = 0
        self.length = 0

    def update(self, k, v):
        """
        Add new key-value pair to the cache.

        Args:
            k: New key of shape (num_heads, d_k)
            v: New value of shape (num_heads, d_v)
        """
        # Write to current position (wraps around)
        idx = self.position % self.window_size
        self.k_cache[idx] = k
        self.v_cache[idx] = v

        self.position += 1
        self.length = min(self.length + 1, self.window_size)

    def get_kv(self):
        """Get current keys and values in correct order."""
        if self.length < self.window_size:
            # Haven't filled the buffer yet
            return self.k_cache[: self.length], self.v_cache[: self.length]
        else:
            # Buffer is full, need to reorder
            start = self.position % self.window_size
            indices = np.concatenate(
                [np.arange(start, self.window_size), np.arange(0, start)]
            )
            return self.k_cache[indices], self.v_cache[indices]

In[26]:

Code

# Demonstrate rolling cache behavior
cache = RollingKVCache(window_size=4, num_heads=2, d_k=8, d_v=8)

# Simulate adding tokens
for i in range(6):
    k = np.full((2, 8), i)  # Simple keys to track order
    v = np.full((2, 8), i)
    cache.update(k, v)

    K, V = cache.get_kv()
    first_values = K[:, 0, 0]  # First element of each cached key

# Demonstrate rolling cache behavior
cache = RollingKVCache(window_size=4, num_heads=2, d_k=8, d_v=8)

# Simulate adding tokens
for i in range(6):
    k = np.full((2, 8), i)  # Simple keys to track order
    v = np.full((2, 8), i)
    cache.update(k, v)

    K, V = cache.get_kv()
    first_values = K[:, 0, 0]  # First element of each cached key

Out[27]:

Console

Rolling KV Cache Demo (window=4):
After 6 updates, cache contains: 4 entries
Cached positions (by value): [2. 3. 4. 5.]
Only most recent 4 tokens are stored

After adding 6 tokens to a cache with window size 4, tokens 0 and 1 have been evicted. The cache now contains only tokens 2, 3, 4, and 5, demonstrating how the circular buffer automatically discards the oldest entries when it reaches capacity. This behavior mirrors what happens during autoregressive generation in models like Mistral.

Memory ComparisonLink Copied

The rolling buffer approach provides substantial memory savings for long sequences:

In[28]:

Code

def kv_cache_memory_mb(
    seq_len, num_layers, num_heads, d_head, window_size=None, dtype_bytes=2
):
    """
    Calculate KV cache memory usage.

    Args:
        seq_len: Current sequence length
        num_layers: Number of transformer layers
        num_heads: Number of attention heads
        d_head: Dimension per head
        window_size: If provided, use rolling buffer; otherwise full cache
        dtype_bytes: 2 for fp16, 4 for fp32
    """
    # Full cache: 2 (K+V) * layers * heads * seq_len * d_head * dtype
    if window_size is None:
        elements = 2 * num_layers * num_heads * seq_len * d_head
    else:
        # Rolling buffer: only window_size positions
        effective_len = min(seq_len, window_size)
        elements = 2 * num_layers * num_heads * effective_len * d_head

    return elements * dtype_bytes / (1024**2)  # Convert to MB


# Mistral 7B config (approximate)
config = {"num_layers": 32, "num_heads": 32, "d_head": 128, "window_size": 4096}

seq_lengths = [1024, 4096, 16384, 65536]

def kv_cache_memory_mb(
    seq_len, num_layers, num_heads, d_head, window_size=None, dtype_bytes=2
):
    """
    Calculate KV cache memory usage.

    Args:
        seq_len: Current sequence length
        num_layers: Number of transformer layers
        num_heads: Number of attention heads
        d_head: Dimension per head
        window_size: If provided, use rolling buffer; otherwise full cache
        dtype_bytes: 2 for fp16, 4 for fp32
    """
    # Full cache: 2 (K+V) * layers * heads * seq_len * d_head * dtype
    if window_size is None:
        elements = 2 * num_layers * num_heads * seq_len * d_head
    else:
        # Rolling buffer: only window_size positions
        effective_len = min(seq_len, window_size)
        elements = 2 * num_layers * num_heads * effective_len * d_head

    return elements * dtype_bytes / (1024**2)  # Convert to MB


# Mistral 7B config (approximate)
config = {"num_layers": 32, "num_heads": 32, "d_head": 128, "window_size": 4096}

seq_lengths = [1024, 4096, 16384, 65536]

Out[29]:

Console

KV Cache Memory (fp16, Mistral 7B config):

   Seq Len      Full Cache   Rolling (w=4096)    Savings
--------------------------------------------------------
     1,024          512.0 MB             512.0 MB       0.0%
     4,096         2048.0 MB            2048.0 MB       0.0%
    16,384         8192.0 MB            2048.0 MB      75.0%
    65,536        32768.0 MB            2048.0 MB      93.8%

The memory comparison reveals the scaling advantage of the rolling buffer. At 1,024 tokens (below the window size), both approaches use the same memory. At 4,096 tokens (exactly the window size), they remain equivalent. Beyond that point, the savings become dramatic: at 16K tokens, the rolling buffer uses 75% less memory, and at 64K tokens, savings reach over 93%.

Out[30]:

Visualization

Line plot comparing full cache linear memory growth versus rolling buffer constant memory usage. — KV cache memory comparison between full cache and rolling buffer (window=4096) for Mistral 7B configuration. The rolling buffer provides constant memory usage regardless of sequence length, enabling processing of arbitrarily long sequences.

This enables processing long documents on hardware that could not otherwise accommodate them.

Limitations and ImpactLink Copied

Sliding window attention achieves linear complexity by making a fundamental trade-off: tokens beyond the window cannot directly attend to each other. This creates important limitations that you must consider when deploying these models.

The most significant challenge is capturing long-range dependencies. Consider a question-answering task where the answer appears at the beginning of a long document and the question at the end. With a window of 512 tokens and a document of 10,000 tokens, direct attention between question and answer is impossible. Information must propagate through intermediate layers, potentially degrading or losing signal along the way. Experiments have shown that sliding window models perform well on tasks where relevant context is local but struggle when important information spans large distances.

Another limitation relates to positional awareness. Standard transformers with full attention can learn to recognize absolute positions through attention patterns. With sliding window attention, each token sees the same local structure regardless of its absolute position in the sequence. While positional embeddings help, some tasks that require precise long-range position tracking may suffer.

Despite these limitations, sliding window attention has proven remarkably effective in practice. Modern language models like Mistral combine it with large windows (4,096 tokens) that cover most practical contexts. For document understanding, the combination of sliding windows with global attention tokens (covered in the next chapter) addresses most long-range needs while maintaining efficiency. The approach has enabled scaling to context lengths that would be impractical with full attention, opening new applications in document processing, code analysis, and long-form text generation.

The practical impact extends beyond computational savings. Standard attention requires storing the full $n \times n$ attention matrix, giving $O(n^2)$ memory complexity. Sliding window attention stores only the non-zero entries within each window, reducing memory to $O(n \cdot w)$ . Since $w$ is constant, this is effectively linear in sequence length. A model with full attention at 16K tokens might require a high-end GPU, while the same model with sliding window attention can run on modest hardware. This democratization of long-context processing has accelerated both research and deployment of language models for real-world applications.

SummaryLink Copied

Sliding window attention addresses the quadratic complexity of standard attention by restricting each token to attend only within a local window. The core ideas and takeaways from this chapter:

Linear complexity: By attending to a fixed window of $w$ positions, complexity drops from $O(n^2 d)$ to $O(n \cdot w \cdot d)$ , enabling efficient processing of long sequences.
Window patterns: Symmetric windows work for bidirectional models, while causal windows suit autoregressive generation. Asymmetric and dilated windows offer additional flexibility.
Effective receptive field: Through multiple layers, information propagates beyond the immediate window. A model with $L$ layers and window $w$ has an effective receptive field of approximately $L \cdot (w/2)$ positions in each direction, allowing distant tokens to influence each other indirectly.
Implementation efficiency: Memory-efficient implementations avoid materializing the full attention matrix, instead computing attention block by block.
Rolling KV cache: For autoregressive generation, a circular buffer stores only the most recent $w$ key-value pairs, providing constant memory usage regardless of sequence length.
Trade-offs: The locality assumption sacrifices direct long-range attention. Tasks requiring connections across distant positions may need additional mechanisms like global attention tokens.
Practical success: Models like Mistral demonstrate that sliding window attention with large windows (4,096+ tokens) provides excellent performance on most tasks while enabling efficient inference.

The next chapter explores global attention tokens, which complement sliding window attention by providing a mechanism for information to flow across the entire sequence, combining the efficiency of local attention with the expressiveness of global connectivity.

Key ParametersLink Copied

When implementing or configuring sliding window attention, several parameters directly impact both performance and computational efficiency:

window_size: The number of positions each token can attend to. Larger windows capture more context but increase computation. Common values range from 128-512 for efficient models to 4,096 for models like Mistral. Choose based on the expected range of relevant dependencies in your task.
causal: Whether to use causal masking (True for autoregressive language models, False for bidirectional models like BERT). Causal masking ensures tokens cannot attend to future positions, which is required for text generation.
dilation_rate: For dilated sliding windows, controls the gap between attended positions. A rate of 1 gives standard consecutive attention; rate 2 skips every other position, doubling the effective range. Use multiple dilation rates across heads for multi-scale attention.
num_layers: The number of transformer layers affects the effective receptive field. With $L$ layers and window $w$ , information can propagate approximately $L \cdot (w/2)$ positions in each direction. Deeper models can use smaller windows.
d_k / d_head: The dimension of query and key vectors. Standard values follow $d_k = d_{\text{model}} / h$ where $h$ is the number of heads. This affects the scaling factor $\sqrt{d_k}$ in attention score computation.
dtype_bytes: Memory precision for KV cache storage. Use 2 bytes for fp16/bf16 (recommended) or 4 bytes for fp32. Half precision halves memory usage with negligible quality impact.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about sliding window attention and its efficiency benefits.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Sparse Attention Patterns

Next Chapter

Global Tokens

Reference

BIBTEXAcademic

@misc{slidingwindowattentionlinearcomplexityforlongsequences, author = {Michael Brenndoerfer}, title = {Sliding Window Attention: Linear Complexity for Long Sequences}, year = {2025}, url = {https://mbrenndoerfer.com/writing/sliding-window-attention}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Sliding Window Attention: Linear Complexity for Long Sequences. Retrieved from https://mbrenndoerfer.com/writing/sliding-window-attention

MLAAcademic

Michael Brenndoerfer. "Sliding Window Attention: Linear Complexity for Long Sequences." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/sliding-window-attention>.

CHICAGOAcademic

Michael Brenndoerfer. "Sliding Window Attention: Linear Complexity for Long Sequences." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/sliding-window-attention.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Sliding Window Attention: Linear Complexity for Long Sequences'. Available at: https://mbrenndoerfer.com/writing/sliding-window-attention (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Sliding Window Attention: Linear Complexity for Long Sequences. https://mbrenndoerfer.com/writing/sliding-window-attention

Direct link:

https://mbrenndoerfer.com/writing/sliding-window-attention

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Sliding Window Attention: Linear Complexity for Long Sequences

Sliding Window AttentionLink Copied

The Locality AssumptionLink Copied

How Sliding Window Attention WorksLink Copied

From Intuition to NeighborhoodsLink Copied

Translating Neighborhoods into MasksLink Copied

The Complete Attention FormulaLink Copied

Window PatternsLink Copied

Complexity AnalysisLink Copied

Window Size SelectionLink Copied

Factors Influencing Window SizeLink Copied

Empirical GuidanceLink Copied

Dilated Sliding WindowsLink Copied

How Dilation WorksLink Copied

Multi-Scale Attention with Varying DilationLink Copied

ImplementationLink Copied

Mask-Based ImplementationLink Copied

Efficient Block-Based ImplementationLink Copied

Causal Sliding Window for Language ModelsLink Copied

Mistral-Style Windowed AttentionLink Copied

Key Design ChoicesLink Copied

Memory ComparisonLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FlashAttention Implementation: GPU Memory Optimization for Transformers

FlashAttention Algorithm: Memory-Efficient Exact Attention via GPU-Aware Tiling

Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps

Stay updated