Search

Search articles

Sliding Window Attention: Linear Complexity for Long Sequences

Michael BrenndoerferUpdated June 26, 202539 min read

Learn how sliding window attention reduces transformer complexity from quadratic to linear by restricting attention to local neighborhoods, enabling efficient processing of long documents.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Sliding Window Attention

Standard attention computes all n2n^2 pairwise interactions between tokens, but most of that computation may be wasted. In natural language, context is often local: the meaning of a word depends heavily on its immediate neighbors. Consider the phrase "the quick brown fox jumps over the lazy dog." When determining what "jumps" means, the model needs "fox" (the subject) and "over" (what follows), but distant tokens like "the" or "dog" contribute less to understanding this particular word. Sliding window attention exploits this locality by restricting each token to attend only to tokens within a fixed-size window around it.

This chapter explores how sliding window attention works, why it reduces complexity from quadratic to linear, and how to implement it effectively. We will also examine the trade-offs: what do we lose when tokens cannot see beyond their local window, and how can we address these limitations?

The Locality Assumption

Before diving into the mechanics, let's understand why local attention makes sense. Natural language exhibits strong locality: adjacent words form phrases, phrases form clauses, and meaning often crystallizes within a narrow span.

Sliding Window Attention

Sliding window attention restricts each token to attending only to tokens within a fixed distance ww (the window size). Instead of n2n^2 attention weights, we compute approximately nwn \cdot w weights, achieving linear complexity in sequence length.

Consider how you read this sentence. When your eye lands on "read," you immediately connect it to "you" (subject) and "this sentence" (object), all within a few tokens. You do not need to re-read the beginning of the paragraph to understand this verb. Sliding window attention captures this intuition by limiting attention to a local neighborhood.

The key insight is that attention patterns in trained transformers often concentrate locally. Research analyzing attention weights in models like BERT and GPT-2 found that many heads allocate most attention mass to nearby tokens. By building this locality bias into the architecture, we can dramatically reduce computation while preserving much of the model's representational power.

How Sliding Window Attention Works

To understand sliding window attention, we need to answer three questions: What does "local" mean mathematically? How do we enforce locality in the attention computation? And how does this change the algorithm's behavior? Let's build up the formalism step by step, starting from the intuition and arriving at a complete mathematical specification.

From Intuition to Neighborhoods

In standard attention, every token can attend to every other token. This creates a fully connected graph where information flows freely between all positions. But we observed that most meaningful connections occur between nearby tokens. The first step toward efficiency is formalizing what "nearby" means.

Consider token ii in a sequence. Which tokens should it attend to? The natural answer is: tokens within some fixed distance. If we set a window size ww, we want token ii to attend only to tokens jj that are close enough. We measure closeness by position distance:

ijw/2|i - j| \leq \lfloor w/2 \rfloor

where:

  • ii: the position of the query token (the one computing its representation)
  • jj: the position of the key token (a candidate for attention)
  • ww: the window size (total number of positions each token can attend to)
  • ij|i - j|: the absolute distance between the two positions
  • w/2\lfloor w/2 \rfloor: half the window size (rounded down), giving the maximum distance in each direction

This formula defines a neighborhood around each token. With w=5w = 5, token at position 10 can attend to positions 8, 9, 10, 11, and 12. The neighborhood extends 5/2=2\lfloor 5/2 \rfloor = 2 positions in each direction, plus the token itself. Token at position 3 can attend to positions 1, 2, 3, 4, and 5. Every token has the same sized window, centered on itself.

Why floor instead of exact division? Window sizes are typically odd (like 3, 5, or 257) so the token sits exactly in the middle. For even windows, the floor gives a slight asymmetry, attending to one more position on one side than the other. This rarely matters in practice.

Translating Neighborhoods into Masks

Now we face a practical question: how do we tell the attention mechanism which positions each token should attend to? Standard attention computes a score between every query-key pair. We need a way to "zero out" the scores for positions outside the neighborhood.

The solution is an attention mask: a matrix that modifies the raw attention scores before softmax. The mask acts like a filter, allowing some connections and blocking others. For each pair of positions (i,j)(i, j), we assign a mask value:

Mij={0if ijw/2otherwiseM_{ij} = \begin{cases} 0 & \text{if } |i - j| \leq \lfloor w/2 \rfloor \\ -\infty & \text{otherwise} \end{cases}

where:

  • MijM_{ij}: the mask value at position (i,j)(i, j)
  • 00: no modification (allow attention)
  • -\infty: block this connection entirely

Why these particular values? The mask gets added to the raw attention scores before softmax. Adding 0 leaves the score unchanged, preserving normal attention behavior for positions within the window. Adding -\infty ensures that after the exponential in softmax, the weight becomes exactly 0:

softmax(+())=e=0=0\text{softmax}(\ldots + (-\infty)) = \frac{e^{-\infty}}{\sum} = \frac{0}{\sum} = 0

The blocked position contributes nothing to the output. In implementation, we use a large negative number like 109-10^9 rather than true infinity to avoid numerical issues, but the effect is identical.

The Complete Attention Formula

With the mask defined, we can write the full sliding window attention computation. Start with the standard attention formula:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

This computes scores between all query-key pairs (QKTQK^T), normalizes them into weights (softmax), and uses those weights to aggregate values. Sliding window attention inserts the mask before softmax:

Attention(Q,K,V)=softmax(QKT+Mdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

where:

  • QRn×dkQ \in \mathbb{R}^{n \times d_k}: the query matrix, with one row per token
  • KRn×dkK \in \mathbb{R}^{n \times d_k}: the key matrix, with one row per token
  • VRn×dvV \in \mathbb{R}^{n \times d_v}: the value matrix, containing information to aggregate
  • MRn×nM \in \mathbb{R}^{n \times n}: the sliding window mask
  • QKTRn×nQK^T \in \mathbb{R}^{n \times n}: the raw attention scores before masking
  • dkd_k: the dimension of query and key vectors
  • dk\sqrt{d_k}: scaling factor to prevent dot products from growing too large

The computation proceeds in four steps:

  1. Compute raw scores: Multiply queries by keys to get n×nn \times n scores
  2. Apply mask: Add MM to block positions outside each token's window
  3. Normalize: Apply softmax row-wise to convert scores to weights
  4. Aggregate: Multiply weights by values to get the output

After masking and softmax, each row of the weight matrix sums to 1, but the non-zero weights are concentrated in a band around the diagonal. Token 5 has non-zero weights only for tokens 3-7 (with w=5w=5). Token 100 has non-zero weights only for tokens 98-102. The sparse pattern emerges from the mask, even though we computed all n2n^2 scores.

Out[2]:
Visualization
Dense heatmap showing all positions have non-zero attention weights.
Full attention: Every token attends to every other token, creating a dense n x n matrix.
Banded heatmap showing attention concentrated along the diagonal.
Sliding window attention (w=5): Each token attends only to its local neighborhood, creating a sparse banded pattern.

Window Patterns

Several window configurations are possible. The most common is the symmetric window where each token looks an equal distance forward and backward:

Out[3]:
Visualization
Matrix showing diagonal band of attention weights centered on the main diagonal.
Symmetric window (w=5): Each token attends to 2 tokens on each side plus itself.
Matrix showing lower triangular band below and including the diagonal.
Causal window (w=3): Each token attends only to itself and 2 previous tokens.
Matrix showing asymmetric band with more left context than right.
Asymmetric window: Different left and right context sizes.

The symmetric window suits bidirectional models like BERT where context flows in both directions. The causal window is essential for autoregressive models like GPT, where tokens must not attend to future positions. Asymmetric windows can be useful when past context is more informative than future context, or vice versa.

Complexity Analysis

Understanding the complexity reduction is key to appreciating why sliding window attention matters. Standard attention computes n×nn \times n attention scores because every token attends to every other token. With a sliding window of size ww, each token attends to at most ww positions, yielding approximately nwn \cdot w attention scores instead.

The critical insight is that ww is a fixed constant (typically 128-512), independent of sequence length. This transforms the complexity:

O(n2d)O(nwd)=O(nd)O(n^2 d) \to O(n \cdot w \cdot d) = O(n \cdot d)

where:

  • nn: sequence length (the variable that grows with input size)
  • ww: window size (a fixed hyperparameter, constant with respect to nn)
  • dd: model dimension (also fixed for a given model architecture)

Since both ww and dd are constants for a given model, the complexity is linear in nn. This linear scaling means we can process sequences of arbitrary length with proportional, rather than quadratic, cost. A 10x longer sequence requires only 10x more computation, not 100x.

In[4]:
Code
def attention_complexity(n, d, window_size=None):
    """
    Compute attention complexity for standard vs windowed attention.

    Args:
        n: Sequence length
        d: Model dimension
        window_size: If provided, use sliding window; otherwise full attention

    Returns:
        Number of operations (approximately)
    """
    if window_size is None:
        # Standard attention: O(n^2 * d)
        return n * n * d
    else:
        # Sliding window: O(n * w * d)
        return n * window_size * d
In[5]:
Code
# Compare complexity across sequence lengths
d = 768  # GPT-2 small dimension
window = 256

seq_lengths = [512, 1024, 2048, 4096, 8192, 16384]
results = []

for n in seq_lengths:
    standard = attention_complexity(n, d)
    windowed = attention_complexity(n, d, window)
    speedup = standard / windowed
    results.append(
        {
            "n": n,
            "standard_gflops": standard / 1e9,
            "windowed_gflops": windowed / 1e9,
            "speedup": speedup,
        }
    )
Out[6]:
Console
Complexity Comparison (d=768, window=256):

   Seq Len        Standard        Windowed    Speedup
----------------------------------------------------
       512           0.20G           0.10G        2.0x
     1,024           0.81G           0.20G        4.0x
     2,048           3.22G           0.40G        8.0x
     4,096          12.88G           0.81G       16.0x
     8,192          51.54G           1.61G       32.0x
    16,384         206.16G           3.22G       64.0x

The speedup grows linearly with sequence length. At 512 tokens, sliding window provides a 2x speedup. At 16K tokens, the speedup reaches 64x. This difference becomes crucial for processing long documents, where standard attention would be prohibitively expensive.

Out[7]:
Visualization
Log-scale line plot showing quadratic growth of standard attention versus linear growth of sliding window attention.
Complexity comparison between standard attention and sliding window attention (window=256). Standard attention grows quadratically while sliding window grows linearly, with the gap widening as sequence length increases.

Window Size Selection

Choosing the right window size involves balancing efficiency against the ability to capture dependencies. Smaller windows are faster but may miss important long-range relationships. Larger windows capture more context but reduce the computational advantage.

Factors Influencing Window Size

Several considerations guide window size selection:

  • Task requirements: Classification tasks often need only local features, while question answering may require longer-range reasoning. Summarization tasks might need to connect information across paragraphs.

  • Model depth: With multiple layers, information can propagate across the sequence. In a model with LL layers and window size ww, information can theoretically travel L(w/2)L \cdot (w/2) positions in each direction through the network. This is because each layer allows a token to gather information from w/2w/2 positions on each side, and stacking LL layers compounds this reach. A 12-layer model with window size 128 can propagate information across 12×64=76812 \times 64 = 768 positions in each direction.

  • Computational budget: Larger windows cost more. If latency is critical, smaller windows enable faster inference.

  • Sequence length: The ratio of window size to sequence length matters. A 256-token window covers an 8K-token document poorly but might suffice for 1K-token inputs.

Empirical Guidance

Research on efficient transformers has explored various window sizes:

  • Longformer: Uses window sizes of 256-512 tokens, combined with global attention on specific positions
  • BigBird: Employs block sizes of 64 tokens with random and global attention patterns
  • Mistral 7B: Uses a 4,096-token sliding window, covering substantial context for most applications

A useful heuristic is to start with a window size that captures typical sentence or paragraph lengths (128-512 tokens) and increase it if downstream task performance suffers.

In[8]:
Code
def effective_receptive_field(num_layers, window_size):
    """
    Calculate the maximum distance information can travel.

    With L layers and window w, token i can receive information
    from token j if |i - j| <= L * (w // 2) for symmetric windows.
    """
    return num_layers * (window_size // 2)


# Example: how far can information travel?
window_sizes = [64, 128, 256, 512]
layer_counts = [6, 12, 24]
Out[9]:
Console
Effective Receptive Field (positions):

    Window         6 layers        12 layers        24 layers
------------------------------------------
        64         192         384         768
       128         384         768       1,536
       256         768       1,536       3,072
       512       1,536       3,072       6,144

The receptive field table shows how layer depth compensates for narrow windows. A 64-token window in a 24-layer model allows information to propagate 768 positions, while the same window in a 6-layer model reaches only 192 positions. This explains why deeper models can often use smaller windows without sacrificing performance.

Out[10]:
Visualization
Line plot showing linear growth of receptive field with layer count for different window sizes.
Effective receptive field growth across layers. With a window size of 128, each layer expands the receptive field by 64 positions in each direction. Deeper models achieve broader information propagation without increasing per-layer computation.

Dilated Sliding Windows

Standard sliding windows have a limitation: the receptive field grows slowly with window size. To see positions 100 tokens away, you need either a window of 200 tokens (expensive) or many layers for information to propagate (slow). Dilated sliding windows offer a third option: spread the attention across a wider range by skipping intermediate positions.

Dilated Attention

Dilated sliding window attention attends to positions at regular intervals (the dilation rate) rather than consecutive positions. This expands the effective receptive field while keeping the number of attended positions constant.

The intuition comes from signal processing and computer vision. A dilated convolution samples every rrth position instead of consecutive positions, covering more ground with the same number of parameters. Dilated attention applies the same principle: instead of attending to your 5 nearest neighbors, attend to 5 positions spaced 2 apart.

How Dilation Works

Let's formalize this idea. In standard attention, the attended positions form a consecutive sequence centered on the query. With dilation, we introduce gaps between attended positions.

With dilation rate rr and kk attended positions, token ii attends to the following set of positions:

{irk/2,ir(k/21),,i,,i+r(k/21),i+rk/2}\{i - r \cdot k/2, \, i - r \cdot (k/2 - 1), \, \ldots, \, i, \, \ldots, \, i + r \cdot (k/2 - 1), \, i + r \cdot k/2\}

where:

  • ii: the query position (the token computing its attention)
  • rr: the dilation rate (the gap between consecutive attended positions)
  • kk: the number of positions to attend to (same as in non-dilated attention)
  • k/2k/2: the number of positions on each side of the center

The dilation rate rr acts as a multiplier on the offset from the center position. With r=1r = 1 (no dilation), we get the standard consecutive window. With r=2r = 2, we skip every other position, doubling the effective range while attending to the same number of positions.

For example, with k=5k = 5 and r=2r = 2, token at position 10 attends to positions {6,8,10,12,14}\{6, 8, 10, 12, 14\} rather than the consecutive window {8,9,10,11,12}\{8, 9, 10, 11, 12\}. Both patterns attend to exactly 5 positions, but the dilated version covers a range of 8 positions instead of 4.

Out[11]:
Visualization
Attention matrix showing narrow diagonal band of consecutive positions.
Standard window (dilation=1): Attends to 5 consecutive positions.
Attention matrix showing wider diagonal pattern with gaps between attended positions.
Dilated window (dilation=2): Attends to 5 positions with gaps, covering twice the range.

Multi-Scale Attention with Varying Dilation

A single dilation rate creates gaps in coverage: with r=2r=2, we miss all odd-positioned tokens. The solution is to use different dilation rates across attention heads. Some heads use r=1r=1 for fine-grained local attention, others use r=2r=2 for medium-range context, and still others use r=4r=4 or r=8r=8 for long-range dependencies.

This multi-scale approach, used in models like Longformer, creates a powerful combination. The union of all heads covers a broad range of positions, each head specializes in a particular scale, and the total computation remains sparse. Think of it like having multiple photographers at an event: one captures close-ups, another medium shots, and a third wide angles. Together, they document everything without any single photographer doing all the work.

In[12]:
Code
def multi_scale_receptive_field(dilations, k):
    """
    Calculate coverage of multi-scale dilated attention.

    Args:
        dilations: List of dilation rates for different heads
        k: Number of positions each head attends to

    Returns:
        Total range covered by union of all heads
    """
    all_positions = set()
    center = 1000  # Arbitrary center position
    half_k = k // 2

    for d in dilations:
        for offset in range(-half_k, half_k + 1):
            all_positions.add(center + offset * d)

    min_pos = min(all_positions)
    max_pos = max(all_positions)
    return max_pos - min_pos, len(all_positions)


# Example: 4 heads with different dilations
dilations = [1, 2, 4, 8]
k = 5
range_covered, unique_positions = multi_scale_receptive_field(dilations, k)
Out[13]:
Console
Multi-scale dilated attention (k=5 positions per head):

Dilation rates: [1, 2, 4, 8]
Total range covered: 32 positions
Unique positions attended: 11
Coverage density: 33.3%

The multi-scale approach extends the effective receptive field while maintaining sparse attention. With dilation rates of {1,2,4,8}\{1, 2, 4, 8\} and 5 positions per head, the model covers a range of 32 positions with 11 unique attention connections, achieving both broad coverage and computational efficiency.

Out[14]:
Visualization
Horizontal bar showing attended positions at different dilation rates, with overlapping coverage at the center.
Multi-scale dilated attention coverage from a single query position (center). Each head attends to 5 positions at different dilation rates. The bottom row shows combined coverage where darker colors indicate positions attended by multiple heads.

Implementation

With the theory established, let's translate these ideas into working code. We'll build sliding window attention step by step, starting with the most straightforward approach and progressively optimizing. Along the way, you'll see how the mathematical formulas map directly to implementation choices.

Mask-Based Implementation

The most direct implementation follows the formula exactly: compute all attention scores, apply the mask, run softmax, and aggregate values. This approach prioritizes clarity over efficiency, making it ideal for understanding the mechanism.

First, we need helper functions. The softmax function normalizes scores into probability distributions. We implement a numerically stable version that subtracts the maximum before exponentiating to prevent overflow:

In[15]:
Code
import numpy as np


def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def create_sliding_window_mask(seq_len, window_size, causal=False):
    """
    Create a sliding window attention mask.

    Args:
        seq_len: Length of the sequence
        window_size: Size of the attention window
        causal: If True, mask future positions (for autoregressive models)

    Returns:
        Mask of shape (seq_len, seq_len) with 0 for allowed positions,
        -inf for masked positions
    """
    mask = np.full((seq_len, seq_len), -1e9)
    half_window = window_size // 2

    for i in range(seq_len):
        if causal:
            # Only attend to current and previous positions within window
            start = max(0, i - window_size + 1)
            end = i + 1
        else:
            # Symmetric window
            start = max(0, i - half_window)
            end = min(seq_len, i + half_window + 1)

        mask[i, start:end] = 0

    return mask

The mask creation function iterates through each position and sets the mask to 0 for positions within the window, 109-10^9 for positions outside. Notice how the code directly implements our mathematical definition: for symmetric windows, we check if the distance is at most half_window; for causal windows, we also require that the key position is not in the future.

This implementation explicitly constructs the full n×nn \times n mask, which works well for moderate sequence lengths. For very long sequences, we'll need a more efficient approach, but this version is perfect for understanding the pattern.

With the mask function ready, we can implement the complete attention computation:

In[16]:
Code
def sliding_window_attention(Q, K, V, window_size, causal=False):
    """
    Compute sliding window attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)
        window_size: Size of the attention window
        causal: If True, use causal masking

    Returns:
        Output of shape (seq_len, d_v)
    """
    seq_len, d_k = Q.shape

    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply sliding window mask
    mask = create_sliding_window_mask(seq_len, window_size, causal)
    masked_scores = scores + mask

    # Apply softmax
    weights = softmax(masked_scores, axis=-1)

    # Compute weighted sum of values
    output = weights @ V

    return output, weights

This function follows our four-step process exactly: compute scores (Q@K.TQ @ K.T), scale by dk\sqrt{d_k}, add the mask, apply softmax, and multiply by values. The function returns both the output and the attention weights, which we'll visualize to verify the windowed pattern.

Let's test this implementation with a concrete example. We'll create random queries, keys, and values, then examine the attention pattern:

In[17]:
Code
# Create sample input
np.random.seed(42)
seq_len = 16
d_k = 32

Q = np.random.randn(seq_len, d_k) * 0.1
K = np.random.randn(seq_len, d_k) * 0.1
V = np.random.randn(seq_len, d_k) * 0.1

# Compute attention with window size 5
output, weights = sliding_window_attention(Q, K, V, window_size=5, causal=False)
Out[18]:
Console
Input shapes: Q=(16, 32), K=(16, 32), V=(16, 32)
Output shape: (16, 32)
Attention weights shape: (16, 16)

Attention pattern for token 8:
  Positions attended: [ 6  7  8  9 10]
  Weights: [0.20061626 0.20531482 0.1960464  0.20224883 0.19577369]

The attention weights confirm that token 8 attends only to positions 6-10 (a window of 5 centered on position 8), with all other weights effectively zero.

Out[19]:
Visualization
Heatmap showing diagonal band of attention weights with most mass concentrated near the diagonal.
Attention weight matrix from sliding window attention (window=5). Each row shows which positions a token attends to, forming a characteristic band pattern along the diagonal.

Efficient Block-Based Implementation

The mask-based implementation has a problem: it still computes and stores the full n×nn \times n attention matrix, even though most entries are zero. For a 10,000-token sequence, we'd allocate 100 million floats just for the attention weights. This defeats the purpose of sliding window attention.

The solution is to never materialize the full matrix. Instead, we process each position independently, computing attention only over its local window. This insight transforms memory complexity from O(n2)O(n^2) to O(nw)O(n \cdot w), where ww is the window size:

In[20]:
Code
def efficient_sliding_window_attention(Q, K, V, window_size):
    """
    Memory-efficient sliding window attention.

    Instead of creating the full n x n attention matrix, we process
    each position individually, computing only the relevant window.

    Args:
        Q, K, V: Query, Key, Value matrices of shape (seq_len, d)
        window_size: Size of the attention window

    Returns:
        Output of shape (seq_len, d)
    """
    seq_len, d_k = Q.shape
    d_v = V.shape[1]
    half_window = window_size // 2

    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):
        # Determine window boundaries
        start = max(0, i - half_window)
        end = min(seq_len, i + half_window + 1)

        # Extract local keys and values
        K_local = K[start:end]  # shape: (window, d_k)
        V_local = V[start:end]  # shape: (window, d_v)

        # Compute attention scores for this position
        scores = Q[i : i + 1] @ K_local.T / np.sqrt(d_k)  # shape: (1, window)
        weights = softmax(scores, axis=-1)

        # Compute output for this position
        output[i] = (weights @ V_local).squeeze()

    return output

The key insight is in the loop body. For each position ii, we compute the window boundaries, extract only the relevant keys and values, compute attention scores over just ww positions, and aggregate. Each iteration uses O(wd)O(w \cdot d) memory instead of O(nd)O(n \cdot d), and we never store the full attention matrix.

This approach trades sequential processing for memory efficiency. In practice, production implementations use batched operations and GPU-optimized kernels to process multiple positions in parallel while still avoiding the full matrix.

Let's verify that both implementations produce identical results:

In[21]:
Code
# Verify both implementations produce the same result
output_mask_based, _ = sliding_window_attention(Q, K, V, window_size=5)
output_efficient = efficient_sliding_window_attention(Q, K, V, window_size=5)

difference = np.max(np.abs(output_mask_based - output_efficient))
Out[22]:
Console
Maximum difference between implementations: 0.00e+00
Implementations match: True

The negligible difference (on the order of 101610^{-16}) confirms that both implementations compute identical results. The efficient version achieves the same output while using significantly less memory for long sequences.

Causal Sliding Window for Language Models

For autoregressive language models like GPT, we need an additional constraint: tokens must not attend to future positions. If token 10 could see token 11, it would be "cheating" during text generation, looking at words it's supposed to predict.

Combining causality with sliding windows creates a pattern where each token sees at most ww previous tokens, including itself. The window is asymmetric: all context comes from the past. Token 10 with w=4w=4 attends to tokens 7, 8, 9, and 10, never tokens 11 and beyond:

In[23]:
Code
def causal_sliding_window_attention(Q, K, V, window_size):
    """
    Causal sliding window attention for autoregressive models.

    Each position attends only to itself and the previous (window_size - 1) positions.
    """
    seq_len, d_k = Q.shape
    d_v = V.shape[1]

    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):
        # Only attend to current and previous positions within window
        start = max(0, i - window_size + 1)
        end = i + 1

        K_local = K[start:end]
        V_local = V[start:end]

        scores = Q[i : i + 1] @ K_local.T / np.sqrt(d_k)
        weights = softmax(scores, axis=-1)
        output[i] = (weights @ V_local).squeeze()

    return output
Out[24]:
Visualization
Lower triangular band pattern showing causal sliding window mask.
Causal sliding window attention pattern (window=4). The mask is lower triangular within each token's window, ensuring no information flows from future positions.

Mistral-Style Windowed Attention

Mistral 7B popularized sliding window attention in modern large language models. Their approach uses a 4,096-token window size, which covers most practical context lengths while enabling efficient inference.

Key Design Choices

Mistral's implementation incorporates several practical considerations:

  • Large window size (4096): Covers entire documents in many use cases, reducing the need for global tokens
  • Rolling buffer KV cache: During generation, only the most recent ww key-value pairs are stored
  • Pre-fill and chunking: Long prompts are processed in chunks to manage memory

The rolling buffer approach is particularly elegant. Traditional transformers store all past key-value pairs, requiring memory that grows with sequence length:

Mfull=O(nLd)M_{\text{full}} = O(n \cdot L \cdot d)

where:

  • MfullM_{\text{full}}: memory required for the full KV cache
  • nn: current sequence length (grows during generation)
  • LL: number of transformer layers
  • dd: model dimension

With sliding window attention, we only store the most recent ww key-value pairs, giving constant memory usage:

Mwindow=O(wLd)M_{\text{window}} = O(w \cdot L \cdot d)

where ww is the window size. Since ww, LL, and dd are all constants for a given model, memory usage becomes independent of sequence length. This is the key insight that enables generating arbitrarily long sequences without running out of memory.

In[25]:
Code
class RollingKVCache:
    """
    Rolling key-value cache for efficient autoregressive generation.

    Maintains only the most recent 'window_size' key-value pairs,
    enabling constant memory usage regardless of sequence length.
    """

    def __init__(self, window_size, num_heads, d_k, d_v):
        self.window_size = window_size
        self.num_heads = num_heads

        # Pre-allocate buffers
        self.k_cache = np.zeros((window_size, num_heads, d_k))
        self.v_cache = np.zeros((window_size, num_heads, d_v))

        # Current position in the circular buffer
        self.position = 0
        self.length = 0

    def update(self, k, v):
        """
        Add new key-value pair to the cache.

        Args:
            k: New key of shape (num_heads, d_k)
            v: New value of shape (num_heads, d_v)
        """
        # Write to current position (wraps around)
        idx = self.position % self.window_size
        self.k_cache[idx] = k
        self.v_cache[idx] = v

        self.position += 1
        self.length = min(self.length + 1, self.window_size)

    def get_kv(self):
        """Get current keys and values in correct order."""
        if self.length < self.window_size:
            # Haven't filled the buffer yet
            return self.k_cache[: self.length], self.v_cache[: self.length]
        else:
            # Buffer is full, need to reorder
            start = self.position % self.window_size
            indices = np.concatenate(
                [np.arange(start, self.window_size), np.arange(0, start)]
            )
            return self.k_cache[indices], self.v_cache[indices]
In[26]:
Code
# Demonstrate rolling cache behavior
cache = RollingKVCache(window_size=4, num_heads=2, d_k=8, d_v=8)

# Simulate adding tokens
for i in range(6):
    k = np.full((2, 8), i)  # Simple keys to track order
    v = np.full((2, 8), i)
    cache.update(k, v)

    K, V = cache.get_kv()
    first_values = K[:, 0, 0]  # First element of each cached key
Out[27]:
Console
Rolling KV Cache Demo (window=4):
After 6 updates, cache contains: 4 entries
Cached positions (by value): [2. 3. 4. 5.]
Only most recent 4 tokens are stored

After adding 6 tokens to a cache with window size 4, tokens 0 and 1 have been evicted. The cache now contains only tokens 2, 3, 4, and 5, demonstrating how the circular buffer automatically discards the oldest entries when it reaches capacity. This behavior mirrors what happens during autoregressive generation in models like Mistral.

Memory Comparison

The rolling buffer approach provides substantial memory savings for long sequences:

In[28]:
Code
def kv_cache_memory_mb(
    seq_len, num_layers, num_heads, d_head, window_size=None, dtype_bytes=2
):
    """
    Calculate KV cache memory usage.

    Args:
        seq_len: Current sequence length
        num_layers: Number of transformer layers
        num_heads: Number of attention heads
        d_head: Dimension per head
        window_size: If provided, use rolling buffer; otherwise full cache
        dtype_bytes: 2 for fp16, 4 for fp32
    """
    # Full cache: 2 (K+V) * layers * heads * seq_len * d_head * dtype
    if window_size is None:
        elements = 2 * num_layers * num_heads * seq_len * d_head
    else:
        # Rolling buffer: only window_size positions
        effective_len = min(seq_len, window_size)
        elements = 2 * num_layers * num_heads * effective_len * d_head

    return elements * dtype_bytes / (1024**2)  # Convert to MB


# Mistral 7B config (approximate)
config = {"num_layers": 32, "num_heads": 32, "d_head": 128, "window_size": 4096}

seq_lengths = [1024, 4096, 16384, 65536]
Out[29]:
Console
KV Cache Memory (fp16, Mistral 7B config):

   Seq Len      Full Cache   Rolling (w=4096)    Savings
--------------------------------------------------------
     1,024          512.0 MB             512.0 MB       0.0%
     4,096         2048.0 MB            2048.0 MB       0.0%
    16,384         8192.0 MB            2048.0 MB      75.0%
    65,536        32768.0 MB            2048.0 MB      93.8%

The memory comparison reveals the scaling advantage of the rolling buffer. At 1,024 tokens (below the window size), both approaches use the same memory. At 4,096 tokens (exactly the window size), they remain equivalent. Beyond that point, the savings become dramatic: at 16K tokens, the rolling buffer uses 75% less memory, and at 64K tokens, savings reach over 93%.

Out[30]:
Visualization
Line plot comparing full cache linear memory growth versus rolling buffer constant memory usage.
KV cache memory comparison between full cache and rolling buffer (window=4096) for Mistral 7B configuration. The rolling buffer provides constant memory usage regardless of sequence length, enabling processing of arbitrarily long sequences.

This enables processing long documents on hardware that could not otherwise accommodate them.

Limitations and Impact

Sliding window attention achieves linear complexity by making a fundamental trade-off: tokens beyond the window cannot directly attend to each other. This creates important limitations that you must consider when deploying these models.

The most significant challenge is capturing long-range dependencies. Consider a question-answering task where the answer appears at the beginning of a long document and the question at the end. With a window of 512 tokens and a document of 10,000 tokens, direct attention between question and answer is impossible. Information must propagate through intermediate layers, potentially degrading or losing signal along the way. Experiments have shown that sliding window models perform well on tasks where relevant context is local but struggle when important information spans large distances.

Another limitation relates to positional awareness. Standard transformers with full attention can learn to recognize absolute positions through attention patterns. With sliding window attention, each token sees the same local structure regardless of its absolute position in the sequence. While positional embeddings help, some tasks that require precise long-range position tracking may suffer.

Despite these limitations, sliding window attention has proven remarkably effective in practice. Modern language models like Mistral combine it with large windows (4,096 tokens) that cover most practical contexts. For document understanding, the combination of sliding windows with global attention tokens (covered in the next chapter) addresses most long-range needs while maintaining efficiency. The approach has enabled scaling to context lengths that would be impractical with full attention, opening new applications in document processing, code analysis, and long-form text generation.

The practical impact extends beyond computational savings. Standard attention requires storing the full n×nn \times n attention matrix, giving O(n2)O(n^2) memory complexity. Sliding window attention stores only the non-zero entries within each window, reducing memory to O(nw)O(n \cdot w). Since ww is constant, this is effectively linear in sequence length. A model with full attention at 16K tokens might require a high-end GPU, while the same model with sliding window attention can run on modest hardware. This democratization of long-context processing has accelerated both research and deployment of language models for real-world applications.

Summary

Sliding window attention addresses the quadratic complexity of standard attention by restricting each token to attend only within a local window. The core ideas and takeaways from this chapter:

  • Linear complexity: By attending to a fixed window of ww positions, complexity drops from O(n2d)O(n^2 d) to O(nwd)O(n \cdot w \cdot d), enabling efficient processing of long sequences.

  • Window patterns: Symmetric windows work for bidirectional models, while causal windows suit autoregressive generation. Asymmetric and dilated windows offer additional flexibility.

  • Effective receptive field: Through multiple layers, information propagates beyond the immediate window. A model with LL layers and window ww has an effective receptive field of approximately L(w/2)L \cdot (w/2) positions in each direction, allowing distant tokens to influence each other indirectly.

  • Implementation efficiency: Memory-efficient implementations avoid materializing the full attention matrix, instead computing attention block by block.

  • Rolling KV cache: For autoregressive generation, a circular buffer stores only the most recent ww key-value pairs, providing constant memory usage regardless of sequence length.

  • Trade-offs: The locality assumption sacrifices direct long-range attention. Tasks requiring connections across distant positions may need additional mechanisms like global attention tokens.

  • Practical success: Models like Mistral demonstrate that sliding window attention with large windows (4,096+ tokens) provides excellent performance on most tasks while enabling efficient inference.

The next chapter explores global attention tokens, which complement sliding window attention by providing a mechanism for information to flow across the entire sequence, combining the efficiency of local attention with the expressiveness of global connectivity.

Key Parameters

When implementing or configuring sliding window attention, several parameters directly impact both performance and computational efficiency:

  • window_size: The number of positions each token can attend to. Larger windows capture more context but increase computation. Common values range from 128-512 for efficient models to 4,096 for models like Mistral. Choose based on the expected range of relevant dependencies in your task.

  • causal: Whether to use causal masking (True for autoregressive language models, False for bidirectional models like BERT). Causal masking ensures tokens cannot attend to future positions, which is required for text generation.

  • dilation_rate: For dilated sliding windows, controls the gap between attended positions. A rate of 1 gives standard consecutive attention; rate 2 skips every other position, doubling the effective range. Use multiple dilation rates across heads for multi-scale attention.

  • num_layers: The number of transformer layers affects the effective receptive field. With LL layers and window ww, information can propagate approximately L(w/2)L \cdot (w/2) positions in each direction. Deeper models can use smaller windows.

  • d_k / d_head: The dimension of query and key vectors. Standard values follow dk=dmodel/hd_k = d_{\text{model}} / h where hh is the number of heads. This affects the scaling factor dk\sqrt{d_k} in attention score computation.

  • dtype_bytes: Memory precision for KV cache storage. Use 2 bytes for fp16/bf16 (recommended) or 4 bytes for fp32. Half precision halves memory usage with negligible quality impact.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about sliding window attention and its efficiency benefits.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{slidingwindowattentionlinearcomplexityforlongsequences, author = {Michael Brenndoerfer}, title = {Sliding Window Attention: Linear Complexity for Long Sequences}, year = {2025}, url = {https://mbrenndoerfer.com/writing/sliding-window-attention}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Sliding Window Attention: Linear Complexity for Long Sequences. Retrieved from https://mbrenndoerfer.com/writing/sliding-window-attention
MLAAcademic
Michael Brenndoerfer. "Sliding Window Attention: Linear Complexity for Long Sequences." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/sliding-window-attention>.
CHICAGOAcademic
Michael Brenndoerfer. "Sliding Window Attention: Linear Complexity for Long Sequences." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/sliding-window-attention.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Sliding Window Attention: Linear Complexity for Long Sequences'. Available at: https://mbrenndoerfer.com/writing/sliding-window-attention (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Sliding Window Attention: Linear Complexity for Long Sequences. https://mbrenndoerfer.com/writing/sliding-window-attention
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free