Search

Search articles

Longformer: Efficient Attention for Long Documents with Linear Complexity

Michael BrenndoerferUpdated June 25, 202534 min read

Learn how Longformer combines sliding window and global attention to process documents of 4,096+ tokens with O(n) complexity instead of O(n²).

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Longformer

Standard transformer attention scales quadratically with sequence length, making it prohibitively expensive for documents that span thousands of tokens. If you want to process an entire research paper, a legal contract, or a book chapter, vanilla attention becomes a memory and compute bottleneck long before you reach the interesting parts of the text.

Longformer addresses this by combining two complementary attention patterns: local sliding window attention for capturing nearby context, and global attention for aggregating information across the entire sequence. This hybrid approach reduces complexity from O(n2)O(n^2) to O(n)O(n), where nn is the sequence length. In practical terms, doubling the sequence length with standard attention quadruples the cost, but with Longformer it only doubles the cost. The result is a transformer that can handle sequences of 4,096 tokens or more without the memory explosion that would occur with full attention.

The Core Insight: Local + Global Attention

Longformer's key innovation is recognizing that not every token needs to attend to every other token. Most language understanding happens locally. When reading a sentence, the meaning of a word primarily depends on its immediate neighbors. The subject of a sentence matters for its verb, but a word in paragraph 15 rarely affects the interpretation of a word in paragraph 3.

However, some tokens genuinely need global reach. A question token in question answering must see the entire context to find the answer. A classification token needs to aggregate information from the whole document. Longformer solves this by treating these global tokens differently from the rest of the sequence.

Longformer

Longformer is a transformer model designed for long documents that combines sliding window attention (local context) with global attention (full sequence access) to achieve linear complexity in sequence length while maintaining the ability to model long-range dependencies.

The architecture defines two types of attention:

  • Sliding window attention: Each token attends only to tokens within a fixed window around it. With window size ww, token ii attends to positions [iw/2,i+w/2][i - w/2, i + w/2]
  • Global attention: Selected tokens attend to all tokens in the sequence, and all tokens attend back to them

This combination allows Longformer to handle documents that would overwhelm standard transformers, while still capturing the relationships that matter for understanding.

Mathematical Formulation

To understand why Longformer works, we need to think carefully about what attention actually computes and why the standard approach becomes expensive. This journey from intuition to formal mathematics will reveal not just the mechanics of Longformer, but the deeper insight that makes efficient attention possible.

The Problem: Why Standard Attention Breaks Down

Recall that standard transformer attention computes, for each token, a weighted average of all other tokens in the sequence. For a token at position ii, we ask: "How relevant is every other position to understanding this token?" The answer comes as a set of attention weights that sum to 1.

The issue is that this question scales poorly. If you have nn tokens, each token must compute nn attention scores, giving us n2n^2 total computations. For a 512-token sequence, that's about 260,000 attention weights per layer. For a 4,096-token document, it explodes to over 16 million. This quadratic scaling is what makes long documents prohibitively expensive.

Out[2]:
Visualization
Line plot showing quadratic and linear scaling curves diverging dramatically as sequence length increases from 512 to 8192 tokens.
Comparison of quadratic (O(n²)) vs linear (O(n)) scaling as sequence length increases. Full attention costs grow explosively while Longformer costs grow proportionally. At 8,192 tokens, full attention requires 67 million operations while Longformer needs only 4 million.

But here's the key insight: most of these computations are wasted. When you read the word "cat" in a sentence, you don't need to simultaneously consider a word from three paragraphs away to understand its meaning. Most language understanding happens locally. Longformer exploits this observation by restricting which tokens each position can attend to.

Dividing the Sequence: Local and Global Tokens

Longformer partitions the sequence into two distinct sets:

  • Local tokens L\mathcal{L}: The vast majority of tokens, which only need to see their immediate neighborhood
  • Global tokens G\mathcal{G}: A small number of special tokens that need to aggregate information from the entire sequence

For a sequence of length nn, we typically have Ln|\mathcal{L}| \approx n local tokens and G10|\mathcal{G}| \leq 10 global tokens. This asymmetry is the source of Longformer's efficiency: we do expensive full-sequence attention for only a handful of tokens, while the rest use a much cheaper local pattern.

Sliding Window Attention: The Local Pattern

For local tokens, we define a window of size ww centered on each position. Instead of attending to all nn tokens, position ii only attends to positions within the range [iw/2,i+w/2][i - w/2, i + w/2]. Think of it as a spotlight that follows each token, illuminating only its immediate neighborhood.

This restriction is not just a computational trick. It reflects a genuine property of language: most syntactic and semantic relationships are local. Subjects tend to be near their verbs. Adjectives modify nearby nouns. Pronouns typically refer to recently mentioned entities. By focusing attention on a fixed-size window, we capture the dependencies that matter most while ignoring distant positions that rarely contribute meaningful information.

The formal computation for a local token at position iLi \in \mathcal{L} is:

Attentioni=softmax(QiK[iw/2:i+w/2]dk)V[iw/2:i+w/2]\text{Attention}_i = \text{softmax}\left(\frac{Q_i K_{[i-w/2:i+w/2]}^\top}{\sqrt{d_k}}\right) V_{[i-w/2:i+w/2]}

Let's unpack each component:

  • Qi=xiWQQ_i = x_i W_Q: The query vector for position ii. This is what the token is "asking for" from other positions. We compute it by projecting the input embedding xix_i through a learned weight matrix WQW_Q.

  • K[iw/2:i+w/2]K_{[i-w/2:i+w/2]}: The key matrix containing only keys for tokens in the local window. This is a matrix of shape (w,dk)(w, d_k) where each row is the key vector for one position in the window. Keys represent what each position "offers" to other tokens.

  • V[iw/2:i+w/2]V_{[i-w/2:i+w/2]}: The value matrix for the same window, also shape (w,dk)(w, d_k). Values contain the actual information that gets aggregated. The attention weights determine how much of each value to include in the output.

  • dkd_k: The dimension of key and query vectors, typically 64 in base transformer models.

  • dk\sqrt{d_k}: A scaling factor that prevents attention scores from growing too large. Without this, dot products between high-dimensional vectors can become very large, causing softmax to produce nearly one-hot distributions that are hard to learn from.

The computation proceeds through three carefully designed steps:

  1. Score computation: We compute QiK[iw/2:i+w/2]Q_i K_{[i-w/2:i+w/2]}^\top, which produces a vector of ww scores. Each score measures how well the query at position ii matches each key in the window. Geometrically, this is a dot product: vectors pointing in similar directions produce high scores.

  2. Normalization: We divide by dk\sqrt{d_k} to stabilize the magnitude, then apply softmax to convert raw scores into a probability distribution. The ww weights now sum to 1, telling us what fraction of attention to allocate to each position.

  3. Aggregation: We compute a weighted average of the value vectors using these probabilities. Positions with high attention weights contribute more to the output; positions with low weights contribute less.

The critical complexity reduction comes from step 1: instead of computing nn scores, we compute only ww scores. Since ww is a constant (typically 256 or 512), the cost per token is O(w)O(w) instead of O(n)O(n). Across all nn tokens, the total complexity is O(nw)O(nw), which is linear in sequence length.

Global Attention: Bridging Distant Positions

Sliding window attention solves the efficiency problem, but it creates a new one: how does information travel between distant parts of the sequence? If token 1 can only see tokens 1-256, and token 4000 can only see tokens 3744-4000, how do they ever communicate?

The answer is global tokens. These are special positions that break the local-only rule. A global token attends to every position in the sequence, and every position attends back to it. Think of global tokens as relay stations: information from any part of the sequence can flow through them to reach any other part.

For a global token at position gGg \in \mathcal{G}, the attention computation spans the entire sequence:

Attentiong=softmax(QgKdk)V\text{Attention}_g = \text{softmax}\left(\frac{Q_g K^\top}{\sqrt{d_k}}\right) V

where:

  • Qg=xgWQgQ_g = x_g W_Q^g: The query vector for the global token, computed using a separate projection matrix WQgW_Q^g
  • KK: The key matrix for all nn tokens, shape (n,dk)(n, d_k)
  • VV: The value matrix for all nn tokens, shape (n,dk)(n, d_k)

Notice the crucial detail: global tokens use different projection matrices (WQgW_Q^g, WKgW_K^g, WVgW_V^g) than local tokens (WQW_Q, WKW_K, WVW_V). Why? Because local and global attention serve fundamentally different purposes.

When a local token computes attention, it's asking: "Which of my neighbors help me understand my immediate context?" When a global token computes attention, it's asking: "What are the most important pieces of information across the entire document?" These are different questions, and the model benefits from learning different representations to answer them.

The separation also matters for the tokens attending to global positions. When a local token attends to a global token, it should receive a representation tailored for aggregation, not one optimized for local context understanding. The separate projection matrices allow this flexibility.

The Symmetry Property

Global attention is bidirectional: global tokens attend to all positions, and all positions attend back to global tokens. This symmetry is essential for information flow.

Out[3]:
Visualization
Line plot showing receptive field size increasing linearly with the number of transformer layers for sliding window attention.
Effective receptive field growth across transformer layers with sliding window attention. Each layer expands the range of positions that can influence a given token. With window size 512 and 8 layers, a token can receive information from positions up to 4,096 steps away, but global tokens provide instant full-sequence connectivity.

Consider document classification with a [CLS] token at position 0. The [CLS] token needs to see the entire document to make a classification decision, so it receives global attention. But equally important, every token in the document can "see" the [CLS] token and incorporate that global context into its own representation. This bidirectional flow ensures that even local tokens have access to document-level information, just mediated through the global tokens rather than computed directly.

Combining the Patterns: The Complete Picture

Now we can see how Longformer attention works as a unified system:

  1. Most tokens use sliding window attention, computing relevance only within their local neighborhood. This keeps the bulk of computation cheap.

  2. A few tokens (typically [CLS], question tokens, or task-specific markers) use global attention, seeing the entire sequence. This maintains the ability to aggregate long-range information.

  3. All tokens can attend to global tokens, even if they're outside the local window. This ensures information can flow from any position to any other position through the global intermediaries.

The result is a sparse attention pattern that covers the essential dependencies while skipping the redundant ones. Local relationships are captured directly. Global relationships are captured through designated relay tokens.

Out[4]:
Visualization
Heatmap showing full attention as a completely filled blue square.
Full attention allows every token to attend to every other, creating a dense n×n matrix with 100% density.
Heatmap showing sliding window attention as a diagonal band.
Sliding window restricts attention to local neighborhoods, creating a diagonal band with only 28% density.
Heatmap showing Longformer attention with diagonal band plus vertical and horizontal stripes for global tokens.
Longformer combines both patterns, adding cross-shaped global connections to the sliding window base with 34% density.

Complexity Analysis: Why It Works

Let's verify that this design actually achieves linear complexity. The total attention computation involves three components:

  1. Local attention: Each of the nn tokens computes attention over a window of size ww. This contributes O(nw)O(nw) to the total cost.

  2. Global tokens attending to all: Each of the G|\mathcal{G}| global tokens attends to all nn positions. This contributes O(Gn)O(|\mathcal{G}| \cdot n).

  3. All tokens attending to global tokens: Each token includes global positions in its attention set. This cost is already absorbed into the local window computation (global tokens simply become part of the attended set).

The total complexity is:

O(nw+Gn)=O(n(w+G))O(nw + |\mathcal{G}| \cdot n) = O(n(w + |\mathcal{G}|))

where:

  • nn: the sequence length, the variable that grows with document size
  • ww: the sliding window size, a constant chosen at model design time (typically 512)
  • G|\mathcal{G}|: the number of global tokens, a constant chosen per task (typically 1 to 10)

The key observation is that (w+G)(w + |\mathcal{G}|) is a constant. It doesn't grow with sequence length. This means the overall complexity is O(n)O(n), linear in sequence length.

Compare this to standard attention's O(n2)O(n^2) complexity. When you double the sequence length:

  • Standard attention: cost quadruples (2n×2n=4n22n \times 2n = 4n^2)
  • Longformer attention: cost only doubles (2n×(w+G)=2×n(w+G)2n \times (w + |\mathcal{G}|) = 2 \times n(w + |\mathcal{G}|))

This difference is what enables Longformer to process documents of 4,096 tokens or more on hardware that would run out of memory with standard attention at just 512 tokens.

Visualizing the Attention Pattern

The Longformer attention pattern creates a distinctive sparse structure when visualized as a matrix. Let's build an interactive visualization to understand this pattern.

In[5]:
Code
import numpy as np


def create_longformer_attention_mask(seq_len, window_size, global_positions):
    """
    Create a Longformer-style attention mask.

    Args:
        seq_len: Total sequence length
        window_size: Size of sliding window (must be even)
        global_positions: List of positions that have global attention

    Returns:
        Attention mask of shape (seq_len, seq_len)
    """
    mask = np.zeros((seq_len, seq_len))
    half_window = window_size // 2

    # Add sliding window attention for all positions
    for i in range(seq_len):
        start = max(0, i - half_window)
        end = min(seq_len, i + half_window + 1)
        mask[i, start:end] = 1

    # Add global attention: global tokens attend to all, all attend to global
    for g in global_positions:
        mask[g, :] = 1  # Global token attends to all
        mask[:, g] = 1  # All tokens attend to global token

    return mask
Out[6]:
Visualization
Heatmap showing Longformer attention mask with diagonal sliding window band and cross-shaped global attention stripes.
Longformer attention pattern with window size 8 and global attention at positions 0 and 32. The diagonal band represents sliding window attention, while the vertical and horizontal stripes show global token connectivity. This sparse pattern reduces complexity from O(n²) to O(n) while maintaining long-range information flow.

The visualization reveals the structure of Longformer attention. The diagonal band represents sliding window attention: each position attends to its local neighborhood. The vertical and horizontal stripes at positions 0 and 32 show global attention: these tokens have bidirectional access to the entire sequence.

Notice how sparse this pattern is compared to full attention. Full attention would fill the entire square, requiring storage for n2=642=4,096n^2 = 64^2 = 4{,}096 attention weights (where n=64n = 64 is our sequence length in this example). The Longformer pattern only needs storage for the non-zero entries, which scales linearly with sequence length.

Comparing Attention Sparsity

Let's quantify the sparsity achieved by Longformer compared to full attention.

In[7]:
Code
def compute_attention_density(seq_len, window_size, num_global):
    """Compute the fraction of non-zero attention weights."""
    # Sliding window: each of n tokens attends to w tokens
    sliding_window_entries = seq_len * window_size

    # Global attention: each global token adds (n-1) new connections
    # (excluding the window entries we already counted)
    global_entries = num_global * (seq_len - window_size) * 2

    # Total possible entries
    total_possible = seq_len**2

    # Approximate density (some overlap, but gives the right idea)
    density = min(
        1.0, (sliding_window_entries + global_entries) / total_possible
    )
    return density


# Compare densities at different sequence lengths
seq_lengths = [512, 1024, 2048, 4096, 8192, 16384]
window_size = 512
num_global = 2

densities = [
    compute_attention_density(n, window_size, num_global) for n in seq_lengths
]
full_attention_memory = [n**2 for n in seq_lengths]
longformer_memory = [n * (window_size + 2 * num_global) for n in seq_lengths]
Out[8]:
Console
Sequence Length | Attention Density | Memory Ratio (Longformer/Full)
-----------------------------------------------------------------
           512 |           100.0% |      100.78%
          1024 |            50.2% |       50.39%
          2048 |            25.1% |       25.20%
          4096 |            12.6% |       12.60%
          8192 |             6.3% |        6.30%
         16384 |             3.1% |        3.15%

The numbers tell a compelling story. At 512 tokens, Longformer uses about the same memory as full attention. But as sequence length grows, the savings become dramatic. At 4,096 tokens, Longformer uses only about 12% of the memory. At 16,384 tokens, it drops to around 3%. This is the power of linear versus quadratic scaling.

Implementing Longformer Attention

With the mathematical foundation in place, let's translate these ideas into working code. Building the implementation from scratch solidifies understanding and reveals the practical choices that make Longformer work. We'll start with the simpler sliding window mechanism, then layer on global attention support.

In[9]:
Code
import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class LongformerSlidingWindowAttention(nn.Module):
    """
    Sliding window attention for Longformer.
    Each position attends only to tokens within a fixed window.
    """

    def __init__(self, embed_dim, num_heads, window_size):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.window_size = window_size

        assert embed_dim % num_heads == 0, (
            "embed_dim must be divisible by num_heads"
        )

        # Projections for local attention
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, attention_mask=None):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
            attention_mask: Optional mask for padding tokens

        Returns:
            Output tensor of shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len, _ = x.shape

        # Project queries, keys, and values
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)

        # Reshape for multi-head attention
        q = q.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)
        k = k.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)
        v = v.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)

        # For simplicity, we'll use a loop-based implementation
        # Production code would use more efficient sparse operations
        half_window = self.window_size // 2
        scale = 1.0 / math.sqrt(self.head_dim)

        outputs = []
        for i in range(seq_len):
            start = max(0, i - half_window)
            end = min(seq_len, i + half_window + 1)

            # Get local keys and values
            local_k = k[:, :, start:end, :]  # (batch, heads, window, head_dim)
            local_v = v[:, :, start:end, :]

            # Compute attention scores
            query = q[:, :, i : i + 1, :]  # (batch, heads, 1, head_dim)
            scores = torch.matmul(query, local_k.transpose(-2, -1)) * scale

            # Apply softmax and compute weighted sum
            attn_weights = F.softmax(scores, dim=-1)
            output = torch.matmul(attn_weights, local_v)
            outputs.append(output)

        # Concatenate outputs
        output = torch.cat(outputs, dim=2)  # (batch, heads, seq_len, head_dim)
        output = (
            output.transpose(1, 2)
            .contiguous()
            .view(batch_size, seq_len, self.embed_dim)
        )

        return self.out_proj(output)

This implementation demonstrates the core sliding window mechanism. Notice how the loop over positions extracts only the local window of keys and values for each query. In production code, this loop would be replaced with efficient sparse matrix operations, but the conceptual structure remains the same: each position computes attention only over its local window, keeping memory usage bounded regardless of sequence length.

The key insight from the code is that the window size half_window determines how far each token can "see." Positions near the edges of the sequence have smaller effective windows (they can't attend to positions that don't exist), but this is handled gracefully by the max and min bounds.

Now let's extend this to the full Longformer attention with global token support:

In[10]:
Code
class LongformerAttention(nn.Module):
    """
    Full Longformer attention with both sliding window and global attention.
    """

    def __init__(self, embed_dim, num_heads, window_size):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.window_size = window_size

        # Local attention projections
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)

        # Separate projections for global attention
        self.q_proj_global = nn.Linear(embed_dim, embed_dim)
        self.k_proj_global = nn.Linear(embed_dim, embed_dim)
        self.v_proj_global = nn.Linear(embed_dim, embed_dim)

        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, global_attention_mask=None):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
            global_attention_mask: Boolean mask where True indicates global attention
                                   Shape: (batch, seq_len)

        Returns:
            Output tensor of shape (batch, seq_len, embed_dim)
        """
        batch_size, seq_len, _ = x.shape

        # Compute local projections
        q_local = self.q_proj(x)
        k_local = self.k_proj(x)
        v_local = self.v_proj(x)

        # Compute global projections
        q_global = self.q_proj_global(x)
        k_global = self.k_proj_global(x)
        v_global = self.v_proj_global(x)

        # Reshape for multi-head attention
        def reshape(t):
            return t.view(
                batch_size, seq_len, self.num_heads, self.head_dim
            ).transpose(1, 2)

        q_local = reshape(q_local)
        k_local = reshape(k_local)
        v_local = reshape(v_local)
        q_global = reshape(q_global)
        k_global = reshape(k_global)
        v_global = reshape(v_global)

        scale = 1.0 / math.sqrt(self.head_dim)
        half_window = self.window_size // 2

        # Determine which positions have global attention
        if global_attention_mask is None:
            global_positions = []
        else:
            global_positions = (
                global_attention_mask[0].nonzero(as_tuple=True)[0].tolist()
            )

        outputs = torch.zeros_like(q_local)

        for i in range(seq_len):
            is_global = i in global_positions

            if is_global:
                # Global token: attend to entire sequence using global projections
                query = q_global[:, :, i : i + 1, :]
                scores = torch.matmul(query, k_global.transpose(-2, -1)) * scale
                attn_weights = F.softmax(scores, dim=-1)
                output = torch.matmul(attn_weights, v_global)
            else:
                # Local token: sliding window + attend to global tokens
                start = max(0, i - half_window)
                end = min(seq_len, i + half_window + 1)

                # Gather local and global keys/values
                local_indices = list(range(start, end))
                attend_indices = list(set(local_indices + global_positions))
                attend_indices.sort()

                local_k = k_local[:, :, attend_indices, :]
                local_v = v_local[:, :, attend_indices, :]

                # Use global k,v for global positions within the attended set
                for g in global_positions:
                    if g in attend_indices:
                        idx = attend_indices.index(g)
                        local_k[:, :, idx, :] = k_global[:, :, g, :]
                        local_v[:, :, idx, :] = v_global[:, :, g, :]

                query = q_local[:, :, i : i + 1, :]
                scores = torch.matmul(query, local_k.transpose(-2, -1)) * scale
                attn_weights = F.softmax(scores, dim=-1)
                output = torch.matmul(attn_weights, local_v)

            outputs[:, :, i : i + 1, :] = output

        # Reshape and project output
        outputs = (
            outputs.transpose(1, 2)
            .contiguous()
            .view(batch_size, seq_len, self.embed_dim)
        )
        return self.out_proj(outputs)

The implementation reveals several important design decisions:

  1. Separate projection matrices: We maintain two complete sets of Q, K, V projections. The local projections (q_proj, k_proj, v_proj) are used for sliding window attention, while the global projections (q_proj_global, k_proj_global, v_proj_global) are used when global tokens are involved.

  2. Asymmetric handling: Global tokens use global projections for everything. Local tokens use local projections for local attention, but when attending to global tokens, they use the global key and value vectors. This asymmetry ensures consistent representations regardless of which token is doing the attending.

  3. Dynamic attention sets: For local tokens, we compute the union of the local window and global positions. This ensures that even if a global token is outside the local window, every token can still attend to it.

Let's verify our implementation works correctly:

In[11]:
Code
# Create a small test
torch.manual_seed(42)

batch_size = 2
seq_len = 32
embed_dim = 64
num_heads = 4
window_size = 8

model = LongformerAttention(embed_dim, num_heads, window_size)
x = torch.randn(batch_size, seq_len, embed_dim)

# Mark positions 0 and 16 as global
global_mask = torch.zeros(batch_size, seq_len, dtype=torch.bool)
global_mask[:, 0] = True
global_mask[:, 16] = True

output = model(x, global_attention_mask=global_mask)
Out[12]:
Console
Input shape: (2, 32, 64)
Output shape: (2, 32, 64)
Global positions: [0, 16]
Window size: 8
Embedding dimension: 64
Number of attention heads: 4

The implementation correctly handles both local sliding window attention and global attention for designated tokens. The output maintains the same shape as the input (batch size 2, sequence length 32, embedding dimension 64), confirming that our attention mechanism preserves dimensionality while routing attention based on the global attention mask. The two global positions (0 and 16) receive full sequence attention, while all other positions use the sliding window of size 8.

Using Longformer from Hugging Face

For practical applications, you'll want to use the optimized Longformer implementation from Hugging Face. Let's see how to apply it to a document classification task.

In[13]:
Code
from transformers import LongformerTokenizer, LongformerModel

# Load pre-trained Longformer
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

# Sample long document (concatenated for demonstration)
document = (
    """
Machine learning has transformed how we process and understand text. 
Traditional approaches relied on hand-crafted features and statistical methods.
Modern neural networks learn representations directly from data, enabling
unprecedented performance on tasks from translation to summarization.

The transformer architecture, introduced in 2017, revolutionized the field.
Self-attention allows models to capture relationships between any positions
in a sequence, overcoming the sequential limitations of recurrent networks.
However, the quadratic complexity of attention limited sequence lengths.

Longformer addresses this limitation through a hybrid attention pattern.
By combining local sliding window attention with strategic global attention,
it achieves linear complexity while maintaining the ability to model
long-range dependencies essential for document understanding.
"""
    * 10
)  # Repeat to create a longer document
In[14]:
Code
# Tokenize the document
inputs = tokenizer(
    document,
    return_tensors="pt",
    max_length=4096,
    truncation=True,
    padding="max_length",
)

# Create global attention mask - set first token ([CLS]) to have global attention
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1  # [CLS] token has global attention

# Forward pass
with torch.no_grad():
    outputs = model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        global_attention_mask=global_attention_mask,
    )

# Get the [CLS] token representation for classification
cls_representation = outputs.last_hidden_state[:, 0, :]
Out[15]:
Console
Document tokens: 1,522
Max sequence length: 4,096
[CLS] representation dimension: 768
Global attention positions: 1 token(s)

The document contains over 1,000 tokens, padded to Longformer's maximum sequence length of 4,096. Despite processing this long sequence, only a single token (the [CLS] token at position 0) has global attention. This token's 768-dimensional representation now contains information aggregated from the entire document, enabling downstream tasks like classification or similarity computation without the quadratic memory cost of full attention.

Configuring Global Attention for Different Tasks

The power of Longformer comes from its flexibility in configuring global attention. Different tasks benefit from different global attention patterns.

Document Classification

For classification, you typically only need global attention on the [CLS] token:

In[16]:
Code
def create_classification_global_mask(input_ids, cls_token_id):
    """Global attention only on [CLS] token."""
    global_mask = torch.zeros_like(input_ids)
    global_mask[:, 0] = 1  # [CLS] is always at position 0
    return global_mask


# Example usage
cls_mask = create_classification_global_mask(
    inputs["input_ids"], tokenizer.cls_token_id
)

Question Answering

For question answering, the question tokens need to see the entire context to find the answer:

In[17]:
Code
def create_qa_global_mask(input_ids, question_end_position):
    """Global attention on question tokens (positions 0 to question_end)."""
    global_mask = torch.zeros_like(input_ids)
    global_mask[:, :question_end_position] = 1
    return global_mask


# Example: question ends at position 20
qa_mask = create_qa_global_mask(inputs["input_ids"], question_end_position=20)
Out[18]:
Console
Global attention configurations:
  Classification: 1 global token ([CLS])
  Question Answering: 20 global tokens (question)

Memory comparison at 4,096 tokens:
  Full attention: 16,777,216 attention weights
  Classification Longformer: ~2,105,344 weights (12.5%)
  QA Longformer: ~2,260,992 weights (13.5%)

The difference in global token count between classification and QA tasks has minimal impact on memory efficiency. Classification uses just 12.5% of full attention memory, while QA with 20 global question tokens uses only 13.5%. Both represent massive savings compared to the 16.7 million attention weights required by full attention at 4,096 tokens.

Named Entity Recognition

For token-level tasks like NER, you might want global attention on sentence boundaries or special delimiter tokens:

In[19]:
Code
def create_ner_global_mask(input_ids, sep_token_id):
    """Global attention on [SEP] tokens (sentence boundaries)."""
    global_mask = (input_ids == sep_token_id).long()
    # Also include [CLS]
    global_mask[:, 0] = 1
    return global_mask

The flexibility to configure global attention per task is one of Longformer's key advantages. You control exactly which tokens have global reach, optimizing the trade-off between computational cost and model capability.

Memory and Speed Analysis

Let's measure the actual memory savings achieved by Longformer compared to full attention.

In[20]:
Code
def measure_attention_memory(
    seq_lengths, embed_dim=768, num_heads=12, window_size=512
):
    """Estimate memory usage for attention matrices."""
    results = []

    for seq_len in seq_lengths:
        # Full attention: n x n attention matrix per head per layer
        full_attention_size = seq_len * seq_len * num_heads

        # Longformer: n x w for local + n x g for global (approximate)
        # Assuming 1 global token
        num_global = 1
        longformer_size = seq_len * (window_size + 2 * num_global) * num_heads

        # Convert to MB (assuming float32)
        bytes_per_float = 4
        full_mb = full_attention_size * bytes_per_float / (1024**2)
        longformer_mb = longformer_size * bytes_per_float / (1024**2)

        results.append(
            {
                "seq_len": seq_len,
                "full_attention_mb": full_mb,
                "longformer_mb": longformer_mb,
                "savings_ratio": longformer_mb / full_mb,
            }
        )

    return results


seq_lengths = [512, 1024, 2048, 4096, 8192]
memory_results = measure_attention_memory(seq_lengths)
Out[21]:
Visualization
Log-log line plot comparing memory usage of full attention versus Longformer.
Memory usage comparison between full attention and Longformer. Full attention grows quadratically while Longformer grows linearly, shown on log-log scale.
Bar chart showing Longformer memory as percentage of full attention at each sequence length.
Memory ratio showing Longformer memory as percentage of full attention. At 8,192 tokens, Longformer uses only 6% of full attention memory.

The memory comparison reveals the dramatic difference between full attention and Longformer. At 512 tokens, Longformer uses nearly as much memory as full attention because the window size is comparable to the sequence length. But as sequences grow, the gap widens exponentially. At 8,192 tokens, Longformer uses only about 6% of the memory that full attention would require.

This difference is why Longformer can process 4,096-token documents on hardware that would run out of memory with standard BERT-style attention at just 512 tokens.

Practical Applications

Longformer shines on tasks that require understanding long documents where context from distant parts matters.

Scientific Paper Analysis

Research papers often span 3,000 to 5,000 tokens. A model analyzing citations needs to connect references in the text to the bibliography at the end. The introduction summarizes findings that appear in detail in the results section. Longformer's global attention on section headers and key sentences enables these long-range connections.

Contracts frequently cross-reference clauses defined elsewhere in the document. A clause on page 15 might modify conditions stated on page 3. With Longformer, you can designate clause markers as global tokens, allowing the model to track these dependencies across the entire document.

Book Summarization

Summarizing a book chapter requires understanding themes that develop over thousands of words. Characters introduced early affect events that happen much later. Longformer's ability to process long sequences while maintaining global tokens for key narrative elements enables coherent summarization.

Multi-Document Question Answering

When answering questions that require reasoning across multiple documents, you can concatenate them and use global attention on the question tokens. The model can then search across all provided evidence to find the answer.

Limitations and Trade-offs

Longformer represents a significant advance for long document processing, but it comes with trade-offs you should understand before adopting it.

The sliding window creates an information bottleneck for tokens far from any global token. If important information lies 1,000 positions away from the nearest global token and outside any token's window, the model must rely on the stacking of attention layers to propagate that information. Deep transformer stacks help mitigate this through repeated local interactions that gradually spread information, but the effective receptive field is still limited compared to full attention. For tasks where any token might need to attend to any other token equally, Longformer's sparse pattern might lose important connections.

The separate projection matrices for global attention increase parameter count by approximately 50% for the attention layers. For models with billions of parameters, this overhead is significant. You're trading memory during inference for additional model weights that must be stored and loaded. The global projections also add complexity to fine-tuning: you need to carefully initialize them, and learning dynamics can differ between local and global components.

Choosing which tokens should have global attention requires task-specific knowledge. For classification, the [CLS] token is an obvious choice. For question answering, the question tokens work well. But for open-ended tasks like summarization or dialogue, the optimal global attention configuration is less clear. Getting this wrong can significantly hurt performance, and finding the right configuration often requires experimentation.

Finally, Longformer's linear complexity assumes the window size and number of global tokens remain constant as sequence length grows. If your task requires global attention on a growing fraction of tokens (say, one global token per paragraph), complexity can approach O(n2)O(n^2) again. The linear scaling benefit only materializes when global attention is genuinely sparse.

Key Parameters

When working with Longformer, the following parameters have the greatest impact on model behavior and performance:

  • window_size: Controls how many neighboring tokens each position can attend to. Larger windows capture more context but increase memory usage linearly. The default of 512 works well for most document tasks, but you might reduce it to 256 for memory-constrained environments or increase it for tasks requiring broader local context.

  • global_attention_mask: A binary tensor indicating which tokens have global attention. Set to 1 for tokens that need to see the entire sequence (e.g., [CLS] for classification, question tokens for QA). Keep the number of global tokens small to maintain linear complexity.

  • max_length: Maximum sequence length the model can process. Longformer-base supports 4,096 tokens by default. Longer sequences require more memory and compute, but the linear scaling makes lengths up to 16,384 feasible.

  • attention_mode (in custom implementations): Determines whether to use sliding window only, global only, or the hybrid pattern. The hybrid pattern is standard for most tasks.

When fine-tuning Longformer, pay special attention to the global attention configuration. The pretrained model learns both local and global projection matrices, so changing which tokens receive global attention during fine-tuning is straightforward. However, adding too many global tokens can negate the efficiency benefits that make Longformer attractive for long documents.

Summary

Longformer tackles the quadratic attention bottleneck through an elegant combination of local and global attention patterns. The key ideas are:

  • Sliding window attention reduces per-token complexity from O(n)O(n) to O(w)O(w), where ww is the window size. Most tokens only need local context to understand their meaning.

  • Global attention preserves the ability to aggregate information across the entire sequence. By designating specific tokens as global, you maintain long-range dependencies where they matter most.

  • Separate projections for local and global attention allow the model to learn distinct representations for different types of context aggregation.

  • Linear complexity enables processing sequences of 4,096 tokens or more on hardware that would fail with standard attention at 512 tokens.

The practical impact is substantial. Research papers, legal documents, and book chapters that were previously too long for transformer processing are now accessible. Tasks requiring reasoning across thousands of tokens become feasible without the memory explosion of full attention.

Longformer represents a broader trend in efficient attention: recognizing that the dense attention pattern of standard transformers is often more than necessary. By designing sparse patterns that match the actual information flow needed for a task, we can dramatically reduce computational costs while maintaining model quality.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Longformer's efficient attention mechanism.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{longformerefficientattentionforlongdocumentswithlinearcomplexity, author = {Michael Brenndoerfer}, title = {Longformer: Efficient Attention for Long Documents with Linear Complexity}, year = {2025}, url = {https://mbrenndoerfer.com/writing/longformer-efficient-attention-long-documents}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Longformer: Efficient Attention for Long Documents with Linear Complexity. Retrieved from https://mbrenndoerfer.com/writing/longformer-efficient-attention-long-documents
MLAAcademic
Michael Brenndoerfer. "Longformer: Efficient Attention for Long Documents with Linear Complexity." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/longformer-efficient-attention-long-documents>.
CHICAGOAcademic
Michael Brenndoerfer. "Longformer: Efficient Attention for Long Documents with Linear Complexity." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/longformer-efficient-attention-long-documents.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Longformer: Efficient Attention for Long Documents with Linear Complexity'. Available at: https://mbrenndoerfer.com/writing/longformer-efficient-attention-long-documents (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Longformer: Efficient Attention for Long Documents with Linear Complexity. https://mbrenndoerfer.com/writing/longformer-efficient-attention-long-documents
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free