Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Understand why self-attention has O(n²) complexity, how memory and compute scale quadratically with sequence length, and why this creates hard limits on context windows.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Quadratic Attention BottleneckLink Copied

Transformers have revolutionized natural language processing, powering everything from chatbots to code assistants. At the heart of their success lies self-attention, a mechanism that allows every token to directly interact with every other token. This all-pairs computation is both the transformer's greatest strength and its Achilles' heel. As sequence length grows, the computational and memory demands explode quadratically, creating a fundamental barrier to processing long documents, extended conversations, and book-length texts.

In this chapter, we analyze why this quadratic bottleneck exists, quantify its impact on real systems, and visualize how quickly resources become exhausted. Understanding this limitation sets the stage for the efficient attention variants covered in subsequent chapters.

The All-Pairs ProblemLink Copied

Self-attention computes relationships between every pair of tokens in a sequence. For a sequence of $n$ tokens, this means $n^2$ pairwise interactions. The complete attention formula is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

$Q \in \mathbb{R}^{n \times d_k}$ : query matrix with one row per token
$K \in \mathbb{R}^{n \times d_k}$ : key matrix with one row per token
$V \in \mathbb{R}^{n \times d_v}$ : value matrix containing information to aggregate
$d_k$ : dimension of queries and keys
$n$ : sequence length

The critical operation is $QK^T$ , which produces an $n \times n$ attention score matrix. Every element $(i, j)$ in this matrix represents how much token $i$ should attend to token $j$ . Computing all $n^2$ entries is what gives attention its power: any token can directly influence any other, regardless of distance. But this same property creates the quadratic scaling problem.

Consider the contrast with sequential processing. In a recurrent neural network, information from token 1 must propagate through tokens 2, 3, 4, and so on to reach token 100. In attention, token 1 directly connects to token 100 in a single step. This short path length is why transformers excel at capturing long-range dependencies. However, the price is computing all pairwise interactions up front.

In[2]:

Code

def count_attention_pairs(n):
    """Count the number of pairwise interactions in attention."""
    return n * n


# Demonstrate quadratic growth
sequence_lengths = [128, 256, 512, 1024, 2048, 4096]
pairs = [(n, count_attention_pairs(n)) for n in sequence_lengths]

def count_attention_pairs(n):
    """Count the number of pairwise interactions in attention."""
    return n * n


# Demonstrate quadratic growth
sequence_lengths = [128, 256, 512, 1024, 2048, 4096]
pairs = [(n, count_attention_pairs(n)) for n in sequence_lengths]

Out[3]:

Console

Pairwise Interactions in Self-Attention:

 Sequence Length           Pairs   Growth Factor
--------------------------------------------------
             128          16,384               -
             256          65,536            4.0x
             512         262,144            4.0x
           1,024       1,048,576            4.0x
           2,048       4,194,304            4.0x
           4,096      16,777,216            4.0x

The "Growth Factor" column reveals the quadratic pattern: doubling the sequence length quadruples the number of interactions. At 128 tokens, we compute roughly 16,000 pairs. At 4096 tokens, this explodes to over 16 million pairs. The 4x growth for each 2x increase in sequence length is the signature of $O(n^2)$ complexity.

Out[4]:

Visualization

Line plot with parabolic growth of pairwise interactions on linear scale. — Linear scale shows the parabolic explosion in pair count, rising from near zero to over 16 million at 4096 tokens.

Log-log plot showing linear relationship with slope 2 and 4x growth annotations. — Log scale reveals the slope-2 relationship: each doubling of sequence length produces exactly 4x more pairs.

The first panel shows the raw explosion in pair count, rising from near zero at short sequences to over 16 million at 4096 tokens. The second panel, using logarithmic scales on both axes, reveals the underlying structure: a straight line with slope 2, confirming the $n^2$ relationship. Each doubling of sequence length produces exactly a 4x increase in pairs, marked along the curve.

Memory Analysis: The $O(n^2)$ Attention MatrixLink Copied

The attention score matrix consumes memory proportional to $n^2$ . Each attention head computes its own $n \times n$ matrix of attention weights, and these matrices must be stored during both forward and backward passes. For a model with $h$ attention heads, the total memory required for attention matrices in a single layer is:

M_{\text{attn}} = h \times n^2 \times b

where:

$M_{\text{attn}}$ : memory in bytes for attention matrices
$h$ : number of attention heads (each head computes an independent attention pattern)
$n$ : sequence length (number of tokens)
$n^2$ : the number of entries in each attention matrix (one score per query-key pair)
$b$ : bytes per element (4 for fp32, 2 for fp16/bf16)

The formula reflects a simple structure: $h$ heads times $n^2$ entries per head times $b$ bytes per entry. The $n^2$ term is the source of quadratic scaling.

Worked Example: With 12 attention heads, fp16 precision, and a sequence length of 8192 tokens, the attention matrices for a single layer consume:

M = 12 \times 8192^2 \times 2 = 12 \times 67{,}108{,}864 \times 2 \approx 1.6 \text{ GB}

For a 12-layer transformer, this becomes approximately 19 GB just for attention matrices, before accounting for model weights, activations, or gradients.

In[5]:

Code

def attention_matrix_memory_gb(n, n_heads, n_layers=1, dtype_bytes=2):
    """
    Calculate memory for attention matrices.

    Args:
        n: Sequence length
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        dtype_bytes: Bytes per element (2 for fp16, 4 for fp32)

    Returns:
        Memory in gigabytes
    """
    # Each layer stores n_heads attention matrices of size n x n
    bytes_per_layer = n_heads * n * n * dtype_bytes
    total_bytes = bytes_per_layer * n_layers
    return total_bytes / (1024**3)


# Model configurations
configs = [
    {"name": "GPT-2 Small", "heads": 12, "layers": 12},
    {"name": "GPT-2 Medium", "heads": 16, "layers": 24},
    {"name": "LLaMA 7B", "heads": 32, "layers": 32},
]

sequence_lengths = [512, 2048, 8192, 32768]

def attention_matrix_memory_gb(n, n_heads, n_layers=1, dtype_bytes=2):
    """
    Calculate memory for attention matrices.

    Args:
        n: Sequence length
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        dtype_bytes: Bytes per element (2 for fp16, 4 for fp32)

    Returns:
        Memory in gigabytes
    """
    # Each layer stores n_heads attention matrices of size n x n
    bytes_per_layer = n_heads * n * n * dtype_bytes
    total_bytes = bytes_per_layer * n_layers
    return total_bytes / (1024**3)


# Model configurations
configs = [
    {"name": "GPT-2 Small", "heads": 12, "layers": 12},
    {"name": "GPT-2 Medium", "heads": 16, "layers": 24},
    {"name": "LLaMA 7B", "heads": 32, "layers": 32},
]

sequence_lengths = [512, 2048, 8192, 32768]

Out[6]:

Console

Attention Matrix Memory (fp16):

GPT-2 Small (12 heads, 12 layers):
  n=   512:     0.07 GB
  n= 2,048:     1.12 GB
  n= 8,192:    18.00 GB
  n=32,768:   288.00 GB

GPT-2 Medium (16 heads, 24 layers):
  n=   512:     0.19 GB
  n= 2,048:     3.00 GB
  n= 8,192:    48.00 GB
  n=32,768:   768.00 GB

LLaMA 7B (32 heads, 32 layers):
  n=   512:     0.50 GB
  n= 2,048:     8.00 GB
  n= 8,192:   128.00 GB
  n=32,768:  2048.00 GB

The memory requirements reveal why long-context models are challenging. A GPT-2 Small model at 512 tokens uses only about 0.07 GB for attention matrices. At 8192 tokens, this grows to nearly 18 GB. At 32K tokens, a modest 12-layer model requires over 280 GB just for attention, exceeding the memory of any single GPU. Larger models with more heads and layers face even steeper memory walls.

Out[7]:

Visualization

Log-scale plot showing attention memory growing quadratically with sequence length for different model sizes, with GPU memory limits marked as horizontal dashed lines. — Memory consumption of attention matrices as sequence length increases. Each curve represents a different model size. The quadratic growth creates a steep wall that quickly exceeds GPU memory limits (marked with dashed lines).

The plot shows that even the smallest model configuration exceeds a 24 GB GPU around 16K tokens, and an 80 GB GPU around 32K tokens. Larger models hit these limits much sooner. This memory wall is often the first constraint encountered when extending context length.

Compute Analysis: The $O(n^2 d)$ OperationsLink Copied

We've seen that attention requires storing an $n \times n$ matrix of weights. But how much computation does it take to produce those weights and use them? Understanding the computational cost helps us predict training times, estimate inference latency, and appreciate why efficient attention variants are so valuable.

The answer turns out to be $O(n^2 d)$ floating-point operations per attention layer. To see where this comes from, let's trace through exactly what happens when attention runs.

Building Intuition: What Must Be Computed?Link Copied

Before diving into the math, consider what attention needs to accomplish for each token in the sequence:

Compare this token to every other token to determine relevance scores
Normalize those scores into a probability distribution
Blend information from all tokens according to those probabilities

For a sequence of $n$ tokens, step 1 alone requires $n$ comparisons per token, and we have $n$ tokens, giving us $n \times n = n^2$ comparisons. This is the fundamental source of quadratic scaling: the all-pairs comparison cannot be avoided in standard attention.

Now let's quantify each step precisely.

Stage 1: Computing Attention ScoresLink Copied

The first operation computes how strongly each query should attend to each key. Mathematically, this is the matrix multiplication $QK^T$ , where $Q$ contains all query vectors (one per token) and $K$ contains all key vectors.

When we multiply a matrix of shape $(n \times d_k)$ by a transposed matrix of shape $(d_k \times n)$ , we get an $(n \times n)$ output. Each entry in this output matrix is a dot product between one query and one key, requiring $d_k$ multiplications and $d_k - 1$ additions. With $n^2$ such entries, the total operation count is:

\text{Score FLOPs} \approx 2 \cdot n^2 \cdot d_k

where:

$n^2$ : the number of query-key pairs (one score per pair)
$d_k$ : the dimension of each query and key vector
$2$ : accounts for both multiplications and additions in each dot product

The $n^2$ appears because every token must be compared against every other token. There's no shortcut here: we don't know which pairs will have high attention until we compute all the scores.

Stage 2: Softmax NormalizationLink Copied

Raw dot products can be any real number, positive or negative. To use them as weights in a weighted average, we need to convert each row of scores into a probability distribution that sums to 1. The softmax function accomplishes this:

\text{softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^{n} e^{s_j}}

where $s_i$ is the $i$ -th attention score in a row, and the sum runs over all $n$ scores in that row.

For each of the $n$ rows in the attention matrix, softmax requires:

Exponentiating $n$ scores
Summing those $n$ exponentials
Dividing each of the $n$ exponentials by the sum

This contributes roughly $5n$ operations per row (exponentiate, sum, divide, with some overhead). With $n$ rows:

\text{Softmax FLOPs} \approx 5 \cdot n^2

While still quadratic in $n$ , the softmax cost lacks the $d_k$ factor, making it smaller than score computation when $d_k$ is large. For typical transformers where $d_k = 64$ to $128$ , softmax contributes less than 10% of the total attention FLOPs.

Stage 3: Computing the OutputLink Copied

The final step uses the normalized attention weights to compute a weighted combination of value vectors. Each token's output is a blend of all tokens' values, weighted by attention:

\text{Output} = \text{softmax}(QK^T / \sqrt{d_k}) \cdot V

This is another matrix multiplication: the $(n \times n)$ attention weights multiplied by the $(n \times d_v)$ value matrix produces an $(n \times d_v)$ output. Each of the $n \times d_v$ output elements requires summing across $n$ weighted values:

\text{Output FLOPs} \approx 2 \cdot n^2 \cdot d_v

This matches the cost of score computation. We've encountered two expensive $O(n^2 d)$ operations, and they dominate the total cost.

Out[8]:

Visualization

Horizontal bar chart showing FLOPs breakdown: score computation and output computation are roughly equal and much larger than softmax. — Breakdown of attention FLOPs across the three computational stages. Score computation and output computation each contribute roughly half, while softmax normalization is negligible (less than 1% for typical model dimensions). This visualization uses d=768 and n=1024.

The visualization confirms our analysis: score computation and output computation each consume roughly half the total FLOPs (about 49.8% each), while softmax contributes less than 0.4%. The two matrix multiplications involving the $n \times n$ attention matrix dominate entirely.

The Complete Complexity FormulaLink Copied

Combining all three stages, the total computational cost of self-attention is:

\text{FLOPs} = \underbrace{2n^2 d_k}_{\text{scores}} + \underbrace{5n^2}_{\text{softmax}} + \underbrace{2n^2 d_v}_{\text{output}} = O(n^2 d)

where:

$n$ : sequence length (number of tokens)
$d_k$ : dimension of query and key vectors (typically $d/h$ )
$d_v$ : dimension of value vectors (typically equal to $d_k$ )
$d$ : model dimension (the full embedding size before splitting into heads)
$h$ : number of attention heads

The simplification to $O(n^2 d)$ holds because $d_k$ and $d_v$ are both proportional to $d$ . In standard transformers, each head operates on $d/h$ dimensions, and since we sum across all heads, the total work scales with the full model dimension.

Why the Quadratic Term DominatesLink Copied

The formula reveals an asymmetry in how the two factors affect complexity:

Doubling $n$ quadruples the computation because $n$ appears squared
Doubling $d$ only doubles the computation because $d$ appears linearly

This asymmetry has profound implications. For a model with $d = 768$ processing $n = 4096$ tokens, the $n^2 = 16{,}777{,}216$ factor dwarfs the $d = 768$ factor by a ratio of over 20,000 to 1. The quadratic term completely dominates, which is why sequence length is the bottleneck rather than model size.

Putting Numbers to the FormulaLink Copied

Let's make this concrete by computing the actual FLOPs for a GPT-2 style model with $d = 768$ :

In[9]:

Code

def attention_flops(n, d):
    """
    Calculate FLOPs for self-attention core operations.

    Args:
        n: Sequence length
        d: Model dimension (used for Q@K^T and attn@V)

    Returns:
        Total FLOPs (counting multiplications and additions)
    """
    # Q @ K^T: (n x d) @ (d x n) = n^2 * d multiplies + n^2 * (d-1) adds
    # Simplified to 2 * n^2 * d for multiply-accumulate
    score_flops = 2 * n * n * d

    # Softmax: exp, sum, divide for each row = ~5 * n^2 operations
    softmax_flops = 5 * n * n

    # attention @ V: (n x n) @ (n x d) = n^2 * d
    output_flops = 2 * n * n * d

    return score_flops + softmax_flops + output_flops


d = 768
flops_by_length = [(n, attention_flops(n, d)) for n in sequence_lengths]

def attention_flops(n, d):
    """
    Calculate FLOPs for self-attention core operations.

    Args:
        n: Sequence length
        d: Model dimension (used for Q@K^T and attn@V)

    Returns:
        Total FLOPs (counting multiplications and additions)
    """
    # Q @ K^T: (n x d) @ (d x n) = n^2 * d multiplies + n^2 * (d-1) adds
    # Simplified to 2 * n^2 * d for multiply-accumulate
    score_flops = 2 * n * n * d

    # Softmax: exp, sum, divide for each row = ~5 * n^2 operations
    softmax_flops = 5 * n * n

    # attention @ V: (n x n) @ (n x d) = n^2 * d
    output_flops = 2 * n * n * d

    return score_flops + softmax_flops + output_flops


d = 768
flops_by_length = [(n, attention_flops(n, d)) for n in sequence_lengths]

Out[10]:

Console

Attention FLOPs (d=768):

 Sequence Length              FLOPs       GFLOPs
--------------------------------------------------
             512        806,617,088         0.81
           2,048     12,905,873,408        12.91
           8,192    206,493,974,528       206.49
          32,768  3,303,903,592,448      3303.90

The numbers tell the story of quadratic growth. At 512 tokens, a single attention layer performs about 1.6 billion operations, which modern GPUs handle in microseconds. At 2048 tokens (4x longer), the count grows to roughly 26 billion operations (16x more, confirming the quadratic relationship). By 32,768 tokens, we're approaching 7 trillion operations per layer.

For a complete transformer, multiply these numbers by the layer count. A 12-layer GPT-2 at 4096 tokens performs over 1.2 trillion attention operations. During training, backpropagation roughly triples this cost (forward pass, backward for computing gradients, backward for applying updates). The computational demands compound rapidly, and attention becomes the dominant bottleneck for long sequences.

Visualizing the Attention MatrixLink Copied

The $n \times n$ attention matrix lies at the center of the bottleneck. Visualizing its structure helps build intuition for why it's both powerful and expensive.

Out[11]:

Visualization

16x16 attention matrix heatmap showing full pairwise attention weights with darker diagonal indicating self-attention. — n=16: The full attention pattern is compact, fitting easily in memory. Each token attends to all 16 others.

64x64 attention matrix heatmap showing more complex attention patterns across a longer sequence. — n=64: The matrix is 16x larger in area. Patterns emerge where certain tokens (like sentence boundaries) receive more attention.

256x256 attention matrix heatmap where individual cells are barely visible, demonstrating scale. — n=256: At this scale, the matrix contains 65,536 entries. Storage and computation become significant.

As the sequence length grows, the attention matrix balloons in size. At $n=16$ , the matrix is compact and easy to visualize. At $n=256$ , individual cells become indistinguishable. At $n=8192$ (a common context length for modern LLMs), the matrix would contain over 67 million entries, impossible to render at a reasonable resolution.

The structured patterns visible in these matrices hint at why sparse attention methods work. Much of the attention mass concentrates on the diagonal (local context), certain global tokens, and specific positions. The full $n^2$ matrix contains much redundancy that efficient methods can exploit.

Practical Sequence Length LimitsLink Copied

Given the quadratic scaling, what are the practical limits for standard attention? The answer depends on hardware constraints and whether you're training or doing inference.

In[12]:

Code

def estimate_max_sequence_length(
    gpu_memory_gb, n_heads, n_layers, overhead_factor=0.5, dtype_bytes=2
):
    """
    Estimate maximum sequence length that fits in GPU memory.

    Args:
        gpu_memory_gb: Available GPU memory
        n_heads: Number of attention heads
        n_layers: Number of layers
        overhead_factor: Fraction of memory available for attention
                        (rest goes to weights, activations, gradients)
        dtype_bytes: Bytes per element

    Returns:
        Maximum sequence length
    """
    available_bytes = gpu_memory_gb * (1024**3) * overhead_factor

    # Memory dominated by attention matrices: n_heads * n^2 * dtype_bytes * n_layers
    # Solving: n^2 = available_bytes / (n_heads * dtype_bytes * n_layers)
    n_squared = available_bytes / (n_heads * dtype_bytes * n_layers)
    max_n = int(np.sqrt(n_squared))

    return max_n


gpu_configs = [
    {"name": "RTX 4090", "memory": 24},
    {"name": "A100 40GB", "memory": 40},
    {"name": "A100 80GB", "memory": 80},
    {"name": "H100 80GB", "memory": 80},
]

model = {"heads": 12, "layers": 12}  # GPT-2 Small style

def estimate_max_sequence_length(
    gpu_memory_gb, n_heads, n_layers, overhead_factor=0.5, dtype_bytes=2
):
    """
    Estimate maximum sequence length that fits in GPU memory.

    Args:
        gpu_memory_gb: Available GPU memory
        n_heads: Number of attention heads
        n_layers: Number of layers
        overhead_factor: Fraction of memory available for attention
                        (rest goes to weights, activations, gradients)
        dtype_bytes: Bytes per element

    Returns:
        Maximum sequence length
    """
    available_bytes = gpu_memory_gb * (1024**3) * overhead_factor

    # Memory dominated by attention matrices: n_heads * n^2 * dtype_bytes * n_layers
    # Solving: n^2 = available_bytes / (n_heads * dtype_bytes * n_layers)
    n_squared = available_bytes / (n_heads * dtype_bytes * n_layers)
    max_n = int(np.sqrt(n_squared))

    return max_n


gpu_configs = [
    {"name": "RTX 4090", "memory": 24},
    {"name": "A100 40GB", "memory": 40},
    {"name": "A100 80GB", "memory": 80},
    {"name": "H100 80GB", "memory": 80},
]

model = {"heads": 12, "layers": 12}  # GPT-2 Small style

Out[13]:

Console

Estimated Maximum Sequence Length (GPT-2 Small, batch_size=1):

            GPU     Memory    Inference     Training
----------------------------------------------------
       RTX 4090         24 GB        6,688        4,230
      A100 40GB         40 GB        8,635        5,461
      A100 80GB         80 GB       12,211        7,723
      H100 80GB         80 GB       12,211        7,723

These estimates reveal a stark reality. Even on an 80 GB GPU, training a standard transformer is limited to roughly 16K-23K tokens before memory runs out. Inference allows longer sequences since gradients aren't stored, but even then, sequences beyond 50K tokens become challenging. Production models that claim 100K+ context windows use specialized techniques covered in later chapters.

Out[14]:

Visualization

Dual-axis plot showing normalized memory and compute requirements growing quadratically, with memory limits marked for training and inference scenarios. — Memory and compute requirements as sequence length increases, normalized to their values at n=512. Both scale quadratically, but memory typically becomes the limiting factor first because GPU memory is fixed while compute can be spread across time. The dashed lines mark where an 80GB GPU runs out of memory for attention matrices.

The shaded region represents sequence lengths that exceed GPU memory for training. While compute could theoretically handle longer sequences given enough time, memory is a hard constraint that cannot be overcome without specialized techniques. This asymmetry explains why memory-efficient attention methods like FlashAttention have had such impact: they shift the memory limit rightward without changing the compute requirements.

The Bottleneck VisualizationLink Copied

To fully appreciate the quadratic bottleneck, we can visualize how attention costs compare to other transformer components as sequence length grows.

Out[15]:

Visualization

Stacked area chart showing feed-forward network costs staying linear while attention costs grow quadratically to dominate at long sequence lengths. — Comparison of computational costs for different transformer components as sequence length increases. The attention core (red) grows quadratically, eventually dominating all other costs combined. The crossover point where attention becomes the majority of computation occurs around 1K-2K tokens for typical models.

At short sequence lengths (128-256 tokens), the feed-forward network and projections dominate, and attention contributes less than 10% of total computation. As sequences grow, the quadratic term takes over. By 4K tokens, attention accounts for nearly half of all operations. At 16K tokens and beyond, attention consumes the vast majority of compute budget, squeezing out everything else.

This is the attention bottleneck in visual form. Optimizing transformers for long sequences means addressing the red region in this chart.

Motivation for Efficient AttentionLink Copied

The quadratic bottleneck motivates several lines of research, each tackling the problem from a different angle:

Sparse Attention

Instead of computing all $n^2$ attention scores, sparse attention methods compute only a subset. Local windows, strided patterns, and learned sparsity can reduce complexity to $O(n \cdot k)$ , where $n$ is the sequence length and $k \ll n$ is the number of positions each token attends to. For example, with a local window of 256 tokens, $k = 256$ regardless of how long the sequence is, making the total cost linear in $n$ .

Linear Attention

Linear attention approximates the softmax attention mechanism with kernel functions that allow associativity. The key insight is changing the order of matrix operations: instead of computing $(QK^T)V$ , which requires the $n \times n$ intermediate matrix, linear attention computes $Q(K^TV)$ . Since $K^TV$ has shape $(d \times d)$ rather than $(n \times n)$ , the $n^2$ term disappears, achieving $O(nd^2)$ complexity where $d$ is the model dimension.

Memory-Efficient Attention

FlashAttention and similar methods don't reduce asymptotic complexity but dramatically reduce memory consumption by restructuring the computation. Rather than materializing the full $n \times n$ attention matrix in GPU memory, they compute attention in small blocks, keeping only the current block in fast SRAM. This trades some recomputation for massive memory savings, enabling longer sequences within the same memory budget.

Each approach involves trade-offs. Sparse attention sacrifices some global interactions. Linear attention may lose expressiveness on certain tasks. Memory-efficient methods require careful implementation and may have higher constant factors. The following chapters explore each in detail.

Out[16]:

Visualization

Log-log plot comparing standard O(n^2) attention with sparse O(n*sqrt(n)) and linear O(n*d) attention, showing increasing divergence at longer sequences. — Complexity comparison of attention variants. Standard attention (red) scales quadratically, while sparse and linear variants maintain subquadratic growth. The gap widens dramatically at long sequence lengths.

The gap between standard attention (red) and efficient variants grows dramatically at long sequences. At 64K tokens, sparse attention requires roughly 250x fewer operations than standard attention. This difference is what enables modern long-context models to process book-length documents.

SummaryLink Copied

The quadratic attention bottleneck is a fundamental constraint that shapes transformer design and deployment. In this chapter, we quantified the $O(n^2)$ scaling in both memory and compute, visualized how attention costs grow to dominate transformer computation, and previewed the efficient attention variants that address this limitation.

Key takeaways:

Quadratic memory: Storing attention matrices requires $O(h \cdot n^2)$ memory, where $h$ is the number of attention heads and $n$ is the sequence length. This often becomes the first constraint hit when extending context length.
Quadratic compute: The attention core requires $O(n^2 d)$ operations, where $n$ is sequence length and $d$ is model dimension. Doubling sequence length quadruples attention computation.
The crossover point: At short sequences, attention is a minor cost. At long sequences (roughly 1K-4K tokens for typical models), attention becomes the dominant expense.
Practical limits: Standard attention on current GPUs is limited to roughly 16K-32K tokens for training and 50K-100K tokens for inference, depending on model size and hardware.
Motivation for efficiency: The quadratic bottleneck drives research into sparse attention, linear attention, and memory-efficient implementations, each trading different aspects to break the $O(n^2)$ barrier.

Understanding this bottleneck is essential for working with long-context models and for appreciating why the efficient attention techniques covered in the following chapters are so important. The $n^2$ wall is real, but it can be scaled with the right approaches.

Key ParametersLink Copied

When analyzing attention complexity or estimating resource requirements, these parameters directly determine memory and compute costs:

n (sequence length): The number of tokens in the input. This is the most critical parameter since both memory and compute scale with $n^2$ . Typical values range from 512 (early BERT) to 128K+ (modern long-context models).
d (model dimension): The embedding dimension of the model. Common values include 768 (GPT-2 Small), 4096 (LLaMA 7B), and 8192 (larger models). Complexity scales linearly with $d$ , making it less impactful than sequence length.
h (number of heads): The number of parallel attention heads. Each head stores an independent $n \times n$ attention matrix, so memory scales linearly with $h$ . Typical values range from 12 to 128.
L (number of layers): The depth of the transformer. Memory for attention matrices scales linearly with layer count. A 24-layer model uses twice the attention memory of a 12-layer model.
dtype_bytes: Precision of floating-point representation. Using fp16 (2 bytes) instead of fp32 (4 bytes) halves memory requirements and often doubles throughput on modern GPUs.
overhead_factor: When estimating maximum sequence length, this represents the fraction of GPU memory available for attention matrices after accounting for model weights, activations, and gradients. Typical values are 0.5 for inference and 0.2 for training.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the quadratic attention bottleneck.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Architecture Hyperparameters

Next Chapter

Sparse Attention Patterns

Reference

BIBTEXAcademic

@misc{quadraticattentionbottleneckwhytransformersstrugglewithlongsequences, author = {Michael Brenndoerfer}, title = {Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences}, year = {2025}, url = {https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences. Retrieved from https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences

MLAAcademic

Michael Brenndoerfer. "Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences>.

CHICAGOAcademic

Michael Brenndoerfer. "Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences'. Available at: https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences. https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences

Direct link:

https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences

Quadratic Attention BottleneckLink Copied

The All-Pairs ProblemLink Copied

Memory Analysis: The $O(n^2)$ Attention MatrixLink Copied

Compute Analysis: The $O(n^2 d)$ OperationsLink Copied

Building Intuition: What Must Be Computed?Link Copied

Stage 1: Computing Attention ScoresLink Copied

Stage 2: Softmax NormalizationLink Copied

Stage 3: Computing the OutputLink Copied

The Complete Complexity FormulaLink Copied

Why the Quadratic Term DominatesLink Copied

Putting Numbers to the FormulaLink Copied

Visualizing the Attention MatrixLink Copied

Practical Sequence Length LimitsLink Copied

The Bottleneck VisualizationLink Copied

Motivation for Efficient AttentionLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FlashAttention Implementation: GPU Memory Optimization for Transformers

FlashAttention Algorithm: Memory-Efficient Exact Attention via GPU-Aware Tiling

Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps

Stay updated

Quadratic Attention Bottleneck: Why Transformers Struggle with Long Sequences

Quadratic Attention BottleneckLink Copied

The All-Pairs ProblemLink Copied

Memory Analysis: The O(n2)O(n^2)O(n2) Attention MatrixLink Copied

Compute Analysis: The O(n2d)O(n^2 d)O(n2d) OperationsLink Copied

Building Intuition: What Must Be Computed?Link Copied

Stage 1: Computing Attention ScoresLink Copied

Stage 2: Softmax NormalizationLink Copied

Stage 3: Computing the OutputLink Copied

The Complete Complexity FormulaLink Copied

Why the Quadratic Term DominatesLink Copied

Putting Numbers to the FormulaLink Copied

Visualizing the Attention MatrixLink Copied

Practical Sequence Length LimitsLink Copied

The Bottleneck VisualizationLink Copied

Motivation for Efficient AttentionLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FlashAttention Implementation: GPU Memory Optimization for Transformers

FlashAttention Algorithm: Memory-Efficient Exact Attention via GPU-Aware Tiling

Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps

Stay updated

Memory Analysis: The $O(n^2)$ Attention MatrixLink Copied

Compute Analysis: The $O(n^2 d)$ OperationsLink Copied