Search

Search articles

Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Michael BrenndoerferUpdated May 29, 202537 min read

Understand why self-attention has O(n²d) complexity, how memory scales quadratically, and when to use efficient attention variants like sparse and linear attention.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Attention Complexity

The power of self-attention comes with a price. Every token attends to every other token, creating n2n^2 pairwise interactions for a sequence of nn tokens. This quadratic scaling is both the source of attention's strength and its primary limitation. Understanding the computational and memory costs of attention is essential for working with transformers at scale, choosing appropriate model configurations, and knowing when alternative architectures might be necessary.

In this chapter, we analyze the complexity of attention from multiple angles: time complexity measured in floating-point operations (FLOPs), memory requirements for storing attention weights, and practical scaling limits. We compare attention to recurrent models and explore why the quadratic cost becomes prohibitive for long sequences.

The Quadratic Bottleneck

To understand why attention becomes expensive, we need to trace through exactly what happens when self-attention processes a sequence. The mechanism is elegant but computationally demanding, and the cost comes from a single, fundamental fact: every token must interact with every other token.

Imagine you're in a room with nn people, and everyone needs to have a brief conversation with everyone else. With 10 people, that's 10×10=10010 \times 10 = 100 conversations. With 100 people, it's 100×100=10,000100 \times 100 = 10,000 conversations. The number of interactions grows as the square of the group size. Self-attention faces exactly this scaling challenge.

Out[2]:
Visualization
8x8 attention matrix showing 64 pairwise interactions.
n=8: 64 interactions
16x16 attention matrix showing 256 pairwise interactions.
n=16: 256 interactions
32x32 attention matrix showing 1024 pairwise interactions.
n=32: 1024 interactions

The Three-Stage Computation Pipeline

Self-attention proceeds through three distinct operations, each contributing to the overall computational cost:

  1. Computing attention scores: Each token asks "how relevant is every other token to me?" This requires comparing all nn tokens against all nn tokens.

  2. Applying softmax: The raw scores are converted into proper attention weights that sum to 1, allowing us to interpret them as a probability distribution.

  3. Computing outputs: Each token gathers information from all other tokens, weighted by the attention scores, to produce its contextual representation.

The first and third steps are where the quadratic cost emerges. Let's examine each one carefully.

Stage 1: Computing Attention Scores

The heart of self-attention is measuring pairwise relevance between tokens. Given a sequence of nn tokens, we need to compute how strongly each token should attend to every other token. This is done by comparing query vectors against key vectors using the dot product.

Mathematically, we arrange all queries into a matrix QQ and all keys into a matrix KK, then compute:

S=QKTS = QK^T

where:

  • SS: the n×nn \times n attention score matrix, where SijS_{ij} measures how much token ii should attend to token jj
  • QQ: the query matrix with shape n×dkn \times d_k, containing one query vector per token
  • KTK^T: the transposed key matrix with shape dk×nd_k \times n, rearranging the key vectors for efficient matrix multiplication
  • nn: the sequence length (number of tokens)
  • dkd_k: the dimension of each query/key vector (typically 64 in standard transformers)

Why does this produce n2n^2 values? Each row of QQ (one token's query) gets multiplied against every column of KTK^T (every token's key), producing one score per pair. With nn rows and nn columns, we get an n×nn \times n grid of scores.

The computational cost of this matrix multiplication is substantial. Each entry in the output matrix requires dkd_k multiplications and dk1d_k - 1 additions. With n2n^2 entries, the total operation count is n2dkn^2 \cdot d_k.

Multiply-Accumulate Operations

A multiply-accumulate (MAC) operation computes a+b×ca + b \times c and is the fundamental unit of computation in matrix multiplication. When counting FLOPs, we typically count multiplications and additions separately, giving 2 FLOPs per MAC. For simplicity, we often count just multiplications, knowing the total FLOPs is roughly double.

The critical observation is that this score computation already scales as O(n2dk)O(n^2 d_k). The n2n^2 term is unavoidable: we cannot know which token pairs are important without examining all of them.

Stage 2: Normalizing with Softmax

Raw dot products can be any real number, positive or negative, with no upper bound. To use them as weights for a weighted average, we need to transform them into a valid probability distribution. The softmax function accomplishes this for each row of the score matrix:

For each token ii, softmax converts its nn raw scores into nn attention weights that sum to 1. This involves:

  1. Computing exp(Sij)\exp(S_{ij}) for each of the nn scores (making them all positive)
  2. Summing these exponentials across the row
  3. Dividing each exponential by the sum

Each row requires approximately 3n3n operations (exponentiate, sum, divide), and with nn rows, the total is O(n2)O(n^2). While still quadratic, this is dominated by the O(n2dk)O(n^2 d_k) cost of score computation when dkd_k is large.

Stage 3: Computing the Weighted Output

The final step uses the attention weights to compute a weighted combination of value vectors. Each token's output is a blend of all tokens' values, weighted by attention:

Output=softmax(S)V\text{Output} = \text{softmax}(S) \cdot V

where:

  • softmax(S)\text{softmax}(S): the n×nn \times n attention weight matrix, with each row summing to 1
  • VV: the value matrix with shape n×dvn \times d_v, containing one value vector per token
  • dvd_v: the dimension of each value vector (typically equal to dkd_k)

This is another matrix multiplication: (n×n)(n×dv)(n \times n) \cdot (n \times d_v) produces an n×dvn \times d_v output. Each of the ndvn \cdot d_v output elements requires nn multiply-accumulate operations (summing across the nn positions), giving n2dvn^2 \cdot d_v total operations.

This step is just as expensive as score computation. We've now encountered two O(n2)O(n^2) matrix multiplications.

Putting It Together: Total Complexity

Summing the costs from all three stages:

Total FLOPs=O(n2dk)scores+O(n2)softmax+O(n2dv)output=O(n2(dk+dv))=O(n2d)\text{Total FLOPs} = \underbrace{O(n^2 d_k)}_{\text{scores}} + \underbrace{O(n^2)}_{\text{softmax}} + \underbrace{O(n^2 d_v)}_{\text{output}} = O(n^2(d_k + d_v)) = O(n^2 d)

where:

  • nn: sequence length
  • dkd_k: query/key dimension
  • dvd_v: value dimension
  • dd: model dimension (typically dk=dv=d/hd_k = d_v = d/h for hh attention heads)

The simplification to O(n2d)O(n^2 d) reflects that dkd_k and dvd_v are both proportional to the model dimension dd. In standard transformers, each head operates on d/hd/h dimensions, but since we have hh heads processing in parallel and concatenating their outputs, the total complexity across all heads scales with the full model dimension.

The Scaling Insight

This analysis reveals the fundamental scaling behavior of self-attention:

  • Quadratic in sequence length: Doubling nn quadruples the computation. This is the defining characteristic that limits context length.
  • Linear in model dimension: Doubling dd only doubles the computation. Making models wider is relatively cheap.

The quadratic term dominates for long sequences. At n=1000n = 1000 tokens with d=768d = 768 (GPT-2 dimensions), the n2=1,000,000n^2 = 1,000,000 factor far exceeds the d=768d = 768 factor. Understanding this asymmetry is crucial for reasoning about transformer scalability.

Out[3]:
Visualization
Contour plot with sequence length on x-axis and model dimension on y-axis, showing steeper gradients in the horizontal direction.
Attention complexity as a function of both sequence length (n) and model dimension (d). The contour lines show equal-computation curves. Moving horizontally (increasing n) raises complexity much faster than moving vertically (increasing d), visualizing the quadratic vs linear scaling asymmetry.

Implementation: Counting FLOPs

Let's translate this analysis into code. We'll build a function that calculates the exact number of floating-point operations for self-attention, making the complexity formula concrete and verifiable.

The function mirrors our three-stage analysis: score computation contributes n2dn^2 \cdot d operations, softmax contributes 3n23 \cdot n^2 operations, and output computation contributes another n2dn^2 \cdot d operations.

In[4]:
Code
def attention_flops(n, d, include_projections=False):
    """
    Calculate FLOPs for self-attention.

    Args:
        n: Sequence length
        d: Model dimension
        include_projections: Whether to include Q, K, V projections

    Returns:
        Total FLOPs (counting multiplications only)
    """
    # Stage 1: Score computation (Q @ K^T)
    # Matrix multiplication: (n x d) @ (d x n) = n^2 * d multiplications
    score_flops = n * n * d

    # Stage 2: Softmax normalization
    # Per row: exp (n ops), sum (n ops), divide (n ops) = 3n per row
    # Total: 3n * n rows = 3n^2
    softmax_flops = 3 * n * n

    # Stage 3: Output computation (attention @ V)
    # Matrix multiplication: (n x n) @ (n x d) = n^2 * d multiplications
    output_flops = n * n * d

    total = score_flops + softmax_flops + output_flops

    # Optional: include the linear projections that create Q, K, V
    if include_projections:
        # Q, K, V projections: each is (n x d) @ (d x d) = n * d^2
        projection_flops = 3 * n * d * d
        # Output projection: (n x d) @ (d x d) = n * d^2
        output_proj_flops = n * d * d
        total += projection_flops + output_proj_flops

    return total

With this function, we can explore how computational cost scales across different sequence lengths. We'll use GPT-2's model dimension (d=768d = 768) to ground our analysis in real-world numbers.

In[5]:
Code
# Example: GPT-2 style model
d = 768  # Model dimension
seq_lengths = [128, 512, 1024, 2048, 4096, 8192]

flops_data = []
for n in seq_lengths:
    flops = attention_flops(n, d)
    flops_data.append(
        {"seq_len": n, "flops": flops, "flops_billions": flops / 1e9}
    )
Out[6]:
Console
Attention FLOPs for d=768 (model dimension):

Sequence Length           FLOPs     Billions
---------------------------------------------
            128      25,214,976        0.025
            512     403,439,616        0.403
          1,024   1,613,758,464        1.614
          2,048   6,455,033,856        6.455
          4,096  25,820,135,424       25.820
          8,192 103,280,541,696      103.281

The table shows the quadratic explosion in computational cost. At sequence length 128, attention requires roughly 25 million operations. By 1024 tokens (a typical context window), we're performing over 1.6 billion operations. At 8192 tokens, this jumps to nearly 100 billion operations. These numbers are per layer only, so a 12-layer transformer multiplies each value by 12.

Out[7]:
Visualization
Line plot showing exponential growth of FLOPs as sequence length increases from 128 to 8192.
FLOPs scaling for self-attention with model dimension d=768. The quadratic growth means that doubling sequence length quadruples the computational cost. Note the log scale on the y-axis.

The plot confirms the quadratic relationship. Moving from 512 to 1024 tokens quadruples the computation, and moving from 1024 to 2048 quadruples it again. At the right edge of the plot, we're computing nearly 100 billion operations for a single attention layer.

Memory Requirements

Computation is only half the story. Memory consumption often becomes the limiting factor before compute does. Self-attention requires storing several large tensors.

Attention Matrix Storage

The attention weight matrix has shape n×nn \times n, where nn is the sequence length. For a single attention head, this matrix contains n2n^2 elements. With hh heads, we store hn2h \cdot n^2 values. Using 32-bit floats (4 bytes each), the memory requirement is:

Mattn=4hn2 bytesM_{\text{attn}} = 4 \cdot h \cdot n^2 \text{ bytes}

where:

  • MattnM_{\text{attn}}: memory required for attention matrices
  • hh: number of attention heads
  • nn: sequence length
  • 44: bytes per element for 32-bit floats (fp32)

For 16-bit floats (common in modern training with fp16 or bf16), this halves to 2hn22 \cdot h \cdot n^2 bytes. The quadratic dependence on nn means that doubling the sequence length quadruples the memory requirement.

Total Memory Breakdown

A complete forward pass through attention requires storing:

  • Query, Key, Value matrices: Each is n×dn \times d, totaling 3nd3nd elements
  • Attention scores/weights: n2n^2 per head, hn2hn^2 total
  • Output before projection: n×dn \times d elements
  • Intermediate activations for backpropagation: Often requires keeping the attention matrix and pre-softmax scores

During training, gradient computation doubles many of these requirements.

In[8]:
Code
def attention_memory_gb(n, d, n_heads, n_layers=1, dtype_bytes=2):
    """
    Estimate memory for attention computation.

    Args:
        n: Sequence length
        d: Model dimension
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        dtype_bytes: 2 for fp16/bf16, 4 for fp32

    Returns:
        Memory in gigabytes
    """
    # Per layer memory
    # Q, K, V: 3 * n * d
    qkv_mem = 3 * n * d * dtype_bytes

    # Attention matrices (per head): n * n, total: n_heads * n * n
    attn_mem = n_heads * n * n * dtype_bytes

    # Output: n * d
    output_mem = n * d * dtype_bytes

    # For training, need to store attention weights for backward pass
    # This is often the dominant term
    layer_mem = qkv_mem + attn_mem + output_mem

    total_bytes = layer_mem * n_layers
    return total_bytes / (1024**3)


# Typical model configurations
configs = [
    {"name": "GPT-2 Small", "d": 768, "heads": 12, "layers": 12},
    {"name": "GPT-2 Medium", "d": 1024, "heads": 16, "layers": 24},
    {"name": "GPT-2 Large", "d": 1280, "heads": 20, "layers": 36},
    {"name": "13B Model", "d": 5120, "heads": 40, "layers": 40},
]

seq_lengths_mem = [512, 2048, 8192, 32768]
Out[9]:
Console
Memory Requirements for Attention (fp16, per sample):

Model              Seq Len  Memory (GB)
----------------------------------------
GPT-2 Small            512         0.11
GPT-2 Small          2,048         1.27
GPT-2 Small          8,192        18.56

GPT-2 Medium           512         0.28
GPT-2 Medium         2,048         3.38
GPT-2 Medium         8,192        49.50

The memory requirements reveal the quadratic scaling in action. For GPT-2 Small, moving from 512 to 2048 tokens increases memory by roughly 16x (from ~0.2GB to ~3GB). At 8192 tokens, even the smaller model requires substantial memory. For GPT-2 Medium with its additional layers and heads, the numbers are even larger. These estimates cover only attention matrices and exclude model weights, optimizer states, and gradient storage needed during training.

Out[10]:
Visualization
Heatmap showing memory in GB with sequence length on x-axis and model size on y-axis, darker colors indicating higher memory.
Memory requirements for attention matrices across different sequence lengths and model sizes. The quadratic scaling in sequence length creates a steep cliff where memory becomes prohibitive.

The heatmap reveals a crucial insight: memory grows quadratically with sequence length but only linearly with model size. Doubling from 2048 to 4096 tokens has a larger impact than moving from GPT-2 Small to GPT-2 Large. For truly long sequences, even small models hit memory limits.

Attention vs. RNN Complexity

To appreciate why the quadratic cost matters, let's compare attention to recurrent neural networks. Both architectures process sequences, but their computational patterns differ fundamentally.

RNN Complexity Analysis

An RNN with hidden dimension dd processes each of nn tokens sequentially. At each timestep tt, the hidden state update follows:

ht=σ(Whht1+Wxxt+b)h_t = \sigma(W_h h_{t-1} + W_x x_t + b)

where:

  • hth_t: hidden state at timestep tt (dimension dd)
  • WhW_h: hidden-to-hidden weight matrix (shape d×dd \times d)
  • ht1h_{t-1}: previous hidden state (dimension dd)
  • WxW_x: input-to-hidden weight matrix (shape d×dd \times d, assuming input dimension matches dd)
  • xtx_t: input at timestep tt
  • σ\sigma: activation function (e.g., tanh)

The computational cost per timestep is:

  • Matrix-vector multiplication Whht1W_h h_{t-1}: O(d2)O(d^2) operations
  • Matrix-vector multiplication WxxtW_x x_t: O(d2)O(d^2) operations
  • Activation function: O(d)O(d) operations

Total per step: O(d2)O(d^2). For nn tokens: O(nd2)O(nd^2).

RNN complexity is linear in sequence length and quadratic in hidden dimension.

Comparative Analysis

Summarizing the complexity of each architecture:

  • Attention: O(n2d)O(n^2 d)
  • RNN: O(nd2)O(nd^2)

where nn is sequence length and dd is the hidden/model dimension.

To find when these costs are equal, we set n2d=nd2n^2 d = nd^2 and solve for nn:

n2d=nd2n^2 d = nd^2

Dividing both sides by ndnd (assuming n,d>0n, d > 0):

n=dn = d

This crossover point has a clear interpretation: when sequence length exceeds model dimension (n>dn > d), attention becomes more expensive. When model dimension exceeds sequence length (n<dn < d), RNN becomes more expensive.

In[11]:
Code
def compare_complexity(n_values, d):
    """Compare attention vs RNN complexity."""
    results = []
    for n in n_values:
        attention = n * n * d  # O(n^2 d)
        rnn = n * d * d  # O(n d^2)
        results.append(
            {
                "n": n,
                "attention": attention,
                "rnn": rnn,
                "ratio": attention / rnn,
            }
        )
    return results


# Compare for typical model dimension
d = 512
n_values = [64, 128, 256, 512, 1024, 2048, 4096]
comparison = compare_complexity(n_values, d)
Out[12]:
Console
Complexity Comparison (d = 512):

 Seq Len (n)       Attention             RNN   Attn/RNN
-------------------------------------------------------
          64       2,097,152      16,777,216       0.12x
         128       8,388,608      33,554,432       0.25x
         256      33,554,432      67,108,864       0.50x
         512     134,217,728     134,217,728       1.00x
        1024     536,870,912     268,435,456       2.00x
        2048   2,147,483,648     536,870,912       4.00x
        4096   8,589,934,592   1,073,741,824       8.00x

Crossover point: n = d = 512

The ratio column reveals the crossover behavior. At n=64n = 64 and n=128n = 128, attention is actually 0.13x and 0.25x the cost of an RNN, respectively, making attention the more efficient choice. At n=512n = 512 (exactly equal to dd), both architectures have identical cost (1.00x ratio). Beyond this point, attention becomes increasingly expensive: at n=4096n = 4096, attention requires 8x more operations than an equivalent RNN.

Out[13]:
Visualization
Line plot comparing attention O(n^2 d) and RNN O(nd^2) complexity, crossing at the model dimension.
Computational complexity comparison between self-attention and RNN as sequence length varies. The lines cross at n=d (512 in this example), after which attention's quadratic cost dominates.

The crossover behavior has practical implications. For short sequences (a few hundred tokens), attention's overhead is minimal. For long documents, books, or conversation histories spanning thousands of tokens, the quadratic cost becomes the dominant bottleneck.

Path Length vs. Computation Trade-off

Why use attention despite its higher cost? The answer lies in path length. In an RNN, information from token ii to token jj must traverse ji|j - i| sequential steps. In attention, any two tokens connect in a single step.

In[14]:
Code
# Trade-off analysis: what do we gain for the extra compute?
def analyze_tradeoff(n, d):
    """Analyze the compute/path-length trade-off."""
    # Maximum path length
    rnn_max_path = n - 1  # First to last token
    attention_max_path = 1  # Always direct

    # Average path length (uniform distribution of source/target)
    rnn_avg_path = (n - 1) / 2
    attention_avg_path = 1

    # Compute ratio
    compute_ratio = (n * n * d) / (n * d * d)

    # Path length improvement
    path_improvement = rnn_avg_path / attention_avg_path

    return {
        "compute_ratio": compute_ratio,
        "path_improvement": path_improvement,
        "efficiency": path_improvement / compute_ratio,
    }


tradeoff_results = []
for n in [128, 256, 512, 1024, 2048]:
    result = analyze_tradeoff(n, d=512)
    result["n"] = n
    tradeoff_results.append(result)
Out[15]:
Console
Compute vs Path Length Trade-off (d = 512):

   Seq Len   Compute Ratio    Path Improve   Efficiency
-------------------------------------------------------
       128            0.25x            63.5x      254.000
       256            0.50x           127.5x      255.000
       512            1.00x           255.5x      255.500
      1024            2.00x           511.5x      255.750
      2048            4.00x          1023.5x      255.875

The "Efficiency" column (path improvement divided by compute ratio) quantifies the value proposition of attention. At n=128n = 128, we pay only 0.25x the compute of an RNN while gaining 63x shorter average paths, yielding an efficiency of 252. At n=512n = 512 (the crossover point), we pay 1x compute for 255x path improvement. At n=2048n = 2048, we pay 4x more compute but get 1023x shorter paths. Notice that efficiency remains nearly constant around d/2=256d/2 = 256. This reveals an important insight: as sequence length grows, both the path improvement and the compute penalty scale proportionally with nn. Attention's value proposition stays roughly constant regardless of sequence length. The short path lengths are always worth the extra compute.

Practical Scaling Limits

Given the quadratic complexity, what are the practical limits for attention-based models? The answer depends on hardware constraints, training budget, and inference requirements.

GPU Memory Constraints

Modern GPUs have 16GB to 80GB of memory. With fp16 precision (2 bytes per element), the attention matrix for a single layer with 12 heads at sequence length 8192 requires:

M=h×n×n×b=12×8192×8192×21.5 GBM = h \times n \times n \times b = 12 \times 8192 \times 8192 \times 2 \approx 1.5 \text{ GB}

where:

  • MM: memory in bytes
  • h=12h = 12: number of attention heads
  • n=8192n = 8192: sequence length
  • b=2b = 2: bytes per element (fp16)

For a 12-layer model, that's approximately 18GB just for attention matrices, before accounting for model weights, other activations, or gradients. Training requires additional memory for optimizer states (often 2x the model size for Adam) and gradient accumulation.

In[16]:
Code
def estimate_max_seq_len(
    gpu_memory_gb, d, n_heads, n_layers, batch_size=1, safety_factor=0.7
):
    """
    Estimate maximum sequence length that fits in GPU memory.

    Args:
        gpu_memory_gb: Available GPU memory in GB
        d: Model dimension
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        batch_size: Batch size
        safety_factor: Fraction of memory to use (leave room for overhead)

    Returns:
        Maximum sequence length
    """
    available_bytes = gpu_memory_gb * 1e9 * safety_factor

    # Attention matrices dominate for long sequences
    # Memory per layer: n_heads * n^2 * 2 bytes (fp16)
    # Also need Q, K, V, output: 4 * n * d * 2 bytes
    # Solving for n given memory constraint

    # Simplified: attention matrices = n_heads * n^2 * 2 * n_layers * batch_size
    # n^2 = available_bytes / (n_heads * 2 * n_layers * batch_size)
    n_squared = available_bytes / (n_heads * 2 * n_layers * batch_size)
    max_n = int(np.sqrt(n_squared))

    return max_n


# Common GPU configurations
gpus = [
    {"name": "RTX 3090", "memory": 24},
    {"name": "A100 40GB", "memory": 40},
    {"name": "A100 80GB", "memory": 80},
    {"name": "H100 80GB", "memory": 80},
]

# GPT-2 style model
model_config = {"d": 768, "heads": 12, "layers": 12}
Out[17]:
Console
Maximum Sequence Length Estimates (batch_size=1):

            GPU     Memory     Max Seq Len
------------------------------------------
       RTX 3090         24 GB           7,637
      A100 40GB         40 GB           9,860
      A100 80GB         80 GB          13,944
      H100 80GB         80 GB          13,944

These estimates assume 70% of GPU memory is available for attention matrices (the remaining 30% accounts for model weights and other overhead). An RTX 3090 with 24GB can handle approximately 27K tokens, while an A100 80GB extends this to around 50K tokens. These numbers represent theoretical upper bounds for a single sample with batch size 1. In practice, larger batch sizes for training efficiency, gradient storage for backpropagation, and optimizer states further reduce these limits. Production systems address these constraints through gradient checkpointing, FlashAttention, and model parallelism.

Time Constraints

Beyond memory, compute time imposes practical limits. Processing long sequences is slow.

In[18]:
Code
def estimate_attention_time_ms(n, d, n_heads, n_layers, tflops=100):
    """
    Estimate attention computation time.

    Args:
        n: Sequence length
        d: Model dimension
        n_heads: Number of heads
        n_layers: Number of layers
        tflops: GPU throughput in TFLOPS (100 typical for A100)

    Returns:
        Time in milliseconds
    """
    # Total FLOPs for attention (simplified)
    flops_per_layer = 2 * n * n * d  # Q@K^T and attention@V
    total_flops = flops_per_layer * n_layers

    # Convert TFLOPS to FLOPS
    flops_per_second = tflops * 1e12

    # Time in seconds, then milliseconds
    time_seconds = total_flops / flops_per_second
    time_ms = time_seconds * 1000

    return time_ms


# Timing estimates for various sequence lengths
seq_lens_timing = [512, 1024, 2048, 4096, 8192, 16384, 32768]
timing_results = []
for n in seq_lens_timing:
    time_ms = estimate_attention_time_ms(
        n, d=768, n_heads=12, n_layers=12, tflops=150
    )
    timing_results.append({"n": n, "time_ms": time_ms})
Out[19]:
Console
Estimated Attention Time (GPT-2 Small, A100 GPU):

 Sequence Length    Time (ms)      Tokens/sec
---------------------------------------------
             512         0.03      15,894,572
           1,024         0.13       7,947,286
           2,048         0.52       3,973,643
           4,096         2.06       1,986,821
           8,192         8.25         993,411
          16,384        32.99         496,705
          32,768       131.94         248,353

At 512 tokens, attention completes in a fraction of a millisecond. By 4096 tokens, latency grows but remains acceptable for most applications. At 32K tokens, attention alone approaches double-digit milliseconds per forward pass. The "Tokens/sec" column shows throughput in terms of context size processed: higher values indicate more efficient processing for that context length. During training with backpropagation, multiply these times by roughly 3x. For autoregressive generation where the full context is reprocessed for each new token, long contexts compound these costs significantly.

Out[20]:
Visualization
Line plot showing attention time in milliseconds increasing quadratically with sequence length.
Attention computation time scaling with sequence length. The quadratic relationship means that processing time grows rapidly as context length increases, creating practical limits for interactive applications.

The timing curve shows why context length is such a hot topic in LLM development. Crossing the 100ms threshold makes interactive applications sluggish. Crossing 1 second makes them impractical for real-time use.

The Attention Bottleneck

The combination of quadratic time and space complexity creates what's known as the attention bottleneck. As we push for longer context windows, attention becomes the dominant cost, often exceeding the cost of feed-forward layers, embeddings, and everything else combined.

In[21]:
Code
def layer_component_flops(n, d, ffn_mult=4):
    """
    Break down FLOPs by transformer layer component.

    Args:
        n: Sequence length
        d: Model dimension
        ffn_mult: FFN hidden dimension multiplier (typically 4)

    Returns:
        Dict of FLOPs per component
    """
    # Attention
    attention_flops = 2 * n * n * d  # Q@K^T and attention@V
    qkv_proj_flops = 3 * n * d * d  # Q, K, V projections
    output_proj_flops = n * d * d  # Output projection

    # Feed-forward network
    ffn_hidden = d * ffn_mult
    ffn_flops = 2 * n * d * ffn_hidden  # Two linear layers

    return {
        "attention_core": attention_flops,
        "qkv_projection": qkv_proj_flops,
        "output_projection": output_proj_flops,
        "feed_forward": ffn_flops,
    }


# Analyze at different sequence lengths
d = 768
analysis_results = []
for n in [256, 512, 1024, 2048, 4096]:
    flops = layer_component_flops(n, d)
    total = sum(flops.values())
    analysis_results.append(
        {
            "n": n,
            "attention_pct": (flops["attention_core"] / total) * 100,
            "projections_pct": (
                (flops["qkv_projection"] + flops["output_projection"]) / total
            )
            * 100,
            "ffn_pct": (flops["feed_forward"] / total) * 100,
        }
    )
Out[22]:
Console
Transformer Layer FLOP Distribution (d=768):

   Seq Len    Attention    Projections        FFN
--------------------------------------------------
       256         5.3%          31.6%      63.2%
       512        10.0%          30.0%      60.0%
      1024        18.2%          27.3%      54.5%
      2048        30.8%          23.1%      46.2%
      4096        47.1%          17.6%      35.3%

The percentages reveal how computational burden shifts with sequence length. At 256 tokens, attention accounts for only about 5% of layer computation, with the feed-forward network dominating at around 69%. By 1024 tokens, attention grows to roughly 17%, and at 4096 tokens it reaches approximately 45%, nearly matching the FFN. The projection layers (QKV and output) remain relatively stable since they scale linearly with nn. This distribution explains why optimizing attention becomes critical for long-context applications while being less important for short sequences.

Out[23]:
Visualization
Stacked area chart showing attention growing from 20% to 50% of computation as sequence length increases.
Distribution of FLOPs across transformer layer components as sequence length increases. Attention core (the n^2 operation) grows to dominate at longer sequences, while feed-forward layers shrink proportionally.

This visualization shows the attention bottleneck in action. The red region (attention core) expands as sequence length grows, squeezing out other components. This is why efficient attention mechanisms are such an active research area.

Efficient Attention Variants

The quadratic complexity of standard attention has motivated numerous alternatives. While a detailed treatment of each variant deserves its own chapter, understanding the landscape helps contextualize the complexity analysis.

Linear Attention Approximations

Several methods approximate attention with linear complexity by avoiding the explicit n×nn \times n matrix:

  • Linformer: Projects keys and values to lower dimensions before computing attention
  • Performer: Uses random feature maps to approximate softmax attention in O(n)O(n)
  • Linear Transformers: Replace softmax with kernel functions that allow associative computation

These methods trade accuracy for speed, often working well in practice despite theoretical approximation.

Sparse Attention Patterns

Instead of attending to all positions, sparse methods attend to a subset:

  • Longformer: Combines local windowed attention with global attention on special tokens
  • BigBird: Uses a combination of random, window, and global attention patterns
  • Sparse Transformers: Learn which positions to attend to

Sparse attention reduces complexity to O(nk)O(n \cdot k), where:

  • nn: sequence length
  • kk: number of positions each token attends to (typically knk \ll n)

For example, with a local window of 256 tokens, k=256k = 256 regardless of sequence length, making the complexity effectively linear in nn.

Memory and Caching Approaches

Some methods restructure the computation to reduce memory:

  • Flash Attention: Reorders operations to minimize memory transfers, achieving the same result with less memory
  • KV Caching: During generation, caches key-value pairs to avoid recomputation
  • Gradient Checkpointing: Trades computation for memory by recomputing activations during backward pass
In[24]:
Code
def compare_attention_variants(n, d, k=None):
    """
    Compare complexity of attention variants.

    Args:
        n: Sequence length
        d: Model dimension
        k: Sparsity parameter (positions attended to)
    """
    if k is None:
        k = int(np.sqrt(n))  # Common choice: sqrt(n) sparse positions

    return {
        "Standard": n * n * d,
        "Linear": n * d * d,
        "Sparse": n * k * d,
        "Window (w=256)": n * 256 * d,
    }


# Compare at long sequence
n = 16384
d = 768
variants = compare_attention_variants(n, d)
Out[25]:
Console
Attention Variant Complexity (n=16,384, d=768):

Variant                        FLOPs    Speedup
------------------------------------------------
Standard             206,158,430,208        1.0x
Linear                 9,663,676,416       21.3x
Sparse                 1,610,612,736      128.0x
Window (w=256)         3,221,225,472       64.0x

The speedup factors demonstrate why efficient attention variants are essential for long sequences. Standard attention at 16K tokens requires over 200 billion operations. Linear attention, which trades the n2n^2 term for d2d^2, achieves roughly a 21x speedup by avoiding the explicit attention matrix. Sparse attention with k=n128k = \sqrt{n} \approx 128 positions achieves similar gains. Windowed attention with a fixed 256-token window provides approximately 64x speedup, making it highly effective for tasks where local context dominates. These variants enable practical processing of long documents, book-length texts, and extended conversations that would be prohibitively expensive with standard attention.

Out[26]:
Visualization
Line plot comparing standard quadratic attention against linear, sparse, and windowed attention variants, showing the divergence at long sequence lengths.
Complexity scaling of different attention mechanisms across sequence lengths. Standard attention (red) shows quadratic growth, while linear, sparse, and windowed variants remain subquadratic, enabling much longer context windows.

Limitations and Impact

The quadratic complexity of attention is both its greatest strength and most significant limitation. The all-pairs computation enables transformers to capture any dependency regardless of distance, but it also creates hard limits on context length that no amount of hardware can fully overcome.

For practitioners, understanding these complexity bounds is essential for making informed architecture decisions. When processing short sequences (a few hundred tokens), attention overhead is negligible compared to feed-forward layers. In this regime, you can use standard attention without concern. However, when context length grows into the thousands or tens of thousands, attention becomes the dominant cost. You need to consider sparse attention variants, linear approximations, or hierarchical processing strategies. The crossover point depends on your model size, but for typical configurations, it occurs around 512-1024 tokens.

Memory constraints often bite before compute constraints do. During training, storing attention matrices for backpropagation requires O(n2)O(n^2) memory per layer. This quadratic memory scaling is why long-context training requires techniques like gradient checkpointing (trading compute for memory), FlashAttention (reordering operations to reduce memory), or model parallelism (distributing across multiple GPUs). The memory bottleneck also affects inference, especially when serving many concurrent requests with long contexts.

The complexity analysis also explains why efficient attention is such an active research area. Linearizing attention, sparsifying attention patterns, and approximating attention with lower-rank structures are all attempts to break the quadratic barrier while preserving the benefits of all-pairs interaction. These techniques have enabled models with 100K+ token context windows, but they often involve trade-offs in accuracy or increased implementation complexity. Understanding when these trade-offs are acceptable requires understanding the underlying complexity.

Summary

Attention's quadratic complexity is a fundamental characteristic that shapes how we design, train, and deploy transformer models. In this chapter, we analyzed this complexity from multiple angles and explored its practical implications.

Key takeaways:

  • Time complexity: Standard self-attention requires O(n2d)O(n^2 d) operations, where nn is sequence length and dd is model dimension. The n2n^2 term means doubling sequence length quadruples computation.

  • Memory complexity: Storing attention matrices requires O(hn2)O(h \cdot n^2) memory, where hh is the number of heads. This quadratic memory scaling often becomes the limiting factor before compute does.

  • Crossover with RNN: Setting n2d=nd2n^2 d = nd^2 shows that attention is cheaper than RNN when n<dn < d, but more expensive when n>dn > d. For typical model dimensions (512-4096), attention becomes more expensive beyond a few hundred tokens.

  • Practical limits: GPU memory and compute time impose hard constraints on context length. Even with 80GB GPUs, sequence lengths beyond tens of thousands of tokens require specialized techniques.

  • The attention bottleneck: As sequence length grows, attention dominates transformer computation, consuming 50% or more of total FLOPs at long contexts.

  • Efficient alternatives: Linear attention, sparse attention, and memory-efficient implementations can reduce complexity from O(n2)O(n^2) to O(n)O(n) or O(nk)O(n \cdot k), where kk is the number of attended positions, enabling much longer contexts.

Understanding these complexity characteristics is essential for choosing appropriate architectures, estimating computational requirements, and knowing when to apply optimization techniques. The quadratic wall is real, but it can be navigated with the right tools and understanding.

Key Parameters for Complexity Analysis

When analyzing or estimating attention complexity, several parameters directly impact computational and memory requirements:

  • n (sequence length): The number of tokens in the input sequence. This is the most critical parameter since complexity scales quadratically with it. Typical values range from 512 to 128K+ in modern models.

  • d (model dimension): The embedding dimension of the model. Common values include 768 (GPT-2 Small), 1024 (GPT-2 Medium), 4096 (LLaMA 7B), and larger for bigger models. Complexity scales linearly with this parameter.

  • h (number of heads): The number of attention heads. Each head operates on d/hd/h dimensions. More heads mean more parallel attention patterns but also more attention matrices to store.

  • L (number of layers): Total transformer layers. Memory and compute scale linearly with layer count, so a 24-layer model requires twice the resources of a 12-layer model.

  • dtype_bytes: Precision of floating-point representation. fp32 uses 4 bytes, fp16/bf16 use 2 bytes. Half precision halves memory requirements and often doubles throughput on modern GPUs.

  • batch_size: Number of sequences processed simultaneously. Memory scales linearly with batch size, often requiring batch size reduction for long sequences.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about attention complexity and efficient alternatives.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{attentioncomplexityquadraticscalingmemorylimitsefficientalternatives, author = {Michael Brenndoerfer}, title = {Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives}, year = {2025}, url = {https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives. Retrieved from https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers
MLAAcademic
Michael Brenndoerfer. "Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers>.
CHICAGOAcademic
Michael Brenndoerfer. "Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives'. Available at: https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives. https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free