Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Understand why self-attention has O(n²d) complexity, how memory scales quadratically, and when to use efficient attention variants like sparse and linear attention.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Attention ComplexityLink Copied

The power of self-attention comes with a price. Every token attends to every other token, creating $n^2$ pairwise interactions for a sequence of $n$ tokens. This quadratic scaling is both the source of attention's strength and its primary limitation. Understanding the computational and memory costs of attention is essential for working with transformers at scale, choosing appropriate model configurations, and knowing when alternative architectures might be necessary.

In this chapter, we analyze the complexity of attention from multiple angles: time complexity measured in floating-point operations (FLOPs), memory requirements for storing attention weights, and practical scaling limits. We compare attention to recurrent models and explore why the quadratic cost becomes prohibitive for long sequences.

The Quadratic BottleneckLink Copied

To understand why attention becomes expensive, we need to trace through exactly what happens when self-attention processes a sequence. The mechanism is elegant but computationally demanding, and the cost comes from a single, fundamental fact: every token must interact with every other token.

Imagine you're in a room with $n$ people, and everyone needs to have a brief conversation with everyone else. With 10 people, that's $10 \times 10 = 100$ conversations. With 100 people, it's $100 \times 100 = 10,000$ conversations. The number of interactions grows as the square of the group size. Self-attention faces exactly this scaling challenge.

Out[2]:

Visualization

8x8 attention matrix showing 64 pairwise interactions. — n=8: 64 interactions

16x16 attention matrix showing 256 pairwise interactions. — n=16: 256 interactions

32x32 attention matrix showing 1024 pairwise interactions. — n=32: 1024 interactions

The Three-Stage Computation PipelineLink Copied

Self-attention proceeds through three distinct operations, each contributing to the overall computational cost:

Computing attention scores: Each token asks "how relevant is every other token to me?" This requires comparing all $n$ tokens against all $n$ tokens.
Applying softmax: The raw scores are converted into proper attention weights that sum to 1, allowing us to interpret them as a probability distribution.
Computing outputs: Each token gathers information from all other tokens, weighted by the attention scores, to produce its contextual representation.

The first and third steps are where the quadratic cost emerges. Let's examine each one carefully.

Stage 1: Computing Attention ScoresLink Copied

The heart of self-attention is measuring pairwise relevance between tokens. Given a sequence of $n$ tokens, we need to compute how strongly each token should attend to every other token. This is done by comparing query vectors against key vectors using the dot product.

Mathematically, we arrange all queries into a matrix $Q$ and all keys into a matrix $K$ , then compute:

S = QK^T

where:

$S$ : the $n \times n$ attention score matrix, where $S_{ij}$ measures how much token $i$ should attend to token $j$
$Q$ : the query matrix with shape $n \times d_k$ , containing one query vector per token
$K^T$ : the transposed key matrix with shape $d_k \times n$ , rearranging the key vectors for efficient matrix multiplication
$n$ : the sequence length (number of tokens)
$d_k$ : the dimension of each query/key vector (typically 64 in standard transformers)

Why does this produce $n^2$ values? Each row of $Q$ (one token's query) gets multiplied against every column of $K^T$ (every token's key), producing one score per pair. With $n$ rows and $n$ columns, we get an $n \times n$ grid of scores.

The computational cost of this matrix multiplication is substantial. Each entry in the output matrix requires $d_k$ multiplications and $d_k - 1$ additions. With $n^2$ entries, the total operation count is $n^2 \cdot d_k$ .

Multiply-Accumulate Operations

A multiply-accumulate (MAC) operation computes $a + b \times c$ and is the fundamental unit of computation in matrix multiplication. When counting FLOPs, we typically count multiplications and additions separately, giving 2 FLOPs per MAC. For simplicity, we often count just multiplications, knowing the total FLOPs is roughly double.

The critical observation is that this score computation already scales as $O(n^2 d_k)$ . The $n^2$ term is unavoidable: we cannot know which token pairs are important without examining all of them.

Stage 2: Normalizing with SoftmaxLink Copied

Raw dot products can be any real number, positive or negative, with no upper bound. To use them as weights for a weighted average, we need to transform them into a valid probability distribution. The softmax function accomplishes this for each row of the score matrix:

For each token $i$ , softmax converts its $n$ raw scores into $n$ attention weights that sum to 1. This involves:

Computing $\exp(S_{ij})$ for each of the $n$ scores (making them all positive)
Summing these exponentials across the row
Dividing each exponential by the sum

Each row requires approximately $3n$ operations (exponentiate, sum, divide), and with $n$ rows, the total is $O(n^2)$ . While still quadratic, this is dominated by the $O(n^2 d_k)$ cost of score computation when $d_k$ is large.

Stage 3: Computing the Weighted OutputLink Copied

The final step uses the attention weights to compute a weighted combination of value vectors. Each token's output is a blend of all tokens' values, weighted by attention:

\text{Output} = \text{softmax}(S) \cdot V

where:

$\text{softmax}(S)$ : the $n \times n$ attention weight matrix, with each row summing to 1
$V$ : the value matrix with shape $n \times d_v$ , containing one value vector per token
$d_v$ : the dimension of each value vector (typically equal to $d_k$ )

This is another matrix multiplication: $(n \times n) \cdot (n \times d_v)$ produces an $n \times d_v$ output. Each of the $n \cdot d_v$ output elements requires $n$ multiply-accumulate operations (summing across the $n$ positions), giving $n^2 \cdot d_v$ total operations.

This step is just as expensive as score computation. We've now encountered two $O(n^2)$ matrix multiplications.

Putting It Together: Total ComplexityLink Copied

Summing the costs from all three stages:

\text{Total FLOPs} = \underbrace{O(n^2 d_k)}_{\text{scores}} + \underbrace{O(n^2)}_{\text{softmax}} + \underbrace{O(n^2 d_v)}_{\text{output}} = O(n^2(d_k + d_v)) = O(n^2 d)

where:

$n$ : sequence length
$d_k$ : query/key dimension
$d_v$ : value dimension
$d$ : model dimension (typically $d_k = d_v = d/h$ for $h$ attention heads)

The simplification to $O(n^2 d)$ reflects that $d_k$ and $d_v$ are both proportional to the model dimension $d$ . In standard transformers, each head operates on $d/h$ dimensions, but since we have $h$ heads processing in parallel and concatenating their outputs, the total complexity across all heads scales with the full model dimension.

The Scaling InsightLink Copied

This analysis reveals the fundamental scaling behavior of self-attention:

Quadratic in sequence length: Doubling $n$ quadruples the computation. This is the defining characteristic that limits context length.
Linear in model dimension: Doubling $d$ only doubles the computation. Making models wider is relatively cheap.

The quadratic term dominates for long sequences. At $n = 1000$ tokens with $d = 768$ (GPT-2 dimensions), the $n^2 = 1,000,000$ factor far exceeds the $d = 768$ factor. Understanding this asymmetry is crucial for reasoning about transformer scalability.

Out[3]:

Visualization

Contour plot with sequence length on x-axis and model dimension on y-axis, showing steeper gradients in the horizontal direction. — Attention complexity as a function of both sequence length (n) and model dimension (d). The contour lines show equal-computation curves. Moving horizontally (increasing n) raises complexity much faster than moving vertically (increasing d), visualizing the quadratic vs linear scaling asymmetry.

Implementation: Counting FLOPsLink Copied

Let's translate this analysis into code. We'll build a function that calculates the exact number of floating-point operations for self-attention, making the complexity formula concrete and verifiable.

The function mirrors our three-stage analysis: score computation contributes $n^2 \cdot d$ operations, softmax contributes $3 \cdot n^2$ operations, and output computation contributes another $n^2 \cdot d$ operations.

In[4]:

Code

def attention_flops(n, d, include_projections=False):
    """
    Calculate FLOPs for self-attention.

    Args:
        n: Sequence length
        d: Model dimension
        include_projections: Whether to include Q, K, V projections

    Returns:
        Total FLOPs (counting multiplications only)
    """
    # Stage 1: Score computation (Q @ K^T)
    # Matrix multiplication: (n x d) @ (d x n) = n^2 * d multiplications
    score_flops = n * n * d

    # Stage 2: Softmax normalization
    # Per row: exp (n ops), sum (n ops), divide (n ops) = 3n per row
    # Total: 3n * n rows = 3n^2
    softmax_flops = 3 * n * n

    # Stage 3: Output computation (attention @ V)
    # Matrix multiplication: (n x n) @ (n x d) = n^2 * d multiplications
    output_flops = n * n * d

    total = score_flops + softmax_flops + output_flops

    # Optional: include the linear projections that create Q, K, V
    if include_projections:
        # Q, K, V projections: each is (n x d) @ (d x d) = n * d^2
        projection_flops = 3 * n * d * d
        # Output projection: (n x d) @ (d x d) = n * d^2
        output_proj_flops = n * d * d
        total += projection_flops + output_proj_flops

    return total

def attention_flops(n, d, include_projections=False):
    """
    Calculate FLOPs for self-attention.

    Args:
        n: Sequence length
        d: Model dimension
        include_projections: Whether to include Q, K, V projections

    Returns:
        Total FLOPs (counting multiplications only)
    """
    # Stage 1: Score computation (Q @ K^T)
    # Matrix multiplication: (n x d) @ (d x n) = n^2 * d multiplications
    score_flops = n * n * d

    # Stage 2: Softmax normalization
    # Per row: exp (n ops), sum (n ops), divide (n ops) = 3n per row
    # Total: 3n * n rows = 3n^2
    softmax_flops = 3 * n * n

    # Stage 3: Output computation (attention @ V)
    # Matrix multiplication: (n x n) @ (n x d) = n^2 * d multiplications
    output_flops = n * n * d

    total = score_flops + softmax_flops + output_flops

    # Optional: include the linear projections that create Q, K, V
    if include_projections:
        # Q, K, V projections: each is (n x d) @ (d x d) = n * d^2
        projection_flops = 3 * n * d * d
        # Output projection: (n x d) @ (d x d) = n * d^2
        output_proj_flops = n * d * d
        total += projection_flops + output_proj_flops

    return total

With this function, we can explore how computational cost scales across different sequence lengths. We'll use GPT-2's model dimension ( $d = 768$ ) to ground our analysis in real-world numbers.

In[5]:

Code

# Example: GPT-2 style model
d = 768  # Model dimension
seq_lengths = [128, 512, 1024, 2048, 4096, 8192]

flops_data = []
for n in seq_lengths:
    flops = attention_flops(n, d)
    flops_data.append(
        {"seq_len": n, "flops": flops, "flops_billions": flops / 1e9}
    )

# Example: GPT-2 style model
d = 768  # Model dimension
seq_lengths = [128, 512, 1024, 2048, 4096, 8192]

flops_data = []
for n in seq_lengths:
    flops = attention_flops(n, d)
    flops_data.append(
        {"seq_len": n, "flops": flops, "flops_billions": flops / 1e9}
    )

Out[6]:

Console

Attention FLOPs for d=768 (model dimension):

Sequence Length           FLOPs     Billions
---------------------------------------------
            128      25,214,976        0.025
            512     403,439,616        0.403
          1,024   1,613,758,464        1.614
          2,048   6,455,033,856        6.455
          4,096  25,820,135,424       25.820
          8,192 103,280,541,696      103.281

The table shows the quadratic explosion in computational cost. At sequence length 128, attention requires roughly 25 million operations. By 1024 tokens (a typical context window), we're performing over 1.6 billion operations. At 8192 tokens, this jumps to nearly 100 billion operations. These numbers are per layer only, so a 12-layer transformer multiplies each value by 12.

Out[7]:

Visualization

Line plot showing exponential growth of FLOPs as sequence length increases from 128 to 8192. — FLOPs scaling for self-attention with model dimension d=768. The quadratic growth means that doubling sequence length quadruples the computational cost. Note the log scale on the y-axis.

The plot confirms the quadratic relationship. Moving from 512 to 1024 tokens quadruples the computation, and moving from 1024 to 2048 quadruples it again. At the right edge of the plot, we're computing nearly 100 billion operations for a single attention layer.

Memory RequirementsLink Copied

Computation is only half the story. Memory consumption often becomes the limiting factor before compute does. Self-attention requires storing several large tensors.

Attention Matrix StorageLink Copied

The attention weight matrix has shape $n \times n$ , where $n$ is the sequence length. For a single attention head, this matrix contains $n^2$ elements. With $h$ heads, we store $h \cdot n^2$ values. Using 32-bit floats (4 bytes each), the memory requirement is:

M_{\text{attn}} = 4 \cdot h \cdot n^2 \text{ bytes}

where:

$M_{\text{attn}}$ : memory required for attention matrices
$h$ : number of attention heads
$n$ : sequence length
$4$ : bytes per element for 32-bit floats (fp32)

For 16-bit floats (common in modern training with fp16 or bf16), this halves to $2 \cdot h \cdot n^2$ bytes. The quadratic dependence on $n$ means that doubling the sequence length quadruples the memory requirement.

Total Memory BreakdownLink Copied

A complete forward pass through attention requires storing:

Query, Key, Value matrices: Each is $n \times d$ , totaling $3nd$ elements
Attention scores/weights: $n^2$ per head, $hn^2$ total
Output before projection: $n \times d$ elements
Intermediate activations for backpropagation: Often requires keeping the attention matrix and pre-softmax scores

During training, gradient computation doubles many of these requirements.

In[8]:

Code

def attention_memory_gb(n, d, n_heads, n_layers=1, dtype_bytes=2):
    """
    Estimate memory for attention computation.

    Args:
        n: Sequence length
        d: Model dimension
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        dtype_bytes: 2 for fp16/bf16, 4 for fp32

    Returns:
        Memory in gigabytes
    """
    # Per layer memory
    # Q, K, V: 3 * n * d
    qkv_mem = 3 * n * d * dtype_bytes

    # Attention matrices (per head): n * n, total: n_heads * n * n
    attn_mem = n_heads * n * n * dtype_bytes

    # Output: n * d
    output_mem = n * d * dtype_bytes

    # For training, need to store attention weights for backward pass
    # This is often the dominant term
    layer_mem = qkv_mem + attn_mem + output_mem

    total_bytes = layer_mem * n_layers
    return total_bytes / (1024**3)


# Typical model configurations
configs = [
    {"name": "GPT-2 Small", "d": 768, "heads": 12, "layers": 12},
    {"name": "GPT-2 Medium", "d": 1024, "heads": 16, "layers": 24},
    {"name": "GPT-2 Large", "d": 1280, "heads": 20, "layers": 36},
    {"name": "13B Model", "d": 5120, "heads": 40, "layers": 40},
]

seq_lengths_mem = [512, 2048, 8192, 32768]

def attention_memory_gb(n, d, n_heads, n_layers=1, dtype_bytes=2):
    """
    Estimate memory for attention computation.

    Args:
        n: Sequence length
        d: Model dimension
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        dtype_bytes: 2 for fp16/bf16, 4 for fp32

    Returns:
        Memory in gigabytes
    """
    # Per layer memory
    # Q, K, V: 3 * n * d
    qkv_mem = 3 * n * d * dtype_bytes

    # Attention matrices (per head): n * n, total: n_heads * n * n
    attn_mem = n_heads * n * n * dtype_bytes

    # Output: n * d
    output_mem = n * d * dtype_bytes

    # For training, need to store attention weights for backward pass
    # This is often the dominant term
    layer_mem = qkv_mem + attn_mem + output_mem

    total_bytes = layer_mem * n_layers
    return total_bytes / (1024**3)


# Typical model configurations
configs = [
    {"name": "GPT-2 Small", "d": 768, "heads": 12, "layers": 12},
    {"name": "GPT-2 Medium", "d": 1024, "heads": 16, "layers": 24},
    {"name": "GPT-2 Large", "d": 1280, "heads": 20, "layers": 36},
    {"name": "13B Model", "d": 5120, "heads": 40, "layers": 40},
]

seq_lengths_mem = [512, 2048, 8192, 32768]

Out[9]:

Console

Memory Requirements for Attention (fp16, per sample):

Model              Seq Len  Memory (GB)
----------------------------------------
GPT-2 Small            512         0.11
GPT-2 Small          2,048         1.27
GPT-2 Small          8,192        18.56

GPT-2 Medium           512         0.28
GPT-2 Medium         2,048         3.38
GPT-2 Medium         8,192        49.50

The memory requirements reveal the quadratic scaling in action. For GPT-2 Small, moving from 512 to 2048 tokens increases memory by roughly 16x (from ~0.2GB to ~3GB). At 8192 tokens, even the smaller model requires substantial memory. For GPT-2 Medium with its additional layers and heads, the numbers are even larger. These estimates cover only attention matrices and exclude model weights, optimizer states, and gradient storage needed during training.

Out[10]:

Visualization

Heatmap showing memory in GB with sequence length on x-axis and model size on y-axis, darker colors indicating higher memory. — Memory requirements for attention matrices across different sequence lengths and model sizes. The quadratic scaling in sequence length creates a steep cliff where memory becomes prohibitive.

The heatmap reveals a crucial insight: memory grows quadratically with sequence length but only linearly with model size. Doubling from 2048 to 4096 tokens has a larger impact than moving from GPT-2 Small to GPT-2 Large. For truly long sequences, even small models hit memory limits.

Attention vs. RNN ComplexityLink Copied

To appreciate why the quadratic cost matters, let's compare attention to recurrent neural networks. Both architectures process sequences, but their computational patterns differ fundamentally.

RNN Complexity AnalysisLink Copied

An RNN with hidden dimension $d$ processes each of $n$ tokens sequentially. At each timestep $t$ , the hidden state update follows:

h_t = \sigma(W_h h_{t-1} + W_x x_t + b)

where:

$h_t$ : hidden state at timestep $t$ (dimension $d$ )
$W_h$ : hidden-to-hidden weight matrix (shape $d \times d$ )
$h_{t-1}$ : previous hidden state (dimension $d$ )
$W_x$ : input-to-hidden weight matrix (shape $d \times d$ , assuming input dimension matches $d$ )
$x_t$ : input at timestep $t$
$\sigma$ : activation function (e.g., tanh)

The computational cost per timestep is:

Matrix-vector multiplication $W_h h_{t-1}$ : $O(d^2)$ operations
Matrix-vector multiplication $W_x x_t$ : $O(d^2)$ operations
Activation function: $O(d)$ operations

Total per step: $O(d^2)$ . For $n$ tokens: $O(nd^2)$ .

RNN complexity is linear in sequence length and quadratic in hidden dimension.

Comparative AnalysisLink Copied

Summarizing the complexity of each architecture:

Attention: $O(n^2 d)$
RNN: $O(nd^2)$

where $n$ is sequence length and $d$ is the hidden/model dimension.

To find when these costs are equal, we set $n^2 d = nd^2$ and solve for $n$ :

n^2 d = nd^2

Dividing both sides by $nd$ (assuming $n, d > 0$ ):

n = d

This crossover point has a clear interpretation: when sequence length exceeds model dimension ( $n > d$ ), attention becomes more expensive. When model dimension exceeds sequence length ( $n < d$ ), RNN becomes more expensive.

In[11]:

Code

def compare_complexity(n_values, d):
    """Compare attention vs RNN complexity."""
    results = []
    for n in n_values:
        attention = n * n * d  # O(n^2 d)
        rnn = n * d * d  # O(n d^2)
        results.append(
            {
                "n": n,
                "attention": attention,
                "rnn": rnn,
                "ratio": attention / rnn,
            }
        )
    return results


# Compare for typical model dimension
d = 512
n_values = [64, 128, 256, 512, 1024, 2048, 4096]
comparison = compare_complexity(n_values, d)

def compare_complexity(n_values, d):
    """Compare attention vs RNN complexity."""
    results = []
    for n in n_values:
        attention = n * n * d  # O(n^2 d)
        rnn = n * d * d  # O(n d^2)
        results.append(
            {
                "n": n,
                "attention": attention,
                "rnn": rnn,
                "ratio": attention / rnn,
            }
        )
    return results


# Compare for typical model dimension
d = 512
n_values = [64, 128, 256, 512, 1024, 2048, 4096]
comparison = compare_complexity(n_values, d)

Out[12]:

Console

Complexity Comparison (d = 512):

 Seq Len (n)       Attention             RNN   Attn/RNN
-------------------------------------------------------
          64       2,097,152      16,777,216       0.12x
         128       8,388,608      33,554,432       0.25x
         256      33,554,432      67,108,864       0.50x
         512     134,217,728     134,217,728       1.00x
        1024     536,870,912     268,435,456       2.00x
        2048   2,147,483,648     536,870,912       4.00x
        4096   8,589,934,592   1,073,741,824       8.00x

Crossover point: n = d = 512

The ratio column reveals the crossover behavior. At $n = 64$ and $n = 128$ , attention is actually 0.13x and 0.25x the cost of an RNN, respectively, making attention the more efficient choice. At $n = 512$ (exactly equal to $d$ ), both architectures have identical cost (1.00x ratio). Beyond this point, attention becomes increasingly expensive: at $n = 4096$ , attention requires 8x more operations than an equivalent RNN.

Out[13]:

Visualization

Line plot comparing attention O(n^2 d) and RNN O(nd^2) complexity, crossing at the model dimension. — Computational complexity comparison between self-attention and RNN as sequence length varies. The lines cross at n=d (512 in this example), after which attention's quadratic cost dominates.

The crossover behavior has practical implications. For short sequences (a few hundred tokens), attention's overhead is minimal. For long documents, books, or conversation histories spanning thousands of tokens, the quadratic cost becomes the dominant bottleneck.

Path Length vs. Computation Trade-offLink Copied

Why use attention despite its higher cost? The answer lies in path length. In an RNN, information from token $i$ to token $j$ must traverse $|j - i|$ sequential steps. In attention, any two tokens connect in a single step.

In[14]:

Code

# Trade-off analysis: what do we gain for the extra compute?
def analyze_tradeoff(n, d):
    """Analyze the compute/path-length trade-off."""
    # Maximum path length
    rnn_max_path = n - 1  # First to last token
    attention_max_path = 1  # Always direct

    # Average path length (uniform distribution of source/target)
    rnn_avg_path = (n - 1) / 2
    attention_avg_path = 1

    # Compute ratio
    compute_ratio = (n * n * d) / (n * d * d)

    # Path length improvement
    path_improvement = rnn_avg_path / attention_avg_path

    return {
        "compute_ratio": compute_ratio,
        "path_improvement": path_improvement,
        "efficiency": path_improvement / compute_ratio,
    }


tradeoff_results = []
for n in [128, 256, 512, 1024, 2048]:
    result = analyze_tradeoff(n, d=512)
    result["n"] = n
    tradeoff_results.append(result)

# Trade-off analysis: what do we gain for the extra compute?
def analyze_tradeoff(n, d):
    """Analyze the compute/path-length trade-off."""
    # Maximum path length
    rnn_max_path = n - 1  # First to last token
    attention_max_path = 1  # Always direct

    # Average path length (uniform distribution of source/target)
    rnn_avg_path = (n - 1) / 2
    attention_avg_path = 1

    # Compute ratio
    compute_ratio = (n * n * d) / (n * d * d)

    # Path length improvement
    path_improvement = rnn_avg_path / attention_avg_path

    return {
        "compute_ratio": compute_ratio,
        "path_improvement": path_improvement,
        "efficiency": path_improvement / compute_ratio,
    }


tradeoff_results = []
for n in [128, 256, 512, 1024, 2048]:
    result = analyze_tradeoff(n, d=512)
    result["n"] = n
    tradeoff_results.append(result)

Out[15]:

Console

Compute vs Path Length Trade-off (d = 512):

   Seq Len   Compute Ratio    Path Improve   Efficiency
-------------------------------------------------------
       128            0.25x            63.5x      254.000
       256            0.50x           127.5x      255.000
       512            1.00x           255.5x      255.500
      1024            2.00x           511.5x      255.750
      2048            4.00x          1023.5x      255.875

The "Efficiency" column (path improvement divided by compute ratio) quantifies the value proposition of attention. At $n = 128$ , we pay only 0.25x the compute of an RNN while gaining 63x shorter average paths, yielding an efficiency of 252. At $n = 512$ (the crossover point), we pay 1x compute for 255x path improvement. At $n = 2048$ , we pay 4x more compute but get 1023x shorter paths. Notice that efficiency remains nearly constant around $d/2 = 256$ . This reveals an important insight: as sequence length grows, both the path improvement and the compute penalty scale proportionally with $n$ . Attention's value proposition stays roughly constant regardless of sequence length. The short path lengths are always worth the extra compute.

Practical Scaling LimitsLink Copied

Given the quadratic complexity, what are the practical limits for attention-based models? The answer depends on hardware constraints, training budget, and inference requirements.

GPU Memory ConstraintsLink Copied

Modern GPUs have 16GB to 80GB of memory. With fp16 precision (2 bytes per element), the attention matrix for a single layer with 12 heads at sequence length 8192 requires:

M = h \times n \times n \times b = 12 \times 8192 \times 8192 \times 2 \approx 1.5 \text{ GB}

where:

$M$ : memory in bytes
$h = 12$ : number of attention heads
$n = 8192$ : sequence length
$b = 2$ : bytes per element (fp16)

For a 12-layer model, that's approximately 18GB just for attention matrices, before accounting for model weights, other activations, or gradients. Training requires additional memory for optimizer states (often 2x the model size for Adam) and gradient accumulation.

In[16]:

Code

def estimate_max_seq_len(
    gpu_memory_gb, d, n_heads, n_layers, batch_size=1, safety_factor=0.7
):
    """
    Estimate maximum sequence length that fits in GPU memory.

    Args:
        gpu_memory_gb: Available GPU memory in GB
        d: Model dimension
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        batch_size: Batch size
        safety_factor: Fraction of memory to use (leave room for overhead)

    Returns:
        Maximum sequence length
    """
    available_bytes = gpu_memory_gb * 1e9 * safety_factor

    # Attention matrices dominate for long sequences
    # Memory per layer: n_heads * n^2 * 2 bytes (fp16)
    # Also need Q, K, V, output: 4 * n * d * 2 bytes
    # Solving for n given memory constraint

    # Simplified: attention matrices = n_heads * n^2 * 2 * n_layers * batch_size
    # n^2 = available_bytes / (n_heads * 2 * n_layers * batch_size)
    n_squared = available_bytes / (n_heads * 2 * n_layers * batch_size)
    max_n = int(np.sqrt(n_squared))

    return max_n


# Common GPU configurations
gpus = [
    {"name": "RTX 3090", "memory": 24},
    {"name": "A100 40GB", "memory": 40},
    {"name": "A100 80GB", "memory": 80},
    {"name": "H100 80GB", "memory": 80},
]

# GPT-2 style model
model_config = {"d": 768, "heads": 12, "layers": 12}

def estimate_max_seq_len(
    gpu_memory_gb, d, n_heads, n_layers, batch_size=1, safety_factor=0.7
):
    """
    Estimate maximum sequence length that fits in GPU memory.

    Args:
        gpu_memory_gb: Available GPU memory in GB
        d: Model dimension
        n_heads: Number of attention heads
        n_layers: Number of transformer layers
        batch_size: Batch size
        safety_factor: Fraction of memory to use (leave room for overhead)

    Returns:
        Maximum sequence length
    """
    available_bytes = gpu_memory_gb * 1e9 * safety_factor

    # Attention matrices dominate for long sequences
    # Memory per layer: n_heads * n^2 * 2 bytes (fp16)
    # Also need Q, K, V, output: 4 * n * d * 2 bytes
    # Solving for n given memory constraint

    # Simplified: attention matrices = n_heads * n^2 * 2 * n_layers * batch_size
    # n^2 = available_bytes / (n_heads * 2 * n_layers * batch_size)
    n_squared = available_bytes / (n_heads * 2 * n_layers * batch_size)
    max_n = int(np.sqrt(n_squared))

    return max_n


# Common GPU configurations
gpus = [
    {"name": "RTX 3090", "memory": 24},
    {"name": "A100 40GB", "memory": 40},
    {"name": "A100 80GB", "memory": 80},
    {"name": "H100 80GB", "memory": 80},
]

# GPT-2 style model
model_config = {"d": 768, "heads": 12, "layers": 12}

Out[17]:

Console

Maximum Sequence Length Estimates (batch_size=1):

            GPU     Memory     Max Seq Len
------------------------------------------
       RTX 3090         24 GB           7,637
      A100 40GB         40 GB           9,860
      A100 80GB         80 GB          13,944
      H100 80GB         80 GB          13,944

These estimates assume 70% of GPU memory is available for attention matrices (the remaining 30% accounts for model weights and other overhead). An RTX 3090 with 24GB can handle approximately 27K tokens, while an A100 80GB extends this to around 50K tokens. These numbers represent theoretical upper bounds for a single sample with batch size 1. In practice, larger batch sizes for training efficiency, gradient storage for backpropagation, and optimizer states further reduce these limits. Production systems address these constraints through gradient checkpointing, FlashAttention, and model parallelism.

Time ConstraintsLink Copied

Beyond memory, compute time imposes practical limits. Processing long sequences is slow.

In[18]:

Code

def estimate_attention_time_ms(n, d, n_heads, n_layers, tflops=100):
    """
    Estimate attention computation time.

    Args:
        n: Sequence length
        d: Model dimension
        n_heads: Number of heads
        n_layers: Number of layers
        tflops: GPU throughput in TFLOPS (100 typical for A100)

    Returns:
        Time in milliseconds
    """
    # Total FLOPs for attention (simplified)
    flops_per_layer = 2 * n * n * d  # Q@K^T and attention@V
    total_flops = flops_per_layer * n_layers

    # Convert TFLOPS to FLOPS
    flops_per_second = tflops * 1e12

    # Time in seconds, then milliseconds
    time_seconds = total_flops / flops_per_second
    time_ms = time_seconds * 1000

    return time_ms


# Timing estimates for various sequence lengths
seq_lens_timing = [512, 1024, 2048, 4096, 8192, 16384, 32768]
timing_results = []
for n in seq_lens_timing:
    time_ms = estimate_attention_time_ms(
        n, d=768, n_heads=12, n_layers=12, tflops=150
    )
    timing_results.append({"n": n, "time_ms": time_ms})

def estimate_attention_time_ms(n, d, n_heads, n_layers, tflops=100):
    """
    Estimate attention computation time.

    Args:
        n: Sequence length
        d: Model dimension
        n_heads: Number of heads
        n_layers: Number of layers
        tflops: GPU throughput in TFLOPS (100 typical for A100)

    Returns:
        Time in milliseconds
    """
    # Total FLOPs for attention (simplified)
    flops_per_layer = 2 * n * n * d  # Q@K^T and attention@V
    total_flops = flops_per_layer * n_layers

    # Convert TFLOPS to FLOPS
    flops_per_second = tflops * 1e12

    # Time in seconds, then milliseconds
    time_seconds = total_flops / flops_per_second
    time_ms = time_seconds * 1000

    return time_ms


# Timing estimates for various sequence lengths
seq_lens_timing = [512, 1024, 2048, 4096, 8192, 16384, 32768]
timing_results = []
for n in seq_lens_timing:
    time_ms = estimate_attention_time_ms(
        n, d=768, n_heads=12, n_layers=12, tflops=150
    )
    timing_results.append({"n": n, "time_ms": time_ms})

Out[19]:

Console

Estimated Attention Time (GPT-2 Small, A100 GPU):

 Sequence Length    Time (ms)      Tokens/sec
---------------------------------------------
             512         0.03      15,894,572
           1,024         0.13       7,947,286
           2,048         0.52       3,973,643
           4,096         2.06       1,986,821
           8,192         8.25         993,411
          16,384        32.99         496,705
          32,768       131.94         248,353

At 512 tokens, attention completes in a fraction of a millisecond. By 4096 tokens, latency grows but remains acceptable for most applications. At 32K tokens, attention alone approaches double-digit milliseconds per forward pass. The "Tokens/sec" column shows throughput in terms of context size processed: higher values indicate more efficient processing for that context length. During training with backpropagation, multiply these times by roughly 3x. For autoregressive generation where the full context is reprocessed for each new token, long contexts compound these costs significantly.

Out[20]:

Visualization

Line plot showing attention time in milliseconds increasing quadratically with sequence length. — Attention computation time scaling with sequence length. The quadratic relationship means that processing time grows rapidly as context length increases, creating practical limits for interactive applications.

The timing curve shows why context length is such a hot topic in LLM development. Crossing the 100ms threshold makes interactive applications sluggish. Crossing 1 second makes them impractical for real-time use.

The Attention BottleneckLink Copied

The combination of quadratic time and space complexity creates what's known as the attention bottleneck. As we push for longer context windows, attention becomes the dominant cost, often exceeding the cost of feed-forward layers, embeddings, and everything else combined.

In[21]:

Code

def layer_component_flops(n, d, ffn_mult=4):
    """
    Break down FLOPs by transformer layer component.

    Args:
        n: Sequence length
        d: Model dimension
        ffn_mult: FFN hidden dimension multiplier (typically 4)

    Returns:
        Dict of FLOPs per component
    """
    # Attention
    attention_flops = 2 * n * n * d  # Q@K^T and attention@V
    qkv_proj_flops = 3 * n * d * d  # Q, K, V projections
    output_proj_flops = n * d * d  # Output projection

    # Feed-forward network
    ffn_hidden = d * ffn_mult
    ffn_flops = 2 * n * d * ffn_hidden  # Two linear layers

    return {
        "attention_core": attention_flops,
        "qkv_projection": qkv_proj_flops,
        "output_projection": output_proj_flops,
        "feed_forward": ffn_flops,
    }


# Analyze at different sequence lengths
d = 768
analysis_results = []
for n in [256, 512, 1024, 2048, 4096]:
    flops = layer_component_flops(n, d)
    total = sum(flops.values())
    analysis_results.append(
        {
            "n": n,
            "attention_pct": (flops["attention_core"] / total) * 100,
            "projections_pct": (
                (flops["qkv_projection"] + flops["output_projection"]) / total
            )
            * 100,
            "ffn_pct": (flops["feed_forward"] / total) * 100,
        }
    )

def layer_component_flops(n, d, ffn_mult=4):
    """
    Break down FLOPs by transformer layer component.

    Args:
        n: Sequence length
        d: Model dimension
        ffn_mult: FFN hidden dimension multiplier (typically 4)

    Returns:
        Dict of FLOPs per component
    """
    # Attention
    attention_flops = 2 * n * n * d  # Q@K^T and attention@V
    qkv_proj_flops = 3 * n * d * d  # Q, K, V projections
    output_proj_flops = n * d * d  # Output projection

    # Feed-forward network
    ffn_hidden = d * ffn_mult
    ffn_flops = 2 * n * d * ffn_hidden  # Two linear layers

    return {
        "attention_core": attention_flops,
        "qkv_projection": qkv_proj_flops,
        "output_projection": output_proj_flops,
        "feed_forward": ffn_flops,
    }


# Analyze at different sequence lengths
d = 768
analysis_results = []
for n in [256, 512, 1024, 2048, 4096]:
    flops = layer_component_flops(n, d)
    total = sum(flops.values())
    analysis_results.append(
        {
            "n": n,
            "attention_pct": (flops["attention_core"] / total) * 100,
            "projections_pct": (
                (flops["qkv_projection"] + flops["output_projection"]) / total
            )
            * 100,
            "ffn_pct": (flops["feed_forward"] / total) * 100,
        }
    )

Out[22]:

Console

Transformer Layer FLOP Distribution (d=768):

   Seq Len    Attention    Projections        FFN
--------------------------------------------------
       256         5.3%          31.6%      63.2%
       512        10.0%          30.0%      60.0%
      1024        18.2%          27.3%      54.5%
      2048        30.8%          23.1%      46.2%
      4096        47.1%          17.6%      35.3%

The percentages reveal how computational burden shifts with sequence length. At 256 tokens, attention accounts for only about 5% of layer computation, with the feed-forward network dominating at around 69%. By 1024 tokens, attention grows to roughly 17%, and at 4096 tokens it reaches approximately 45%, nearly matching the FFN. The projection layers (QKV and output) remain relatively stable since they scale linearly with $n$ . This distribution explains why optimizing attention becomes critical for long-context applications while being less important for short sequences.

Out[23]:

Visualization

Stacked area chart showing attention growing from 20% to 50% of computation as sequence length increases. — Distribution of FLOPs across transformer layer components as sequence length increases. Attention core (the n^2 operation) grows to dominate at longer sequences, while feed-forward layers shrink proportionally.

This visualization shows the attention bottleneck in action. The red region (attention core) expands as sequence length grows, squeezing out other components. This is why efficient attention mechanisms are such an active research area.

Efficient Attention VariantsLink Copied

The quadratic complexity of standard attention has motivated numerous alternatives. While a detailed treatment of each variant deserves its own chapter, understanding the landscape helps contextualize the complexity analysis.

Linear Attention ApproximationsLink Copied

Several methods approximate attention with linear complexity by avoiding the explicit $n \times n$ matrix:

Linformer: Projects keys and values to lower dimensions before computing attention
Performer: Uses random feature maps to approximate softmax attention in $O(n)$
Linear Transformers: Replace softmax with kernel functions that allow associative computation

These methods trade accuracy for speed, often working well in practice despite theoretical approximation.

Sparse Attention PatternsLink Copied

Instead of attending to all positions, sparse methods attend to a subset:

Longformer: Combines local windowed attention with global attention on special tokens
BigBird: Uses a combination of random, window, and global attention patterns
Sparse Transformers: Learn which positions to attend to

Sparse attention reduces complexity to $O(n \cdot k)$ , where:

$n$ : sequence length
$k$ : number of positions each token attends to (typically $k \ll n$ )

For example, with a local window of 256 tokens, $k = 256$ regardless of sequence length, making the complexity effectively linear in $n$ .

Memory and Caching ApproachesLink Copied

Some methods restructure the computation to reduce memory:

Flash Attention: Reorders operations to minimize memory transfers, achieving the same result with less memory
KV Caching: During generation, caches key-value pairs to avoid recomputation
Gradient Checkpointing: Trades computation for memory by recomputing activations during backward pass

In[24]:

Code

def compare_attention_variants(n, d, k=None):
    """
    Compare complexity of attention variants.

    Args:
        n: Sequence length
        d: Model dimension
        k: Sparsity parameter (positions attended to)
    """
    if k is None:
        k = int(np.sqrt(n))  # Common choice: sqrt(n) sparse positions

    return {
        "Standard": n * n * d,
        "Linear": n * d * d,
        "Sparse": n * k * d,
        "Window (w=256)": n * 256 * d,
    }


# Compare at long sequence
n = 16384
d = 768
variants = compare_attention_variants(n, d)

def compare_attention_variants(n, d, k=None):
    """
    Compare complexity of attention variants.

    Args:
        n: Sequence length
        d: Model dimension
        k: Sparsity parameter (positions attended to)
    """
    if k is None:
        k = int(np.sqrt(n))  # Common choice: sqrt(n) sparse positions

    return {
        "Standard": n * n * d,
        "Linear": n * d * d,
        "Sparse": n * k * d,
        "Window (w=256)": n * 256 * d,
    }


# Compare at long sequence
n = 16384
d = 768
variants = compare_attention_variants(n, d)

Out[25]:

Console

Attention Variant Complexity (n=16,384, d=768):

Variant                        FLOPs    Speedup
------------------------------------------------
Standard             206,158,430,208        1.0x
Linear                 9,663,676,416       21.3x
Sparse                 1,610,612,736      128.0x
Window (w=256)         3,221,225,472       64.0x

The speedup factors demonstrate why efficient attention variants are essential for long sequences. Standard attention at 16K tokens requires over 200 billion operations. Linear attention, which trades the $n^2$ term for $d^2$ , achieves roughly a 21x speedup by avoiding the explicit attention matrix. Sparse attention with $k = \sqrt{n} \approx 128$ positions achieves similar gains. Windowed attention with a fixed 256-token window provides approximately 64x speedup, making it highly effective for tasks where local context dominates. These variants enable practical processing of long documents, book-length texts, and extended conversations that would be prohibitively expensive with standard attention.

Out[26]:

Visualization

Line plot comparing standard quadratic attention against linear, sparse, and windowed attention variants, showing the divergence at long sequence lengths. — Complexity scaling of different attention mechanisms across sequence lengths. Standard attention (red) shows quadratic growth, while linear, sparse, and windowed variants remain subquadratic, enabling much longer context windows.

Limitations and ImpactLink Copied

The quadratic complexity of attention is both its greatest strength and most significant limitation. The all-pairs computation enables transformers to capture any dependency regardless of distance, but it also creates hard limits on context length that no amount of hardware can fully overcome.

For practitioners, understanding these complexity bounds is essential for making informed architecture decisions. When processing short sequences (a few hundred tokens), attention overhead is negligible compared to feed-forward layers. In this regime, you can use standard attention without concern. However, when context length grows into the thousands or tens of thousands, attention becomes the dominant cost. You need to consider sparse attention variants, linear approximations, or hierarchical processing strategies. The crossover point depends on your model size, but for typical configurations, it occurs around 512-1024 tokens.

Memory constraints often bite before compute constraints do. During training, storing attention matrices for backpropagation requires $O(n^2)$ memory per layer. This quadratic memory scaling is why long-context training requires techniques like gradient checkpointing (trading compute for memory), FlashAttention (reordering operations to reduce memory), or model parallelism (distributing across multiple GPUs). The memory bottleneck also affects inference, especially when serving many concurrent requests with long contexts.

The complexity analysis also explains why efficient attention is such an active research area. Linearizing attention, sparsifying attention patterns, and approximating attention with lower-rank structures are all attempts to break the quadratic barrier while preserving the benefits of all-pairs interaction. These techniques have enabled models with 100K+ token context windows, but they often involve trade-offs in accuracy or increased implementation complexity. Understanding when these trade-offs are acceptable requires understanding the underlying complexity.

SummaryLink Copied

Attention's quadratic complexity is a fundamental characteristic that shapes how we design, train, and deploy transformer models. In this chapter, we analyzed this complexity from multiple angles and explored its practical implications.

Key takeaways:

Time complexity: Standard self-attention requires $O(n^2 d)$ operations, where $n$ is sequence length and $d$ is model dimension. The $n^2$ term means doubling sequence length quadruples computation.
Memory complexity: Storing attention matrices requires $O(h \cdot n^2)$ memory, where $h$ is the number of heads. This quadratic memory scaling often becomes the limiting factor before compute does.
Crossover with RNN: Setting $n^2 d = nd^2$ shows that attention is cheaper than RNN when $n < d$ , but more expensive when $n > d$ . For typical model dimensions (512-4096), attention becomes more expensive beyond a few hundred tokens.
Practical limits: GPU memory and compute time impose hard constraints on context length. Even with 80GB GPUs, sequence lengths beyond tens of thousands of tokens require specialized techniques.
The attention bottleneck: As sequence length grows, attention dominates transformer computation, consuming 50% or more of total FLOPs at long contexts.
Efficient alternatives: Linear attention, sparse attention, and memory-efficient implementations can reduce complexity from $O(n^2)$ to $O(n)$ or $O(n \cdot k)$ , where $k$ is the number of attended positions, enabling much longer contexts.

Understanding these complexity characteristics is essential for choosing appropriate architectures, estimating computational requirements, and knowing when to apply optimization techniques. The quadratic wall is real, but it can be navigated with the right tools and understanding.

Key Parameters for Complexity AnalysisLink Copied

When analyzing or estimating attention complexity, several parameters directly impact computational and memory requirements:

n (sequence length): The number of tokens in the input sequence. This is the most critical parameter since complexity scales quadratically with it. Typical values range from 512 to 128K+ in modern models.
d (model dimension): The embedding dimension of the model. Common values include 768 (GPT-2 Small), 1024 (GPT-2 Medium), 4096 (LLaMA 7B), and larger for bigger models. Complexity scales linearly with this parameter.
h (number of heads): The number of attention heads. Each head operates on $d/h$ dimensions. More heads mean more parallel attention patterns but also more attention matrices to store.
L (number of layers): Total transformer layers. Memory and compute scale linearly with layer count, so a 24-layer model requires twice the resources of a 12-layer model.
dtype_bytes: Precision of floating-point representation. fp32 uses 4 bytes, fp16/bf16 use 2 bytes. Half precision halves memory requirements and often doubles throughput on modern GPUs.
batch_size: Number of sequences processed simultaneously. Memory scales linearly with batch size, often requiring batch size reduction for long sequences.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about attention complexity and efficient alternatives.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{attentioncomplexityquadraticscalingmemorylimitsefficientalternatives, author = {Michael Brenndoerfer}, title = {Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives}, year = {2025}, url = {https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives. Retrieved from https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers

MLAAcademic

Michael Brenndoerfer. "Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives'. Available at: https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives. https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers

Direct link:

https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Attention ComplexityLink Copied

The Quadratic BottleneckLink Copied

The Three-Stage Computation PipelineLink Copied

Stage 1: Computing Attention ScoresLink Copied

Stage 2: Normalizing with SoftmaxLink Copied

Stage 3: Computing the Weighted OutputLink Copied

Putting It Together: Total ComplexityLink Copied

The Scaling InsightLink Copied

Implementation: Counting FLOPsLink Copied

Memory RequirementsLink Copied

Attention Matrix StorageLink Copied

Total Memory BreakdownLink Copied

Attention vs. RNN ComplexityLink Copied

RNN Complexity AnalysisLink Copied

Comparative AnalysisLink Copied

Path Length vs. Computation Trade-offLink Copied

Practical Scaling LimitsLink Copied

GPU Memory ConstraintsLink Copied

Time ConstraintsLink Copied

The Attention BottleneckLink Copied

Efficient Attention VariantsLink Copied

Linear Attention ApproximationsLink Copied

Sparse Attention PatternsLink Copied

Memory and Caching ApproachesLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key Parameters for Complexity AnalysisLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Query, Key, Value: The Foundation of Transformer Attention

Multi-Head Attention: Parallel Attention for Richer Representations

Scaled Dot-Product Attention: The Core Transformer Mechanism

Stay updated