Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Learn how linear attention achieves O(nd²) complexity by replacing softmax with kernel functions, enabling transformers to scale to extremely long sequences through clever matrix reordering.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Linear AttentionLink Copied

Standard self-attention computes a weighted sum of value vectors, where the weights come from a softmax over query-key similarities. This produces an $n \times n$ attention matrix that must be explicitly computed and stored, leading to the quadratic bottleneck that limits transformers to modest sequence lengths. Linear attention takes a radically different approach: by replacing the softmax with a decomposable kernel function, it enables a clever reordering of matrix operations that eliminates the quadratic term entirely.

The key insight is mathematical: if we can factor the attention weights into a product of separate functions applied to queries and keys, we can change the order of operations from $(QK^T)V$ to $Q(K^TV)$ . The first form requires computing the full $n \times n$ attention matrix. The second form computes $K^TV$ first, which has shape $d \times d$ (model dimension squared), independent of sequence length. This simple reordering transforms $O(n^2d)$ complexity into $O(nd^2)$ , making attention linear in sequence length.

In this chapter, we explore how linear attention achieves this transformation, examine the kernel feature maps that make it possible, implement the mechanism from scratch, and analyze both its theoretical elegance and practical limitations. We also survey variants like Performer that address some of softmax attention's properties that linear attention initially sacrifices.

The Associative Property TrickLink Copied

To understand linear attention, we need to examine why standard softmax attention requires the $n \times n$ matrix in the first place. The culprit is the softmax normalization, which creates dependencies between all positions.

Standard attention computes:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

$Q \in \mathbb{R}^{n \times d_k}$ : the query matrix containing one row per token
$K \in \mathbb{R}^{n \times d_k}$ : the key matrix containing one row per token
$V \in \mathbb{R}^{n \times d_v}$ : the value matrix containing the information to aggregate
$d_k$ : the dimension of query and key vectors
$n$ : the sequence length (number of tokens)
The softmax is applied row-wise, normalizing each query's attention distribution

For each query position $i$ , the attention output is:

\text{output}_i = \frac{\sum_{j=1}^{n} \exp(q_i^T k_j / \sqrt{d_k}) \cdot v_j}{\sum_{j=1}^{n} \exp(q_i^T k_j / \sqrt{d_k})}

where:

$q_i \in \mathbb{R}^{d_k}$ : the query vector at position $i$
$k_j \in \mathbb{R}^{d_k}$ : the key vector at position $j$
$v_j \in \mathbb{R}^{d_v}$ : the value vector at position $j$
$q_i^T k_j$ : the dot product measuring similarity between query $i$ and key $j$
$\exp(\cdot)$ : the exponential function ensuring positive weights
The denominator sums over all $n$ positions, creating the normalization constant

The denominator is the normalization term: to compute the attention weight for any single key position $j$ , we need to know the scores for all positions, because the softmax divides by their sum. This creates a global dependency that forces us to compute all $n^2$ query-key products before we can produce any output.

Breaking the Global DependencyLink Copied

Linear attention's solution is to replace the exponential-and-normalize pattern with a kernel function that decomposes into separate feature maps for queries and keys. Instead of:

\text{similarity}(q_i, k_j) = \exp(q_i^T k_j / \sqrt{d_k})

We use:

\text{similarity}(q_i, k_j) = \phi(q_i)^T \phi(k_j)

where:

$\phi: \mathbb{R}^{d_k} \to \mathbb{R}^{D}$ : a feature map that transforms vectors from the original $d_k$ -dimensional space to a $D$ -dimensional feature space
$\phi(q_i) \in \mathbb{R}^D$ : the transformed query vector
$\phi(k_j) \in \mathbb{R}^D$ : the transformed key vector
$D$ : the dimension of the feature space (often equal to $d_k$ for simple maps, or larger for random feature methods)

The key property is that the similarity decomposes into an inner product of transformed vectors, allowing us to reorder operations using matrix associativity.

With this formulation, the attention output becomes:

\text{output}_i = \frac{\sum_{j=1}^{n} \phi(q_i)^T \phi(k_j) \cdot v_j}{\sum_{j=1}^{n} \phi(q_i)^T \phi(k_j)}

Since $\phi(q_i)$ does not depend on $j$ , we can factor it out of the sum:

\text{output}_i = \frac{\phi(q_i)^T \sum_{j=1}^{n} \phi(k_j) v_j^T}{\phi(q_i)^T \sum_{j=1}^{n} \phi(k_j)}

This factorization is the key to linear attention. The numerator contains $\phi(q_i)^T$ multiplied by a sum that doesn't depend on $i$ . The denominator has the same structure. Both sums can be computed once and reused for every query position.

Key Insight: Precomputation

The terms $\sum_{j=1}^{n} \phi(k_j) v_j^T$ and $\sum_{j=1}^{n} \phi(k_j)$ can be computed once and reused for all query positions. This eliminates the need to compute pairwise query-key interactions.

Let's define:

$S = \sum_{j=1}^{n} \phi(k_j) v_j^T$ : a matrix of shape $(D \times d_v)$
$Z = \sum_{j=1}^{n} \phi(k_j)$ : a vector of shape $(D,)$

Then the output for each position simplifies to:

\text{output}_i = \frac{\phi(q_i)^T S}{\phi(q_i)^T Z}

where:

$\phi(q_i)^T S$ : a matrix-vector product of shape $(1 \times D) \cdot (D \times d_v) = (1 \times d_v)$ , giving the unnormalized output
$\phi(q_i)^T Z$ : a dot product of shape $(1 \times D) \cdot (D \times 1) = 1$ , giving the normalization scalar

Computing $S$ requires $O(nd \cdot D)$ operations (summing $n$ outer products of dimension $D \times d_v$ ). Computing each output requires $O(D \cdot d_v)$ operations. With $n$ outputs, the total complexity is $O(nd^2)$ assuming $D$ is proportional to $d$ .

The Linear Attention FormulaLink Copied

Having established the mathematical foundation, we can now synthesize everything into the complete linear attention formula. The journey from standard attention to linear attention follows a clear logical path: we identified the softmax as the source of the quadratic bottleneck, replaced it with a decomposable kernel function, and discovered that the resulting expression can be rearranged to avoid computing the full attention matrix. Now let's see how these insights combine into a practical algorithm.

From Element-wise to Matrix FormLink Copied

We derived earlier that each position's output can be written as:

\text{output}_i = \frac{\phi(q_i)^T S}{\phi(q_i)^T Z}

To express this for all positions simultaneously, we stack the feature-mapped queries into a matrix and apply the same operations in parallel. For a sequence of $n$ tokens with queries $Q$ , keys $K$ , and values $V$ , the complete formula is:

\text{LinearAttention}(Q, K, V) = \frac{\phi(Q)(\phi(K)^T V)}{\phi(Q) \phi(K)^T \mathbf{1}}

Let's unpack each component to understand how the pieces fit together:

$\phi(Q) \in \mathbb{R}^{n \times D}$ : The feature map $\phi$ applied row-wise to the query matrix. Each of the $n$ query vectors is transformed from $d_k$ dimensions to $D$ dimensions.
$\phi(K) \in \mathbb{R}^{n \times D}$ : Similarly, the feature map applied to each key vector. After transformation, we transpose this to get $\phi(K)^T \in \mathbb{R}^{D \times n}$ .
$\phi(K)^T V$ : This is the key-value aggregate $S$ from our derivation. The matrix product has shape $(D \times n) \cdot (n \times d_v) = (D \times d_v)$ . Notice that $n$ disappears from the result shape, and this is precisely why linear attention avoids the quadratic bottleneck.
$\phi(Q)(\phi(K)^T V)$ : Multiplying the queries by the aggregate produces the numerator. Shape: $(n \times D) \cdot (D \times d_v) = (n \times d_v)$ .
$\phi(K)^T \mathbf{1}$ : Summing the transformed keys (equivalent to $Z$ in our derivation). The vector of ones $\mathbf{1} \in \mathbb{R}^n$ sums across all $n$ key positions, producing a $D$ -dimensional vector.
The division is performed element-wise per row, normalizing each output position by its corresponding scalar denominator.

The Order of Operations MattersLink Copied

The formula highlights the critical insight: the order in which we perform matrix multiplications determines the complexity. Consider the numerator $\phi(Q)(\phi(K)^T V)$ :

If we compute $\phi(Q)\phi(K)^T$ first (as in standard attention), we get an $n \times n$ matrix. Multiplying by $V$ then costs $O(n^2 d_v)$ .
If we compute $\phi(K)^T V$ first (the linear attention approach), we get a $D \times d_v$ matrix. Multiplying $\phi(Q)$ by this costs only $O(n D d_v)$ .

The second approach avoids ever creating the $n \times n$ matrix. This is matrix associativity in action: $(AB)C = A(BC)$ , but the intermediate sizes differ dramatically.

Computational comparison between standard and linear attention showing how reordering matrix operations eliminates the quadratic intermediate matrix.

Operation	Standard Attention	Linear Attention
Core computation	$(QK^T)V$	$Q(K^TV)$
Intermediate matrix	$n \times n$	$D \times d_v$
Complexity	$O(n^2 d)$	$O(nd^2)$ or $O(ndD)$

When the feature dimension $D$ is proportional to the model dimension $d$ (and $d \ll n$ for long sequences), linear attention provides substantial savings. For a 10,000-token sequence with $d = 64$ , standard attention requires a 100-million-element intermediate matrix, while linear attention needs only 4,096 elements.

Implementation: Translating Math to CodeLink Copied

Let's implement linear attention step by step, mapping each line of code to the mathematical operations we've discussed.

In[2]:

Code

import numpy as np


def linear_attention(Q, K, V, feature_map=None, eps=1e-6):
    """
    Compute linear attention using the associative property.

    Args:
        Q: Query matrix of shape (n, d_k)
        K: Key matrix of shape (n, d_k)
        V: Value matrix of shape (n, d_v)
        feature_map: Function to apply to Q and K. If None, uses ELU + 1.
        eps: Small constant for numerical stability

    Returns:
        Output of shape (n, d_v)
    """
    if feature_map is None:
        # ELU + 1 ensures non-negative features (commonly used)
        feature_map = lambda x: np.maximum(x, 0) + np.exp(np.minimum(x, 0))

    # Apply feature map to queries and keys
    Q_prime = feature_map(Q)  # (n, d_k)
    K_prime = feature_map(K)  # (n, d_k)

    # Compute the key-value aggregate: K'^T @ V
    # Shape: (d_k, d_v)
    KV = K_prime.T @ V

    # Compute the key aggregate for normalization: sum of K' along sequence
    # Shape: (d_k,)
    K_sum = K_prime.sum(axis=0)

    # Compute outputs using the efficient formulation
    # Numerator: Q' @ KV, shape (n, d_v)
    numerator = Q_prime @ KV

    # Denominator: Q' @ K_sum, shape (n,)
    denominator = Q_prime @ K_sum + eps

    # Normalize each position
    output = numerator / denominator[:, np.newaxis]

    return output

import numpy as np


def linear_attention(Q, K, V, feature_map=None, eps=1e-6):
    """
    Compute linear attention using the associative property.

    Args:
        Q: Query matrix of shape (n, d_k)
        K: Key matrix of shape (n, d_k)
        V: Value matrix of shape (n, d_v)
        feature_map: Function to apply to Q and K. If None, uses ELU + 1.
        eps: Small constant for numerical stability

    Returns:
        Output of shape (n, d_v)
    """
    if feature_map is None:
        # ELU + 1 ensures non-negative features (commonly used)
        feature_map = lambda x: np.maximum(x, 0) + np.exp(np.minimum(x, 0))

    # Apply feature map to queries and keys
    Q_prime = feature_map(Q)  # (n, d_k)
    K_prime = feature_map(K)  # (n, d_k)

    # Compute the key-value aggregate: K'^T @ V
    # Shape: (d_k, d_v)
    KV = K_prime.T @ V

    # Compute the key aggregate for normalization: sum of K' along sequence
    # Shape: (d_k,)
    K_sum = K_prime.sum(axis=0)

    # Compute outputs using the efficient formulation
    # Numerator: Q' @ KV, shape (n, d_v)
    numerator = Q_prime @ KV

    # Denominator: Q' @ K_sum, shape (n,)
    denominator = Q_prime @ K_sum + eps

    # Normalize each position
    output = numerator / denominator[:, np.newaxis]

    return output

The implementation follows our derivation exactly:

Apply the feature map: Transform $Q$ and $K$ to get $\phi(Q)$ and $\phi(K)$ . We use ELU+1 as a default because it ensures all features are positive, which is necessary for the normalization to produce valid weights.
Compute the key-value aggregate: K_prime.T @ V computes $\phi(K)^T V$ , the $D \times d_v$ matrix that summarizes how each feature dimension relates to the values. This is computed once and reused for all queries.
Compute the normalization aggregate: K_prime.sum(axis=0) computes $\sum_j \phi(k_j)$ , equivalent to $\phi(K)^T \mathbf{1}$ . This is the denominator's aggregate.
Query the aggregates: Each query retrieves its output by multiplying against the precomputed aggregates. The numerator Q_prime @ KV and denominator Q_prime @ K_sum are both $O(nDd)$ operations.
Normalize: Element-wise division produces the final attention output.

Seeing the Efficiency GainLink Copied

Let's run this implementation and verify that it produces the expected shapes and efficiency characteristics.

In[3]:

Code

# Demonstrate the efficiency difference
n, d = 128, 64

Q = np.random.randn(n, d).astype(np.float32)
K = np.random.randn(n, d).astype(np.float32)
V = np.random.randn(n, d).astype(np.float32)

output = linear_attention(Q, K, V)

# Demonstrate the efficiency difference
n, d = 128, 64

Q = np.random.randn(n, d).astype(np.float32)
K = np.random.randn(n, d).astype(np.float32)
V = np.random.randn(n, d).astype(np.float32)

output = linear_attention(Q, K, V)

Out[4]:

Console

Input shapes: Q=(128, 64), K=(128, 64), V=(128, 64)
Output shape: (128, 64)

Key computation: K'^T @ V creates a 64x64 matrix
Standard attention would create a 128x128 matrix
Ratio: 4.0x smaller intermediate representation

The output confirms the efficiency gain: instead of a 128×128 attention matrix (16,384 elements), linear attention uses a 64×64 key-value aggregate (4,096 elements). This 4x reduction might seem modest, but remember that the ratio scales as $(n/d)^2$ . At 1,024 tokens with $d = 64$ , the reduction is 256x. At 8,192 tokens, it's over 16,000x. The quadratic-to-linear improvement becomes increasingly dramatic as sequences grow longer.

Kernel Feature MapsLink Copied

The choice of feature map $\phi$ determines the properties of linear attention. Different feature maps approximate different similarity functions, each with trade-offs between computational efficiency and how well they mimic softmax behavior.

The Kernel InterpretationLink Copied

The connection between feature maps and attention kernels comes from the theory of reproducing kernel Hilbert spaces (RKHS). A kernel function $\kappa(x, y)$ measures similarity between two vectors, and Mercer's theorem tells us that any positive semi-definite kernel can be written as:

\kappa(x, y) = \langle \phi(x), \phi(y) \rangle

where:

$\kappa: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ : a kernel function that takes two vectors and returns a scalar similarity score
$x, y \in \mathbb{R}^d$ : input vectors to compare
$\phi: \mathbb{R}^d \to \mathbb{R}^D$ : a feature map that lifts inputs to a (possibly infinite-dimensional) feature space
$\langle \cdot, \cdot \rangle$ : the inner product in the feature space

This result, known as Mercer's theorem, tells us that for any valid kernel function, there exists a corresponding feature map. Choosing a feature map implicitly defines a similarity kernel, and vice versa.

Standard softmax attention uses an exponential kernel (ignoring the scaling factor):

\kappa_{\text{softmax}}(q, k) = \exp(q^T k)

where $q^T k$ is the dot product between query and key vectors. The exponential function ensures the similarity is always positive and amplifies differences between scores.

To approximate this with linear attention, we need a feature map $\phi$ such that:

\phi(q)^T \phi(k) \approx \exp(q^T k)

Finding such a feature map is the central challenge of linear attention design. The closer the approximation, the more closely linear attention mimics softmax behavior.

Common Feature MapsLink Copied

Several feature maps have been proposed, each with different properties:

1. ELU + 1 (Efficient Linear Units)

\phi(x) = \text{ELU}(x) + 1 = \begin{cases} x + 1 & \text{if } x > 0 \\ e^x & \text{if } x \leq 0 \end{cases}

This simple feature map ensures all features are positive (required for the normalization to work properly). It doesn't approximate the exponential kernel well but is computationally cheap.

2. ReLU

\phi(x) = \max(0, x)

Even simpler than ELU, but can produce zero features, potentially causing numerical issues in the denominator.

3. Random Fourier Features (Performer)

\phi(x) = \frac{1}{\sqrt{m}} \left[ \sin(\omega_1^T x), \cos(\omega_1^T x), \ldots, \sin(\omega_m^T x), \cos(\omega_m^T x) \right]

where:

$x \in \mathbb{R}^{d_k}$ : the input vector (query or key)
$m$ : the number of random frequency vectors
$\omega_i \in \mathbb{R}^{d_k}$ : random frequency vectors sampled from a standard normal distribution $\mathcal{N}(0, I)$
$\omega_i^T x$ : the projection of input onto the $i$ -th random direction
$\frac{1}{\sqrt{m}}$ : normalization factor to control the variance of the approximation

The output has dimension $2m$ (sine and cosine for each frequency). This provides an unbiased approximation to the Gaussian kernel. With additional scaling, it can approximate the softmax kernel.

4. Positive Random Features (FAVOR+)

To ensure non-negativity while approximating the exponential kernel, Performer uses:

\phi(x) = \frac{\exp(-\|x\|^2/2)}{\sqrt{m}} \left[ \exp(\omega_1^T x), \ldots, \exp(\omega_m^T x) \right]

where:

$x \in \mathbb{R}^{d_k}$ : the input vector
$\|x\|^2 = \sum_{i=1}^{d_k} x_i^2$ : the squared Euclidean norm of the input
$\exp(-\|x\|^2/2)$ : a normalization factor that depends on the input magnitude
$\omega_i \sim \mathcal{N}(0, I)$ : random projection vectors sampled from a standard normal distribution
$\exp(\omega_i^T x)$ : ensures all features are strictly positive
$m$ : the number of random features

This construction gives $\mathbb{E}[\phi(q)^T \phi(k)] = \exp(q^T k)$ , an unbiased estimator of the softmax kernel. The key insight is that by using exponentials instead of sines and cosines, all features remain positive, which is necessary for the normalization in attention to work correctly.

In[5]:

Code

def elu_feature_map(x):
    """ELU + 1 feature map: simple, always positive."""
    return np.maximum(x, 0) + np.exp(np.minimum(x, 0))


def relu_feature_map(x):
    """ReLU feature map: simplest possible, but can be zero."""
    return np.maximum(x, 0)


def random_fourier_features(x, random_matrix, scale=1.0):
    """
    Random Fourier features for approximating Gaussian/softmax kernel.

    Args:
        x: Input of shape (..., d)
        random_matrix: Random projection matrix of shape (d, m)
        scale: Scaling factor for input

    Returns:
        Features of shape (..., 2m)
    """
    projection = x @ random_matrix * scale
    return np.concatenate(
        [np.sin(projection), np.cos(projection)], axis=-1
    ) / np.sqrt(random_matrix.shape[1])


def positive_random_features(x, random_matrix):
    """
    Positive random features (FAVOR+) for unbiased softmax approximation.

    Args:
        x: Input of shape (..., d)
        random_matrix: Random projection matrix of shape (d, m)

    Returns:
        Features of shape (..., m)
    """
    # Normalize input
    norm_sq = np.sum(x**2, axis=-1, keepdims=True)
    projection = x @ random_matrix

    # Positive features with normalization factor
    features = np.exp(projection - norm_sq / 2)
    return features / np.sqrt(random_matrix.shape[1])

def elu_feature_map(x):
    """ELU + 1 feature map: simple, always positive."""
    return np.maximum(x, 0) + np.exp(np.minimum(x, 0))


def relu_feature_map(x):
    """ReLU feature map: simplest possible, but can be zero."""
    return np.maximum(x, 0)


def random_fourier_features(x, random_matrix, scale=1.0):
    """
    Random Fourier features for approximating Gaussian/softmax kernel.

    Args:
        x: Input of shape (..., d)
        random_matrix: Random projection matrix of shape (d, m)
        scale: Scaling factor for input

    Returns:
        Features of shape (..., 2m)
    """
    projection = x @ random_matrix * scale
    return np.concatenate(
        [np.sin(projection), np.cos(projection)], axis=-1
    ) / np.sqrt(random_matrix.shape[1])


def positive_random_features(x, random_matrix):
    """
    Positive random features (FAVOR+) for unbiased softmax approximation.

    Args:
        x: Input of shape (..., d)
        random_matrix: Random projection matrix of shape (d, m)

    Returns:
        Features of shape (..., m)
    """
    # Normalize input
    norm_sq = np.sum(x**2, axis=-1, keepdims=True)
    projection = x @ random_matrix

    # Positive features with normalization factor
    features = np.exp(projection - norm_sq / 2)
    return features / np.sqrt(random_matrix.shape[1])

In[6]:

Code

# Compare feature dimensions
d = 64
m = 128  # Number of random features

# Compare feature dimensions
d = 64
m = 128  # Number of random features

Out[7]:

Console

Feature Map Comparison:

Feature Map               Input Dim    Output Dim  
--------------------------------------------------
ELU + 1                   64           64          
ReLU                      64           64          
Random Fourier            64           256         
Positive Random (FAVOR+)  64           128

The ELU and ReLU maps preserve the input dimension, making them computationally efficient. Random feature maps expand to a higher dimension, trading compute for better approximation of the softmax kernel.

Out[8]:

Visualization

Line plot comparing ELU+1 and ReLU feature maps across input range from -3 to 3, showing ELU's smooth positive output versus ReLU's hard clipping at zero. — Element-wise feature map transformations. ELU+1 smoothly maps all values to positive outputs, while ReLU clips negative values to zero. The dashed diagonal shows the identity function for reference.

Line plot comparing kernel approximations showing exponential kernel growing rapidly while ELU and ReLU approximations remain more modest. — Kernel approximation quality: similarity values produced by each feature map compared to the true exponential kernel. ELU+1 provides a reasonable approximation for small dot products but diverges for larger values.

The visualizations reveal key properties of the feature maps. ELU+1 produces strictly positive outputs for all inputs, which is essential for attention normalization. ReLU is simpler but zeros out negative inputs, losing information and potentially causing numerical issues. Neither simple feature map accurately approximates the exponential kernel, especially for larger dot products, which explains why they produce different attention patterns than softmax.

Comparing Standard and Linear AttentionLink Copied

Let's implement both attention mechanisms and compare their outputs to understand what linear attention preserves and what it changes.

In[9]:

Code

def softmax_attention(Q, K, V):
    """Standard scaled dot-product attention."""
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)

    # Softmax normalization
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    return attention_weights @ V, attention_weights


def compare_attention_mechanisms(n, d, seed=42):
    """Compare outputs of standard and linear attention."""
    np.random.seed(seed)

    Q = np.random.randn(n, d).astype(np.float32) * 0.5
    K = np.random.randn(n, d).astype(np.float32) * 0.5
    V = np.random.randn(n, d).astype(np.float32)

    # Standard attention
    output_standard, weights_standard = softmax_attention(Q, K, V)

    # Linear attention with ELU
    output_linear = linear_attention(Q, K, V, feature_map=elu_feature_map)

    # Compute similarity metrics
    cosine_sim = np.mean(
        [
            np.dot(output_standard[i], output_linear[i])
            / (
                np.linalg.norm(output_standard[i])
                * np.linalg.norm(output_linear[i])
            )
            for i in range(n)
        ]
    )

    mse = np.mean((output_standard - output_linear) ** 2)

    return {
        "output_standard": output_standard,
        "output_linear": output_linear,
        "weights_standard": weights_standard,
        "cosine_similarity": cosine_sim,
        "mse": mse,
    }

def softmax_attention(Q, K, V):
    """Standard scaled dot-product attention."""
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)

    # Softmax normalization
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    return attention_weights @ V, attention_weights


def compare_attention_mechanisms(n, d, seed=42):
    """Compare outputs of standard and linear attention."""
    np.random.seed(seed)

    Q = np.random.randn(n, d).astype(np.float32) * 0.5
    K = np.random.randn(n, d).astype(np.float32) * 0.5
    V = np.random.randn(n, d).astype(np.float32)

    # Standard attention
    output_standard, weights_standard = softmax_attention(Q, K, V)

    # Linear attention with ELU
    output_linear = linear_attention(Q, K, V, feature_map=elu_feature_map)

    # Compute similarity metrics
    cosine_sim = np.mean(
        [
            np.dot(output_standard[i], output_linear[i])
            / (
                np.linalg.norm(output_standard[i])
                * np.linalg.norm(output_linear[i])
            )
            for i in range(n)
        ]
    )

    mse = np.mean((output_standard - output_linear) ** 2)

    return {
        "output_standard": output_standard,
        "output_linear": output_linear,
        "weights_standard": weights_standard,
        "cosine_similarity": cosine_sim,
        "mse": mse,
    }

In[10]:

Code

results = compare_attention_mechanisms(n=64, d=32)

results = compare_attention_mechanisms(n=64, d=32)

Out[11]:

Console

Comparison of Standard vs Linear Attention (ELU feature map):

Sequence length: 64, Dimension: 32
Cosine similarity between outputs: 0.9846
Mean squared error: 0.000780

Output magnitude ratio: 0.975

The cosine similarity shows that linear attention with a simple ELU feature map produces outputs that point in a similar direction to standard attention but aren't identical. The MSE quantifies the actual difference in output values.

Out[12]:

Visualization

Heatmap of softmax attention weights showing concentrated dark spots indicating strong attention to specific positions. — Standard softmax attention produces sharp, focused patterns. High-weight positions (dark blue) capture specific dependencies, while most positions receive near-zero attention.

Heatmap of linear attention showing more uniform, spread-out attention pattern with less contrast between positions. — Linear attention with ELU features produces smoother, more diffuse patterns. The lack of softmax normalization means attention weights don't sum to 1 per row, changing the weighting dynamics.

The visualizations reveal a fundamental difference in attention behavior. Softmax attention concentrates weight on a few key positions, creating the "sharp" attention patterns that transformers are known for. Linear attention with simple feature maps spreads weight more uniformly, producing softer patterns that may miss fine-grained dependencies.

Computational Complexity AnalysisLink Copied

Let's rigorously compare the computational requirements of standard and linear attention across different sequence lengths.

In[13]:

Code

def flops_standard_attention(n, d):
    """
    FLOPs for standard attention.
    - Q @ K^T: 2 * n * n * d
    - Softmax: ~5 * n * n
    - Weights @ V: 2 * n * n * d
    """
    qk_flops = 2 * n * n * d
    softmax_flops = 5 * n * n
    wv_flops = 2 * n * n * d
    return qk_flops + softmax_flops + wv_flops


def flops_linear_attention(n, d, D=None):
    """
    FLOPs for linear attention.
    - Feature map: ~2 * n * d (depends on feature map)
    - K'^T @ V: 2 * n * D * d
    - Q' @ KV: 2 * n * D * d
    - Normalization: 2 * n * D + n
    """
    if D is None:
        D = d  # Feature dimension equals model dimension

    feature_flops = 2 * n * d  # Simplified
    kv_flops = 2 * n * D * d
    qkv_flops = 2 * n * D * d
    norm_flops = 2 * n * D + n

    return feature_flops + kv_flops + qkv_flops + norm_flops


# Compare across sequence lengths
sequence_lengths = [128, 256, 512, 1024, 2048, 4096, 8192, 16384]
d = 64  # Model dimension

def flops_standard_attention(n, d):
    """
    FLOPs for standard attention.
    - Q @ K^T: 2 * n * n * d
    - Softmax: ~5 * n * n
    - Weights @ V: 2 * n * n * d
    """
    qk_flops = 2 * n * n * d
    softmax_flops = 5 * n * n
    wv_flops = 2 * n * n * d
    return qk_flops + softmax_flops + wv_flops


def flops_linear_attention(n, d, D=None):
    """
    FLOPs for linear attention.
    - Feature map: ~2 * n * d (depends on feature map)
    - K'^T @ V: 2 * n * D * d
    - Q' @ KV: 2 * n * D * d
    - Normalization: 2 * n * D + n
    """
    if D is None:
        D = d  # Feature dimension equals model dimension

    feature_flops = 2 * n * d  # Simplified
    kv_flops = 2 * n * D * d
    qkv_flops = 2 * n * D * d
    norm_flops = 2 * n * D + n

    return feature_flops + kv_flops + qkv_flops + norm_flops


# Compare across sequence lengths
sequence_lengths = [128, 256, 512, 1024, 2048, 4096, 8192, 16384]
d = 64  # Model dimension

Out[14]:

Console

FLOPs Comparison: Standard vs Linear Attention

Model dimension d = 64

Seq Len    Standard           Linear                Speedup
------------------------------------------------------------
128        4.3M               2.1M                      2.0x
256        17.1M              4.3M                      4.0x
512        68.4M              8.5M                      8.0x
1024       273.7M             17.0M                    16.1x
2048       1.09G              34.1M                    32.1x
4096       4.38G              68.2M                    64.2x
8192       17.52G             136.3M                  128.5x
16384      70.06G             272.6M                  257.0x

The speedup grows linearly with sequence length. At 128 tokens, linear attention is only marginally faster. At 16,384 tokens, it's over 100x faster. This is the power of eliminating the quadratic term: the longer the sequence, the greater the benefit.

Out[15]:

Visualization

Log-scale line plot comparing standard O(n^2 d) attention with linear O(nd^2) attention, showing diverging curves as sequence length increases. — Computational complexity comparison showing the quadratic growth of standard attention versus the linear growth of linear attention. The crossover point where they become equal depends on the model dimension, but linear attention's advantage grows unboundedly for longer sequences.

The log-scale plot reveals the essential difference: standard attention curves have slope 2 (quadratic), while linear attention curves have slope 1 (linear). All curves for the same algorithm are parallel, shifted only by the constant factor from the model dimension.

Memory EfficiencyLink Copied

Beyond compute savings, linear attention also dramatically reduces memory requirements. Standard attention must store the full $n \times n$ attention matrix during the forward pass (for use in backpropagation). Linear attention only stores the $d \times d$ key-value aggregate.

In[16]:

Code

def memory_standard_attention(n, d, dtype_bytes=4):
    """Memory for attention matrix in standard attention."""
    # Store the n x n attention weights
    return n * n * dtype_bytes


def memory_linear_attention(n, d, D=None, dtype_bytes=4):
    """Memory for key-value aggregate in linear attention."""
    if D is None:
        D = d
    # Store the D x d KV aggregate and D-dimensional key sum
    kv_memory = D * d * dtype_bytes
    norm_memory = D * dtype_bytes
    return kv_memory + norm_memory

def memory_standard_attention(n, d, dtype_bytes=4):
    """Memory for attention matrix in standard attention."""
    # Store the n x n attention weights
    return n * n * dtype_bytes


def memory_linear_attention(n, d, D=None, dtype_bytes=4):
    """Memory for key-value aggregate in linear attention."""
    if D is None:
        D = d
    # Store the D x d KV aggregate and D-dimensional key sum
    kv_memory = D * d * dtype_bytes
    norm_memory = D * dtype_bytes
    return kv_memory + norm_memory

In[17]:

Code

d = 64
memory_results = []
for n in [512, 1024, 2048, 4096, 8192, 16384]:
    std_mem = memory_standard_attention(n, d)
    lin_mem = memory_linear_attention(n, d)
    reduction = std_mem / lin_mem
    memory_results.append((n, std_mem, lin_mem, reduction))

d = 64
memory_results = []
for n in [512, 1024, 2048, 4096, 8192, 16384]:
    std_mem = memory_standard_attention(n, d)
    lin_mem = memory_linear_attention(n, d)
    reduction = std_mem / lin_mem
    memory_results.append((n, std_mem, lin_mem, reduction))

Out[18]:

Console

Memory Comparison (fp32):

Seq Len    Standard        Linear             Reduction
-------------------------------------------------------
512        1.0 MB          16.2 KB                 63x
1024       4.2 MB          16.2 KB                252x
2048       16.8 MB         16.2 KB               1008x
4096       67.1 MB         16.2 KB               4033x
8192       268.4 MB        16.2 KB              16132x
16384      1073.7 MB       16.2 KB              64528x

At 16,384 tokens, standard attention requires over 1 GB just for the attention matrix, while linear attention uses only about 16 KB. This memory efficiency is what enables linear attention to scale to extremely long sequences that would be impossible with standard attention.

Out[19]:

Visualization

Log-log plot showing standard attention memory growing from kilobytes to gigabytes as sequence length increases, while linear attention remains flat at 16 KB. — Memory usage comparison between standard and linear attention across sequence lengths. Standard attention memory grows quadratically (slope 2 on log-log scale), while linear attention remains constant regardless of sequence length. The divergence becomes dramatic at long sequences.

The log-log visualization makes the scaling difference stark. Standard attention memory grows as a straight line with slope 2 (quadratic), while linear attention remains essentially flat at around 16 KB regardless of sequence length. At 65,536 tokens, standard attention would require over 16 GB just for the attention matrix, exceeding even high-end GPU memory, while linear attention uses the same 16 KB it needed at 128 tokens.

Causal Linear AttentionLink Copied

For autoregressive language models, we need causal attention: each position can only attend to previous positions. Standard causal attention applies a triangular mask before softmax. Linear attention requires a different approach.

The key insight is that the key-value aggregate can be computed incrementally. Instead of summing over all positions, we maintain a running sum up to the current position:

S_i = \sum_{j=1}^{i} \phi(k_j) v_j^T

Z_i = \sum_{j=1}^{i} \phi(k_j)

\text{output}_i = \frac{\phi(q_i)^T S_i}{\phi(q_i)^T Z_i}

where:

$S_i \in \mathbb{R}^{D \times d_v}$ : the cumulative key-value aggregate up to position $i$ , computed by summing outer products $\phi(k_j) v_j^T$ for all positions $j \leq i$
$Z_i \in \mathbb{R}^{D}$ : the cumulative key sum up to position $i$ , used for normalization
$\phi(q_i)^T S_i$ : retrieves a weighted combination of values based on the query at position $i$
$\phi(q_i)^T Z_i$ : computes the normalization factor

This formulation naturally respects causality: position $i$ only accesses information from positions $1$ through $i$ . The running sum structure enables efficient incremental computation during inference.

In[20]:

Code

def causal_linear_attention(Q, K, V, feature_map=None, eps=1e-6):
    """
    Compute causal linear attention where each position attends only to previous positions.

    Args:
        Q: Query matrix of shape (n, d_k)
        K: Key matrix of shape (n, d_k)
        V: Value matrix of shape (n, d_v)
        feature_map: Function to apply to Q and K
        eps: Numerical stability constant

    Returns:
        Output of shape (n, d_v)
    """
    if feature_map is None:
        feature_map = elu_feature_map

    n, d_k = Q.shape
    d_v = V.shape[1]

    Q_prime = feature_map(Q)  # (n, d_k)
    K_prime = feature_map(K)  # (n, d_k)

    # Initialize running sums
    S = np.zeros((d_k, d_v), dtype=Q.dtype)  # Key-value aggregate
    Z = np.zeros(d_k, dtype=Q.dtype)  # Key sum for normalization

    outputs = np.zeros((n, d_v), dtype=Q.dtype)

    for i in range(n):
        # Update running sums with current position
        S += np.outer(K_prime[i], V[i])
        Z += K_prime[i]

        # Compute output for position i
        numerator = Q_prime[i] @ S
        denominator = Q_prime[i] @ Z + eps
        outputs[i] = numerator / denominator

    return outputs

def causal_linear_attention(Q, K, V, feature_map=None, eps=1e-6):
    """
    Compute causal linear attention where each position attends only to previous positions.

    Args:
        Q: Query matrix of shape (n, d_k)
        K: Key matrix of shape (n, d_k)
        V: Value matrix of shape (n, d_v)
        feature_map: Function to apply to Q and K
        eps: Numerical stability constant

    Returns:
        Output of shape (n, d_v)
    """
    if feature_map is None:
        feature_map = elu_feature_map

    n, d_k = Q.shape
    d_v = V.shape[1]

    Q_prime = feature_map(Q)  # (n, d_k)
    K_prime = feature_map(K)  # (n, d_k)

    # Initialize running sums
    S = np.zeros((d_k, d_v), dtype=Q.dtype)  # Key-value aggregate
    Z = np.zeros(d_k, dtype=Q.dtype)  # Key sum for normalization

    outputs = np.zeros((n, d_v), dtype=Q.dtype)

    for i in range(n):
        # Update running sums with current position
        S += np.outer(K_prime[i], V[i])
        Z += K_prime[i]

        # Compute output for position i
        numerator = Q_prime[i] @ S
        denominator = Q_prime[i] @ Z + eps
        outputs[i] = numerator / denominator

    return outputs

In[21]:

Code

np.random.seed(42)
n, d = 32, 16
Q = np.random.randn(n, d).astype(np.float32) * 0.5
K = np.random.randn(n, d).astype(np.float32) * 0.5
V = np.random.randn(n, d).astype(np.float32)

output_causal = causal_linear_attention(Q, K, V)

# Compare with non-causal
output_full = linear_attention(Q, K, V, feature_map=elu_feature_map)

# The outputs should differ because causal restricts attention
diff = np.mean(np.abs(output_causal - output_full))

np.random.seed(42)
n, d = 32, 16
Q = np.random.randn(n, d).astype(np.float32) * 0.5
K = np.random.randn(n, d).astype(np.float32) * 0.5
V = np.random.randn(n, d).astype(np.float32)

output_causal = causal_linear_attention(Q, K, V)

# Compare with non-causal
output_full = linear_attention(Q, K, V, feature_map=elu_feature_map)

# The outputs should differ because causal restricts attention
diff = np.mean(np.abs(output_causal - output_full))

Out[22]:

Console

Causal vs Non-causal Linear Attention:
Mean absolute difference: 0.2188

This difference is expected: causal attention sees less context per position.

The causal formulation is particularly elegant because the running sum structure means that during inference, we can process one token at a time with $O(1)$ memory overhead per step (just updating the $d \times d$ aggregate). This makes linear attention ideal for autoregressive generation.

Out[23]:

Visualization

Lower-triangular heatmap of causal attention weights where positions only attend to earlier positions in the sequence. — Causal linear attention pattern showing the lower-triangular structure where each position attends only to itself and previous positions. The running sum formulation naturally enforces this constraint without explicit masking.

Performer: Unbiased Softmax ApproximationLink Copied

The simple feature maps we've examined (ELU, ReLU) don't approximate softmax attention well. Performer, introduced by Choromanski et al. in 2020, addresses this with a clever approach called FAVOR+ (Fast Attention Via positive Orthogonal Random features).

The goal is to find a feature map $\phi$ such that:

\mathbb{E}[\phi(q)^T \phi(k)] = \exp(q^T k)

where:

$\mathbb{E}[\cdot]$ : the expectation operator, averaging over the random projection vectors
$\phi(q)^T \phi(k)$ : the inner product of feature-mapped query and key
$\exp(q^T k)$ : the true softmax kernel (without the scaling factor)

When this equality holds in expectation, we say the approximation is unbiased: on average, linear attention produces the same result as softmax attention. Performer achieves this by using positive random features:

\phi(x) = \frac{\exp(-\|x\|^2/2)}{\sqrt{m}} \left[ \exp(\omega_1^T x), \ldots, \exp(\omega_m^T x) \right]

where $\omega_i \sim \mathcal{N}(0, I)$ are random projection vectors drawn from a standard normal distribution, and $m$ is the number of random features that controls the trade-off between approximation quality and computational cost.

In[24]:

Code

def performer_attention(Q, K, V, num_features=256, eps=1e-6):
    """
    Performer attention using FAVOR+ positive random features.

    Args:
        Q: Query matrix of shape (n, d_k)
        K: Key matrix of shape (n, d_k)
        V: Value matrix of shape (n, d_v)
        num_features: Number of random features m
        eps: Numerical stability constant

    Returns:
        Output of shape (n, d_v)
    """
    n, d_k = Q.shape

    # Sample random projection matrix (in practice, this would be fixed per layer)
    np.random.seed(0)  # For reproducibility
    random_matrix = np.random.randn(d_k, num_features) / np.sqrt(d_k)

    def favor_plus_features(x):
        """Compute FAVOR+ positive random features."""
        # x: (n, d_k), random_matrix: (d_k, m)
        projection = x @ random_matrix  # (n, m)
        norm_sq = np.sum(x**2, axis=-1, keepdims=True)  # (n, 1)

        # Positive features with normalization
        features = np.exp(projection - norm_sq / 2) / np.sqrt(num_features)
        return features

    Q_prime = favor_plus_features(Q)  # (n, m)
    K_prime = favor_plus_features(K)  # (n, m)

    # Linear attention with FAVOR+ features
    KV = K_prime.T @ V  # (m, d_v)
    K_sum = K_prime.sum(axis=0)  # (m,)

    numerator = Q_prime @ KV  # (n, d_v)
    denominator = Q_prime @ K_sum + eps  # (n,)

    return numerator / denominator[:, np.newaxis]

def performer_attention(Q, K, V, num_features=256, eps=1e-6):
    """
    Performer attention using FAVOR+ positive random features.

    Args:
        Q: Query matrix of shape (n, d_k)
        K: Key matrix of shape (n, d_k)
        V: Value matrix of shape (n, d_v)
        num_features: Number of random features m
        eps: Numerical stability constant

    Returns:
        Output of shape (n, d_v)
    """
    n, d_k = Q.shape

    # Sample random projection matrix (in practice, this would be fixed per layer)
    np.random.seed(0)  # For reproducibility
    random_matrix = np.random.randn(d_k, num_features) / np.sqrt(d_k)

    def favor_plus_features(x):
        """Compute FAVOR+ positive random features."""
        # x: (n, d_k), random_matrix: (d_k, m)
        projection = x @ random_matrix  # (n, m)
        norm_sq = np.sum(x**2, axis=-1, keepdims=True)  # (n, 1)

        # Positive features with normalization
        features = np.exp(projection - norm_sq / 2) / np.sqrt(num_features)
        return features

    Q_prime = favor_plus_features(Q)  # (n, m)
    K_prime = favor_plus_features(K)  # (n, m)

    # Linear attention with FAVOR+ features
    KV = K_prime.T @ V  # (m, d_v)
    K_sum = K_prime.sum(axis=0)  # (m,)

    numerator = Q_prime @ KV  # (n, d_v)
    denominator = Q_prime @ K_sum + eps  # (n,)

    return numerator / denominator[:, np.newaxis]

In[25]:

Code

np.random.seed(42)
n, d = 64, 32
Q = np.random.randn(n, d).astype(np.float32) * 0.3
K = np.random.randn(n, d).astype(np.float32) * 0.3
V = np.random.randn(n, d).astype(np.float32)

# Compare Performer with different numbers of features
output_softmax, _ = softmax_attention(Q, K, V)

performer_results = []
for m in [32, 64, 128, 256, 512]:
    output_performer = performer_attention(Q, K, V, num_features=m)

    cosine_sim = np.mean(
        [
            np.dot(output_softmax[i], output_performer[i])
            / (
                np.linalg.norm(output_softmax[i])
                * np.linalg.norm(output_performer[i])
            )
            for i in range(n)
        ]
    )
    mse = np.mean((output_softmax - output_performer) ** 2)
    performer_results.append((m, cosine_sim, mse))

np.random.seed(42)
n, d = 64, 32
Q = np.random.randn(n, d).astype(np.float32) * 0.3
K = np.random.randn(n, d).astype(np.float32) * 0.3
V = np.random.randn(n, d).astype(np.float32)

# Compare Performer with different numbers of features
output_softmax, _ = softmax_attention(Q, K, V)

performer_results = []
for m in [32, 64, 128, 256, 512]:
    output_performer = performer_attention(Q, K, V, num_features=m)

    cosine_sim = np.mean(
        [
            np.dot(output_softmax[i], output_performer[i])
            / (
                np.linalg.norm(output_softmax[i])
                * np.linalg.norm(output_performer[i])
            )
            for i in range(n)
        ]
    )
    mse = np.mean((output_softmax - output_performer) ** 2)
    performer_results.append((m, cosine_sim, mse))

Out[26]:

Console

Performer Approximation Quality vs Number of Random Features:

Num Features    Cosine Sim      MSE            
---------------------------------------------
32              0.9800          0.000958       
64              0.9828          0.000857       
128             0.9821          0.000871       
256             0.9821          0.000870       
512             0.9808          0.000927

The results show that Performer's approximation improves with more random features. With 512 features, the cosine similarity approaches 1.0 and MSE drops significantly. This trade-off between approximation quality and feature dimension is central to Performer's design: more features mean better approximation but higher compute cost.

Out[27]:

Visualization

Bar plot of softmax attention weights showing sharp peaks at a few key positions with most weights near zero. — Standard softmax attention weights for a sample query position, showing the characteristic peaked distribution where a few positions dominate.

Bar plot of Performer attention weights showing similar peaked structure with slightly more noise in the distribution. — Performer's approximation with 256 random features. The overall shape is preserved, though fine details differ. More features improve fidelity.

Performer's key innovation is that by using positive features with the right normalization, the approximation is unbiased: the expected value equals true softmax attention. The variance of the approximation decreases as $O(1/m)$ , where $m$ is the number of random features. This means doubling the number of features halves the variance, providing a controllable trade-off between computational cost and approximation fidelity.

Other Linear Attention VariantsLink Copied

The linear attention family has grown to include many variants, each addressing different aspects of the softmax-to-kernel trade-off:

Linear Transformer (Katharopoulos et al., 2020): The foundational work that established the kernel view and demonstrated causal linear attention for autoregressive models.
Random Feature Attention (RFA): Uses random Fourier features to approximate the Gaussian kernel, providing a middle ground between simple feature maps and Performer's FAVOR+.
Nyströmformer (Xiong et al., 2021): Uses the Nyström method to approximate the attention matrix with landmark points, achieving $O(nm)$ complexity where $m$ is the number of landmarks.
FNet (Lee-Thorp et al., 2021): Replaces attention entirely with Fourier transforms, achieving $O(n \log n)$ complexity with no learnable attention parameters.
RetNet (Sun et al., 2023): Combines linear attention with exponential decay, creating a recurrent formulation that supports both parallel training and efficient incremental inference.

In[28]:

Code

def compare_linear_variants(n, d):
    """Compare complexity of different linear attention variants."""
    variants = {
        "Standard Softmax": n * n * d,
        "Linear (ELU)": n * d * d,
        "Performer (m=256)": n * 256 * d + 256 * d,
        "Nyströmformer (m=64)": n * 64 * d,
        "FNet (FFT)": n * np.log2(n) * d,
    }
    return variants

def compare_linear_variants(n, d):
    """Compare complexity of different linear attention variants."""
    variants = {
        "Standard Softmax": n * n * d,
        "Linear (ELU)": n * d * d,
        "Performer (m=256)": n * 256 * d + 256 * d,
        "Nyströmformer (m=64)": n * 64 * d,
        "FNet (FFT)": n * np.log2(n) * d,
    }
    return variants

In[29]:

Code

variants = compare_linear_variants(4096, 64)

variants = compare_linear_variants(4096, 64)

Out[30]:

Console

Complexity Comparison of Attention Variants (n=4096, d=64):

Variant                   FLOPs            vs Standard
-------------------------------------------------------
Standard Softmax          1.07G                  1.0x faster
Linear (ELU)              16.8M                 64.0x faster
Performer (m=256)         67.1M                 16.0x faster
Nyströmformer (m=64)      16.8M                 64.0x faster
FNet (FFT)                3.1M                 341.3x faster

Linear attention with ELU features achieves approximately 64x speedup over standard softmax attention at this sequence length, while Performer with 256 random features achieves around 60x. The simpler variants (Nyströmformer, FNet) offer even greater speedups but with different quality trade-offs.

Limitations of Linear AttentionLink Copied

Linear attention's efficiency comes with trade-offs. Understanding these limitations is essential for choosing when to use linear attention versus other approaches.

Reduced Expressiveness: Softmax attention can create arbitrarily sharp attention patterns, focusing nearly all weight on a single position. The kernel decomposition in linear attention constrains the types of attention patterns that can be expressed. With simple feature maps like ELU, linear attention tends toward smoother, more diffuse patterns that may not capture fine-grained dependencies as precisely. This is particularly problematic for tasks requiring exact matching or retrieval from context.

Approximation Error: Random feature methods like Performer introduce variance in the attention computation. While the expected value matches softmax attention, individual forward passes have stochastic deviations. For tasks with low tolerance for noise (like copying or exact arithmetic), this variance can hurt performance. Increasing the number of random features reduces variance but increases compute cost, partially eroding the efficiency gains.

Out[31]:

Visualization

Histogram comparing attention entropy distributions, with softmax showing a wide range including near-zero values, while linear attention clusters at higher entropy values. — Entropy of attention distributions for standard softmax attention versus linear attention with ELU features. Lower entropy indicates sharper, more focused attention. Softmax attention can achieve near-zero entropy (attending to a single position), while linear attention distributions are bounded away from zero, limiting their ability to express focused patterns.

Challenges with Causal Modeling: While causal linear attention is theoretically elegant, the running sum formulation has practical issues. The cumulative nature means errors can compound over long sequences. Additionally, the lack of true position-wise softmax normalization means that earlier positions in the sequence may receive disproportionate total attention weight, creating position biases that don't exist in standard causal attention.

Training Stability: Some linear attention variants, particularly those with simple feature maps, can exhibit training instabilities. The normalization in linear attention (dividing by the sum of key features) can have high variance when key features are poorly conditioned. This often requires careful initialization and potentially learning rate adjustments.

Despite these limitations, linear attention remains valuable for specific use cases. Long-document understanding, where the quadratic bottleneck is most severe, benefits enormously from linear complexity. Streaming applications that process tokens incrementally can leverage the recurrent formulation. And hybrid architectures that combine sparse attention layers with occasional linear attention layers get the best of both worlds.

Implementation ConsiderationsLink Copied

When implementing linear attention in practice, several engineering details matter:

Numerical Stability: The division by the sum of key features can be unstable when the denominator is small. Always add a small epsilon (typically 1e-6) to prevent division by zero. For extreme cases, clipping the denominator to a minimum value may be necessary.

In[32]:

Code

def stable_linear_attention(Q, K, V, feature_map, eps=1e-6, min_denom=1e-4):
    """Linear attention with enhanced numerical stability."""
    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    KV = K_prime.T @ V
    K_sum = K_prime.sum(axis=0)

    numerator = Q_prime @ KV
    denominator = Q_prime @ K_sum

    # Clip denominator to prevent extreme values
    denominator = np.maximum(denominator, min_denom)

    return numerator / denominator[:, np.newaxis]

def stable_linear_attention(Q, K, V, feature_map, eps=1e-6, min_denom=1e-4):
    """Linear attention with enhanced numerical stability."""
    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    KV = K_prime.T @ V
    K_sum = K_prime.sum(axis=0)

    numerator = Q_prime @ KV
    denominator = Q_prime @ K_sum

    # Clip denominator to prevent extreme values
    denominator = np.maximum(denominator, min_denom)

    return numerator / denominator[:, np.newaxis]

Feature Map Selection: For production systems, the choice of feature map significantly impacts both quality and speed. ELU+1 is fast but produces poor approximations. FAVOR+ provides better approximation but requires generating and storing random projection matrices. A common compromise is to use a deterministic feature map based on learned projections.

Chunked Computation: For very long sequences, even linear attention's memory can be substantial. Computing the KV aggregate in chunks and accumulating helps manage memory, especially on GPUs with limited VRAM.

In[33]:

Code

def chunked_linear_attention(Q, K, V, feature_map, chunk_size=512, eps=1e-6):
    """
    Memory-efficient linear attention with chunked accumulation.

    Useful when even the KV aggregate is too large to compute at once.
    """
    n, d_k = Q.shape
    d_v = V.shape[1]

    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    # Accumulate KV and K_sum in chunks
    KV = np.zeros((d_k, d_v), dtype=Q.dtype)
    K_sum = np.zeros(d_k, dtype=Q.dtype)

    for start in range(0, n, chunk_size):
        end = min(start + chunk_size, n)
        KV += K_prime[start:end].T @ V[start:end]
        K_sum += K_prime[start:end].sum(axis=0)

    # Compute outputs
    numerator = Q_prime @ KV
    denominator = Q_prime @ K_sum + eps

    return numerator / denominator[:, np.newaxis]

def chunked_linear_attention(Q, K, V, feature_map, chunk_size=512, eps=1e-6):
    """
    Memory-efficient linear attention with chunked accumulation.

    Useful when even the KV aggregate is too large to compute at once.
    """
    n, d_k = Q.shape
    d_v = V.shape[1]

    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    # Accumulate KV and K_sum in chunks
    KV = np.zeros((d_k, d_v), dtype=Q.dtype)
    K_sum = np.zeros(d_k, dtype=Q.dtype)

    for start in range(0, n, chunk_size):
        end = min(start + chunk_size, n)
        KV += K_prime[start:end].T @ V[start:end]
        K_sum += K_prime[start:end].sum(axis=0)

    # Compute outputs
    numerator = Q_prime @ KV
    denominator = Q_prime @ K_sum + eps

    return numerator / denominator[:, np.newaxis]

SummaryLink Copied

Linear attention transforms the attention mechanism from quadratic to linear complexity by exploiting the associative property of matrix multiplication. Instead of computing the full $n \times n$ attention matrix, it precomputes a $d \times d$ key-value aggregate that can be queried by each position independently.

The key concepts covered in this chapter:

Kernel decomposition: Replacing softmax with a kernel function $\kappa(q, k) = \phi(q)^T\phi(k)$ enables reordering $(QK^T)V$ to $Q(K^TV)$ , eliminating the quadratic dependency on sequence length.
Feature maps: Different choices of $\phi$ trade off between computational efficiency and how well they approximate softmax attention. Simple maps like ELU are fast but imprecise; random feature methods like FAVOR+ provide unbiased approximation with controllable variance.
Causal formulation: Linear attention naturally supports autoregressive modeling through a running sum formulation that incrementally accumulates key-value aggregates, enabling $O(1)$ per-token generation.
Performer and variants: The FAVOR+ approach uses positive random features to provide an unbiased estimator of softmax attention, with approximation quality improving as the number of random features increases.
Trade-offs: Linear attention sacrifices the ability to express arbitrarily sharp attention patterns, introduces approximation variance with random features, and may exhibit position biases in causal settings. These limitations make it best suited for long-document tasks where the quadratic bottleneck is the primary constraint.

Linear attention represents a fundamental shift in how we think about the attention mechanism: from an $n \times n$ interaction matrix to a fixed-size state that summarizes the key-value relationships. This perspective has influenced subsequent architectures like state space models and linear recurrent units, which further develop the idea of maintaining a compact state that captures sequence information.

For practitioners, linear attention is most valuable when sequence length is the binding constraint. At shorter lengths where $n < d$ , the overhead of feature transformation may not be worthwhile. But for tasks involving documents, books, or streaming data where sequences stretch to tens of thousands of tokens, linear attention's $O(nd^2)$ complexity makes previously intractable problems accessible.

Key ParametersLink Copied

When implementing or tuning linear attention, these parameters most directly affect performance and quality:

feature_map: The function $\phi$ that transforms queries and keys. ELU+1 is fast and simple but produces poor softmax approximation. FAVOR+ with positive random features provides unbiased approximation but requires more computation. The choice depends on whether you prioritize speed or fidelity to softmax behavior.
num_features (m): For random feature methods like Performer, this controls the trade-off between approximation quality and computational cost. Higher values reduce variance ( $O(1/m)$ ) but increase the feature dimension. Typical values range from 64 to 512, with 256 offering a reasonable balance for most applications.
eps: A small constant (typically 1e-6) added to the denominator during normalization to prevent division by zero. Critical for numerical stability, especially when feature maps can produce near-zero values.
min_denom: An optional floor for the denominator (e.g., 1e-4) that clips extreme values. Useful when the sum of key features is poorly conditioned, which can occur with certain feature maps or input distributions.
chunk_size: For memory-efficient implementations, this controls how many positions are processed together when accumulating the key-value aggregate. Smaller chunks reduce peak memory usage but may incur overhead from loop iterations. Values between 256 and 1024 work well in practice.
D (feature dimension): The output dimension of the feature map. For simple maps like ELU, this equals the input dimension $d_k$ . For random feature methods, it can be larger than $d_k$ to improve approximation quality, with typical ratios of 2x to 4x the original dimension.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about linear attention and how it achieves linear complexity.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

BigBird

Next Chapter

FlashAttention Algorithm

Reference

BIBTEXAcademic

@misc{linearattentionbreakingthequadraticbottleneckwithkernelfeaturemaps, author = {Michael Brenndoerfer}, title = {Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps}, year = {2025}, url = {https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps. Retrieved from https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers

MLAAcademic

Michael Brenndoerfer. "Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps'. Available at: https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps. https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers

Direct link:

https://mbrenndoerfer.com/writing/linear-attention-kernel-feature-maps-efficient-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps

Linear AttentionLink Copied

The Associative Property TrickLink Copied

Breaking the Global DependencyLink Copied

The Linear Attention FormulaLink Copied

From Element-wise to Matrix FormLink Copied

The Order of Operations MattersLink Copied

Implementation: Translating Math to CodeLink Copied

Seeing the Efficiency GainLink Copied

Kernel Feature MapsLink Copied

The Kernel InterpretationLink Copied

Common Feature MapsLink Copied

Comparing Standard and Linear AttentionLink Copied

Computational Complexity AnalysisLink Copied

Memory EfficiencyLink Copied

Causal Linear AttentionLink Copied

Performer: Unbiased Softmax ApproximationLink Copied

Other Linear Attention VariantsLink Copied

Limitations of Linear AttentionLink Copied

Implementation ConsiderationsLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FlashAttention Implementation: GPU Memory Optimization for Transformers

FlashAttention Algorithm: Memory-Efficient Exact Attention via GPU-Aware Tiling

Sliding Window Attention: Linear Complexity for Long Sequences

Stay updated