ALiBi: Attention with Linear Biases for Position Encoding

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how ALiBi encodes position through linear attention biases instead of embeddings. Master head-specific slopes, extrapolation properties, and when to choose ALiBi over RoPE for length generalization.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

ALiBi: Attention with Linear BiasesLink Copied

In the previous chapters, we explored various approaches to encoding position information: sinusoidal encodings that add fixed patterns to embeddings, learned position embeddings that train position vectors from scratch, relative position encodings that capture pairwise distances, and RoPE that rotates embeddings based on position. Each method has trade-offs involving complexity, performance, and the ability to handle sequences longer than those seen during training.

ALiBi (Attention with Linear Biases) takes a radically different approach. Instead of modifying embeddings or inventing clever rotation schemes, ALiBi simply subtracts a value from attention scores based on the distance between tokens. The farther apart two tokens are, the larger the penalty. This remarkably simple idea, introduced by Press et al. in 2022, achieves strong performance while enabling something the other methods struggle with: extrapolation to sequences far longer than anything seen during training.

The Extrapolation ProblemLink Copied

Before understanding ALiBi's solution, we need to appreciate the problem it solves. Transformers trained on sequences of length $L$ often fail dramatically when given sequences of length $2L$ or beyond. The sinusoidal encodings produce positions the model has never seen. Learned position embeddings simply don't exist for positions beyond the training range. Even RoPE, despite its theoretical elegance, can struggle with extreme extrapolation.

Length Extrapolation

Length extrapolation refers to a model's ability to process sequences longer than those encountered during training while maintaining reasonable performance. Many position encoding schemes fail this test because they produce position representations the model has never learned to interpret.

Why does this matter? Training on very long sequences is computationally expensive due to attention's quadratic complexity. If we could train on shorter sequences and deploy on longer ones, we'd save enormous resources. More practically, real-world applications often encounter documents, conversations, or code files that exceed training lengths. A model that degrades gracefully on longer inputs is far more useful than one that fails catastrophically.

The Core Idea: Penalizing DistanceLink Copied

ALiBi's insight is elegant in its simplicity: don't encode position in the embeddings at all. Instead, modify the attention mechanism itself to prefer nearby tokens over distant ones.

Consider what attention scores represent: they measure compatibility between a query and a key. After softmax, these scores determine how much each position contributes to the output. ALiBi introduces a simple bias that subtracts from the score based on distance:

\text{score}_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j - m \cdot |i - j|

where:

$\text{score}_{ij}$ : the modified attention score between positions $i$ and $j$
$\mathbf{q}_i \cdot \mathbf{k}_j$ : the original dot product between query $i$ and key $j$
$m$ : a slope parameter that controls how aggressively to penalize distance
$|i - j|$ : the absolute distance between the two positions

The subtraction means distant tokens receive lower scores. A token 10 positions away gets penalized by $10m$ , while an adjacent token gets penalized by just $m$ . After softmax normalization, this translates to nearby tokens receiving higher attention weights.

ALiBi Bias

ALiBi adds a linear bias to attention scores based on the distance between query and key positions. The bias is always negative, penalizing distant positions. The penalty grows linearly with distance, controlled by a slope parameter $m$ .

This is position encoding without position embeddings. The model learns nothing about position during pretraining of the embeddings themselves. Position information enters only through the attention bias, and only at the moment of computing attention scores.

The Mathematical FormulationLink Copied

Now that we understand ALiBi's core insight, let's develop the complete mathematical framework. We'll build from standard attention to ALiBi-augmented attention, showing exactly where and how the position bias enters the computation.

Starting Point: Standard Scaled Dot-Product AttentionLink Copied

Recall that standard self-attention operates on three matrices derived from the input sequence. For a sequence of $n$ tokens:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

$Q \in \mathbb{R}^{n \times d_k}$ : the query matrix containing query vectors for all $n$ positions
$K \in \mathbb{R}^{n \times d_k}$ : the key matrix containing key vectors for all positions
$V \in \mathbb{R}^{n \times d_v}$ : the value matrix containing value vectors for all positions
$d_k$ : the dimension of queries and keys
$QK^T \in \mathbb{R}^{n \times n}$ : the raw attention score matrix where entry $(i, j)$ is the dot product between query $i$ and key $j$
$\sqrt{d_k}$ : scaling factor that prevents dot products from growing too large in high dimensions

The matrix $QK^T$ captures content-based similarity: how much each query "wants" to attend to each key based purely on their learned representations. But this matrix is blind to position. Token 1 attending to token 2 produces the same score whether they're adjacent or separated by 500 tokens. This is where ALiBi intervenes.

Injecting Position: The Bias MatrixLink Copied

ALiBi's modification is surgical. Rather than changing how queries, keys, or values are computed, it adds a single term to the attention scores:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + B\right)V

The bias matrix $B \in \mathbb{R}^{n \times n}$ is added directly to the scaled scores before softmax. This is the only change to standard attention. The matrix $B$ encodes position information through a remarkably simple rule:

B_{ij} = -m \cdot |i - j|

where:

$B_{ij}$ : the bias added to the attention score between query position $i$ and key position $j$
$m$ : the slope parameter for this attention head (controls penalty strength)
$|i - j|$ : the absolute distance between positions $i$ and $j$

Let's unpack why this formula works. The absolute distance $|i - j|$ measures how far apart two positions are in the sequence. Adjacent tokens have distance 1, tokens separated by 10 positions have distance 10. Multiplying by a positive slope $m$ and negating creates a penalty that grows linearly with distance.

Consider what happens for a query at position 5 attending to various keys:

Attending to position 5 (itself): $B_{55} = -m \cdot 0 = 0$ (no penalty)
Attending to position 4 (adjacent): $B_{54} = -m \cdot 1 = -m$ (small penalty)
Attending to position 0 (distant): $B_{50} = -m \cdot 5 = -5m$ (large penalty)

The negative bias reduces the attention score, and larger distances produce more negative biases. After softmax normalization, this translates to lower attention weights for distant tokens. The content-based similarity in $QK^T$ still matters, but now it competes against a distance penalty.

The Causal Case: Masking the FutureLink Copied

For decoder-style models that generate text left-to-right, we need causal masking: position $i$ can only attend to positions $j \le i$ . The combined bias matrix looks like:

B = -m \cdot \begin{bmatrix} 0 & -\infty & -\infty & -\infty & \cdots \\ 1 & 0 & -\infty & -\infty & \cdots \\ 2 & 1 & 0 & -\infty & \cdots \\ 3 & 2 & 1 & 0 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}

The structure reveals two distinct components:

Lower triangle (including diagonal): Contains the ALiBi distance penalties. The diagonal is 0 (self-attention has no penalty), and values increase as you move left (farther from the query position).
Upper triangle: Contains $-\infty$ values from the causal mask. These become 0 after softmax, preventing any attention to future positions.

The $-\infty$ entries ensure that after $\exp(-\infty) = 0$ , future positions contribute nothing to the weighted sum. The linear penalties in the lower triangle shape how attention flows among the allowed past positions.

Building the Bias Matrix: Step-by-Step ImplementationLink Copied

Let's translate this mathematics into code, building up the bias matrix piece by piece.

In[2]:

Code

import numpy as np


def create_alibi_bias(seq_len, slope):
    """
    Create the ALiBi bias matrix for a single attention head.

    Args:
        seq_len: Length of the sequence
        slope: The slope parameter m for this head

    Returns:
        Bias matrix of shape (seq_len, seq_len)
    """
    # Create position indices
    positions = np.arange(seq_len)

    # Compute pairwise distances: |i - j|
    # positions[:, None] is (seq_len, 1), positions[None, :] is (1, seq_len)
    # Broadcasting gives (seq_len, seq_len)
    distances = np.abs(positions[:, None] - positions[None, :])

    # Apply linear penalty
    bias = -slope * distances

    return bias

import numpy as np


def create_alibi_bias(seq_len, slope):
    """
    Create the ALiBi bias matrix for a single attention head.

    Args:
        seq_len: Length of the sequence
        slope: The slope parameter m for this head

    Returns:
        Bias matrix of shape (seq_len, seq_len)
    """
    # Create position indices
    positions = np.arange(seq_len)

    # Compute pairwise distances: |i - j|
    # positions[:, None] is (seq_len, 1), positions[None, :] is (1, seq_len)
    # Broadcasting gives (seq_len, seq_len)
    distances = np.abs(positions[:, None] - positions[None, :])

    # Apply linear penalty
    bias = -slope * distances

    return bias

The key insight is using NumPy broadcasting to compute all pairwise distances at once. By reshaping the position array into a column vector positions[:, None] and a row vector positions[None, :], subtraction produces an $n \times n$ matrix where entry $(i, j)$ is $i - j$ . Taking the absolute value gives us the distance matrix.

Let's see what this produces for a small sequence:

Out[3]:

Console

ALiBi bias matrix for sequence length 6, slope 0.5:
[[-0.  -0.5 -1.  -1.5 -2.  -2.5]
 [-0.5 -0.  -0.5 -1.  -1.5 -2. ]
 [-1.  -0.5 -0.  -0.5 -1.  -1.5]
 [-1.5 -1.  -0.5 -0.  -0.5 -1. ]
 [-2.  -1.5 -1.  -0.5 -0.  -0.5]
 [-2.5 -2.  -1.5 -1.  -0.5 -0. ]]

Reading this matrix: the diagonal is zero because a token attending to itself has distance zero. Moving away from the diagonal in either direction, penalties grow linearly. Position 5 (row 5) attending to position 0 (column 0) shows $-2.5$ , exactly $-0.5 \times 5$ as expected. The matrix is symmetric because distance is symmetric: $|i - j| = |j - i|$ .

For decoder models, we overlay the causal mask:

In[4]:

Code

def create_causal_alibi_bias(seq_len, slope):
    """
    Create ALiBi bias with causal masking.

    Args:
        seq_len: Length of the sequence
        slope: The slope parameter m

    Returns:
        Bias matrix with -inf for future positions
    """
    # Start with the distance-based bias
    bias = create_alibi_bias(seq_len, slope)

    # Create causal mask (upper triangle should be -inf)
    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)

    # Apply causal mask: future positions get -inf
    bias = np.where(causal_mask == 1, -np.inf, bias)

    return bias

def create_causal_alibi_bias(seq_len, slope):
    """
    Create ALiBi bias with causal masking.

    Args:
        seq_len: Length of the sequence
        slope: The slope parameter m

    Returns:
        Bias matrix with -inf for future positions
    """
    # Start with the distance-based bias
    bias = create_alibi_bias(seq_len, slope)

    # Create causal mask (upper triangle should be -inf)
    causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)

    # Apply causal mask: future positions get -inf
    bias = np.where(causal_mask == 1, -np.inf, bias)

    return bias

The np.triu function creates an upper triangular matrix of ones, which we use to identify positions that should be masked. The k=1 argument excludes the diagonal, since a token should be able to attend to itself.

Out[5]:

Console

Causal ALiBi bias matrix:
[[-0.  -inf -inf -inf -inf -inf]
 [-0.5 -0.  -inf -inf -inf -inf]
 [-1.  -0.5 -0.  -inf -inf -inf]
 [-1.5 -1.  -0.5 -0.  -inf -inf]
 [-2.  -1.5 -1.  -0.5 -0.  -inf]
 [-2.5 -2.  -1.5 -1.  -0.5 -0. ]]

Now the upper triangle shows inf (NumPy's display for $-\infty$ ), while the lower triangle retains the distance penalties. Row 5 can attend to all previous positions with penalties $-2.5, -2.0, -1.5, -1.0, -0.5, 0.0$ for positions 0 through 5 respectively. Row 0 can only attend to itself with penalty 0.

Let's visualize both the distance matrix and the resulting causal ALiBi bias matrix side by side:

Out[6]:

Visualization

Heatmap of distance matrix with values increasing symmetrically away from the diagonal. — Distance matrix showing absolute distances between all position pairs. Entry (i, j) contains |i - j|. The matrix is symmetric with zeros on the diagonal.

Heatmap of causal ALiBi bias matrix showing negative values in the lower triangle and masked upper triangle. — Causal ALiBi bias matrix with slope 0.5. The distance matrix is negated and scaled, then the upper triangle is masked with negative infinity to prevent attending to future positions.

The left matrix shows raw distances: the diagonal is 0 (same position), adjacent cells are 1, and values grow as positions diverge. The right matrix shows what happens after applying ALiBi: distances become negative penalties (scaled by the slope), and the upper triangle is masked to enforce causality.

Head-Specific SlopesLink Copied

ALiBi uses different slope values for different attention heads. This is crucial: a single slope would force all heads to have the same locality preference, but different heads benefit from different perspectives. Some heads might focus on very local context (steep slope, harsh penalties for distance), while others maintain broader receptive fields (gentle slope, mild penalties).

The slopes are not learned but fixed according to a geometric sequence. For a model with $h$ attention heads, the slope for head $i$ is:

m_i = \frac{1}{2^{\frac{8}{h} \cdot i}} \quad \text{for } i = 1, 2, \ldots, h

where:

$m_i$ : the slope parameter for attention head $i$
$h$ : the total number of attention heads in the model
$i$ : the head index, ranging from 1 to $h$
$\frac{8}{h}$ : a scaling factor that ensures slopes span a consistent range regardless of head count
$2^{\frac{8}{h} \cdot i}$ : the denominator that grows exponentially with head index

The base of 8 in the exponent was chosen empirically by the ALiBi authors. It ensures that slopes span roughly 8 orders of magnitude when divided across the heads. With 8 heads, the exponent simplifies to just $i$ , giving slopes of $\frac{1}{2^1}, \frac{1}{2^2}, \ldots, \frac{1}{2^8}$ , which equals $0.5, 0.25, 0.125, \ldots, 0.00390625$ .

In[7]:

Code

def get_alibi_slopes(num_heads):
    """
    Compute ALiBi slopes for each attention head.

    The slopes follow a geometric sequence, with the first head
    having the steepest slope (most local attention) and the last
    head having the gentlest slope (broadest attention).

    Args:
        num_heads: Number of attention heads

    Returns:
        Array of slopes, one per head
    """
    # Compute the ratio for the geometric sequence
    ratio = 2 ** (8 / num_heads)

    # Generate slopes: 1/ratio, 1/ratio^2, ..., 1/ratio^num_heads
    slopes = 1.0 / (ratio ** np.arange(1, num_heads + 1))

    return slopes

def get_alibi_slopes(num_heads):
    """
    Compute ALiBi slopes for each attention head.

    The slopes follow a geometric sequence, with the first head
    having the steepest slope (most local attention) and the last
    head having the gentlest slope (broadest attention).

    Args:
        num_heads: Number of attention heads

    Returns:
        Array of slopes, one per head
    """
    # Compute the ratio for the geometric sequence
    ratio = 2 ** (8 / num_heads)

    # Generate slopes: 1/ratio, 1/ratio^2, ..., 1/ratio^num_heads
    slopes = 1.0 / (ratio ** np.arange(1, num_heads + 1))

    return slopes

Out[8]:

Console

4 heads: slopes = [0.25     0.0625   0.015625 0.003906]
8 heads: slopes = [0.5      0.25     0.125    0.0625   0.03125  0.015625 0.007812 0.003906]
16 heads: slopes = [0.707107 0.5      0.353553 0.25     0.176777 0.125    0.088388 0.0625
 0.044194 0.03125  0.022097 0.015625 0.011049 0.007812 0.005524 0.003906]

The geometric progression ensures that slopes span several orders of magnitude. With 8 heads, the steepest slope (0.5) penalizes a distance of 10 by 5 logits, effectively eliminating distant tokens from consideration. The gentlest slope (roughly 0.004) penalizes the same distance by only 0.04 logits, allowing the head to attend broadly.

Out[9]:

Visualization

Bar chart showing ALiBi slopes for 8 attention heads, with values decreasing geometrically from 0.5 for head 1 to near zero for head 8. — ALiBi slopes across attention heads. Steeper slopes create stronger locality bias, restricting attention to nearby tokens. Gentler slopes allow broader attention patterns. The geometric progression ensures heads cover a wide range of receptive field sizes.

Visualizing the Attention BiasLink Copied

Let's visualize how ALiBi biases affect attention patterns for different heads:

Out[10]:

Visualization

Heatmap showing ALiBi bias for head 1 with steep slope, displaying strong negative values far from the diagonal. — Head 1 (steep slope): Strong locality bias creates a narrow attention window. Tokens beyond a few positions receive heavily penalized scores, forcing attention to focus locally.

Heatmap showing ALiBi bias for head 8 with gentle slope, displaying mild penalties across all distances. — Head 8 (gentle slope): Weak locality bias allows broad attention. Even distant tokens receive only mild penalties, enabling this head to capture long-range dependencies.

The contrast is stark. Head 1's bias matrix shows deep blue (strongly negative) values just a few positions from the diagonal. By position 10, the penalty exceeds -5 logits, making those tokens nearly invisible after softmax. Head 8, in contrast, shows mild penalties throughout. Even at distance 19, the penalty is less than 0.1 logits, allowing meaningful attention to distant tokens.

This division of labor is intentional. Linguistic phenomena operate at different scales: adjacent tokens matter for syntax and local coherence, while distant tokens matter for coreference, topic consistency, and long-range dependencies. By giving different heads different locality preferences, ALiBi enables the model to capture phenomena at multiple scales simultaneously.

ALiBi in Attention: Complete ImplementationLink Copied

Now let's implement ALiBi-augmented attention from scratch:

In[11]:

Code

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def alibi_attention(Q, K, V, slopes, causal=True):
    """
    Compute attention with ALiBi position encoding.

    Args:
        Q: Query matrix of shape (num_heads, seq_len, d_k)
        K: Key matrix of shape (num_heads, seq_len, d_k)
        V: Value matrix of shape (num_heads, seq_len, d_v)
        slopes: ALiBi slopes, one per head, shape (num_heads,)
        causal: Whether to apply causal masking

    Returns:
        Output of shape (num_heads, seq_len, d_v)
        Attention weights of shape (num_heads, seq_len, seq_len)
    """
    num_heads, seq_len, d_k = Q.shape

    # Compute raw attention scores: Q @ K^T
    # Shape: (num_heads, seq_len, seq_len)
    scores = np.matmul(Q, K.transpose(0, 2, 1))

    # Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)

    # Create ALiBi biases for each head
    # Shape: (num_heads, seq_len, seq_len)
    positions = np.arange(seq_len)
    distances = np.abs(positions[:, None] - positions[None, :])

    # Broadcast slopes: (num_heads, 1, 1) * (seq_len, seq_len)
    alibi_bias = -slopes[:, None, None] * distances[None, :, :]

    # Apply causal mask if needed
    if causal:
        causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
        alibi_bias = np.where(causal_mask == 1, -np.inf, alibi_bias)

    # Add ALiBi bias to scores
    scores = scores + alibi_bias

    # Apply softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)

    # Handle NaN from -inf (all masked positions)
    attention_weights = np.nan_to_num(attention_weights, nan=0.0)

    # Compute output: weighted sum of values
    output = np.matmul(attention_weights, V)

    return output, attention_weights

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def alibi_attention(Q, K, V, slopes, causal=True):
    """
    Compute attention with ALiBi position encoding.

    Args:
        Q: Query matrix of shape (num_heads, seq_len, d_k)
        K: Key matrix of shape (num_heads, seq_len, d_k)
        V: Value matrix of shape (num_heads, seq_len, d_v)
        slopes: ALiBi slopes, one per head, shape (num_heads,)
        causal: Whether to apply causal masking

    Returns:
        Output of shape (num_heads, seq_len, d_v)
        Attention weights of shape (num_heads, seq_len, seq_len)
    """
    num_heads, seq_len, d_k = Q.shape

    # Compute raw attention scores: Q @ K^T
    # Shape: (num_heads, seq_len, seq_len)
    scores = np.matmul(Q, K.transpose(0, 2, 1))

    # Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)

    # Create ALiBi biases for each head
    # Shape: (num_heads, seq_len, seq_len)
    positions = np.arange(seq_len)
    distances = np.abs(positions[:, None] - positions[None, :])

    # Broadcast slopes: (num_heads, 1, 1) * (seq_len, seq_len)
    alibi_bias = -slopes[:, None, None] * distances[None, :, :]

    # Apply causal mask if needed
    if causal:
        causal_mask = np.triu(np.ones((seq_len, seq_len)), k=1)
        alibi_bias = np.where(causal_mask == 1, -np.inf, alibi_bias)

    # Add ALiBi bias to scores
    scores = scores + alibi_bias

    # Apply softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)

    # Handle NaN from -inf (all masked positions)
    attention_weights = np.nan_to_num(attention_weights, nan=0.0)

    # Compute output: weighted sum of values
    output = np.matmul(attention_weights, V)

    return output, attention_weights

Let's test this implementation with a simple example:

In[12]:

Code

# Create a simple example
np.random.seed(42)

num_heads = 4
seq_len = 8
d_k = 16
d_v = 16

# Random Q, K, V matrices
Q = np.random.randn(num_heads, seq_len, d_k) * 0.5
K = np.random.randn(num_heads, seq_len, d_k) * 0.5
V = np.random.randn(num_heads, seq_len, d_v) * 0.5

# Get ALiBi slopes
slopes = get_alibi_slopes(num_heads)

# Compute ALiBi attention
output, attention_weights = alibi_attention(Q, K, V, slopes, causal=True)

# Create a simple example
np.random.seed(42)

num_heads = 4
seq_len = 8
d_k = 16
d_v = 16

# Random Q, K, V matrices
Q = np.random.randn(num_heads, seq_len, d_k) * 0.5
K = np.random.randn(num_heads, seq_len, d_k) * 0.5
V = np.random.randn(num_heads, seq_len, d_v) * 0.5

# Get ALiBi slopes
slopes = get_alibi_slopes(num_heads)

# Compute ALiBi attention
output, attention_weights = alibi_attention(Q, K, V, slopes, causal=True)

Out[13]:

Console

Output shape: (4, 8, 16)
Attention weights shape: (4, 8, 8)

Attention weights for head 1 (steep slope = 0.250):
[[1.    0.    0.    0.    0.    0.    0.    0.   ]
 [0.45  0.55  0.    0.    0.    0.    0.    0.   ]
 [0.233 0.37  0.397 0.    0.    0.    0.    0.   ]
 [0.227 0.205 0.262 0.306 0.    0.    0.    0.   ]
 [0.118 0.068 0.121 0.279 0.414 0.    0.    0.   ]
 [0.083 0.086 0.13  0.201 0.176 0.324 0.    0.   ]
 [0.065 0.089 0.092 0.127 0.137 0.272 0.218 0.   ]
 [0.025 0.038 0.057 0.073 0.136 0.214 0.233 0.224]]

Attention weights for head 4 (gentle slope = 0.003906):
[[1.    0.    0.    0.    0.    0.    0.    0.   ]
 [0.562 0.438 0.    0.    0.    0.    0.    0.   ]
 [0.36  0.453 0.187 0.    0.    0.    0.    0.   ]
 [0.344 0.23  0.245 0.181 0.    0.    0.    0.   ]
 [0.184 0.232 0.181 0.169 0.233 0.    0.    0.   ]
 [0.121 0.125 0.286 0.214 0.096 0.158 0.    0.   ]
 [0.104 0.124 0.171 0.176 0.08  0.175 0.169 0.   ]
 [0.109 0.137 0.063 0.124 0.158 0.147 0.163 0.099]]

Compare the attention patterns between heads. Head 1 with its steep slope concentrates attention heavily on recent positions, with weights dropping rapidly as distance increases. Head 4 with its gentle slope distributes attention more evenly across the available context.

To understand the impact of ALiBi more directly, let's compare attention patterns with and without the position bias. We'll compute attention for the same queries and keys, once with standard attention (no position encoding) and once with ALiBi:

Out[14]:

Visualization

Heatmap showing attention weights without position encoding, with relatively uniform distribution. — Standard attention (no position encoding). Attention patterns depend only on content similarity between queries and keys. The model has no notion of which tokens are nearby vs. distant.

Heatmap showing attention weights with ALiBi, showing stronger diagonal pattern due to locality bias. — ALiBi attention (with position bias). The same content-based scores are modified by distance penalties. Nearby tokens receive higher weights, creating a locality preference.

The difference is striking. Standard attention distributes weight based purely on content similarity, sometimes attending strongly to distant positions. ALiBi reshapes this pattern, pulling attention toward recent tokens while still allowing content to influence the final distribution. This locality bias emerges from a single matrix addition, requiring no learned position embeddings.

Out[15]:

Visualization

Heatmap of attention weights for head 1 showing strong diagonal pattern with rapid falloff. — Attention weights for head 1 (steep slope). Each row shows how one query position attends to key positions. Attention concentrates sharply on nearby tokens, with distant positions receiving negligible weight.

Heatmap of attention weights for head 4 showing broader attention distribution. — Attention weights for head 4 (gentle slope). Attention spreads more broadly across available positions, though still respecting the causal mask. This head can capture longer-range dependencies.

Why ALiBi ExtrapolatesLink Copied

The key to ALiBi's extrapolation ability lies in what the model learns during training. With other position encoding schemes, the model learns to interpret specific position representations. A sinusoidal encoding for position 512 produces a particular pattern that the model has seen and learned to use. A learned position embedding for position 512 is a specific trained vector. When you encounter position 513 or 1024, these are novel representations the model hasn't seen.

ALiBi sidesteps this problem entirely. The model never learns position representations because there are none. Instead, it learns to work with relative attention patterns shaped by the linear bias. During training on sequences of length 1024, the model sees attention patterns where nearby tokens are favored and distant tokens are penalized. This is true at position 10, position 500, and position 1000.

When inference extends to position 2048, the same principle applies. The local neighborhood still receives favorable bias. Tokens 10 positions away still get penalized by the same amount. The absolute positions are larger, but the relative structure is unchanged. The model has learned to extract information from attention patterns that favor locality, and those patterns remain consistent regardless of sequence length.

Extrapolation Mechanism

ALiBi extrapolates because it encodes relative distance, not absolute position. The linear penalty for distance 10 is the same whether you're at position 50 or position 5000. The model learns to work with distance-biased attention patterns, which remain consistent across sequence lengths.

Let's visualize this consistency:

Out[16]:

Visualization

Line plot showing ALiBi bias as a function of distance for three different query positions, all following the same linear penalty curve. — ALiBi attention bias remains consistent as sequence length grows. The bias from position i to position i-10 is always -5 (with slope 0.5), regardless of whether i is 20, 100, or 500. This structural consistency enables extrapolation.

The curves overlap perfectly because ALiBi's bias depends only on distance, never on absolute position. This is the mathematical basis for length extrapolation.

ALiBi vs RoPE: A ComparisonLink Copied

Both ALiBi and RoPE are widely used in modern language models, and both handle relative position. But they take fundamentally different approaches:

ALiBi vs RoPE comparison. ALiBi adds position as an attention bias while RoPE rotates embeddings.

Aspect	ALiBi	RoPE
Position encoding location	Attention bias	Query/key embeddings
Mechanism	Subtracts linear penalty from scores	Rotates Q and K vectors
Parameters	Fixed slopes (no training)	No additional parameters
Computation	Simple matrix addition	Complex number arithmetic or rotation matrices
Extrapolation	Strong out-of-the-box	Requires additional techniques (e.g., scaling)
Relative position	Implicit through distance penalty	Explicit through rotation angle differences

RoPE encodes position by rotating query and key vectors. When computing dot products, the rotation angles combine such that the result depends on relative position. This is elegant and theoretically motivated, but the rotation patterns at unseen positions can behave unexpectedly.

ALiBi's approach is more brutalist: just penalize distance. There's no rotation, no complex number interpretation, no interplay between embedding dimensions. The simplicity has practical benefits. ALiBi requires no changes to the embedding pipeline and adds minimal computational overhead.

Let's compare implementation complexity:

In[17]:

Code

def rope_attention(Q, K, V, seq_len, d_k, causal=True):
    """
    Simplified RoPE attention for comparison.
    This is a sketch showing the additional complexity.
    """
    # RoPE requires computing rotation matrices or using complex numbers
    # For each position, apply rotation to Q and K before computing attention

    # Step 1: Compute rotation angles for each position and dimension
    positions = np.arange(seq_len)
    dim_indices = np.arange(d_k // 2)

    # Frequency for each dimension pair (simplified)
    freqs = 1.0 / (10000 ** (2 * dim_indices / d_k))

    # Angle matrix: (seq_len, d_k/2)
    angles = positions[:, None] * freqs[None, :]

    # Step 2: Apply rotation to Q and K (complex number approach)
    # This involves reshaping Q and K, applying cos/sin transformations...
    # (full implementation omitted for brevity)

    # The key point: RoPE modifies the embeddings themselves
    # before attention computation
    pass


def alibi_attention_simple(Q, K, V, slopes, causal=True):
    """
    ALiBi attention for comparison.
    """
    # Step 1: Standard attention scores
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(Q.shape[-1])

    # Step 2: Add distance-based bias (one line!)
    seq_len = Q.shape[1]
    distances = np.abs(
        np.arange(seq_len)[:, None] - np.arange(seq_len)[None, :]
    )
    scores = scores + (-slopes[:, None, None] * distances)

    # That's it. Apply softmax and compute output.
    return scores

def rope_attention(Q, K, V, seq_len, d_k, causal=True):
    """
    Simplified RoPE attention for comparison.
    This is a sketch showing the additional complexity.
    """
    # RoPE requires computing rotation matrices or using complex numbers
    # For each position, apply rotation to Q and K before computing attention

    # Step 1: Compute rotation angles for each position and dimension
    positions = np.arange(seq_len)
    dim_indices = np.arange(d_k // 2)

    # Frequency for each dimension pair (simplified)
    freqs = 1.0 / (10000 ** (2 * dim_indices / d_k))

    # Angle matrix: (seq_len, d_k/2)
    angles = positions[:, None] * freqs[None, :]

    # Step 2: Apply rotation to Q and K (complex number approach)
    # This involves reshaping Q and K, applying cos/sin transformations...
    # (full implementation omitted for brevity)

    # The key point: RoPE modifies the embeddings themselves
    # before attention computation
    pass


def alibi_attention_simple(Q, K, V, slopes, causal=True):
    """
    ALiBi attention for comparison.
    """
    # Step 1: Standard attention scores
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(Q.shape[-1])

    # Step 2: Add distance-based bias (one line!)
    seq_len = Q.shape[1]
    distances = np.abs(
        np.arange(seq_len)[:, None] - np.arange(seq_len)[None, :]
    )
    scores = scores + (-slopes[:, None, None] * distances)

    # That's it. Apply softmax and compute output.
    return scores

The ALiBi bias is a single line of code added to standard attention. RoPE requires restructuring how queries and keys are computed. Both work, but ALiBi's simplicity is a genuine advantage for implementation and debugging.

Practical ConsiderationsLink Copied

When should you choose ALiBi? Consider these factors:

Training efficiency. ALiBi adds minimal overhead. The bias computation is a simple matrix operation that's negligible compared to the attention computation itself. There are no additional parameters to train.

Length flexibility. If your deployment might encounter sequences longer than training, ALiBi provides a safety net. Models like BLOOM and MPT use ALiBi partly for this reason.

Simplicity. ALiBi is easy to implement correctly. The fixed slopes mean no hyperparameter tuning for the position encoding itself. Debugging attention patterns is straightforward because the bias is directly interpretable.

Limitations. ALiBi's linear penalty may not capture all positional relationships. Some tasks might benefit from RoPE's more nuanced encoding. Extremely local heads (with steep slopes) can struggle if a task genuinely requires attending to distant tokens with equal strength.

In practice, many production models use either ALiBi or RoPE, with the choice often depending on the team's preferences and empirical results on target tasks. Both represent significant advances over sinusoidal encodings and learned position embeddings.

Out[18]:

Visualization

Line plot showing attention weight decay with distance for different ALiBi slopes, demonstrating varying effective context windows. — Effective context window for different ALiBi heads. The steep-slope head effectively ignores tokens beyond about 20 positions, while the gentle-slope head maintains meaningful attention weights out to hundreds of positions. This creates a multi-scale attention mechanism.

Limitations and ImpactLink Copied

ALiBi's simplicity is both its strength and its limitation. The linear penalty assumes that relevance decreases monotonically with distance, which is generally true for language but not universally. Some linguistic phenomena, like matching parentheses in code or tracking long-distance agreement in legal documents, might benefit from attention patterns that don't decay linearly.

The fixed slopes, while eliminating hyperparameters, also remove flexibility. A model cannot learn task-specific locality preferences through the position encoding itself. If a particular downstream task requires unusual attention patterns, the model must learn to overcome the ALiBi bias through the content-based attention scores.

Despite these limitations, ALiBi has proven remarkably effective. The BLOOM family of models, including the 176-billion parameter BLOOM-176B, uses ALiBi. So does MPT (MosaicML Pretrained Transformer). These models demonstrate that ALiBi scales to the largest models and handles diverse tasks well.

The impact of ALiBi extends beyond its direct use. It demonstrated that position encoding can be far simpler than previously thought. The original Transformer's sinusoidal encodings were ingenious but perhaps overengineered for the task. ALiBi showed that a linear penalty on distance, applied at attention time, is sufficient for strong performance. This insight has influenced subsequent work on efficient transformers and position encoding schemes.

Key ParametersLink Copied

When implementing ALiBi in your own models, these parameters determine behavior:

num_heads: The number of attention heads in your model. ALiBi automatically computes slopes for each head using the geometric sequence formula. More heads create finer granularity in locality preferences.
slope (m): The penalty strength for each head. Steeper slopes (larger values like 0.5) create strong locality bias where attention concentrates on nearby tokens. Gentler slopes (smaller values like 0.004) allow broader attention across the sequence. These are fixed by the formula, not tuned.
causal: Whether to apply causal masking. Set to True for decoder-style autoregressive models where tokens can only attend to previous positions. Set to False for encoder-style bidirectional attention.
Base value (8): The constant in the slope formula $m_i = 1/2^{(8/h) \cdot i}$ that controls the range of slopes. The original ALiBi paper uses 8, which ensures slopes span several orders of magnitude. This value is typically not modified.

SummaryLink Copied

ALiBi offers a refreshingly simple approach to position encoding:

No position embeddings. Position enters only through attention biases, not through modifications to token representations.
Linear distance penalty. Attention scores are reduced by $m \cdot |i - j|$ , where $m$ is a head-specific slope and $|i - j|$ is the distance between query position $i$ and key position $j$ . Nearby tokens are favored, distant tokens are penalized.
Geometric slopes. Different attention heads use different slopes, creating multi-scale attention. Some heads focus locally, others attend broadly.
Strong extrapolation. Because only relative distance matters (not absolute position), models trained on short sequences can process longer sequences at inference time.
Minimal overhead. ALiBi adds one matrix addition to attention computation. No additional parameters, no complex rotation arithmetic.

The next chapter will compare all position encoding methods we've covered: sinusoidal, learned, relative, RoPE, and ALiBi. You'll see how each handles key challenges like extrapolation, computational cost, and representational power.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about ALiBi (Attention with Linear Biases).

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Rotary Position Embedding (RoPE)

Next Chapter

Position Encoding Comparison

Reference

BIBTEXAcademic

@misc{alibiattentionwithlinearbiasesforpositionencoding, author = {Michael Brenndoerfer}, title = {ALiBi: Attention with Linear Biases for Position Encoding}, year = {2025}, url = {https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). ALiBi: Attention with Linear Biases for Position Encoding. Retrieved from https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding

MLAAcademic

Michael Brenndoerfer. "ALiBi: Attention with Linear Biases for Position Encoding." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding>.

CHICAGOAcademic

Michael Brenndoerfer. "ALiBi: Attention with Linear Biases for Position Encoding." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding.

HARVARDAcademic

Michael Brenndoerfer (2025) 'ALiBi: Attention with Linear Biases for Position Encoding'. Available at: https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). ALiBi: Attention with Linear Biases for Position Encoding. https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding

Direct link:

https://mbrenndoerfer.com/writing/alibi-attention-linear-biases-position-encoding

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

ALiBi: Attention with Linear Biases for Position Encoding

ALiBi: Attention with Linear BiasesLink Copied

The Extrapolation ProblemLink Copied

The Core Idea: Penalizing DistanceLink Copied

The Mathematical FormulationLink Copied

Starting Point: Standard Scaled Dot-Product AttentionLink Copied

Injecting Position: The Bias MatrixLink Copied

The Causal Case: Masking the FutureLink Copied

Building the Bias Matrix: Step-by-Step ImplementationLink Copied

Head-Specific SlopesLink Copied

Visualizing the Attention BiasLink Copied

ALiBi in Attention: Complete ImplementationLink Copied

Why ALiBi ExtrapolatesLink Copied

ALiBi vs RoPE: A ComparisonLink Copied

Practical ConsiderationsLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Sinusoidal Position Encoding: How Transformers Know Word Order

The Position Problem: Why Transformers Can't Tell Order Without Help

Rotary Position Embedding (RoPE): Encoding Position Through Rotation

Stay updated