Search

Search articles

Relative Position Encoding: Distance-Based Attention for Transformers

Michael BrenndoerferUpdated June 2, 202534 min read

Learn how relative position encoding improves transformer generalization by encoding token distances rather than absolute positions, with Shaw et al.'s influential formulation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Relative Position Encoding

Previous chapters explored absolute position encoding: each position receives a fixed representation based on its index in the sequence. Whether through sinusoidal functions or learned embeddings, the model learns that "position 5 means something specific." But language doesn't work this way. When you read "the cat sat on the mat," the relationship between "cat" and "sat" depends on their relative distance (one word apart), not on whether they appear at positions 2 and 3 versus positions 102 and 103.

Relative position encoding shifts the focus from "where am I?" to "how far apart are we?" This chapter explores why this matters, how to formulate relative attention, and how Shaw et al.'s influential approach brought relative positions into transformer self-attention.

Why Relative Position Matters

Consider the sentence "The cat that I saw yesterday sat on the mat." The verb "sat" needs to find its subject "cat." With absolute position encoding, the model must learn that "position 5 attending to position 2" captures a subject-verb relationship. But if we move the same phrase later in the document, it becomes "position 105 attending to position 102." The model must learn these patterns for every possible absolute position pair.

The Generalization Problem

Absolute position encodings create position-specific patterns. A model trained on short sequences may never see "position 500 attending to position 497," even though the relationship is identical to "position 5 attending to position 2": a distance of 3 positions.

Relative position encoding solves this by focusing on the offset between positions. The distance from position 5 to position 2 is 3-3 (looking backward three positions). The distance from position 105 to position 102 is also 3-3. By encoding this offset rather than absolute positions, the model learns a single pattern for "three positions back" that generalizes across the entire sequence.

This shift has practical consequences. Language has many distance-dependent patterns: adjectives typically appear one position before their nouns, determiners precede noun phrases by a few positions, and verbs often follow their subjects within a local window. Relative encoding captures these patterns directly.

In[2]:
Code
# Demonstrate the generalization problem with absolute positions
def count_position_pairs(max_len_train, max_len_test):
    """Count how many position pairs in test were never seen during training."""
    seen_pairs = set()
    for i in range(max_len_train):
        for j in range(max_len_train):
            seen_pairs.add((i, j))

    unseen = 0
    total = 0
    for i in range(max_len_test):
        for j in range(max_len_test):
            total += 1
            if (i, j) not in seen_pairs:
                unseen += 1

    return unseen, total


# Training on sequences up to length 128, testing on 256
unseen_abs, total_abs = count_position_pairs(128, 256)
Out[3]:
Console
Absolute Position Generalization Problem
=============================================
Training sequence length: 128
Test sequence length:     256
Total position pairs at test time: 65,536
Unseen position pairs:            49,152
Percentage unseen:                 75.0%

With relative encoding, a distance of -3 seen during training
applies equally to positions (5, 2) and (105, 102).
Out[4]:
Visualization
Heatmap showing seen vs unseen position pairs, with a dark square in the bottom-left corner and light L-shaped region indicating unseen pairs.
Visualization of the absolute position generalization problem. The dark region (bottom-left) represents position pairs seen during training on 128-token sequences. The light region shows unseen pairs when testing on 256 tokens. Nearly 75% of test pairs were never seen during training.

Nearly 75% of position pairs at test time were never seen during training. A model relying on absolute positions must somehow generalize to these unseen combinations. With relative encoding, every distance seen during training transfers to all positions at that distance.

From Absolute to Relative: The Key Insight

Recall the standard self-attention formula. Given queries Q\mathbf{Q}, keys K\mathbf{K}, and values V\mathbf{V}, we compute:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

where:

  • QRn×dk\mathbf{Q} \in \mathbb{R}^{n \times d_k}: query matrix containing all query vectors
  • KRn×dk\mathbf{K} \in \mathbb{R}^{n \times d_k}: key matrix containing all key vectors
  • VRn×dv\mathbf{V} \in \mathbb{R}^{n \times d_v}: value matrix containing all value vectors
  • dkd_k: dimension of queries and keys
  • dk\sqrt{d_k}: scaling factor to prevent dot products from growing too large
  • nn: sequence length

With absolute position encoding, each token's embedding includes position information before the QKV projections. Position is baked into the input: xi=ei+pi\mathbf{x}_i = \mathbf{e}_i + \mathbf{p}_i, where ei\mathbf{e}_i is the token embedding and pi\mathbf{p}_i is the position encoding. The attention scores then implicitly depend on absolute positions through the projected queries and keys.

Relative position encoding takes a different approach: inject position information directly into the attention computation. Instead of adding position to the input, we modify how attention scores are calculated to explicitly account for the offset jij - i between positions ii and jj.

The core idea is to add a position-dependent term to the attention score:

scoreij=qikj+relative_bias(i,j)\text{score}_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j + \text{relative\_bias}(i, j)

where:

  • qi\mathbf{q}_i: the query vector at position ii
  • kj\mathbf{k}_j: the key vector at position jj
  • relative_bias(i,j)\text{relative\_bias}(i, j): a term that depends only on the offset jij - i

This formulation separates content matching (qikj\mathbf{q}_i \cdot \mathbf{k}_j) from position matching (the relative bias). The model can learn that certain offsets are generally more or less important, regardless of absolute position.

Shaw et al. Relative Position Representations

The most influential formulation of relative positions in self-attention comes from Shaw et al. (2018). Their approach asks a fundamental question: if we want attention to be distance-aware, where exactly should we inject position information? Rather than modifying the input embeddings, Shaw et al. discovered that injecting relative positions directly into the attention computation gives the model finer control over how distance affects both which tokens get attention and what information flows.

To understand their approach, we need to build up three connected ideas: what we want the model to learn about distance, how to represent those distances as learnable parameters, and how to integrate those representations into the attention mechanism itself.

The Design Question: What Should Distance Encode?

Consider what happens when position ii attends to position jj in standard self-attention. The model computes a score based on content similarity, then uses that score to weight how much of jj's information flows to ii. But we want distance to influence both of these operations:

  1. Which positions get attention: A verb might prefer to attend to positions 1-3 positions back (where subjects typically appear), regardless of content similarity.

  2. What information flows: When aggregating information, we might want "nearby context" to carry different weight than "distant context," even when attention weights are equal.

Shaw et al. address both needs by introducing two sets of learnable embeddings: one that modifies attention scores, and one that modifies value aggregation. This separation gives the model independent control over "who to attend to" versus "what to gather."

Relative Position Embeddings

For each possible offset r=jir = j - i between positions, Shaw et al. define learnable vectors:

  • arKRdk\mathbf{a}^K_r \in \mathbb{R}^{d_k}: a relative position embedding that influences attention scores
  • arVRdv\mathbf{a}^V_r \in \mathbb{R}^{d_v}: a relative position embedding that modifies aggregated values

where:

  • r=jir = j - i: the relative offset (positive means looking forward, negative means looking backward)
  • dkd_k: dimension of query and key vectors
  • dvd_v: dimension of value vectors

These embeddings are learned during training, just like the QKV projection matrices. The critical insight is that the number of distinct offsets is bounded by the sequence length (from (n1)-(n-1) to +(n1)+(n-1)), not by the number of position pairs (which is n2n^2). This means the model learns a single pattern for "three positions back" that applies universally, rather than learning separate patterns for every pair of absolute positions.

Modifying Attention Scores

Let's trace through how relative positions enter the attention computation. In standard self-attention, the score measuring how much position ii should attend to position jj is:

eij=qikjdke_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}

where:

  • eije_{ij}: the raw attention score from query position ii to key position jj
  • qiRdk\mathbf{q}_i \in \mathbb{R}^{d_k}: the query vector at position ii
  • kjRdk\mathbf{k}_j \in \mathbb{R}^{d_k}: the key vector at position jj
  • dkd_k: the dimension of query and key vectors

This score depends only on the content at each position. Two identical words at different positions produce identical scores, regardless of their distance.

Shaw et al. modify this by adding the relative position embedding to the key before computing the dot product:

eij=qi(kj+ajiK)dke_{ij} = \frac{\mathbf{q}_i \cdot (\mathbf{k}_j + \mathbf{a}^K_{j-i})}{\sqrt{d_k}}

Why add to the key rather than the query? The key represents "what this position offers," so adding position information to it means the model can learn that positions at certain distances offer something different, even if their content is identical. The query remains focused on "what I'm looking for."

Expanding the dot product reveals the two-component structure:

eij=qikj+qiajiKdke_{ij} = \frac{\mathbf{q}_i \cdot \mathbf{k}_j + \mathbf{q}_i \cdot \mathbf{a}^K_{j-i}}{\sqrt{d_k}}

where:

  • qikj\mathbf{q}_i \cdot \mathbf{k}_j: the content term, measuring semantic compatibility between positions
  • qiajiK\mathbf{q}_i \cdot \mathbf{a}^K_{j-i}: the position term, encoding how relative distance affects the score
  • ajiKRdk\mathbf{a}^K_{j-i} \in \mathbb{R}^{d_k}: the learned relative position embedding for offset (ji)(j-i)

The position term is a dot product between the query and the relative position embedding. This means different queries can respond differently to the same distance. A query learned to find subjects might have high dot product with a2K\mathbf{a}^K_{-2} (two positions back), while a query learned to find objects might prefer a+1K\mathbf{a}^K_{+1} (one position forward). The model learns these patterns from data.

Modifying Value Aggregation

Shaw et al. apply the same principle to value aggregation. After computing attention weights αij=softmax(eij)\alpha_{ij} = \text{softmax}(e_{ij}), standard attention aggregates values as a weighted sum. The modified version adds relative position embeddings to the values:

oi=j=1nαij(vj+ajiV)\mathbf{o}_i = \sum_{j=1}^{n} \alpha_{ij} (\mathbf{v}_j + \mathbf{a}^V_{j-i})

where:

  • oiRdv\mathbf{o}_i \in \mathbb{R}^{d_v}: output vector at position ii, now enriched with position-aware context
  • αij\alpha_{ij}: attention weight from position ii to position jj (sums to 1 across jj)
  • vjRdv\mathbf{v}_j \in \mathbb{R}^{d_v}: value vector at position jj, carrying the content to aggregate
  • ajiVRdv\mathbf{a}^V_{j-i} \in \mathbb{R}^{d_v}: relative position embedding for values at offset (ji)(j-i)
  • nn: sequence length

This modification means the output at each position incorporates not just the weighted content from other positions, but also position-dependent information about where that content came from. Even if two positions have identical values and identical attention weights, their contributions differ based on their relative distances.

The two sets of embeddings (aK\mathbf{a}^K and aV\mathbf{a}^V) work together but serve distinct roles. The key embeddings affect which positions receive attention (the "routing" decision). The value embeddings affect what information flows once routing is decided (the "content" decision). This separation allows the model to learn, for example, that adjacent positions should receive high attention (key embedding effect) while also learning that adjacent context should contribute a specific "locality" signal to the output (value embedding effect).

Clipping Relative Positions

The formulation above assumes we have a separate embedding for every possible relative position. But this creates a practical problem: for a sequence of length nn, relative positions range from (n1)-(n-1) to (n1)(n-1), requiring 2n12n-1 embeddings. For a sequence of length 512, that's 1023 different embeddings per dimension, for both keys and values.

This approach has two flaws. First, memory grows linearly with sequence length. Second, distant positions appear infrequently during training. In a 128-token sequence, only one position pair has offset +127+127 (position 0 to position 127), while thousands of pairs have offset +1+1. The embedding for offset +127+127 would be severely undertrained.

Shaw et al. address both problems with a simple insight: fine-grained distance matters more locally than globally. The difference between "3 positions back" and "4 positions back" may be linguistically significant, as it could distinguish an adjective-noun relationship from a determiner-noun relationship. But the difference between "100 positions back" and "101 positions back" rarely carries meaning. Both are simply "far away."

This motivates clipping: relative positions beyond a maximum distance kk are clamped to ±k\pm k:

clip(ji,k,k)\text{clip}(j - i, -k, k)

where:

  • jij - i: the raw relative position (offset from position ii to position jj)
  • kk: the maximum relative position to distinguish (a hyperparameter)
  • clip(x,a,b)\text{clip}(x, a, b): clamps xx to the range [a,b][a, b], returning aa if x<ax < a, bb if x>bx > b, and xx otherwise

With clipping, we need only 2k+12k + 1 relative position embeddings regardless of sequence length. Position pairs further than kk apart share the same "far away" embedding, either akK\mathbf{a}^K_{-k} for "far backward" or a+kK\mathbf{a}^K_{+k} for "far forward."

The choice of kk reflects a trade-off between expressiveness and learnability. Typical values range from 16 to 128, depending on the expected locality of relevant patterns in the data.

In[5]:
Code
def clip_relative_position(j, i, max_dist):
    """
    Compute clipped relative position.

    Args:
        j: Key/value position
        i: Query position
        max_dist: Maximum relative distance to distinguish

    Returns:
        Clipped relative position in range [-max_dist, max_dist]
    """
    rel_pos = j - i
    return max(-max_dist, min(max_dist, rel_pos))


# Demonstrate clipping
max_k = 4
positions = list(range(10))
query_pos = 5
Out[6]:
Console
Query position: 5
Maximum relative distance (k): 4

Position | Raw Offset | Clipped Offset
------------------------------------------
    0    |     -5     |      -4
    1    |     -4     |      -4
    2    |     -3     |      -3
    3    |     -2     |      -2
    4    |     -1     |      -1
    5    |     +0     |      +0
    6    |     +1     |      +1
    7    |     +2     |      +2
    8    |     +3     |      +3
    9    |     +4     |      +4

The table reveals how clipping works in practice. Positions 0 and 1 have raw offsets of -5 and -4, but both clip to -4 because they exceed our maximum distance of 4. Position 9 has raw offset +4, which equals the maximum so no clipping occurs. All positions beyond the clipping threshold share the same "far away" embedding, which captures the intuition that distinguishing between "very far" and "extremely far" rarely matters for language understanding.

Out[7]:
Visualization
Heatmap showing clipped relative positions for a 10x10 matrix with k=4, displaying values from -4 to +4.
Effect of clipping on the relative position matrix. Each cell shows the clipped relative position used to index the embedding table. Distant positions (beyond k=4) share the same embedding index, shown in darker shades at the corners.

Notice the diagonal band structure: each anti-diagonal contains the same relative position. The corners, where positions are far apart, are clipped to the boundary values ±k\pm k. This Toeplitz-like structure is key to efficient implementation.

Implementation: Relative Position in Self-Attention

Now that we understand the mathematical formulation, let's translate it into code. We'll build up the implementation in stages: first the relative position embeddings, then the modified attention score computation, then the modified value aggregation, and finally the complete layer.

Building the Relative Position Embedding Table

The foundation of Shaw-style attention is a table of learned embeddings indexed by relative position. Since relative positions range from k-k to +k+k after clipping, we need 2k+12k + 1 embeddings. The key implementation detail is converting from relative positions (which can be negative) to array indices (which must be non-negative).

In[8]:
Code
import numpy as np


class RelativePositionEmbedding:
    """
    Learnable embeddings for relative positions.
    """

    def __init__(self, max_distance, d_k, d_v, seed=None):
        """
        Initialize relative position embeddings.

        Args:
            max_distance: Maximum relative position to distinguish (k)
            d_k: Dimension for key-side embeddings
            d_v: Dimension for value-side embeddings
            seed: Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)

        # Number of unique relative positions: from -k to +k
        n_positions = 2 * max_distance + 1

        # Initialize embeddings with small random values
        scale_k = np.sqrt(2.0 / (n_positions + d_k))
        scale_v = np.sqrt(2.0 / (n_positions + d_v))

        self.a_K = np.random.randn(n_positions, d_k) * scale_k
        self.a_V = np.random.randn(n_positions, d_v) * scale_v
        self.max_distance = max_distance

    def get_key_embedding(self, rel_pos):
        """Get the key-side relative position embedding for an offset."""
        clipped = max(-self.max_distance, min(self.max_distance, rel_pos))
        # Convert from [-k, k] to [0, 2k] for indexing
        idx = clipped + self.max_distance
        return self.a_K[idx]

    def get_value_embedding(self, rel_pos):
        """Get the value-side relative position embedding for an offset."""
        clipped = max(-self.max_distance, min(self.max_distance, rel_pos))
        idx = clipped + self.max_distance
        return self.a_V[idx]

The index conversion is the subtle but essential detail. Relative positions range from k-k to +k+k, but array indices must be non-negative. Adding kk shifts the range to [0,2k][0, 2k]:

  • Offset k-k (maximum backward distance) maps to index 0
  • Offset 0 (same position) maps to index kk
  • Offset +k+k (maximum forward distance) maps to index 2k2k

This simple arithmetic lets us use relative positions as array indices without any conditional logic.

Computing Attention Scores with Relative Position

With the embedding table in place, we can implement the modified attention score computation. Recall the formula: eij=(qikj+qiajiK)/dke_{ij} = (\mathbf{q}_i \cdot \mathbf{k}_j + \mathbf{q}_i \cdot \mathbf{a}^K_{j-i}) / \sqrt{d_k}. For each query-key pair, we compute both the content term and the position term, then sum them before scaling.

In[9]:
Code
def relative_attention_scores(Q, K, rel_embed):
    """
    Compute attention scores with relative position information.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        rel_embed: RelativePositionEmbedding instance

    Returns:
        Attention scores of shape (seq_len, seq_len)
    """
    seq_len, d_k = Q.shape
    scores = np.zeros((seq_len, seq_len))

    for i in range(seq_len):  # Query position
        for j in range(seq_len):  # Key position
            # Content term: q_i · k_j
            content_score = np.dot(Q[i], K[j])

            # Position term: q_i · a^K_{j-i}
            rel_pos = j - i
            a_K = rel_embed.get_key_embedding(rel_pos)
            position_score = np.dot(Q[i], a_K)

            # Combined score
            scores[i, j] = content_score + position_score

    # Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)

    return scores

This implementation iterates over all position pairs, making the logic transparent. For each pair (i,j)(i, j), we compute the content score (qikj\mathbf{q}_i \cdot \mathbf{k}_j), look up the relative position embedding for offset jij - i, compute the position score (qiajiK\mathbf{q}_i \cdot \mathbf{a}^K_{j-i}), and sum them. The final division by dk\sqrt{d_k} applies the same scaling as standard attention to prevent score magnitudes from growing with dimension.

Computing Output with Relative Position in Values

The value aggregation follows the same pattern. For each output position ii, we aggregate contributions from all positions jj, but we add the relative position embedding to each value before weighting:

In[10]:
Code
def relative_attention_output(attention_weights, V, rel_embed):
    """
    Compute attention output with relative position in values.

    Args:
        attention_weights: Softmax attention weights (seq_len, seq_len)
        V: Value matrix of shape (seq_len, d_v)
        rel_embed: RelativePositionEmbedding instance

    Returns:
        Output of shape (seq_len, d_v)
    """
    seq_len, d_v = V.shape
    output = np.zeros((seq_len, d_v))

    for i in range(seq_len):  # Output position
        for j in range(seq_len):  # Value position
            # Get value with relative position embedding
            rel_pos = j - i
            a_V = rel_embed.get_value_embedding(rel_pos)
            value_with_pos = V[j] + a_V

            # Weighted contribution
            output[i] += attention_weights[i, j] * value_with_pos

    return output

The value modification works similarly to the key modification. For each contributing position jj, we add its relative position embedding ajiV\mathbf{a}^V_{j-i} to its value vector before applying the attention weight. This means the output at position ii contains not just weighted content, but also weighted position information about where that content originated.

The Complete Relative Self-Attention Layer

With both components in place, we can assemble the complete layer. This class wraps the QKV projections, relative position embeddings, and the modified attention computation into a single forward pass:

In[11]:
Code
class RelativeSelfAttention:
    """
    Self-attention with relative position representations (Shaw et al.).
    """

    def __init__(self, embed_dim, d_k, d_v, max_distance=16, seed=None):
        """
        Initialize relative self-attention layer.

        Args:
            embed_dim: Dimension of input embeddings
            d_k: Query/key dimension
            d_v: Value dimension
            max_distance: Maximum relative position to distinguish
            seed: Random seed
        """
        if seed is not None:
            np.random.seed(seed)

        # QKV projection matrices
        scale_qk = np.sqrt(2.0 / (embed_dim + d_k))
        scale_v = np.sqrt(2.0 / (embed_dim + d_v))

        self.W_q = np.random.randn(embed_dim, d_k) * scale_qk
        self.W_k = np.random.randn(embed_dim, d_k) * scale_qk
        self.W_v = np.random.randn(embed_dim, d_v) * scale_v

        # Relative position embeddings
        self.rel_embed = RelativePositionEmbedding(max_distance, d_k, d_v, seed)

        self.d_k = d_k

    def forward(self, X):
        """
        Compute relative self-attention.

        Args:
            X: Input embeddings of shape (seq_len, embed_dim)

        Returns:
            output: Attention output of shape (seq_len, d_v)
            attention_weights: Weights of shape (seq_len, seq_len)
        """
        # Project to Q, K, V
        Q = X @ self.W_q
        K = X @ self.W_k
        V = X @ self.W_v

        # Compute attention scores with relative positions
        scores = relative_attention_scores(Q, K, self.rel_embed)

        # Softmax
        scores_stable = scores - scores.max(axis=1, keepdims=True)
        exp_scores = np.exp(scores_stable)
        attention_weights = exp_scores / exp_scores.sum(axis=1, keepdims=True)

        # Compute output with relative positions in values
        output = relative_attention_output(attention_weights, V, self.rel_embed)

        return output, attention_weights

The forward pass follows the standard self-attention pattern with two modifications. After projecting input embeddings to Q, K, and V, we use our modified relative_attention_scores function instead of the standard dot product, and we use relative_attention_output instead of the standard weighted sum. The softmax step remains unchanged, as the relative position terms are already incorporated into the scores.

Testing the Implementation

Let's verify our implementation produces sensible results:

In[12]:
Code
# Create a test sequence
np.random.seed(42)
seq_len = 6
embed_dim = 8
d_k = d_v = 4
max_dist = 3

# Random input embeddings
X = np.random.randn(seq_len, embed_dim)

# Create relative attention layer
rel_attn = RelativeSelfAttention(
    embed_dim, d_k, d_v, max_distance=max_dist, seed=123
)
output, weights = rel_attn.forward(X)
Out[13]:
Console
Relative Self-Attention Test
=============================================
Sequence length: 6
Embedding dim:   8
Q/K dimension:   4
V dimension:     4
Max distance:    3

Input shape:     (6, 8)
Output shape:    (6, 4)
Weights shape:   (6, 6)

Attention weights (rows sum to 1):
[[0.008 0.028 0.001 0.12  0.62  0.223]
 [0.26  0.098 0.35  0.157 0.052 0.083]
 [0.794 0.002 0.077 0.122 0.002 0.002]
 [0.016 0.394 0.025 0.108 0.356 0.101]
 [0.475 0.023 0.002 0.13  0.069 0.301]
 [0.002 0.227 0.001 0.014 0.66  0.097]]

The attention weights look similar to standard self-attention, but they now incorporate relative position information. Each query-key compatibility is biased by the relative distance between positions.

Out[14]:
Visualization
Heatmap showing relative position embedding values with offsets on y-axis and dimensions on x-axis.
Visualization of the learned relative position embedding matrix (key-side). Each row corresponds to a relative offset from -3 to +3, and each column is an embedding dimension. The patterns vary across offsets, allowing the model to encode different distance-dependent behaviors.

The embedding matrix shows how each relative offset is encoded as a different vector. Offset 0 (same position) has a distinct pattern from offset -3 (three positions back) or offset +3 (three positions forward). These patterns are learned during training to capture which distances are relevant for the task.

Visualizing Relative Position Effects

Let's visualize how relative position embeddings influence attention. We'll examine the position term qiajiK\mathbf{q}_i \cdot \mathbf{a}^K_{j-i} separately from the content term.

In[15]:
Code
# Extract position-only attention biases
def compute_position_biases(Q, rel_embed):
    """
    Compute attention score contributions from relative positions only.
    """
    seq_len, d_k = Q.shape
    biases = np.zeros((seq_len, seq_len))

    for i in range(seq_len):
        for j in range(seq_len):
            rel_pos = j - i
            a_K = rel_embed.get_key_embedding(rel_pos)
            biases[i, j] = np.dot(Q[i], a_K) / np.sqrt(d_k)

    return biases


# Compute position biases for our test case
Q_test = X @ rel_attn.W_q
position_biases = compute_position_biases(Q_test, rel_attn.rel_embed)

# Also compute content-only scores for comparison
K_test = X @ rel_attn.W_k
content_scores = (Q_test @ K_test.T) / np.sqrt(d_k)
Out[16]:
Visualization
Heatmap of content-based attention scores between six positions showing irregular patterns based on embedding similarity.
Content scores: Query-key dot products measure semantic compatibility. These patterns depend on the input embeddings, not position.
Heatmap of position-based attention biases showing diagonal band structure reflecting relative distances.
Position biases: Relative position contribution to attention. Each diagonal line corresponds to a fixed offset, showing consistent distance-based patterns.

The content scores show irregular patterns depending on the input embeddings. The position biases show a distinctive diagonal structure: each anti-diagonal corresponds to positions at a fixed relative distance. For example, the main diagonal (offset 0) shows self-attention bias, while diagonals above and below show biases for looking forward and backward.

Let's examine how the combined scores compare to content-only attention:

In[17]:
Code
# Compute attention weights with and without position
def softmax_rows(x):
    exp_x = np.exp(x - x.max(axis=1, keepdims=True))
    return exp_x / exp_x.sum(axis=1, keepdims=True)


content_only_weights = softmax_rows(content_scores)
combined_scores = content_scores + position_biases
combined_weights = softmax_rows(combined_scores)
Out[18]:
Visualization
Heatmap of attention weights using only content similarity between positions.
Content-only attention: Weights based purely on query-key similarity. Patterns reflect embedding relationships without distance information.
Heatmap of attention weights combining content and relative position, showing modified attention patterns.
With relative position: Position biases modify attention patterns. Notice how relative distance influences which positions receive more weight.

The position biases subtly shift attention patterns. In this random example, the differences are modest because the embeddings are random. In trained models, relative position embeddings learn linguistically meaningful patterns: verbs attending strongly to positions 1-2 back (typical subject distance), determiners attending 1-2 forward (to their nouns), and so on.

Efficient Matrix Implementation

The loop-based implementation above is clear but slow. In practice, we compute relative attention efficiently using matrix operations. The key insight is that position biases form a Toeplitz-like structure: all entries on each diagonal share the same relative distance.

In[19]:
Code
def efficient_relative_scores(Q, K, rel_embed):
    """
    Compute relative attention scores using efficient matrix operations.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        rel_embed: RelativePositionEmbedding instance

    Returns:
        Attention scores of shape (seq_len, seq_len)
    """
    seq_len, d_k = Q.shape

    # Content scores: standard Q @ K.T
    content_scores = Q @ K.T

    # Build relative position bias matrix
    # Each entry [i, j] needs Q[i] · a^K_{j-i}

    # Step 1: Collect all relative position embeddings into a matrix
    # Shape: (2*max_dist + 1, d_k)
    all_a_K = rel_embed.a_K
    max_dist = rel_embed.max_distance

    # Step 2: Compute Q @ all_a_K.T to get all possible position scores
    # Shape: (seq_len, 2*max_dist + 1)
    all_position_scores = Q @ all_a_K.T

    # Step 3: Index to build the (seq_len, seq_len) position bias matrix
    position_biases = np.zeros((seq_len, seq_len))
    for i in range(seq_len):
        for j in range(seq_len):
            rel_pos = j - i
            clipped = max(-max_dist, min(max_dist, rel_pos))
            idx = clipped + max_dist
            position_biases[i, j] = all_position_scores[i, idx]

    # Combined scores
    scores = (content_scores + position_biases) / np.sqrt(d_k)

    return scores


# Verify the efficient implementation matches the naive one
scores_naive = relative_attention_scores(Q_test, K_test, rel_attn.rel_embed)
scores_efficient = efficient_relative_scores(Q_test, K_test, rel_attn.rel_embed)
Out[20]:
Console
Maximum difference between naive and efficient: 8.88e-16
Implementations are equivalent!

The efficient version first computes all possible query-position dot products, then indexes into this precomputed matrix. This reduces the inner loop computation from O(dk)O(d_k) per entry to O(1)O(1), with the upfront cost of a single (n,2k+1)(n, 2k+1) matrix multiplication.

Comparison with Absolute Position Encoding

Let's directly compare relative and absolute position encodings on a simple task: detecting whether a token is attending to its immediate neighbors versus distant positions.

In[21]:
Code
def analyze_position_bias(attention_weights, max_offset=5):
    """
    Analyze average attention by relative offset.
    """
    seq_len = attention_weights.shape[0]
    offset_weights = {}

    for offset in range(-max_offset, max_offset + 1):
        weights_at_offset = []
        for i in range(seq_len):
            j = i + offset
            if 0 <= j < seq_len:
                weights_at_offset.append(attention_weights[i, j])
        if weights_at_offset:
            offset_weights[offset] = np.mean(weights_at_offset)

    return offset_weights


# Analyze position patterns in our relative attention
offset_analysis = analyze_position_bias(combined_weights)
Out[22]:
Visualization
Bar chart showing average attention weight for relative offsets from -5 to +5, with tallest bar at offset 0.
Average attention weight by relative offset. This profile shows how much attention typically flows to positions at each distance. Peaks at offset 0 indicate self-attention; asymmetric patterns suggest directional dependencies (e.g., attending more to past than future).

The profile shows how attention is distributed by relative distance. In this random initialization, patterns are noisy, but trained models develop clear preferences: immediate neighbors typically receive more attention, with the pattern decaying for distant positions.

Learned Position Patterns

To illustrate what relative position embeddings might learn, let's simulate embeddings that encode common linguistic patterns:

In[23]:
Code
# Simulate learned relative position embeddings
# Pattern: slight preference for immediately adjacent positions

d_k_sim = 4
max_dist_sim = 5

# Create embeddings that encode locality preference
np.random.seed(42)
a_K_sim = np.random.randn(2 * max_dist_sim + 1, d_k_sim) * 0.1

# Add locality bias: positions closer to 0 have higher dot product potential
for idx in range(2 * max_dist_sim + 1):
    offset = idx - max_dist_sim  # Convert back to relative position
    locality_factor = np.exp(-0.5 * (offset**2))  # Gaussian decay
    a_K_sim[idx, 0] += locality_factor  # Boost first dimension

# Create a uniform query that will reveal the position pattern
uniform_query = np.ones(d_k_sim) / np.sqrt(d_k_sim)
position_profile = uniform_query @ a_K_sim.T
Out[24]:
Visualization
Line plot showing position score peaking at offset 0 and decaying symmetrically for more distant positions.
Simulated learned relative position pattern. The embeddings encode a locality preference: nearby positions have higher compatibility scores. This pattern approximates what models often learn, where local context is prioritized.

This simulated pattern shows a Gaussian-like preference for nearby positions. Real models learn more complex patterns that capture linguistic structure: determiners attending to following nouns, verbs attending to preceding subjects, and so on.

Key Parameters

When implementing relative position encoding, several parameters control the behavior and efficiency of the mechanism:

  • max_distance (k): Maximum relative position to distinguish. Positions beyond this distance share the same "far away" embedding. Typical values range from 16 to 128. Smaller values reduce memory but lose fine-grained distance information for distant positions. Larger values preserve more detail but require more parameters and training data.
  • d_k: Dimension of query and key vectors (and key-side relative position embeddings). Must match between queries and keys for dot product compatibility. Common values are 64 for single-head attention or embed_dim / num_heads for multi-head attention.
  • d_v: Dimension of value vectors (and value-side relative position embeddings). Determines the output dimension of each attention head. Often set equal to d_k for simplicity.
  • Embedding initialization scale: Relative position embeddings are typically initialized with small random values. Xavier/Glorot initialization scales by 2/(n_positions+d)\sqrt{2/(n\_\text{positions} + d)} to maintain stable gradient magnitudes during training.

The total number of relative position parameters is (2k+1)×(dk+dv)(2k + 1) \times (d_k + d_v). With k=64k = 64 and dk=dv=64d_k = d_v = 64, this adds 16,512 parameters per attention head, which is modest compared to the QKV projection matrices.

Limitations and Practical Considerations

Relative position encoding solves the generalization problem but introduces its own challenges.

The most significant limitation is computational complexity. Standard absolute position encoding adds to embeddings once at the input, with no additional cost during attention. Relative position encoding modifies every attention score computation. For a sequence of length nn, we compute n2n^2 relative position biases per attention layer. With multiple layers and attention heads, this overhead accumulates. Shaw et al. note that their approach adds roughly 5-10% to training time.

Memory usage also increases. We store (2k+1)(2k+1) relative position embeddings per dimension, for both keys and values. With dk=dv=64d_k = d_v = 64 and k=64k = 64, this adds about 16K parameters per attention head. This is modest compared to the projection matrices, but still non-trivial for models with many heads.

The clipping distance kk introduces a hyperparameter choice. Too small, and the model loses fine-grained distance information. Too large, and distant embeddings have insufficient training signal. In practice, kk is often set to match the typical context length during training, such as 64 or 128 for sentence-level tasks.

Despite these limitations, relative position encoding's impact on transformer research has been substantial. It demonstrated that position information can be injected at the attention level rather than the embedding level, opening the door to more sophisticated schemes like Rotary Position Embeddings and ALiBi that we'll explore in subsequent chapters.

Summary

Relative position encoding shifts the focus from "where am I?" to "how far apart are we?" By encoding the offset between positions rather than absolute indices, models learn patterns that generalize across all positions at the same distance.

Key takeaways from this chapter:

  • Absolute positions limit generalization: A model trained on short sequences may never see position pairs that appear in longer sequences. Relative encoding ensures that distance patterns transfer regardless of absolute position.

  • Shaw et al. formulation: Modify attention scores by adding qiajiK\mathbf{q}_i \cdot \mathbf{a}^K_{j-i} to the content term. This separates semantic compatibility from distance-based compatibility.

  • Clipping bounds memory: Relative positions beyond distance kk share the same embedding. This limits the number of parameters to 2k+12k + 1 regardless of sequence length.

  • Two components to modify: Shaw et al. add relative embeddings to both keys (affecting which positions get attention) and values (affecting what content flows). Both contribute to the learned distance patterns.

  • Diagonal structure in attention: Position biases form a Toeplitz-like pattern where each anti-diagonal corresponds to a fixed relative distance. This structure enables efficient computation.

  • Trade-off between complexity and generalization: Relative position encoding adds computational overhead but provides better generalization to unseen position pairs and longer sequences.

In the next chapter, we'll explore Rotary Position Embeddings (RoPE), a method that encodes relative positions through rotation rather than additive biases. RoPE achieves similar generalization benefits with a more elegant mathematical formulation that integrates naturally with the attention computation.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about relative position encoding.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{relativepositionencodingdistancebasedattentionfortransformers, author = {Michael Brenndoerfer}, title = {Relative Position Encoding: Distance-Based Attention for Transformers}, year = {2025}, url = {https://mbrenndoerfer.com/writing/relative-position-encoding-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Relative Position Encoding: Distance-Based Attention for Transformers. Retrieved from https://mbrenndoerfer.com/writing/relative-position-encoding-transformers
MLAAcademic
Michael Brenndoerfer. "Relative Position Encoding: Distance-Based Attention for Transformers." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/relative-position-encoding-transformers>.
CHICAGOAcademic
Michael Brenndoerfer. "Relative Position Encoding: Distance-Based Attention for Transformers." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/relative-position-encoding-transformers.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Relative Position Encoding: Distance-Based Attention for Transformers'. Available at: https://mbrenndoerfer.com/writing/relative-position-encoding-transformers (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Relative Position Encoding: Distance-Based Attention for Transformers. https://mbrenndoerfer.com/writing/relative-position-encoding-transformers
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free