Search

Search articles

Rotary Position Embedding (RoPE): Encoding Position Through Rotation

Michael BrenndoerferUpdated June 5, 202538 min read

Learn how RoPE encodes position through vector rotation, making attention scores depend on relative position. Includes mathematical derivation and implementation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Rotary Position Embedding (RoPE)

The transformer attention mechanism, as we've seen, is inherently position-blind. Sinusoidal encodings and learned position embeddings address this by adding position information to token embeddings before attention. But these approaches encode absolute position. Token at position 5 always receives the same positional signal, regardless of context. What if the relationship between positions 5 and 7 matters more than the absolute locations? Relative position encoding tackles this, but earlier methods required modifying the attention architecture or adding explicit bias terms.

Rotary Position Embedding, or RoPE, takes an elegant geometric approach. Instead of adding position to embeddings, it rotates them. Each position corresponds to a rotation angle, and the rotation is applied directly to query and key vectors. The ingenious part: when you compute the dot product between a rotated query at position mm and a rotated key at position nn, the result depends only on their relative distance mnm - n. Absolute positions vanish, leaving only the relationship between tokens.

This chapter develops RoPE from first principles. We'll start with rotations in 2D, extend to higher dimensions through paired rotations, derive why the mechanism captures relative position, and implement it in code. By the end, you'll understand why RoPE has become the dominant position encoding in modern language models like LLaMA, PaLM, and many others.

Why Rotation?

Consider what we want from a position encoding. When a query at position mm attends to a key at position nn, the attention score should somehow reflect their relative distance mnm - n. If token 3 attends to token 1, the model should "know" they're 2 positions apart, exactly as if token 8 attends to token 6.

Rotations have a beautiful property that accomplishes this. If you rotate vector a\mathbf{a} by angle θm\theta_m and vector b\mathbf{b} by angle θn\theta_n, then compute their dot product, the result depends on the angle difference θmθn\theta_m - \theta_n. The absolute angles cancel out.

Rotation Invariance of Dot Products

For two vectors a\mathbf{a} and b\mathbf{b}, rotating both by the same angle θ\theta preserves their dot product: Rθ(a)Rθ(b)=abR_\theta(\mathbf{a}) \cdot R_\theta(\mathbf{b}) = \mathbf{a} \cdot \mathbf{b}. This is because rotations preserve lengths and angles between vectors.

Now imagine we associate each position with an angle: position mm gets angle mθm \cdot \theta for some base angle θ\theta. If we rotate the query vector at position mm by mθm\theta and the key vector at position nn by nθn\theta, their dot product will involve the angle (mn)θ(m - n)\theta. That's exactly the relative position information we want.

This is the core insight of RoPE: encode position through rotation, and let the geometry of dot products naturally extract relative position.

Rotation Matrices in 2D

Let's build up the mechanics. In two dimensions, rotating a vector (x,y)(x, y) by angle θ\theta counterclockwise uses the rotation matrix:

R(θ)=(cosθsinθsinθcosθ)R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

where:

  • R(θ)R(\theta): the 2D rotation matrix that rotates vectors by angle θ\theta
  • θ\theta: the rotation angle in radians (counterclockwise is positive)
  • cosθ\cos\theta, sinθ\sin\theta: trigonometric functions evaluated at angle θ\theta

Applying this rotation matrix to a 2D vector transforms its coordinates:

(xy)=R(θ)(xy)=(xcosθysinθxsinθ+ycosθ)\begin{pmatrix} x' \\ y' \end{pmatrix} = R(\theta) \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} x\cos\theta - y\sin\theta \\ x\sin\theta + y\cos\theta \end{pmatrix}

where:

  • (x,y)(x, y): the original vector coordinates
  • (x,y)(x', y'): the rotated vector coordinates
  • The first row computes the new xx-coordinate by combining the original coordinates with cosθ\cos\theta and sinθ-\sin\theta
  • The second row computes the new yy-coordinate using sinθ\sin\theta and cosθ\cos\theta

The rotated vector (x,y)(x', y') has the same length as (x,y)(x, y) but points in a direction shifted by θ\theta. This is because rotation matrices are orthogonal, meaning they preserve vector lengths (norms) and angles between vectors.

In[3]:
Code
def rotate_2d(vector, theta):
    """Rotate a 2D vector by angle theta (in radians)."""
    cos_t, sin_t = np.cos(theta), np.sin(theta)
    rotation_matrix = np.array([[cos_t, -sin_t], [sin_t, cos_t]])
    return rotation_matrix @ vector

Let's visualize how rotation transforms a vector:

In[4]:
Code
# Original vector
original = np.array([1.0, 0.5])

# Rotate by several angles
angles = [0, np.pi / 6, np.pi / 3, np.pi / 2]
rotated_vectors = [rotate_2d(original, theta) for theta in angles]
Out[5]:
Visualization
Four arrows from origin showing a vector rotated by 0, 30, 60, and 90 degrees.
A 2D vector rotated by increasing angles. The vector maintains its length (distance from origin) while its direction changes. Each rotation of π/6 radians (30°) shifts the vector counterclockwise.

The dashed circle shows the path traced by the vector tip as it rotates. Crucially, the length (magnitude) never changes. Rotation is an isometry, a transformation that preserves distances.

From Rotation to Relative Position

Now let's see how rotation makes dot products depend on relative position. Take two 2D vectors q\mathbf{q} (query) and k\mathbf{k} (key). Rotate q\mathbf{q} by angle mθm\theta (for position mm) and k\mathbf{k} by angle nθn\theta (for position nn).

The dot product of the rotated vectors is:

R(mθ)qR(nθ)kR(m\theta)\mathbf{q} \cdot R(n\theta)\mathbf{k}

where:

  • q\mathbf{q}: the query vector at position mm
  • k\mathbf{k}: the key vector at position nn
  • R(mθ)R(m\theta): rotation matrix that rotates by angle mθm\theta (position times base angle)
  • R(nθ)R(n\theta): rotation matrix that rotates by angle nθn\theta

Using properties of rotation matrices, we can simplify this expression step by step:

qTR(mθ)TR(nθ)k=qTR(mθ)R(nθ)k=qTR((nm)θ)k\mathbf{q}^T R(m\theta)^T R(n\theta) \mathbf{k} = \mathbf{q}^T R(-m\theta) R(n\theta) \mathbf{k} = \mathbf{q}^T R((n-m)\theta) \mathbf{k}

The derivation proceeds as follows:

  1. Rewrite the dot product as matrix multiplication: R(mθ)qR(nθ)k=(R(mθ)q)T(R(nθ)k)=qTR(mθ)TR(nθ)kR(m\theta)\mathbf{q} \cdot R(n\theta)\mathbf{k} = (R(m\theta)\mathbf{q})^T (R(n\theta)\mathbf{k}) = \mathbf{q}^T R(m\theta)^T R(n\theta) \mathbf{k}
  2. Apply the transpose-inverse property: The transpose of a rotation matrix equals its inverse, so R(θ)T=R(θ)R(\theta)^T = R(-\theta). This gives us R(mθ)T=R(mθ)R(m\theta)^T = R(-m\theta).
  3. Apply the composition property: Multiplying two rotation matrices adds their angles, so R(α)R(β)=R(α+β)R(\alpha) R(\beta) = R(\alpha + \beta). Therefore R(mθ)R(nθ)=R((nm)θ)R(-m\theta) R(n\theta) = R((n-m)\theta).

The final result qTR((nm)θ)k\mathbf{q}^T R((n-m)\theta) \mathbf{k} depends only on the difference (nm)(n-m), not on the absolute values of mm and nn separately.

Key Insight: Relative Position Emerges

When we rotate query by mθm\theta and key by nθn\theta, their dot product depends only on (nm)θ(n - m)\theta, the relative position. The absolute positions mm and nn vanish, replaced by their difference.

This is remarkable. We encode absolute position through rotation angle, but the attention mechanism, which uses dot products, automatically extracts relative position. No architectural changes needed. No explicit bias terms. Just geometry.

Let's verify this numerically:

In[6]:
Code
def verify_relative_position():
    """Verify that rotated dot products depend only on relative position."""
    np.random.seed(42)

    # Random query and key vectors
    q = np.random.randn(2)
    k = np.random.randn(2)

    theta = 0.5  # Base rotation angle

    # Different absolute positions, same relative distance
    results = []
    for m, n in [(1, 3), (5, 7), (10, 12), (100, 102)]:
        q_rotated = rotate_2d(q, m * theta)
        k_rotated = rotate_2d(k, n * theta)
        dot_product = np.dot(q_rotated, k_rotated)
        results.append((m, n, n - m, dot_product))

    return results


position_results = verify_relative_position()
Out[7]:
Console
Verifying relative position property:
Position m   Position n   n - m      Dot Product    
--------------------------------------------------
1            3            2          -0.651890
5            7            2          -0.651890
10           12           2          -0.651890
100          102          2          -0.651890

All four pairs have the same relative distance (2 positions apart), and their dot products are identical despite wildly different absolute positions. This confirms the relative position property holds numerically.

Extending to Higher Dimensions

We've established that 2D rotation elegantly encodes relative position. But real transformer embeddings have hundreds or thousands of dimensions, not just 2. How do we extend this geometric insight to high-dimensional space?

The challenge is that rotations in high dimensions are more complex than in 2D. A naive approach might try to define a single rotation that affects all dimensions simultaneously, but this would be computationally expensive and wouldn't preserve the relative position property we just derived.

RoPE's solution is both clever and efficient: treat the dd-dimensional embedding as d/2d/2 independent pairs. A dd-dimensional embedding is split into pairs: (x1,x2)(x_1, x_2), (x3,x4)(x_3, x_4), ..., (xd1,xd)(x_{d-1}, x_d). Each pair is rotated independently as a 2D vector, and since the pairs don't interact, the relative position property holds for each pair separately. When we sum up the contributions from all pairs in a dot product, the overall score still depends only on relative position.

But here's where RoPE becomes truly expressive: each pair rotates at a different frequency. The first pair might rotate by θ\theta per position, the second by θ/2\theta/2, the third by θ/4\theta/4, and so on. Think of it like the hour, minute, and second hands of a clock: each moves at a different rate, and together they can represent any time uniquely. Similarly, by using multiple frequencies, RoPE creates a rich encoding where different dimension pairs capture position information at different scales.

The rotation angle for dimension pair ii at position mm is:

θi(m)=mθi=m1100002i/d\theta_i(m) = m \cdot \theta_i = m \cdot \frac{1}{10000^{2i/d}}

where:

  • θi(m)\theta_i(m): the rotation angle (in radians) for dimension pair ii at sequence position mm
  • mm: the position in the sequence (0, 1, 2, ..., n1n-1 for a sequence of length nn)
  • ii: the dimension pair index (0, 1, 2, ..., d/21d/2 - 1)
  • dd: the total embedding dimension (must be even)
  • θi=1/100002i/d\theta_i = 1/10000^{2i/d}: the base frequency for dimension pair ii, which decreases exponentially as ii increases
  • 10000: the base constant (same as in sinusoidal position encodings), chosen empirically for good performance

To understand why this formula creates a multi-scale representation, consider the exponent 2i/d2i/d:

  • When i=0i = 0: θ0=1/100000=1\theta_0 = 1/10000^0 = 1 (fastest rotation, one radian per position)
  • When i=d/4i = d/4: θd/4=1/100000.5=1/100=0.01\theta_{d/4} = 1/10000^{0.5} = 1/100 = 0.01 (slower rotation)
  • When i=d/21i = d/2 - 1: θd/211/10000\theta_{d/2-1} \approx 1/10000 (slowest rotation)

This exponential decay means early dimension pairs (small ii) rotate quickly, capturing fine-grained position differences, while later dimension pairs (large ii) rotate slowly, capturing longer-range relationships.

In[8]:
Code
def compute_rope_frequencies(d_model, base=10000):
    """Compute rotation frequencies for each dimension pair."""
    # Number of dimension pairs
    num_pairs = d_model // 2

    # Frequency for each pair: 1 / base^(2i/d)
    i = np.arange(num_pairs)
    frequencies = 1.0 / (base ** (2 * i / d_model))

    return frequencies


# Example with 8 dimensions (4 pairs)
d_model = 8
freqs = compute_rope_frequencies(d_model)
Out[9]:
Console
RoPE frequencies for 8-dimensional embeddings:
Pair     Frequency       Wavelength (positions)
---------------------------------------------
0        1.000000        6.3
1        0.100000        62.8
2        0.010000        628.3
3        0.001000        6283.2

The table shows the exponential decay: pair 0 completes a full cycle in about 6 positions (high frequency), while pair 3 takes over 600 positions (low frequency). This 100× difference in wavelength is what allows RoPE to encode positions at multiple scales simultaneously.

Let's visualize this frequency spectrum to see the exponential decay more clearly:

In[10]:
Code
# Visualize frequency decay across dimension pairs
d_model_viz = 64  # Typical small model dimension
freqs_viz = compute_rope_frequencies(d_model_viz)
wavelengths = 2 * np.pi / freqs_viz
Out[11]:
Visualization
Line plot showing exponential decay of frequency across dimension pairs on log scale.
Frequencies decay exponentially from 1.0 (pair 0) to near zero (pair 31). Early dimension pairs rotate rapidly, later pairs rotate slowly.
Line plot showing exponential growth of wavelength across dimension pairs on log scale.
Wavelengths grow exponentially, spanning from ~6 positions to over 60,000 positions. This multi-scale structure enables RoPE to encode both local and global position information.

The wavelengths tell us how many positions before a dimension pair completes a full rotation (360°). Pair 0 completes a cycle in about 6 positions, while pair 3 takes over 600 positions. This exponential spread ensures RoPE can distinguish positions both locally and globally.

The Complete RoPE Formula

Now that we understand the individual components, let's bring everything together into the complete RoPE transformation. We've established three key ideas:

  1. Rotation encodes position: Each position mm corresponds to a rotation angle mθm\theta
  2. Dot products extract relative position: When query and key are rotated by different amounts, their dot product depends only on the angle difference
  3. Multiple frequencies create richness: Different dimension pairs rotate at different rates, capturing both local and global position information

The complete RoPE formula combines these insights into a single elegant operation. Given a query or key vector xRd\mathbf{x} \in \mathbb{R}^d at position mm, we apply RoPE as follows:

RoPE(x,m)=(R(θ0m)R(θ1m)R(θd/21m))(x1x2x3x4xd1xd)\text{RoPE}(\mathbf{x}, m) = \begin{pmatrix} R(\theta_0 \cdot m) & & & \\ & R(\theta_1 \cdot m) & & \\ & & \ddots & \\ & & & R(\theta_{d/2-1} \cdot m) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \\ x_{d-1} \\ x_d \end{pmatrix}

where:

  • RoPE(x,m)\text{RoPE}(\mathbf{x}, m): the rotated vector, a function of both the input vector and position
  • xRd\mathbf{x} \in \mathbb{R}^d: the input query or key vector with dd dimensions
  • mm: the sequence position (integer index)
  • R(θim)R(\theta_i \cdot m): the 2×\times2 rotation matrix for angle θim\theta_i \cdot m, applied to dimension pair ii
  • θi=1/100002i/d\theta_i = 1 / 10000^{2i/d}: the base frequency for dimension pair ii (decreases exponentially with ii)
  • The block-diagonal structure means each 2×\times2 rotation block operates independently on its corresponding dimension pair
  • Empty off-diagonal blocks are zeros, so dimensions in different pairs don't interact during rotation

The large block-diagonal matrix applies different rotations to each dimension pair simultaneously. This is efficient because each 2D rotation is independent of the others, allowing for parallel computation.

Expanding the rotation for a single dimension pair (x2i+1,x2i+2)(x_{2i+1}, x_{2i+2}):

(x2i+1x2i+2)=(cos(mθi)sin(mθi)sin(mθi)cos(mθi))(x2i+1x2i+2)\begin{pmatrix} x'_{2i+1} \\ x'_{2i+2} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i+1} \\ x_{2i+2} \end{pmatrix}

where:

  • x2i+1x_{2i+1}, x2i+2x_{2i+2}: the (2i+1)(2i+1)-th and (2i+2)(2i+2)-th components of the input vector (using 1-based indexing)
  • x2i+1x'_{2i+1}, x2i+2x'_{2i+2}: the corresponding components after rotation
  • mθim\theta_i: the rotation angle, which increases linearly with position mm at a rate determined by frequency θi\theta_i

Written out element-wise, the transformation is:

  • x2i+1=x2i+1cos(mθi)x2i+2sin(mθi)x'_{2i+1} = x_{2i+1} \cos(m\theta_i) - x_{2i+2} \sin(m\theta_i)
  • x2i+2=x2i+1sin(mθi)+x2i+2cos(mθi)x'_{2i+2} = x_{2i+1} \sin(m\theta_i) + x_{2i+2} \cos(m\theta_i)

Complex Number Perspective

The matrix formulation we've developed is mathematically complete, but there's an even more elegant way to express RoPE using complex numbers. This isn't just a notational convenience: the complex perspective reveals the deep connection between rotations and exponentials, and leads to more efficient implementations.

The key insight is that a 2D rotation is equivalent to multiplication by a complex exponential. Every 2D vector (x,y)(x, y) can be viewed as a complex number z=x+jyz = x + jy, where jj is the imaginary unit. In this representation, rotating the vector by angle θ\theta is simply multiplying by ejθe^{j\theta}.

For the dimension pair (x2i+1,x2i+2)(x_{2i+1}, x_{2i+2}), interpret them as the real and imaginary parts of a complex number:

zi=x2i+1+jx2i+2z_i = x_{2i+1} + j \cdot x_{2i+2}

where:

  • ziz_i: a complex number representing dimension pair ii
  • x2i+1x_{2i+1}: the real part (first element of the pair)
  • x2i+2x_{2i+2}: the imaginary part (second element of the pair)
  • jj: the imaginary unit (satisfying j2=1j^2 = -1)

Rotation by angle θ\theta in the complex plane is achieved by multiplication:

zi=ziejθ=zi(cosθ+jsinθ)z'_i = z_i \cdot e^{j\theta} = z_i \cdot (\cos\theta + j\sin\theta)

where:

  • ziz'_i: the rotated complex number
  • ejθe^{j\theta}: the complex exponential, a point on the unit circle at angle θ\theta
  • cosθ+jsinθ\cos\theta + j\sin\theta: the expanded form via Euler's formula
Euler's Formula

Euler's formula states that ejθ=cosθ+jsinθe^{j\theta} = \cos\theta + j\sin\theta. Geometrically, ejθe^{j\theta} represents a point on the unit circle at angle θ\theta from the positive real axis. Multiplying any complex number by ejθe^{j\theta} rotates it by angle θ\theta counterclockwise in the complex plane, preserving its magnitude.

For RoPE at position mm, we apply this rotation with the position-dependent angle:

zi=ziejmθiz'_i = z_i \cdot e^{j \cdot m \cdot \theta_i}

where:

  • mm: the sequence position
  • θi\theta_i: the base frequency for dimension pair ii
  • mθim \cdot \theta_i: the total rotation angle (increases linearly with position)

This formulation is mathematically equivalent to the rotation matrix approach. The complex perspective leads to more concise code and can be more efficient on hardware with optimized complex number operations.

In[12]:
Code
def rope_complex(x, position, frequencies):
    """Apply RoPE using complex number formulation.

    Args:
        x: Input vector of shape (d,) where d is even
        position: Position index (integer)
        frequencies: Pre-computed frequencies of shape (d/2,)

    Returns:
        Rotated vector of shape (d,)
    """
    # Reshape to pairs and view as complex numbers
    d = len(x)
    x_pairs = x.reshape(-1, 2)
    x_complex = x_pairs[:, 0] + 1j * x_pairs[:, 1]

    # Rotation angles for this position
    angles = position * frequencies

    # Apply rotation via complex multiplication
    rotated_complex = x_complex * np.exp(1j * angles)

    # Convert back to real pairs
    rotated_pairs = np.stack(
        [rotated_complex.real, rotated_complex.imag], axis=1
    )
    return rotated_pairs.flatten()

Let's verify that the complex formulation gives the same result as explicit rotation matrices:

In[13]:
Code
def rope_matrix(x, position, frequencies):
    """Apply RoPE using explicit rotation matrices."""
    d = len(x)
    result = np.zeros_like(x)

    for i in range(d // 2):
        theta = position * frequencies[i]
        cos_t, sin_t = np.cos(theta), np.sin(theta)

        # Apply 2D rotation to dimension pair (2i, 2i+1)
        x1, x2 = x[2 * i], x[2 * i + 1]
        result[2 * i] = x1 * cos_t - x2 * sin_t
        result[2 * i + 1] = x1 * sin_t + x2 * cos_t

    return result


# Compare both implementations
np.random.seed(42)
test_vector = np.random.randn(8)
test_freqs = compute_rope_frequencies(8)
position = 5

result_complex = rope_complex(test_vector, position, test_freqs)
result_matrix = rope_matrix(test_vector, position, test_freqs)
Out[14]:
Console
Comparing RoPE implementations:
Original vector:     [ 0.4967 -0.1383  0.6477  1.523  -0.2342 -0.2341  1.5792  0.7674]
Complex formulation: [ 0.0083 -0.5155 -0.1618  1.6471 -0.2222 -0.2455  1.5754  0.7753]
Matrix formulation:  [ 0.0083 -0.5155 -0.1618  1.6471 -0.2222 -0.2455  1.5754  0.7753]

Max difference: 2.78e-17

The two implementations produce identical results (up to floating-point precision). The complex formulation is often preferred in practice because it's more concise and can leverage optimized complex number operations.

Visualizing RoPE Patterns

With both the matrix and complex formulations implemented, let's build intuition by visualizing how RoPE actually transforms embeddings. Understanding these patterns helps explain why RoPE is so effective at encoding position information.

We'll plot the rotation patterns for each dimension pair, tracking how a unit vector moves as position increases:

In[15]:
Code
# Generate RoPE patterns across positions
d_model = 8
max_position = 50
frequencies = compute_rope_frequencies(d_model)

# Track how a unit vector in each pair rotates with position
positions = np.arange(max_position)
rotation_patterns = np.zeros((max_position, d_model // 2, 2))

for pos in positions:
    for i in range(d_model // 2):
        angle = pos * frequencies[i]
        rotation_patterns[pos, i, 0] = np.cos(angle)
        rotation_patterns[pos, i, 1] = np.sin(angle)
Out[16]:
Visualization
Scatter plot showing points tracing multiple full circles as position increases.
Pair 0: Fastest rotation (freq=1.0). Completes multiple full cycles within 50 positions, distinguishing nearby tokens.
Scatter plot showing points tracing partial circles with medium speed.
Pair 1: Medium-fast rotation. Slower than pair 0 but still captures local position differences.
Scatter plot showing points tracing a small arc slowly.
Pair 2: Medium-slow rotation. Provides mid-range position information.
Scatter plot showing points barely moving along the circle.
Pair 3: Slowest rotation. Barely completes an arc in 50 positions, encoding global structure.

The color gradient (dark to light) shows increasing position. Pair 0 makes multiple full rotations within 50 positions, while Pair 3 barely completes an arc. This multi-frequency structure is what gives RoPE its expressiveness.

Relative Position Through Dot Products

We've derived mathematically that RoPE should make attention scores depend only on relative position. Now let's verify this core property empirically and see what it looks like in practice.

The test is straightforward: create identical query and key vectors at different positions, apply RoPE, and compute attention scores. If RoPE works as intended, the scores should form a Toeplitz matrix, where each diagonal contains identical values. This structure proves that scores depend only on relative position (the difference between query and key positions), not on absolute positions.

In[17]:
Code
def compute_rope_attention_scores(queries, keys, frequencies):
    """Compute attention scores with RoPE applied.

    Args:
        queries: Query vectors, shape (seq_len, d)
        keys: Key vectors, shape (seq_len, d)
        frequencies: RoPE frequencies, shape (d/2,)

    Returns:
        Attention scores, shape (seq_len, seq_len)
    """
    seq_len, d = queries.shape
    scores = np.zeros((seq_len, seq_len))

    for m in range(seq_len):
        # Rotate query at position m
        q_rotated = rope_complex(queries[m], m, frequencies)

        for n in range(seq_len):
            # Rotate key at position n
            k_rotated = rope_complex(keys[n], n, frequencies)

            # Compute dot product
            scores[m, n] = np.dot(q_rotated, k_rotated)

    return scores


# Create test queries and keys
np.random.seed(42)
seq_len = 6
d_model = 8

# All queries are identical, all keys are identical
# This isolates the effect of position
q_template = np.random.randn(d_model)
k_template = np.random.randn(d_model)

queries = np.tile(q_template, (seq_len, 1))
keys = np.tile(k_template, (seq_len, 1))

frequencies = compute_rope_frequencies(d_model)
scores = compute_rope_attention_scores(queries, keys, frequencies)
Out[18]:
Visualization
Heatmap showing attention scores with identical values along each diagonal.
Attention scores with RoPE show Toeplitz structure: each diagonal has the same value. This proves that scores depend only on relative position (column index minus row index), not absolute positions. The pattern emerges purely from the geometry of rotation.

The Toeplitz structure is clear: all entries along each diagonal are identical. Position (0,0), (1,1), (2,2) all have the same score (relative distance 0). Position (0,1), (1,2), (2,3) all match (relative distance 1). This is the relative position property in action.

Let's verify numerically by extracting scores for each relative distance:

In[19]:
Code
# Extract scores by relative distance
relative_scores = {}
for m in range(seq_len):
    for n in range(seq_len):
        rel_dist = n - m
        if rel_dist not in relative_scores:
            relative_scores[rel_dist] = []
        relative_scores[rel_dist].append(scores[m, n])
Out[20]:
Console
Scores grouped by relative position (n - m):
Relative Position    Scores                                   Std Dev        
---------------------------------------------------------------------------
-5                   -3.7130                                  0.00e+00
-4                   -3.4684, -3.4684                         0.00e+00
-3                   -3.2589, -3.2589, -3.2589                2.56e-16
-2                   -3.3481, -3.3481, -3.3481, -3.3481       5.44e-16
-1                   -3.7172, -3.7172, -3.7172, -3.7172, ...  1.99e-16
0                    -4.0819, -4.0819, -4.0819, -4.0819, ...  3.63e-16
1                    -4.1532, -4.1532, -4.1532, -4.1532, ...  7.94e-16
2                    -3.9027, -3.9027, -3.9027, -3.9027       2.22e-16
3                    -3.5884, -3.5884, -3.5884                3.63e-16
4                    -3.5173, -3.5173                         3.14e-16
5                    -3.7630                                  0.00e+00

All standard deviations are effectively zero (within floating-point precision), confirming that scores at each relative distance are identical.

How Dot Products Vary with Relative Distance

The Toeplitz structure tells us scores depend only on relative position, but how do they vary? Let's trace how the dot product changes as we increase the relative distance between query and key:

In[21]:
Code
# Analyze how dot product varies with relative distance
np.random.seed(42)
d_model = 32
frequencies = compute_rope_frequencies(d_model)

# Create random query and key vectors
q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Compute dot products at various relative distances
max_rel_dist = 50
relative_distances = np.arange(-max_rel_dist, max_rel_dist + 1)
dot_products = []

# Fix query at position 50 (middle of range)
query_pos = 50
q_rotated = rope_complex(q, query_pos, frequencies)

for rel_dist in relative_distances:
    key_pos = query_pos + rel_dist
    k_rotated = rope_complex(k, key_pos, frequencies)
    dot_products.append(np.dot(q_rotated, k_rotated))

dot_products = np.array(dot_products)
Out[22]:
Visualization
Line plot showing oscillating dot product values across relative distances from -50 to +50.
Dot product between RoPE-rotated query and key vectors as a function of relative distance. The oscillating pattern emerges from the interference of multiple rotation frequencies. Peak at distance 0 indicates highest similarity when query and key are at the same position.

The oscillating pattern is characteristic of RoPE. The multiple frequencies create a complex interference pattern where some relative distances produce higher scores than others. This structure allows the model to learn position-dependent attention patterns during training.

Efficient Implementation

The implementations we've shown so far process one token at a time, which is clear for understanding but inefficient in practice. Modern deep learning frameworks excel at vectorized operations, so we want to apply RoPE to all tokens in a sequence simultaneously.

The key insight is that we can precompute all rotation angles as a matrix and apply them through broadcasting. Instead of looping over positions and dimension pairs, we compute everything in parallel:

In[23]:
Code
def apply_rope_batch(x, frequencies):
    """Apply RoPE to a batch of vectors at consecutive positions.

    Args:
        x: Input tensor of shape (seq_len, d)
        frequencies: Pre-computed frequencies of shape (d/2,)

    Returns:
        Rotated tensor of shape (seq_len, d)
    """
    seq_len, d = x.shape
    positions = np.arange(seq_len)

    # Compute all rotation angles: (seq_len, d/2)
    angles = np.outer(positions, frequencies)

    # Compute cos and sin for all positions and frequencies
    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    # Reshape input to pairs: (seq_len, d/2, 2)
    x_pairs = x.reshape(seq_len, -1, 2)

    # Apply rotation to each pair
    # x_new[0] = x[0] * cos - x[1] * sin
    # x_new[1] = x[0] * sin + x[1] * cos
    x_rotated = np.stack(
        [
            x_pairs[:, :, 0] * cos_angles - x_pairs[:, :, 1] * sin_angles,
            x_pairs[:, :, 0] * sin_angles + x_pairs[:, :, 1] * cos_angles,
        ],
        axis=-1,
    )

    return x_rotated.reshape(seq_len, d)

Let's verify this batch implementation matches the per-token version:

In[24]:
Code
# Test batch vs individual application
np.random.seed(42)
seq_len = 10
d_model = 16

test_embeddings = np.random.randn(seq_len, d_model)
frequencies = compute_rope_frequencies(d_model)

# Batch application
batch_result = apply_rope_batch(test_embeddings, frequencies)

# Individual application
individual_result = np.zeros_like(test_embeddings)
for pos in range(seq_len):
    individual_result[pos] = rope_complex(
        test_embeddings[pos], pos, frequencies
    )
Out[25]:
Console
Batch vs individual implementation:
Maximum difference: 2.22e-16
Implementations match: True

The maximum difference between implementations is on the order of 101610^{-16}, which is essentially machine epsilon for 64-bit floating point. This confirms that the batch implementation produces numerically identical results while being much more efficient through vectorization.

Integration with Self-Attention

With efficient RoPE implementation in hand, let's see how it fits into a complete self-attention layer. The integration is remarkably clean: RoPE slots in between the QKV projections and the attention computation, requiring no architectural changes to the transformer.

Here's the complete flow:

  1. Project input embeddings to Q, K, V using learned weight matrices
  2. Apply RoPE to Q and K (but not to V)
  3. Compute scaled dot-product attention as usual
  4. Return the attention output

The critical detail is step 2: we rotate queries and keys but leave values untouched. Let's implement this:

In[26]:
Code
class RoPEAttention:
    """Self-attention with Rotary Position Embedding."""

    def __init__(self, d_model, d_k, base=10000, seed=None):
        """Initialize RoPE attention layer.

        Args:
            d_model: Input embedding dimension
            d_k: Query/Key/Value dimension (must be even for RoPE)
            base: Base for frequency computation
            seed: Random seed for weight initialization
        """
        if d_k % 2 != 0:
            raise ValueError("d_k must be even for RoPE")

        if seed is not None:
            np.random.seed(seed)

        # Projection matrices
        scale = np.sqrt(2.0 / (d_model + d_k))
        self.W_q = np.random.randn(d_model, d_k) * scale
        self.W_k = np.random.randn(d_model, d_k) * scale
        self.W_v = np.random.randn(d_model, d_k) * scale

        # RoPE frequencies
        self.frequencies = compute_rope_frequencies(d_k, base)
        self.d_k = d_k

    def forward(self, x):
        """Compute RoPE attention.

        Args:
            x: Input embeddings of shape (seq_len, d_model)

        Returns:
            output: Attention output of shape (seq_len, d_k)
            attention_weights: Weights of shape (seq_len, seq_len)
        """
        # Project to Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Apply RoPE to Q and K
        Q_rope = apply_rope_batch(Q, self.frequencies)
        K_rope = apply_rope_batch(K, self.frequencies)

        # Scaled dot-product attention
        scores = Q_rope @ K_rope.T / np.sqrt(self.d_k)

        # Softmax
        scores_stable = scores - scores.max(axis=1, keepdims=True)
        attention_weights = np.exp(scores_stable) / np.exp(scores_stable).sum(
            axis=1, keepdims=True
        )

        # Aggregate values (V is NOT rotated)
        output = attention_weights @ V

        return output, attention_weights

Note that RoPE is applied only to queries and keys, not to values. This is because:

  1. Queries and keys determine attention patterns: The dot product between them computes compatibility. RoPE makes this compatibility position-aware.
  2. Values carry content: They should not be position-encoded because the content itself doesn't depend on position, only how much weight it receives.
In[27]:
Code
# Test the RoPE attention layer
np.random.seed(42)
seq_len = 8
d_model = 16
d_k = 8

x = np.random.randn(seq_len, d_model)
attention = RoPEAttention(d_model, d_k, seed=123)
output, weights = attention.forward(x)
Out[28]:
Console
RoPE Attention Layer Test:
Input shape:      (8, 16)
Output shape:     (8, 8)
Weights shape:    (8, 8)

Row sums (should be 1.0): [1. 1. 1. 1. 1. 1. 1. 1.]

The output confirms correct behavior: input of shape (8, 16) produces output of shape (8, 8) after projection to the query/key/value dimension. The attention weights form an 8×8 matrix where each row sums to exactly 1.0, confirming proper softmax normalization. The RoPE transformations are applied internally to queries and keys, making attention position-aware without changing the external interface.

RoPE Frequency Patterns

The choice of frequencies is crucial to RoPE's effectiveness. Let's visualize how the multi-frequency structure creates unique position signatures:

In[29]:
Code
# Visualize how position affects embedding components
d_model = 64
max_pos = 100
frequencies = compute_rope_frequencies(d_model)

# For each position, compute the rotation for each dimension pair
# We'll track cos(pos * freq) and sin(pos * freq)
cos_components = np.zeros((max_pos, d_model // 2))
sin_components = np.zeros((max_pos, d_model // 2))

for pos in range(max_pos):
    cos_components[pos] = np.cos(pos * frequencies)
    sin_components[pos] = np.sin(pos * frequencies)
Out[30]:
Visualization
Heatmap showing cosine values with rapid oscillation at top, slow at bottom.
Cosine components of RoPE across positions and dimension pairs. Early pairs (top) oscillate rapidly, creating high-frequency position signals. Later pairs oscillate slowly, encoding longer-range position information.
Heatmap showing sine values with similar frequency patterns to cosine.
Sine components of RoPE. Together with cosine, they form the complete rotation state for each dimension pair. The combination of multiple frequencies creates a unique signature for each position.

The heatmaps reveal the multi-scale nature of RoPE. Low-index dimension pairs (top rows) cycle rapidly, distinguishing nearby positions. High-index pairs (bottom rows) change slowly, providing a coarse position signal. This structure resembles sinusoidal position encodings since both use similar frequency patterns. The key difference is that sinusoidal encodings add position information to embeddings, while RoPE rotates the embeddings themselves.

Why RoPE Works So Well

Several properties make RoPE particularly effective:

Relative position by design. Unlike additive position encodings that must learn to extract relative position, RoPE provides it automatically through the geometry of rotations. The model doesn't need to learn that positions 5 and 7 are "2 apart"; the attention scores inherently reflect this.

Length generalization. Because RoPE encodes relative rather than absolute position, models can often generalize to longer sequences than seen during training. Position 1000 rotating relative to position 1002 works the same as position 0 rotating relative to position 2.

Computational efficiency. RoPE requires no additional parameters beyond the pre-computed frequencies. The rotation can be implemented as element-wise operations, making it very fast.

Compatibility with linear attention. Some efficient attention approximations rely on the inner product structure of attention. RoPE preserves this structure (rotation is a linear transformation), making it compatible with these methods.

In[31]:
Code
# Demonstrate length generalization
np.random.seed(42)

# Train on short sequences (conceptually)
train_seq_len = 20

# Test on longer sequence
test_seq_len = 100
d_model = 16

frequencies = compute_rope_frequencies(d_model)

# Check that relative distances produce consistent scores
# even at positions far beyond "training"
q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Near the start (like training)
q_rot_5 = rope_complex(q, 5, frequencies)
k_rot_7 = rope_complex(k, 7, frequencies)
score_near = np.dot(q_rot_5, k_rot_7)

# Far beyond (test generalization)
q_rot_85 = rope_complex(q, 85, frequencies)
k_rot_87 = rope_complex(k, 87, frequencies)
score_far = np.dot(q_rot_85, k_rot_87)
Out[32]:
Console
Length generalization test:
Score at positions 5, 7 (relative distance 2): -2.388206
Score at positions 85, 87 (relative distance 2): -2.388206
Difference: 2.22e-15

Identical scores at identical relative distances, regardless of absolute position. This is why RoPE-based models can extrapolate to longer contexts more gracefully than models with absolute position encodings.

Let's visualize this length generalization property more comprehensively by testing many absolute position pairs:

In[33]:
Code
# Comprehensive length generalization test
np.random.seed(42)
d_model = 32
frequencies = compute_rope_frequencies(d_model)

q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Test relative distance of 2 at many different absolute positions
relative_dist = 2
absolute_positions = [0, 10, 50, 100, 500, 1000, 5000, 10000]
scores_at_positions = []

for pos in absolute_positions:
    q_rot = rope_complex(q, pos, frequencies)
    k_rot = rope_complex(k, pos + relative_dist, frequencies)
    scores_at_positions.append(np.dot(q_rot, k_rot))

scores_at_positions = np.array(scores_at_positions)
Out[34]:
Visualization
Bar chart showing nearly identical dot product scores across different absolute positions.
Length generalization in RoPE: the same relative distance (2) produces nearly identical attention scores regardless of absolute position, from position 0 to position 10,000. The flat line demonstrates that RoPE''s relative position property holds even at positions far beyond typical training lengths.

The nearly zero standard deviation confirms that RoPE perfectly preserves relative position information regardless of where in the sequence we look. This is the mathematical foundation for length generalization in RoPE-based models.

Comparing RoPE to Other Position Encodings

Let's position RoPE within the broader landscape of position encodings:

Comparison of position encoding methods. RoPE achieves relative position encoding with zero parameters and no attention modification.
PropertySinusoidalLearnedRelative (Shaw)RoPE
Parameters0O(L×d)O(L \times d)O(L×d)O(L \times d)0
Position typeAbsoluteAbsoluteRelativeRelative
Attention modifiedNoNoYesNo (uses rotation)
Length extrapolationModeratePoorModerateGood
Computational costLowLowHigherLow

RoPE combines the parameter efficiency of sinusoidal encodings with the relative position benefits of learned relative encodings, without the architectural complexity. This balance explains its widespread adoption.

To make this comparison concrete, let's visualize how position similarity decays with distance for different encoding schemes:

In[35]:
Code
# Compare position similarity decay across encoding methods
d_model_cmp = 64
max_distance = 50

# RoPE: dot product between rotated vectors at different distances
frequencies_cmp = compute_rope_frequencies(d_model_cmp)
rope_similarities = []
base_vec = np.ones(d_model_cmp) / np.sqrt(d_model_cmp)  # Unit vector

base_rotated = rope_complex(base_vec, 0, frequencies_cmp)
for dist in range(max_distance):
    other_rotated = rope_complex(base_vec, dist, frequencies_cmp)
    rope_similarities.append(np.dot(base_rotated, other_rotated))


# Sinusoidal: dot product between position encodings
def sinusoidal_encoding(position, d_model):
    """Generate sinusoidal position encoding."""
    pe = np.zeros(d_model)
    for i in range(0, d_model, 2):
        div_term = 10000 ** (i / d_model)
        pe[i] = np.sin(position / div_term)
        if i + 1 < d_model:
            pe[i + 1] = np.cos(position / div_term)
    return pe


sinusoidal_similarities = []
base_sin = sinusoidal_encoding(0, d_model_cmp)
for dist in range(max_distance):
    other_sin = sinusoidal_encoding(dist, d_model_cmp)
    sinusoidal_similarities.append(np.dot(base_sin, other_sin))

# Convert to arrays
rope_similarities = np.array(rope_similarities)
sinusoidal_similarities = np.array(sinusoidal_similarities)
Out[36]:
Visualization
Two line plots comparing how position similarity decays with distance for RoPE and sinusoidal encodings.
Position similarity decay comparison. RoPE similarity (computed on rotated vectors) oscillates more dramatically than sinusoidal similarity (computed on position encodings). Both show periodic structure from multi-frequency components, but RoPE's oscillation pattern emerges from the dot product of rotated content vectors, not the encodings themselves.

Both methods show oscillating similarity patterns due to their multi-frequency structure. The key difference: sinusoidal encodings add this pattern to the input, while RoPE modulates the attention computation directly through rotation.

Limitations and Considerations

Despite its elegance, RoPE has limitations worth understanding.

Frequency base sensitivity. The base (typically 10000) determines the frequency range. Models trained with one base may not transfer well to contexts requiring different frequency patterns. Recent work like YaRN and NTK-aware scaling addresses this by adjusting frequencies for longer contexts.

High-frequency aliasing. At very long positions, high-frequency dimension pairs may "wrap around" multiple times, potentially creating aliasing where distant positions appear similar. In practice, this is rarely problematic within reasonable context lengths, but it's a theoretical limitation.

Dimension divisibility. RoPE requires even-dimensional queries and keys since it operates on pairs. This is a minor constraint but must be considered in architecture design.

Training distribution effects. While RoPE theoretically supports any position, the model's other components (feed-forward networks, layer norms) are trained on a specific position distribution. Significant extrapolation may still degrade performance due to these other components, not RoPE itself.

These limitations are generally manageable. The community has developed extensions like Position Interpolation and NTK-aware RoPE that modify the frequency computation for better long-context performance. The core rotation mechanism remains unchanged.

Key Parameters

When implementing RoPE in your models, these parameters control its behavior:

  • d_model (embedding dimension): The total dimension of query and key vectors. Must be even since RoPE operates on dimension pairs. Common values range from 64 to 4096, typically matching the model's hidden dimension divided by the number of attention heads.

  • base (frequency base): Controls the range of rotation frequencies. The default value of 10000 provides a good balance between local and global position sensitivity. Larger values (e.g., 100000) extend the effective context length by slowing all rotations; smaller values make the encoding more sensitive to nearby positions.

  • theta_i (per-dimension frequency): Computed as 1/base2i/d1/\text{base}^{2i/d} for dimension pair ii. Not typically set directly, but understanding this helps diagnose behavior: the first pair rotates once per radian, while the last pair completes a full rotation over approximately 2π×base2\pi \times \text{base} positions.

  • Position offset: Some implementations support a starting position offset for key-value caching during inference. This allows continuing generation from a specific position without recomputing RoPE for all previous positions.

Summary

Rotary Position Embedding encodes position through geometric rotation rather than additive signals. This approach elegantly captures relative position through the natural properties of dot products between rotated vectors.

Key takeaways:

  • Rotation as position encoding. Each position corresponds to a rotation angle. Rotating query and key vectors embeds position information directly into their geometric relationship.

  • Relative position emerges. When a rotated query at position mm attends to a rotated key at position nn, the dot product depends only on mnm - n. Absolute positions cancel out through rotation mathematics.

  • Multi-frequency structure. Different dimension pairs rotate at different frequencies, creating a rich position representation. High frequencies capture local position differences; low frequencies capture global structure.

  • No additional parameters. Like sinusoidal encodings, RoPE uses deterministic frequencies based on dimension index. The only computation is the rotation itself.

  • Applied to Q and K only. Values are not rotated because they carry content, not position information. Rotation affects attention patterns, not the content that flows through them.

  • Good extrapolation. Because relative position is baked into the mechanism, models can often generalize to longer sequences than seen during training, though other model components may still limit this.

RoPE has become the dominant position encoding in modern large language models. Its combination of theoretical elegance, computational efficiency, and practical effectiveness makes it a foundational technique for transformer architectures. In the next chapter, we'll explore ALiBi, an alternative approach that adds relative position bias directly to attention scores.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Rotary Position Embedding (RoPE).

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{rotarypositionembeddingropeencodingpositionthroughrotation, author = {Michael Brenndoerfer}, title = {Rotary Position Embedding (RoPE): Encoding Position Through Rotation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Rotary Position Embedding (RoPE): Encoding Position Through Rotation. Retrieved from https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers
MLAAcademic
Michael Brenndoerfer. "Rotary Position Embedding (RoPE): Encoding Position Through Rotation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers>.
CHICAGOAcademic
Michael Brenndoerfer. "Rotary Position Embedding (RoPE): Encoding Position Through Rotation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Rotary Position Embedding (RoPE): Encoding Position Through Rotation'. Available at: https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Rotary Position Embedding (RoPE): Encoding Position Through Rotation. https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free