Rotary Position Embedding (RoPE): Encoding Position Through Rotation

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how RoPE encodes position through vector rotation, making attention scores depend on relative position. Includes mathematical derivation and implementation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Rotary Position Embedding (RoPE)Link Copied

The transformer attention mechanism, as we've seen, is inherently position-blind. Sinusoidal encodings and learned position embeddings address this by adding position information to token embeddings before attention. But these approaches encode absolute position. Token at position 5 always receives the same positional signal, regardless of context. What if the relationship between positions 5 and 7 matters more than the absolute locations? Relative position encoding tackles this, but earlier methods required modifying the attention architecture or adding explicit bias terms.

Rotary Position Embedding, or RoPE, takes an elegant geometric approach. Instead of adding position to embeddings, it rotates them. Each position corresponds to a rotation angle, and the rotation is applied directly to query and key vectors. The ingenious part: when you compute the dot product between a rotated query at position $m$ and a rotated key at position $n$ , the result depends only on their relative distance $m - n$ . Absolute positions vanish, leaving only the relationship between tokens.

This chapter develops RoPE from first principles. We'll start with rotations in 2D, extend to higher dimensions through paired rotations, derive why the mechanism captures relative position, and implement it in code. By the end, you'll understand why RoPE has become the dominant position encoding in modern language models like LLaMA, PaLM, and many others.

Why Rotation?Link Copied

Consider what we want from a position encoding. When a query at position $m$ attends to a key at position $n$ , the attention score should somehow reflect their relative distance $m - n$ . If token 3 attends to token 1, the model should "know" they're 2 positions apart, exactly as if token 8 attends to token 6.

Rotations have a beautiful property that accomplishes this. If you rotate vector $\mathbf{a}$ by angle $\theta_m$ and vector $\mathbf{b}$ by angle $\theta_n$ , then compute their dot product, the result depends on the angle difference $\theta_m - \theta_n$ . The absolute angles cancel out.

Rotation Invariance of Dot Products

For two vectors $\mathbf{a}$ and $\mathbf{b}$ , rotating both by the same angle $\theta$ preserves their dot product: $R_\theta(\mathbf{a}) \cdot R_\theta(\mathbf{b}) = \mathbf{a} \cdot \mathbf{b}$ . This is because rotations preserve lengths and angles between vectors.

Now imagine we associate each position with an angle: position $m$ gets angle $m \cdot \theta$ for some base angle $\theta$ . If we rotate the query vector at position $m$ by $m\theta$ and the key vector at position $n$ by $n\theta$ , their dot product will involve the angle $(m - n)\theta$ . That's exactly the relative position information we want.

This is the core insight of RoPE: encode position through rotation, and let the geometry of dot products naturally extract relative position.

Rotation Matrices in 2DLink Copied

Let's build up the mechanics. In two dimensions, rotating a vector $(x, y)$ by angle $\theta$ counterclockwise uses the rotation matrix:

R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}

where:

$R(\theta)$ : the 2D rotation matrix that rotates vectors by angle $\theta$
$\theta$ : the rotation angle in radians (counterclockwise is positive)
$\cos\theta$ , $\sin\theta$ : trigonometric functions evaluated at angle $\theta$

Applying this rotation matrix to a 2D vector transforms its coordinates:

\begin{pmatrix} x' \\ y' \end{pmatrix} = R(\theta) \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} x\cos\theta - y\sin\theta \\ x\sin\theta + y\cos\theta \end{pmatrix}

where:

$(x, y)$ : the original vector coordinates
$(x', y')$ : the rotated vector coordinates
The first row computes the new $x$ -coordinate by combining the original coordinates with $\cos\theta$ and $-\sin\theta$
The second row computes the new $y$ -coordinate using $\sin\theta$ and $\cos\theta$

The rotated vector $(x', y')$ has the same length as $(x, y)$ but points in a direction shifted by $\theta$ . This is because rotation matrices are orthogonal, meaning they preserve vector lengths (norms) and angles between vectors.

In[3]:

Code

def rotate_2d(vector, theta):
    """Rotate a 2D vector by angle theta (in radians)."""
    cos_t, sin_t = np.cos(theta), np.sin(theta)
    rotation_matrix = np.array([[cos_t, -sin_t], [sin_t, cos_t]])
    return rotation_matrix @ vector

def rotate_2d(vector, theta):
    """Rotate a 2D vector by angle theta (in radians)."""
    cos_t, sin_t = np.cos(theta), np.sin(theta)
    rotation_matrix = np.array([[cos_t, -sin_t], [sin_t, cos_t]])
    return rotation_matrix @ vector

Let's visualize how rotation transforms a vector:

In[4]:

Code

# Original vector
original = np.array([1.0, 0.5])

# Rotate by several angles
angles = [0, np.pi / 6, np.pi / 3, np.pi / 2]
rotated_vectors = [rotate_2d(original, theta) for theta in angles]

# Original vector
original = np.array([1.0, 0.5])

# Rotate by several angles
angles = [0, np.pi / 6, np.pi / 3, np.pi / 2]
rotated_vectors = [rotate_2d(original, theta) for theta in angles]

Out[5]:

Visualization

Four arrows from origin showing a vector rotated by 0, 30, 60, and 90 degrees. — A 2D vector rotated by increasing angles. The vector maintains its length (distance from origin) while its direction changes. Each rotation of π/6 radians (30°) shifts the vector counterclockwise.

The dashed circle shows the path traced by the vector tip as it rotates. Crucially, the length (magnitude) never changes. Rotation is an isometry, a transformation that preserves distances.

From Rotation to Relative PositionLink Copied

Now let's see how rotation makes dot products depend on relative position. Take two 2D vectors $\mathbf{q}$ (query) and $\mathbf{k}$ (key). Rotate $\mathbf{q}$ by angle $m\theta$ (for position $m$ ) and $\mathbf{k}$ by angle $n\theta$ (for position $n$ ).

The dot product of the rotated vectors is:

R(m\theta)\mathbf{q} \cdot R(n\theta)\mathbf{k}

where:

$\mathbf{q}$ : the query vector at position $m$
$\mathbf{k}$ : the key vector at position $n$
$R(m\theta)$ : rotation matrix that rotates by angle $m\theta$ (position times base angle)
$R(n\theta)$ : rotation matrix that rotates by angle $n\theta$

Using properties of rotation matrices, we can simplify this expression step by step:

\mathbf{q}^T R(m\theta)^T R(n\theta) \mathbf{k} = \mathbf{q}^T R(-m\theta) R(n\theta) \mathbf{k} = \mathbf{q}^T R((n-m)\theta) \mathbf{k}

The derivation proceeds as follows:

Rewrite the dot product as matrix multiplication: $R(m\theta)\mathbf{q} \cdot R(n\theta)\mathbf{k} = (R(m\theta)\mathbf{q})^T (R(n\theta)\mathbf{k}) = \mathbf{q}^T R(m\theta)^T R(n\theta) \mathbf{k}$
Apply the transpose-inverse property: The transpose of a rotation matrix equals its inverse, so $R(\theta)^T = R(-\theta)$ . This gives us $R(m\theta)^T = R(-m\theta)$ .
Apply the composition property: Multiplying two rotation matrices adds their angles, so $R(\alpha) R(\beta) = R(\alpha + \beta)$ . Therefore $R(-m\theta) R(n\theta) = R((n-m)\theta)$ .

The final result $\mathbf{q}^T R((n-m)\theta) \mathbf{k}$ depends only on the difference $(n-m)$ , not on the absolute values of $m$ and $n$ separately.

Key Insight: Relative Position Emerges

When we rotate query by $m\theta$ and key by $n\theta$ , their dot product depends only on $(n - m)\theta$ , the relative position. The absolute positions $m$ and $n$ vanish, replaced by their difference.

This is remarkable. We encode absolute position through rotation angle, but the attention mechanism, which uses dot products, automatically extracts relative position. No architectural changes needed. No explicit bias terms. Just geometry.

Let's verify this numerically:

In[6]:

Code

def verify_relative_position():
    """Verify that rotated dot products depend only on relative position."""
    np.random.seed(42)

    # Random query and key vectors
    q = np.random.randn(2)
    k = np.random.randn(2)

    theta = 0.5  # Base rotation angle

    # Different absolute positions, same relative distance
    results = []
    for m, n in [(1, 3), (5, 7), (10, 12), (100, 102)]:
        q_rotated = rotate_2d(q, m * theta)
        k_rotated = rotate_2d(k, n * theta)
        dot_product = np.dot(q_rotated, k_rotated)
        results.append((m, n, n - m, dot_product))

    return results


position_results = verify_relative_position()

def verify_relative_position():
    """Verify that rotated dot products depend only on relative position."""
    np.random.seed(42)

    # Random query and key vectors
    q = np.random.randn(2)
    k = np.random.randn(2)

    theta = 0.5  # Base rotation angle

    # Different absolute positions, same relative distance
    results = []
    for m, n in [(1, 3), (5, 7), (10, 12), (100, 102)]:
        q_rotated = rotate_2d(q, m * theta)
        k_rotated = rotate_2d(k, n * theta)
        dot_product = np.dot(q_rotated, k_rotated)
        results.append((m, n, n - m, dot_product))

    return results


position_results = verify_relative_position()

Out[7]:

Console

Verifying relative position property:
Position m   Position n   n - m      Dot Product    
--------------------------------------------------
1            3            2          -0.651890
5            7            2          -0.651890
10           12           2          -0.651890
100          102          2          -0.651890

All four pairs have the same relative distance (2 positions apart), and their dot products are identical despite wildly different absolute positions. This confirms the relative position property holds numerically.

Extending to Higher DimensionsLink Copied

We've established that 2D rotation elegantly encodes relative position. But real transformer embeddings have hundreds or thousands of dimensions, not just 2. How do we extend this geometric insight to high-dimensional space?

The challenge is that rotations in high dimensions are more complex than in 2D. A naive approach might try to define a single rotation that affects all dimensions simultaneously, but this would be computationally expensive and wouldn't preserve the relative position property we just derived.

RoPE's solution is both clever and efficient: treat the $d$ -dimensional embedding as $d/2$ independent pairs. A $d$ -dimensional embedding is split into pairs: $(x_1, x_2)$ , $(x_3, x_4)$ , ..., $(x_{d-1}, x_d)$ . Each pair is rotated independently as a 2D vector, and since the pairs don't interact, the relative position property holds for each pair separately. When we sum up the contributions from all pairs in a dot product, the overall score still depends only on relative position.

But here's where RoPE becomes truly expressive: each pair rotates at a different frequency. The first pair might rotate by $\theta$ per position, the second by $\theta/2$ , the third by $\theta/4$ , and so on. Think of it like the hour, minute, and second hands of a clock: each moves at a different rate, and together they can represent any time uniquely. Similarly, by using multiple frequencies, RoPE creates a rich encoding where different dimension pairs capture position information at different scales.

The rotation angle for dimension pair $i$ at position $m$ is:

\theta_i(m) = m \cdot \theta_i = m \cdot \frac{1}{10000^{2i/d}}

where:

$\theta_i(m)$ : the rotation angle (in radians) for dimension pair $i$ at sequence position $m$
$m$ : the position in the sequence (0, 1, 2, ..., $n-1$ for a sequence of length $n$ )
$i$ : the dimension pair index (0, 1, 2, ..., $d/2 - 1$ )
$d$ : the total embedding dimension (must be even)
$\theta_i = 1/10000^{2i/d}$ : the base frequency for dimension pair $i$ , which decreases exponentially as $i$ increases
10000: the base constant (same as in sinusoidal position encodings), chosen empirically for good performance

To understand why this formula creates a multi-scale representation, consider the exponent $2i/d$ :

When $i = 0$ : $\theta_0 = 1/10000^0 = 1$ (fastest rotation, one radian per position)
When $i = d/4$ : $\theta_{d/4} = 1/10000^{0.5} = 1/100 = 0.01$ (slower rotation)
When $i = d/2 - 1$ : $\theta_{d/2-1} \approx 1/10000$ (slowest rotation)

This exponential decay means early dimension pairs (small $i$ ) rotate quickly, capturing fine-grained position differences, while later dimension pairs (large $i$ ) rotate slowly, capturing longer-range relationships.

In[8]:

Code

def compute_rope_frequencies(d_model, base=10000):
    """Compute rotation frequencies for each dimension pair."""
    # Number of dimension pairs
    num_pairs = d_model // 2

    # Frequency for each pair: 1 / base^(2i/d)
    i = np.arange(num_pairs)
    frequencies = 1.0 / (base ** (2 * i / d_model))

    return frequencies


# Example with 8 dimensions (4 pairs)
d_model = 8
freqs = compute_rope_frequencies(d_model)

def compute_rope_frequencies(d_model, base=10000):
    """Compute rotation frequencies for each dimension pair."""
    # Number of dimension pairs
    num_pairs = d_model // 2

    # Frequency for each pair: 1 / base^(2i/d)
    i = np.arange(num_pairs)
    frequencies = 1.0 / (base ** (2 * i / d_model))

    return frequencies


# Example with 8 dimensions (4 pairs)
d_model = 8
freqs = compute_rope_frequencies(d_model)

Out[9]:

Console

RoPE frequencies for 8-dimensional embeddings:
Pair     Frequency       Wavelength (positions)
---------------------------------------------
0        1.000000        6.3
1        0.100000        62.8
2        0.010000        628.3
3        0.001000        6283.2

The table shows the exponential decay: pair 0 completes a full cycle in about 6 positions (high frequency), while pair 3 takes over 600 positions (low frequency). This 100× difference in wavelength is what allows RoPE to encode positions at multiple scales simultaneously.

Let's visualize this frequency spectrum to see the exponential decay more clearly:

In[10]:

Code

# Visualize frequency decay across dimension pairs
d_model_viz = 64  # Typical small model dimension
freqs_viz = compute_rope_frequencies(d_model_viz)
wavelengths = 2 * np.pi / freqs_viz

# Visualize frequency decay across dimension pairs
d_model_viz = 64  # Typical small model dimension
freqs_viz = compute_rope_frequencies(d_model_viz)
wavelengths = 2 * np.pi / freqs_viz

Out[11]:

Visualization

Line plot showing exponential decay of frequency across dimension pairs on log scale. — Frequencies decay exponentially from 1.0 (pair 0) to near zero (pair 31). Early dimension pairs rotate rapidly, later pairs rotate slowly.

Line plot showing exponential growth of wavelength across dimension pairs on log scale. — Wavelengths grow exponentially, spanning from ~6 positions to over 60,000 positions. This multi-scale structure enables RoPE to encode both local and global position information.

The wavelengths tell us how many positions before a dimension pair completes a full rotation (360°). Pair 0 completes a cycle in about 6 positions, while pair 3 takes over 600 positions. This exponential spread ensures RoPE can distinguish positions both locally and globally.

The Complete RoPE FormulaLink Copied

Now that we understand the individual components, let's bring everything together into the complete RoPE transformation. We've established three key ideas:

Rotation encodes position: Each position $m$ corresponds to a rotation angle $m\theta$
Dot products extract relative position: When query and key are rotated by different amounts, their dot product depends only on the angle difference
Multiple frequencies create richness: Different dimension pairs rotate at different rates, capturing both local and global position information

The complete RoPE formula combines these insights into a single elegant operation. Given a query or key vector $\mathbf{x} \in \mathbb{R}^d$ at position $m$ , we apply RoPE as follows:

\text{RoPE}(\mathbf{x}, m) = \begin{pmatrix} R(\theta_0 \cdot m) & & & \\ & R(\theta_1 \cdot m) & & \\ & & \ddots & \\ & & & R(\theta_{d/2-1} \cdot m) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \\ x_{d-1} \\ x_d \end{pmatrix}

where:

$\text{RoPE}(\mathbf{x}, m)$ : the rotated vector, a function of both the input vector and position
$\mathbf{x} \in \mathbb{R}^d$ : the input query or key vector with $d$ dimensions
$m$ : the sequence position (integer index)
$R(\theta_i \cdot m)$ : the 2 $\times$ 2 rotation matrix for angle $\theta_i \cdot m$ , applied to dimension pair $i$
$\theta_i = 1 / 10000^{2i/d}$ : the base frequency for dimension pair $i$ (decreases exponentially with $i$ )
The block-diagonal structure means each 2 $\times$ 2 rotation block operates independently on its corresponding dimension pair
Empty off-diagonal blocks are zeros, so dimensions in different pairs don't interact during rotation

The large block-diagonal matrix applies different rotations to each dimension pair simultaneously. This is efficient because each 2D rotation is independent of the others, allowing for parallel computation.

Expanding the rotation for a single dimension pair $(x_{2i+1}, x_{2i+2})$ :

\begin{pmatrix} x'_{2i+1} \\ x'_{2i+2} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i+1} \\ x_{2i+2} \end{pmatrix}

where:

$x_{2i+1}$ , $x_{2i+2}$ : the $(2i+1)$ -th and $(2i+2)$ -th components of the input vector (using 1-based indexing)
$x'_{2i+1}$ , $x'_{2i+2}$ : the corresponding components after rotation
$m\theta_i$ : the rotation angle, which increases linearly with position $m$ at a rate determined by frequency $\theta_i$

Written out element-wise, the transformation is:

$x'_{2i+1} = x_{2i+1} \cos(m\theta_i) - x_{2i+2} \sin(m\theta_i)$
$x'_{2i+2} = x_{2i+1} \sin(m\theta_i) + x_{2i+2} \cos(m\theta_i)$

Complex Number PerspectiveLink Copied

The matrix formulation we've developed is mathematically complete, but there's an even more elegant way to express RoPE using complex numbers. This isn't just a notational convenience: the complex perspective reveals the deep connection between rotations and exponentials, and leads to more efficient implementations.

The key insight is that a 2D rotation is equivalent to multiplication by a complex exponential. Every 2D vector $(x, y)$ can be viewed as a complex number $z = x + jy$ , where $j$ is the imaginary unit. In this representation, rotating the vector by angle $\theta$ is simply multiplying by $e^{j\theta}$ .

For the dimension pair $(x_{2i+1}, x_{2i+2})$ , interpret them as the real and imaginary parts of a complex number:

z_i = x_{2i+1} + j \cdot x_{2i+2}

where:

$z_i$ : a complex number representing dimension pair $i$
$x_{2i+1}$ : the real part (first element of the pair)
$x_{2i+2}$ : the imaginary part (second element of the pair)
$j$ : the imaginary unit (satisfying $j^2 = -1$ )

Rotation by angle $\theta$ in the complex plane is achieved by multiplication:

z'_i = z_i \cdot e^{j\theta} = z_i \cdot (\cos\theta + j\sin\theta)

where:

$z'_i$ : the rotated complex number
$e^{j\theta}$ : the complex exponential, a point on the unit circle at angle $\theta$
$\cos\theta + j\sin\theta$ : the expanded form via Euler's formula

Euler's Formula

Euler's formula states that $e^{j\theta} = \cos\theta + j\sin\theta$ . Geometrically, $e^{j\theta}$ represents a point on the unit circle at angle $\theta$ from the positive real axis. Multiplying any complex number by $e^{j\theta}$ rotates it by angle $\theta$ counterclockwise in the complex plane, preserving its magnitude.

For RoPE at position $m$ , we apply this rotation with the position-dependent angle:

z'_i = z_i \cdot e^{j \cdot m \cdot \theta_i}

where:

$m$ : the sequence position
$\theta_i$ : the base frequency for dimension pair $i$
$m \cdot \theta_i$ : the total rotation angle (increases linearly with position)

This formulation is mathematically equivalent to the rotation matrix approach. The complex perspective leads to more concise code and can be more efficient on hardware with optimized complex number operations.

In[12]:

Code

def rope_complex(x, position, frequencies):
    """Apply RoPE using complex number formulation.

    Args:
        x: Input vector of shape (d,) where d is even
        position: Position index (integer)
        frequencies: Pre-computed frequencies of shape (d/2,)

    Returns:
        Rotated vector of shape (d,)
    """
    # Reshape to pairs and view as complex numbers
    d = len(x)
    x_pairs = x.reshape(-1, 2)
    x_complex = x_pairs[:, 0] + 1j * x_pairs[:, 1]

    # Rotation angles for this position
    angles = position * frequencies

    # Apply rotation via complex multiplication
    rotated_complex = x_complex * np.exp(1j * angles)

    # Convert back to real pairs
    rotated_pairs = np.stack(
        [rotated_complex.real, rotated_complex.imag], axis=1
    )
    return rotated_pairs.flatten()

def rope_complex(x, position, frequencies):
    """Apply RoPE using complex number formulation.

    Args:
        x: Input vector of shape (d,) where d is even
        position: Position index (integer)
        frequencies: Pre-computed frequencies of shape (d/2,)

    Returns:
        Rotated vector of shape (d,)
    """
    # Reshape to pairs and view as complex numbers
    d = len(x)
    x_pairs = x.reshape(-1, 2)
    x_complex = x_pairs[:, 0] + 1j * x_pairs[:, 1]

    # Rotation angles for this position
    angles = position * frequencies

    # Apply rotation via complex multiplication
    rotated_complex = x_complex * np.exp(1j * angles)

    # Convert back to real pairs
    rotated_pairs = np.stack(
        [rotated_complex.real, rotated_complex.imag], axis=1
    )
    return rotated_pairs.flatten()

Let's verify that the complex formulation gives the same result as explicit rotation matrices:

In[13]:

Code

def rope_matrix(x, position, frequencies):
    """Apply RoPE using explicit rotation matrices."""
    d = len(x)
    result = np.zeros_like(x)

    for i in range(d // 2):
        theta = position * frequencies[i]
        cos_t, sin_t = np.cos(theta), np.sin(theta)

        # Apply 2D rotation to dimension pair (2i, 2i+1)
        x1, x2 = x[2 * i], x[2 * i + 1]
        result[2 * i] = x1 * cos_t - x2 * sin_t
        result[2 * i + 1] = x1 * sin_t + x2 * cos_t

    return result


# Compare both implementations
np.random.seed(42)
test_vector = np.random.randn(8)
test_freqs = compute_rope_frequencies(8)
position = 5

result_complex = rope_complex(test_vector, position, test_freqs)
result_matrix = rope_matrix(test_vector, position, test_freqs)

def rope_matrix(x, position, frequencies):
    """Apply RoPE using explicit rotation matrices."""
    d = len(x)
    result = np.zeros_like(x)

    for i in range(d // 2):
        theta = position * frequencies[i]
        cos_t, sin_t = np.cos(theta), np.sin(theta)

        # Apply 2D rotation to dimension pair (2i, 2i+1)
        x1, x2 = x[2 * i], x[2 * i + 1]
        result[2 * i] = x1 * cos_t - x2 * sin_t
        result[2 * i + 1] = x1 * sin_t + x2 * cos_t

    return result


# Compare both implementations
np.random.seed(42)
test_vector = np.random.randn(8)
test_freqs = compute_rope_frequencies(8)
position = 5

result_complex = rope_complex(test_vector, position, test_freqs)
result_matrix = rope_matrix(test_vector, position, test_freqs)

Out[14]:

Console

Comparing RoPE implementations:
Original vector:     [ 0.4967 -0.1383  0.6477  1.523  -0.2342 -0.2341  1.5792  0.7674]
Complex formulation: [ 0.0083 -0.5155 -0.1618  1.6471 -0.2222 -0.2455  1.5754  0.7753]
Matrix formulation:  [ 0.0083 -0.5155 -0.1618  1.6471 -0.2222 -0.2455  1.5754  0.7753]

Max difference: 2.78e-17

The two implementations produce identical results (up to floating-point precision). The complex formulation is often preferred in practice because it's more concise and can leverage optimized complex number operations.

Visualizing RoPE PatternsLink Copied

With both the matrix and complex formulations implemented, let's build intuition by visualizing how RoPE actually transforms embeddings. Understanding these patterns helps explain why RoPE is so effective at encoding position information.

We'll plot the rotation patterns for each dimension pair, tracking how a unit vector moves as position increases:

In[15]:

Code

# Generate RoPE patterns across positions
d_model = 8
max_position = 50
frequencies = compute_rope_frequencies(d_model)

# Track how a unit vector in each pair rotates with position
positions = np.arange(max_position)
rotation_patterns = np.zeros((max_position, d_model // 2, 2))

for pos in positions:
    for i in range(d_model // 2):
        angle = pos * frequencies[i]
        rotation_patterns[pos, i, 0] = np.cos(angle)
        rotation_patterns[pos, i, 1] = np.sin(angle)

# Generate RoPE patterns across positions
d_model = 8
max_position = 50
frequencies = compute_rope_frequencies(d_model)

# Track how a unit vector in each pair rotates with position
positions = np.arange(max_position)
rotation_patterns = np.zeros((max_position, d_model // 2, 2))

for pos in positions:
    for i in range(d_model // 2):
        angle = pos * frequencies[i]
        rotation_patterns[pos, i, 0] = np.cos(angle)
        rotation_patterns[pos, i, 1] = np.sin(angle)

Out[16]:

Visualization

Scatter plot showing points tracing multiple full circles as position increases. — Pair 0: Fastest rotation (freq=1.0). Completes multiple full cycles within 50 positions, distinguishing nearby tokens.

Scatter plot showing points tracing partial circles with medium speed. — Pair 1: Medium-fast rotation. Slower than pair 0 but still captures local position differences.

Scatter plot showing points tracing a small arc slowly. — Pair 2: Medium-slow rotation. Provides mid-range position information.

Scatter plot showing points barely moving along the circle. — Pair 3: Slowest rotation. Barely completes an arc in 50 positions, encoding global structure.

The color gradient (dark to light) shows increasing position. Pair 0 makes multiple full rotations within 50 positions, while Pair 3 barely completes an arc. This multi-frequency structure is what gives RoPE its expressiveness.

Relative Position Through Dot ProductsLink Copied

We've derived mathematically that RoPE should make attention scores depend only on relative position. Now let's verify this core property empirically and see what it looks like in practice.

The test is straightforward: create identical query and key vectors at different positions, apply RoPE, and compute attention scores. If RoPE works as intended, the scores should form a Toeplitz matrix, where each diagonal contains identical values. This structure proves that scores depend only on relative position (the difference between query and key positions), not on absolute positions.

In[17]:

Code

def compute_rope_attention_scores(queries, keys, frequencies):
    """Compute attention scores with RoPE applied.

    Args:
        queries: Query vectors, shape (seq_len, d)
        keys: Key vectors, shape (seq_len, d)
        frequencies: RoPE frequencies, shape (d/2,)

    Returns:
        Attention scores, shape (seq_len, seq_len)
    """
    seq_len, d = queries.shape
    scores = np.zeros((seq_len, seq_len))

    for m in range(seq_len):
        # Rotate query at position m
        q_rotated = rope_complex(queries[m], m, frequencies)

        for n in range(seq_len):
            # Rotate key at position n
            k_rotated = rope_complex(keys[n], n, frequencies)

            # Compute dot product
            scores[m, n] = np.dot(q_rotated, k_rotated)

    return scores


# Create test queries and keys
np.random.seed(42)
seq_len = 6
d_model = 8

# All queries are identical, all keys are identical
# This isolates the effect of position
q_template = np.random.randn(d_model)
k_template = np.random.randn(d_model)

queries = np.tile(q_template, (seq_len, 1))
keys = np.tile(k_template, (seq_len, 1))

frequencies = compute_rope_frequencies(d_model)
scores = compute_rope_attention_scores(queries, keys, frequencies)

def compute_rope_attention_scores(queries, keys, frequencies):
    """Compute attention scores with RoPE applied.

    Args:
        queries: Query vectors, shape (seq_len, d)
        keys: Key vectors, shape (seq_len, d)
        frequencies: RoPE frequencies, shape (d/2,)

    Returns:
        Attention scores, shape (seq_len, seq_len)
    """
    seq_len, d = queries.shape
    scores = np.zeros((seq_len, seq_len))

    for m in range(seq_len):
        # Rotate query at position m
        q_rotated = rope_complex(queries[m], m, frequencies)

        for n in range(seq_len):
            # Rotate key at position n
            k_rotated = rope_complex(keys[n], n, frequencies)

            # Compute dot product
            scores[m, n] = np.dot(q_rotated, k_rotated)

    return scores


# Create test queries and keys
np.random.seed(42)
seq_len = 6
d_model = 8

# All queries are identical, all keys are identical
# This isolates the effect of position
q_template = np.random.randn(d_model)
k_template = np.random.randn(d_model)

queries = np.tile(q_template, (seq_len, 1))
keys = np.tile(k_template, (seq_len, 1))

frequencies = compute_rope_frequencies(d_model)
scores = compute_rope_attention_scores(queries, keys, frequencies)

Out[18]:

Visualization

Heatmap showing attention scores with identical values along each diagonal. — Attention scores with RoPE show Toeplitz structure: each diagonal has the same value. This proves that scores depend only on relative position (column index minus row index), not absolute positions. The pattern emerges purely from the geometry of rotation.

The Toeplitz structure is clear: all entries along each diagonal are identical. Position (0,0), (1,1), (2,2) all have the same score (relative distance 0). Position (0,1), (1,2), (2,3) all match (relative distance 1). This is the relative position property in action.

Let's verify numerically by extracting scores for each relative distance:

In[19]:

Code

# Extract scores by relative distance
relative_scores = {}
for m in range(seq_len):
    for n in range(seq_len):
        rel_dist = n - m
        if rel_dist not in relative_scores:
            relative_scores[rel_dist] = []
        relative_scores[rel_dist].append(scores[m, n])

# Extract scores by relative distance
relative_scores = {}
for m in range(seq_len):
    for n in range(seq_len):
        rel_dist = n - m
        if rel_dist not in relative_scores:
            relative_scores[rel_dist] = []
        relative_scores[rel_dist].append(scores[m, n])

Out[20]:

Console

Scores grouped by relative position (n - m):
Relative Position    Scores                                   Std Dev        
---------------------------------------------------------------------------
-5                   -3.7130                                  0.00e+00
-4                   -3.4684, -3.4684                         0.00e+00
-3                   -3.2589, -3.2589, -3.2589                2.56e-16
-2                   -3.3481, -3.3481, -3.3481, -3.3481       5.44e-16
-1                   -3.7172, -3.7172, -3.7172, -3.7172, ...  1.99e-16
0                    -4.0819, -4.0819, -4.0819, -4.0819, ...  3.63e-16
1                    -4.1532, -4.1532, -4.1532, -4.1532, ...  7.94e-16
2                    -3.9027, -3.9027, -3.9027, -3.9027       2.22e-16
3                    -3.5884, -3.5884, -3.5884                3.63e-16
4                    -3.5173, -3.5173                         3.14e-16
5                    -3.7630                                  0.00e+00

All standard deviations are effectively zero (within floating-point precision), confirming that scores at each relative distance are identical.

How Dot Products Vary with Relative DistanceLink Copied

The Toeplitz structure tells us scores depend only on relative position, but how do they vary? Let's trace how the dot product changes as we increase the relative distance between query and key:

In[21]:

Code

# Analyze how dot product varies with relative distance
np.random.seed(42)
d_model = 32
frequencies = compute_rope_frequencies(d_model)

# Create random query and key vectors
q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Compute dot products at various relative distances
max_rel_dist = 50
relative_distances = np.arange(-max_rel_dist, max_rel_dist + 1)
dot_products = []

# Fix query at position 50 (middle of range)
query_pos = 50
q_rotated = rope_complex(q, query_pos, frequencies)

for rel_dist in relative_distances:
    key_pos = query_pos + rel_dist
    k_rotated = rope_complex(k, key_pos, frequencies)
    dot_products.append(np.dot(q_rotated, k_rotated))

dot_products = np.array(dot_products)

# Analyze how dot product varies with relative distance
np.random.seed(42)
d_model = 32
frequencies = compute_rope_frequencies(d_model)

# Create random query and key vectors
q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Compute dot products at various relative distances
max_rel_dist = 50
relative_distances = np.arange(-max_rel_dist, max_rel_dist + 1)
dot_products = []

# Fix query at position 50 (middle of range)
query_pos = 50
q_rotated = rope_complex(q, query_pos, frequencies)

for rel_dist in relative_distances:
    key_pos = query_pos + rel_dist
    k_rotated = rope_complex(k, key_pos, frequencies)
    dot_products.append(np.dot(q_rotated, k_rotated))

dot_products = np.array(dot_products)

Out[22]:

Visualization

Line plot showing oscillating dot product values across relative distances from -50 to +50. — Dot product between RoPE-rotated query and key vectors as a function of relative distance. The oscillating pattern emerges from the interference of multiple rotation frequencies. Peak at distance 0 indicates highest similarity when query and key are at the same position.

The oscillating pattern is characteristic of RoPE. The multiple frequencies create a complex interference pattern where some relative distances produce higher scores than others. This structure allows the model to learn position-dependent attention patterns during training.

Efficient ImplementationLink Copied

The implementations we've shown so far process one token at a time, which is clear for understanding but inefficient in practice. Modern deep learning frameworks excel at vectorized operations, so we want to apply RoPE to all tokens in a sequence simultaneously.

The key insight is that we can precompute all rotation angles as a matrix and apply them through broadcasting. Instead of looping over positions and dimension pairs, we compute everything in parallel:

In[23]:

Code

def apply_rope_batch(x, frequencies):
    """Apply RoPE to a batch of vectors at consecutive positions.

    Args:
        x: Input tensor of shape (seq_len, d)
        frequencies: Pre-computed frequencies of shape (d/2,)

    Returns:
        Rotated tensor of shape (seq_len, d)
    """
    seq_len, d = x.shape
    positions = np.arange(seq_len)

    # Compute all rotation angles: (seq_len, d/2)
    angles = np.outer(positions, frequencies)

    # Compute cos and sin for all positions and frequencies
    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    # Reshape input to pairs: (seq_len, d/2, 2)
    x_pairs = x.reshape(seq_len, -1, 2)

    # Apply rotation to each pair
    # x_new[0] = x[0] * cos - x[1] * sin
    # x_new[1] = x[0] * sin + x[1] * cos
    x_rotated = np.stack(
        [
            x_pairs[:, :, 0] * cos_angles - x_pairs[:, :, 1] * sin_angles,
            x_pairs[:, :, 0] * sin_angles + x_pairs[:, :, 1] * cos_angles,
        ],
        axis=-1,
    )

    return x_rotated.reshape(seq_len, d)

def apply_rope_batch(x, frequencies):
    """Apply RoPE to a batch of vectors at consecutive positions.

    Args:
        x: Input tensor of shape (seq_len, d)
        frequencies: Pre-computed frequencies of shape (d/2,)

    Returns:
        Rotated tensor of shape (seq_len, d)
    """
    seq_len, d = x.shape
    positions = np.arange(seq_len)

    # Compute all rotation angles: (seq_len, d/2)
    angles = np.outer(positions, frequencies)

    # Compute cos and sin for all positions and frequencies
    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    # Reshape input to pairs: (seq_len, d/2, 2)
    x_pairs = x.reshape(seq_len, -1, 2)

    # Apply rotation to each pair
    # x_new[0] = x[0] * cos - x[1] * sin
    # x_new[1] = x[0] * sin + x[1] * cos
    x_rotated = np.stack(
        [
            x_pairs[:, :, 0] * cos_angles - x_pairs[:, :, 1] * sin_angles,
            x_pairs[:, :, 0] * sin_angles + x_pairs[:, :, 1] * cos_angles,
        ],
        axis=-1,
    )

    return x_rotated.reshape(seq_len, d)

Let's verify this batch implementation matches the per-token version:

In[24]:

Code

# Test batch vs individual application
np.random.seed(42)
seq_len = 10
d_model = 16

test_embeddings = np.random.randn(seq_len, d_model)
frequencies = compute_rope_frequencies(d_model)

# Batch application
batch_result = apply_rope_batch(test_embeddings, frequencies)

# Individual application
individual_result = np.zeros_like(test_embeddings)
for pos in range(seq_len):
    individual_result[pos] = rope_complex(
        test_embeddings[pos], pos, frequencies
    )

# Test batch vs individual application
np.random.seed(42)
seq_len = 10
d_model = 16

test_embeddings = np.random.randn(seq_len, d_model)
frequencies = compute_rope_frequencies(d_model)

# Batch application
batch_result = apply_rope_batch(test_embeddings, frequencies)

# Individual application
individual_result = np.zeros_like(test_embeddings)
for pos in range(seq_len):
    individual_result[pos] = rope_complex(
        test_embeddings[pos], pos, frequencies
    )

Out[25]:

Console

Batch vs individual implementation:
Maximum difference: 2.22e-16
Implementations match: True

The maximum difference between implementations is on the order of $10^{-16}$ , which is essentially machine epsilon for 64-bit floating point. This confirms that the batch implementation produces numerically identical results while being much more efficient through vectorization.

Integration with Self-AttentionLink Copied

With efficient RoPE implementation in hand, let's see how it fits into a complete self-attention layer. The integration is remarkably clean: RoPE slots in between the QKV projections and the attention computation, requiring no architectural changes to the transformer.

Here's the complete flow:

Project input embeddings to Q, K, V using learned weight matrices
Apply RoPE to Q and K (but not to V)
Compute scaled dot-product attention as usual
Return the attention output

The critical detail is step 2: we rotate queries and keys but leave values untouched. Let's implement this:

In[26]:

Code

class RoPEAttention:
    """Self-attention with Rotary Position Embedding."""

    def __init__(self, d_model, d_k, base=10000, seed=None):
        """Initialize RoPE attention layer.

        Args:
            d_model: Input embedding dimension
            d_k: Query/Key/Value dimension (must be even for RoPE)
            base: Base for frequency computation
            seed: Random seed for weight initialization
        """
        if d_k % 2 != 0:
            raise ValueError("d_k must be even for RoPE")

        if seed is not None:
            np.random.seed(seed)

        # Projection matrices
        scale = np.sqrt(2.0 / (d_model + d_k))
        self.W_q = np.random.randn(d_model, d_k) * scale
        self.W_k = np.random.randn(d_model, d_k) * scale
        self.W_v = np.random.randn(d_model, d_k) * scale

        # RoPE frequencies
        self.frequencies = compute_rope_frequencies(d_k, base)
        self.d_k = d_k

    def forward(self, x):
        """Compute RoPE attention.

        Args:
            x: Input embeddings of shape (seq_len, d_model)

        Returns:
            output: Attention output of shape (seq_len, d_k)
            attention_weights: Weights of shape (seq_len, seq_len)
        """
        # Project to Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Apply RoPE to Q and K
        Q_rope = apply_rope_batch(Q, self.frequencies)
        K_rope = apply_rope_batch(K, self.frequencies)

        # Scaled dot-product attention
        scores = Q_rope @ K_rope.T / np.sqrt(self.d_k)

        # Softmax
        scores_stable = scores - scores.max(axis=1, keepdims=True)
        attention_weights = np.exp(scores_stable) / np.exp(scores_stable).sum(
            axis=1, keepdims=True
        )

        # Aggregate values (V is NOT rotated)
        output = attention_weights @ V

        return output, attention_weights

class RoPEAttention:
    """Self-attention with Rotary Position Embedding."""

    def __init__(self, d_model, d_k, base=10000, seed=None):
        """Initialize RoPE attention layer.

        Args:
            d_model: Input embedding dimension
            d_k: Query/Key/Value dimension (must be even for RoPE)
            base: Base for frequency computation
            seed: Random seed for weight initialization
        """
        if d_k % 2 != 0:
            raise ValueError("d_k must be even for RoPE")

        if seed is not None:
            np.random.seed(seed)

        # Projection matrices
        scale = np.sqrt(2.0 / (d_model + d_k))
        self.W_q = np.random.randn(d_model, d_k) * scale
        self.W_k = np.random.randn(d_model, d_k) * scale
        self.W_v = np.random.randn(d_model, d_k) * scale

        # RoPE frequencies
        self.frequencies = compute_rope_frequencies(d_k, base)
        self.d_k = d_k

    def forward(self, x):
        """Compute RoPE attention.

        Args:
            x: Input embeddings of shape (seq_len, d_model)

        Returns:
            output: Attention output of shape (seq_len, d_k)
            attention_weights: Weights of shape (seq_len, seq_len)
        """
        # Project to Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Apply RoPE to Q and K
        Q_rope = apply_rope_batch(Q, self.frequencies)
        K_rope = apply_rope_batch(K, self.frequencies)

        # Scaled dot-product attention
        scores = Q_rope @ K_rope.T / np.sqrt(self.d_k)

        # Softmax
        scores_stable = scores - scores.max(axis=1, keepdims=True)
        attention_weights = np.exp(scores_stable) / np.exp(scores_stable).sum(
            axis=1, keepdims=True
        )

        # Aggregate values (V is NOT rotated)
        output = attention_weights @ V

        return output, attention_weights

Note that RoPE is applied only to queries and keys, not to values. This is because:

Queries and keys determine attention patterns: The dot product between them computes compatibility. RoPE makes this compatibility position-aware.
Values carry content: They should not be position-encoded because the content itself doesn't depend on position, only how much weight it receives.

In[27]:

Code

# Test the RoPE attention layer
np.random.seed(42)
seq_len = 8
d_model = 16
d_k = 8

x = np.random.randn(seq_len, d_model)
attention = RoPEAttention(d_model, d_k, seed=123)
output, weights = attention.forward(x)

# Test the RoPE attention layer
np.random.seed(42)
seq_len = 8
d_model = 16
d_k = 8

x = np.random.randn(seq_len, d_model)
attention = RoPEAttention(d_model, d_k, seed=123)
output, weights = attention.forward(x)

Out[28]:

Console

RoPE Attention Layer Test:
Input shape:      (8, 16)
Output shape:     (8, 8)
Weights shape:    (8, 8)

Row sums (should be 1.0): [1. 1. 1. 1. 1. 1. 1. 1.]

The output confirms correct behavior: input of shape (8, 16) produces output of shape (8, 8) after projection to the query/key/value dimension. The attention weights form an 8×8 matrix where each row sums to exactly 1.0, confirming proper softmax normalization. The RoPE transformations are applied internally to queries and keys, making attention position-aware without changing the external interface.

RoPE Frequency PatternsLink Copied

The choice of frequencies is crucial to RoPE's effectiveness. Let's visualize how the multi-frequency structure creates unique position signatures:

In[29]:

Code

# Visualize how position affects embedding components
d_model = 64
max_pos = 100
frequencies = compute_rope_frequencies(d_model)

# For each position, compute the rotation for each dimension pair
# We'll track cos(pos * freq) and sin(pos * freq)
cos_components = np.zeros((max_pos, d_model // 2))
sin_components = np.zeros((max_pos, d_model // 2))

for pos in range(max_pos):
    cos_components[pos] = np.cos(pos * frequencies)
    sin_components[pos] = np.sin(pos * frequencies)

# Visualize how position affects embedding components
d_model = 64
max_pos = 100
frequencies = compute_rope_frequencies(d_model)

# For each position, compute the rotation for each dimension pair
# We'll track cos(pos * freq) and sin(pos * freq)
cos_components = np.zeros((max_pos, d_model // 2))
sin_components = np.zeros((max_pos, d_model // 2))

for pos in range(max_pos):
    cos_components[pos] = np.cos(pos * frequencies)
    sin_components[pos] = np.sin(pos * frequencies)

Out[30]:

Visualization

Heatmap showing cosine values with rapid oscillation at top, slow at bottom. — Cosine components of RoPE across positions and dimension pairs. Early pairs (top) oscillate rapidly, creating high-frequency position signals. Later pairs oscillate slowly, encoding longer-range position information.

Heatmap showing sine values with similar frequency patterns to cosine. — Sine components of RoPE. Together with cosine, they form the complete rotation state for each dimension pair. The combination of multiple frequencies creates a unique signature for each position.

The heatmaps reveal the multi-scale nature of RoPE. Low-index dimension pairs (top rows) cycle rapidly, distinguishing nearby positions. High-index pairs (bottom rows) change slowly, providing a coarse position signal. This structure resembles sinusoidal position encodings since both use similar frequency patterns. The key difference is that sinusoidal encodings add position information to embeddings, while RoPE rotates the embeddings themselves.

Why RoPE Works So WellLink Copied

Several properties make RoPE particularly effective:

Relative position by design. Unlike additive position encodings that must learn to extract relative position, RoPE provides it automatically through the geometry of rotations. The model doesn't need to learn that positions 5 and 7 are "2 apart"; the attention scores inherently reflect this.

Length generalization. Because RoPE encodes relative rather than absolute position, models can often generalize to longer sequences than seen during training. Position 1000 rotating relative to position 1002 works the same as position 0 rotating relative to position 2.

Computational efficiency. RoPE requires no additional parameters beyond the pre-computed frequencies. The rotation can be implemented as element-wise operations, making it very fast.

Compatibility with linear attention. Some efficient attention approximations rely on the inner product structure of attention. RoPE preserves this structure (rotation is a linear transformation), making it compatible with these methods.

In[31]:

Code

# Demonstrate length generalization
np.random.seed(42)

# Train on short sequences (conceptually)
train_seq_len = 20

# Test on longer sequence
test_seq_len = 100
d_model = 16

frequencies = compute_rope_frequencies(d_model)

# Check that relative distances produce consistent scores
# even at positions far beyond "training"
q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Near the start (like training)
q_rot_5 = rope_complex(q, 5, frequencies)
k_rot_7 = rope_complex(k, 7, frequencies)
score_near = np.dot(q_rot_5, k_rot_7)

# Far beyond (test generalization)
q_rot_85 = rope_complex(q, 85, frequencies)
k_rot_87 = rope_complex(k, 87, frequencies)
score_far = np.dot(q_rot_85, k_rot_87)

# Demonstrate length generalization
np.random.seed(42)

# Train on short sequences (conceptually)
train_seq_len = 20

# Test on longer sequence
test_seq_len = 100
d_model = 16

frequencies = compute_rope_frequencies(d_model)

# Check that relative distances produce consistent scores
# even at positions far beyond "training"
q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Near the start (like training)
q_rot_5 = rope_complex(q, 5, frequencies)
k_rot_7 = rope_complex(k, 7, frequencies)
score_near = np.dot(q_rot_5, k_rot_7)

# Far beyond (test generalization)
q_rot_85 = rope_complex(q, 85, frequencies)
k_rot_87 = rope_complex(k, 87, frequencies)
score_far = np.dot(q_rot_85, k_rot_87)

Out[32]:

Console

Length generalization test:
Score at positions 5, 7 (relative distance 2): -2.388206
Score at positions 85, 87 (relative distance 2): -2.388206
Difference: 2.22e-15

Identical scores at identical relative distances, regardless of absolute position. This is why RoPE-based models can extrapolate to longer contexts more gracefully than models with absolute position encodings.

Let's visualize this length generalization property more comprehensively by testing many absolute position pairs:

In[33]:

Code

# Comprehensive length generalization test
np.random.seed(42)
d_model = 32
frequencies = compute_rope_frequencies(d_model)

q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Test relative distance of 2 at many different absolute positions
relative_dist = 2
absolute_positions = [0, 10, 50, 100, 500, 1000, 5000, 10000]
scores_at_positions = []

for pos in absolute_positions:
    q_rot = rope_complex(q, pos, frequencies)
    k_rot = rope_complex(k, pos + relative_dist, frequencies)
    scores_at_positions.append(np.dot(q_rot, k_rot))

scores_at_positions = np.array(scores_at_positions)

# Comprehensive length generalization test
np.random.seed(42)
d_model = 32
frequencies = compute_rope_frequencies(d_model)

q = np.random.randn(d_model)
k = np.random.randn(d_model)

# Test relative distance of 2 at many different absolute positions
relative_dist = 2
absolute_positions = [0, 10, 50, 100, 500, 1000, 5000, 10000]
scores_at_positions = []

for pos in absolute_positions:
    q_rot = rope_complex(q, pos, frequencies)
    k_rot = rope_complex(k, pos + relative_dist, frequencies)
    scores_at_positions.append(np.dot(q_rot, k_rot))

scores_at_positions = np.array(scores_at_positions)

Out[34]:

Visualization

Bar chart showing nearly identical dot product scores across different absolute positions. — Length generalization in RoPE: the same relative distance (2) produces nearly identical attention scores regardless of absolute position, from position 0 to position 10,000. The flat line demonstrates that RoPE''s relative position property holds even at positions far beyond typical training lengths.

The nearly zero standard deviation confirms that RoPE perfectly preserves relative position information regardless of where in the sequence we look. This is the mathematical foundation for length generalization in RoPE-based models.

Comparing RoPE to Other Position EncodingsLink Copied

Let's position RoPE within the broader landscape of position encodings:

Comparison of position encoding methods. RoPE achieves relative position encoding with zero parameters and no attention modification.

Property	Sinusoidal	Learned	Relative (Shaw)	RoPE
Parameters	0	$O(L \times d)$	$O(L \times d)$	0
Position type	Absolute	Absolute	Relative	Relative
Attention modified	No	No	Yes	No (uses rotation)
Length extrapolation	Moderate	Poor	Moderate	Good
Computational cost	Low	Low	Higher	Low

RoPE combines the parameter efficiency of sinusoidal encodings with the relative position benefits of learned relative encodings, without the architectural complexity. This balance explains its widespread adoption.

To make this comparison concrete, let's visualize how position similarity decays with distance for different encoding schemes:

In[35]:

Code

# Compare position similarity decay across encoding methods
d_model_cmp = 64
max_distance = 50

# RoPE: dot product between rotated vectors at different distances
frequencies_cmp = compute_rope_frequencies(d_model_cmp)
rope_similarities = []
base_vec = np.ones(d_model_cmp) / np.sqrt(d_model_cmp)  # Unit vector

base_rotated = rope_complex(base_vec, 0, frequencies_cmp)
for dist in range(max_distance):
    other_rotated = rope_complex(base_vec, dist, frequencies_cmp)
    rope_similarities.append(np.dot(base_rotated, other_rotated))


# Sinusoidal: dot product between position encodings
def sinusoidal_encoding(position, d_model):
    """Generate sinusoidal position encoding."""
    pe = np.zeros(d_model)
    for i in range(0, d_model, 2):
        div_term = 10000 ** (i / d_model)
        pe[i] = np.sin(position / div_term)
        if i + 1 < d_model:
            pe[i + 1] = np.cos(position / div_term)
    return pe


sinusoidal_similarities = []
base_sin = sinusoidal_encoding(0, d_model_cmp)
for dist in range(max_distance):
    other_sin = sinusoidal_encoding(dist, d_model_cmp)
    sinusoidal_similarities.append(np.dot(base_sin, other_sin))

# Convert to arrays
rope_similarities = np.array(rope_similarities)
sinusoidal_similarities = np.array(sinusoidal_similarities)

# Compare position similarity decay across encoding methods
d_model_cmp = 64
max_distance = 50

# RoPE: dot product between rotated vectors at different distances
frequencies_cmp = compute_rope_frequencies(d_model_cmp)
rope_similarities = []
base_vec = np.ones(d_model_cmp) / np.sqrt(d_model_cmp)  # Unit vector

base_rotated = rope_complex(base_vec, 0, frequencies_cmp)
for dist in range(max_distance):
    other_rotated = rope_complex(base_vec, dist, frequencies_cmp)
    rope_similarities.append(np.dot(base_rotated, other_rotated))


# Sinusoidal: dot product between position encodings
def sinusoidal_encoding(position, d_model):
    """Generate sinusoidal position encoding."""
    pe = np.zeros(d_model)
    for i in range(0, d_model, 2):
        div_term = 10000 ** (i / d_model)
        pe[i] = np.sin(position / div_term)
        if i + 1 < d_model:
            pe[i + 1] = np.cos(position / div_term)
    return pe


sinusoidal_similarities = []
base_sin = sinusoidal_encoding(0, d_model_cmp)
for dist in range(max_distance):
    other_sin = sinusoidal_encoding(dist, d_model_cmp)
    sinusoidal_similarities.append(np.dot(base_sin, other_sin))

# Convert to arrays
rope_similarities = np.array(rope_similarities)
sinusoidal_similarities = np.array(sinusoidal_similarities)

Out[36]:

Visualization

Two line plots comparing how position similarity decays with distance for RoPE and sinusoidal encodings. — Position similarity decay comparison. RoPE similarity (computed on rotated vectors) oscillates more dramatically than sinusoidal similarity (computed on position encodings). Both show periodic structure from multi-frequency components, but RoPE's oscillation pattern emerges from the dot product of rotated content vectors, not the encodings themselves.

Both methods show oscillating similarity patterns due to their multi-frequency structure. The key difference: sinusoidal encodings add this pattern to the input, while RoPE modulates the attention computation directly through rotation.

Limitations and ConsiderationsLink Copied

Despite its elegance, RoPE has limitations worth understanding.

Frequency base sensitivity. The base (typically 10000) determines the frequency range. Models trained with one base may not transfer well to contexts requiring different frequency patterns. Recent work like YaRN and NTK-aware scaling addresses this by adjusting frequencies for longer contexts.

High-frequency aliasing. At very long positions, high-frequency dimension pairs may "wrap around" multiple times, potentially creating aliasing where distant positions appear similar. In practice, this is rarely problematic within reasonable context lengths, but it's a theoretical limitation.

Dimension divisibility. RoPE requires even-dimensional queries and keys since it operates on pairs. This is a minor constraint but must be considered in architecture design.

Training distribution effects. While RoPE theoretically supports any position, the model's other components (feed-forward networks, layer norms) are trained on a specific position distribution. Significant extrapolation may still degrade performance due to these other components, not RoPE itself.

These limitations are generally manageable. The community has developed extensions like Position Interpolation and NTK-aware RoPE that modify the frequency computation for better long-context performance. The core rotation mechanism remains unchanged.

Key ParametersLink Copied

When implementing RoPE in your models, these parameters control its behavior:

d_model (embedding dimension): The total dimension of query and key vectors. Must be even since RoPE operates on dimension pairs. Common values range from 64 to 4096, typically matching the model's hidden dimension divided by the number of attention heads.
base (frequency base): Controls the range of rotation frequencies. The default value of 10000 provides a good balance between local and global position sensitivity. Larger values (e.g., 100000) extend the effective context length by slowing all rotations; smaller values make the encoding more sensitive to nearby positions.
theta_i (per-dimension frequency): Computed as $1/\text{base}^{2i/d}$ for dimension pair $i$ . Not typically set directly, but understanding this helps diagnose behavior: the first pair rotates once per radian, while the last pair completes a full rotation over approximately $2\pi \times \text{base}$ positions.
Position offset: Some implementations support a starting position offset for key-value caching during inference. This allows continuing generation from a specific position without recomputing RoPE for all previous positions.

SummaryLink Copied

Rotary Position Embedding encodes position through geometric rotation rather than additive signals. This approach elegantly captures relative position through the natural properties of dot products between rotated vectors.

Key takeaways:

Rotation as position encoding. Each position corresponds to a rotation angle. Rotating query and key vectors embeds position information directly into their geometric relationship.
Relative position emerges. When a rotated query at position $m$ attends to a rotated key at position $n$ , the dot product depends only on $m - n$ . Absolute positions cancel out through rotation mathematics.
Multi-frequency structure. Different dimension pairs rotate at different frequencies, creating a rich position representation. High frequencies capture local position differences; low frequencies capture global structure.
No additional parameters. Like sinusoidal encodings, RoPE uses deterministic frequencies based on dimension index. The only computation is the rotation itself.
Applied to Q and K only. Values are not rotated because they carry content, not position information. Rotation affects attention patterns, not the content that flows through them.
Good extrapolation. Because relative position is baked into the mechanism, models can often generalize to longer sequences than seen during training, though other model components may still limit this.

RoPE has become the dominant position encoding in modern large language models. Its combination of theoretical elegance, computational efficiency, and practical effectiveness makes it a foundational technique for transformer architectures. In the next chapter, we'll explore ALiBi, an alternative approach that adds relative position bias directly to attention scores.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Rotary Position Embedding (RoPE).

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Relative Position Encoding

Next Chapter

ALiBi

Reference

BIBTEXAcademic

@misc{rotarypositionembeddingropeencodingpositionthroughrotation, author = {Michael Brenndoerfer}, title = {Rotary Position Embedding (RoPE): Encoding Position Through Rotation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Rotary Position Embedding (RoPE): Encoding Position Through Rotation. Retrieved from https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers

MLAAcademic

Michael Brenndoerfer. "Rotary Position Embedding (RoPE): Encoding Position Through Rotation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "Rotary Position Embedding (RoPE): Encoding Position Through Rotation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Rotary Position Embedding (RoPE): Encoding Position Through Rotation'. Available at: https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Rotary Position Embedding (RoPE): Encoding Position Through Rotation. https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers

Direct link:

https://mbrenndoerfer.com/writing/rotary-position-embedding-rope-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Rotary Position Embedding (RoPE): Encoding Position Through Rotation

Rotary Position Embedding (RoPE)Link Copied

Why Rotation?Link Copied

Rotation Matrices in 2DLink Copied

From Rotation to Relative PositionLink Copied

Extending to Higher DimensionsLink Copied

The Complete RoPE FormulaLink Copied

Complex Number PerspectiveLink Copied

Visualizing RoPE PatternsLink Copied

Relative Position Through Dot ProductsLink Copied

How Dot Products Vary with Relative DistanceLink Copied

Efficient ImplementationLink Copied

Integration with Self-AttentionLink Copied

RoPE Frequency PatternsLink Copied

Why RoPE Works So WellLink Copied

Comparing RoPE to Other Position EncodingsLink Copied

Limitations and ConsiderationsLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Sinusoidal Position Encoding: How Transformers Know Word Order

The Position Problem: Why Transformers Can't Tell Order Without Help

Position Encoding Comparison: Sinusoidal, Learned, RoPE & ALiBi Guide

Stay updated