Position Interpolation: Extending LLM Context Length with RoPE Scaling

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Learn how Position Interpolation extends transformer context windows by scaling position indices to stay within training distributions, enabling longer sequences with minimal fine-tuning.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Position InterpolationLink Copied

Modern language models face a fundamental tension: training on long sequences is expensive, but real-world applications demand them. A model trained on 2,048 tokens might encounter documents with 8,000 tokens, conversations spanning 16,000 tokens, or codebases requiring even longer context. When RoPE-based models try to process positions beyond their training range, performance degrades rapidly. The rotation angles reach values the model has never seen, producing attention patterns that bear no resemblance to what was learned.

Position Interpolation, introduced by Chen et al. in 2023, offers an elegant solution: instead of extrapolating to unseen positions, we interpolate within the familiar range. Rather than assigning position 4,096 a rotation angle the model has never encountered, we scale all positions down so they fit within the original training range. Position 4,096 becomes position 2,048 after scaling, and the model sees familiar rotation angles even as it processes longer sequences.

This chapter develops Position Interpolation from first principles. We'll start by understanding why RoPE fails at extrapolation, derive the interpolation formula, implement it in code, and explore its limitations. By the end, you'll understand both the elegance of this approach and why subsequent methods like NTK-aware scaling were developed to address its shortcomings.

The Extrapolation ProblemLink Copied

RoPE encodes position through rotation. At position $m$ , each dimension pair of the query and key vectors is rotated by an angle proportional to $m$ . The key idea is that each dimension pair rotates at a different frequency, creating a unique positional signature. The rotation angle for dimension pair $i$ at position $m$ is computed as:

\theta_i(m) = m \cdot \theta_i = m \cdot \frac{1}{\text{base}^{2i/d}}

where:

$\theta_i(m)$ : the rotation angle (in radians) applied to dimension pair $i$ at sequence position $m$
$m$ : the position index in the sequence (0, 1, 2, ..., $L-1$ for a sequence of length $L$ )
$i$ : the dimension pair index (0, 1, 2, ..., $d/2 - 1$ )
$d$ : the total embedding dimension (typically 64, 128, or larger)
$\text{base}$ : a constant that controls the range of frequencies (typically 10000)
$\theta_i = 1/\text{base}^{2i/d}$ : the base frequency for dimension pair $i$ , which decreases exponentially as $i$ increases

The exponential decay in the base frequency $\theta_i$ means that dimension pair 0 rotates fastest (with $\theta_0 = 1$ radian per position), while higher-indexed pairs rotate progressively slower. This creates a multi-scale representation where different dimension pairs capture positional information at different granularities.

Out[2]:

Visualization

Line plot showing wavelength in positions on a log scale across 32 dimension pairs, rising exponentially from left to right. — RoPE wavelengths across dimension pairs. The wavelength (positions per full rotation) increases exponentially from about 6 positions for pair 0 to over 60,000 positions for pair 31. This multi-scale structure allows RoPE to encode both local and global positional information.

During training on sequences of length $L$ , the model sees positions $0, 1, 2, \ldots, L-1$ . The rotation angles range from 0 to $(L-1) \cdot \theta_i$ for each dimension pair.

What happens when we present the model with position $2L$ ? The rotation angle $2L \cdot \theta_i$ exceeds anything seen during training. For the fastest-rotating dimensions (where $\theta_i$ is large), these new angles produce embedding rotations the attention mechanism has never learned to interpret.

Let's visualize this problem by examining the rotation angles at different positions.

In[3]:

Code

import numpy as np


def compute_rope_angles(positions, d_model=64, base=10000):
    """Compute RoPE rotation angles for given positions.

    Args:
        positions: Array of position indices
        d_model: Embedding dimension
        base: Base for frequency computation

    Returns:
        angles: Array of shape (len(positions), d_model/2)
    """
    # Compute frequencies for each dimension pair
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Compute angles: outer product of positions and frequencies
    positions = np.array(positions)
    angles = np.outer(positions, theta)

    return angles


# Training range: positions 0 to 2047 (2K context)
train_length = 2048
train_positions = np.arange(train_length)
train_angles = compute_rope_angles(train_positions)

# Extended range: positions 0 to 8191 (4x context)
extended_length = 8192
extended_positions = np.arange(extended_length)
extended_angles = compute_rope_angles(extended_positions)

# Find maximum angles seen during training vs extension
max_train_angles = train_angles[-1]  # Angles at position 2047
max_extended_angles = extended_angles[-1]  # Angles at position 8191

import numpy as np


def compute_rope_angles(positions, d_model=64, base=10000):
    """Compute RoPE rotation angles for given positions.

    Args:
        positions: Array of position indices
        d_model: Embedding dimension
        base: Base for frequency computation

    Returns:
        angles: Array of shape (len(positions), d_model/2)
    """
    # Compute frequencies for each dimension pair
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Compute angles: outer product of positions and frequencies
    positions = np.array(positions)
    angles = np.outer(positions, theta)

    return angles


# Training range: positions 0 to 2047 (2K context)
train_length = 2048
train_positions = np.arange(train_length)
train_angles = compute_rope_angles(train_positions)

# Extended range: positions 0 to 8191 (4x context)
extended_length = 8192
extended_positions = np.arange(extended_length)
extended_angles = compute_rope_angles(extended_positions)

# Find maximum angles seen during training vs extension
max_train_angles = train_angles[-1]  # Angles at position 2047
max_extended_angles = extended_angles[-1]  # Angles at position 8191

Out[4]:

Console

Maximum rotation angles by dimension pair (radians):
--------------------------------------------------
Pair   Training (pos 2047)    Extended (pos 8191)   
--------------------------------------------------
0      2047.00                8191.00               
1      1535.03                6142.38               
2      1151.11                4606.14               
3      863.21                 3454.12               
31     0.27                   1.09

For the fastest-rotating dimension pair (pair 0), the model sees angles up to about 2,047 radians during training. Extending to 4x the context pushes this to over 8,000 radians. While both values wrap around the unit circle many times, the relative patterns between dimension pairs change in ways the model hasn't learned.

The real problem becomes clear when we examine what happens to attention patterns. During training, the model learns that certain rotation angle combinations correspond to meaningful relative positions. When extrapolated angles produce unfamiliar combinations, the learned attention patterns break down.

Out[5]:

Visualization

Line plot showing rotation angles across 32 dimension pairs for positions 512, 2047, and 8191. Higher positions show dramatically larger angles for low-index dimension pairs. — Rotation angles across dimension pairs at different positions. During training (blue, orange), the model sees angles within a bounded range. Extrapolating to 4x the training length (green) produces angles far outside this range for fast-rotating dimensions, while slow-rotating dimensions barely change. This imbalance disrupts the learned attention patterns.

The plot reveals the core issue. Fast-rotating dimensions (low indices) experience dramatic angle increases during extrapolation. The angle at position 8191 for dimension pair 0 is four times larger than at position 2047. Meanwhile, slow-rotating dimensions (high indices) barely change. This asymmetric behavior disrupts the carefully balanced patterns the model learned during training.

The Interpolation InsightLink Copied

We've seen that extrapolation fails because the model encounters rotation angles outside its training distribution. But what if we could ensure that every position, no matter how far into the extended sequence, produces angles the model has already seen? This is the core intuition behind Position Interpolation.

From Intuition to FormulationLink Copied

Think of the training range as a ruler. During training, the model learned to interpret positions from 0 to $L_{\text{train}} - 1$ , each marking a specific location on this ruler. Now we need to fit a longer sequence onto the same ruler. The solution? We don't extend the ruler; we compress the new positions to fit within the existing marks.

If we want to process a sequence of length $L_{\text{target}}$ where $L_{\text{target}} > L_{\text{train}}$ , we define a scale factor that maps the extended range back to the familiar one:

s = \frac{L_{\text{train}}}{L_{\text{target}}}

where:

$s$ : the scale factor, always between 0 and 1 when extending context
$L_{\text{train}}$ : the maximum sequence length seen during training (e.g., 2048)
$L_{\text{target}}$ : the target extended sequence length (e.g., 8192)

This scale factor answers the question: "How much do we need to shrink the extended positions to fit them within the training range?" For a 4x extension (2K to 8K), $s = 2048/8192 = 0.25$ , meaning every position is compressed to one-quarter of its original value.

Position Mapping in ActionLink Copied

Let's trace through concrete examples to see how this mapping works. For a model trained on 2,048 positions processing an 8,192-token sequence:

Position mapping examples for 4x context extension (2K to 8K). All extended positions map to values within the original training range.

Actual Position	Scaled Position ( $m \cdot s$ )	Interpretation
0	0	Start of sequence, unchanged
2,048	512	Maps to quarter of training range
4,096	1,024	Maps to midpoint of training range
8,191	2,047.75	Maps to end of training range

Every position in the extended sequence, no matter how large, maps to a value the model encountered during training. Position 8,191 in the extended sequence produces the same rotation pattern as position 2,047 would in the original system.

Position Interpolation

Position Interpolation scales position indices by a factor $s = L_{\text{train}} / L_{\text{target}}$ before computing RoPE rotation angles. This keeps all rotation angles within the range seen during training, trading extrapolation for interpolation.

Deriving the Modified RoPE FormulaLink Copied

Now we can formalize this intuition mathematically. The derivation proceeds in three steps, each building naturally on the previous.

Step 1: Recall the original RoPE rotation angle.

In standard RoPE, the rotation angle for dimension pair $i$ at position $m$ is simply the product of position and base frequency:

\theta_i(m) = m \cdot \theta_i

where $\theta_i = 1/\text{base}^{2i/d}$ is the base frequency for dimension pair $i$ . This formula produces angles that grow linearly with position, which is exactly the behavior that causes extrapolation to fail when $m$ exceeds the training range.

Step 2: Apply position scaling to compress the range.

Position Interpolation modifies this formula by scaling the position index before computing the rotation. Instead of using $m$ directly, we use $m \cdot s$ :

\theta_i^{\text{PI}}(m) = (m \cdot s) \cdot \theta_i

We can rearrange this expression to reveal an important insight:

\theta_i^{\text{PI}}(m) = m \cdot (s \cdot \theta_i)

where:

$\theta_i^{\text{PI}}(m)$ : the position-interpolated rotation angle for dimension pair $i$ at position $m$
$m$ : the actual position in the extended sequence
$s$ : the scale factor ( $L_{\text{train}} / L_{\text{target}}$ )
$\theta_i$ : the original base frequency for dimension pair $i$

The rearrangement shows two equivalent interpretations of Position Interpolation:

Scale the position: Compute angles for position $m \cdot s$ using original frequencies
Scale the frequency: Compute angles for position $m$ using reduced frequencies $s \cdot \theta_i$

Both perspectives lead to the same result, but the second interpretation proves useful for implementation.

Step 3: Express as modified base frequency.

Let's push the algebraic manipulation further to see what Position Interpolation does to the effective RoPE base:

\theta_i^{\text{PI}}(m) = m \cdot (s \cdot \theta_i) = m \cdot \frac{s}{\text{base}^{2i/d}}

We can factor out the scale to express this in terms of a modified base:

\theta_i^{\text{PI}}(m) = m \cdot \frac{1}{(\text{base} / s)^{2i/d}}

This reveals that Position Interpolation is mathematically equivalent to using a larger effective base:

\text{base}' = \frac{\text{base}}{s}

For extending from 2K to 8K context ( $s = 0.25$ ), the effective base becomes $\text{base}' = 10000 / 0.25 = 40000$ . Why does a larger base help? Recall that the base frequency is $\theta_i = 1/\text{base}^{2i/d}$ . A larger base produces smaller frequencies, which means slower rotations for all dimension pairs. Slower rotations let us fit more positions into the same angular range before exceeding the training maximum.

Connecting Math to MechanismLink Copied

The mathematical derivation reveals something elegant: Position Interpolation doesn't change the fundamental structure of RoPE. It doesn't add new components or modify the attention mechanism. It simply asks: "What if we used a different base constant from the start?" The answer is that a larger base would have allowed longer sequences all along, but at the cost of reduced angular resolution between nearby positions.

This insight also explains why fine-tuning is necessary. The model learned to associate certain rotation patterns with certain relative distances. When we compress positions, those associations break. A distance of 100 positions now produces the same rotation pattern as 25 positions would have during training. Fine-tuning recalibrates these associations.

Implementing Position InterpolationLink Copied

With the mathematical foundation in place, let's translate Position Interpolation into code. The implementation is remarkably simple: we compute the scale factor, multiply positions by that factor, and then proceed with standard RoPE angle computation.

Computing Interpolated AnglesLink Copied

The core function takes positions along with both the training and target lengths. It computes the scale factor internally, scales all positions, and returns angles that stay within the training range.

In[6]:

Code

def compute_interpolated_angles(
    positions, d_model=64, base=10000, train_length=2048, target_length=8192
):
    """Compute position-interpolated RoPE angles.

    Args:
        positions: Array of position indices
        d_model: Embedding dimension
        base: Base for frequency computation
        train_length: Maximum position seen during training
        target_length: Target extended context length

    Returns:
        angles: Array of shape (len(positions), d_model/2)
    """
    # Compute scale factor
    scale = train_length / target_length

    # Scale positions
    positions = np.array(positions)
    scaled_positions = positions * scale

    # Compute frequencies (unchanged from original RoPE)
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Compute angles with scaled positions
    angles = np.outer(scaled_positions, theta)

    return angles, scale


# Compute interpolated angles for the extended range
interp_angles, scale = compute_interpolated_angles(
    extended_positions, train_length=train_length, target_length=extended_length
)

def compute_interpolated_angles(
    positions, d_model=64, base=10000, train_length=2048, target_length=8192
):
    """Compute position-interpolated RoPE angles.

    Args:
        positions: Array of position indices
        d_model: Embedding dimension
        base: Base for frequency computation
        train_length: Maximum position seen during training
        target_length: Target extended context length

    Returns:
        angles: Array of shape (len(positions), d_model/2)
    """
    # Compute scale factor
    scale = train_length / target_length

    # Scale positions
    positions = np.array(positions)
    scaled_positions = positions * scale

    # Compute frequencies (unchanged from original RoPE)
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Compute angles with scaled positions
    angles = np.outer(scaled_positions, theta)

    return angles, scale


# Compute interpolated angles for the extended range
interp_angles, scale = compute_interpolated_angles(
    extended_positions, train_length=train_length, target_length=extended_length
)

Verifying the Angle BoundsLink Copied

The critical test: do the interpolated angles at the maximum extended position match the training maximum? Let's compare the angles at position 8191 across three methods.

Out[7]:

Console

Scale factor: 0.2500

Maximum angles comparison at position 8191:
------------------------------------------------------------
Method                    Pair 0          Pair 15         Pair 31        
------------------------------------------------------------
Original RoPE             8191.00         109.2287        1.092287       
Position Interpolation    2047.75         27.3072         0.273072       
Training max (pos 2047)   2047.00         27.2972         0.272972

The results confirm our mathematical derivation. With Position Interpolation, the maximum angles at position 8191 match the training maximum at position 2047 across all dimension pairs. Original RoPE would produce angles 4x larger at the extended position, but Position Interpolation compresses them back into the familiar range.

Let's visualize this comparison as a heatmap to see the pattern across all dimension pairs simultaneously.

Out[8]:

Visualization

Three horizontal bar heatmaps comparing rotation angles across 32 dimension pairs for original RoPE, position interpolation, and training maximum. — Comparison of rotation angles across dimension pairs at the maximum extended position (8191). Original RoPE (top) produces angles far exceeding the training maximum for fast-rotating dimensions. Position Interpolation (middle) compresses these to match the training distribution (bottom).

The heatmap makes the difference visually striking. The top row shows the intense "heat" of original RoPE's extrapolated angles, especially in the fast-rotating dimensions on the left. Position Interpolation (middle row) shows the same pattern as the training maximum (bottom row), confirming that we've successfully mapped extended positions back into familiar territory.

Visualizing the Position MappingLink Copied

Let's visualize this compression graphically. The plot below shows how actual positions in the extended sequence map to effective positions for RoPE computation.

Out[9]:

Visualization

Line plot comparing identity mapping to position interpolation. The identity line rises steeply to 8192, while the interpolation line stays below 2048. — Position Interpolation maps extended positions back to the training range. The dashed diagonal shows direct extrapolation (identity mapping), while the solid line shows interpolated positions. All positions in the 8K range map to values between 0 and 2K.

Interpolation vs Extrapolation: A Closer LookLink Copied

Why does interpolation work better than extrapolation? The answer lies in how neural networks generalize. During training, the model learns attention patterns for rotation angles in a specific range. These learned patterns form a continuous function over that range.

When we extrapolate, we ask the model to generalize this function to inputs it has never seen. Neural networks are notoriously poor at extrapolation; they often produce arbitrary outputs outside their training distribution. When we interpolate, we stay within the training distribution but query it at finer-grained positions. The model can leverage its learned continuous representations to handle intermediate values.

Consider an analogy. Imagine training someone to recognize temperatures between 0°C and 100°C. If you then ask them about 200°C, they must extrapolate beyond their experience, and their predictions become unreliable. But if you ask about 37.5°C when they've only seen integer values, they can interpolate from their knowledge of 37°C and 38°C.

Let's quantify this by examining how the angle differences (which determine attention scores) change under interpolation.

In[10]:

Code

def compute_relative_angles(positions, d_model=64, base=10000, scale=1.0):
    """Compute relative rotation angles between consecutive positions.

    Args:
        positions: Array of position indices
        d_model: Embedding dimension
        base: Base for frequency computation
        scale: Position scale factor (1.0 for standard RoPE)

    Returns:
        relative_angles: Angle difference between position m and m-1
    """
    positions = np.array(positions) * scale

    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Relative angle is just theta * scale (constant for all positions)
    relative_angles = theta * scale

    return relative_angles


# Compare relative angles for standard RoPE vs interpolated
standard_relative = compute_relative_angles(extended_positions, scale=1.0)
interpolated_relative = compute_relative_angles(extended_positions, scale=scale)

def compute_relative_angles(positions, d_model=64, base=10000, scale=1.0):
    """Compute relative rotation angles between consecutive positions.

    Args:
        positions: Array of position indices
        d_model: Embedding dimension
        base: Base for frequency computation
        scale: Position scale factor (1.0 for standard RoPE)

    Returns:
        relative_angles: Angle difference between position m and m-1
    """
    positions = np.array(positions) * scale

    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Relative angle is just theta * scale (constant for all positions)
    relative_angles = theta * scale

    return relative_angles


# Compare relative angles for standard RoPE vs interpolated
standard_relative = compute_relative_angles(extended_positions, scale=1.0)
interpolated_relative = compute_relative_angles(extended_positions, scale=scale)

Out[11]:

Console

Rotation angle per position step (radians):
-------------------------------------------------------
Pair     Standard RoPE        Position Interpolation
-------------------------------------------------------
0        1.000000             0.250000            
1        0.749894             0.187474            
2        0.562341             0.140585            
15       0.013335             0.003334            
31       0.000133             0.000033

The relative angle per position step shrinks with interpolation. In standard RoPE, moving one position rotates dimension pair 0 by 1 radian. With 4x interpolation, the same step rotates by only 0.25 radians. This compression is the trade-off at the heart of Position Interpolation: we maintain familiar absolute angles but reduce the angular resolution between nearby positions.

Out[12]:

Visualization

Bar chart comparing rotation angle per position step between standard RoPE and interpolated RoPE across dimension pairs. — Relative rotation angles per position step across dimension pairs. Standard RoPE (blue) maintains full angular resolution but produces out-of-distribution angles at extended positions. Position Interpolation (green) reduces angular resolution by the scale factor but keeps all angles within training distribution.

Fine-tuning for Extended ContextLink Copied

Position Interpolation alone doesn't magically enable long context. While the rotation angles stay within the training distribution, the model still encounters unfamiliar situations. Two tokens that were 100 positions apart during training now produce the same relative rotation as tokens 400 positions apart in the extended sequence. The model must learn to interpret these compressed position signals.

This is where fine-tuning comes in. After applying Position Interpolation, models typically undergo a short fine-tuning phase on long-context data. The good news: this fine-tuning is remarkably efficient. Chen et al. found that only about 1,000 fine-tuning steps were needed to adapt a 2K-context LLaMA model to 8K context, compared to the billions of tokens used in original pretraining.

Fine-tuning Requirement

Position Interpolation requires fine-tuning to achieve good performance. Without fine-tuning, the model may produce coherent outputs at the new context length, but perplexity typically increases. The fine-tuning phase teaches the model to interpret the compressed position signals correctly.

Let's simulate what fine-tuning might need to correct by examining how attention patterns change under interpolation.

In[13]:

Code

def simulate_attention_decay(distances, d_model=64, base=10000, scale=1.0):
    """Simulate how attention might decay with distance.

    This is a simplified model showing how rotation angle magnitudes
    change with position distance, affecting attention patterns.

    Args:
        distances: Array of position distances
        d_model: Embedding dimension
        base: Base for frequency computation
        scale: Position scale factor

    Returns:
        decay_scores: Simulated attention decay (not actual attention)
    """
    distances = np.array(distances) * scale

    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Total rotation magnitude for each distance
    angles = np.outer(distances, theta)

    # Simulate decay based on angle variance across dimensions
    # (This is illustrative, not actual attention computation)
    angle_variance = np.var(angles, axis=1)

    return angle_variance


distances = np.arange(1, 4097)
standard_decay = simulate_attention_decay(distances, scale=1.0)
interp_decay = simulate_attention_decay(distances, scale=0.25)

def simulate_attention_decay(distances, d_model=64, base=10000, scale=1.0):
    """Simulate how attention might decay with distance.

    This is a simplified model showing how rotation angle magnitudes
    change with position distance, affecting attention patterns.

    Args:
        distances: Array of position distances
        d_model: Embedding dimension
        base: Base for frequency computation
        scale: Position scale factor

    Returns:
        decay_scores: Simulated attention decay (not actual attention)
    """
    distances = np.array(distances) * scale

    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Total rotation magnitude for each distance
    angles = np.outer(distances, theta)

    # Simulate decay based on angle variance across dimensions
    # (This is illustrative, not actual attention computation)
    angle_variance = np.var(angles, axis=1)

    return angle_variance


distances = np.arange(1, 4097)
standard_decay = simulate_attention_decay(distances, scale=1.0)
interp_decay = simulate_attention_decay(distances, scale=0.25)

Out[14]:

Visualization

Line plot showing rotation angle variance increasing with position distance for standard RoPE. — Standard RoPE shows increasing variance with distance. The training limit (red) marks where extrapolation begins.

Line plot showing compressed rotation angle variance for position interpolation. — Position Interpolation compresses the variance curve by 4x, mapping extended distances to familiar variance patterns.

The plots illustrate the core transformation. With Position Interpolation, a distance of 4,096 positions produces the same rotation angle pattern as a distance of 1,024 positions in standard RoPE. Fine-tuning teaches the model that this compressed pattern now represents the longer distance.

A Complete ImplementationLink Copied

Let's put everything together into a complete Position Interpolation implementation that can be applied to RoPE.

In[15]:

Code

class RoPEWithPositionInterpolation:
    """RoPE implementation with Position Interpolation support."""

    def __init__(
        self, d_model, base=10000, train_length=2048, target_length=None
    ):
        """Initialize RoPE with optional Position Interpolation.

        Args:
            d_model: Embedding dimension (must be even)
            base: Base for frequency computation
            train_length: Maximum position seen during training
            target_length: Extended context length (None = no interpolation)
        """
        if d_model % 2 != 0:
            raise ValueError("d_model must be even for RoPE")

        self.d_model = d_model
        self.base = base
        self.train_length = train_length

        # Compute scale factor for Position Interpolation
        if target_length is not None and target_length > train_length:
            self.scale = train_length / target_length
        else:
            self.scale = 1.0

        # Precompute frequencies
        dim_pairs = d_model // 2
        i = np.arange(dim_pairs)
        self.theta = 1.0 / (base ** (2 * i / d_model))

    def get_rotation_matrix(self, position):
        """Get the block-diagonal rotation matrix for a position.

        Args:
            position: Integer position index

        Returns:
            R: Rotation matrix of shape (d_model, d_model)
        """
        # Apply position scaling
        scaled_pos = position * self.scale

        # Compute angles for each dimension pair
        angles = scaled_pos * self.theta

        # Build block-diagonal rotation matrix
        R = np.zeros((self.d_model, self.d_model))
        for i, angle in enumerate(angles):
            cos_a, sin_a = np.cos(angle), np.sin(angle)
            idx = 2 * i
            R[idx, idx] = cos_a
            R[idx, idx + 1] = -sin_a
            R[idx + 1, idx] = sin_a
            R[idx + 1, idx + 1] = cos_a

        return R

    def apply(self, x, positions):
        """Apply RoPE to input vectors.

        Args:
            x: Input array of shape (seq_len, d_model)
            positions: Position indices for each element

        Returns:
            rotated: Rotated vectors of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]
        rotated = np.zeros_like(x)

        for i, pos in enumerate(positions):
            R = self.get_rotation_matrix(pos)
            rotated[i] = R @ x[i]

        return rotated

class RoPEWithPositionInterpolation:
    """RoPE implementation with Position Interpolation support."""

    def __init__(
        self, d_model, base=10000, train_length=2048, target_length=None
    ):
        """Initialize RoPE with optional Position Interpolation.

        Args:
            d_model: Embedding dimension (must be even)
            base: Base for frequency computation
            train_length: Maximum position seen during training
            target_length: Extended context length (None = no interpolation)
        """
        if d_model % 2 != 0:
            raise ValueError("d_model must be even for RoPE")

        self.d_model = d_model
        self.base = base
        self.train_length = train_length

        # Compute scale factor for Position Interpolation
        if target_length is not None and target_length > train_length:
            self.scale = train_length / target_length
        else:
            self.scale = 1.0

        # Precompute frequencies
        dim_pairs = d_model // 2
        i = np.arange(dim_pairs)
        self.theta = 1.0 / (base ** (2 * i / d_model))

    def get_rotation_matrix(self, position):
        """Get the block-diagonal rotation matrix for a position.

        Args:
            position: Integer position index

        Returns:
            R: Rotation matrix of shape (d_model, d_model)
        """
        # Apply position scaling
        scaled_pos = position * self.scale

        # Compute angles for each dimension pair
        angles = scaled_pos * self.theta

        # Build block-diagonal rotation matrix
        R = np.zeros((self.d_model, self.d_model))
        for i, angle in enumerate(angles):
            cos_a, sin_a = np.cos(angle), np.sin(angle)
            idx = 2 * i
            R[idx, idx] = cos_a
            R[idx, idx + 1] = -sin_a
            R[idx + 1, idx] = sin_a
            R[idx + 1, idx + 1] = cos_a

        return R

    def apply(self, x, positions):
        """Apply RoPE to input vectors.

        Args:
            x: Input array of shape (seq_len, d_model)
            positions: Position indices for each element

        Returns:
            rotated: Rotated vectors of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]
        rotated = np.zeros_like(x)

        for i, pos in enumerate(positions):
            R = self.get_rotation_matrix(pos)
            rotated[i] = R @ x[i]

        return rotated

Let's verify that our implementation produces the expected behavior.

In[16]:

Code

# Create instances with and without interpolation
rope_standard = RoPEWithPositionInterpolation(d_model=64, train_length=2048)

rope_interpolated = RoPEWithPositionInterpolation(
    d_model=64, train_length=2048, target_length=8192
)

# Test on a sample vector at an extended position
np.random.seed(42)
test_vector = np.random.randn(64)
test_position = 6000  # Beyond training range

# Get rotation matrices
R_standard = rope_standard.get_rotation_matrix(test_position)
R_interpolated = rope_interpolated.get_rotation_matrix(test_position)
R_training_equiv = rope_standard.get_rotation_matrix(test_position * 0.25)

# Create instances with and without interpolation
rope_standard = RoPEWithPositionInterpolation(d_model=64, train_length=2048)

rope_interpolated = RoPEWithPositionInterpolation(
    d_model=64, train_length=2048, target_length=8192
)

# Test on a sample vector at an extended position
np.random.seed(42)
test_vector = np.random.randn(64)
test_position = 6000  # Beyond training range

# Get rotation matrices
R_standard = rope_standard.get_rotation_matrix(test_position)
R_interpolated = rope_interpolated.get_rotation_matrix(test_position)
R_training_equiv = rope_standard.get_rotation_matrix(test_position * 0.25)

Out[17]:

Console

Testing at position 6000 (training limit: 2048)
Interpolation scale factor: 0.25

Rotation matrix diagonal (cos of angles) for first 5 dimension pairs:
----------------------------------------------------------------------
Pair     Standard           Interpolated       Training Equiv    
----------------------------------------------------------------------
0        0.903912           -0.110267          -0.110267         
1        0.822743           0.988599           0.988599          
2        0.999746           0.005640           0.005640          
3        -0.365213          -0.467238          -0.467238         
4        0.987955           -0.999246          -0.999246         

Note: 'Interpolated' and 'Training Equiv' columns should match,
since position 6000 × 0.25 = 1500.0

The interpolated rotation matrix at position 6,000 matches the standard rotation matrix at position 1,500, confirming that our implementation correctly maps extended positions back to the training range.

Limitations of Position InterpolationLink Copied

Position Interpolation enables longer context, but it comes with trade-offs. Understanding these limitations helps explain why subsequent methods like NTK-aware scaling were developed.

Reduced positional resolution. The most significant limitation is reduced angular resolution between nearby positions. When we compress 8,192 positions into the range of 2,048, each position step produces 1/4 the rotation of the original. Two tokens that are adjacent in the extended sequence differ by the same rotation as tokens 0.25 positions apart in the original. This compression can make it harder for the model to distinguish nearby positions, potentially affecting tasks requiring fine-grained positional awareness.

In[18]:

Code

# Compute the effective position resolution
def compute_position_resolution(scale, d_model=64, base=10000):
    """Compute the angular resolution between adjacent positions.

    Returns the minimum angle difference that can distinguish
    two adjacent positions.
    """
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Resolution is the angle per position step after scaling
    resolution = theta * scale

    return resolution


# Compare resolution at different extension factors
extension_factors = [1, 2, 4, 8, 16]
resolutions = []

for factor in extension_factors:
    scale = 1.0 / factor
    res = compute_position_resolution(scale)
    resolutions.append(res)

# Compute the effective position resolution
def compute_position_resolution(scale, d_model=64, base=10000):
    """Compute the angular resolution between adjacent positions.

    Returns the minimum angle difference that can distinguish
    two adjacent positions.
    """
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Resolution is the angle per position step after scaling
    resolution = theta * scale

    return resolution


# Compare resolution at different extension factors
extension_factors = [1, 2, 4, 8, 16]
resolutions = []

for factor in extension_factors:
    scale = 1.0 / factor
    res = compute_position_resolution(scale)
    resolutions.append(res)

Out[19]:

Visualization

Line plot showing angular resolution decreasing for multiple dimension pairs as extension factor increases from 1x to 16x. — Angular resolution degradation with increasing context extension. Each line represents a dimension pair. As we extend context by larger factors, the angular difference between adjacent positions shrinks, making it harder to distinguish nearby tokens. Fast-rotating dimensions (pair 0) are most affected.

Non-uniform frequency scaling. Position Interpolation applies the same scale factor to all frequency components. However, different frequencies may require different treatment. High-frequency components (fast-rotating dimensions) are most affected by the reduced resolution because they distinguish local positions. Low-frequency components (slow-rotating dimensions) were already coarse and are less impacted. This uniform scaling is suboptimal, which motivated the development of NTK-aware scaling that treats frequencies differently.

Fine-tuning requirement. While Position Interpolation requires far less fine-tuning than training from scratch, it still requires some adaptation. This limits scenarios where you need to extend context on the fly without access to fine-tuning data or compute.

Perplexity increase. Even after fine-tuning, models with Position Interpolation often show slightly higher perplexity compared to models trained directly on longer sequences. The compression introduces information loss that fine-tuning can mitigate but not fully eliminate.

Let's quantify how different dimension pairs are affected by the uniform scaling.

In[20]:

Code

def analyze_frequency_impact(
    train_length=2048, target_length=8192, d_model=64, base=10000
):
    """Analyze how Position Interpolation affects different frequency bands."""

    scale = train_length / target_length
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Wavelength: positions for one complete rotation
    wavelength = 2 * np.pi / theta

    # After interpolation, effective wavelength in terms of original positions
    effective_wavelength = wavelength / scale

    # Categorize by frequency band
    high_freq_mask = wavelength < 100
    mid_freq_mask = (wavelength >= 100) & (wavelength < 1000)
    low_freq_mask = wavelength >= 1000

    return {
        "wavelength": wavelength,
        "effective_wavelength": effective_wavelength,
        "theta": theta,
        "scaled_theta": theta * scale,
        "high_freq": high_freq_mask,
        "mid_freq": mid_freq_mask,
        "low_freq": low_freq_mask,
    }


freq_analysis = analyze_frequency_impact()

def analyze_frequency_impact(
    train_length=2048, target_length=8192, d_model=64, base=10000
):
    """Analyze how Position Interpolation affects different frequency bands."""

    scale = train_length / target_length
    dim_pairs = d_model // 2
    i = np.arange(dim_pairs)
    theta = 1.0 / (base ** (2 * i / d_model))

    # Wavelength: positions for one complete rotation
    wavelength = 2 * np.pi / theta

    # After interpolation, effective wavelength in terms of original positions
    effective_wavelength = wavelength / scale

    # Categorize by frequency band
    high_freq_mask = wavelength < 100
    mid_freq_mask = (wavelength >= 100) & (wavelength < 1000)
    low_freq_mask = wavelength >= 1000

    return {
        "wavelength": wavelength,
        "effective_wavelength": effective_wavelength,
        "theta": theta,
        "scaled_theta": theta * scale,
        "high_freq": high_freq_mask,
        "mid_freq": mid_freq_mask,
        "low_freq": low_freq_mask,
    }


freq_analysis = analyze_frequency_impact()

Out[21]:

Console

Frequency Band Analysis:
======================================================================

High frequency (wavelength < 100 positions):  10 dimension pairs
Mid frequency  (100 ≤ wavelength < 1000):     8 dimension pairs
Low frequency  (wavelength ≥ 1000):           14 dimension pairs

Impact of 4× Position Interpolation:
----------------------------------------------------------------------

High Frequency Band:
  Original rotation/position:  0.377347 radians
  Scaled rotation/position:    0.094337 radians
  Original wavelength:         31.6 positions
  Effective wavelength:        126.5 positions

Mid Frequency Band:
  Original rotation/position:  0.025295 radians
  Scaled rotation/position:    0.006324 radians
  Original wavelength:         376.9 positions
  Effective wavelength:        1507.5 positions

Low Frequency Band:
  Original rotation/position:  0.001577 radians
  Scaled rotation/position:    0.000394 radians
  Original wavelength:         13217.1 positions
  Effective wavelength:        52868.3 positions

The analysis reveals the asymmetric impact. High-frequency dimension pairs, which originally completed a rotation every 6-50 positions, now require 4x as many positions. This affects local position discrimination most severely. Low-frequency dimension pairs, already operating at wavelengths of thousands of positions, remain capable of distinguishing positions at the extended range.

Out[22]:

Visualization

Scatter plot with dimension pair on x-axis and rotation angle per position on y-axis, showing original and scaled values connected by lines, colored by frequency band. — Impact of Position Interpolation on different frequency bands. The plot shows original vs. scaled rotation angles per position for each dimension pair, color-coded by frequency band. High-frequency dimensions (red) experience the largest absolute reduction in angular resolution.

The visualization shows how Position Interpolation uniformly scales all frequencies by the same factor (4x reduction). The vertical lines connecting original (circles) to scaled (squares) values have the same proportional length across all dimension pairs. This uniform treatment is the core limitation that NTK-aware scaling addresses.

Key ParametersLink Copied

When implementing Position Interpolation, the following parameters control the behavior:

train_length: The maximum sequence length the model was originally trained on (e.g., 2048 for LLaMA). This defines the upper bound of positions the model has learned to interpret during pretraining.
target_length: The extended context length you want to support (e.g., 8192). Must be greater than train_length for interpolation to apply.
scale: Computed as train_length / target_length. This determines how much to compress positions. A scale of 0.25 means 4x context extension. Smaller scale values enable longer contexts but reduce positional resolution proportionally.
base: The RoPE base constant (typically 10000). Position Interpolation effectively increases this to base / scale, slowing all rotations uniformly.
d_model: The embedding dimension. Must be even since RoPE operates on dimension pairs. Each pair rotates at a different frequency determined by its index.

When selecting parameters, consider the trade-off between context length and positional resolution. Extending by 2x is generally safe with minimal fine-tuning. Extensions of 4x-8x work well but require more fine-tuning to recover performance. Extensions beyond 8x may significantly degrade the model's ability to distinguish nearby positions.

SummaryLink Copied

Position Interpolation provides an elegant solution to the context length extension problem. By scaling position indices rather than extrapolating to unseen values, it keeps all rotation angles within the training distribution. The key insights include:

Interpolation over extrapolation. Neural networks generalize poorly outside their training distribution. By scaling positions down instead of letting them grow, Position Interpolation stays within familiar territory.
Simple implementation. The technique requires only a scale factor applied to position indices before computing RoPE. No architectural changes are needed.
Efficient fine-tuning. Adapting a model to extended context requires only about 1,000 fine-tuning steps, orders of magnitude less than original pretraining.
Uniform scaling limitation. All frequency components receive the same scaling treatment, which is suboptimal. High-frequency dimensions, crucial for local position discrimination, lose the most resolution.

The limitations of Position Interpolation, particularly its uniform treatment of frequencies, motivated the development of NTK-aware scaling, which we'll explore in the next chapter. That technique applies frequency-dependent scaling, preserving high-frequency components while adjusting low-frequency ones, achieving better performance without sacrificing local positional awareness.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Position Interpolation and extending context length in language models.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Context Length Challenges

Next Chapter

NTK-aware Scaling

Reference

BIBTEXAcademic

@misc{positioninterpolationextendingllmcontextlengthwithropescaling, author = {Michael Brenndoerfer}, title = {Position Interpolation: Extending LLM Context Length with RoPE Scaling}, year = {2025}, url = {https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Position Interpolation: Extending LLM Context Length with RoPE Scaling. Retrieved from https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension

MLAAcademic

Michael Brenndoerfer. "Position Interpolation: Extending LLM Context Length with RoPE Scaling." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension>.

CHICAGOAcademic

Michael Brenndoerfer. "Position Interpolation: Extending LLM Context Length with RoPE Scaling." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Position Interpolation: Extending LLM Context Length with RoPE Scaling'. Available at: https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Position Interpolation: Extending LLM Context Length with RoPE Scaling. https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension

Direct link:

https://mbrenndoerfer.com/writing/position-interpolation-rope-context-extension

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Position Interpolation: Extending LLM Context Length with RoPE Scaling

Position InterpolationLink Copied

The Extrapolation ProblemLink Copied

The Interpolation InsightLink Copied

From Intuition to FormulationLink Copied

Position Mapping in ActionLink Copied

Deriving the Modified RoPE FormulaLink Copied

Connecting Math to MechanismLink Copied

Implementing Position InterpolationLink Copied

Computing Interpolated AnglesLink Copied

Verifying the Angle BoundsLink Copied

Visualizing the Position MappingLink Copied

Interpolation vs Extrapolation: A Closer LookLink Copied

Fine-tuning for Extended ContextLink Copied

A Complete ImplementationLink Copied

Limitations of Position InterpolationLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Memory Augmentation for Transformers: External Storage for Long Context

Recurrent Memory: Extending Transformer Context with Segment-Level State Caching

Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM

Stay updated