Search

Search articles

YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling

Michael BrenndoerferUpdated June 28, 202533 min read

Learn how YaRN extends LLM context length through wavelength-based frequency interpolation and attention temperature correction. Includes mathematical formulation and implementation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

YaRN: Yet another RoPE extensioN

Extending context length in large language models has become a central challenge. Models trained on sequences of 2,048 or 4,096 tokens struggle when asked to process documents spanning tens of thousands of positions. Position Interpolation offered one solution: scale down position indices to fit within the trained range. NTK-aware scaling improved upon this by preserving high-frequency components that capture local relationships. But both methods still leave performance on the table, particularly as the extension factor grows large.

YaRN, which stands for "Yet another RoPE extensioN," addresses limitations in both approaches. The key insight is that extending context length doesn't just require fixing position encodings. It also requires compensating for a subtle but significant change in attention score distributions. When we stretch positions across longer sequences, the entropy of attention distributions shifts in ways that degrade model behavior. YaRN tackles this through a combination of targeted frequency interpolation and an attention temperature correction.

This chapter develops YaRN from its motivation through its complete formulation. We'll see why existing methods fall short at large extension factors, derive the attention scaling mechanism, and implement YaRN step by step. By the end, you'll understand how this technique enables models to maintain quality at context lengths 16x or 32x beyond their training distribution.

The Problem with Existing Methods

Before diving into YaRN, let's understand why Position Interpolation and NTK-aware scaling leave room for improvement. Both methods successfully enable RoPE-based models to process longer sequences, but they introduce subtle distortions that accumulate as the extension factor increases.

Position Interpolation works by scaling all position indices by a factor ss: a position mm becomes m/sm/s. This keeps all positions within the trained range, but it compresses the entire frequency spectrum uniformly. High-frequency dimensions that previously distinguished positions 1 apart now need to distinguish positions ss apart after compression. The model loses fine-grained position discrimination.

NTK-aware scaling addresses this by adjusting the base frequency rather than scaling positions directly. This preserves high-frequency components while stretching low-frequency ones. The approach works well for moderate extensions (2x to 4x), but at larger factors, even NTK-aware scaling begins to degrade.

In[2]:
Code
import numpy as np


def compute_rope_frequencies(d_model, base=10000):
    """Compute standard RoPE rotation frequencies."""
    num_pairs = d_model // 2
    i = np.arange(num_pairs)
    frequencies = 1.0 / (base ** (2 * i / d_model))
    return frequencies


def compute_pi_frequencies(d_model, base=10000, scale=4.0):
    """Position Interpolation: scale all positions by 1/scale."""
    frequencies = compute_rope_frequencies(d_model, base)
    return frequencies / scale


def compute_ntk_frequencies(d_model, base=10000, scale=4.0):
    """NTK-aware scaling: adjust base instead of positions.

    The new base is computed as: base * scale^(d / (d - 2))
    This increases the base, which decreases all frequencies,
    but preserves the high-frequency components better than PI.
    """
    new_base = base * scale ** (d_model / (d_model - 2))
    return compute_rope_frequencies(d_model, new_base)

Let's visualize how these methods transform the frequency spectrum:

In[3]:
Code
d_model = 64
scale = 4.0

original_freqs = compute_rope_frequencies(d_model)
pi_freqs = compute_pi_frequencies(d_model, scale=scale)
ntk_freqs = compute_ntk_frequencies(d_model, scale=scale)
Out[4]:
Visualization
Line plot comparing three frequency spectra across dimension pairs on a log scale.
Frequency spectra under different extension methods. Position Interpolation uniformly divides all frequencies by 4. NTK-aware scaling reduces frequencies more gradually, preserving high-frequency components (small indices) while stretching low-frequency ones (large indices). The original spectrum shows the exponential decay characteristic of standard RoPE.

The visualization reveals a fundamental trade-off. Position Interpolation maintains the relative spacing between frequencies (the lines are parallel on a log scale) but shifts everything down. NTK-aware scaling bends the curve, keeping high frequencies close to original while aggressively stretching low frequencies. Neither approach is optimal across all dimension pairs.

Attention Entropy and the Temperature Problem

Beyond frequency adjustments, there's a more subtle issue that neither PI nor NTK-aware scaling addresses: attention entropy shifts. When we modify RoPE frequencies, we change the distribution of attention scores in ways that affect model behavior.

Recall that attention scores are computed as the scaled dot product between query and key vectors:

score(m,n)=qmkndk\text{score}(m, n) = \frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d_k}}

where:

  • score(m,n)\text{score}(m, n): the attention score between a query at position mm and a key at position nn
  • qm\mathbf{q}_m: the RoPE-rotated query vector at sequence position mm
  • kn\mathbf{k}_n: the RoPE-rotated key vector at sequence position nn
  • dkd_k: the dimension of the key vectors, used for scaling to prevent dot products from growing too large
  • qmkn\mathbf{q}_m^\top \mathbf{k}_n: the dot product between query and key, measuring their similarity

When we interpolate positions, the effective distances between tokens change. Positions that were far apart now appear closer in the rotated space. This compresses the range of attention scores, making them more uniform. Higher uniformity means higher entropy: the model attends more diffusely rather than focusing on specific positions.

Attention Entropy

The entropy of an attention distribution measures how spread out the attention weights are. Low entropy means the model focuses sharply on a few positions. High entropy means attention is distributed more evenly across many positions. Changes to position encoding can inadvertently shift this entropy, degrading the model's ability to focus.

Let's quantify this effect by computing attention entropy under different interpolation schemes:

In[5]:
Code
def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def compute_entropy(attention_weights):
    """Compute entropy of attention distribution."""
    # Avoid log(0) by adding small epsilon
    eps = 1e-10
    log_weights = np.log(attention_weights + eps)
    entropy = -np.sum(attention_weights * log_weights, axis=-1)
    return entropy


def apply_rope(x, position, frequencies):
    """Apply RoPE to a vector at given position."""
    d = len(x)
    x_pairs = x.reshape(-1, 2)
    angles = position * frequencies

    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    x_rotated = np.stack(
        [
            x_pairs[:, 0] * cos_angles - x_pairs[:, 1] * sin_angles,
            x_pairs[:, 0] * sin_angles + x_pairs[:, 1] * cos_angles,
        ],
        axis=-1,
    )

    return x_rotated.flatten()


def compute_attention_scores(seq_len, d_model, frequencies, seed=42):
    """Compute attention score matrix for random Q, K vectors."""
    np.random.seed(seed)

    # Generate random query and key vectors
    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    # Apply RoPE
    Q_rotated = np.array(
        [apply_rope(Q[i], i, frequencies) for i in range(seq_len)]
    )
    K_rotated = np.array(
        [apply_rope(K[i], i, frequencies) for i in range(seq_len)]
    )

    # Compute scaled dot-product scores
    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)

    return scores
In[6]:
Code
# Compare entropy across methods
seq_len = 64
d_model = 64
scale = 4.0

# Original context (positions 0 to seq_len-1)
original_scores = compute_attention_scores(seq_len, d_model, original_freqs)
original_weights = softmax(original_scores)
original_entropy = compute_entropy(original_weights)

# Position interpolation (pretend we're at 4x length, but scale down)
pi_scores = compute_attention_scores(seq_len, d_model, pi_freqs)
pi_weights = softmax(pi_scores)
pi_entropy = compute_entropy(pi_weights)

# NTK-aware
ntk_scores = compute_attention_scores(seq_len, d_model, ntk_freqs)
ntk_weights = softmax(ntk_scores)
ntk_entropy = compute_entropy(ntk_weights)
Out[7]:
Console
Mean attention entropy across query positions:
Method                    Mean Entropy    Std Dev        
-------------------------------------------------------
Original RoPE             3.6652           0.1447
Position Interpolation    3.6772           0.1517
NTK-aware                 3.6721           0.1737

The entropy values reveal the effect of position interpolation on attention patterns. While the differences might seem small in absolute terms, they compound across layers and affect the model's ability to retrieve and aggregate information from specific positions. YaRN addresses this by introducing an attention temperature correction.

The YaRN Solution

YaRN combines two complementary mechanisms to enable high-quality context extension:

  1. Ramp-based frequency interpolation: Rather than treating all dimension pairs equally (like PI) or smoothly transitioning across all pairs (like NTK), YaRN uses a ramp function that applies no interpolation to high-frequency pairs, full interpolation to low-frequency pairs, and a smooth transition in between.

  2. Attention temperature scaling: YaRN introduces a scaling factor t\sqrt{t} to attention scores, where t1t \geq 1 is derived to compensate for the entropy increase caused by interpolation. When s=1s = 1 (no extension), t=1t = 1 and no scaling is applied.

The combined approach modifies both the rotation frequencies and the attention computation. For frequencies, YaRN applies a dimension-specific adjustment:

θi=θiγ(ri)\theta'_i = \theta_i \cdot \gamma(r_i)

where:

  • θi\theta'_i: the adjusted frequency for dimension pair ii after YaRN modification
  • θi\theta_i: the original RoPE frequency for dimension pair ii, computed as 1/100002i/d1/10000^{2i/d}
  • γ(ri)\gamma(r_i): the interpolation factor, a value between 1/s1/s and 11 that depends on the wavelength ratio
  • ri=λi/Lr_i = \lambda_i / L: the wavelength-to-context ratio for dimension pair ii
  • λi\lambda_i: the wavelength for dimension pair ii (positions per full rotation)
  • LL: the original training context length

For attention scores, YaRN introduces a temperature correction:

score(m,n)=tqmkndk\text{score}'(m, n) = \sqrt{t} \cdot \frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d_k}}

where:

  • score(m,n)\text{score}'(m, n): the temperature-adjusted attention score
  • t\sqrt{t}: the temperature scaling factor, where t>1t > 1 increases score magnitude
  • tt: the temperature parameter, computed from the extension factor ss

Let's develop each component in detail.

Wavelength Analysis

To understand why some dimension pairs need interpolation while others don't, we need to think about rotation in terms of wavelength rather than frequency. While frequency tells us how fast something rotates (radians per position), wavelength tells us how far we must travel for one complete cycle (positions per full rotation). Wavelength provides a more intuitive picture because we can directly compare it to the context length.

Think of it this way: if a dimension pair completes 100 full rotations during the training context, it has learned to use those rotations to distinguish positions at a fine-grained level. Extending the context by 4x means it now completes 400 rotations. No problem. The pair can still distinguish positions just as well as before. But if a dimension pair completes only 0.1 rotations during training (barely moving at all), extending by 4x means it now needs to cover 0.4 rotations. That's still not a complete cycle, and the model has never seen rotation patterns beyond what it learned during training.

The wavelength λi\lambda_i of dimension pair ii is the inverse of the frequency, scaled by 2π2\pi:

λi=2πθi=2π100002i/d\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \cdot 10000^{2i/d}

where:

  • λi\lambda_i: the wavelength (in positions) for dimension pair ii, representing how many sequence positions correspond to one complete rotation
  • θi=1/100002i/d\theta_i = 1/10000^{2i/d}: the base frequency for dimension pair ii, measured in radians per position
  • dd: the total embedding dimension (must be even)
  • ii: the dimension pair index (0, 1, ..., d/21d/2 - 1)
  • 2π2\pi: the number of radians in a complete rotation (one full cycle)
  • 1000010000: the RoPE base constant, which controls the frequency range

The second equality follows by substituting the definition of θi\theta_i and simplifying: 2π/(1/100002i/d)=2π100002i/d2\pi / (1/10000^{2i/d}) = 2\pi \cdot 10000^{2i/d}.

The critical question becomes: how does each wavelength compare to the training context length LL? If λi<L\lambda_i < L, the pair completes at least one full rotation during training, meaning the model has seen all possible rotation states. If λi>L\lambda_i > L, the pair completes less than one rotation, meaning the model has only seen a fraction of the rotation cycle. This ratio ri=λi/Lr_i = \lambda_i / L will be central to YaRN's design.

In[8]:
Code
def compute_wavelengths(d_model, base=10000):
    """Compute wavelengths for each dimension pair."""
    frequencies = compute_rope_frequencies(d_model, base)
    wavelengths = 2 * np.pi / frequencies
    return wavelengths


# Analyze wavelengths relative to context length
d_model = 64
base = 10000
original_context = 2048  # Original training context
extension_factor = 4.0
extended_context = original_context * extension_factor

wavelengths = compute_wavelengths(d_model, base)
Out[9]:
Console
Original context: 2048 positions
Extended context: 8192 positions

Pair     Wavelength      vs Original L        vs Extended L       
---------------------------------------------------------------
0        6.3             0.0031               0.0008              
4        19.9            0.0097               0.0024              
8        62.8            0.0307               0.0077              
16       628.3           0.3068               0.0767              
24       6283.2          3.0680               0.7670              
31       47117.2         23.0065              5.7516              

The "vs Original L" column shows the wavelength ratio ri=λi/Lr_i = \lambda_i / L. This ratio is the key to understanding which dimension pairs need interpolation:

  • When ri<1r_i < 1, the wavelength is shorter than the context, meaning the pair completes multiple full rotations during training. These pairs have learned the full rotation cycle.
  • When ri>1r_i > 1, the wavelength exceeds the context, meaning the pair doesn't complete even one rotation during training. These pairs operate in a limited portion of the rotation cycle.

Let's trace through the table to build intuition. Dimension pair 0 has a wavelength of about 6 positions and r00.003r_0 \approx 0.003. It completes roughly 340 full rotations within the training context, so it has thoroughly learned how to use rotation for position encoding. When we extend to 8,192 positions, pair 0 still completes over 1,300 rotations. No problem here.

At the other extreme, dimension pair 31 has a wavelength of around 60,000 positions and r3129r_{31} \approx 29. During training, it completes only about 3% of a single rotation. The model has learned to encode position using just this small arc of the rotation cycle. If we extend the context by 4x without interpolation, we'd ask pair 31 to cover 12% of its rotation cycle, using rotation states it has never encountered. This is extrapolation, and it degrades model quality.

The pattern is clear: pairs with small rir_i don't need interpolation; pairs with large rir_i do. The question is where to draw the line, and whether the transition should be abrupt or smooth. YaRN's ramp function provides the answer.

The YaRN Ramp Function

Now we can design a function that decides how much interpolation each dimension pair receives. We want three behaviors:

  1. For pairs with small rir_i (short wavelengths, many rotations during training): apply no interpolation, leaving them at their original frequencies.
  2. For pairs with large rir_i (long wavelengths, few rotations during training): apply full interpolation, scaling them by 1/s1/s just like Position Interpolation would.
  3. For pairs with intermediate rir_i: apply partial interpolation, blending smoothly between the two extremes.

This is exactly what a ramp function achieves. YaRN defines γ(r)\gamma(r) as a piecewise function:

γ(r)={1if r<α1sif r>β(1w)+w1sotherwise\gamma(r) = \begin{cases} 1 & \text{if } r < \alpha \\ \frac{1}{s} & \text{if } r > \beta \\ (1 - w) + w \cdot \frac{1}{s} & \text{otherwise} \end{cases}

where:

  • r=λi/Lr = \lambda_i / L: the wavelength-to-context ratio for dimension pair ii, comparing the rotation period to the training context
  • λi\lambda_i: the wavelength for dimension pair ii
  • LL: the original training context length (e.g., 2048 or 4096 tokens)
  • ss: the extension scale factor (e.g., 4 for extending from 2048 to 8192)
  • α\alpha: the lower threshold (typically 1.0), below which no interpolation is applied
  • β\beta: the upper threshold (typically 32.0), above which full interpolation is applied
  • w=(rα)/(βα)w = (r - \alpha) / (\beta - \alpha): the interpolation weight in the ramp region, ranging from 0 to 1
  • γ(r)\gamma(r): the output interpolation factor, ranging from 1 (no interpolation) to 1/s1/s (full interpolation)

Let's unpack the middle case, which is the most interesting. The weight ww measures where rr falls within the transition region [α,β][\alpha, \beta]. At the left edge (r=αr = \alpha), we have w=0w = 0, so the expression becomes (10)+0(1/s)=1(1 - 0) + 0 \cdot (1/s) = 1. At the right edge (r=βr = \beta), we have w=1w = 1, giving (11)+1(1/s)=1/s(1 - 1) + 1 \cdot (1/s) = 1/s. For values between the edges, we get a linear blend. This creates a smooth ramp rather than an abrupt step, which helps the model adapt more gracefully.

Why α = 1 and β = 32?

The default thresholds have intuitive interpretations. When α=1\alpha = 1, dimension pairs with wavelength less than the context length (completing at least one full rotation during training) don't need interpolation. When β=32\beta = 32, dimension pairs with wavelength more than 32 times the context length (completing less than 3% of a rotation) need full interpolation. These values were empirically validated across multiple model architectures, striking a balance between preserving high-frequency information and ensuring stable extrapolation.

To summarize, the ramp function creates three distinct regions:

  1. Short wavelengths (r<αr < \alpha): These pairs already rotate many times within the original context. They don't need interpolation because they can naturally handle the extended positions.

  2. Long wavelengths (r>βr > \beta): These pairs rotate slowly and need full Position Interpolation treatment to avoid extrapolation.

  3. Middle wavelengths (αrβ\alpha \leq r \leq \beta): These pairs receive partial interpolation, smoothly blending between the two extremes.

In[10]:
Code
def yarn_gamma(wavelength, context_length, scale, alpha=1.0, beta=32.0):
    """Compute YaRN interpolation factor for a given wavelength."""
    r = wavelength / context_length

    if r < alpha:
        # Short wavelength: no interpolation
        return 1.0
    elif r > beta:
        # Long wavelength: full interpolation
        return 1.0 / scale
    else:
        # Ramp region: smooth transition
        t = (r - alpha) / (beta - alpha)
        # Linear interpolation between 1 and 1/scale
        return (1 - t) * 1.0 + t * (1.0 / scale)


def compute_yarn_frequencies(
    d_model, base=10000, context_length=2048, scale=4.0, alpha=1.0, beta=32.0
):
    """Compute YaRN-adjusted frequencies."""
    original_freqs = compute_rope_frequencies(d_model, base)
    wavelengths = compute_wavelengths(d_model, base)

    gamma_values = np.array(
        [yarn_gamma(w, context_length, scale, alpha, beta) for w in wavelengths]
    )

    # Apply gamma as a frequency multiplier
    # gamma < 1 means we slow down the rotation (interpolation)
    yarn_freqs = original_freqs * gamma_values

    return yarn_freqs, gamma_values

Let's visualize the ramp function and its effect on frequencies:

In[11]:
Code
# Compute YaRN frequencies
d_model = 64
context_length = 2048
scale = 4.0

yarn_freqs, gamma_values = compute_yarn_frequencies(
    d_model, context_length=context_length, scale=scale
)
original_freqs = compute_rope_frequencies(d_model)
wavelengths = compute_wavelengths(d_model)

# Compute wavelength ratios
r_values = wavelengths / context_length
Out[12]:
Visualization
Line plot showing step-like gamma function with smooth ramp transition.
The YaRN gamma function versus wavelength ratio. Pairs with short wavelengths (r < 1) receive no interpolation (gamma = 1). Pairs with long wavelengths (r > 32) receive full interpolation (gamma = 0.25 for 4x extension). The ramp region smoothly connects these extremes.
Log-scale plot comparing YaRN, PI, and NTK frequency spectra.
Effect on frequencies. YaRN preserves high frequencies (left side) while interpolating low frequencies (right side). Compare to Position Interpolation, which uniformly scales all frequencies, and NTK-aware scaling, which bends the entire curve.

The ramp function creates a piecewise linear transition on the log-wavelength scale. The left panel shows how γ\gamma drops from 1 to 1/s1/s as wavelength increases. The right panel shows the resulting frequency spectrum: YaRN preserves high frequencies (matching the original) while smoothly transitioning to interpolated low frequencies.

Attention Temperature Scaling

We've now addressed the frequency problem with the ramp function. But recall that we identified two problems at the start: frequency distortion and entropy shift. Even with perfect frequency adjustments, interpolation changes something fundamental about attention patterns.

To understand why, think about what interpolation does geometrically. When we slow down rotations (by multiplying frequencies by γ<1\gamma < 1), positions that were previously "far apart" in the rotated embedding space now appear "closer." Consider two tokens at positions 0 and 100. In the original RoPE, pair 0 might rotate these to be 100 radians apart (after wrapping). With interpolation, the same pair rotates them to only 25 radians apart (for 4x extension). This compression happens across all interpolated dimension pairs.

This compression has a direct consequence for attention scores. The dot product between query and key vectors measures their similarity. When positions appear closer together in the rotated space, the dot products become more similar to each other. The range of attention scores shrinks.

Now recall how softmax works. Given a vector of scores z\mathbf{z}:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

When input scores are spread out (high variance), exponentiating amplifies the differences, producing a peaked distribution that focuses on the highest scores. When input scores are compressed (low variance), the exponentials are more similar, producing a flatter, more uniform distribution. This increased uniformity is what we measure as higher entropy.

The solution is conceptually simple: if scores are too compressed, stretch them back out. We do this by multiplying the scores by a factor greater than 1 before applying softmax. In temperature scaling terminology, we reduce the "temperature" (making the distribution sharper) by dividing by a temperature T<1T < 1, or equivalently, multiplying by 1/T1/T. YaRN frames this as multiplying by t\sqrt{t} where t>1t > 1.

The temperature factor is computed as:

t=0.1ln(s)+1t = 0.1 \cdot \ln(s) + 1

where:

  • tt: the attention temperature scaling factor (always 1\geq 1)
  • ss: the context extension scale factor (e.g., 4 for 4x extension)
  • ln(s)\ln(s): the natural logarithm of the extension factor
  • 0.10.1: an empirically determined coefficient that controls the rate of temperature increase
  • 11: the base value ensuring t=1t = 1 when s=1s = 1 (no extension)

Why t\sqrt{t} rather than tt directly? This relates to how variance scales under multiplication. If we multiply scores by a constant cc, the variance of the scores is multiplied by c2c^2. So to increase variance by a factor of tt, we need to multiply the scores by t\sqrt{t}.

The logarithmic form of tt captures an empirical observation: the entropy shift doesn't grow linearly with the extension factor. Larger extensions need progressively less aggressive correction. Doubling the extension factor from 4x to 8x doesn't require doubling the temperature adjustment. Instead, it adds a constant amount: ln(8)ln(4)=ln(2)0.69\ln(8) - \ln(4) = \ln(2) \approx 0.69, contributing an additional 0.0690.069 to tt.

In[13]:
Code
def compute_yarn_temperature(scale):
    """Compute YaRN attention temperature scaling factor."""
    return 0.1 * np.log(scale) + 1.0


# Compute temperature for various extension factors
extension_factors = [2, 4, 8, 16, 32, 64, 128]
temperatures = [compute_yarn_temperature(s) for s in extension_factors]
Out[14]:
Console
YaRN temperature scaling factors:
Extension Factor     Temperature (t)      sqrt(t)             
------------------------------------------------------------
2                    1.0693               1.0341
4                    1.1386               1.0671
8                    1.2079               1.0991
16                   1.2773               1.1302
32                   1.3466               1.1604
64                   1.4159               1.1899
128                  1.4852               1.2187

The t\sqrt{t} column shows the actual multiplier applied to attention scores. Even for very large extension factors (128x), the multiplier stays below 1.2, indicating that temperature correction is a subtle adjustment rather than a dramatic rescaling.

The temperature factor grows slowly with extension factor. For a 4x extension, we multiply attention scores by 1.1391.067\sqrt{1.139} \approx 1.067. For 32x extension, the multiplier is 1.3471.161\sqrt{1.347} \approx 1.161. These modest corrections help maintain attention sharpness.

Let's verify the effect on entropy:

In[15]:
Code
def apply_yarn_rope(
    x, position, frequencies, scale, context_length, alpha=1.0, beta=32.0
):
    """Apply YaRN-adjusted RoPE."""
    d = len(x)
    wavelengths = compute_wavelengths(d)

    # Compute gamma for each dimension pair
    gamma_vals = np.array(
        [yarn_gamma(w, context_length, scale, alpha, beta) for w in wavelengths]
    )

    # Adjust frequencies
    adjusted_freqs = frequencies * gamma_vals

    return apply_rope(x, position, adjusted_freqs)


def compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=True, seed=42
):
    """Compute attention scores with YaRN adjustments."""
    np.random.seed(seed)

    original_freqs = compute_rope_frequencies(d_model)

    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    Q_rotated = np.array(
        [
            apply_yarn_rope(Q[i], i, original_freqs, scale, context_length)
            for i in range(seq_len)
        ]
    )
    K_rotated = np.array(
        [
            apply_yarn_rope(K[i], i, original_freqs, scale, context_length)
            for i in range(seq_len)
        ]
    )

    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)

    if apply_temperature:
        t = compute_yarn_temperature(scale)
        scores = scores * np.sqrt(t)

    return scores
In[16]:
Code
# Compare entropy with and without temperature scaling
seq_len = 64
d_model = 64
scale = 4.0
context_length = 2048

# YaRN without temperature
yarn_scores_no_temp = compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=False
)
yarn_weights_no_temp = softmax(yarn_scores_no_temp)
yarn_entropy_no_temp = compute_entropy(yarn_weights_no_temp)

# YaRN with temperature
yarn_scores_with_temp = compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=True
)
yarn_weights_with_temp = softmax(yarn_scores_with_temp)
yarn_entropy_with_temp = compute_entropy(yarn_weights_with_temp)
Out[17]:
Console
Effect of YaRN temperature scaling on attention entropy:
Method                         Mean Entropy    Std Dev        
------------------------------------------------------------
Original RoPE                  3.6652           0.1447
YaRN (no temperature)          3.6652           0.1447
YaRN (with temperature)        3.6008           0.1666

Temperature scaling helps bring the entropy closer to the original distribution. While the match isn't perfect (which is expected given the fundamental changes to the attention geometry), the correction prevents the excessive entropy increase that would otherwise occur.

The Complete YaRN Formula

Now we can put together the complete YaRN transformation. Given a model trained on context length LL that we want to extend by factor ss, the process involves three steps.

Step 1: Compute adjusted frequencies

For each dimension pair ii with base frequency θi\theta_i and wavelength λi=2π/θi\lambda_i = 2\pi/\theta_i, compute the adjusted frequency:

θi=θiγ(λiL)\theta'_i = \theta_i \cdot \gamma\left(\frac{\lambda_i}{L}\right)

where:

  • θi\theta'_i: the adjusted frequency for dimension pair ii
  • θi\theta_i: the original RoPE frequency
  • γ()\gamma(\cdot): the ramp function defined earlier
  • λi/L\lambda_i / L: the wavelength ratio that determines the interpolation amount

Step 2: Apply RoPE with adjusted frequencies

For query or key vector x\mathbf{x} at position mm, apply the block-diagonal rotation matrix:

YaRN-RoPE(x,m)=(R(θ0m)R(θd/21m))x\text{YaRN-RoPE}(\mathbf{x}, m) = \begin{pmatrix} R(\theta'_0 \cdot m) & & \\ & \ddots & \\ & & R(\theta'_{d/2-1} \cdot m) \end{pmatrix} \mathbf{x}

where:

  • YaRN-RoPE(x,m)\text{YaRN-RoPE}(\mathbf{x}, m): the rotated vector, now incorporating YaRN's frequency adjustments
  • R(θim)R(\theta'_i \cdot m): a 2×\times2 rotation matrix for angle θim\theta'_i \cdot m, applied to dimension pair ii
  • θim\theta'_i \cdot m: the rotation angle, which increases linearly with position mm at the adjusted rate
  • The block-diagonal structure means each dimension pair is rotated independently

Step 3: Apply temperature scaling to attention scores

After computing the raw attention scores, apply the temperature correction:

score(m,n)=tqmkndk\text{score}(m, n) = \sqrt{t} \cdot \frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d_k}}

where:

  • score(m,n)\text{score}(m, n): the scaled attention score between positions mm and nn
  • qm\mathbf{q}_m: the YaRN-rotated query vector at position mm
  • kn\mathbf{k}_n: the YaRN-rotated key vector at position nn
  • dkd_k: the dimension of key vectors
  • t=0.1ln(s)+1t = 0.1 \ln(s) + 1: the temperature factor, computed from the extension scale
  • t\sqrt{t}: the scaling multiplier applied to raw scores (increases score magnitude to sharpen attention)
  • The scaling occurs before softmax normalization
In[18]:
Code
class YaRNRoPE:
    """YaRN-adjusted Rotary Position Embedding."""

    def __init__(
        self,
        d_model,
        original_context=2048,
        scale=4.0,
        base=10000,
        alpha=1.0,
        beta=32.0,
    ):
        """Initialize YaRN RoPE.

        Args:
            d_model: Embedding dimension (must be even)
            original_context: Original training context length
            scale: Context extension factor
            base: RoPE frequency base
            alpha: Lower wavelength threshold
            beta: Upper wavelength threshold
        """
        self.d_model = d_model
        self.original_context = original_context
        self.scale = scale
        self.alpha = alpha
        self.beta = beta

        # Compute adjusted frequencies
        self.original_freqs = compute_rope_frequencies(d_model, base)
        self.wavelengths = compute_wavelengths(d_model, base)

        self.gamma_values = np.array(
            [
                yarn_gamma(w, original_context, scale, alpha, beta)
                for w in self.wavelengths
            ]
        )

        self.adjusted_freqs = self.original_freqs * self.gamma_values

        # Compute temperature factor
        self.temperature = compute_yarn_temperature(scale)

    def apply(self, x, position):
        """Apply YaRN-adjusted RoPE to a vector.

        Args:
            x: Input vector of shape (d_model,)
            position: Position index

        Returns:
            Rotated vector of shape (d_model,)
        """
        return apply_rope(x, position, self.adjusted_freqs)

    def apply_batch(self, x):
        """Apply YaRN-adjusted RoPE to a batch of vectors.

        Args:
            x: Input tensor of shape (seq_len, d_model)

        Returns:
            Rotated tensor of shape (seq_len, d_model)
        """
        seq_len, d = x.shape
        positions = np.arange(seq_len)

        # Compute all rotation angles
        angles = np.outer(positions, self.adjusted_freqs)

        cos_angles = np.cos(angles)
        sin_angles = np.sin(angles)

        x_pairs = x.reshape(seq_len, -1, 2)

        x_rotated = np.stack(
            [
                x_pairs[:, :, 0] * cos_angles - x_pairs[:, :, 1] * sin_angles,
                x_pairs[:, :, 0] * sin_angles + x_pairs[:, :, 1] * cos_angles,
            ],
            axis=-1,
        )

        return x_rotated.reshape(seq_len, d)

    def scale_attention(self, scores):
        """Apply temperature scaling to attention scores.

        Args:
            scores: Attention scores of shape (seq_len, seq_len)

        Returns:
            Scaled scores
        """
        return scores * np.sqrt(self.temperature)

Let's test the complete YaRN implementation:

In[19]:
Code
# Create YaRN module
yarn = YaRNRoPE(d_model=64, original_context=2048, scale=4.0)

# Test with random input
np.random.seed(42)
seq_len = 32
test_input = np.random.randn(seq_len, 64)

# Apply YaRN RoPE
rotated = yarn.apply_batch(test_input)
Out[20]:
Console
YaRN Configuration:
  Original context: 2048
  Extension scale: 4.0x
  Extended context: 8192
  Temperature: 1.1386
  Attention scale: sqrt(t) = 1.0671

Input shape: (32, 64)
Output shape: (32, 64)
Magnitude preserved: True

The implementation confirms that YaRN preserves vector magnitudes (rotation is an isometry) while applying the adjusted frequencies.

Visualizing YaRN's Effect

Let's visualize how YaRN affects the attention patterns compared to other methods:

In[21]:
Code
def compute_attention_matrix(
    seq_len, d_model, freqs, scale_attention=1.0, seed=42
):
    """Compute full attention weight matrix."""
    np.random.seed(seed)

    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    Q_rotated = np.array([apply_rope(Q[i], i, freqs) for i in range(seq_len)])
    K_rotated = np.array([apply_rope(K[i], i, freqs) for i in range(seq_len)])

    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)
    scores = scores * scale_attention

    weights = softmax(scores)
    return weights


# Compute attention patterns for each method
seq_len = 32
d_model = 64
scale = 4.0

# YaRN with temperature
yarn_module = YaRNRoPE(d_model=d_model, original_context=2048, scale=scale)
yarn_attn = compute_attention_matrix(
    seq_len,
    d_model,
    yarn_module.adjusted_freqs,
    scale_attention=np.sqrt(yarn_module.temperature),
)

# Original
original_attn = compute_attention_matrix(seq_len, d_model, original_freqs)

# Position Interpolation
pi_attn = compute_attention_matrix(seq_len, d_model, pi_freqs)

# NTK-aware
ntk_attn = compute_attention_matrix(seq_len, d_model, ntk_freqs)
Out[22]:
Visualization
Heatmap showing attention weights with diagonal emphasis.
Original RoPE attention pattern. This is the baseline distribution the model learned during training.
Heatmap showing more uniform attention distribution.
Position Interpolation attention. The pattern becomes more diffuse due to compressed position distances.
Heatmap showing moderate attention pattern preservation.
NTK-aware attention. Preserves more structure than PI but still shows increased entropy.
Heatmap showing attention pattern similar to original.
YaRN attention. The combination of selective interpolation and temperature scaling produces a pattern closer to the original.

The attention heatmaps reveal the differences between methods. YaRN produces patterns that more closely resemble the original RoPE attention, with sharper focus on relevant positions rather than the diffuse patterns seen with Position Interpolation.

YaRN Training Requirements

An important practical consideration is whether YaRN requires fine-tuning. The answer depends on the extension factor:

For modest extensions (2x to 4x): YaRN can often be applied without any fine-tuning, leveraging the model's existing weights. The frequency adjustments and temperature scaling are designed to minimize the distribution shift.

For larger extensions (8x to 32x): Brief fine-tuning significantly improves quality. The YaRN paper recommends 200 to 400 training steps on long-context data, much less than training from scratch.

For extreme extensions (64x+): Longer fine-tuning becomes necessary, though still much shorter than pre-training from scratch.

The key advantage of YaRN over naive approaches is the efficiency of this fine-tuning. Because the method preserves the essential structure of the position encoding while making targeted adjustments, the model needs to learn only minor adaptations rather than fundamentally new position representations.

Out[23]:
Console
Recommended YaRN fine-tuning budget:

Extension Factor     Recommended Steps         Notes                         
---------------------------------------------------------------------------
2x - 4x              0 (optional)              Works zero-shot               
4x - 8x              100 - 200                 Brief warmup helps            
8x - 16x             200 - 400                 Recommended by paper          
16x - 32x            400 - 1000                More adaptation needed        
32x+                 1000+                     Extended fine-tuning          

YaRN vs Alternatives

Let's summarize how YaRN compares to other context extension methods:

Comparison of RoPE extension methods. YaRN's combination of targeted frequency adjustment and attention temperature scaling addresses limitations of simpler approaches.
MethodKey MechanismStrengthsWeaknesses
Position InterpolationUniform position scalingSimple, no new parametersLoses high-frequency information
NTK-awareBase frequency adjustmentBetter frequency preservationStill uniform across dimensions
Dynamic NTKPosition-dependent scalingAdapts to sequence lengthComplex, training instability
YaRNRamp function + temperatureSelective interpolation, entropy controlTwo hyperparameters (α, β)

The main advantages of YaRN are:

  1. Selective preservation: High-frequency dimensions that don't need interpolation are left unchanged, maintaining fine-grained position discrimination.

  2. Entropy control: Temperature scaling prevents the attention distribution from becoming too diffuse, preserving the model's ability to focus.

  3. Efficient adaptation: The targeted adjustments require less fine-tuning to adapt to than more disruptive methods.

  4. Predictable behavior: The ramp function provides intuitive control over which dimensions are interpolated and by how much.

In[24]:
Code
# Quantitative comparison across methods
def evaluate_method(name, freqs, scale_factor=1.0, n_trials=10):
    """Evaluate position encoding method on multiple metrics."""
    entropies = []
    relative_position_errors = []

    for trial in range(n_trials):
        # Compute entropy
        scores = compute_attention_matrix(
            64, 64, freqs, scale_attention=scale_factor, seed=trial
        )
        entropy = compute_entropy(scores)
        entropies.append(np.mean(entropy))

        # Check relative position consistency
        np.random.seed(trial)
        q = np.random.randn(64)
        k = np.random.randn(64)

        # Compare scores at same relative distance but different absolute positions
        q0 = apply_rope(q, 0, freqs)
        k2 = apply_rope(k, 2, freqs)
        score_near = np.dot(q0, k2)

        q50 = apply_rope(q, 50, freqs)
        k52 = apply_rope(k, 52, freqs)
        score_far = np.dot(q50, k52)

        relative_position_errors.append(abs(score_near - score_far))

    return {
        "mean_entropy": np.mean(entropies),
        "std_entropy": np.std(entropies),
        "mean_rel_error": np.mean(relative_position_errors),
    }


# Evaluate all methods
yarn_module = YaRNRoPE(d_model=64, original_context=2048, scale=4.0)

methods = {
    "Original": (original_freqs, 1.0),
    "Position Interpolation": (pi_freqs, 1.0),
    "NTK-aware": (ntk_freqs, 1.0),
    "YaRN": (yarn_module.adjusted_freqs, np.sqrt(yarn_module.temperature)),
}

results = {
    name: evaluate_method(name, freqs, scale)
    for name, (freqs, scale) in methods.items()
}
Out[25]:
Console
Method Comparison (4x extension):

Method                    Mean Entropy       Entropy Std        Rel. Pos. Error   
-------------------------------------------------------------------------------
Original                  3.6870             0.0206             4.56e-15
Position Interpolation    3.6997             0.0231             2.66e-15
NTK-aware                 3.6896             0.0194             7.42e-15
YaRN                      3.6253             0.0234             4.12e-15

The "Mean Entropy" column shows how diffuse the attention distribution is, with lower values indicating sharper focus. The "Rel. Pos. Error" measures whether the same relative distance produces the same attention score at different absolute positions. Values near zero indicate perfect relative position encoding.

The comparison shows that YaRN achieves entropy closer to the original while maintaining the relative position property (low position error). The combination of targeted interpolation and temperature scaling addresses both the frequency and entropy aspects of context extension.

Limitations and Considerations

Despite its effectiveness, YaRN has limitations worth understanding.

Hyperparameter sensitivity: The α\alpha and β\beta parameters control the ramp function's transition region. While the defaults (1 and 32) work well for many models, some architectures may benefit from tuning. Models with different RoPE bases or dimension sizes may require adjustment.

Temperature approximation: The temperature formula t=0.1ln(s)+1t = 0.1 \ln(s) + 1 is empirically derived rather than theoretically optimal. It works well in practice but may not perfectly compensate for entropy shifts in all scenarios.

Integration complexity: Unlike Position Interpolation (which only modifies position indices) or NTK-aware scaling (which only modifies the base), YaRN requires two separate modifications. Both the frequency adjustment and the attention scaling must be implemented correctly.

Interaction with other optimizations: When combined with techniques like FlashAttention or grouped-query attention, care must be taken to apply the temperature scaling at the correct point in the computation. The scaling should occur before the softmax, not after.

These limitations are manageable in practice. YaRN has been successfully integrated into many open-source models, and the community has developed reference implementations that handle the integration details correctly.

Key Parameters

When implementing YaRN, several parameters control its behavior:

  • scale: The context extension factor (e.g., 4.0 for extending from 2048 to 8192 tokens). Larger values enable longer contexts but may require more fine-tuning.

  • alpha: The lower wavelength threshold (default: 1.0). Dimension pairs with wavelength-to-context ratio below this value receive no interpolation. Decreasing α\alpha applies interpolation to more dimension pairs.

  • beta: The upper wavelength threshold (default: 32.0). Dimension pairs with wavelength-to-context ratio above this value receive full interpolation. Increasing β\beta delays full interpolation to higher wavelength pairs.

  • base: The RoPE frequency base (default: 10000). This should match the base used in the original model's RoPE implementation.

  • original_context: The context length the model was originally trained on (e.g., 2048 or 4096). This determines the wavelength ratios used in the ramp function.

For most applications, the default values of α=1\alpha = 1 and β=32\beta = 32 work well. The primary parameter to adjust is scale, which should be set to the desired extension factor. If the model uses a non-standard RoPE base, ensure base matches the original configuration.

Summary

YaRN provides a principled approach to extending context length in RoPE-based language models. By combining selective frequency interpolation with attention temperature scaling, it addresses limitations in both Position Interpolation and NTK-aware methods.

Key takeaways:

  • Wavelength-based interpolation: YaRN uses a ramp function to apply no interpolation to high-frequency dimension pairs, full interpolation to low-frequency pairs, and a smooth transition in between. This preserves fine-grained position information where it matters.

  • Attention temperature correction: Context extension changes attention score distributions, increasing entropy. YaRN compensates with a temperature scaling factor t\sqrt{t} where t=0.1ln(s)+1t = 0.1 \ln(s) + 1.

  • Efficient fine-tuning: The targeted adjustments minimize distribution shift, enabling effective context extension with as few as 200 to 400 fine-tuning steps for moderate extension factors.

  • Complementary to other methods: YaRN can be seen as a refinement that combines insights from Position Interpolation (full interpolation for long wavelengths) and NTK-aware scaling (preservation of short wavelengths), while adding the novel temperature correction.

  • Practical deployment: YaRN has been integrated into many open-source models and inference frameworks, demonstrating its practical viability for real-world context extension.

The next chapter explores attention sinks, a phenomenon where transformer models allocate disproportionate attention to initial tokens regardless of their semantic relevance, and how this affects long-context processing.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{yarnextendingcontextlengthwithselectiveinterpolationandtemperaturescaling, author = {Michael Brenndoerfer}, title = {YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling}, year = {2025}, url = {https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling. Retrieved from https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm
MLAAcademic
Michael Brenndoerfer. "YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm>.
CHICAGOAcademic
Michael Brenndoerfer. "YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm.
HARVARDAcademic
Michael Brenndoerfer (2025) 'YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling'. Available at: https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling. https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free