YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how YaRN extends LLM context length through wavelength-based frequency interpolation and attention temperature correction. Includes mathematical formulation and implementation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

YaRN: Yet another RoPE extensioNLink Copied

Extending context length in large language models has become a central challenge. Models trained on sequences of 2,048 or 4,096 tokens struggle when asked to process documents spanning tens of thousands of positions. Position Interpolation offered one solution: scale down position indices to fit within the trained range. NTK-aware scaling improved upon this by preserving high-frequency components that capture local relationships. But both methods still leave performance on the table, particularly as the extension factor grows large.

YaRN, which stands for "Yet another RoPE extensioN," addresses limitations in both approaches. The key insight is that extending context length doesn't just require fixing position encodings. It also requires compensating for a subtle but significant change in attention score distributions. When we stretch positions across longer sequences, the entropy of attention distributions shifts in ways that degrade model behavior. YaRN tackles this through a combination of targeted frequency interpolation and an attention temperature correction.

This chapter develops YaRN from its motivation through its complete formulation. We'll see why existing methods fall short at large extension factors, derive the attention scaling mechanism, and implement YaRN step by step. By the end, you'll understand how this technique enables models to maintain quality at context lengths 16x or 32x beyond their training distribution.

The Problem with Existing MethodsLink Copied

Before diving into YaRN, let's understand why Position Interpolation and NTK-aware scaling leave room for improvement. Both methods successfully enable RoPE-based models to process longer sequences, but they introduce subtle distortions that accumulate as the extension factor increases.

Position Interpolation works by scaling all position indices by a factor $s$ : a position $m$ becomes $m/s$ . This keeps all positions within the trained range, but it compresses the entire frequency spectrum uniformly. High-frequency dimensions that previously distinguished positions 1 apart now need to distinguish positions $s$ apart after compression. The model loses fine-grained position discrimination.

NTK-aware scaling addresses this by adjusting the base frequency rather than scaling positions directly. This preserves high-frequency components while stretching low-frequency ones. The approach works well for moderate extensions (2x to 4x), but at larger factors, even NTK-aware scaling begins to degrade.

In[2]:

Code

import numpy as np


def compute_rope_frequencies(d_model, base=10000):
    """Compute standard RoPE rotation frequencies."""
    num_pairs = d_model // 2
    i = np.arange(num_pairs)
    frequencies = 1.0 / (base ** (2 * i / d_model))
    return frequencies


def compute_pi_frequencies(d_model, base=10000, scale=4.0):
    """Position Interpolation: scale all positions by 1/scale."""
    frequencies = compute_rope_frequencies(d_model, base)
    return frequencies / scale


def compute_ntk_frequencies(d_model, base=10000, scale=4.0):
    """NTK-aware scaling: adjust base instead of positions.

    The new base is computed as: base * scale^(d / (d - 2))
    This increases the base, which decreases all frequencies,
    but preserves the high-frequency components better than PI.
    """
    new_base = base * scale ** (d_model / (d_model - 2))
    return compute_rope_frequencies(d_model, new_base)

import numpy as np


def compute_rope_frequencies(d_model, base=10000):
    """Compute standard RoPE rotation frequencies."""
    num_pairs = d_model // 2
    i = np.arange(num_pairs)
    frequencies = 1.0 / (base ** (2 * i / d_model))
    return frequencies


def compute_pi_frequencies(d_model, base=10000, scale=4.0):
    """Position Interpolation: scale all positions by 1/scale."""
    frequencies = compute_rope_frequencies(d_model, base)
    return frequencies / scale


def compute_ntk_frequencies(d_model, base=10000, scale=4.0):
    """NTK-aware scaling: adjust base instead of positions.

    The new base is computed as: base * scale^(d / (d - 2))
    This increases the base, which decreases all frequencies,
    but preserves the high-frequency components better than PI.
    """
    new_base = base * scale ** (d_model / (d_model - 2))
    return compute_rope_frequencies(d_model, new_base)

Let's visualize how these methods transform the frequency spectrum:

In[3]:

Code

d_model = 64
scale = 4.0

original_freqs = compute_rope_frequencies(d_model)
pi_freqs = compute_pi_frequencies(d_model, scale=scale)
ntk_freqs = compute_ntk_frequencies(d_model, scale=scale)

d_model = 64
scale = 4.0

original_freqs = compute_rope_frequencies(d_model)
pi_freqs = compute_pi_frequencies(d_model, scale=scale)
ntk_freqs = compute_ntk_frequencies(d_model, scale=scale)

Out[4]:

Visualization

Line plot comparing three frequency spectra across dimension pairs on a log scale. — Frequency spectra under different extension methods. Position Interpolation uniformly divides all frequencies by 4. NTK-aware scaling reduces frequencies more gradually, preserving high-frequency components (small indices) while stretching low-frequency ones (large indices). The original spectrum shows the exponential decay characteristic of standard RoPE.

The visualization reveals a fundamental trade-off. Position Interpolation maintains the relative spacing between frequencies (the lines are parallel on a log scale) but shifts everything down. NTK-aware scaling bends the curve, keeping high frequencies close to original while aggressively stretching low frequencies. Neither approach is optimal across all dimension pairs.

Attention Entropy and the Temperature ProblemLink Copied

Beyond frequency adjustments, there's a more subtle issue that neither PI nor NTK-aware scaling addresses: attention entropy shifts. When we modify RoPE frequencies, we change the distribution of attention scores in ways that affect model behavior.

Recall that attention scores are computed as the scaled dot product between query and key vectors:

\text{score}(m, n) = \frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d_k}}

where:

$\text{score}(m, n)$ : the attention score between a query at position $m$ and a key at position $n$
$\mathbf{q}_m$ : the RoPE-rotated query vector at sequence position $m$
$\mathbf{k}_n$ : the RoPE-rotated key vector at sequence position $n$
$d_k$ : the dimension of the key vectors, used for scaling to prevent dot products from growing too large
$\mathbf{q}_m^\top \mathbf{k}_n$ : the dot product between query and key, measuring their similarity

When we interpolate positions, the effective distances between tokens change. Positions that were far apart now appear closer in the rotated space. This compresses the range of attention scores, making them more uniform. Higher uniformity means higher entropy: the model attends more diffusely rather than focusing on specific positions.

Attention Entropy

The entropy of an attention distribution measures how spread out the attention weights are. Low entropy means the model focuses sharply on a few positions. High entropy means attention is distributed more evenly across many positions. Changes to position encoding can inadvertently shift this entropy, degrading the model's ability to focus.

Let's quantify this effect by computing attention entropy under different interpolation schemes:

In[5]:

Code

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def compute_entropy(attention_weights):
    """Compute entropy of attention distribution."""
    # Avoid log(0) by adding small epsilon
    eps = 1e-10
    log_weights = np.log(attention_weights + eps)
    entropy = -np.sum(attention_weights * log_weights, axis=-1)
    return entropy


def apply_rope(x, position, frequencies):
    """Apply RoPE to a vector at given position."""
    d = len(x)
    x_pairs = x.reshape(-1, 2)
    angles = position * frequencies

    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    x_rotated = np.stack(
        [
            x_pairs[:, 0] * cos_angles - x_pairs[:, 1] * sin_angles,
            x_pairs[:, 0] * sin_angles + x_pairs[:, 1] * cos_angles,
        ],
        axis=-1,
    )

    return x_rotated.flatten()


def compute_attention_scores(seq_len, d_model, frequencies, seed=42):
    """Compute attention score matrix for random Q, K vectors."""
    np.random.seed(seed)

    # Generate random query and key vectors
    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    # Apply RoPE
    Q_rotated = np.array(
        [apply_rope(Q[i], i, frequencies) for i in range(seq_len)]
    )
    K_rotated = np.array(
        [apply_rope(K[i], i, frequencies) for i in range(seq_len)]
    )

    # Compute scaled dot-product scores
    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)

    return scores

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def compute_entropy(attention_weights):
    """Compute entropy of attention distribution."""
    # Avoid log(0) by adding small epsilon
    eps = 1e-10
    log_weights = np.log(attention_weights + eps)
    entropy = -np.sum(attention_weights * log_weights, axis=-1)
    return entropy


def apply_rope(x, position, frequencies):
    """Apply RoPE to a vector at given position."""
    d = len(x)
    x_pairs = x.reshape(-1, 2)
    angles = position * frequencies

    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    x_rotated = np.stack(
        [
            x_pairs[:, 0] * cos_angles - x_pairs[:, 1] * sin_angles,
            x_pairs[:, 0] * sin_angles + x_pairs[:, 1] * cos_angles,
        ],
        axis=-1,
    )

    return x_rotated.flatten()


def compute_attention_scores(seq_len, d_model, frequencies, seed=42):
    """Compute attention score matrix for random Q, K vectors."""
    np.random.seed(seed)

    # Generate random query and key vectors
    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    # Apply RoPE
    Q_rotated = np.array(
        [apply_rope(Q[i], i, frequencies) for i in range(seq_len)]
    )
    K_rotated = np.array(
        [apply_rope(K[i], i, frequencies) for i in range(seq_len)]
    )

    # Compute scaled dot-product scores
    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)

    return scores

In[6]:

Code

# Compare entropy across methods
seq_len = 64
d_model = 64
scale = 4.0

# Original context (positions 0 to seq_len-1)
original_scores = compute_attention_scores(seq_len, d_model, original_freqs)
original_weights = softmax(original_scores)
original_entropy = compute_entropy(original_weights)

# Position interpolation (pretend we're at 4x length, but scale down)
pi_scores = compute_attention_scores(seq_len, d_model, pi_freqs)
pi_weights = softmax(pi_scores)
pi_entropy = compute_entropy(pi_weights)

# NTK-aware
ntk_scores = compute_attention_scores(seq_len, d_model, ntk_freqs)
ntk_weights = softmax(ntk_scores)
ntk_entropy = compute_entropy(ntk_weights)

# Compare entropy across methods
seq_len = 64
d_model = 64
scale = 4.0

# Original context (positions 0 to seq_len-1)
original_scores = compute_attention_scores(seq_len, d_model, original_freqs)
original_weights = softmax(original_scores)
original_entropy = compute_entropy(original_weights)

# Position interpolation (pretend we're at 4x length, but scale down)
pi_scores = compute_attention_scores(seq_len, d_model, pi_freqs)
pi_weights = softmax(pi_scores)
pi_entropy = compute_entropy(pi_weights)

# NTK-aware
ntk_scores = compute_attention_scores(seq_len, d_model, ntk_freqs)
ntk_weights = softmax(ntk_scores)
ntk_entropy = compute_entropy(ntk_weights)

Out[7]:

Console

Mean attention entropy across query positions:
Method                    Mean Entropy    Std Dev        
-------------------------------------------------------
Original RoPE             3.6652           0.1447
Position Interpolation    3.6772           0.1517
NTK-aware                 3.6721           0.1737

The entropy values reveal the effect of position interpolation on attention patterns. While the differences might seem small in absolute terms, they compound across layers and affect the model's ability to retrieve and aggregate information from specific positions. YaRN addresses this by introducing an attention temperature correction.

The YaRN SolutionLink Copied

YaRN combines two complementary mechanisms to enable high-quality context extension:

Ramp-based frequency interpolation: Rather than treating all dimension pairs equally (like PI) or smoothly transitioning across all pairs (like NTK), YaRN uses a ramp function that applies no interpolation to high-frequency pairs, full interpolation to low-frequency pairs, and a smooth transition in between.
Attention temperature scaling: YaRN introduces a scaling factor $\sqrt{t}$ to attention scores, where $t \geq 1$ is derived to compensate for the entropy increase caused by interpolation. When $s = 1$ (no extension), $t = 1$ and no scaling is applied.

The combined approach modifies both the rotation frequencies and the attention computation. For frequencies, YaRN applies a dimension-specific adjustment:

\theta'_i = \theta_i \cdot \gamma(r_i)

where:

$\theta'_i$ : the adjusted frequency for dimension pair $i$ after YaRN modification
$\theta_i$ : the original RoPE frequency for dimension pair $i$ , computed as $1/10000^{2i/d}$
$\gamma(r_i)$ : the interpolation factor, a value between $1/s$ and $1$ that depends on the wavelength ratio
$r_i = \lambda_i / L$ : the wavelength-to-context ratio for dimension pair $i$
$\lambda_i$ : the wavelength for dimension pair $i$ (positions per full rotation)
$L$ : the original training context length

For attention scores, YaRN introduces a temperature correction:

\text{score}'(m, n) = \sqrt{t} \cdot \frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d_k}}

where:

$\text{score}'(m, n)$ : the temperature-adjusted attention score
$\sqrt{t}$ : the temperature scaling factor, where $t > 1$ increases score magnitude
$t$ : the temperature parameter, computed from the extension factor $s$

Let's develop each component in detail.

Wavelength AnalysisLink Copied

To understand why some dimension pairs need interpolation while others don't, we need to think about rotation in terms of wavelength rather than frequency. While frequency tells us how fast something rotates (radians per position), wavelength tells us how far we must travel for one complete cycle (positions per full rotation). Wavelength provides a more intuitive picture because we can directly compare it to the context length.

Think of it this way: if a dimension pair completes 100 full rotations during the training context, it has learned to use those rotations to distinguish positions at a fine-grained level. Extending the context by 4x means it now completes 400 rotations. No problem. The pair can still distinguish positions just as well as before. But if a dimension pair completes only 0.1 rotations during training (barely moving at all), extending by 4x means it now needs to cover 0.4 rotations. That's still not a complete cycle, and the model has never seen rotation patterns beyond what it learned during training.

The wavelength $\lambda_i$ of dimension pair $i$ is the inverse of the frequency, scaled by $2\pi$ :

\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \cdot 10000^{2i/d}

where:

$\lambda_i$ : the wavelength (in positions) for dimension pair $i$ , representing how many sequence positions correspond to one complete rotation
$\theta_i = 1/10000^{2i/d}$ : the base frequency for dimension pair $i$ , measured in radians per position
$d$ : the total embedding dimension (must be even)
$i$ : the dimension pair index (0, 1, ..., $d/2 - 1$ )
$2\pi$ : the number of radians in a complete rotation (one full cycle)
$10000$ : the RoPE base constant, which controls the frequency range

The second equality follows by substituting the definition of $\theta_i$ and simplifying: $2\pi / (1/10000^{2i/d}) = 2\pi \cdot 10000^{2i/d}$ .

The critical question becomes: how does each wavelength compare to the training context length $L$ ? If $\lambda_i < L$ , the pair completes at least one full rotation during training, meaning the model has seen all possible rotation states. If $\lambda_i > L$ , the pair completes less than one rotation, meaning the model has only seen a fraction of the rotation cycle. This ratio $r_i = \lambda_i / L$ will be central to YaRN's design.

In[8]:

Code

def compute_wavelengths(d_model, base=10000):
    """Compute wavelengths for each dimension pair."""
    frequencies = compute_rope_frequencies(d_model, base)
    wavelengths = 2 * np.pi / frequencies
    return wavelengths


# Analyze wavelengths relative to context length
d_model = 64
base = 10000
original_context = 2048  # Original training context
extension_factor = 4.0
extended_context = original_context * extension_factor

wavelengths = compute_wavelengths(d_model, base)

def compute_wavelengths(d_model, base=10000):
    """Compute wavelengths for each dimension pair."""
    frequencies = compute_rope_frequencies(d_model, base)
    wavelengths = 2 * np.pi / frequencies
    return wavelengths


# Analyze wavelengths relative to context length
d_model = 64
base = 10000
original_context = 2048  # Original training context
extension_factor = 4.0
extended_context = original_context * extension_factor

wavelengths = compute_wavelengths(d_model, base)

Out[9]:

Console

Original context: 2048 positions
Extended context: 8192 positions

Pair     Wavelength      vs Original L        vs Extended L       
---------------------------------------------------------------
0        6.3             0.0031               0.0008              
4        19.9            0.0097               0.0024              
8        62.8            0.0307               0.0077              
16       628.3           0.3068               0.0767              
24       6283.2          3.0680               0.7670              
31       47117.2         23.0065              5.7516

The "vs Original L" column shows the wavelength ratio $r_i = \lambda_i / L$ . This ratio is the key to understanding which dimension pairs need interpolation:

When $r_i < 1$ , the wavelength is shorter than the context, meaning the pair completes multiple full rotations during training. These pairs have learned the full rotation cycle.
When $r_i > 1$ , the wavelength exceeds the context, meaning the pair doesn't complete even one rotation during training. These pairs operate in a limited portion of the rotation cycle.

Let's trace through the table to build intuition. Dimension pair 0 has a wavelength of about 6 positions and $r_0 \approx 0.003$ . It completes roughly 340 full rotations within the training context, so it has thoroughly learned how to use rotation for position encoding. When we extend to 8,192 positions, pair 0 still completes over 1,300 rotations. No problem here.

At the other extreme, dimension pair 31 has a wavelength of around 60,000 positions and $r_{31} \approx 29$ . During training, it completes only about 3% of a single rotation. The model has learned to encode position using just this small arc of the rotation cycle. If we extend the context by 4x without interpolation, we'd ask pair 31 to cover 12% of its rotation cycle, using rotation states it has never encountered. This is extrapolation, and it degrades model quality.

The pattern is clear: pairs with small $r_i$ don't need interpolation; pairs with large $r_i$ do. The question is where to draw the line, and whether the transition should be abrupt or smooth. YaRN's ramp function provides the answer.

The YaRN Ramp FunctionLink Copied

Now we can design a function that decides how much interpolation each dimension pair receives. We want three behaviors:

For pairs with small $r_i$ (short wavelengths, many rotations during training): apply no interpolation, leaving them at their original frequencies.
For pairs with large $r_i$ (long wavelengths, few rotations during training): apply full interpolation, scaling them by $1/s$ just like Position Interpolation would.
For pairs with intermediate $r_i$ : apply partial interpolation, blending smoothly between the two extremes.

This is exactly what a ramp function achieves. YaRN defines $\gamma(r)$ as a piecewise function:

\gamma(r) = \begin{cases} 1 & \text{if } r < \alpha \\ \frac{1}{s} & \text{if } r > \beta \\ (1 - w) + w \cdot \frac{1}{s} & \text{otherwise} \end{cases}

where:

$r = \lambda_i / L$ : the wavelength-to-context ratio for dimension pair $i$ , comparing the rotation period to the training context
$\lambda_i$ : the wavelength for dimension pair $i$
$L$ : the original training context length (e.g., 2048 or 4096 tokens)
$s$ : the extension scale factor (e.g., 4 for extending from 2048 to 8192)
$\alpha$ : the lower threshold (typically 1.0), below which no interpolation is applied
$\beta$ : the upper threshold (typically 32.0), above which full interpolation is applied
$w = (r - \alpha) / (\beta - \alpha)$ : the interpolation weight in the ramp region, ranging from 0 to 1
$\gamma(r)$ : the output interpolation factor, ranging from 1 (no interpolation) to $1/s$ (full interpolation)

Let's unpack the middle case, which is the most interesting. The weight $w$ measures where $r$ falls within the transition region $[\alpha, \beta]$ . At the left edge ( $r = \alpha$ ), we have $w = 0$ , so the expression becomes $(1 - 0) + 0 \cdot (1/s) = 1$ . At the right edge ( $r = \beta$ ), we have $w = 1$ , giving $(1 - 1) + 1 \cdot (1/s) = 1/s$ . For values between the edges, we get a linear blend. This creates a smooth ramp rather than an abrupt step, which helps the model adapt more gracefully.

Why α = 1 and β = 32?

The default thresholds have intuitive interpretations. When $\alpha = 1$ , dimension pairs with wavelength less than the context length (completing at least one full rotation during training) don't need interpolation. When $\beta = 32$ , dimension pairs with wavelength more than 32 times the context length (completing less than 3% of a rotation) need full interpolation. These values were empirically validated across multiple model architectures, striking a balance between preserving high-frequency information and ensuring stable extrapolation.

To summarize, the ramp function creates three distinct regions:

Short wavelengths ( $r < \alpha$ ): These pairs already rotate many times within the original context. They don't need interpolation because they can naturally handle the extended positions.
Long wavelengths ( $r > \beta$ ): These pairs rotate slowly and need full Position Interpolation treatment to avoid extrapolation.
Middle wavelengths ( $\alpha \leq r \leq \beta$ ): These pairs receive partial interpolation, smoothly blending between the two extremes.

In[10]:

Code

def yarn_gamma(wavelength, context_length, scale, alpha=1.0, beta=32.0):
    """Compute YaRN interpolation factor for a given wavelength."""
    r = wavelength / context_length

    if r < alpha:
        # Short wavelength: no interpolation
        return 1.0
    elif r > beta:
        # Long wavelength: full interpolation
        return 1.0 / scale
    else:
        # Ramp region: smooth transition
        t = (r - alpha) / (beta - alpha)
        # Linear interpolation between 1 and 1/scale
        return (1 - t) * 1.0 + t * (1.0 / scale)


def compute_yarn_frequencies(
    d_model, base=10000, context_length=2048, scale=4.0, alpha=1.0, beta=32.0
):
    """Compute YaRN-adjusted frequencies."""
    original_freqs = compute_rope_frequencies(d_model, base)
    wavelengths = compute_wavelengths(d_model, base)

    gamma_values = np.array(
        [yarn_gamma(w, context_length, scale, alpha, beta) for w in wavelengths]
    )

    # Apply gamma as a frequency multiplier
    # gamma < 1 means we slow down the rotation (interpolation)
    yarn_freqs = original_freqs * gamma_values

    return yarn_freqs, gamma_values

def yarn_gamma(wavelength, context_length, scale, alpha=1.0, beta=32.0):
    """Compute YaRN interpolation factor for a given wavelength."""
    r = wavelength / context_length

    if r < alpha:
        # Short wavelength: no interpolation
        return 1.0
    elif r > beta:
        # Long wavelength: full interpolation
        return 1.0 / scale
    else:
        # Ramp region: smooth transition
        t = (r - alpha) / (beta - alpha)
        # Linear interpolation between 1 and 1/scale
        return (1 - t) * 1.0 + t * (1.0 / scale)


def compute_yarn_frequencies(
    d_model, base=10000, context_length=2048, scale=4.0, alpha=1.0, beta=32.0
):
    """Compute YaRN-adjusted frequencies."""
    original_freqs = compute_rope_frequencies(d_model, base)
    wavelengths = compute_wavelengths(d_model, base)

    gamma_values = np.array(
        [yarn_gamma(w, context_length, scale, alpha, beta) for w in wavelengths]
    )

    # Apply gamma as a frequency multiplier
    # gamma < 1 means we slow down the rotation (interpolation)
    yarn_freqs = original_freqs * gamma_values

    return yarn_freqs, gamma_values

Let's visualize the ramp function and its effect on frequencies:

In[11]:

Code

# Compute YaRN frequencies
d_model = 64
context_length = 2048
scale = 4.0

yarn_freqs, gamma_values = compute_yarn_frequencies(
    d_model, context_length=context_length, scale=scale
)
original_freqs = compute_rope_frequencies(d_model)
wavelengths = compute_wavelengths(d_model)

# Compute wavelength ratios
r_values = wavelengths / context_length

# Compute YaRN frequencies
d_model = 64
context_length = 2048
scale = 4.0

yarn_freqs, gamma_values = compute_yarn_frequencies(
    d_model, context_length=context_length, scale=scale
)
original_freqs = compute_rope_frequencies(d_model)
wavelengths = compute_wavelengths(d_model)

# Compute wavelength ratios
r_values = wavelengths / context_length

Out[12]:

Visualization

Line plot showing step-like gamma function with smooth ramp transition. — The YaRN gamma function versus wavelength ratio. Pairs with short wavelengths (r < 1) receive no interpolation (gamma = 1). Pairs with long wavelengths (r > 32) receive full interpolation (gamma = 0.25 for 4x extension). The ramp region smoothly connects these extremes.

Log-scale plot comparing YaRN, PI, and NTK frequency spectra. — Effect on frequencies. YaRN preserves high frequencies (left side) while interpolating low frequencies (right side). Compare to Position Interpolation, which uniformly scales all frequencies, and NTK-aware scaling, which bends the entire curve.

The ramp function creates a piecewise linear transition on the log-wavelength scale. The left panel shows how $\gamma$ drops from 1 to $1/s$ as wavelength increases. The right panel shows the resulting frequency spectrum: YaRN preserves high frequencies (matching the original) while smoothly transitioning to interpolated low frequencies.

Attention Temperature ScalingLink Copied

We've now addressed the frequency problem with the ramp function. But recall that we identified two problems at the start: frequency distortion and entropy shift. Even with perfect frequency adjustments, interpolation changes something fundamental about attention patterns.

To understand why, think about what interpolation does geometrically. When we slow down rotations (by multiplying frequencies by $\gamma < 1$ ), positions that were previously "far apart" in the rotated embedding space now appear "closer." Consider two tokens at positions 0 and 100. In the original RoPE, pair 0 might rotate these to be 100 radians apart (after wrapping). With interpolation, the same pair rotates them to only 25 radians apart (for 4x extension). This compression happens across all interpolated dimension pairs.

This compression has a direct consequence for attention scores. The dot product between query and key vectors measures their similarity. When positions appear closer together in the rotated space, the dot products become more similar to each other. The range of attention scores shrinks.

Now recall how softmax works. Given a vector of scores $\mathbf{z}$ :

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

When input scores are spread out (high variance), exponentiating amplifies the differences, producing a peaked distribution that focuses on the highest scores. When input scores are compressed (low variance), the exponentials are more similar, producing a flatter, more uniform distribution. This increased uniformity is what we measure as higher entropy.

The solution is conceptually simple: if scores are too compressed, stretch them back out. We do this by multiplying the scores by a factor greater than 1 before applying softmax. In temperature scaling terminology, we reduce the "temperature" (making the distribution sharper) by dividing by a temperature $T < 1$ , or equivalently, multiplying by $1/T$ . YaRN frames this as multiplying by $\sqrt{t}$ where $t > 1$ .

The temperature factor is computed as:

t = 0.1 \cdot \ln(s) + 1

where:

$t$ : the attention temperature scaling factor (always $\geq 1$ )
$s$ : the context extension scale factor (e.g., 4 for 4x extension)
$\ln(s)$ : the natural logarithm of the extension factor
$0.1$ : an empirically determined coefficient that controls the rate of temperature increase
$1$ : the base value ensuring $t = 1$ when $s = 1$ (no extension)

Why $\sqrt{t}$ rather than $t$ directly? This relates to how variance scales under multiplication. If we multiply scores by a constant $c$ , the variance of the scores is multiplied by $c^2$ . So to increase variance by a factor of $t$ , we need to multiply the scores by $\sqrt{t}$ .

The logarithmic form of $t$ captures an empirical observation: the entropy shift doesn't grow linearly with the extension factor. Larger extensions need progressively less aggressive correction. Doubling the extension factor from 4x to 8x doesn't require doubling the temperature adjustment. Instead, it adds a constant amount: $\ln(8) - \ln(4) = \ln(2) \approx 0.69$ , contributing an additional $0.069$ to $t$ .

In[13]:

Code

def compute_yarn_temperature(scale):
    """Compute YaRN attention temperature scaling factor."""
    return 0.1 * np.log(scale) + 1.0


# Compute temperature for various extension factors
extension_factors = [2, 4, 8, 16, 32, 64, 128]
temperatures = [compute_yarn_temperature(s) for s in extension_factors]

def compute_yarn_temperature(scale):
    """Compute YaRN attention temperature scaling factor."""
    return 0.1 * np.log(scale) + 1.0


# Compute temperature for various extension factors
extension_factors = [2, 4, 8, 16, 32, 64, 128]
temperatures = [compute_yarn_temperature(s) for s in extension_factors]

Out[14]:

Console

YaRN temperature scaling factors:
Extension Factor     Temperature (t)      sqrt(t)             
------------------------------------------------------------
2                    1.0693               1.0341
4                    1.1386               1.0671
8                    1.2079               1.0991
16                   1.2773               1.1302
32                   1.3466               1.1604
64                   1.4159               1.1899
128                  1.4852               1.2187

The $\sqrt{t}$ column shows the actual multiplier applied to attention scores. Even for very large extension factors (128x), the multiplier stays below 1.2, indicating that temperature correction is a subtle adjustment rather than a dramatic rescaling.

The temperature factor grows slowly with extension factor. For a 4x extension, we multiply attention scores by $\sqrt{1.139} \approx 1.067$ . For 32x extension, the multiplier is $\sqrt{1.347} \approx 1.161$ . These modest corrections help maintain attention sharpness.

Let's verify the effect on entropy:

In[15]:

Code

def apply_yarn_rope(
    x, position, frequencies, scale, context_length, alpha=1.0, beta=32.0
):
    """Apply YaRN-adjusted RoPE."""
    d = len(x)
    wavelengths = compute_wavelengths(d)

    # Compute gamma for each dimension pair
    gamma_vals = np.array(
        [yarn_gamma(w, context_length, scale, alpha, beta) for w in wavelengths]
    )

    # Adjust frequencies
    adjusted_freqs = frequencies * gamma_vals

    return apply_rope(x, position, adjusted_freqs)


def compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=True, seed=42
):
    """Compute attention scores with YaRN adjustments."""
    np.random.seed(seed)

    original_freqs = compute_rope_frequencies(d_model)

    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    Q_rotated = np.array(
        [
            apply_yarn_rope(Q[i], i, original_freqs, scale, context_length)
            for i in range(seq_len)
        ]
    )
    K_rotated = np.array(
        [
            apply_yarn_rope(K[i], i, original_freqs, scale, context_length)
            for i in range(seq_len)
        ]
    )

    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)

    if apply_temperature:
        t = compute_yarn_temperature(scale)
        scores = scores * np.sqrt(t)

    return scores

def apply_yarn_rope(
    x, position, frequencies, scale, context_length, alpha=1.0, beta=32.0
):
    """Apply YaRN-adjusted RoPE."""
    d = len(x)
    wavelengths = compute_wavelengths(d)

    # Compute gamma for each dimension pair
    gamma_vals = np.array(
        [yarn_gamma(w, context_length, scale, alpha, beta) for w in wavelengths]
    )

    # Adjust frequencies
    adjusted_freqs = frequencies * gamma_vals

    return apply_rope(x, position, adjusted_freqs)


def compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=True, seed=42
):
    """Compute attention scores with YaRN adjustments."""
    np.random.seed(seed)

    original_freqs = compute_rope_frequencies(d_model)

    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    Q_rotated = np.array(
        [
            apply_yarn_rope(Q[i], i, original_freqs, scale, context_length)
            for i in range(seq_len)
        ]
    )
    K_rotated = np.array(
        [
            apply_yarn_rope(K[i], i, original_freqs, scale, context_length)
            for i in range(seq_len)
        ]
    )

    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)

    if apply_temperature:
        t = compute_yarn_temperature(scale)
        scores = scores * np.sqrt(t)

    return scores

In[16]:

Code

# Compare entropy with and without temperature scaling
seq_len = 64
d_model = 64
scale = 4.0
context_length = 2048

# YaRN without temperature
yarn_scores_no_temp = compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=False
)
yarn_weights_no_temp = softmax(yarn_scores_no_temp)
yarn_entropy_no_temp = compute_entropy(yarn_weights_no_temp)

# YaRN with temperature
yarn_scores_with_temp = compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=True
)
yarn_weights_with_temp = softmax(yarn_scores_with_temp)
yarn_entropy_with_temp = compute_entropy(yarn_weights_with_temp)

# Compare entropy with and without temperature scaling
seq_len = 64
d_model = 64
scale = 4.0
context_length = 2048

# YaRN without temperature
yarn_scores_no_temp = compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=False
)
yarn_weights_no_temp = softmax(yarn_scores_no_temp)
yarn_entropy_no_temp = compute_entropy(yarn_weights_no_temp)

# YaRN with temperature
yarn_scores_with_temp = compute_yarn_attention_scores(
    seq_len, d_model, scale, context_length, apply_temperature=True
)
yarn_weights_with_temp = softmax(yarn_scores_with_temp)
yarn_entropy_with_temp = compute_entropy(yarn_weights_with_temp)

Out[17]:

Console

Effect of YaRN temperature scaling on attention entropy:
Method                         Mean Entropy    Std Dev        
------------------------------------------------------------
Original RoPE                  3.6652           0.1447
YaRN (no temperature)          3.6652           0.1447
YaRN (with temperature)        3.6008           0.1666

Temperature scaling helps bring the entropy closer to the original distribution. While the match isn't perfect (which is expected given the fundamental changes to the attention geometry), the correction prevents the excessive entropy increase that would otherwise occur.

The Complete YaRN FormulaLink Copied

Now we can put together the complete YaRN transformation. Given a model trained on context length $L$ that we want to extend by factor $s$ , the process involves three steps.

Step 1: Compute adjusted frequencies

For each dimension pair $i$ with base frequency $\theta_i$ and wavelength $\lambda_i = 2\pi/\theta_i$ , compute the adjusted frequency:

\theta'_i = \theta_i \cdot \gamma\left(\frac{\lambda_i}{L}\right)

where:

$\theta'_i$ : the adjusted frequency for dimension pair $i$
$\theta_i$ : the original RoPE frequency
$\gamma(\cdot)$ : the ramp function defined earlier
$\lambda_i / L$ : the wavelength ratio that determines the interpolation amount

Step 2: Apply RoPE with adjusted frequencies

For query or key vector $\mathbf{x}$ at position $m$ , apply the block-diagonal rotation matrix:

\text{YaRN-RoPE}(\mathbf{x}, m) = \begin{pmatrix} R(\theta'_0 \cdot m) & & \\ & \ddots & \\ & & R(\theta'_{d/2-1} \cdot m) \end{pmatrix} \mathbf{x}

where:

$\text{YaRN-RoPE}(\mathbf{x}, m)$ : the rotated vector, now incorporating YaRN's frequency adjustments
$R(\theta'_i \cdot m)$ : a 2 $\times$ 2 rotation matrix for angle $\theta'_i \cdot m$ , applied to dimension pair $i$
$\theta'_i \cdot m$ : the rotation angle, which increases linearly with position $m$ at the adjusted rate
The block-diagonal structure means each dimension pair is rotated independently

Step 3: Apply temperature scaling to attention scores

After computing the raw attention scores, apply the temperature correction:

\text{score}(m, n) = \sqrt{t} \cdot \frac{\mathbf{q}_m^\top \mathbf{k}_n}{\sqrt{d_k}}

where:

$\text{score}(m, n)$ : the scaled attention score between positions $m$ and $n$
$\mathbf{q}_m$ : the YaRN-rotated query vector at position $m$
$\mathbf{k}_n$ : the YaRN-rotated key vector at position $n$
$d_k$ : the dimension of key vectors
$t = 0.1 \ln(s) + 1$ : the temperature factor, computed from the extension scale
$\sqrt{t}$ : the scaling multiplier applied to raw scores (increases score magnitude to sharpen attention)
The scaling occurs before softmax normalization

In[18]:

Code

class YaRNRoPE:
    """YaRN-adjusted Rotary Position Embedding."""

    def __init__(
        self,
        d_model,
        original_context=2048,
        scale=4.0,
        base=10000,
        alpha=1.0,
        beta=32.0,
    ):
        """Initialize YaRN RoPE.

        Args:
            d_model: Embedding dimension (must be even)
            original_context: Original training context length
            scale: Context extension factor
            base: RoPE frequency base
            alpha: Lower wavelength threshold
            beta: Upper wavelength threshold
        """
        self.d_model = d_model
        self.original_context = original_context
        self.scale = scale
        self.alpha = alpha
        self.beta = beta

        # Compute adjusted frequencies
        self.original_freqs = compute_rope_frequencies(d_model, base)
        self.wavelengths = compute_wavelengths(d_model, base)

        self.gamma_values = np.array(
            [
                yarn_gamma(w, original_context, scale, alpha, beta)
                for w in self.wavelengths
            ]
        )

        self.adjusted_freqs = self.original_freqs * self.gamma_values

        # Compute temperature factor
        self.temperature = compute_yarn_temperature(scale)

    def apply(self, x, position):
        """Apply YaRN-adjusted RoPE to a vector.

        Args:
            x: Input vector of shape (d_model,)
            position: Position index

        Returns:
            Rotated vector of shape (d_model,)
        """
        return apply_rope(x, position, self.adjusted_freqs)

    def apply_batch(self, x):
        """Apply YaRN-adjusted RoPE to a batch of vectors.

        Args:
            x: Input tensor of shape (seq_len, d_model)

        Returns:
            Rotated tensor of shape (seq_len, d_model)
        """
        seq_len, d = x.shape
        positions = np.arange(seq_len)

        # Compute all rotation angles
        angles = np.outer(positions, self.adjusted_freqs)

        cos_angles = np.cos(angles)
        sin_angles = np.sin(angles)

        x_pairs = x.reshape(seq_len, -1, 2)

        x_rotated = np.stack(
            [
                x_pairs[:, :, 0] * cos_angles - x_pairs[:, :, 1] * sin_angles,
                x_pairs[:, :, 0] * sin_angles + x_pairs[:, :, 1] * cos_angles,
            ],
            axis=-1,
        )

        return x_rotated.reshape(seq_len, d)

    def scale_attention(self, scores):
        """Apply temperature scaling to attention scores.

        Args:
            scores: Attention scores of shape (seq_len, seq_len)

        Returns:
            Scaled scores
        """
        return scores * np.sqrt(self.temperature)

class YaRNRoPE:
    """YaRN-adjusted Rotary Position Embedding."""

    def __init__(
        self,
        d_model,
        original_context=2048,
        scale=4.0,
        base=10000,
        alpha=1.0,
        beta=32.0,
    ):
        """Initialize YaRN RoPE.

        Args:
            d_model: Embedding dimension (must be even)
            original_context: Original training context length
            scale: Context extension factor
            base: RoPE frequency base
            alpha: Lower wavelength threshold
            beta: Upper wavelength threshold
        """
        self.d_model = d_model
        self.original_context = original_context
        self.scale = scale
        self.alpha = alpha
        self.beta = beta

        # Compute adjusted frequencies
        self.original_freqs = compute_rope_frequencies(d_model, base)
        self.wavelengths = compute_wavelengths(d_model, base)

        self.gamma_values = np.array(
            [
                yarn_gamma(w, original_context, scale, alpha, beta)
                for w in self.wavelengths
            ]
        )

        self.adjusted_freqs = self.original_freqs * self.gamma_values

        # Compute temperature factor
        self.temperature = compute_yarn_temperature(scale)

    def apply(self, x, position):
        """Apply YaRN-adjusted RoPE to a vector.

        Args:
            x: Input vector of shape (d_model,)
            position: Position index

        Returns:
            Rotated vector of shape (d_model,)
        """
        return apply_rope(x, position, self.adjusted_freqs)

    def apply_batch(self, x):
        """Apply YaRN-adjusted RoPE to a batch of vectors.

        Args:
            x: Input tensor of shape (seq_len, d_model)

        Returns:
            Rotated tensor of shape (seq_len, d_model)
        """
        seq_len, d = x.shape
        positions = np.arange(seq_len)

        # Compute all rotation angles
        angles = np.outer(positions, self.adjusted_freqs)

        cos_angles = np.cos(angles)
        sin_angles = np.sin(angles)

        x_pairs = x.reshape(seq_len, -1, 2)

        x_rotated = np.stack(
            [
                x_pairs[:, :, 0] * cos_angles - x_pairs[:, :, 1] * sin_angles,
                x_pairs[:, :, 0] * sin_angles + x_pairs[:, :, 1] * cos_angles,
            ],
            axis=-1,
        )

        return x_rotated.reshape(seq_len, d)

    def scale_attention(self, scores):
        """Apply temperature scaling to attention scores.

        Args:
            scores: Attention scores of shape (seq_len, seq_len)

        Returns:
            Scaled scores
        """
        return scores * np.sqrt(self.temperature)

Let's test the complete YaRN implementation:

In[19]:

Code

# Create YaRN module
yarn = YaRNRoPE(d_model=64, original_context=2048, scale=4.0)

# Test with random input
np.random.seed(42)
seq_len = 32
test_input = np.random.randn(seq_len, 64)

# Apply YaRN RoPE
rotated = yarn.apply_batch(test_input)

# Create YaRN module
yarn = YaRNRoPE(d_model=64, original_context=2048, scale=4.0)

# Test with random input
np.random.seed(42)
seq_len = 32
test_input = np.random.randn(seq_len, 64)

# Apply YaRN RoPE
rotated = yarn.apply_batch(test_input)

Out[20]:

Console

YaRN Configuration:
  Original context: 2048
  Extension scale: 4.0x
  Extended context: 8192
  Temperature: 1.1386
  Attention scale: sqrt(t) = 1.0671

Input shape: (32, 64)
Output shape: (32, 64)
Magnitude preserved: True

The implementation confirms that YaRN preserves vector magnitudes (rotation is an isometry) while applying the adjusted frequencies.

Visualizing YaRN's EffectLink Copied

Let's visualize how YaRN affects the attention patterns compared to other methods:

In[21]:

Code

def compute_attention_matrix(
    seq_len, d_model, freqs, scale_attention=1.0, seed=42
):
    """Compute full attention weight matrix."""
    np.random.seed(seed)

    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    Q_rotated = np.array([apply_rope(Q[i], i, freqs) for i in range(seq_len)])
    K_rotated = np.array([apply_rope(K[i], i, freqs) for i in range(seq_len)])

    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)
    scores = scores * scale_attention

    weights = softmax(scores)
    return weights


# Compute attention patterns for each method
seq_len = 32
d_model = 64
scale = 4.0

# YaRN with temperature
yarn_module = YaRNRoPE(d_model=d_model, original_context=2048, scale=scale)
yarn_attn = compute_attention_matrix(
    seq_len,
    d_model,
    yarn_module.adjusted_freqs,
    scale_attention=np.sqrt(yarn_module.temperature),
)

# Original
original_attn = compute_attention_matrix(seq_len, d_model, original_freqs)

# Position Interpolation
pi_attn = compute_attention_matrix(seq_len, d_model, pi_freqs)

# NTK-aware
ntk_attn = compute_attention_matrix(seq_len, d_model, ntk_freqs)

def compute_attention_matrix(
    seq_len, d_model, freqs, scale_attention=1.0, seed=42
):
    """Compute full attention weight matrix."""
    np.random.seed(seed)

    Q = np.random.randn(seq_len, d_model)
    K = np.random.randn(seq_len, d_model)

    Q_rotated = np.array([apply_rope(Q[i], i, freqs) for i in range(seq_len)])
    K_rotated = np.array([apply_rope(K[i], i, freqs) for i in range(seq_len)])

    scores = Q_rotated @ K_rotated.T / np.sqrt(d_model)
    scores = scores * scale_attention

    weights = softmax(scores)
    return weights


# Compute attention patterns for each method
seq_len = 32
d_model = 64
scale = 4.0

# YaRN with temperature
yarn_module = YaRNRoPE(d_model=d_model, original_context=2048, scale=scale)
yarn_attn = compute_attention_matrix(
    seq_len,
    d_model,
    yarn_module.adjusted_freqs,
    scale_attention=np.sqrt(yarn_module.temperature),
)

# Original
original_attn = compute_attention_matrix(seq_len, d_model, original_freqs)

# Position Interpolation
pi_attn = compute_attention_matrix(seq_len, d_model, pi_freqs)

# NTK-aware
ntk_attn = compute_attention_matrix(seq_len, d_model, ntk_freqs)

Out[22]:

Visualization

Heatmap showing attention weights with diagonal emphasis. — Original RoPE attention pattern. This is the baseline distribution the model learned during training.

Heatmap showing more uniform attention distribution. — Position Interpolation attention. The pattern becomes more diffuse due to compressed position distances.

Heatmap showing moderate attention pattern preservation. — NTK-aware attention. Preserves more structure than PI but still shows increased entropy.

Heatmap showing attention pattern similar to original. — YaRN attention. The combination of selective interpolation and temperature scaling produces a pattern closer to the original.

The attention heatmaps reveal the differences between methods. YaRN produces patterns that more closely resemble the original RoPE attention, with sharper focus on relevant positions rather than the diffuse patterns seen with Position Interpolation.

YaRN Training RequirementsLink Copied

An important practical consideration is whether YaRN requires fine-tuning. The answer depends on the extension factor:

For modest extensions (2x to 4x): YaRN can often be applied without any fine-tuning, leveraging the model's existing weights. The frequency adjustments and temperature scaling are designed to minimize the distribution shift.

For larger extensions (8x to 32x): Brief fine-tuning significantly improves quality. The YaRN paper recommends 200 to 400 training steps on long-context data, much less than training from scratch.

For extreme extensions (64x+): Longer fine-tuning becomes necessary, though still much shorter than pre-training from scratch.

The key advantage of YaRN over naive approaches is the efficiency of this fine-tuning. Because the method preserves the essential structure of the position encoding while making targeted adjustments, the model needs to learn only minor adaptations rather than fundamentally new position representations.

Out[23]:

Console

Recommended YaRN fine-tuning budget:

Extension Factor     Recommended Steps         Notes                         
---------------------------------------------------------------------------
2x - 4x              0 (optional)              Works zero-shot               
4x - 8x              100 - 200                 Brief warmup helps            
8x - 16x             200 - 400                 Recommended by paper          
16x - 32x            400 - 1000                More adaptation needed        
32x+                 1000+                     Extended fine-tuning

YaRN vs AlternativesLink Copied

Let's summarize how YaRN compares to other context extension methods:

Comparison of RoPE extension methods. YaRN's combination of targeted frequency adjustment and attention temperature scaling addresses limitations of simpler approaches.

Method	Key Mechanism	Strengths	Weaknesses
Position Interpolation	Uniform position scaling	Simple, no new parameters	Loses high-frequency information
NTK-aware	Base frequency adjustment	Better frequency preservation	Still uniform across dimensions
Dynamic NTK	Position-dependent scaling	Adapts to sequence length	Complex, training instability
YaRN	Ramp function + temperature	Selective interpolation, entropy control	Two hyperparameters (α, β)

The main advantages of YaRN are:

Selective preservation: High-frequency dimensions that don't need interpolation are left unchanged, maintaining fine-grained position discrimination.
Entropy control: Temperature scaling prevents the attention distribution from becoming too diffuse, preserving the model's ability to focus.
Efficient adaptation: The targeted adjustments require less fine-tuning to adapt to than more disruptive methods.
Predictable behavior: The ramp function provides intuitive control over which dimensions are interpolated and by how much.

In[24]:

Code

# Quantitative comparison across methods
def evaluate_method(name, freqs, scale_factor=1.0, n_trials=10):
    """Evaluate position encoding method on multiple metrics."""
    entropies = []
    relative_position_errors = []

    for trial in range(n_trials):
        # Compute entropy
        scores = compute_attention_matrix(
            64, 64, freqs, scale_attention=scale_factor, seed=trial
        )
        entropy = compute_entropy(scores)
        entropies.append(np.mean(entropy))

        # Check relative position consistency
        np.random.seed(trial)
        q = np.random.randn(64)
        k = np.random.randn(64)

        # Compare scores at same relative distance but different absolute positions
        q0 = apply_rope(q, 0, freqs)
        k2 = apply_rope(k, 2, freqs)
        score_near = np.dot(q0, k2)

        q50 = apply_rope(q, 50, freqs)
        k52 = apply_rope(k, 52, freqs)
        score_far = np.dot(q50, k52)

        relative_position_errors.append(abs(score_near - score_far))

    return {
        "mean_entropy": np.mean(entropies),
        "std_entropy": np.std(entropies),
        "mean_rel_error": np.mean(relative_position_errors),
    }


# Evaluate all methods
yarn_module = YaRNRoPE(d_model=64, original_context=2048, scale=4.0)

methods = {
    "Original": (original_freqs, 1.0),
    "Position Interpolation": (pi_freqs, 1.0),
    "NTK-aware": (ntk_freqs, 1.0),
    "YaRN": (yarn_module.adjusted_freqs, np.sqrt(yarn_module.temperature)),
}

results = {
    name: evaluate_method(name, freqs, scale)
    for name, (freqs, scale) in methods.items()
}

# Quantitative comparison across methods
def evaluate_method(name, freqs, scale_factor=1.0, n_trials=10):
    """Evaluate position encoding method on multiple metrics."""
    entropies = []
    relative_position_errors = []

    for trial in range(n_trials):
        # Compute entropy
        scores = compute_attention_matrix(
            64, 64, freqs, scale_attention=scale_factor, seed=trial
        )
        entropy = compute_entropy(scores)
        entropies.append(np.mean(entropy))

        # Check relative position consistency
        np.random.seed(trial)
        q = np.random.randn(64)
        k = np.random.randn(64)

        # Compare scores at same relative distance but different absolute positions
        q0 = apply_rope(q, 0, freqs)
        k2 = apply_rope(k, 2, freqs)
        score_near = np.dot(q0, k2)

        q50 = apply_rope(q, 50, freqs)
        k52 = apply_rope(k, 52, freqs)
        score_far = np.dot(q50, k52)

        relative_position_errors.append(abs(score_near - score_far))

    return {
        "mean_entropy": np.mean(entropies),
        "std_entropy": np.std(entropies),
        "mean_rel_error": np.mean(relative_position_errors),
    }


# Evaluate all methods
yarn_module = YaRNRoPE(d_model=64, original_context=2048, scale=4.0)

methods = {
    "Original": (original_freqs, 1.0),
    "Position Interpolation": (pi_freqs, 1.0),
    "NTK-aware": (ntk_freqs, 1.0),
    "YaRN": (yarn_module.adjusted_freqs, np.sqrt(yarn_module.temperature)),
}

results = {
    name: evaluate_method(name, freqs, scale)
    for name, (freqs, scale) in methods.items()
}

Out[25]:

Console

Method Comparison (4x extension):

Method                    Mean Entropy       Entropy Std        Rel. Pos. Error   
-------------------------------------------------------------------------------
Original                  3.6870             0.0206             4.56e-15
Position Interpolation    3.6997             0.0231             2.66e-15
NTK-aware                 3.6896             0.0194             7.42e-15
YaRN                      3.6253             0.0234             4.12e-15

The "Mean Entropy" column shows how diffuse the attention distribution is, with lower values indicating sharper focus. The "Rel. Pos. Error" measures whether the same relative distance produces the same attention score at different absolute positions. Values near zero indicate perfect relative position encoding.

The comparison shows that YaRN achieves entropy closer to the original while maintaining the relative position property (low position error). The combination of targeted interpolation and temperature scaling addresses both the frequency and entropy aspects of context extension.

Limitations and ConsiderationsLink Copied

Despite its effectiveness, YaRN has limitations worth understanding.

Hyperparameter sensitivity: The $\alpha$ and $\beta$ parameters control the ramp function's transition region. While the defaults (1 and 32) work well for many models, some architectures may benefit from tuning. Models with different RoPE bases or dimension sizes may require adjustment.

Temperature approximation: The temperature formula $t = 0.1 \ln(s) + 1$ is empirically derived rather than theoretically optimal. It works well in practice but may not perfectly compensate for entropy shifts in all scenarios.

Integration complexity: Unlike Position Interpolation (which only modifies position indices) or NTK-aware scaling (which only modifies the base), YaRN requires two separate modifications. Both the frequency adjustment and the attention scaling must be implemented correctly.

Interaction with other optimizations: When combined with techniques like FlashAttention or grouped-query attention, care must be taken to apply the temperature scaling at the correct point in the computation. The scaling should occur before the softmax, not after.

These limitations are manageable in practice. YaRN has been successfully integrated into many open-source models, and the community has developed reference implementations that handle the integration details correctly.

Key ParametersLink Copied

When implementing YaRN, several parameters control its behavior:

scale: The context extension factor (e.g., 4.0 for extending from 2048 to 8192 tokens). Larger values enable longer contexts but may require more fine-tuning.
alpha: The lower wavelength threshold (default: 1.0). Dimension pairs with wavelength-to-context ratio below this value receive no interpolation. Decreasing $\alpha$ applies interpolation to more dimension pairs.
beta: The upper wavelength threshold (default: 32.0). Dimension pairs with wavelength-to-context ratio above this value receive full interpolation. Increasing $\beta$ delays full interpolation to higher wavelength pairs.
base: The RoPE frequency base (default: 10000). This should match the base used in the original model's RoPE implementation.
original_context: The context length the model was originally trained on (e.g., 2048 or 4096). This determines the wavelength ratios used in the ramp function.

For most applications, the default values of $\alpha = 1$ and $\beta = 32$ work well. The primary parameter to adjust is scale, which should be set to the desired extension factor. If the model uses a non-standard RoPE base, ensure base matches the original configuration.

SummaryLink Copied

YaRN provides a principled approach to extending context length in RoPE-based language models. By combining selective frequency interpolation with attention temperature scaling, it addresses limitations in both Position Interpolation and NTK-aware methods.

Key takeaways:

Wavelength-based interpolation: YaRN uses a ramp function to apply no interpolation to high-frequency dimension pairs, full interpolation to low-frequency pairs, and a smooth transition in between. This preserves fine-grained position information where it matters.
Attention temperature correction: Context extension changes attention score distributions, increasing entropy. YaRN compensates with a temperature scaling factor $\sqrt{t}$ where $t = 0.1 \ln(s) + 1$ .
Efficient fine-tuning: The targeted adjustments minimize distribution shift, enabling effective context extension with as few as 200 to 400 fine-tuning steps for moderate extension factors.
Complementary to other methods: YaRN can be seen as a refinement that combines insights from Position Interpolation (full interpolation for long wavelengths) and NTK-aware scaling (preservation of short wavelengths), while adding the novel temperature correction.
Practical deployment: YaRN has been integrated into many open-source models and inference frameworks, demonstrating its practical viability for real-world context extension.

The next chapter explores attention sinks, a phenomenon where transformer models allocate disproportionate attention to initial tokens regardless of their semantic relevance, and how this affects long-context processing.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{yarnextendingcontextlengthwithselectiveinterpolationandtemperaturescaling, author = {Michael Brenndoerfer}, title = {YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling}, year = {2025}, url = {https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling. Retrieved from https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm

MLAAcademic

Michael Brenndoerfer. "YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm>.

CHICAGOAcademic

Michael Brenndoerfer. "YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm.

HARVARDAcademic

Michael Brenndoerfer (2025) 'YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling'. Available at: https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling. https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm

Direct link:

https://mbrenndoerfer.com/writing/yarn-rope-context-extension-llm

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

YaRN: Extending Context Length with Selective Interpolation and Temperature Scaling

YaRN: Yet another RoPE extensioNLink Copied

The Problem with Existing MethodsLink Copied

Attention Entropy and the Temperature ProblemLink Copied

The YaRN SolutionLink Copied

Wavelength AnalysisLink Copied

The YaRN Ramp FunctionLink Copied

Attention Temperature ScalingLink Copied

The Complete YaRN FormulaLink Copied

Visualizing YaRN's EffectLink Copied

YaRN Training RequirementsLink Copied

YaRN vs AlternativesLink Copied

Limitations and ConsiderationsLink Copied

Key ParametersLink Copied

SummaryLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Memory Augmentation for Transformers: External Storage for Long Context

Recurrent Memory: Extending Transformer Context with Segment-Level State Caching

Position Interpolation: Extending LLM Context Length with RoPE Scaling

Stay updated