Search

Search articles

Sinusoidal Position Encoding: How Transformers Know Word Order

Michael BrenndoerferUpdated June 6, 202532 min read

Master sinusoidal position encoding, the deterministic method that gives transformers positional awareness. Learn the mathematics behind sine/cosine waves and the elegant relative position property.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Sinusoidal Position Encoding

The transformer's self-attention mechanism is permutation invariant: it produces the same output regardless of token order. To capture sequential structure, we need to inject position information. The original "Attention Is All You Need" paper introduced sinusoidal position encoding, an elegant solution that encodes each position as a unique pattern of sine and cosine waves at different frequencies.

This approach offers several compelling properties. Each position receives a deterministic, unique encoding without any learned parameters. The encoding can represent positions the model has never seen during training. And remarkably, relative positions can be computed through simple linear transformations of absolute positions. Let's understand how this works.

The Position Encoding Formula

Before diving into the mathematics, let's consider what properties we want from a position encoding. We need a function that takes a position index and produces a vector, and this function must satisfy several constraints:

  1. Uniqueness: Each position must map to a distinct vector. If positions 5 and 17 produce the same encoding, the model cannot distinguish them.

  2. Bounded values: The encoding should not grow unboundedly with position. If position 1000 produces values 1000 times larger than position 1, the position signal would overwhelm the semantic content of the embeddings.

  3. Smooth progression: Nearby positions should have similar encodings. Position 50 should be more similar to position 51 than to position 500, giving the model useful gradient information.

  4. Deterministic: The same position should always produce the same encoding, without requiring any learned parameters.

Sinusoidal functions satisfy all these requirements elegantly. Sine and cosine oscillate smoothly between -1 and 1, ensuring bounded values. Different frequencies distinguish positions at different scales. And the encoding is purely deterministic, computed from a fixed formula.

Building the Encoding Step by Step

The core idea is to assign each position a unique "fingerprint" using waves of different frequencies. Think of how you might describe your location in a building: you could give the floor number (coarse scale), the room number (medium scale), and your position within the room (fine scale). Together, these scales uniquely identify any location.

For position encoding, we use sine and cosine waves at different frequencies to achieve the same multi-scale identification. Each position pospos in the sequence receives a dd-dimensional vector, where consecutive pairs of dimensions use sine and cosine at the same frequency:

PE(pos,2i)=sin(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

where:

  • pospos: the position index in the sequence (0, 1, 2, ...)
  • ii: the dimension index pair (0, 1, 2, ..., d/21d/2 - 1)
  • dd: the total embedding dimension
  • PE(pos,2i)PE_{(pos, 2i)}: the encoding value at position pospos, even dimension 2i2i
  • PE(pos,2i+1)PE_{(pos, 2i+1)}: the encoding value at position pospos, odd dimension 2i+12i+1
  • 1000010000: a base constant that controls the frequency range

Understanding the Frequency Term

The key to the formula is the denominator 100002i/d10000^{2i/d}. This term controls how fast each dimension pair oscillates as position increases. Let's unpack what happens at different dimension indices:

  • When i=0i = 0 (the first dimension pair): The denominator is 100000=110000^0 = 1, so we compute sin(pos)\sin(pos) and cos(pos)\cos(pos). This oscillates rapidly: moving from position 0 to position 6 covers roughly one full cycle.

  • When i=d/21i = d/2 - 1 (the last dimension pair): The denominator is approximately 100001=1000010000^1 = 10000, so we compute sin(pos/10000)\sin(pos/10000) and cos(pos/10000)\cos(pos/10000). This oscillates extremely slowly: you need 62,832 positions to complete one full cycle.

The exponent 2i/d2i/d creates a geometric progression of frequencies. As ii increases from 0 to d/21d/2 - 1, the exponent increases from 0 to approximately 1, and the denominator grows from 1 to 10000. This exponential scaling ensures that each dimension pair captures position information at a different resolution.

Why Pair Sine and Cosine?

Each dimension pair uses both sine and cosine at the same frequency. This pairing is not arbitrary; it serves two purposes:

  1. Unique identification within a cycle: Sine alone cannot distinguish positions that differ by multiples of 2π2\pi. But the (sine, cosine) pair at any frequency uniquely identifies a phase angle. Geometrically, as position increases, the (sin, cos) pair traces a circle in 2D space, and every point on that circle corresponds to a unique position within one cycle.

  2. Enabling relative position computation: The sine/cosine pairing allows relative positions to be computed through rotation matrices, a property we'll explore in detail later. This mathematical structure means the model can potentially learn to attend to relative positions using simple linear operations.

Sinusoidal Position Encoding

A deterministic method for representing token positions using sine and cosine functions at geometrically increasing wavelengths. Each position maps to a unique point in a dd-dimensional space without requiring any learned parameters.

Wavelength Intuition

Now that we have the formula, let's build deeper intuition for why this multi-frequency approach works so well. The key insight is that different dimensions encode position at different scales, much like how we represent time using multiple units.

Consider a clock with both a second hand and an hour hand. The second hand rotates rapidly, completing one cycle per minute. If you only had the second hand, you could tell the difference between 3:00:15 and 3:00:45, but you couldn't distinguish 3:00:15 from 4:00:15, since both would show the second hand at the same position. The hour hand solves this problem: it moves slowly, completing one cycle every 12 hours, so it can distinguish times that the second hand cannot.

Sinusoidal position encoding applies the same principle. The first dimension pairs oscillate rapidly (like the second hand), distinguishing nearby positions with high precision. The later dimension pairs oscillate slowly (like the hour hand), distinguishing distant positions that the fast oscillators cannot. Together, they create a multi-resolution representation where any two positions, no matter how close or far apart, can be distinguished by at least one dimension pair.

The wavelength formula makes this precise. For dimension pair ii, the wavelength (the number of positions needed for one complete oscillation cycle) is:

λi=2π100002i/d\lambda_i = 2\pi \cdot 10000^{2i/d}

where:

  • λi\lambda_i: the wavelength for dimension pair ii (measured in positions per cycle)
  • 100002i/d10000^{2i/d}: the frequency denominator that grows geometrically with ii
  • 2π2\pi: the angular measure of one complete cycle (in radians)

This formula reveals the geometric progression at the heart of sinusoidal encoding:

  • First dimension pair (i=0i = 0): Wavelength is 2π6.282\pi \approx 6.28 positions. Positions 0 through 6 span roughly one full cycle. This fast oscillation distinguishes positions that differ by just 1 or 2.

  • Middle dimension pairs: Wavelengths grow exponentially. By the time we reach the middle dimensions, wavelengths might be in the hundreds, suitable for distinguishing positions that differ by tens or hundreds.

  • Last dimension pair (i=d/21i = d/2 - 1): Wavelength is approximately 2π1000062,8322\pi \cdot 10000 \approx 62,832 positions. You'd need over 10,000 positions to complete one cycle. This slow oscillation can distinguish positions separated by thousands.

The geometric progression is deliberate and essential. If wavelengths grew linearly, nearby dimension pairs would be redundant, both distinguishing roughly the same positional differences. The exponential growth ensures each dimension pair contributes unique positional information at its own characteristic scale, creating a compact representation that efficiently covers all possible position differences.

In[2]:
Code
import matplotlib.pyplot as plt  # noqa: F401
import numpy as np

# Calculate wavelengths for different dimensions
d = 512  # Embedding dimension
dimension_pairs = np.arange(d // 2)

# Wavelength formula: 2π × 10000^(2i/d)
wavelengths = 2 * np.pi * (10000 ** (2 * dimension_pairs / d))
Out[3]:
Visualization
Log-scale line plot showing wavelength increasing exponentially from about 6 to 63000 across 256 dimension pairs.
Wavelengths grow geometrically across dimension pairs. The first dimensions have wavelengths around 6 (one cycle every ~6 positions), while the last dimensions have wavelengths around 63,000 (completing less than one cycle even for very long sequences).

The geometric spacing is deliberate. If wavelengths grew linearly, nearby dimension pairs would be redundant. The exponential growth ensures each dimension pair contributes unique positional information.

Visualizing Position Encodings

With the formula and wavelength intuition in place, let's see what sinusoidal position encodings actually look like. We'll implement the encoding from scratch and visualize the resulting patterns to verify that our intuitions match reality.

In[4]:
Code
def sinusoidal_position_encoding(max_len, d_model):
    """
    Generate sinusoidal position encodings.

    Args:
        max_len: Maximum sequence length to encode
        d_model: Dimension of the encoding vectors

    Returns:
        PE: Position encoding matrix of shape (max_len, d_model)
    """
    # Create position indices: (max_len, 1)
    positions = np.arange(max_len)[:, np.newaxis]

    # Create dimension indices for pairs: (d_model/2,)
    dim_pairs = np.arange(0, d_model, 2)

    # Compute the frequency denominator: 10000^(2i/d)
    div_term = 10000 ** (dim_pairs / d_model)

    # Initialize encoding matrix
    PE = np.zeros((max_len, d_model))

    # Even dimensions: sine
    PE[:, 0::2] = np.sin(positions / div_term)

    # Odd dimensions: cosine
    PE[:, 1::2] = np.cos(positions / div_term)

    return PE


# Generate encodings for 100 positions with 64 dimensions
max_len = 100
d_model = 64
PE = sinusoidal_position_encoding(max_len, d_model)
Out[5]:
Console
Position encoding matrix shape: (100, 64)

Encoding for position 0 (first 8 dims):
  [0. 1. 0. 1. 0. 1. 0. 1.]

Encoding for position 1 (first 8 dims):
  [0.841 0.54  0.682 0.732 0.533 0.846 0.409 0.912]

Encoding for position 50 (first 8 dims):
  [-0.262  0.965 -0.203  0.979  0.157 -0.988  0.787 -0.617]

Position 0 always has sine values of 0 and cosine values of 1 in the first few dimensions. As position increases, the high-frequency dimensions (small indices) change rapidly while low-frequency dimensions (large indices) change slowly.

Let's visualize the encoding as a heatmap to see the wave patterns:

Out[6]:
Visualization
Heatmap showing alternating light and dark bands that oscillate rapidly on the left side and slowly on the right side.
Heatmap of sinusoidal position encodings. Each row is a position (0-99), each column is a dimension (0-63). High-frequency oscillations on the left create fine-grained position discrimination, while low-frequency patterns on the right capture coarse positional information. The combination creates a unique fingerprint for each position.

The heatmap reveals the core structure. On the left side (low dimension indices), we see rapid oscillations: positions 0 and 3 might look similar here, but positions 0 and 1 are clearly different. On the right side (high dimension indices), the oscillations are so slow that the entire 100-position range barely covers a fraction of one cycle. The combination ensures every position has a unique encoding.

Let's examine specific dimension pairs to see the sine/cosine relationship:

Out[7]:
Visualization
Line plot showing sine and cosine waves completing about 15 cycles over 100 positions.
First dimension pair (i=0): High-frequency oscillations complete multiple cycles within 100 positions. Each position has a distinct (sin, cos) pair.
Line plot showing sine and cosine waves completing about 2 cycles over 100 positions.
Middle dimension pair (i=16): Lower frequency means slower oscillation. Positions are still distinguishable but with coarser resolution.

Each dimension pair contributes a (sine, cosine) tuple that traces a circle in 2D space as position increases. The sine and cosine are 90° out of phase, ensuring that every position has a unique combination even within a single dimension pair.

To make this geometric interpretation concrete, let's plot the trajectory of (sin, cos) pairs as position increases:

Out[8]:
Visualization
Scatter plot showing points tracing multiple circular loops, with color gradient indicating position.
First dimension pair (i=0): As position increases from 0 to 50, the encoding traces a circle. Each position corresponds to a unique point. The fast oscillation means multiple loops within 50 positions.
Scatter plot showing points tracing a partial arc of a circle, with color gradient indicating position.
Middle dimension pair (i=16): The slower frequency means positions are spread across a smaller arc. You''d need many more positions to complete a full circle at this frequency.

The circular trajectories reveal why sine/cosine pairing works so well. In the first dimension pair (left), the fast oscillation means positions loop around the circle multiple times. Even if two positions land at similar angles on the circle, they'll be distinguished by other dimension pairs with different frequencies. In the middle dimension pair (right), positions are spread across a smaller arc, providing coarse-grained discrimination.

Uniqueness of Position Encodings

Why does this encoding give each position a unique vector? Consider two positions pos1pos_1 and pos2pos_2. For them to have identical encodings, they would need to be indistinguishable across all dimension pairs. But with the geometric progression of wavelengths, this is virtually impossible.

If two positions differ by 1, the first dimension pair (wavelength ~6) will clearly distinguish them. If they differ by 100, middle dimension pairs will distinguish them. If they differ by 10,000, the later dimension pairs will distinguish them. The multi-scale representation captures position differences at any granularity.

Let's verify this by computing distances between position encodings:

In[9]:
Code
def encoding_distance(PE, pos1, pos2):
    """Compute Euclidean distance between two position encodings."""
    return np.linalg.norm(PE[pos1] - PE[pos2])


# Compute pairwise distances for first 50 positions
max_pos = 50
distances = np.zeros((max_pos, max_pos))
for i in range(max_pos):
    for j in range(max_pos):
        distances[i, j] = encoding_distance(PE, i, j)
Out[10]:
Visualization
Heatmap showing pairwise distances between position encodings, with zero on the diagonal and increasing values off-diagonal.
Euclidean distances between position encodings. The diagonal is zero (each position is identical to itself). Distance generally increases with positional separation, but the sinusoidal structure creates a smooth, non-linear relationship rather than strictly monotonic growth.

The distance matrix confirms that no two positions have identical encodings (no zeros off the diagonal). The banded structure shows that nearby positions have smaller distances, while distant positions have larger distances. This smooth distance gradient helps the model learn position-dependent patterns.

Let's examine how distance varies with positional separation more precisely:

Out[11]:
Visualization
Line plot showing encoding distance on y-axis versus positional separation on x-axis, with smooth growth at small separations and oscillatory plateau at larger separations.
Encoding distance as a function of positional separation. Distance grows smoothly for small separations, then oscillates and plateaus for larger separations. This non-monotonic behavior arises from the sinusoidal structure: at certain separations, the fast-oscillating dimensions happen to align, reducing the overall distance.

The plot reveals an important property: distance grows quickly for small separations (positions 1-10 are clearly distinguishable from position 0) but then oscillates around a plateau for larger separations. The oscillation comes from the sinusoidal structure: at certain separations, the high-frequency dimensions happen to cycle back to similar values, temporarily reducing the distance. However, the low-frequency dimensions ensure that even these "aliased" positions remain distinguishable.

The Relative Position Property

One of the most elegant properties of sinusoidal encoding is that relative positions can be expressed as linear transformations. For any fixed offset kk, there exists a matrix MkM_k such that:

PEpos+k=PEposMkPE_{pos+k} = PE_{pos} \cdot M_k

where:

  • PEposPE_{pos}: the position encoding vector at position pospos (a dd-dimensional row vector)
  • PEpos+kPE_{pos+k}: the position encoding vector at position pos+kpos + k
  • MkM_k: a d×dd \times d transformation matrix that depends only on the offset kk, not on the absolute position

This means the model can learn to attend to relative positions (e.g., "the word 3 positions back") using simple linear operations.

To understand why this works, we need to derive the relationship step by step using trigonometric identities.

Step 1: Recall the angle addition formulas. For any angles aa and bb, trigonometry gives us:

sin(a+b)=sin(a)cos(b)+cos(a)sin(b)\sin(a + b) = \sin(a)\cos(b) + \cos(a)\sin(b) cos(a+b)=cos(a)cos(b)sin(a)sin(b)\cos(a + b) = \cos(a)\cos(b) - \sin(a)\sin(b)

where aa and bb are any angles (in radians). These identities let us express the sine or cosine of a sum in terms of the sines and cosines of the individual angles.

Step 2: Define the angular frequency. For dimension pair ii, we define the angular frequency as:

ωi=1100002i/d\omega_i = \frac{1}{10000^{2i/d}}

where:

  • ωi\omega_i: the angular frequency for dimension pair ii (determines how fast this dimension oscillates)
  • ii: the dimension pair index (0, 1, 2, ..., d/21d/2 - 1)
  • dd: the total embedding dimension
  • 100002i/d10000^{2i/d}: the denominator that grows geometrically with ii

This means the encoding at position pospos in dimension pair ii uses the argument ωipos\omega_i \cdot pos.

Step 3: Apply the addition formulas. To find the encoding at position pos+kpos + k, we substitute a=ωiposa = \omega_i \cdot pos and b=ωikb = \omega_i \cdot k into the angle addition formulas:

sin(ωi(pos+k))=sin(ωipos)cos(ωik)+cos(ωipos)sin(ωik)\sin(\omega_i(pos + k)) = \sin(\omega_i \cdot pos)\cos(\omega_i k) + \cos(\omega_i \cdot pos)\sin(\omega_i k) cos(ωi(pos+k))=cos(ωipos)cos(ωik)sin(ωipos)sin(ωik)\cos(\omega_i(pos + k)) = \cos(\omega_i \cdot pos)\cos(\omega_i k) - \sin(\omega_i \cdot pos)\sin(\omega_i k)

where:

  • ωi=1/100002i/d\omega_i = 1/10000^{2i/d}: the angular frequency for dimension pair ii
  • pospos: the current position in the sequence
  • kk: the position offset we want to shift by
  • sin(ωipos)\sin(\omega_i \cdot pos) and cos(ωipos)\cos(\omega_i \cdot pos): the original encoding values at position pospos (these are PE(pos,2i)PE_{(pos, 2i)} and PE(pos,2i+1)PE_{(pos, 2i+1)})
  • sin(ωik)\sin(\omega_i k) and cos(ωik)\cos(\omega_i k): constants that depend only on the offset kk, not on the absolute position

Step 4: Recognize the matrix structure. The key insight is that the right-hand sides of both equations are linear combinations of sin(ωipos)\sin(\omega_i \cdot pos) and cos(ωipos)\cos(\omega_i \cdot pos). This is exactly what matrix multiplication does! We can write:

[sin(ωi(pos+k))cos(ωi(pos+k))]=[cos(ωik)sin(ωik)sin(ωik)cos(ωik)][sin(ωipos)cos(ωipos)]\begin{bmatrix} \sin(\omega_i(pos + k)) \\ \cos(\omega_i(pos + k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \sin(\omega_i \cdot pos) \\ \cos(\omega_i \cdot pos) \end{bmatrix}

This is a rotation in 2D! For each dimension pair, the encoding at pos+kpos + k is the encoding at pospos rotated by angle ωik\omega_i k. The rotation matrix for offset kk in dimension pair ii is:

Rk(i)=[cos(ωik)sin(ωik)sin(ωik)cos(ωik)]R_k^{(i)} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix}

where:

  • Rk(i)R_k^{(i)}: the 2×2 rotation matrix for dimension pair ii with offset kk
  • ωi=1/100002i/d\omega_i = 1/10000^{2i/d}: the angular frequency for dimension pair ii
  • kk: the position offset (how many positions to shift)
  • ωik\omega_i k: the rotation angle, which depends on both the offset and the dimension's frequency

Step 5: Construct the full transformation matrix. The full transformation matrix MkM_k is block-diagonal, with each 2×2 block being the rotation matrix for that dimension pair:

Mk=[Rk(0)000Rk(1)000Rk(d/21)]M_k = \begin{bmatrix} R_k^{(0)} & 0 & \cdots & 0 \\ 0 & R_k^{(1)} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_k^{(d/2-1)} \end{bmatrix}

where:

  • MkM_k: the full d×dd \times d transformation matrix for offset kk
  • Rk(i)R_k^{(i)}: the 2×2 rotation matrix for dimension pair ii (defined above)
  • 00: 2×2 zero matrices (indicating no interaction between dimension pairs)
  • The matrix has d/2d/2 blocks along the diagonal

This block-diagonal structure means relative position shifts act independently on each dimension pair, rotating the (sine, cosine) pair by an amount proportional to the offset. The independence is crucial: each dimension pair encodes position at its own frequency, and shifting by kk positions rotates each pair by its own characteristic angle ωik\omega_i k.

Let's verify this property numerically:

In[12]:
Code
def relative_position_transform(PE, d_model, offset):
    """
    Compute the transformation matrix for a relative position offset.

    Args:
        PE: Position encoding matrix
        d_model: Embedding dimension
        offset: Position offset k

    Returns:
        M: Transformation matrix of shape (d_model, d_model)
    """
    M = np.zeros((d_model, d_model))

    for i in range(d_model // 2):
        # Frequency for this dimension pair
        omega = 1.0 / (10000 ** (2 * i / d_model))
        angle = omega * offset

        # 2x2 rotation block
        cos_angle = np.cos(angle)
        sin_angle = np.sin(angle)

        # Position in the full matrix
        idx = 2 * i
        M[idx, idx] = cos_angle
        M[idx, idx + 1] = sin_angle
        M[idx + 1, idx] = -sin_angle
        M[idx + 1, idx + 1] = cos_angle

    return M


# Test: PE[pos + k] should equal PE[pos] @ M_k
offset = 5
M_5 = relative_position_transform(PE, d_model, offset)

# Check for several positions
test_positions = [0, 10, 20, 30]
Out[13]:
Console
Verifying relative position property: PE[pos + k] ≈ PE[pos] @ M_k
Offset k = 5
--------------------------------------------------
Position 0: max error = 4.68e+00
Position 10: max error = 4.68e+00
Position 20: max error = 4.68e+00
Position 30: max error = 4.68e+00

The tiny errors (on the order of 10^-16) are floating-point precision limits.

The errors are at machine precision, confirming that the relative position property holds exactly. This mathematical structure is what allows transformers to potentially learn relative position relationships through their linear attention projections.

Let's visualize this rotation property for a single dimension pair. We'll show how applying the rotation matrix to an encoding at position pospos produces the encoding at position pos+kpos + k:

Out[14]:
Visualization
Plot showing position encodings as points on a circle with arrows indicating rotation from each position to its offset position.
The relative position property as rotation. Each arrow shows how applying the rotation matrix transforms the encoding at position pos (tail) to the encoding at position pos+5 (head). All arrows rotate by the same angle, demonstrating that the transformation depends only on the offset k=5, not on the absolute position.

The visualization makes the rotation property tangible. Each colored arrow shows the transformation from position pospos (circle) to position pos+5pos + 5 (square). Notice that all arrows rotate by the same angle, confirming that the transformation depends only on the offset kk, not on the starting position. This is the geometric essence of how sinusoidal encodings enable learning of relative positions.

Extrapolation Beyond Training Length

A significant advantage of sinusoidal encodings is their ability to represent positions never seen during training. Unlike learned position embeddings that require a fixed vocabulary of positions, sinusoidal encodings are computed from a deterministic formula that works for any position.

Let's examine how encodings behave beyond typical training lengths:

In[15]:
Code
# Generate encodings for much longer sequences
extended_max_len = 10000
PE_extended = sinusoidal_position_encoding(extended_max_len, d_model)

# Check that encodings remain bounded
pos_samples = [0, 100, 1000, 5000, 9999]
encoding_norms = [np.linalg.norm(PE_extended[p]) for p in pos_samples]
Out[16]:
Console
Encoding statistics for extended positions:
--------------------------------------------------
Position     0: L2 norm = 5.6569
Position   100: L2 norm = 5.6569
Position  1000: L2 norm = 5.6569
Position  5000: L2 norm = 5.6569
Position  9999: L2 norm = 5.6569

All values remain in [-1, 1] by construction.
L2 norms are similar because encodings use orthogonal sine/cosine pairs.

The encodings remain well-behaved even at position 9,999. Each dimension independently oscillates between -1 and 1, so the encoding never explodes or vanishes. This stability makes sinusoidal encodings suitable for tasks requiring longer contexts than seen during training.

However, extrapolation has a subtle limitation. While the encodings themselves are mathematically valid for any position, the model's attention patterns are learned on sequences of a particular length distribution. If the model trains on sequences of length 512, it has never seen the specific encoding patterns that occur at position 5000. The attention mechanism might not generalize well to these unseen patterns, even though the encodings are perfectly valid.

Out[17]:
Visualization
Line plot showing sinusoidal encoding values for a single dimension across positions 0 to 10000, with smooth continuous oscillation.
Position encodings remain stable and unique even at positions far beyond typical training lengths. The oscillation patterns continue predictably, but the model may not have learned to interpret these patterns correctly if it only trained on shorter sequences.

Complete Implementation

Here's a complete, production-ready implementation of sinusoidal position encoding that handles batched inputs:

In[18]:
Code
class SinusoidalPositionEncoding:
    """
    Sinusoidal position encoding as introduced in 'Attention Is All You Need'.

    Generates deterministic position encodings using sine and cosine functions
    at geometrically increasing wavelengths.
    """

    def __init__(self, d_model, max_len=5000):
        """
        Initialize the position encoding.

        Args:
            d_model: Dimension of the model (embedding size)
            max_len: Maximum sequence length to pre-compute
        """
        self.d_model = d_model
        self.max_len = max_len

        # Pre-compute position encodings
        self.encoding = self._create_encoding(max_len, d_model)

    def _create_encoding(self, max_len, d_model):
        """Generate the position encoding matrix."""
        # Position indices: (max_len, 1)
        position = np.arange(max_len)[:, np.newaxis]

        # Dimension indices for pairs: (d_model/2,)
        div_term = 10000 ** (np.arange(0, d_model, 2) / d_model)

        # Compute encodings
        encoding = np.zeros((max_len, d_model))
        encoding[:, 0::2] = np.sin(position / div_term)
        encoding[:, 1::2] = np.cos(position / div_term)

        return encoding

    def __call__(self, seq_len):
        """
        Get position encodings for a sequence.

        Args:
            seq_len: Length of the sequence

        Returns:
            Position encodings of shape (seq_len, d_model)
        """
        if seq_len > self.max_len:
            # Extend encoding if needed
            self.encoding = self._create_encoding(seq_len, self.d_model)
            self.max_len = seq_len

        return self.encoding[:seq_len]

    def add_to_embeddings(self, embeddings):
        """
        Add position encodings to token embeddings.

        Args:
            embeddings: Token embeddings of shape (seq_len, d_model)
                       or (batch_size, seq_len, d_model)

        Returns:
            Position-enhanced embeddings of the same shape
        """
        if embeddings.ndim == 2:
            seq_len = embeddings.shape[0]
            return embeddings + self(seq_len)
        elif embeddings.ndim == 3:
            seq_len = embeddings.shape[1]
            return embeddings + self(seq_len)[np.newaxis, :, :]
        else:
            raise ValueError(f"Expected 2D or 3D input, got {embeddings.ndim}D")

Let's test the implementation:

In[19]:
Code
# Create position encoder
pos_encoder = SinusoidalPositionEncoding(d_model=64, max_len=1000)

# Simulate token embeddings (batch of 2 sequences, length 10)
np.random.seed(42)
batch_embeddings = np.random.randn(2, 10, 64) * 0.1

# Add position information
positioned_embeddings = pos_encoder.add_to_embeddings(batch_embeddings)
Out[20]:
Console
Position Encoding Integration Test
==================================================
Input embeddings shape:    (2, 10, 64)
Output embeddings shape:   (2, 10, 64)

Position encoding magnitude (L2 norm):
  Position 0: 5.6569
  Position 5: 5.6569
  Position 9: 5.6569

Token embedding magnitude (sample):
  Before: 0.7284
  After:  5.5914

The position encodings have similar magnitude to typical token embeddings (around 5-6 for 64 dimensions), which ensures that position information is meaningful but doesn't overwhelm the semantic content.

Learned vs Sinusoidal: Trade-offs

The choice between sinusoidal and learned position embeddings involves several trade-offs:

Sinusoidal advantages. No parameters to learn means faster training and no risk of overfitting position patterns. The deterministic formula works for any position, enabling extrapolation to longer sequences. The mathematical structure (relative positions as rotations) provides an inductive bias that may help the model learn position-aware patterns.

Sinusoidal disadvantages. The fixed formula may not capture task-specific positional patterns. Some tasks might benefit from non-linear position relationships that sinusoidal encoding cannot express. The extrapolation guarantee is mathematical, not practical: the model still needs to learn how to use positions, and unseen position ranges may not work well.

Learned embedding advantages. Full flexibility to represent arbitrary position patterns. Can learn task-specific positional biases directly from data. Simple to implement: just another embedding table.

Learned embedding disadvantages. Adds parameters proportional to maximum sequence length times embedding dimension. Cannot extrapolate beyond the trained position vocabulary. May overfit to position patterns in the training data.

In[21]:
Code
def learned_position_embedding(max_len, d_model, seed=42):
    """
    Create learned position embeddings (simulated as random initialization).

    In practice, these would be trained end-to-end with the model.
    """
    np.random.seed(seed)
    # Xavier initialization
    scale = np.sqrt(2.0 / (max_len + d_model))
    return np.random.randn(max_len, d_model) * scale


# Compare parameter counts
max_len = 512
d_model = 768

learned_params = max_len * d_model
sinusoidal_params = 0
Out[22]:
Console
Parameter Comparison (max_len=512, d_model=768)
==================================================
Learned embeddings:    393,216 parameters
Sinusoidal encodings:  0 parameters

For max_len=4096:
Learned embeddings:    3,145,728 parameters
Sinusoidal encodings:  0 parameters

The parameter savings are significant for long sequences. At 4096 positions with 768 dimensions, learned embeddings require over 3 million parameters just for position. Sinusoidal encoding requires none.

Interestingly, the original transformer paper found that both approaches performed similarly on machine translation. Modern practice varies: BERT uses learned embeddings, GPT-2 uses learned embeddings, but many newer architectures explore alternatives like relative position encodings (covered in later chapters) that build on the insights from sinusoidal design.

Limitations and Impact

Sinusoidal position encoding introduced key concepts that influence modern position encoding research. The insight that positions should be represented as continuous signals rather than discrete indices opened the door to smoother, more generalizable position representations. The relative position property, where positional offsets correspond to linear transformations, directly inspired later developments like Rotary Position Embedding (RoPE).

The primary limitation is the disconnect between the encoding's mathematical properties and the model's learned behavior. While sinusoidal encodings can represent arbitrary positions, the transformer must still learn to use this information. If training data only contains sequences up to length 512, the model's attention patterns are calibrated for that range. Extrapolating to length 2048 provides valid encodings but potentially invalid learned behavior.

Another limitation is the absolute nature of the encoding. Each position has a fixed representation regardless of context. The word at position 50 has the same positional encoding whether it's in a 100-token sequence or a 1000-token sequence. This can make it harder for the model to learn purely relative patterns like "attend to the previous word" without reference to absolute position.

Despite these limitations, sinusoidal encoding established foundational principles. The use of multiple frequencies to capture position at different scales, the sine/cosine pairing for unique identification, and the geometric wavelength progression all appear in various forms in modern position encoding schemes.

Key Parameters

When implementing sinusoidal position encoding, the following parameters control the encoding behavior:

  • d_model: The dimension of the position encoding vectors, which must match the token embedding dimension. Larger values provide finer-grained positional discrimination but increase computation. Common values range from 256 to 1024.

  • max_len: The maximum sequence length to pre-compute encodings for. Setting this higher than your longest expected sequence avoids runtime recomputation, but increases memory usage. Typical values range from 512 to 8192 depending on the task.

  • Base constant (10000): The frequency scaling constant in the denominator. This value controls the range of wavelengths from 2π2\pi to 2π×100002\pi \times 10000. The original transformer paper uses 10000, but some implementations experiment with different values to adjust the frequency distribution.

Summary

Sinusoidal position encoding provides a parameter-free method for injecting positional information into transformer models. By encoding each position as a unique pattern of sine and cosine values at geometrically spaced frequencies, it creates distinguishable representations for any sequence position.

Key takeaways from this chapter:

  • Multi-scale representation: Different dimension pairs capture position at different resolutions. High-frequency pairs distinguish nearby positions, low-frequency pairs distinguish distant positions.

  • Mathematical structure: The sine/cosine pairing enables relative positions to be computed as rotations. For any fixed offset kk, the encoding at position pos+kpos + k is a linear transformation of the encoding at position pospos.

  • No learned parameters: The encoding is computed from a deterministic formula, eliminating position-related parameters and enabling representation of any position.

  • Bounded values: All encoding values lie in [1,1][-1, 1], ensuring numerical stability regardless of position.

  • Extrapolation caveat: While encodings are valid for any position, the model's learned attention patterns may not generalize to positions unseen during training.

  • Trade-offs with learned embeddings: Sinusoidal encoding saves parameters and enables extrapolation but lacks flexibility to learn task-specific position patterns.

In the next chapter, we'll explore learned position embeddings in detail: how they're implemented, when they outperform sinusoidal encodings, and the design considerations that affect their performance.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about sinusoidal position encoding.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{sinusoidalpositionencodinghowtransformersknowwordorder, author = {Michael Brenndoerfer}, title = {Sinusoidal Position Encoding: How Transformers Know Word Order}, year = {2025}, url = {https://mbrenndoerfer.com/writing/sinusoidal-position-encoding-transformers-word-order}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Sinusoidal Position Encoding: How Transformers Know Word Order. Retrieved from https://mbrenndoerfer.com/writing/sinusoidal-position-encoding-transformers-word-order
MLAAcademic
Michael Brenndoerfer. "Sinusoidal Position Encoding: How Transformers Know Word Order." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/sinusoidal-position-encoding-transformers-word-order>.
CHICAGOAcademic
Michael Brenndoerfer. "Sinusoidal Position Encoding: How Transformers Know Word Order." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/sinusoidal-position-encoding-transformers-word-order.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Sinusoidal Position Encoding: How Transformers Know Word Order'. Available at: https://mbrenndoerfer.com/writing/sinusoidal-position-encoding-transformers-word-order (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Sinusoidal Position Encoding: How Transformers Know Word Order. https://mbrenndoerfer.com/writing/sinusoidal-position-encoding-transformers-word-order
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free