Learned Position Embeddings: Training Transformers to Understand Position

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

How GPT and BERT encode position through learnable parameters. Understand embedding tables, position similarity, interpolation techniques, and trade-offs versus sinusoidal encoding.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Learned Position EmbeddingsLink Copied

The previous chapter introduced sinusoidal position encoding, a fixed mathematical function that assigns each position a unique pattern. The approach is elegant and principled, with wavelengths designed to enable relative position detection. But there's an alternative philosophy: instead of designing position representations by hand, why not learn them from data?

Learned position embeddings take a different stance. Rather than encoding position through a predetermined formula, they treat position representations as trainable parameters. Just as word embeddings are learned vectors that capture semantic meaning, position embeddings can be learned vectors that capture positional meaning. The model discovers what aspects of position matter for the task at hand.

This approach powers many influential models, including GPT-2, GPT-3, and BERT. Understanding learned position embeddings is essential for working with these architectures and for making informed choices about position encoding in new models.

The Position Embedding TableLink Copied

The core idea is simple: maintain a lookup table of position vectors, one for each position in the sequence. When processing a sequence, look up the embedding for each position and add it to the corresponding token embedding.

Position Embedding Table

A position embedding table is a trainable matrix of shape $(L_{\max}, d)$ , where $L_{\max}$ is the maximum sequence length and $d$ is the embedding dimension. Position $i$ is represented by the $i$ -th row of this matrix. These vectors are learned during training, not computed from a formula.

Mathematically, if $\mathbf{P} \in \mathbb{R}^{L_{\max} \times d}$ is the position embedding table, then the position embedding for position $i$ is simply:

\mathbf{p}_i = \mathbf{P}[i, :]

where:

$\mathbf{p}_i \in \mathbb{R}^d$ : the position embedding vector for position $i$
$\mathbf{P} \in \mathbb{R}^{L_{\max} \times d}$ : the position embedding table (a learnable parameter matrix)
$L_{\max}$ : the maximum sequence length the model can handle
$d$ : the embedding dimension (same as token embeddings)
$i$ : the position index (0-indexed, so $i \in \{0, 1, \ldots, L_{\max} - 1\}$ )

This is identical to how word embeddings work: just as we look up a word's embedding from a vocabulary table, we look up a position's embedding from a position table.

In[2]:

Code

import matplotlib.pyplot as plt  # noqa: F401
import numpy as np


class LearnedPositionEmbedding:
    """
    Learned position embedding table.

    This is a simplified version that stores the embedding table as a NumPy array.
    In practice, this would be a PyTorch nn.Embedding layer.
    """

    def __init__(self, max_seq_len, embed_dim, seed=None):
        """
        Initialize the position embedding table.

        Args:
            max_seq_len: Maximum sequence length (L_max)
            embed_dim: Embedding dimension (d)
            seed: Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)

        # Initialize with small random values (like word embeddings)
        # Using normal initialization scaled by 1/sqrt(embed_dim)
        self.embeddings = np.random.randn(max_seq_len, embed_dim) * (
            1.0 / np.sqrt(embed_dim)
        )
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim

    def __call__(self, positions):
        """
        Look up position embeddings.

        Args:
            positions: Array of position indices

        Returns:
            Position embeddings of shape (len(positions), embed_dim)
        """
        return self.embeddings[positions]

import matplotlib.pyplot as plt  # noqa: F401
import numpy as np


class LearnedPositionEmbedding:
    """
    Learned position embedding table.

    This is a simplified version that stores the embedding table as a NumPy array.
    In practice, this would be a PyTorch nn.Embedding layer.
    """

    def __init__(self, max_seq_len, embed_dim, seed=None):
        """
        Initialize the position embedding table.

        Args:
            max_seq_len: Maximum sequence length (L_max)
            embed_dim: Embedding dimension (d)
            seed: Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)

        # Initialize with small random values (like word embeddings)
        # Using normal initialization scaled by 1/sqrt(embed_dim)
        self.embeddings = np.random.randn(max_seq_len, embed_dim) * (
            1.0 / np.sqrt(embed_dim)
        )
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim

    def __call__(self, positions):
        """
        Look up position embeddings.

        Args:
            positions: Array of position indices

        Returns:
            Position embeddings of shape (len(positions), embed_dim)
        """
        return self.embeddings[positions]

The implementation mirrors how word embedding layers work in deep learning frameworks. In PyTorch, you would use nn.Embedding(max_seq_len, embed_dim), which handles the lookup and gradient computation automatically.

Let's create a position embedding table and examine what the initial (random) embeddings look like:

In[3]:

Code

# Create a position embedding table
max_seq_len = 128
embed_dim = 64

pos_embed = LearnedPositionEmbedding(max_seq_len, embed_dim, seed=42)

# Look up embeddings for positions 0 through 9
positions = np.arange(10)
pos_vectors = pos_embed(positions)

# Create a position embedding table
max_seq_len = 128
embed_dim = 64

pos_embed = LearnedPositionEmbedding(max_seq_len, embed_dim, seed=42)

# Look up embeddings for positions 0 through 9
positions = np.arange(10)
pos_vectors = pos_embed(positions)

Out[4]:

Console

Position embedding table shape: (128, 64)
  - 128 positions
  - 64 dimensions per position

Sample position embeddings (first 5 dimensions):
  Position 0: [ 0.062 -0.017  0.081  0.19  -0.029] ...
  Position 1: [ 0.102  0.17  -0.009  0.125  0.045] ...
  Position 2: [ 0.012 -0.063 -0.194  0.009 -0.133] ...
  Position 3: [ 0.027 -0.156  0.022  0.048 -0.11 ] ...
  Position 4: [ 0.158 -0.088  0.055  0.097 -0.116] ...

At initialization, the position embeddings are random vectors with no meaningful structure. The magic happens during training, when gradients flow back through these embeddings and reshape them to capture positional patterns useful for the task.

Out[5]:

Visualization

Heatmap of random position embeddings showing no coherent structure or patterns. — Random position embeddings at initialization. Each row represents a position, each column a dimension. The lack of vertical structure shows that adjacent positions have unrelated embeddings. Training will impose smooth patterns.

The heatmap reveals the chaotic structure of random initialization. Each column (position) has values that appear unrelated to neighboring columns. There's no smooth gradient, no pattern that would help the model understand that position 5 is closer to position 6 than to position 50. This randomness is the starting point; training will transform this noise into meaningful structure.

Combining Token and Position EmbeddingsLink Copied

Just as with sinusoidal encoding, learned position embeddings are typically added to token embeddings:

\mathbf{h}_i = \mathbf{e}_{w_i} + \mathbf{p}_i

where:

$\mathbf{h}_i \in \mathbb{R}^d$ : the input representation for position $i$ , combining token and position information
$\mathbf{e}_{w_i} \in \mathbb{R}^d$ : the token embedding for the word at position $i$
$\mathbf{p}_i \in \mathbb{R}^d$ : the position embedding for position $i$

This additive combination means position information is blended into the token representation from the very first layer. The model sees each token as existing at a particular position, not as a position-agnostic entity.

In[6]:

Code

def combine_embeddings(token_embeddings, pos_embed):
    """
    Combine token embeddings with position embeddings.

    Args:
        token_embeddings: Token embeddings of shape (seq_len, embed_dim)
        pos_embed: LearnedPositionEmbedding instance

    Returns:
        Combined embeddings of shape (seq_len, embed_dim)
    """
    seq_len = token_embeddings.shape[0]
    positions = np.arange(seq_len)
    position_embeddings = pos_embed(positions)
    return token_embeddings + position_embeddings


# Example: combine with some token embeddings
np.random.seed(123)
token_embeds = np.random.randn(10, embed_dim) * 0.5  # 10 tokens

combined = combine_embeddings(token_embeds, pos_embed)

def combine_embeddings(token_embeddings, pos_embed):
    """
    Combine token embeddings with position embeddings.

    Args:
        token_embeddings: Token embeddings of shape (seq_len, embed_dim)
        pos_embed: LearnedPositionEmbedding instance

    Returns:
        Combined embeddings of shape (seq_len, embed_dim)
    """
    seq_len = token_embeddings.shape[0]
    positions = np.arange(seq_len)
    position_embeddings = pos_embed(positions)
    return token_embeddings + position_embeddings


# Example: combine with some token embeddings
np.random.seed(123)
token_embeds = np.random.randn(10, embed_dim) * 0.5  # 10 tokens

combined = combine_embeddings(token_embeds, pos_embed)

Out[7]:

Console

Combining token and position embeddings:
  Token embeddings shape:    (10, 64)
  Position embeddings shape: (10, 64)
  Combined shape:            (10, 64)

Example vector norms (showing position contribution):
  Position 0: token=4.709, pos=0.910, combined=4.554
  Position 5: token=3.886, pos=0.881, combined=3.958
  Position 9: token=4.019, pos=1.000, combined=4.333

The combined embeddings carry information about both what token appears at each position and where that position is in the sequence. This is the representation that attention layers will operate on.

How Position Embeddings LearnLink Copied

Position embeddings learn through the same backpropagation process as all other parameters. When the model makes a prediction, gradients flow backward through the network, and the position embeddings receive updates that push them toward configurations that reduce the loss.

But what patterns do position embeddings actually learn? Research has revealed several consistent findings.

Position embeddings typically learn to encode absolute position information directly. Nearby positions tend to have similar embeddings, creating a smooth gradient across the sequence. This makes intuitive sense: positions 5 and 6 are more similar (in terms of their role in the sequence) than positions 5 and 50.

Let's simulate what trained position embeddings might look like by creating a toy example with smoothly varying patterns:

In[8]:

Code

def create_trained_position_embeddings(max_len, embed_dim, seed=42):
    """
    Create position embeddings that simulate trained patterns.

    Real trained embeddings show smooth variation across positions.
    We simulate this with a combination of:
    - Low-frequency sinusoids (capturing global position)
    - Medium-frequency patterns (capturing local structure)
    - Some learned noise (capturing task-specific patterns)
    """
    np.random.seed(seed)
    embeddings = np.zeros((max_len, embed_dim))
    positions = np.arange(max_len)

    for dim in range(embed_dim):
        # Mix of frequencies, simulating what training discovers
        freq = 0.1 + 0.5 * (dim / embed_dim)  # Lower dims = lower freq
        phase = np.random.rand() * 2 * np.pi

        # Smooth sinusoidal component
        embeddings[:, dim] = 0.5 * np.sin(freq * positions + phase)

        # Add some learned variation
        embeddings[:, dim] += (
            0.2 * np.random.randn(max_len) * np.exp(-0.01 * positions)
        )

    # Normalize to have reasonable magnitude
    embeddings = embeddings / np.sqrt(embed_dim)

    return embeddings


trained_embeddings = create_trained_position_embeddings(128, 64)

def create_trained_position_embeddings(max_len, embed_dim, seed=42):
    """
    Create position embeddings that simulate trained patterns.

    Real trained embeddings show smooth variation across positions.
    We simulate this with a combination of:
    - Low-frequency sinusoids (capturing global position)
    - Medium-frequency patterns (capturing local structure)
    - Some learned noise (capturing task-specific patterns)
    """
    np.random.seed(seed)
    embeddings = np.zeros((max_len, embed_dim))
    positions = np.arange(max_len)

    for dim in range(embed_dim):
        # Mix of frequencies, simulating what training discovers
        freq = 0.1 + 0.5 * (dim / embed_dim)  # Lower dims = lower freq
        phase = np.random.rand() * 2 * np.pi

        # Smooth sinusoidal component
        embeddings[:, dim] = 0.5 * np.sin(freq * positions + phase)

        # Add some learned variation
        embeddings[:, dim] += (
            0.2 * np.random.randn(max_len) * np.exp(-0.01 * positions)
        )

    # Normalize to have reasonable magnitude
    embeddings = embeddings / np.sqrt(embed_dim)

    return embeddings


trained_embeddings = create_trained_position_embeddings(128, 64)

Out[9]:

Visualization

Heatmap of position embeddings showing smooth vertical patterns that represent learned positional information. — Simulated trained position embeddings showing smooth patterns across positions. Different dimensions capture different aspects of position, with some dimensions varying slowly (global position) and others more rapidly (local structure).

The visualization shows the characteristic structure of learned position embeddings. Different dimensions capture different aspects of position: some vary slowly across the entire sequence (low-frequency components), while others vary more rapidly (high-frequency components). This multi-scale representation allows the model to detect both global position ("near the start vs. near the end") and local structure ("two positions apart").

Position Similarity AnalysisLink Copied

One way to understand what position embeddings have learned is to compute similarities between positions. If the model uses position for word order and syntax, nearby positions should be more similar than distant ones.

In[10]:

Code

def cosine_similarity_matrix(embeddings):
    """Compute pairwise cosine similarities between embeddings."""
    # Normalize each embedding to unit length
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-8)

    # Cosine similarity = dot product of normalized vectors
    return normalized @ normalized.T


# Compute similarity for trained embeddings
sim_matrix = cosine_similarity_matrix(trained_embeddings)

def cosine_similarity_matrix(embeddings):
    """Compute pairwise cosine similarities between embeddings."""
    # Normalize each embedding to unit length
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-8)

    # Cosine similarity = dot product of normalized vectors
    return normalized @ normalized.T


# Compute similarity for trained embeddings
sim_matrix = cosine_similarity_matrix(trained_embeddings)

Out[11]:

Visualization

Heatmap showing cosine similarity between positions with a bright diagonal and gradually decreasing values off-diagonal. — Cosine similarity between position embeddings. The strong diagonal band shows that nearby positions have similar embeddings. The gradual fade-off indicates smooth position encoding where distance correlates with dissimilarity.

The similarity matrix reveals the structure learned by position embeddings. The bright diagonal indicates that each position is most similar to itself (similarity = 1). The gradual darkening as we move away from the diagonal shows that nearby positions are more similar than distant ones. This band structure is characteristic of well-trained position embeddings.

Let's quantify how similarity decreases with distance:

In[12]:

Code

def similarity_by_distance(sim_matrix, max_distance=30):
    """Compute average similarity as a function of position distance."""
    n = sim_matrix.shape[0]
    distances = []
    avg_similarities = []

    for d in range(max_distance + 1):
        sims = []
        for i in range(n - d):
            sims.append(sim_matrix[i, i + d])
        distances.append(d)
        avg_similarities.append(np.mean(sims))

    return np.array(distances), np.array(avg_similarities)


distances, avg_sims = similarity_by_distance(sim_matrix, max_distance=40)

def similarity_by_distance(sim_matrix, max_distance=30):
    """Compute average similarity as a function of position distance."""
    n = sim_matrix.shape[0]
    distances = []
    avg_similarities = []

    for d in range(max_distance + 1):
        sims = []
        for i in range(n - d):
            sims.append(sim_matrix[i, i + d])
        distances.append(d)
        avg_similarities.append(np.mean(sims))

    return np.array(distances), np.array(avg_similarities)


distances, avg_sims = similarity_by_distance(sim_matrix, max_distance=40)

Out[13]:

Visualization

Line plot showing cosine similarity decreasing from 1.0 at distance 0 to near 0 at distance 40. — Average cosine similarity between position embeddings as a function of distance. The smooth decay shows that position embeddings encode a meaningful notion of proximity: nearby positions have similar representations.

The decay curve shows that similarity drops off smoothly with distance. Positions 1 apart have high similarity (around 0.9), while positions 20+ apart have near-zero similarity. This pattern allows the model to detect relative position through embedding similarity: if two position embeddings are very similar, the positions are likely close together.

The Maximum Sequence Length ConstraintLink Copied

Unlike sinusoidal encodings, learned position embeddings have a hard constraint: the model can only handle sequences up to length $L_{\max}$ . If you train with $L_{\max} = 512$ , you have 512 position embeddings. Position 513 simply doesn't exist in the table.

This creates practical challenges.

During training, you choose $L_{\max}$ based on your computational budget and data characteristics. Larger $L_{\max}$ means more parameters (the position table has $L_{\max} \times d$ values) and higher memory usage during training.

During inference, sequences longer than $L_{\max}$ cannot be processed directly. You must either truncate the sequence, use sliding windows, or fine-tune with a larger $L_{\max}$ .

In[14]:

Code

# Demonstrate the sequence length constraint
max_len = 128
pos_embed_small = LearnedPositionEmbedding(max_len, 64, seed=42)


def check_position_access(pos_embed, position):
    """Check if a position can be accessed."""
    if position < pos_embed.max_seq_len:
        return True, pos_embed(np.array([position]))
    else:
        return False, None


# Test various positions
test_positions = [0, 50, 127, 128, 200]

# Demonstrate the sequence length constraint
max_len = 128
pos_embed_small = LearnedPositionEmbedding(max_len, 64, seed=42)


def check_position_access(pos_embed, position):
    """Check if a position can be accessed."""
    if position < pos_embed.max_seq_len:
        return True, pos_embed(np.array([position]))
    else:
        return False, None


# Test various positions
test_positions = [0, 50, 127, 128, 200]

Out[15]:

Console

Position embedding table size: 128

Position access test:
  Position   0: ✓ Valid
  Position  50: ✓ Valid
  Position 127: ✓ Valid
  Position 128: ✗ Out of range
  Position 200: ✗ Out of range

The constraint is fundamental to the approach. Sinusoidal encodings can generate a representation for any position using the formula, but learned embeddings must be stored explicitly. This is a key trade-off between the two approaches.

Extrapolation: Beyond Training LengthLink Copied

What happens if you need to process sequences longer than $L_{\max}$ ? With learned embeddings, you have several options, none of which are ideal.

Option 1: Extend and fine-tune. Add new position embeddings for positions beyond $L_{\max}$ , initialize them (perhaps by interpolating from existing embeddings), and fine-tune on longer sequences. This works but requires additional training.

Option 2: Position interpolation. If you trained with $L_{\max} = 512$ but need to handle 1024 tokens, you can interpolate the position indices. Position 512 in the long sequence uses the embedding for position 256 in the original table. This trick, called position interpolation, works surprisingly well for moderate extensions.

In[16]:

Code

def position_interpolation(pos_embed, target_len, original_max_len):
    """
    Interpolate position embeddings for longer sequences.

    Maps positions [0, target_len) to [0, original_max_len) and uses
    linear interpolation between the two nearest positions.
    """
    # Scale factor: how much to compress positions
    scale = original_max_len / target_len

    interpolated = np.zeros((target_len, pos_embed.embed_dim))

    for i in range(target_len):
        # Map to position in original table
        orig_pos = i * scale

        # Linear interpolation between floor and ceil positions
        low_pos = int(np.floor(orig_pos))
        high_pos = min(int(np.ceil(orig_pos)), original_max_len - 1)

        if low_pos == high_pos:
            interpolated[i] = pos_embed.embeddings[low_pos]
        else:
            # Weight based on fractional position
            weight = orig_pos - low_pos
            interpolated[i] = (1 - weight) * pos_embed.embeddings[
                low_pos
            ] + weight * pos_embed.embeddings[high_pos]

    return interpolated


# Extend from 128 to 256 positions using interpolation
original_max = 128
target_len = 256
interpolated_embeds = position_interpolation(
    pos_embed, target_len, original_max
)

def position_interpolation(pos_embed, target_len, original_max_len):
    """
    Interpolate position embeddings for longer sequences.

    Maps positions [0, target_len) to [0, original_max_len) and uses
    linear interpolation between the two nearest positions.
    """
    # Scale factor: how much to compress positions
    scale = original_max_len / target_len

    interpolated = np.zeros((target_len, pos_embed.embed_dim))

    for i in range(target_len):
        # Map to position in original table
        orig_pos = i * scale

        # Linear interpolation between floor and ceil positions
        low_pos = int(np.floor(orig_pos))
        high_pos = min(int(np.ceil(orig_pos)), original_max_len - 1)

        if low_pos == high_pos:
            interpolated[i] = pos_embed.embeddings[low_pos]
        else:
            # Weight based on fractional position
            weight = orig_pos - low_pos
            interpolated[i] = (1 - weight) * pos_embed.embeddings[
                low_pos
            ] + weight * pos_embed.embeddings[high_pos]

    return interpolated


# Extend from 128 to 256 positions using interpolation
original_max = 128
target_len = 256
interpolated_embeds = position_interpolation(
    pos_embed, target_len, original_max
)

Out[17]:

Console

Position Interpolation:
  Original max length: 128
  Target length:       256
  Scale factor:        0.50

Example mappings:
  New position   0 → Original position 0.0
  New position  64 → Original position 32.0
  New position 128 → Original position 64.0
  New position 192 → Original position 96.0
  New position 255 → Original position 127.5

Position interpolation effectively "stretches" the position embedding table. The model sees positions at half the resolution, but the relative ordering is preserved. This works because the model primarily cares about relative positions, and those relationships are maintained under scaling.

Let's visualize how position interpolation affects the embedding structure:

In[18]:

Code

# Compare similarity structure before and after interpolation
# First, create a simulated trained embedding for the original table
original_trained = create_trained_position_embeddings(128, 64, seed=42)


# Interpolate to 256 positions
class SimpleEmbed:
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.embed_dim = embeddings.shape[1]


interpolated_256 = position_interpolation(
    SimpleEmbed(original_trained), 256, 128
)

# Compute similarity matrices for both
sim_original = cosine_similarity_matrix(original_trained)
sim_interpolated = cosine_similarity_matrix(interpolated_256)

# Compare similarity structure before and after interpolation
# First, create a simulated trained embedding for the original table
original_trained = create_trained_position_embeddings(128, 64, seed=42)


# Interpolate to 256 positions
class SimpleEmbed:
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.embed_dim = embeddings.shape[1]


interpolated_256 = position_interpolation(
    SimpleEmbed(original_trained), 256, 128
)

# Compute similarity matrices for both
sim_original = cosine_similarity_matrix(original_trained)
sim_interpolated = cosine_similarity_matrix(interpolated_256)

Out[19]:

Visualization

Heatmap of 64x64 position similarities with bright diagonal band. — Original position similarity (128 positions). The diagonal band shows smooth decay with distance, the characteristic signature of well-trained position embeddings.

Heatmap of 128x128 interpolated position similarities with similar but stretched pattern. — Interpolated position similarity (256 positions). The same pattern is preserved but stretched. Each original position is mapped to two new positions, maintaining relative relationships.

The side-by-side comparison shows that position interpolation preserves the essential structure: nearby positions remain similar, and similarity decays with distance. The interpolated version has twice as many positions, but the relative relationships are maintained. This explains why interpolation works reasonably well for moderate length extensions.

Option 3: Sliding window. Process long sequences in overlapping chunks of length $L_{\max}$ . Each chunk gets proper position embeddings, but you need a strategy for combining predictions across chunks. This is common for very long documents.

The extrapolation problem is a significant limitation of learned position embeddings. Models trained with shorter contexts may struggle when forced to process longer sequences, even with interpolation tricks. This has motivated research into position encodings that generalize better, which we'll explore in later chapters.

GPT-Style Position EmbeddingsLink Copied

GPT-2 and GPT-3 use learned position embeddings in a straightforward way. The architecture adds token and position embeddings at the input layer, then processes the combined representation through transformer blocks.

In[20]:

Code

class GPTStyleEmbedding:
    """
    GPT-style embedding layer with learned token and position embeddings.

    This combines:
    - Token embeddings: lookup table for vocabulary
    - Position embeddings: lookup table for positions
    """

    def __init__(self, vocab_size, max_seq_len, embed_dim, seed=None):
        if seed is not None:
            np.random.seed(seed)

        # Token embedding table
        self.token_embeddings = np.random.randn(vocab_size, embed_dim) * 0.02

        # Position embedding table
        self.position_embeddings = (
            np.random.randn(max_seq_len, embed_dim) * 0.02
        )

        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim

    def forward(self, token_ids):
        """
        Compute embeddings for a sequence of token IDs.

        Args:
            token_ids: Array of token indices, shape (seq_len,)

        Returns:
            Embeddings of shape (seq_len, embed_dim)
        """
        seq_len = len(token_ids)

        if seq_len > self.max_seq_len:
            raise ValueError(
                f"Sequence length {seq_len} exceeds maximum {self.max_seq_len}"
            )

        # Look up token embeddings
        token_embeds = self.token_embeddings[token_ids]

        # Look up position embeddings (positions 0 through seq_len-1)
        positions = np.arange(seq_len)
        pos_embeds = self.position_embeddings[positions]

        # Add them together
        return token_embeds + pos_embeds

class GPTStyleEmbedding:
    """
    GPT-style embedding layer with learned token and position embeddings.

    This combines:
    - Token embeddings: lookup table for vocabulary
    - Position embeddings: lookup table for positions
    """

    def __init__(self, vocab_size, max_seq_len, embed_dim, seed=None):
        if seed is not None:
            np.random.seed(seed)

        # Token embedding table
        self.token_embeddings = np.random.randn(vocab_size, embed_dim) * 0.02

        # Position embedding table
        self.position_embeddings = (
            np.random.randn(max_seq_len, embed_dim) * 0.02
        )

        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim

    def forward(self, token_ids):
        """
        Compute embeddings for a sequence of token IDs.

        Args:
            token_ids: Array of token indices, shape (seq_len,)

        Returns:
            Embeddings of shape (seq_len, embed_dim)
        """
        seq_len = len(token_ids)

        if seq_len > self.max_seq_len:
            raise ValueError(
                f"Sequence length {seq_len} exceeds maximum {self.max_seq_len}"
            )

        # Look up token embeddings
        token_embeds = self.token_embeddings[token_ids]

        # Look up position embeddings (positions 0 through seq_len-1)
        positions = np.arange(seq_len)
        pos_embeds = self.position_embeddings[positions]

        # Add them together
        return token_embeds + pos_embeds

In[21]:

Code

# Demonstrate GPT-style embedding
vocab_size = 50257  # GPT-2 vocabulary size
max_seq_len = 1024  # GPT-2 context length
embed_dim = 768  # GPT-2 embedding dimension

gpt_embed = GPTStyleEmbedding(vocab_size, max_seq_len, embed_dim, seed=42)

# Simulate processing a sequence
# (in practice, token_ids come from a tokenizer)
sample_token_ids = np.array(
    [464, 3290, 318, 845, 1310]
)  # "The cat is very small"
embeddings = gpt_embed.forward(sample_token_ids)

# Demonstrate GPT-style embedding
vocab_size = 50257  # GPT-2 vocabulary size
max_seq_len = 1024  # GPT-2 context length
embed_dim = 768  # GPT-2 embedding dimension

gpt_embed = GPTStyleEmbedding(vocab_size, max_seq_len, embed_dim, seed=42)

# Simulate processing a sequence
# (in practice, token_ids come from a tokenizer)
sample_token_ids = np.array(
    [464, 3290, 318, 845, 1310]
)  # "The cat is very small"
embeddings = gpt_embed.forward(sample_token_ids)

Out[22]:

Console

GPT-Style Embedding Layer
==================================================
Vocabulary size:       50,257
Max sequence length:   1,024
Embedding dimension:   768

Parameter counts:
  Token embeddings:    38,597,376 (38.6M)
  Position embeddings: 786,432 (0.79M)
  Total:               39,383,808 (39.4M)

Input sequence:  [ 464 3290  318  845 1310]
Output shape:    (5, 768)

The parameter count reveals an important insight: position embeddings are relatively cheap. In GPT-2 with its 50K vocabulary and 1024 positions, the position embedding table has only 0.79M parameters versus 38.6M for token embeddings. The position table is about 2% of the embedding layer. This makes increasing $L_{\max}$ computationally inexpensive in terms of parameters, though it increases the quadratic attention cost.

BERT-Style Position EmbeddingsLink Copied

BERT also uses learned position embeddings, but with a few differences. BERT includes segment embeddings (to distinguish between sentence pairs) and uses a different initialization scheme.

In[23]:

Code

class BERTStyleEmbedding:
    """
    BERT-style embedding layer with token, position, and segment embeddings.
    """

    def __init__(
        self, vocab_size, max_seq_len, embed_dim, num_segments=2, seed=None
    ):
        if seed is not None:
            np.random.seed(seed)

        # Token embeddings
        self.token_embeddings = np.random.randn(vocab_size, embed_dim) * 0.02

        # Position embeddings
        self.position_embeddings = (
            np.random.randn(max_seq_len, embed_dim) * 0.02
        )

        # Segment embeddings (for distinguishing sentence A vs sentence B)
        self.segment_embeddings = (
            np.random.randn(num_segments, embed_dim) * 0.02
        )

        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim

    def forward(self, token_ids, segment_ids=None):
        """
        Compute embeddings for a sequence.

        Args:
            token_ids: Array of token indices, shape (seq_len,)
            segment_ids: Array of segment indices (0 or 1), shape (seq_len,)

        Returns:
            Embeddings of shape (seq_len, embed_dim)
        """
        seq_len = len(token_ids)

        if segment_ids is None:
            segment_ids = np.zeros(seq_len, dtype=int)

        # Look up all three embedding types
        token_embeds = self.token_embeddings[token_ids]
        pos_embeds = self.position_embeddings[np.arange(seq_len)]
        seg_embeds = self.segment_embeddings[segment_ids]

        # Sum all three
        return token_embeds + pos_embeds + seg_embeds

class BERTStyleEmbedding:
    """
    BERT-style embedding layer with token, position, and segment embeddings.
    """

    def __init__(
        self, vocab_size, max_seq_len, embed_dim, num_segments=2, seed=None
    ):
        if seed is not None:
            np.random.seed(seed)

        # Token embeddings
        self.token_embeddings = np.random.randn(vocab_size, embed_dim) * 0.02

        # Position embeddings
        self.position_embeddings = (
            np.random.randn(max_seq_len, embed_dim) * 0.02
        )

        # Segment embeddings (for distinguishing sentence A vs sentence B)
        self.segment_embeddings = (
            np.random.randn(num_segments, embed_dim) * 0.02
        )

        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.embed_dim = embed_dim

    def forward(self, token_ids, segment_ids=None):
        """
        Compute embeddings for a sequence.

        Args:
            token_ids: Array of token indices, shape (seq_len,)
            segment_ids: Array of segment indices (0 or 1), shape (seq_len,)

        Returns:
            Embeddings of shape (seq_len, embed_dim)
        """
        seq_len = len(token_ids)

        if segment_ids is None:
            segment_ids = np.zeros(seq_len, dtype=int)

        # Look up all three embedding types
        token_embeds = self.token_embeddings[token_ids]
        pos_embeds = self.position_embeddings[np.arange(seq_len)]
        seg_embeds = self.segment_embeddings[segment_ids]

        # Sum all three
        return token_embeds + pos_embeds + seg_embeds

The addition of segment embeddings allows BERT to understand sentence structure in tasks like next sentence prediction and question answering. The position embeddings work identically to GPT-style: a simple lookup and addition.

Analyzing Real Position EmbeddingsLink Copied

When researchers analyze trained position embeddings from models like GPT-2 and BERT, several patterns emerge consistently.

Low-rank structure. The 768-dimensional position embeddings can often be approximated well by a much lower-dimensional subspace. The first 50-100 principal components typically capture most of the variance.

Smooth interpolation. Adjacent positions have similar embeddings, and this similarity decreases smoothly with distance. The embeddings form a continuous manifold in the embedding space.

Boundary effects. The first few positions (0, 1, 2) and positions near $L_{\max}$ sometimes show different patterns, possibly because they're encountered in distinct contexts during training.

Let's visualize the low-rank structure:

In[24]:

Code

from numpy.linalg import svd


def analyze_position_embedding_rank(embeddings):
    """Analyze the effective rank of position embeddings via SVD."""
    U, S, Vt = svd(embeddings, full_matrices=False)

    # Compute cumulative explained variance
    total_var = np.sum(S**2)
    cumulative_var = np.cumsum(S**2) / total_var

    return S, cumulative_var


# Analyze our simulated trained embeddings
singular_values, cumulative_var = analyze_position_embedding_rank(
    trained_embeddings
)

from numpy.linalg import svd


def analyze_position_embedding_rank(embeddings):
    """Analyze the effective rank of position embeddings via SVD."""
    U, S, Vt = svd(embeddings, full_matrices=False)

    # Compute cumulative explained variance
    total_var = np.sum(S**2)
    cumulative_var = np.cumsum(S**2) / total_var

    return S, cumulative_var


# Analyze our simulated trained embeddings
singular_values, cumulative_var = analyze_position_embedding_rank(
    trained_embeddings
)

Out[25]:

Visualization

Line plot of singular values on log scale showing rapid exponential decay. — Singular values of position embeddings showing rapid decay. The first few components capture most of the structure, suggesting low effective dimensionality.

Cumulative variance curve rising steeply then plateauing near 100%. — Cumulative explained variance reaches 90% with relatively few components, indicating the position embeddings lie in a low-dimensional subspace.

The rapid decay of singular values confirms the low-rank structure. Most of the information in position embeddings can be captured by a handful of principal components. This suggests that positions are fundamentally simple, even though we represent them in high-dimensional space. The extra dimensions provide capacity for task-specific adjustments during training.

Trade-offs: Learned vs. SinusoidalLink Copied

The choice between learned and sinusoidal position encodings involves several trade-offs.

Flexibility. Learned embeddings can capture any pattern the data requires, including task-specific positional biases. Sinusoidal encodings impose a fixed mathematical structure that may not match the task's needs. For most NLP tasks, learned embeddings perform as well or better than sinusoidal encodings when the model is trained on sufficient data.

Generalization to longer sequences. Sinusoidal encodings can generate representations for any position, including those never seen during training. Learned embeddings are limited to $L_{\max}$ and may degrade for positions near the boundary where less training signal exists. For applications requiring length generalization, sinusoidal or other fixed encodings have an advantage.

Parameter count. Learned embeddings add $L_{\max} \times d$ parameters. For typical transformer sizes, this is a small fraction of total parameters. Sinusoidal encodings add zero parameters since they're computed from a formula.

Interpretability. Sinusoidal encodings have clear mathematical properties (the dot product encodes relative position, different frequencies capture different scales). Learned embeddings are opaque; their properties must be discovered empirically through analysis.

Training efficiency. Learned embeddings must be trained, which requires gradients to flow through positions encountered during training. Rare positions (e.g., near $L_{\max}$ when most training sequences are short) may receive insufficient updates. Sinusoidal encodings work correctly immediately, with no training needed for the position component.

In practice, most modern language models use learned position embeddings. The flexibility to adapt to task-specific positional patterns outweighs the generalization advantages of fixed encodings for most applications. The sequence length limit is addressed by choosing $L_{\max}$ large enough for the target use case or by using techniques like position interpolation.

Implementation in PyTorchLink Copied

In practice, you would implement learned position embeddings using PyTorch's nn.Embedding layer. Here's what a real implementation looks like:

In[40]:

Code

import torch
import torch.nn as nn


class TransformerEmbedding(nn.Module):
    """
    Transformer embedding layer with learned token and position embeddings.
    """

    def __init__(self, vocab_size, max_seq_len, embed_dim, dropout=0.1):
        super().__init__()

        # Token embedding table
        self.token_embed = nn.Embedding(vocab_size, embed_dim)

        # Position embedding table
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)

        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)

        # Store for position indexing
        self.max_seq_len = max_seq_len

    def forward(self, token_ids):
        """
        Args:
            token_ids: LongTensor of shape (batch_size, seq_len)

        Returns:
            Embeddings of shape (batch_size, seq_len, embed_dim)
        """
        batch_size, seq_len = token_ids.shape

        # Create position indices: [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=token_ids.device)
        positions = positions.unsqueeze(0).expand(batch_size, -1)

        # Look up embeddings and add
        token_embeds = self.token_embed(token_ids)
        pos_embeds = self.pos_embed(positions)

        return self.dropout(token_embeds + pos_embeds)

import torch
import torch.nn as nn


class TransformerEmbedding(nn.Module):
    """
    Transformer embedding layer with learned token and position embeddings.
    """

    def __init__(self, vocab_size, max_seq_len, embed_dim, dropout=0.1):
        super().__init__()

        # Token embedding table
        self.token_embed = nn.Embedding(vocab_size, embed_dim)

        # Position embedding table
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)

        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)

        # Store for position indexing
        self.max_seq_len = max_seq_len

    def forward(self, token_ids):
        """
        Args:
            token_ids: LongTensor of shape (batch_size, seq_len)

        Returns:
            Embeddings of shape (batch_size, seq_len, embed_dim)
        """
        batch_size, seq_len = token_ids.shape

        # Create position indices: [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=token_ids.device)
        positions = positions.unsqueeze(0).expand(batch_size, -1)

        # Look up embeddings and add
        token_embeds = self.token_embed(token_ids)
        pos_embeds = self.pos_embed(positions)

        return self.dropout(token_embeds + pos_embeds)

The PyTorch implementation is clean and efficient. The nn.Embedding layer handles the lookup table and gradient computation automatically. Position indices are created on-the-fly based on sequence length, and the same position embeddings are shared across all examples in a batch.

Limitations and ImpactLink Copied

Learned position embeddings represent a pragmatic approach to position encoding. By treating positions as learnable parameters rather than fixed functions, they allow models to discover optimal position representations for their training data. This flexibility has proven valuable across many tasks, from language modeling to machine translation.

The primary limitation is the fixed sequence length. Models cannot process sequences longer than their training length without additional techniques like position interpolation or fine-tuning with extended context. This constraint has driven research into position encodings that generalize better to unseen lengths, including relative position encodings and rotary position embeddings.

Another limitation is the lack of theoretical guarantees about what the embeddings learn. Unlike sinusoidal encodings, which have clear mathematical properties, learned embeddings are empirical objects whose properties must be discovered through analysis. This makes debugging and interpretation more challenging.

Despite these limitations, learned position embeddings remain a dominant choice for transformer architectures. Their simplicity, effectiveness, and ability to adapt to task-specific patterns have made them the default in models like GPT-2, GPT-3, BERT, and many others. The technique demonstrates a broader principle in deep learning: when you have enough data, learned representations often outperform hand-designed ones.

Key ParametersLink Copied

When implementing learned position embeddings, these parameters determine the capacity and behavior of the position encoding:

max_seq_len ( $L_{\max}$ ): Maximum sequence length the model can handle. This determines the number of rows in the position embedding table. Common values range from 512 (BERT) to 2048+ (modern GPT variants). Larger values increase memory usage linearly but allow processing longer documents without truncation.
embed_dim ( $d$ ): Dimension of each position embedding vector. Must match the token embedding dimension for additive combination. Typical values range from 256 to 4096 depending on model size. Higher dimensions provide more capacity but increase parameter count proportionally.
Initialization scale: Position embeddings are typically initialized with small random values, often scaled by $1/\sqrt{d}$ or using a fixed standard deviation (e.g., 0.02 in GPT-2). Smaller initialization helps training stability by keeping initial representations in a reasonable range.
dropout: Dropout rate applied after combining token and position embeddings. Values of 0.1 are common. This regularizes the model by randomly zeroing embedding dimensions during training.

The ratio of position parameters to total model parameters is typically small (around 2% for GPT-2), making $L_{\max}$ relatively cheap to increase from a parameter perspective. However, attention complexity scales quadratically with sequence length, which is often the practical bottleneck.

SummaryLink Copied

Learned position embeddings treat position representations as trainable parameters rather than fixed formulas. This simple idea has proven remarkably effective across many language modeling tasks.

Key takeaways from this chapter:

Position embedding table: A learnable matrix of shape $(L_{\max}, d)$ stores one embedding vector per position. Position $i$ is represented by the $i$ -th row of this table.
Additive combination: Position embeddings are added to token embeddings at the input layer, creating representations that encode both what token appears and where it appears.
Training discovers structure: Through backpropagation, position embeddings learn to encode positional information useful for the task. Nearby positions typically develop similar embeddings.
Maximum sequence length constraint: Unlike sinusoidal encodings, learned embeddings cannot represent positions beyond $L_{\max}$ . Processing longer sequences requires techniques like position interpolation or fine-tuning.
Low effective dimensionality: Despite being stored in high-dimensional space, position embeddings often lie in a low-rank subspace. A few principal components capture most of the positional information.
GPT and BERT style: Major models like GPT-2 and BERT use learned position embeddings with simple additive combination. The approach is straightforward to implement and works well in practice.
Trade-offs vs. sinusoidal: Learned embeddings offer more flexibility but limited length generalization. Sinusoidal encodings generalize to any length but impose fixed structure. The choice depends on the application's requirements.

In the next chapter, we'll explore relative position encodings, which address the extrapolation problem by encoding the distance between positions rather than their absolute locations.

QuizLink Copied

Test your understanding of learned position embeddings and how they differ from fixed encoding schemes.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Sinusoidal Position Encoding

Next Chapter

Relative Position Encoding

Reference

BIBTEXAcademic

@misc{learnedpositionembeddingstrainingtransformerstounderstandposition, author = {Michael Brenndoerfer}, title = {Learned Position Embeddings: Training Transformers to Understand Position}, year = {2025}, url = {https://mbrenndoerfer.com/writing/learned-position-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Learned Position Embeddings: Training Transformers to Understand Position. Retrieved from https://mbrenndoerfer.com/writing/learned-position-embeddings

MLAAcademic

Michael Brenndoerfer. "Learned Position Embeddings: Training Transformers to Understand Position." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/learned-position-embeddings>.

CHICAGOAcademic

Michael Brenndoerfer. "Learned Position Embeddings: Training Transformers to Understand Position." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/learned-position-embeddings.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Learned Position Embeddings: Training Transformers to Understand Position'. Available at: https://mbrenndoerfer.com/writing/learned-position-embeddings (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Learned Position Embeddings: Training Transformers to Understand Position. https://mbrenndoerfer.com/writing/learned-position-embeddings

Direct link:

https://mbrenndoerfer.com/writing/learned-position-embeddings

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Learned Position Embeddings: Training Transformers to Understand Position

Learned Position EmbeddingsLink Copied

The Position Embedding TableLink Copied

Combining Token and Position EmbeddingsLink Copied

How Position Embeddings LearnLink Copied

Position Similarity AnalysisLink Copied

The Maximum Sequence Length ConstraintLink Copied

Extrapolation: Beyond Training LengthLink Copied

GPT-Style Position EmbeddingsLink Copied

BERT-Style Position EmbeddingsLink Copied

Analyzing Real Position EmbeddingsLink Copied

Trade-offs: Learned vs. SinusoidalLink Copied

Implementation in PyTorchLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Sinusoidal Position Encoding: How Transformers Know Word Order

The Position Problem: Why Transformers Can't Tell Order Without Help

Rotary Position Embedding (RoPE): Encoding Position Through Rotation

Stay updated