Search

Search articles

The Position Problem: Why Transformers Can't Tell Order Without Help

Michael BrenndoerferUpdated June 5, 202524 min read

Explore why self-attention is blind to word order and what properties positional encodings need. Learn about permutation equivariance and position encoding requirements.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

The Position Problem

In the previous chapters, we explored self-attention and the query-key-value mechanism. We saw how each token can attend to every other token in a sequence, computing contextual representations that capture relationships regardless of distance. But there's a fundamental limitation hiding in plain sight: self-attention has no notion of order.

Shuffle the words in a sentence, and self-attention produces the same attention weights (just reordered). Feed it "The cat sat on the mat" or "mat the on sat cat The," and the pairwise relationships between tokens remain identical. This is a problem because language is inherently sequential. "Dog bites man" means something entirely different from "man bites dog," yet pure self-attention treats these as equivalent permutations.

This chapter explores why position blindness emerges from the mathematics of attention, why word order matters so critically for language understanding, and what properties any solution must have. We'll set the stage for the various positional encoding schemes covered in subsequent chapters.

Attention Is Permutation Equivariant

To understand why self-attention ignores position, we need to examine its mathematical structure. Let's build up the key insight step by step, starting with precise definitions and culminating in a proof that reveals where position information is missing.

Permutation Invariance vs. Equivariance

First, let's clarify two related but distinct properties. A function is permutation invariant if reordering its inputs doesn't change its output at all. Think of computing the sum or average of a set of numbers: the result is the same regardless of order. A function is permutation equivariant if reordering the inputs causes the outputs to be reordered in exactly the same way. Self-attention falls into the second category.

Permutation Equivariance

A function ff is permutation equivariant if for any permutation π\pi applied to the input sequence, the output is permuted in the same way: f(π(X))=π(f(X))f(\pi(X)) = \pi(f(X)). Self-attention has this property because it processes all positions identically and independently.

Why does this matter? If we shuffle a sentence before feeding it to self-attention, the outputs get shuffled in exactly the same way. The model produces the same representations, just in a different order. This means self-attention cannot distinguish "dog bites man" from "man bites dog" based on structure alone.

Tracing the Mathematics

To see why permutation equivariance emerges, let's trace through the self-attention computation and identify where position could enter, but doesn't.

Self-attention produces an output vector for each position by aggregating information from all positions. For position ii, we compute a weighted sum where each position jj contributes its value vector, scaled by how relevant it is to position ii:

yi=j=1nαijvj\mathbf{y}_i = \sum_{j=1}^{n} \alpha_{ij} \mathbf{v}_j

where:

  • yiRdv\mathbf{y}_i \in \mathbb{R}^{d_v}: the output representation for position ii
  • αij\alpha_{ij}: the attention weight from position ii to position jj (determines how much jj contributes to ii's output)
  • vjRdv\mathbf{v}_j \in \mathbb{R}^{d_v}: the value vector at position jj (the information that position jj contributes)
  • nn: the sequence length
  • j=1n\sum_{j=1}^{n}: summation over all positions in the sequence

This formula shows that the output at position ii depends on all value vectors vj\mathbf{v}_j and all attention weights αij\alpha_{ij}. But notice: the values come from the token content, and the weights come from content comparisons. Where does position enter?

The attention weights themselves come from applying softmax to scaled dot products between query and key vectors. The softmax function converts raw similarity scores into a probability distribution, ensuring all weights are positive and sum to 1:

αij=exp(qikj/dk)l=1nexp(qikl/dk)\alpha_{ij} = \frac{\exp(\mathbf{q}_i \cdot \mathbf{k}_j / \sqrt{d_k})}{\sum_{l=1}^{n} \exp(\mathbf{q}_i \cdot \mathbf{k}_l / \sqrt{d_k})}

where:

  • αij\alpha_{ij}: the attention weight from position ii to position jj (how much position ii attends to position jj)
  • qiRdk\mathbf{q}_i \in \mathbb{R}^{d_k}: the query vector for position ii
  • kjRdk\mathbf{k}_j \in \mathbb{R}^{d_k}: the key vector for position jj
  • dkd_k: the dimension of query and key vectors
  • exp()\exp(\cdot): the exponential function, ensuring all values are positive
  • l=1n\sum_{l=1}^{n}: summation over all nn positions, serving as the normalizing constant

The Missing Position Signal

Here's the critical observation: examine both equations carefully and notice what's absent. The indices ii and jj appear only as subscripts identifying which vectors to use. They never appear as values in the computation itself. The attention weight αij\alpha_{ij} depends entirely on:

  1. The query vector qi\mathbf{q}_i, which is derived from the content at position ii
  2. The key vector kj\mathbf{k}_j, which is derived from the content at position jj
  3. The normalization over all keys, which also depends only on content

Position indices serve as labels telling us which output corresponds to which input. They're bookkeeping, not input features. If we swap the tokens at positions 2 and 5, the query and key vectors simply swap, and the computation produces swapped outputs. The attention mechanism treats the sequence as an unordered bag of tokens.

Demonstrating Permutation Equivariance

The mathematical argument is compelling, but seeing the effect in code makes it concrete. Let's implement self-attention and verify that permuting the input produces identically permuted output.

In[2]:
Code
import numpy as np
import matplotlib.pyplot as plt  # noqa: F401

np.random.seed(42)


def softmax(x, axis=-1):
    """Compute softmax along the specified axis."""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def self_attention(X, W_Q, W_K, W_V):
    """
    Compute self-attention output.

    Args:
        X: Input embeddings of shape (seq_len, embed_dim)
        W_Q, W_K, W_V: Projection matrices

    Returns:
        Output representations of shape (seq_len, d_v)
    """
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V

    d_k = K.shape[1]
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores, axis=-1)
    output = weights @ V

    return output, weights

We'll create a simple 4-token sequence, apply a permutation that swaps two positions, and compare the outputs.

In[3]:
Code
# Create embeddings for a 4-token sequence
seq_len = 4
embed_dim = 8
d_k = d_v = 6

# Random embeddings representing different tokens
X = np.random.randn(seq_len, embed_dim)

# Initialize projection matrices
W_Q = np.random.randn(embed_dim, d_k) * 0.1
W_K = np.random.randn(embed_dim, d_k) * 0.1
W_V = np.random.randn(embed_dim, d_v) * 0.1

# Compute attention on original sequence
output_original, weights_original = self_attention(X, W_Q, W_K, W_V)

# Create a permutation: swap positions 1 and 2
permutation = [0, 2, 1, 3]
X_permuted = X[permutation]

# Compute attention on permuted sequence
output_permuted, weights_permuted = self_attention(X_permuted, W_Q, W_K, W_V)

# Apply the same permutation to the original output for comparison
output_original_reordered = output_original[permutation]
Out[4]:
Console
Permutation applied: [0, 2, 1, 3] (swap positions 1 and 2)

Original output shape: (4, 6)
Permuted output shape: (4, 6)

Max difference between π(output) and output(π(input)): 6.94e-18
Outputs are identical (within numerical precision)

The outputs match exactly (up to floating point precision). This confirms permutation equivariance: permuting the input permutes the output in the same way. Self-attention doesn't "know" that we reordered the tokens; it just computes the same pairwise relationships in a different order.

The Root Cause: Content-Only Dependencies

This experiment confirms what the mathematics predicted. Self-attention treats the input as an unordered set of tokens, using only their content to determine relationships. Each token's output depends on three things, none of which involve position:

  1. Its own content (through its query vector, which asks "what am I looking for?")
  2. The content of all other tokens (through their key vectors, which answer "what do I offer?")
  3. The pairwise content relationships (through dot products that measure compatibility)

This stands in stark contrast to recurrent networks. In an RNN or LSTM, position is implicitly encoded through sequential processing: the hidden state at step 5 depends on the hidden state at step 4, which depends on step 3, and so on. The temporal chain of computation bakes position into the representation. Self-attention discards this chain entirely, processing all positions in parallel but losing positional awareness in the process.

Why Position Matters for Language

This position blindness might seem like a minor technical detail, but it's actually catastrophic for language understanding. Human language is deeply structured around word order. Let's explore several dimensions of why position matters.

Grammatical Roles

Consider these sentences:

  • "The dog chased the cat."
  • "The cat chased the dog."

The same words appear in both sentences, but they play opposite grammatical roles. In the first sentence, "dog" is the subject (the chaser) and "cat" is the object (the chased). In the second, these roles reverse. Pure self-attention, seeing only that "dog," "chased," and "cat" co-occur, cannot distinguish which is agent and which is patient.

Negation Scope

Word order determines what negation applies to:

  • "I did not say he stole the money." (I said nothing about theft)
  • "I did say he did not steal the money." (I claimed his innocence)

The positioning of "not" relative to "say" and "steal" completely changes the meaning. Moving a single word transforms an accusation of silence into a defense of character.

Modifier Attachment

Consider the ambiguity in:

  • "I saw the man with the telescope."

Does "with the telescope" modify "saw" (I used a telescope to see) or "man" (the man was carrying a telescope)? In written text, position and punctuation often disambiguate. More importantly, our mental parsing depends heavily on positional expectations about where modifiers attach.

Temporal and Causal Ordering

Sequence implies temporal or causal relationships:

  • "She got her degree, then found a job."
  • "She found a job, then got her degree."

The order of clauses suggests different life trajectories. Even though both sentences contain the same events, their relative positions encode temporal precedence.

Semantic Composition

Languages compose meaning hierarchically, and order often determines the composition structure:

  • "Hot dog" (a type of food)
  • "Dog hot" (describing a dog's temperature)

English compounds typically have the modifier before the head noun. Without positional information, "hot" and "dog" are just two tokens that appear together, with no way to determine which modifies which.

In[5]:
Code
# Simulate the attention pattern for "dog bites man" vs "man bites dog"
# We'll use random embeddings but give them distinct identities

np.random.seed(123)

# Create distinct embeddings for each word
word_embeddings = {
    "dog": np.random.randn(embed_dim),
    "bites": np.random.randn(embed_dim),
    "man": np.random.randn(embed_dim),
}

# Construct the two sequences
sentence1 = np.stack(
    [word_embeddings["dog"], word_embeddings["bites"], word_embeddings["man"]]
)
sentence2 = np.stack(
    [word_embeddings["man"], word_embeddings["bites"], word_embeddings["dog"]]
)

# Compute attention weights for both
_, weights1 = self_attention(sentence1, W_Q, W_K, W_V)
_, weights2 = self_attention(sentence2, W_Q, W_K, W_V)
Out[6]:
Visualization
3x3 heatmap showing attention weights for dog bites man sentence.
Attention weights for 'dog bites man'. Each cell shows how much the row token attends to the column token.
3x3 heatmap showing attention weights for man bites dog sentence.
Attention weights for 'man bites dog'. The weights between corresponding word pairs are identical to the left panel.
Out[7]:
Console
--- Key Observation ---
Attention from 'dog' to 'bites': 0.3439 (sentence 1, dog at position 0)
Attention from 'dog' to 'bites': 0.3439 (sentence 2, dog at position 2)

These are identical because attention only depends on content, not position!

The heatmaps reveal the core problem. Despite "dog bites man" and "man bites dog" having opposite meanings, the attention weight between "dog" and "bites" is identical in both sentences (as shown by matching cell values). Self-attention cannot tell that in one sentence the dog is biting, while in the other the dog is being bitten. The pairwise content relationships are the same; only the positions differ.

What We Need from Positional Information

Understanding why position matters helps us identify what properties any positional encoding scheme should have. The ideal solution must satisfy several requirements:

  • Unique position identification: Each position in a sequence needs a distinct representation. If positions 3 and 7 have identical positional signals, the model cannot distinguish them. This seems obvious, but it rules out trivial solutions like using the same constant everywhere.
  • Bounded and stable values: Whatever numerical representation we use for position, it should have bounded, well-behaved values. If position 1 is encoded as 1, position 1000 as 1000, and position 1 million as 1 million, the scale differences will overwhelm the semantic content. Neural networks prefer inputs with similar magnitudes.
  • Consistent across sequence lengths: The encoding for position 5 should be similar whether it appears in a 10-token sequence or a 1000-token sequence. If positional representations change dramatically based on sequence length, the model must re-learn positional patterns for every possible length.
  • Relative position accessibility: Many linguistic relationships depend on relative rather than absolute position. A verb typically appears near its subject and object, regardless of where they fall in the sentence. Ideally, the positional encoding makes it easy to determine that position 7 is "2 steps after" position 5, not just that they're both somewhere in the sequence.
  • Generalization beyond training: Models trained on sequences of length 512 may encounter sequences of length 1024 at inference time. The positional encoding should degrade gracefully (or not at all) when extrapolating to positions not seen during training.
  • Compatibility with attention: The positional information must integrate with the attention mechanism in a way that allows position to influence attention patterns. If a query can't distinguish nearby keys from distant keys, positional information isn't flowing into the core computation.
In[8]:
Code
# Illustrate the problem with naive positional representations

positions = np.arange(1, 513)

# Naive approach 1: Raw position index
naive_index = positions

# Naive approach 2: Normalized position (0 to 1)
naive_normalized = positions / 512

# Naive approach 3: Log-scaled position
naive_log = np.log(positions)
Out[9]:
Visualization
Line plot showing raw position indices growing linearly from 1 to 512.
Raw position indices grow unboundedly, overwhelming semantic content at large positions.
Line plot showing normalized positions from 0 to 1.
Normalized positions change meaning with sequence length, making patterns non-transferable.
Line plot showing log-scaled positions with diminishing growth.
Log scaling compresses distant positions, making them hard to distinguish.

None of these naive approaches satisfy our requirements. Raw indices are unbounded. Normalized positions change meaning with sequence length (position 0.5 is location 256 in a 512-token sequence but location 512 in a 1024-token sequence). Log scaling compresses distant positions so tightly that the model can barely distinguish position 400 from position 500.

Position Encoding vs. Position Embedding

Before diving into specific solutions in subsequent chapters, let's clarify an important terminological distinction that often causes confusion.

Encoding vs. Embedding

Positional encoding refers to a fixed, deterministic function that maps position indices to vectors. Positional embedding refers to learned vectors stored in a lookup table, one per position. Both inject positional information, but through different mechanisms.

Positional encoding uses a predetermined formula to compute a position vector. Given position pp, the encoding function f(p)f(p) returns a fixed vector that never changes during training. The classic example is the sinusoidal encoding from the original Transformer paper, which uses sine and cosine functions at different frequencies. The advantage is that the formula generalizes to any position, including positions longer than those seen during training.

Positional embedding treats positions like vocabulary tokens. We create an embedding matrix PRL×dP \in \mathbb{R}^{L \times d}, where LL is the maximum sequence length and dd is the embedding dimension. Position pp simply looks up row pp of this matrix, retrieving a dd-dimensional vector pp\mathbf{p}_p. These embeddings are learned during training, just like word embeddings. The advantage is flexibility: the model can learn whatever positional patterns are useful. The disadvantage is that positions beyond LL have no representation.

Most modern language models use positional embeddings (learned lookup tables) rather than fixed encodings, but the terminology often gets mixed. We'll explore both approaches in detail in the following chapters.

How Position Information Enters the Model

Now that we understand the two mechanisms for generating positional vectors, the remaining question is: how do we inject this information into the attention computation? Recall from our earlier analysis that attention weights depend only on query and key vectors, which are projections of the input embeddings. To make attention position-aware, we need to modify what goes into those projections.

The standard approach is elegantly simple: add the positional vector directly to the token embedding, creating a single combined representation that carries both semantic meaning and positional information:

xi=xi+pi\mathbf{x}'_i = \mathbf{x}_i + \mathbf{p}_i

where:

  • xiRd\mathbf{x}_i \in \mathbb{R}^d: the token embedding at position ii (carries semantic content)
  • piRd\mathbf{p}_i \in \mathbb{R}^d: the positional vector for position ii (carries positional information)
  • xiRd\mathbf{x}'_i \in \mathbb{R}^d: the combined representation fed to attention
  • dd: the embedding dimension (must be the same for both token and position vectors)

Why addition rather than concatenation or some other combination? Addition preserves the dimensionality: the combined vector has the same size as the original embedding, so existing projection matrices work unchanged. More subtly, addition allows the model to learn how much weight to give positional versus semantic information. If certain dimensions of the positional encoding are consistently ignored by the learned projections, the model effectively down-weights position for those aspects of the representation.

In[10]:
Code
def add_positional_info(token_embeddings, position_vectors):
    """
    Combine token embeddings with positional information.

    Args:
        token_embeddings: Shape (seq_len, embed_dim)
        position_vectors: Shape (seq_len, embed_dim)

    Returns:
        Combined representations of shape (seq_len, embed_dim)
    """
    return token_embeddings + position_vectors


# Example: simple positional vectors (not a real encoding scheme)
seq_len = 4
embed_dim = 8

# Token embeddings
tokens = np.random.randn(seq_len, embed_dim)

# Placeholder positional vectors
positions = np.random.randn(seq_len, embed_dim) * 0.1

# Combined representation
combined = add_positional_info(tokens, positions)
Out[11]:
Console
Shapes:
  Token embeddings:     (4, 8)
  Position vectors:     (4, 8)
  Combined:             (4, 8)

Example: Position 0
  Token:    [[-1.254 -0.638  0.907 -1.429]...]
  Position: [[0.089 0.175 0.15  0.107]...]
  Combined: [[-1.165 -0.462  1.057 -1.322]...]

The shapes confirm that adding positional vectors preserves the embedding dimension. Looking at position 0, you can see how each dimension of the combined vector is simply the sum of the corresponding token and position dimensions. The positional vectors are scaled by 0.1 to keep them relatively small compared to the token embeddings, a common practice that prevents positional information from dominating semantic content. The combined representation now carries both semantic content (from the token embedding) and positional context (from the positional vector). When this combined representation is projected into queries, keys, and values, both types of information can influence the attention pattern.

Absolute vs. Relative Position

A final conceptual distinction shapes the design space for positional representations: should we encode where each token is in absolute terms, or should we encode how tokens relate to each other in relative terms?

Absolute positional encoding assigns each position a fixed vector. Position 0 always gets the same encoding, position 1 always gets the same encoding, and so on. The attention mechanism must then learn to extract relative information by comparing absolute positions.

Relative positional encoding directly encodes the distance or relationship between pairs of positions. Instead of saying "this token is at position 5," relative encoding says "this token is 3 positions before the query." The attention mechanism directly incorporates these relative distances.

Out[12]:
Visualization
Diagram showing five word boxes with fixed position indices 0 through 4.
Absolute position assigns a fixed index to each token regardless of context.
Diagram showing relative distances from the query token sat to other tokens.
Relative position encodes the distance from a query token, capturing how far apart tokens are.

Each approach has trade-offs. Absolute encoding is simpler to implement: just add a position vector to each token embedding. But linguistic relationships often depend on relative distance. A verb looks for its subject within a few words, regardless of whether they're at positions 3-5 or positions 103-105 in a document. Absolute encoding forces the model to learn this generalization across all possible absolute position pairs.

Relative encoding directly captures these distance-based relationships but requires modifying the attention mechanism itself, not just the input embeddings. The attention score between positions ii and jj must incorporate information about the distance iji - j, which complicates the efficient matrix operations we rely on.

Modern architectures increasingly favor relative or hybrid approaches. RoPE (Rotary Position Embedding) cleverly encodes relative positions through rotation operations that integrate naturally with attention. ALiBi (Attention with Linear Biases) adds a simple bias to attention scores based on distance. We'll explore these in dedicated chapters.

Limitations of Any Positional Encoding

No positional encoding scheme is perfect. Every approach makes trade-offs, and understanding these limitations helps choose the right scheme for a given application:

  • Fixed encodings may not capture learned patterns. Sinusoidal encoding uses mathematically determined frequencies. These frequencies might not align with the positional patterns most useful for a specific task. A learned embedding can adapt to the training data, potentially discovering better positional representations.
  • Learned embeddings don't generalize to longer sequences. If you train with a maximum sequence length of 512, positions 513 and beyond have no learned representation. Some models handle this with interpolation or extrapolation heuristics, but performance often degrades for unseen positions.
  • Additive combination limits expressiveness. Adding positional vectors to token embeddings means the combined representation must encode both content and position in a shared space. The model may struggle when positional and semantic information conflict or when fine-grained positional distinctions are needed.
  • Relative positions require architectural changes. Directly encoding relative positions requires modifying the attention computation, not just the input. This can complicate implementation and sometimes reduce computational efficiency.
  • Very long sequences remain challenging. Even with sophisticated positional encoding, transformers struggle with extremely long sequences (tens of thousands of tokens). The attention mechanism still has quadratic complexity, and positional patterns learned on shorter sequences may not transfer.

Summary

Self-attention is blind to position because its computation depends only on content, not on where tokens appear in the sequence. This permutation equivariance is a fundamental property of the attention mechanism, not a bug in any particular implementation.

Key takeaways from this chapter:

  • Permutation equivariance: Self-attention produces the same outputs (reordered) regardless of input order. Shuffling the input shuffles the output identically, because attention weights depend only on query-key compatibility, not on position indices.
  • Language requires order: Grammatical roles, negation scope, modifier attachment, temporal relationships, and semantic composition all depend on word position. "Dog bites man" and "man bites dog" must be distinguishable.
  • Requirements for positional information: Any solution must provide unique position identification, bounded values, consistency across sequence lengths, accessibility of relative positions, generalization beyond training, and compatibility with attention.
  • Encoding vs. embedding: Positional encoding uses fixed formulas (like sinusoids) that generalize to any position. Positional embedding uses learned lookup tables that adapt to training data but have fixed maximum length.
  • Absolute vs. relative: Absolute encoding assigns fixed vectors to positions. Relative encoding captures distances between positions. Modern architectures increasingly favor relative approaches for better generalization.

In the next chapter, we'll examine the sinusoidal positional encoding introduced in the original Transformer paper. This elegant mathematical construction uses sine and cosine functions at different frequencies to create unique position vectors that support relative position computation through simple linear operations.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the position problem in self-attention.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{thepositionproblemwhytransformerscanttellorderwithouthelp, author = {Michael Brenndoerfer}, title = {The Position Problem: Why Transformers Can't Tell Order Without Help}, year = {2025}, url = {https://mbrenndoerfer.com/writing/position-problem-self-attention-word-order}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). The Position Problem: Why Transformers Can't Tell Order Without Help. Retrieved from https://mbrenndoerfer.com/writing/position-problem-self-attention-word-order
MLAAcademic
Michael Brenndoerfer. "The Position Problem: Why Transformers Can't Tell Order Without Help." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/position-problem-self-attention-word-order>.
CHICAGOAcademic
Michael Brenndoerfer. "The Position Problem: Why Transformers Can't Tell Order Without Help." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/position-problem-self-attention-word-order.
HARVARDAcademic
Michael Brenndoerfer (2025) 'The Position Problem: Why Transformers Can't Tell Order Without Help'. Available at: https://mbrenndoerfer.com/writing/position-problem-self-attention-word-order (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). The Position Problem: Why Transformers Can't Tell Order Without Help. https://mbrenndoerfer.com/writing/position-problem-self-attention-word-order
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free