Query, Key, Value: The Foundation of Transformer Attention

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how QKV projections enable transformers to learn flexible attention patterns through specialized query, key, and value representations.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Query, Key, ValueLink Copied

In the previous chapter, we implemented self-attention using raw embeddings: each token's embedding served directly as the basis for computing similarities and aggregating context. This approach works, but it has a fundamental limitation. When a token computes dot products with other tokens using its own embedding, it's asking: "Which tokens are similar to me?" But similarity isn't the same as relevance. A pronoun like "it" isn't similar to "cat" in embedding space, yet "cat" is highly relevant for understanding what "it" refers to.

The solution is to give tokens different representations for different roles. When a token is looking for context (querying), it should express what information it needs. When a token is being looked at (acting as a key), it should advertise what information it offers. And when a token contributes to another token's representation (providing a value), it should supply the actual content to be aggregated. This separation of concerns is the Query, Key, Value (QKV) framework.

The Database Lookup AnalogyLink Copied

The QKV mechanism mirrors how databases retrieve information. Consider a library catalog: you search with a query (perhaps "books about neural networks"), the system compares your query against keys (book metadata, titles, descriptions), and returns values (the actual book content or references). The query expresses what you want, keys describe what's available, and values provide the substance.

Query, Key, Value

In self-attention, each token is projected into three different representations: a query (what information am I looking for?), a key (what information do I contain?), and a value (what information should I contribute?). Attention weights are computed by matching queries to keys, then used to aggregate values.

This analogy clarifies why raw embedding similarity falls short. In a database, your search query isn't compared directly against the books themselves; it's compared against structured metadata designed to facilitate matching. Similarly, in self-attention, we don't want tokens to match based on their raw meanings. We want them to match based on learned patterns that capture functional relationships: subjects matching predicates, pronouns matching antecedents, modifiers matching their targets.

The QKV Mechanism: From Intuition to FormulaLink Copied

Now that we understand why we need separate representations for querying, being queried, and contributing content, let's build up the mathematical machinery that makes this work. We'll develop each component step by step, showing how the formulas directly address the challenge of learning flexible attention patterns.

The journey from intuition to formula follows a natural progression:

Project tokens into specialized query, key, and value spaces
Measure compatibility between queries and keys using dot products
Scale the scores to prevent numerical instability
Aggregate values according to the attention weights

Each step solves a specific problem that arises from the previous one. By the end, you'll see how these pieces combine into a single elegant formula that powers modern transformers.

Step 1: Projecting Into Specialized SpacesLink Copied

The core insight is simple: if we want tokens to play different roles, we should give them different representations for each role. We accomplish this through learned linear projections, three separate transformations that map each token's embedding into query, key, and value spaces.

Why three separate projections? Consider what happens when a pronoun like "it" needs to find its antecedent. The pronoun's embedding encodes "third-person singular neuter pronoun," but that's not what it should search for. It should search for "noun phrase that could be referred to." Meanwhile, a noun like "cat" shouldn't advertise "I'm a cat" but rather "I'm a noun phrase available for reference." And when "cat" contributes information, it should provide its semantic content (the concept of a cat), not its grammatical role.

These are three fundamentally different jobs:

Querying: What information am I looking for?
Being queried: What information do I have to offer?
Contributing: What content should I transmit?

A single embedding can't optimally serve all three purposes. So we learn three different transformations, each specialized for its role.

Given an input embedding $\mathbf{x} \in \mathbb{R}^d$ for a single token, we compute:

\mathbf{q} = \mathbf{x} \mathbf{W}_Q, \quad \mathbf{k} = \mathbf{x} \mathbf{W}_K, \quad \mathbf{v} = \mathbf{x} \mathbf{W}_V

where:

$\mathbf{x} \in \mathbb{R}^d$ : the input embedding vector (a row vector with $d$ dimensions)
$\mathbf{q} \in \mathbb{R}^{d_k}$ : the resulting query vector, expressing what this token is looking for
$\mathbf{k} \in \mathbb{R}^{d_k}$ : the resulting key vector, advertising what this token contains
$\mathbf{v} \in \mathbb{R}^{d_v}$ : the resulting value vector, carrying the content to be aggregated
$\mathbf{W}_Q \in \mathbb{R}^{d \times d_k}$ : the query projection matrix (learned during training)
$\mathbf{W}_K \in \mathbb{R}^{d \times d_k}$ : the key projection matrix (learned during training)
$\mathbf{W}_V \in \mathbb{R}^{d \times d_v}$ : the value projection matrix (learned during training)
$d$ : the input embedding dimension
$d_k$ : the dimension of queries and keys (must match for dot product compatibility)
$d_v$ : the dimension of values (can differ from $d_k$ )

Each projection is a matrix multiplication, a linear transformation where each output dimension is a weighted combination of input dimensions. The crucial point is that these weights are learned during training. The model doesn't know in advance which embedding features matter for querying versus being queried versus contributing content. Instead, gradient descent discovers these patterns from data:

$\mathbf{W}_Q$ learns which combinations of embedding features make effective queries (what patterns indicate "I need information about X")
$\mathbf{W}_K$ learns which combinations make effective keys (what patterns indicate "I have information about X")
$\mathbf{W}_V$ learns which content is worth transmitting when attention flows

This learning happens implicitly through the training objective. If attending from a verb to its subject improves next-word prediction, the projection matrices gradually adjust to make verb queries match subject keys.

Scaling to sequences. For a complete sequence of $n$ tokens, we stack all embeddings into a matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (where each row is one token) and project all tokens simultaneously:

\mathbf{Q} = \mathbf{X} \mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X} \mathbf{W}_K, \quad \mathbf{V} = \mathbf{X} \mathbf{W}_V

The resulting matrices have shapes $\mathbf{Q} \in \mathbb{R}^{n \times d_k}$ , $\mathbf{K} \in \mathbb{R}^{n \times d_k}$ , and $\mathbf{V} \in \mathbb{R}^{n \times d_v}$ . Row $i$ of each matrix is the query, key, or value vector for token $i$ .

In[3]:

Code

def project_qkv(embeddings, W_q, W_k, W_v):
    """
    Project input embeddings into query, key, and value spaces.

    Args:
        embeddings: Input matrix of shape (seq_len, embed_dim)
        W_q: Query projection matrix of shape (embed_dim, d_k)
        W_k: Key projection matrix of shape (embed_dim, d_k)
        W_v: Value projection matrix of shape (embed_dim, d_v)

    Returns:
        Q, K, V matrices
    """
    Q = embeddings @ W_q  # (seq_len, d_k)
    K = embeddings @ W_k  # (seq_len, d_k)
    V = embeddings @ W_v  # (seq_len, d_v)
    return Q, K, V

def project_qkv(embeddings, W_q, W_k, W_v):
    """
    Project input embeddings into query, key, and value spaces.

    Args:
        embeddings: Input matrix of shape (seq_len, embed_dim)
        W_q: Query projection matrix of shape (embed_dim, d_k)
        W_k: Key projection matrix of shape (embed_dim, d_k)
        W_v: Value projection matrix of shape (embed_dim, d_v)

    Returns:
        Q, K, V matrices
    """
    Q = embeddings @ W_q  # (seq_len, d_k)
    K = embeddings @ W_k  # (seq_len, d_k)
    V = embeddings @ W_v  # (seq_len, d_v)
    return Q, K, V

Why might these projections learn different things? Consider an embedding that encodes both syntactic information (part of speech, grammatical role) and semantic information (meaning, topic). The query projection $\mathbf{W}_Q$ might learn to emphasize syntactic features, helping verbs find their subjects by grammatical role rather than semantic similarity. Meanwhile, the value projection $\mathbf{W}_V$ might preserve semantic content, so that when attention flows, it transfers meaning rather than grammatical markers.

The separation between keys and values is particularly powerful. A word's key determines which queries it matches (what attention it receives), while its value determines what information it contributes. This decoupling means a word can attract attention for one reason (syntactic role) while contributing entirely different information (semantic content).

Step 2: Measuring Compatibility with Dot ProductsLink Copied

With queries and keys in hand, we face the next challenge: how do we measure whether a query matches a key? We need a scoring function that takes two vectors and returns a single number indicating compatibility. Higher numbers should mean better matches.

Several options exist, including additive attention, multiplicative attention, and others, but the dot product has become the standard choice for its simplicity and efficiency. The geometric interpretation is intuitive:

Vectors pointing in the same direction → large positive dot product (strong match)
Orthogonal vectors → zero dot product (no relationship)
Vectors pointing in opposite directions → negative dot product (poor match)

This gives us exactly the ranking behavior we want: the most compatible keys get the highest scores, and training can learn arbitrary matching patterns by adjusting the projection matrices.

For positions $i$ and $j$ , the compatibility score is:

\text{score}_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j = \sum_{m=1}^{d_k} q_{im} \cdot k_{jm}

where:

$\text{score}_{ij}$ : the raw attention score measuring how well position $i$ 's query matches position $j$ 's key
$\mathbf{q}_i \in \mathbb{R}^{d_k}$ : the query vector for position $i$
$\mathbf{k}_j \in \mathbb{R}^{d_k}$ : the key vector for position $j$
$q_{im}$ , $k_{jm}$ : the $m$ -th components of these vectors
$d_k$ : the dimension of query and key vectors (must match for the dot product to be defined)

The beauty of the dot product is that it's differentiable, fast to compute, and captures alignment in learned representation space. Because $\mathbf{q}_i$ and $\mathbf{k}_j$ come from learned projections, the model can discover arbitrary matching patterns during training.

To compute all $n^2$ scores at once (every query against every key), we use matrix multiplication:

\mathbf{S} = \mathbf{Q} \mathbf{K}^T

where:

$\mathbf{S} \in \mathbb{R}^{n \times n}$ : the score matrix containing all pairwise attention scores
$\mathbf{Q} \in \mathbb{R}^{n \times d_k}$ : the query matrix (row $i$ is the query for position $i$ )
$\mathbf{K}^T \in \mathbb{R}^{d_k \times n}$ : the transposed key matrix (column $j$ is the key for position $j$ )

Entry $S_{ij}$ is the dot product of row $i$ of $\mathbf{Q}$ with column $j$ of $\mathbf{K}^T$ , which equals $\mathbf{q}_i \cdot \mathbf{k}_j$ . This single matrix multiplication gives us all $n^2$ pairwise scores in one highly optimized operation, a key reason why attention scales well on modern hardware.

In[4]:

Code

def compute_attention_scores(Q, K):
    """
    Compute raw attention scores from queries and keys.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)

    Returns:
        Scores matrix of shape (seq_len, seq_len)
    """
    # Matrix multiplication: Q @ K.T gives all pairwise dot products
    scores = Q @ K.T
    return scores

def compute_attention_scores(Q, K):
    """
    Compute raw attention scores from queries and keys.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)

    Returns:
        Scores matrix of shape (seq_len, seq_len)
    """
    # Matrix multiplication: Q @ K.T gives all pairwise dot products
    scores = Q @ K.T
    return scores

The power of learned projections. This is where QKV attention differs fundamentally from raw embedding similarity. In the previous chapter, tokens matched based on how similar their embeddings were, a fixed computation that couldn't adapt to context. Now they match based on how well their learned projections align.

Consider the implications: a verb's query can learn to match the key of a noun playing the subject role, even if "run" and "dog" have very different embeddings. The word "it" can learn a query that matches noun phrase keys, even though pronouns and nouns occupy different regions of embedding space. The projections transform the matching problem from "what words are similar?" to "what words are relevant?", and relevance is learned from data.

Step 3: The Scaling Problem and Its SolutionLink Copied

We have scores, but there's a subtle problem lurking in the mathematics. As the dimension $d_k$ grows, dot products tend to become larger in magnitude, not because the vectors are more aligned, but simply because we're summing more terms. This creates numerical instability that can derail training. To understand why and how we fix it, let's work through the statistics step by step.

Setting up the problem. Assume each component of $\mathbf{q}$ and $\mathbf{k}$ has zero mean and unit variance (a reasonable assumption for normalized embeddings). We want to understand how the variance of the dot product depends on dimension.

Step 1: Write the dot product as a sum. The dot product is:

\mathbf{q} \cdot \mathbf{k} = \sum_{m=1}^{d_k} q_m \cdot k_m

where:

$q_m$ : the $m$ -th component of the query vector $\mathbf{q}$
$k_m$ : the $m$ -th component of the key vector $\mathbf{k}$
$d_k$ : the dimension of both vectors

Step 2: Compute variance of each term. Each product $q_m \cdot k_m$ is the product of two independent random variables with zero mean and unit variance. For independent random variables $X$ and $Y$ with $\mathbb{E}[X] = \mathbb{E}[Y] = 0$ and $\text{Var}(X) = \text{Var}(Y) = 1$ :

\text{Var}(X \cdot Y) = \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2] = 1 \cdot 1 = 1

So each term $q_m \cdot k_m$ has variance 1.

Step 3: Sum the variances. For independent random variables, the variance of a sum equals the sum of variances:

\text{Var}\left(\sum_{m=1}^{d_k} q_m \cdot k_m\right) = \sum_{m=1}^{d_k} \text{Var}(q_m \cdot k_m) = \sum_{m=1}^{d_k} 1 = d_k

Conclusion. The dot product has variance $d_k$ , so its standard deviation is $\sqrt{d_k}$ . At $d_k = 64$ (common in practice), scores have standard deviation 8. At $d_k = 512$ , it's about 22.6. This growth in magnitude isn't a feature; it's a bug that we need to fix.

In[5]:

Code

# Demonstrate how dot product variance grows with dimension
def measure_dot_product_variance(d_k, n_samples=10000):
    """Compute empirical variance of dot products for random unit-variance vectors."""
    q = np.random.randn(n_samples, d_k)
    k = np.random.randn(n_samples, d_k)
    dot_products = np.sum(q * k, axis=1)
    return dot_products.var()


dimensions = [4, 8, 16, 32, 64, 128, 256, 512]
np.random.seed(42)
empirical_variances = [measure_dot_product_variance(d) for d in dimensions]
theoretical_variances = dimensions  # Variance equals d_k

# Demonstrate how dot product variance grows with dimension
def measure_dot_product_variance(d_k, n_samples=10000):
    """Compute empirical variance of dot products for random unit-variance vectors."""
    q = np.random.randn(n_samples, d_k)
    k = np.random.randn(n_samples, d_k)
    dot_products = np.sum(q * k, axis=1)
    return dot_products.var()


dimensions = [4, 8, 16, 32, 64, 128, 256, 512]
np.random.seed(42)
empirical_variances = [measure_dot_product_variance(d) for d in dimensions]
theoretical_variances = dimensions  # Variance equals d_k

Out[6]:

Visualization

Line plot showing empirical and theoretical dot product variance increasing linearly with dimension from 4 to 512. — Dot product variance grows linearly with dimension. For random vectors with unit-variance components, the variance of their dot product equals the dimension. This means larger dimensions produce larger score magnitudes, which can saturate softmax.

The plot confirms our theoretical prediction: variance grows linearly with dimension. The empirical measurements (red squares) lie almost perfectly on the theoretical line (blue circles), validating our statistical analysis.

Why does this matter? The problem becomes clear when we consider what happens next: we apply softmax to convert scores into probability-like attention weights.

The softmax function converts a vector of real numbers into a probability distribution. For a vector of scores $\mathbf{s} = [s_1, s_2, \ldots, s_n]$ , softmax computes:

\text{softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^{n} e^{s_j}}

where:

$s_i$ : the $i$ -th score (the raw attention score for position $i$ )
$e^{s_i}$ : the exponential of $s_i$ , ensuring all values become positive
$\sum_{j=1}^{n} e^{s_j}$ : the sum of exponentials across all $n$ positions, serving as a normalizing constant
$n$ : the number of positions (sequence length)

The exponential function amplifies differences between inputs: larger scores get exponentially larger outputs. When one score is much larger than others, softmax assigns nearly all probability mass to that element. The attention becomes "hard," focusing on essentially one position while ignoring all others.

More critically, this creates gradient problems. In the softmax function, elements with very low probability receive vanishingly small gradients. If attention is sharply focused due to large score magnitudes, the model struggles to learn that other positions might also be relevant. Training becomes slow and unstable.

In[7]:

Code

# Demonstrate softmax saturation with different score magnitudes
def softmax(x):
    exp_x = np.exp(x - x.max())
    return exp_x / exp_x.sum()


# Same relative differences, different scales
scores_small = np.array([1.0, 0.8, 0.5, 0.3])
scores_medium = scores_small * 5
scores_large = scores_small * 20

weights_small = softmax(scores_small)
weights_medium = softmax(scores_medium)
weights_large = softmax(scores_large)

# Demonstrate softmax saturation with different score magnitudes
def softmax(x):
    exp_x = np.exp(x - x.max())
    return exp_x / exp_x.sum()


# Same relative differences, different scales
scores_small = np.array([1.0, 0.8, 0.5, 0.3])
scores_medium = scores_small * 5
scores_large = scores_small * 20

weights_small = softmax(scores_small)
weights_medium = softmax(scores_medium)
weights_large = softmax(scores_large)

Out[8]:

Visualization

Bar chart showing relatively uniform softmax weights across four positions. — Small scores (scale=1): Softmax produces a soft distribution where all positions receive meaningful weight. Gradients flow to all elements.

Bar chart showing softmax weights increasingly concentrated on the maximum position. — Medium scores (scale=5): Distribution sharpens. The maximum element dominates, while others receive less weight.

Bar chart showing nearly all softmax weight on a single position. — Large scores (scale=20): Near-hard attention. Almost all weight goes to one element. Gradients for other elements nearly vanish.

The visualization makes the problem viscerally clear. All three plots use scores with the same relative differences, where position 1's score is always 25% higher than position 2's, and so on. But scaling up the magnitudes completely changes the output distribution. In the rightmost plot, position 1 receives 99.9% of the weight, and the model can barely learn to attend elsewhere.

The elegant solution. Since the problem is that score variance grows with $d_k$ , we fix it by dividing scores by $\sqrt{d_k}$ . This simple scaling operation normalizes the variance back to approximately 1, regardless of dimension:

Original variance: $d_k$
After scaling by $\sqrt{d_k}$ : variance becomes $d_k / d_k = 1$

The result is the complete scaled dot-product attention formula:

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

where:

$\mathbf{Q} \in \mathbb{R}^{n \times d_k}$ : the query matrix, with row $i$ containing the query vector for position $i$
$\mathbf{K} \in \mathbb{R}^{n \times d_k}$ : the key matrix, with row $j$ containing the key vector for position $j$
$\mathbf{V} \in \mathbb{R}^{n \times d_v}$ : the value matrix, with row $j$ containing the value vector for position $j$
$\mathbf{Q}\mathbf{K}^T \in \mathbb{R}^{n \times n}$ : the raw score matrix, where entry $(i, j)$ is the dot product $\mathbf{q}_i \cdot \mathbf{k}_j$
$\sqrt{d_k}$ : the scaling factor that stabilizes score magnitudes
$\text{softmax}(\cdot)$ : applied row-wise, converts each row of scaled scores into a probability distribution (attention weights)
$n$ : the sequence length (number of tokens)
$d_k$ : the query/key dimension
$d_v$ : the value dimension (determines the output dimension)

The formula works in three stages. First, $\mathbf{Q}\mathbf{K}^T$ computes all pairwise query-key dot products. Second, dividing by $\sqrt{d_k}$ and applying softmax converts these scores into attention weights. Third, multiplying by $\mathbf{V}$ aggregates value vectors according to these weights, producing context-enriched output representations.

After scaling, scores have controlled variance regardless of $d_k$ , and softmax operates in a regime where gradients flow to all positions.

In[9]:

Code

# Compare attention weights with and without scaling
def compute_attention_comparison(d_k, n_positions=8):
    """Compare attention distributions with and without scaling."""
    np.random.seed(42)
    Q = np.random.randn(1, d_k)
    K = np.random.randn(n_positions, d_k)

    # Unscaled scores
    unscaled = (Q @ K.T).flatten()

    # Scaled scores
    scaled = unscaled / np.sqrt(d_k)

    # Softmax
    def softmax(x):
        exp_x = np.exp(x - x.max())
        return exp_x / exp_x.sum()

    return unscaled, scaled, softmax(unscaled), softmax(scaled)


# High dimension where scaling matters most
d_k_demo = 64
unscaled_scores, scaled_scores, unscaled_weights, scaled_weights = (
    compute_attention_comparison(d_k_demo)
)

# Compare attention weights with and without scaling
def compute_attention_comparison(d_k, n_positions=8):
    """Compare attention distributions with and without scaling."""
    np.random.seed(42)
    Q = np.random.randn(1, d_k)
    K = np.random.randn(n_positions, d_k)

    # Unscaled scores
    unscaled = (Q @ K.T).flatten()

    # Scaled scores
    scaled = unscaled / np.sqrt(d_k)

    # Softmax
    def softmax(x):
        exp_x = np.exp(x - x.max())
        return exp_x / exp_x.sum()

    return unscaled, scaled, softmax(unscaled), softmax(scaled)


# High dimension where scaling matters most
d_k_demo = 64
unscaled_scores, scaled_scores, unscaled_weights, scaled_weights = (
    compute_attention_comparison(d_k_demo)
)

Out[10]:

Visualization

Bar chart showing highly concentrated attention weights without scaling. — Without scaling: The large score variance pushes softmax toward hard attention. Position 3 receives 91% of the weight, leaving little gradient for learning to attend elsewhere.

Bar chart showing more distributed attention weights with scaling. — With scaling: Dividing by sqrt(64)=8 compresses the score range. The attention distribution is softer, allowing gradients to flow to multiple positions during training.

The side-by-side comparison demonstrates scaling's importance. Without it (left), attention collapses to a near-hard distribution where one position dominates. With scaling (right), the model retains flexibility to attend broadly. Crucially, gradients flow to all positions during training, allowing the model to learn nuanced attention patterns.

In[11]:

Code

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)

    Returns:
        output: Attention output of shape (seq_len, d_v)
        attention_weights: Attention weights of shape (seq_len, seq_len)
    """
    d_k = Q.shape[-1]

    # Compute scaled scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply softmax (row-wise)
    # Subtract max for numerical stability (doesn't change the result)
    scores_stable = scores - scores.max(axis=1, keepdims=True)
    exp_scores = np.exp(scores_stable)
    attention_weights = exp_scores / exp_scores.sum(axis=1, keepdims=True)

    # Aggregate values
    output = attention_weights @ V

    return output, attention_weights

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)

    Returns:
        output: Attention output of shape (seq_len, d_v)
        attention_weights: Attention weights of shape (seq_len, seq_len)
    """
    d_k = Q.shape[-1]

    # Compute scaled scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply softmax (row-wise)
    # Subtract max for numerical stability (doesn't change the result)
    scores_stable = scores - scores.max(axis=1, keepdims=True)
    exp_scores = np.exp(scores_stable)
    attention_weights = exp_scores / exp_scores.sum(axis=1, keepdims=True)

    # Aggregate values
    output = attention_weights @ V

    return output, attention_weights

Let's verify that the dimensions work out correctly by tracing through a concrete example:

In[12]:

Code

# Example dimensions
seq_len = 4  # 4 tokens in the sequence
embed_dim = 8  # each token has 8-dimensional embedding
d_k = 6  # query and key dimension
d_v = 6  # value dimension

# Initialize random embeddings and projection matrices
np.random.seed(42)
X = np.random.randn(seq_len, embed_dim)
W_q = np.random.randn(embed_dim, d_k) * 0.1
W_k = np.random.randn(embed_dim, d_k) * 0.1
W_v = np.random.randn(embed_dim, d_v) * 0.1

# Project to Q, K, V
Q, K, V = project_qkv(X, W_q, W_k, W_v)

# Compute attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)

# Example dimensions
seq_len = 4  # 4 tokens in the sequence
embed_dim = 8  # each token has 8-dimensional embedding
d_k = 6  # query and key dimension
d_v = 6  # value dimension

# Initialize random embeddings and projection matrices
np.random.seed(42)
X = np.random.randn(seq_len, embed_dim)
W_q = np.random.randn(embed_dim, d_k) * 0.1
W_k = np.random.randn(embed_dim, d_k) * 0.1
W_v = np.random.randn(embed_dim, d_v) * 0.1

# Project to Q, K, V
Q, K, V = project_qkv(X, W_q, W_k, W_v)

# Compute attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)

Out[13]:

Console

Shape analysis:
  Input X:            (4, 8)  (seq_len × embed_dim)
  Query projection:   (8, 6)  (embed_dim × d_k)
  Key projection:     (8, 6)  (embed_dim × d_k)
  Value projection:   (8, 6)  (embed_dim × d_v)

  Queries Q:          (4, 6)  (seq_len × d_k)
  Keys K:             (4, 6)  (seq_len × d_k)
  Values V:           (4, 6)  (seq_len × d_v)

  Attention weights:  (4, 4)  (seq_len × seq_len)
  Output:             (4, 6)  (seq_len × d_v)

The shapes confirm our understanding: input embeddings flow through projections to create Q, K, and V. The attention weight matrix is always $n \times n$ because we compute all pairwise interactions. The output has the same sequence length as the input but takes on the value dimension $d_v$ . In practice, $d_v$ is often chosen to equal $d_k$ and the original embedding dimension, enabling residual connections.

Step 4: Aggregating ValuesLink Copied

We've now computed attention weights, a probability distribution over positions for each query. But weights alone don't produce useful outputs. The final step uses these weights to aggregate value vectors, producing new representations enriched with contextual information.

This is where the separation of keys and values proves its worth. Keys determined which positions receive attention (through query-key matching). Values determine what content actually flows. A word might attract attention because of its syntactic role (captured in its key) while contributing semantic information (carried in its value).

For position $i$ , the output is a weighted sum of all value vectors:

\mathbf{o}_i = \sum_{j=1}^{n} \alpha_{ij} \mathbf{v}_j

where:

$\mathbf{o}_i \in \mathbb{R}^{d_v}$ : the output vector for position $i$ , now enriched with contextual information
$\alpha_{ij} \in [0, 1]$ : the attention weight from position $i$ to position $j$ , indicating how much position $i$ attends to position $j$
$\mathbf{v}_j \in \mathbb{R}^{d_v}$ : the value vector at position $j$ , carrying the content that can be transferred
$n$ : the sequence length (total number of positions)
$d_v$ : the value dimension

The attention weights $\alpha_{ij}$ come from applying softmax to the scaled scores, so they satisfy $\sum_{j=1}^{n} \alpha_{ij} = 1$ for each position $i$ . This constraint ensures the output is a proper weighted average.

Geometrically, this weighted average has an elegant interpretation. Position $i$ 's output is a blend of all value vectors, with weights determined by query-key compatibility. Positions with keys that matched $i$ 's query (high $\alpha_{ij}$ ) contribute more; positions with mismatched keys (low $\alpha_{ij}$ ) contribute less. Because the weights sum to 1, the output $\mathbf{o}_i$ lies within the convex hull of the value vectors, a point "inside" the space spanned by all possible values.

The information flow paradigm. Self-attention can be understood as a message-passing system. Each position broadcasts its value vector, and each position receives a custom blend of all broadcasts, with the blend weights determined by query-key compatibility. Unlike recurrent networks that pass information sequentially, or convolutional networks that only see local neighborhoods, attention allows any position to directly receive information from any other position in a single step.

This direct, content-addressable communication is what makes transformers so effective. A pronoun can directly access its antecedent, regardless of distance. A verb can simultaneously gather information from its subject, object, and modifiers. The network doesn't need to learn complex routing through intermediate positions; it learns which positions are relevant and attends to them directly.

Let's visualize the complete attention flow for a simple sentence:

In[14]:

Code

# Create example with interpretable structure
words = ["The", "quick", "fox", "jumps"]
n_words = len(words)

# Create embeddings that capture some semantic structure
# (In practice these would come from an embedding layer)
np.random.seed(123)
word_embeddings = np.random.randn(n_words, embed_dim)

# Initialize projection matrices
W_q = np.random.randn(embed_dim, d_k) * 0.2
W_k = np.random.randn(embed_dim, d_k) * 0.2
W_v = np.random.randn(embed_dim, d_v) * 0.2

# Compute QKV attention
Q, K, V = project_qkv(word_embeddings, W_q, W_k, W_v)
output, attn_weights = scaled_dot_product_attention(Q, K, V)

# Create example with interpretable structure
words = ["The", "quick", "fox", "jumps"]
n_words = len(words)

# Create embeddings that capture some semantic structure
# (In practice these would come from an embedding layer)
np.random.seed(123)
word_embeddings = np.random.randn(n_words, embed_dim)

# Initialize projection matrices
W_q = np.random.randn(embed_dim, d_k) * 0.2
W_k = np.random.randn(embed_dim, d_k) * 0.2
W_v = np.random.randn(embed_dim, d_v) * 0.2

# Compute QKV attention
Q, K, V = project_qkv(word_embeddings, W_q, W_k, W_v)
output, attn_weights = scaled_dot_product_attention(Q, K, V)

Out[15]:

Visualization

Heatmap of attention weights for four words showing query-key matching patterns. — Attention weight heatmap showing query-key matching for a four-word sequence. Each row shows how one word's query matches against all words' keys. These weights determine how much each word's value contributes to the output at each position.

The heatmap shows how each word's query matches against all keys. Each row sums to 1.0 because softmax normalizes the weights. In this random initialization, the patterns aren't meaningful, but after training, we would see linguistically interpretable patterns: articles attending to their nouns, verbs attending to subjects and objects, and so on.

Dimensions and ShapesLink Copied

Understanding the tensor shapes in self-attention is crucial for implementation and debugging. Let's summarize the key dimensions:

Tensor shapes in self-attention. Query and key dimensions must match for dot product compatibility.

Tensor	Shape	Description
Input $\mathbf{X}$	$(n, d)$	$n$ tokens, each with $d$ -dimensional embedding
$\mathbf{W}_Q$	$(d, d_k)$	Query projection, maps $d \to d_k$
$\mathbf{W}_K$	$(d, d_k)$	Key projection, maps $d \to d_k$
$\mathbf{W}_V$	$(d, d_v)$	Value projection, maps $d \to d_v$
Queries $\mathbf{Q}$	$(n, d_k)$	Query vectors for all tokens
Keys $\mathbf{K}$	$(n, d_k)$	Key vectors for all tokens
Values $\mathbf{V}$	$(n, d_v)$	Value vectors for all tokens
Scores $\mathbf{Q}\mathbf{K}^T$	$(n, n)$	All pairwise attention scores
Weights $\alpha$	$(n, n)$	Softmax-normalized attention
Output	$(n, d_v)$	Context-enriched representations

A few important constraints:

Query and key dimensions must match ( $d_k$ ) because we compute their dot product
Value dimension can differ ( $d_v$ ) since values are aggregated, not compared
Output dimension equals value dimension ( $d_v$ ) because the output is a weighted sum of values

In practice, most implementations set $d_k = d_v = d / h$ , where $d$ is the embedding dimension and $h$ is the number of attention heads. This choice ensures that the total computation across all heads remains comparable to a single-head attention with full dimension. We'll explore multi-head attention in a later chapter.

QKV as Learned TransformationsLink Copied

Why do learned projections help? Consider what happens without them. If we use raw embeddings, a word can only attend to other words that happen to have similar embeddings. The word "it" might attend to "this" and "that" (similar pronouns) but struggle to attend to "cat" (semantically relevant but embedding-distant).

With learned projections, the model can discover arbitrary matching patterns. During training:

$\mathbf{W}_Q$ learns to transform embeddings into representations that express what each word is looking for
$\mathbf{W}_K$ learns to transform embeddings into representations that advertise what each word offers
$\mathbf{W}_V$ learns what content should actually flow when attention is paid

These projections operate independently, so a word's query doesn't need to resemble its key or value. A pronoun's query might encode "seeking a noun phrase antecedent" while its key might encode "available for coreference" and its value might encode its referent-neutral semantic content.

Let's visualize how the same embeddings project differently into query, key, and value spaces:

In[16]:

Code

# Create a 3-word example with 2D projections for visualization
np.random.seed(42)
words_viz = ["cat", "sat", "mat"]

# Original embeddings (3D for variety, project to 2D)
embed_dim_viz = 4
d_proj = 2  # Project to 2D for visualization

X_viz = np.random.randn(3, embed_dim_viz)

# Different projection matrices for Q, K, V
W_q_viz = np.random.randn(embed_dim_viz, d_proj) * 0.5
W_k_viz = np.random.randn(embed_dim_viz, d_proj) * 0.5
W_v_viz = np.random.randn(embed_dim_viz, d_proj) * 0.5

# Project
Q_viz = X_viz @ W_q_viz
K_viz = X_viz @ W_k_viz
V_viz = X_viz @ W_v_viz

# Create a 3-word example with 2D projections for visualization
np.random.seed(42)
words_viz = ["cat", "sat", "mat"]

# Original embeddings (3D for variety, project to 2D)
embed_dim_viz = 4
d_proj = 2  # Project to 2D for visualization

X_viz = np.random.randn(3, embed_dim_viz)

# Different projection matrices for Q, K, V
W_q_viz = np.random.randn(embed_dim_viz, d_proj) * 0.5
W_k_viz = np.random.randn(embed_dim_viz, d_proj) * 0.5
W_v_viz = np.random.randn(embed_dim_viz, d_proj) * 0.5

# Project
Q_viz = X_viz @ W_q_viz
K_viz = X_viz @ W_k_viz
V_viz = X_viz @ W_v_viz

Out[17]:

Visualization

2D scatter plot showing three words projected into query space with different positions. — Query space: Where words look for information. The relative positions determine which queries match which keys.

2D scatter plot showing the same three words projected into key space with different arrangement. — Key space: How words advertise themselves. Words with keys near a query receive high attention from that query.

2D scatter plot showing the same three words projected into value space with yet another arrangement. — Value space: What content flows during attention. The geometric arrangement here determines what information gets aggregated.

The three projections arrange the same words differently. In query space, the positions reflect what each word is searching for. In key space, positions reflect how words present themselves to queries. In value space, positions determine what gets aggregated. A word might be close to another in query space (they look for similar things) but far in value space (they contribute different content).

Putting It All TogetherLink Copied

We've now built up all the components of QKV attention: projections that create specialized representations, dot products that measure compatibility, scaling that ensures stable gradients, and value aggregation that transmits information. Let's combine these into a complete self-attention layer:

In[18]:

Code

class SelfAttention:
    """
    Complete self-attention layer with QKV projections.
    """

    def __init__(self, embed_dim, d_k, d_v, seed=None):
        """
        Initialize projection matrices.

        Args:
            embed_dim: Dimension of input embeddings
            d_k: Dimension of queries and keys
            d_v: Dimension of values
            seed: Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)

        # Xavier/Glorot initialization for stable gradients
        scale_qk = np.sqrt(2.0 / (embed_dim + d_k))
        scale_v = np.sqrt(2.0 / (embed_dim + d_v))

        self.W_q = np.random.randn(embed_dim, d_k) * scale_qk
        self.W_k = np.random.randn(embed_dim, d_k) * scale_qk
        self.W_v = np.random.randn(embed_dim, d_v) * scale_v
        self.d_k = d_k

    def forward(self, X):
        """
        Compute self-attention output.

        Args:
            X: Input embeddings of shape (seq_len, embed_dim)

        Returns:
            output: Attention output of shape (seq_len, d_v)
            attention_weights: Weights of shape (seq_len, seq_len)
        """
        # Project to Q, K, V
        Q = X @ self.W_q
        K = X @ self.W_k
        V = X @ self.W_v

        # Scaled dot-product attention
        scores = Q @ K.T / np.sqrt(self.d_k)

        # Softmax
        scores_stable = scores - scores.max(axis=1, keepdims=True)
        exp_scores = np.exp(scores_stable)
        attention_weights = exp_scores / exp_scores.sum(axis=1, keepdims=True)

        # Aggregate values
        output = attention_weights @ V

        return output, attention_weights

class SelfAttention:
    """
    Complete self-attention layer with QKV projections.
    """

    def __init__(self, embed_dim, d_k, d_v, seed=None):
        """
        Initialize projection matrices.

        Args:
            embed_dim: Dimension of input embeddings
            d_k: Dimension of queries and keys
            d_v: Dimension of values
            seed: Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)

        # Xavier/Glorot initialization for stable gradients
        scale_qk = np.sqrt(2.0 / (embed_dim + d_k))
        scale_v = np.sqrt(2.0 / (embed_dim + d_v))

        self.W_q = np.random.randn(embed_dim, d_k) * scale_qk
        self.W_k = np.random.randn(embed_dim, d_k) * scale_qk
        self.W_v = np.random.randn(embed_dim, d_v) * scale_v
        self.d_k = d_k

    def forward(self, X):
        """
        Compute self-attention output.

        Args:
            X: Input embeddings of shape (seq_len, embed_dim)

        Returns:
            output: Attention output of shape (seq_len, d_v)
            attention_weights: Weights of shape (seq_len, seq_len)
        """
        # Project to Q, K, V
        Q = X @ self.W_q
        K = X @ self.W_k
        V = X @ self.W_v

        # Scaled dot-product attention
        scores = Q @ K.T / np.sqrt(self.d_k)

        # Softmax
        scores_stable = scores - scores.max(axis=1, keepdims=True)
        exp_scores = np.exp(scores_stable)
        attention_weights = exp_scores / exp_scores.sum(axis=1, keepdims=True)

        # Aggregate values
        output = attention_weights @ V

        return output, attention_weights

In[19]:

Code

# Test the complete layer
embed_dim = 8
d_k = d_v = 6
seq_len = 5

np.random.seed(42)
X = np.random.randn(seq_len, embed_dim)

# Create and apply self-attention layer
attention = SelfAttention(embed_dim, d_k, d_v, seed=123)
output, weights = attention.forward(X)

# Test the complete layer
embed_dim = 8
d_k = d_v = 6
seq_len = 5

np.random.seed(42)
X = np.random.randn(seq_len, embed_dim)

# Create and apply self-attention layer
attention = SelfAttention(embed_dim, d_k, d_v, seed=123)
output, weights = attention.forward(X)

Out[20]:

Console

Self-Attention Layer Test
========================================
Input shape:            (5, 8)
Output shape:           (5, 6)
Attention weights shape: (5, 5)

Attention weight matrix (rows sum to 1):
[[0.068 0.446 0.094 0.171 0.221]
 [0.012 0.476 0.085 0.148 0.28 ]
 [0.046 0.24  0.251 0.096 0.367]
 [0.177 0.324 0.158 0.208 0.132]
 [0.457 0.169 0.077 0.125 0.172]]

Row sums: [1. 1. 1. 1. 1.]

The attention weights matrix reveals the core of what self-attention computes: a soft routing table that determines how information flows between positions. Each row sums to exactly 1.0 (confirming proper softmax normalization), and each entry indicates how much one position attends to another.

With random initialization, these patterns are meaningless, just noise from the random projection matrices. But after training on language data, we would see interpretable patterns emerge: determiners attending to nouns, verbs gathering information from subjects and objects, pronouns reaching back to their antecedents.

A Worked Example with Real WordsLink Copied

To make these abstractions concrete, let's trace through QKV attention step by step with a meaningful sentence. We'll use deliberately small embeddings (4 dimensions) and projection matrices (projecting to 3 dimensions) so we can follow every number through the computation.

In[21]:

Code

# Sentence: "The cat sat"
words_example = ["The", "cat", "sat"]
n = len(words_example)

# Create simple 4D embeddings (in practice, these come from an embedding layer)
# We'll make them somewhat interpretable:
# - Dimension 0: Determiner-ness
# - Dimension 1: Noun-ness
# - Dimension 2: Verb-ness
# - Dimension 3: Animacy
embeddings_example = np.array(
    [
        [1.0, 0.0, 0.0, 0.0],  # "The" - determiner
        [0.1, 1.0, 0.0, 0.8],  # "cat" - animate noun
        [0.0, 0.1, 1.0, 0.0],  # "sat" - verb
    ]
)

# Small projection matrices (4D -> 3D)
np.random.seed(42)
d_k_ex = d_v_ex = 3
W_q_ex = np.random.randn(4, d_k_ex) * 0.5
W_k_ex = np.random.randn(4, d_k_ex) * 0.5
W_v_ex = np.random.randn(4, d_v_ex) * 0.5

# Sentence: "The cat sat"
words_example = ["The", "cat", "sat"]
n = len(words_example)

# Create simple 4D embeddings (in practice, these come from an embedding layer)
# We'll make them somewhat interpretable:
# - Dimension 0: Determiner-ness
# - Dimension 1: Noun-ness
# - Dimension 2: Verb-ness
# - Dimension 3: Animacy
embeddings_example = np.array(
    [
        [1.0, 0.0, 0.0, 0.0],  # "The" - determiner
        [0.1, 1.0, 0.0, 0.8],  # "cat" - animate noun
        [0.0, 0.1, 1.0, 0.0],  # "sat" - verb
    ]
)

# Small projection matrices (4D -> 3D)
np.random.seed(42)
d_k_ex = d_v_ex = 3
W_q_ex = np.random.randn(4, d_k_ex) * 0.5
W_k_ex = np.random.randn(4, d_k_ex) * 0.5
W_v_ex = np.random.randn(4, d_v_ex) * 0.5

Out[22]:

Console

Original Embeddings:
Word     Embed                         
----------------------------------------
The      [1. 0. 0. 0.]
cat      [0.1 1.  0.  0.8]
sat      [0.  0.1 1.  0. ]

We've designed these embeddings to be interpretable: each dimension corresponds to a linguistic feature. "The" is a pure determiner (1.0 only in the determiner dimension). "cat" is primarily a noun with high animacy. "sat" is a pure verb. In real systems, embeddings would be dense vectors where meaning is distributed across dimensions, but our hand-crafted features help us trace what the projections do.

Step 1: Computing QKV Projections. The projection matrices transform these 4D embeddings into 3D query, key, and value vectors:

In[23]:

Code

# Step 1: Project to Q, K, V
Q_ex = embeddings_example @ W_q_ex
K_ex = embeddings_example @ W_k_ex
V_ex = embeddings_example @ W_v_ex

# Step 1: Project to Q, K, V
Q_ex = embeddings_example @ W_q_ex
K_ex = embeddings_example @ W_k_ex
V_ex = embeddings_example @ W_v_ex

Out[24]:

Console


Step 1: QKV Projections
==================================================

Query vectors (what each word is looking for):
  The      Q = [ 0.248 -0.069  0.324]
  cat      Q = [ 1.003 -0.309 -0.271]
  sat      Q = [ 0.866  0.372 -0.246]

Key vectors (how each word presents itself):
  The      K = [ 0.121 -0.957 -0.862]
  cat      K = [-0.359 -0.575 -0.499]
  sat      K = [-0.482 -0.757  0.749]

Value vectors (what each word contributes):
  The      V = [-0.272  0.055 -0.575]
  cat      V = [-0.262  0.034 -0.692]
  sat      V = [-0.282  0.896 -0.021]

Observe that each word now has three different representations. The same embedding for "cat" becomes different vectors in query, key, and value spaces. This is the core of the QKV framework: specialized representations for specialized roles.

With random projection matrices (as we have here), the vectors don't encode anything linguistically meaningful. But the structure is in place: if we trained these projections on language data, "cat"'s query might learn to seek modifiers and related nouns, its key might learn to signal "available as subject/object," and its value might carry semantic features worth transmitting.

Step 2: Computing Scaled Attention Scores. Now we compute how well each query matches each key:

In[25]:

Code

# Step 2: Compute scaled attention scores
raw_scores = Q_ex @ K_ex.T
scaled_scores = raw_scores / np.sqrt(d_k_ex)

# Step 2: Compute scaled attention scores
raw_scores = Q_ex @ K_ex.T
scaled_scores = raw_scores / np.sqrt(d_k_ex)

Out[26]:

Console


Step 2: Attention Scores
==================================================

Raw scores (Q @ K^T):
                 The       cat       sat
The           -0.183    -0.211     0.175
cat            0.651    -0.047    -0.452
sat           -0.039    -0.402    -0.883

Scaled scores (÷ √3 = ÷ 1.732):
                 The       cat       sat
The           -0.106    -0.122     0.101
cat            0.376    -0.027    -0.261
sat           -0.022    -0.232    -0.510

The raw scores are dot products between query and key vectors. Each entry measures how well one word's query aligns with another word's key. Scaling by $\sqrt{3} \approx 1.732$ compresses the score range, preventing the softmax saturation we discussed earlier.

Notice the variety in scores: some query-key pairs produce positive scores (the vectors point in similar directions), while others produce negative scores (opposing directions). This variety is what allows attention to be selective, with some positions receiving high attention while others receive little.

Step 3: Converting Scores to Attention Weights. Softmax transforms these real-valued scores into a probability distribution:

In[27]:

Code

# Step 3: Softmax to get attention weights
def softmax_rows(x):
    exp_x = np.exp(x - x.max(axis=1, keepdims=True))
    return exp_x / exp_x.sum(axis=1, keepdims=True)


attention_ex = softmax_rows(scaled_scores)

# Step 3: Softmax to get attention weights
def softmax_rows(x):
    exp_x = np.exp(x - x.max(axis=1, keepdims=True))
    return exp_x / exp_x.sum(axis=1, keepdims=True)


attention_ex = softmax_rows(scaled_scores)

Out[28]:

Console


Step 3: Attention Weights (after softmax)
==================================================
                 The       cat       sat
The            0.311     0.306     0.383  (sum: 1.000)
cat            0.455     0.304     0.241  (sum: 1.000)
sat            0.412     0.334     0.253  (sum: 1.000)

Each row now sums to exactly 1.0, making these valid probability distributions. The exponential in softmax amplifies differences: a score of 0.5 doesn't just beat 0.3 by a little. It gets substantially more weight after exponentiation. This creates soft but focused attention patterns.

Reading the table: "The" attends to all three words with weights shown in its row. "cat" distributes its attention differently. "sat" has its own pattern. These weights will determine how value vectors are blended to produce each word's output.

Step 4: Aggregating Values into Outputs. Finally, we use the attention weights to compute weighted sums of value vectors:

In[29]:

Code

# Step 4: Weighted sum of values
output_ex = attention_ex @ V_ex

# Step 4: Weighted sum of values
output_ex = attention_ex @ V_ex

Out[30]:

Console


Step 4: Output (weighted sum of values)
==================================================

Each output is a blend of all value vectors:
  The      output = [-0.273  0.371 -0.399]
           ≈ 0.31×V_The + 0.31×V_cat + 0.38×V_sat

  cat      output = [-0.272  0.251 -0.477]
           ≈ 0.46×V_The + 0.30×V_cat + 0.24×V_sat

  sat      output = [-0.271  0.261 -0.474]
           ≈ 0.41×V_The + 0.33×V_cat + 0.25×V_sat

Each word's output is now a blend of all value vectors, weighted by the attention distribution. The equations show which contributions dominate, displaying only weights above 0.1 to focus on the meaningful contributions.

This is the essence of self-attention: each position receives a custom mixture of information from the entire sequence. The original embedding for "sat" knew nothing about what subject performed the action. After attention, "sat"'s representation incorporates information from "cat" (and "The"), creating a contextualized representation that encodes not just "sat" but "sat with cat as subject in a determined noun phrase."

In trained models, these patterns become linguistically meaningful: verbs attend to their arguments, modifiers attend to what they modify, and pronouns attend to their antecedents, all learned automatically from data without explicit linguistic supervision.

Limitations and ImpactLink Copied

The QKV framework transformed how attention mechanisms are designed. By separating the roles of querying, being queried, and contributing content, it provides flexibility that raw embedding comparisons cannot match. Modern transformers rely entirely on QKV attention, with the projection matrices learning task-specific patterns.

The key limitations stem from the mechanism's simplicity. The projections are linear transformations, meaning the model can only learn linear relationships between embedding dimensions and QKV roles. Deep transformer networks address this by stacking multiple attention layers with nonlinear feed-forward networks between them, allowing the composition of linear attention operations to approximate complex functions.

Another limitation is the lack of explicit structure. QKV attention treats all positions symmetrically, with no built-in notion of syntax, hierarchy, or compositional structure. The model must learn these patterns from data, which requires substantial training examples. This data hunger is both a limitation (large datasets required) and a strength (the model isn't constrained by human-designed rules).

The impact of QKV attention extends beyond NLP. The same mechanism powers vision transformers (ViT), which treat image patches as "tokens." It underlies multimodal models that attend across text, images, and other modalities. The generality of query-key-value matching as a computational primitive has made it foundational across modern AI.

Key ParametersLink Copied

When implementing QKV self-attention, these parameters control the mechanism's capacity and computational cost:

embed_dim (d): The dimension of input token embeddings. This determines the input size to the projection matrices. Common values range from 256 to 4096 in production transformers.
d_k (query/key dimension): The dimension of query and key vectors after projection. Smaller values reduce computation (since attention scores require $O(n^2 \cdot d_k)$ operations) but limit the expressiveness of query-key matching. Typically set to embed_dim / num_heads in multi-head attention.
d_v (value dimension): The dimension of value vectors after projection. This determines the output dimension of the attention layer. Usually equals d_k for simplicity, but can differ when the output needs a different size than the matching space.
Initialization scale: The projection matrices are typically initialized with small random values. Xavier/Glorot initialization scales weights by $\sqrt{2/(d_{\text{in}} + d_{\text{out}})}$ , where $d_{\text{in}}$ is the input dimension and $d_{\text{out}}$ is the output dimension. This scaling helps maintain stable gradient magnitudes during training by keeping the variance of activations approximately constant across layers.

The ratio of d_k to embed_dim represents a compression factor. Setting d_k < embed_dim forces the model to learn a compressed representation for matching, which can act as a form of regularization. Setting d_k = embed_dim preserves full expressiveness but increases memory and computation.

SummaryLink Copied

Query, Key, Value projections transform self-attention from a fixed similarity computation into a learned matching mechanism. By giving each token separate representations for different roles, the model can learn arbitrary attention patterns that capture linguistic relationships.

Key takeaways from this chapter:

Three roles, one mechanism: Queries express what a token seeks, keys advertise what a token offers, and values provide what gets aggregated. This separation enables flexible, learned attention patterns.
Projection matrices are learned: $\mathbf{W}_Q$ , $\mathbf{W}_K$ , and $\mathbf{W}_V$ are the trainable parameters. During training, they discover which aspects of embeddings matter for matching and aggregation.
Query-key matching determines attention: The dot product $\mathbf{Q}\mathbf{K}^T$ computes compatibility scores. High scores mean a query matches a key well, leading to high attention weight.
Scaling prevents gradient issues: Dividing by $\sqrt{d_k}$ keeps score magnitudes stable regardless of dimension, ensuring softmax operates in a regime with healthy gradients.
Values carry the content: Attention weights determine how much each position contributes, but value vectors determine what content flows. A token can have a key that attracts attention while its value transmits different information.
Dimensions have meaning: Query and key dimensions must match ( $d_k$ ) for dot product compatibility. Value dimension ( $d_v$ ) determines output size. In practice, $d_k = d_v$ for simplicity.

In the next chapter, we'll explore multi-head attention, which runs multiple QKV attention operations in parallel. This allows the model to capture different types of relationships simultaneously, dramatically increasing the expressiveness of the attention mechanism.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Query, Key, Value mechanism in self-attention.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Self-Attention Concept

Next Chapter

Scaled Dot-Product Attention

Reference

BIBTEXAcademic

@misc{querykeyvaluethefoundationoftransformerattention, author = {Michael Brenndoerfer}, title = {Query, Key, Value: The Foundation of Transformer Attention}, year = {2025}, url = {https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Query, Key, Value: The Foundation of Transformer Attention. Retrieved from https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism

MLAAcademic

Michael Brenndoerfer. "Query, Key, Value: The Foundation of Transformer Attention." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism>.

CHICAGOAcademic

Michael Brenndoerfer. "Query, Key, Value: The Foundation of Transformer Attention." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Query, Key, Value: The Foundation of Transformer Attention'. Available at: https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Query, Key, Value: The Foundation of Transformer Attention. https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism

Direct link:

https://mbrenndoerfer.com/writing/query-key-value-attention-mechanism

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Query, Key, Value: The Foundation of Transformer Attention

Query, Key, ValueLink Copied

The Database Lookup AnalogyLink Copied

The QKV Mechanism: From Intuition to FormulaLink Copied

Step 1: Projecting Into Specialized SpacesLink Copied

Step 2: Measuring Compatibility with Dot ProductsLink Copied

Step 3: The Scaling Problem and Its SolutionLink Copied

Step 4: Aggregating ValuesLink Copied

Dimensions and ShapesLink Copied

QKV as Learned TransformationsLink Copied

Putting It All TogetherLink Copied

A Worked Example with Real WordsLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Multi-Head Attention: Parallel Attention for Richer Representations

Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Scaled Dot-Product Attention: The Core Transformer Mechanism

Stay updated