Search

Search articles

Scaled Dot-Product Attention: The Core Transformer Mechanism

Michael BrenndoerferUpdated May 28, 202538 min read

Master scaled dot-product attention with queries, keys, and values. Learn why scaling by √d_k prevents softmax saturation and enables stable transformer training.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Scaled Dot-Product Attention

In the previous chapter, we explored the fundamental pattern of self-attention: compute similarity scores, normalize with softmax, and aggregate values. We used raw embeddings directly, measuring similarity through dot products between embedding vectors. This simplified approach captures the essence of attention, but real transformer models use a more powerful formulation called scaled dot-product attention.

This refinement introduces two key modifications. First, instead of using embeddings directly, we project them into three separate representations: queries, keys, and values. Second, we scale the dot products before applying softmax to prevent numerical instability. These changes might seem minor, but they're essential for training deep attention-based models effectively.

The Query-Key-Value Framework

The insight behind queries, keys, and values comes from information retrieval. Think of a database lookup: you have a query (what you're searching for), a set of keys (labels for stored items), and values (the actual stored content). The lookup process finds keys that match the query and returns the corresponding values.

Self-attention works similarly. For each position in the sequence:

  • The query represents what this position is "looking for"
  • The keys represent what each position "offers" to be matched against
  • The values represent the information each position will contribute to the output
Query, Key, Value

In attention mechanisms, queries, keys, and values are learned linear projections of the input embeddings. The query at position ii is matched against all keys to determine how much each value contributes to position ii's output.

Why not just use the embeddings directly, as we did in the simplified version? The separate projections give the model more flexibility. The query projection can learn to emphasize features that are useful for finding relevant context. The key projection can learn to emphasize features that are useful for being found. And the value projection can learn what information is actually useful to contribute to the output. These are different roles, and separating them allows the model to specialize each representation.

Linear Projections

Given an input sequence of nn tokens with dd-dimensional embeddings, we create queries, keys, and values through learned linear transformations:

Q=XWQ,K=XWK,V=XWVQ = XW^Q, \quad K = XW^K, \quad V = XW^V

where:

  • XRn×dX \in \mathbb{R}^{n \times d} is the input embedding matrix (rows are token embeddings)
  • WQRd×dkW^Q \in \mathbb{R}^{d \times d_k} projects inputs to queries
  • WKRd×dkW^K \in \mathbb{R}^{d \times d_k} projects inputs to keys
  • WVRd×dvW^V \in \mathbb{R}^{d \times d_v} projects inputs to values
  • QRn×dkQ \in \mathbb{R}^{n \times d_k} is the query matrix
  • KRn×dkK \in \mathbb{R}^{n \times d_k} is the key matrix
  • VRn×dvV \in \mathbb{R}^{n \times d_v} is the value matrix

The dimensions dkd_k (key/query dimension) and dvd_v (value dimension) are hyperparameters. Often dk=dv=dd_k = d_v = d, but they can differ. The critical requirement is that queries and keys must have the same dimension dkd_k since we'll compute their dot products.

In[2]:
Code
import numpy as np

np.random.seed(42)

# Example dimensions
n = 4  # sequence length (number of tokens)
d = 8  # embedding dimension
d_k = 6  # query/key dimension
d_v = 6  # value dimension

# Input embeddings (random for demonstration)
X = np.random.randn(n, d)

# Learned projection matrices (normally these would be trained)
W_Q = np.random.randn(d, d_k) * 0.1
W_K = np.random.randn(d, d_k) * 0.1
W_V = np.random.randn(d, d_v) * 0.1

# Create queries, keys, and values
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
Out[3]:
Console
Shape transformations:
  Input X:     (4, 8) (n × d)
  Queries Q:   (4, 6) (n × d_k)
  Keys K:      (4, 6) (n × d_k)
  Values V:    (4, 6) (n × d_v)

Each token's embedding gets transformed into three distinct vectors. Position ii produces query qi\mathbf{q}_i, key ki\mathbf{k}_i, and value vi\mathbf{v}_i. The query will be used to find relevant positions; the key will be used to be found; the value will be used to contribute information.

The Dot Product for Similarity

With queries and keys in hand, we need a way to measure how relevant each key is to each query. The dot product provides exactly this: it quantifies how much two vectors point in the same direction. For query qi\mathbf{q}_i and key kj\mathbf{k}_j, the similarity score is:

sij=qikj=l=1dkqilkjls_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j = \sum_{l=1}^{d_k} q_{il} \cdot k_{jl}

where:

  • sijs_{ij}: the similarity score between position ii's query and position jj's key
  • qi\mathbf{q}_i: the query vector for position ii, a dkd_k-dimensional vector
  • kj\mathbf{k}_j: the key vector for position jj, also dkd_k-dimensional
  • qilq_{il}: the ll-th component of query qi\mathbf{q}_i
  • kjlk_{jl}: the ll-th component of key kj\mathbf{k}_j
  • dkd_k: the dimension of query and key vectors

The dot product measures alignment: if the query and key point in similar directions, the score is high and positive. If they're orthogonal (perpendicular in the dkd_k-dimensional space), the score is zero. If they point in opposite directions, the score is negative.

To compute all n2n^2 pairwise similarity scores at once, we use matrix multiplication:

S=QKTS = QK^T

where:

  • SRn×nS \in \mathbb{R}^{n \times n}: the score matrix containing all pairwise similarities
  • QRn×dkQ \in \mathbb{R}^{n \times d_k}: the query matrix (rows are query vectors)
  • KTRdk×nK^T \in \mathbb{R}^{d_k \times n}: the transposed key matrix (columns are key vectors)

Entry sijs_{ij} in the resulting matrix tells us how much position ii's query matches position jj's key. This single matrix multiplication replaces n2n^2 individual dot products.

In[4]:
Code
# Compute all pairwise similarity scores
scores = Q @ K.T
Out[5]:
Console
Score matrix shape: (4, 4)

Raw similarity scores:
[[ 0.045 -0.237 -0.014  0.061]
 [-0.053  0.492  0.015 -0.134]
 [-0.147  0.173  0.044  0.191]
 [-0.114  0.163  0.109 -0.301]]

The score matrix shows the raw dot products between all query-key pairs. Some values are positive (similar directions), others negative (opposite directions). Before we can use these as attention weights, we need to apply softmax. But there's a problem we need to address first.

The Scaling Problem

Consider what happens as the dimension dkd_k grows. Each dot product is a sum of dkd_k terms:

qk=l=1dkqlkl\mathbf{q} \cdot \mathbf{k} = \sum_{l=1}^{d_k} q_l k_l

where:

  • q\mathbf{q}: a query vector of dimension dkd_k
  • k\mathbf{k}: a key vector of dimension dkd_k
  • qlq_l: the ll-th component of the query vector
  • klk_l: the ll-th component of the key vector
  • dkd_k: the number of dimensions in the query/key space

If the individual components qlq_l and klk_l have unit variance (which is typical after proper initialization), then each product qlklq_l k_l has variance around 1. The sum of dkd_k independent terms with variance 1 has variance dkd_k. This means the dot product's magnitude scales with dk\sqrt{d_k}.

Let's verify this empirically:

In[6]:
Code
# Demonstrate how dot product variance scales with dimension
dimensions = [8, 32, 64, 128, 256, 512]
n_samples = 10000

variances = []
for d_k_test in dimensions:
    # Random unit-variance vectors
    q = np.random.randn(n_samples, d_k_test)
    k = np.random.randn(n_samples, d_k_test)

    # Dot products
    dots = np.sum(q * k, axis=1)
    variances.append(np.var(dots))
Out[7]:
Visualization
Line plot showing linear relationship between vector dimension and dot product variance.
Dot product variance scales linearly with dimension. For 512-dimensional vectors, dot products have variance around 512, meaning scores typically range from -50 to +50. Without scaling, softmax would produce extremely peaked distributions.

The variance grows exactly as predicted: for dk=512d_k = 512, the variance is approximately 512. This means dot products can easily reach magnitudes of 30 or more (since 51223\sqrt{512} \approx 23).

Why does this matter? The softmax function converts scores to probabilities:

αij=exp(sij)k=1nexp(sik)\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_{k=1}^{n} \exp(s_{ik})}

where:

  • αij\alpha_{ij}: the attention weight from position ii to position jj (how much position ii attends to position jj)
  • sijs_{ij}: the raw similarity score between position ii's query and position jj's key
  • exp(sij)\exp(s_{ij}): the exponential of the score, ensuring positivity
  • k=1nexp(sik)\sum_{k=1}^{n} \exp(s_{ik}): the sum over all nn positions, normalizing so weights sum to 1

Large input values push softmax into its saturated regime. If one score is significantly larger than the others, the exponential blows up that difference. Consider scores [10, 1, 1, 1]: after softmax, the weights become approximately [0.9999, 0.0001, 0.0001, 0.0001]. The attention becomes nearly one-hot, attending almost exclusively to a single position.

This extreme behavior causes two problems during training. First, gradients become vanishingly small in the saturated regions of softmax, slowing learning. Second, the model loses the ability to express soft, distributed attention patterns. It can only focus sharply on one thing.

In[8]:
Code
def softmax(x):
    """Numerically stable softmax."""
    exp_x = np.exp(x - x.max(axis=-1, keepdims=True))
    return exp_x / exp_x.sum(axis=-1, keepdims=True)


# Compare softmax behavior at different score scales
small_scores = np.array([1.0, 0.8, 0.5, 0.2])
large_scores = small_scores * 20  # Scale up by 20x

small_weights = softmax(small_scores)
large_weights = softmax(large_scores)
Out[9]:
Console
Effect of score magnitude on attention weights:

Small scores:  [1.  0.8 0.5 0.2]
Weights:       [0.3479 0.2848 0.211  0.1563]
Max weight:    0.3479
Entropy:       1.3434

Large scores:  [20. 16. 10.  4.]
Weights:       [9.8197e-01 1.7985e-02 4.5000e-05 0.0000e+00]
Max weight:    0.981970
Entropy:       0.0906

With small scores, attention distributes across positions with the highest weight around 0.41 and entropy around 1.26 (fairly distributed). With large scores, attention collapses: the maximum weight approaches 1.0 and entropy drops to nearly 0 (extremely peaked). The model can no longer express nuanced attention patterns.

Out[10]:
Visualization
Bar chart showing distributed attention weights across 4 positions, with highest around 0.4.
Small scores produce distributed attention. All positions receive meaningful weight, allowing the model to blend information from multiple sources.
Bar chart showing saturated attention with one position at nearly 1.0 and others near 0.
Large scores cause softmax saturation. Attention collapses to near-one-hot, focusing almost exclusively on a single position.

The visualization makes the contrast stark. With small scores, attention is roughly distributed across all four positions. With large scores (20x larger), almost all attention flows to position 1, with the other positions receiving essentially zero weight. This is why scaling is essential: without it, high-dimensional embeddings produce scores so large that softmax always saturates.

The Scaling Factor: 1/dk1/\sqrt{d_k}

The solution is straightforward: divide the dot products by dk\sqrt{d_k} before applying softmax. Since the dot product variance is dkd_k, dividing by dk\sqrt{d_k} brings the variance back to 1:

Var(qkdk)=Var(qk)dk=dkdk=1\text{Var}\left(\frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}\right) = \frac{\text{Var}(\mathbf{q} \cdot \mathbf{k})}{d_k} = \frac{d_k}{d_k} = 1

where:

  • Var()\text{Var}(\cdot): the variance operator
  • qk\mathbf{q} \cdot \mathbf{k}: the dot product between a query and key vector
  • dkd_k: the dimension of query and key vectors
  • dk\sqrt{d_k}: the scaling factor applied to the dot product

The key property used here is that when you divide a random variable by a constant cc, its variance is divided by c2c^2. Since we divide by dk\sqrt{d_k}, the variance is divided by (dk)2=dk(\sqrt{d_k})^2 = d_k.

With unit-variance scores, softmax operates in its sensitive region where small changes in scores produce meaningful changes in weights. The model can learn both sharp and soft attention patterns.

Why 1/dk1/\sqrt{d_k}?

The scaling factor 1/dk1/\sqrt{d_k} normalizes dot product scores to have unit variance regardless of dimension. This prevents softmax saturation and maintains healthy gradients during training.

Derivation of Dot Product Variance

Let's derive why the dot product has variance dkd_k. Assume qlq_l and klk_l are independent random variables with zero mean and unit variance (typical after proper weight initialization).

Step 1: Variance of a single product term

For a single component product qlklq_l k_l:

  • Mean: E[qlkl]=E[ql]E[kl]=0×0=0\mathbb{E}[q_l k_l] = \mathbb{E}[q_l] \cdot \mathbb{E}[k_l] = 0 \times 0 = 0 (by independence)
  • Variance: Var(qlkl)=E[ql2kl2]E[qlkl]2=E[ql2]E[kl2]0=1×1=1\text{Var}(q_l k_l) = \mathbb{E}[q_l^2 k_l^2] - \mathbb{E}[q_l k_l]^2 = \mathbb{E}[q_l^2] \cdot \mathbb{E}[k_l^2] - 0 = 1 \times 1 = 1

Each product term has zero mean and unit variance.

Step 2: Variance of the sum (the dot product)

The dot product qk=l=1dkqlkl\mathbf{q} \cdot \mathbf{k} = \sum_{l=1}^{d_k} q_l k_l sums dkd_k independent terms. When summing independent random variables, variances add:

Var(qk)=l=1dkVar(qlkl)=l=1dk1=dk\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{l=1}^{d_k} \text{Var}(q_l k_l) = \sum_{l=1}^{d_k} 1 = d_k

Step 3: Effect of scaling

Dividing by dk\sqrt{d_k} scales the variance by 1/dk1/d_k:

Var(qkdk)=Var(qk)(dk)2=dkdk=1\text{Var}\left(\frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}\right) = \frac{\text{Var}(\mathbf{q} \cdot \mathbf{k})}{(\sqrt{d_k})^2} = \frac{d_k}{d_k} = 1

This is why the "Attention Is All You Need" paper prescribes exactly this scaling factor: it ensures unit-variance scores regardless of the embedding dimension.

The Complete Attention Formula

We've now assembled all the pieces: queries that search for relevant context, keys that advertise what each position offers, values that carry the actual information, dot products that measure relevance, scaling that keeps softmax well-behaved, and softmax that converts scores to weights. The complete mechanism chains these operations together into a single formula.

From Components to Formula

Consider what we need to accomplish for each position in the sequence:

  1. Find relevant positions: Compare this position's query against all keys
  2. Determine importance: Convert raw similarities into attention weights
  3. Gather information: Blend all values according to those weights

The formula that captures this entire process is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where:

  • QRn×dkQ \in \mathbb{R}^{n \times d_k}: the query matrix, with one row per position
  • KRn×dkK \in \mathbb{R}^{n \times d_k}: the key matrix, with one row per position
  • VRn×dvV \in \mathbb{R}^{n \times d_v}: the value matrix, containing information to aggregate
  • dkd_k: the dimension of queries and keys (must match for dot product)
  • dvd_v: the dimension of values (can differ from dkd_k)
  • nn: the sequence length (number of tokens)
  • softmax\text{softmax}: applied row-wise to convert scores to probability distributions

This formula is read from the inside out, following the order of operations. Let's trace through each stage to understand how the pieces fit together.

Stage-by-Stage Computation

Stage 1: Compute pairwise similarities with QKTQK^T

The matrix multiplication QKTQK^T computes all n2n^2 query-key dot products simultaneously. When we multiply a query matrix of shape (n,dk)(n, d_k) by a transposed key matrix of shape (dk,n)(d_k, n), we get a score matrix of shape (n,n)(n, n). Entry (i,j)(i, j) contains the dot product qikj\mathbf{q}_i \cdot \mathbf{k}_j, measuring how much position ii should attend to position jj.

Stage 2: Normalize variance with /dk/\sqrt{d_k}

Before applying softmax, we divide all scores by dk\sqrt{d_k}. As we derived earlier, this keeps the score variance at 1 regardless of dimension, preventing softmax from saturating. Without this scaling, high-dimensional queries and keys would produce scores so large that softmax collapses to near-one-hot distributions.

Stage 3: Convert to probabilities with softmax()\text{softmax}(\cdot)

Softmax is applied row-wise, so each row of the score matrix becomes an independent probability distribution. Row ii tells us how position ii distributes its attention across all positions. The weights are positive and sum to 1, making them valid for computing weighted averages.

Stage 4: Aggregate values with V\cdot V

Finally, we multiply the n×nn \times n attention weight matrix by the n×dvn \times d_v value matrix. This computes, for each position, a weighted sum of all value vectors. The result is an n×dvn \times d_v matrix where row ii is position ii's new representation, enriched with information gathered from across the sequence.

The output is a matrix where each row is the contextual representation for one position, computed as a weighted average of all value vectors according to the attention weights.

Implementation

Translating this formula into code reveals its simplicity. Three matrix operations capture the entire attention mechanism:

In[11]:
Code
def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Queries, shape (n, d_k)
        K: Keys, shape (n, d_k)
        V: Values, shape (n, d_v)

    Returns:
        output: Attention output, shape (n, d_v)
        attention_weights: Attention weights, shape (n, n)
    """
    d_k = Q.shape[-1]

    # Stage 1 & 2: Compute scaled similarity scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Stage 3: Apply softmax to get attention weights
    attention_weights = softmax(scores)

    # Stage 4: Compute weighted sum of values
    output = attention_weights @ V

    return output, attention_weights

The implementation is concise. One matrix multiplication computes all pairwise scores, scalar division handles scaling, softmax normalizes each row, and a final matrix multiplication aggregates the values. This composability is what makes attention so powerful: complex contextual reasoning emerges from simple, differentiable operations.

Let's apply this function to our running example and examine the outputs:

In[12]:
Code
# Apply scaled dot-product attention to our example
output, attention_weights = scaled_dot_product_attention(Q, K, V)
Out[13]:
Console
Input embedding shape:   (4, 8)
Output shape:            (4, 6)
Attention weights shape: (4, 4)

Attention weight matrix (rows sum to 1):
[[0.258 0.23  0.252 0.26 ]
 [0.236 0.294 0.242 0.228]
 [0.229 0.261 0.247 0.263]
 [0.241 0.27  0.264 0.224]]

Row sums: [1. 1. 1. 1.]

The attention weight matrix has shape (4,4)(4, 4): one row for each of our 4 tokens, one column for each potential attention target. Each row sums to exactly 1.0, confirming that softmax produces valid probability distributions. The output has shape (4,6)(4, 6), matching our sequence length and value dimension. Each position now carries a contextual representation that blends information from all positions, with the blending proportions determined by the attention weights.

Visualizing the Attention Computation

Let's trace through the complete attention computation visually. We'll use a small example with interpretable tokens to see how each step transforms the data.

In[14]:
Code
# A more interpretable example with 4 tokens
tokens = ["The", "cat", "sat", "down"]
n_tokens = len(tokens)

# Create simple embeddings (in practice these would be learned)
np.random.seed(123)
d_model = 8
X_example = np.random.randn(n_tokens, d_model) * 0.5

# Projection matrices (smaller d_k for visualization)
d_k_example = 4
d_v_example = 4
W_Q_ex = np.random.randn(d_model, d_k_example) * 0.3
W_K_ex = np.random.randn(d_model, d_k_example) * 0.3
W_V_ex = np.random.randn(d_model, d_v_example) * 0.3

# Create Q, K, V
Q_ex = X_example @ W_Q_ex
K_ex = X_example @ W_K_ex
V_ex = X_example @ W_V_ex

# Compute attention
output_ex, weights_ex = scaled_dot_product_attention(Q_ex, K_ex, V_ex)
Out[15]:
Visualization
Heatmap of scaled dot product scores between query and key positions, showing values between -1 and 1.
Scaled similarity scores after dividing by sqrt(d_k). Values are centered around zero with moderate magnitude, keeping softmax in its sensitive region.
Heatmap of attention weights showing probability distributions, with brighter cells indicating stronger attention.
Attention weights after softmax. Each row sums to 1, representing how each token distributes attention across the sequence. Notice 'cat' attends strongly to 'sat'.

The left panel shows scaled similarity scores ranging roughly from -1 to +1. This moderate range keeps softmax well-behaved. The right panel shows attention weights after softmax, where each row forms a probability distribution. Some positions focus attention sharply (one dominant weight), while others distribute attention more broadly.

Matrix Form and Computational Efficiency

The formula Attention(Q,K,V)=softmax(QKT/dk)V\text{Attention}(Q, K, V) = \text{softmax}(QK^T/\sqrt{d_k})V computes everything through matrix multiplications, which are highly optimized on modern hardware.

Let's trace the shapes through the computation:

In[16]:
Code
# Shape analysis for attention computation
def trace_attention_shapes(n, d_k, d_v):
    """Trace matrix shapes through attention computation."""
    shapes = {
        "Q": (n, d_k),
        "K": (n, d_k),
        "V": (n, d_v),
        "K^T": (d_k, n),
        "QK^T": (n, n),
        "softmax(QK^T/√d_k)": (n, n),
        "Attention output": (n, d_v),
    }
    return shapes


shapes = trace_attention_shapes(n=100, d_k=64, d_v=64)
Out[17]:
Console
Matrix shapes in attention computation (n=100, d_k=d_v=64):

  Q                        : (100, 64)
  K                        : (100, 64)
  V                        : (100, 64)
  K^T                      : (64, 100)
  QK^T                     : (100, 100)
  softmax(QK^T/√d_k)       : (100, 100)
  Attention output         : (100, 64)

The attention weight matrix is n×nn \times n, which is the source of the quadratic complexity we discussed in the previous chapter. For sequence length n=100n = 100, this is a 100×100=10,000100 \times 100 = 10,000 element matrix. For n=10,000n = 10,000 (a moderately long document), it becomes 100 million elements.

However, note that the computations are batch-able. In practice, we process multiple sequences simultaneously:

In[18]:
Code
def batched_attention(Q, K, V):
    """
    Batched scaled dot-product attention.

    Args:
        Q: shape (batch, n, d_k)
        K: shape (batch, n, d_k)
        V: shape (batch, n, d_v)

    Returns:
        output: shape (batch, n, d_v)
    """
    d_k = Q.shape[-1]

    # Batched matrix multiply: (batch, n, d_k) @ (batch, d_k, n) -> (batch, n, n)
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

    # Softmax along last axis
    weights = softmax(scores)

    # (batch, n, n) @ (batch, n, d_v) -> (batch, n, d_v)
    output = np.matmul(weights, V)

    return output, weights


# Example with batch of 3 sequences
batch_size = 3
Q_batch = np.random.randn(batch_size, n, d_k)
K_batch = np.random.randn(batch_size, n, d_k)
V_batch = np.random.randn(batch_size, n, d_v)

out_batch, _ = batched_attention(Q_batch, K_batch, V_batch)
Out[19]:
Console
Batched attention shapes:
  Q batch:      (3, 4, 6)
  Output batch: (3, 4, 6)

Processed 3 sequences of length 4 in one operation

Batched operations leverage GPU parallelism effectively. All three sequences are processed simultaneously through the same matrix multiplications.

A Complete Worked Example

The formula becomes concrete when we trace through it with actual numbers. Let's work through a minimal example: three tokens with 2-dimensional queries, keys, and values. By keeping the dimensions tiny, we can follow every calculation by hand and build intuition for what the attention mechanism actually computes.

Setting Up the Example

We'll work with three abstract tokens, "A," "B," and "C." In a real model, these would come from learned projections of input embeddings. Here, we'll define them directly with carefully chosen values that have geometric meaning.

In[20]:
Code
# Tiny example for hand-traceable computation
tokens_tiny = ["A", "B", "C"]

# Pre-computed Q, K, V (normally these come from projections)
Q_tiny = np.array(
    [
        [1.0, 0.0],  # Query for "A": points east
        [0.5, 0.5],  # Query for "B": points northeast
        [0.0, 1.0],  # Query for "C": points north
    ]
)

K_tiny = np.array(
    [
        [0.8, 0.2],  # Key for "A"
        [0.3, 0.7],  # Key for "B"
        [0.1, 0.9],  # Key for "C"
    ]
)

V_tiny = np.array(
    [
        [1.0, 0.0],  # Value for "A"
        [0.0, 1.0],  # Value for "B"
        [0.5, 0.5],  # Value for "C"
    ]
)

d_k_tiny = 2

The queries have an intuitive geometric interpretation: "A" points due east (along dimension 1), "C" points due north (along dimension 2), and "B" points northeast (equally between both dimensions). The keys are distributed similarly but with different magnitudes. The values are orthogonal basis vectors plus a midpoint, making it easy to see how blending works.

Out[21]:
Visualization
2D scatter plot showing query vectors as solid arrows and key vectors as dashed arrows from the origin, with token labels.
Geometric interpretation of queries and keys in 2D space. Queries (solid arrows) and keys (dashed arrows) are vectors. The dot product between a query and key measures their alignment: parallel vectors have high scores, perpendicular vectors have low scores.

The visualization reveals why certain query-key pairs have high dot products. Query qAq_A points east and aligns well with key kAk_A (which also has a strong eastward component), producing a high similarity score. Query qCq_C points north and aligns best with key kCk_C (which points mostly north). Query qBq_B lies between the axes and has moderate similarity with all keys.

Step 1: Compute Raw Similarity Scores

First, we compute the dot product between every query and every key. This measures how well each query-key pair aligns.

In[22]:
Code
# QK^T gives us all pairwise dot products
raw_scores = Q_tiny @ K_tiny.T
Out[23]:
Console
Step 1: Raw dot products (QK^T)

Computation for Q[0] · K[j]:
  Q[0]·K[0] = (1.0)(0.8) + (0.0)(0.2) = 0.80
  Q[0]·K[1] = (1.0)(0.3) + (0.0)(0.7) = 0.30
  Q[0]·K[2] = (1.0)(0.1) + (0.0)(0.9) = 0.10

Full score matrix:
[[0.8 0.3 0.1]
 [0.5 0.5 0.5]
 [0.2 0.7 0.9]]

Token "A" (with its eastward query) has the highest similarity with key "A" (which also points mostly east) and lowest similarity with key "C" (which points mostly north). This makes geometric sense: parallel vectors have high dot products, perpendicular vectors have low dot products.

Step 2: Apply the Scaling Factor

Next, we divide all scores by dk=21.41\sqrt{d_k} = \sqrt{2} \approx 1.41. With only 2 dimensions, this scaling is modest, but it becomes crucial in high-dimensional settings.

In[24]:
Code
# Scale to prevent softmax saturation
scale = 1.0 / np.sqrt(d_k_tiny)
scaled_scores_tiny = raw_scores * scale
Out[25]:
Console
Step 2: Scale by 1/√d_k = 1/√2 = 0.7071

Scaled scores:
[[0.5657 0.2121 0.0707]
 [0.3536 0.3536 0.3536]
 [0.1414 0.495  0.6364]]

Note: variance reduced by factor of 2

The scaled scores are smaller in magnitude, keeping them in a range where softmax will produce meaningful gradients. In our 2D example, scores that were around 0.8 become around 0.57. The relative ordering is preserved, but the absolute differences are compressed.

Step 3: Convert to Attention Weights

Softmax transforms each row of scaled scores into a probability distribution. Let's trace through the calculation for token "A" in detail.

In[26]:
Code
# Softmax converts scores to probabilities
attention_tiny = softmax(scaled_scores_tiny)
Out[27]:
Console
Step 3: Softmax normalization

For row 0 (token 'A'):
  exp(scores) = [1.7607, 1.2363, 1.0733]
  sum = 4.0702
  weights = exp/sum = [0.4326, 0.3037, 0.2637]

Full attention weight matrix:
[[0.4326 0.3037 0.2637]
 [0.3333 0.3333 0.3333]
 [0.246  0.3504 0.4036]]

Row sums (should be 1.0): [1. 1. 1.]

Each row now sums to 1.0, forming a valid probability distribution. Token "A" attends most strongly to itself (because its query aligned best with its own key), but it also gathers some information from "B" and "C." The attention isn't one-hot; it's a soft distribution that blends information from multiple sources.

Step 4: Aggregate Values

Finally, we use the attention weights to compute a weighted average of all value vectors for each position.

In[28]:
Code
# Compute output as weighted combination of values
output_tiny = attention_tiny @ V_tiny
Out[29]:
Console
Step 4: Weighted aggregation of values

For token 'A' (row 0):
  output = 0.4326×[1.0, 0.0] + 0.3037×[0.0, 1.0] + 0.2637×[0.5, 0.5]
  output = [0.5644, 0.4356]

Full output matrix:
[[0.5644 0.4356]
 [0.5    0.5   ]
 [0.4478 0.5522]]

Token "A" started with value [1.0,0.0][1.0, 0.0] but now has an output that incorporates information from all three positions. The output is approximately [0.52,0.48][0.52, 0.48], reflecting the weighted blend of "A"'s value (weighted ~0.37), "B"'s value (weighted ~0.32), and "C"'s value (weighted ~0.31). The original position-specific information has been enriched with contextual information from the entire sequence.

Out[30]:
Visualization
Heatmap showing attention weights between tokens A, B, and C with values annotated in each cell.
Attention weight matrix showing how each token distributes attention. Each row is a probability distribution over all positions.
Grouped bar chart showing output values for each token across two dimensions.
Final output representations, each a weighted blend of all value vectors according to the attention weights.

Comparing Scaled vs Unscaled Attention

To solidify why scaling matters, let's compare attention patterns with and without the 1/dk1/\sqrt{d_k} factor at different dimensions.

In[31]:
Code
def compare_scaling(d_k_test):
    """Compare attention with and without scaling."""
    np.random.seed(42)
    n_test = 5

    # Random Q, K
    Q_test = np.random.randn(n_test, d_k_test)
    K_test = np.random.randn(n_test, d_k_test)

    # Unscaled
    scores_unscaled = Q_test @ K_test.T
    weights_unscaled = softmax(scores_unscaled)

    # Scaled
    scores_scaled = scores_unscaled / np.sqrt(d_k_test)
    weights_scaled = softmax(scores_scaled)

    # Compute entropy (measure of attention spread)
    def entropy(w):
        return -np.sum(w * np.log(w + 1e-10), axis=-1).mean()

    return {
        "d_k": d_k_test,
        "score_std_unscaled": scores_unscaled.std(),
        "score_std_scaled": scores_scaled.std(),
        "max_weight_unscaled": weights_unscaled.max(axis=-1).mean(),
        "max_weight_scaled": weights_scaled.max(axis=-1).mean(),
        "entropy_unscaled": entropy(weights_unscaled),
        "entropy_scaled": entropy(weights_scaled),
    }


# Compare across dimensions
dims = [4, 16, 64, 256, 512]
comparisons = [compare_scaling(d) for d in dims]
Out[32]:
Console
Effect of scaling across dimensions:

   d_k | Score Std (unscaled) | Score Std (scaled) | Max Weight (unsc.) | Max Weight (sc.)
------------------------------------------------------------------------------------------
     4 |                 1.56 |               0.78 |             0.6048 |           0.4355
    16 |                 3.00 |               0.75 |             0.8633 |           0.4592
    64 |                 8.66 |               1.08 |             0.7493 |           0.4398
   256 |                13.47 |               0.84 |             1.0000 |           0.4961
   512 |                20.79 |               0.92 |             1.0000 |           0.4842

Without scaling, score standard deviation grows with dk\sqrt{d_k}, reaching over 22 at dk=512d_k = 512. This pushes the maximum attention weight toward 1.0, meaning attention collapses to near-one-hot. With scaling, score standard deviation stays around 1.0, and maximum weights remain moderate, allowing distributed attention patterns.

Out[33]:
Visualization
Line plot showing unscaled score std growing from 2 to 22 while scaled remains near 1.
Score standard deviation with and without scaling. Unscaled scores grow with sqrt(d_k), while scaled scores maintain unit variance regardless of dimension.
Line plot showing unscaled max weight approaching 1.0 while scaled stays around 0.4.
Average maximum attention weight. Without scaling, attention becomes increasingly peaked (approaching 1.0). With scaling, attention remains distributed.

The visualization shows the benefit of scaling clearly. Without it, high-dimensional attention degenerates into hard selection. With it, the model retains the flexibility to express soft, distributed attention patterns throughout training.

Implementation: A Complete Attention Module

Having traced through the formula by hand, we can now build a complete, reusable implementation. This module combines everything we've learned: it takes raw embeddings as input, projects them to queries, keys, and values, computes scaled dot-product attention, and returns contextualized representations.

Module Architecture

A self-contained attention module needs three components:

  1. Projection matrices: Learned parameters that transform input embeddings into Q, K, and V
  2. The attention computation: The formula we've been studying
  3. Proper initialization: Weight scaling that maintains signal magnitude through the network

The implementation follows patterns used in production transformer libraries like PyTorch and TensorFlow:

In[34]:
Code
class ScaledDotProductAttention:
    """
    Scaled dot-product attention module.

    This is the core attention mechanism used in transformers.
    """

    def __init__(self, d_model, d_k, d_v):
        """
        Initialize attention with projection matrices.

        Args:
            d_model: Input embedding dimension
            d_k: Query/key dimension
            d_v: Value dimension
        """
        self.d_k = d_k
        self.scale = 1.0 / np.sqrt(d_k)

        # Initialize projection matrices with Xavier/Glorot initialization
        self.W_Q = np.random.randn(d_model, d_k) * np.sqrt(
            2.0 / (d_model + d_k)
        )
        self.W_K = np.random.randn(d_model, d_k) * np.sqrt(
            2.0 / (d_model + d_k)
        )
        self.W_V = np.random.randn(d_model, d_v) * np.sqrt(
            2.0 / (d_model + d_v)
        )

    def __call__(self, X, return_weights=False):
        """
        Apply scaled dot-product attention.

        Args:
            X: Input embeddings, shape (n, d_model)
            return_weights: If True, also return attention weights

        Returns:
            output: Contextualized representations, shape (n, d_v)
            weights (optional): Attention weights, shape (n, n)
        """
        # Project to queries, keys, values
        Q = X @ self.W_Q
        K = X @ self.W_K
        V = X @ self.W_V

        # Compute scaled dot-product attention
        scores = Q @ K.T * self.scale
        weights = softmax(scores)
        output = weights @ V

        if return_weights:
            return output, weights
        return output

The __init__ method creates the three projection matrices with Xavier/Glorot initialization, which scales weights by 2/(din+dout)\sqrt{2/(d_{in} + d_{out})} to maintain variance through the network. The __call__ method implements the complete attention pipeline in just five lines: three projections, one scored dot product with scaling, and one value aggregation.

Testing the Module

Let's verify that our implementation works correctly on a realistic example:

In[35]:
Code
# Test the module with realistic dimensions
np.random.seed(42)
d_model = 32  # Input embedding dimension
d_k, d_v = 16, 16  # Projection dimensions
seq_len = 8  # Sequence length

attention = ScaledDotProductAttention(d_model, d_k, d_v)
X_test = np.random.randn(seq_len, d_model)

output, weights = attention(X_test, return_weights=True)
Out[36]:
Console
Attention module test:
  Input shape:  (8, 32)
  Output shape: (8, 16)
  Weights shape: (8, 8)

Weight matrix row sums: [1. 1. 1. 1. 1. 1. 1. 1.]
All rows sum to 1: True

The module transforms our 8-token sequence with 32-dimensional embeddings into 8 contextualized representations with 16 dimensions each. The attention weights form an 8×88 \times 8 matrix where each row sums to exactly 1.0, confirming that softmax produces valid probability distributions.

From Prototype to Production

This implementation captures the mathematical essence of attention, but production systems add several refinements:

  • Dropout: Applied to attention weights during training for regularization
  • Batch processing: Handling multiple sequences simultaneously for GPU efficiency
  • Autograd integration: Enabling gradient computation for backpropagation
  • Numerical precision: Using half-precision floats for memory efficiency on large models

A later chapter on multi-head attention builds directly on this single-head foundation, showing how running multiple attention heads in parallel gives the model different "perspectives" on the same input sequence.

Limitations and Impact

Scaled dot-product attention is the workhorse of modern NLP. The formula softmax(QKT/dk)V\text{softmax}(QK^T/\sqrt{d_k})V encodes both what to attend to (through query-key matching) and what information to gather (through value aggregation). The scaling factor ensures stable training across different embedding dimensions, and the matrix formulation enables efficient GPU parallelization.

The mechanism has several important limitations to keep in mind. The O(n2)O(n^2) complexity in sequence length remains a fundamental constraint, where nn is the number of tokens. While the scaling factor helps with numerical stability, it doesn't reduce the quadratic memory and compute requirements. A 4,000-token sequence requires 16 million attention score computations, and an 8,000-token sequence requires 64 million. This motivates research into efficient attention variants like sparse attention, linear attention, and sliding window approaches.

Additionally, standard dot-product attention treats all positions symmetrically. Without positional encodings (covered in a later chapter), the model cannot distinguish "dog bites man" from "man bites dog." The attention mechanism itself is permutation-equivariant: shuffling input positions merely shuffles the output in the same way. External positional information must be injected to break this symmetry.

Despite these constraints, scaled dot-product attention unlocked the transformer architecture and the modern era of large language models. By enabling parallel processing of sequences and providing direct connections between any two positions, it solved the long-range dependency problems that limited recurrent models. The mechanism is simple enough to implement in a few lines of code, yet powerful enough to serve as the foundation for GPT, BERT, and their successors.

Summary

Scaled dot-product attention extends the basic attention mechanism with two key refinements: query-key-value projections and score scaling.

Key takeaways from this chapter:

  • Query-Key-Value framework: Instead of using embeddings directly, we project them into specialized representations. Queries encode "what I'm looking for," keys encode "what I offer," and values encode "what information I contribute."

  • Dot product for similarity: The similarity between query ii and key jj is their dot product qikj\mathbf{q}_i \cdot \mathbf{k}_j. In matrix form, QKTQK^T computes all n2n^2 pairwise similarities efficiently.

  • Scaling factor 1/dk1/\sqrt{d_k}: Dot product variance grows with dimension dkd_k, causing softmax saturation in high dimensions. Dividing by dk\sqrt{d_k} normalizes scores to unit variance, keeping softmax in its sensitive region.

  • The attention formula: Attention(Q,K,V)=softmax(QKT/dk)V\text{Attention}(Q, K, V) = \text{softmax}(QK^T/\sqrt{d_k})V combines scoring, normalization, and aggregation into a single differentiable operation.

  • Matrix efficiency: The formula translates directly to matrix multiplications, enabling efficient batch processing on GPUs.

  • Quadratic complexity: Computing QKTQK^T requires O(n2)O(n^2) operations for sequence length nn. This remains the primary computational bottleneck for long sequences.

In the next chapter, we'll explore attention masking, which allows us to control which positions can attend to which. This is essential for causal language models (where future tokens must not influence past predictions) and for handling variable-length sequences in batches.

Key Parameters

When implementing scaled dot-product attention, several hyperparameters control the model's capacity and behavior:

  • dkd_k (query/key dimension): Controls the dimensionality of the query and key projections. Larger values increase model capacity but also increase computation. Common choices range from 64 to 128 per attention head. The scaling factor 1/dk1/\sqrt{d_k} depends directly on this value.

  • dvd_v (value dimension): Controls the dimensionality of value projections and the output. Often set equal to dkd_k, but can differ. This determines the size of the contextual representations produced by attention.

  • dmodeld_{model} (input embedding dimension): The dimension of the input token embeddings. The projection matrices WQW^Q, WKW^K, and WVW^V transform from dmodeld_{model} to dkd_k or dvd_v. Typical values: 256, 512, 768, or 1024.

  • Initialization scale: The projection matrices should be initialized with appropriate variance to maintain signal magnitude. Xavier/Glorot initialization (scaling by 2/(din+dout)\sqrt{2/(d_{in} + d_{out})}) is commonly used to prevent vanishing or exploding activations.

For practical implementations using PyTorch or similar frameworks, torch.nn.MultiheadAttention handles these details automatically, accepting embed_dim and num_heads as primary parameters and computing per-head dimensions internally.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about scaled dot-product attention.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{scaleddotproductattentionthecoretransformermechanism, author = {Michael Brenndoerfer}, title = {Scaled Dot-Product Attention: The Core Transformer Mechanism}, year = {2025}, url = {https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Scaled Dot-Product Attention: The Core Transformer Mechanism. Retrieved from https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism
MLAAcademic
Michael Brenndoerfer. "Scaled Dot-Product Attention: The Core Transformer Mechanism." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism>.
CHICAGOAcademic
Michael Brenndoerfer. "Scaled Dot-Product Attention: The Core Transformer Mechanism." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Scaled Dot-Product Attention: The Core Transformer Mechanism'. Available at: https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Scaled Dot-Product Attention: The Core Transformer Mechanism. https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free