Scaled Dot-Product Attention: The Core Transformer Mechanism

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master scaled dot-product attention with queries, keys, and values. Learn why scaling by √d_k prevents softmax saturation and enables stable transformer training.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Scaled Dot-Product AttentionLink Copied

In the previous chapter, we explored the fundamental pattern of self-attention: compute similarity scores, normalize with softmax, and aggregate values. We used raw embeddings directly, measuring similarity through dot products between embedding vectors. This simplified approach captures the essence of attention, but real transformer models use a more powerful formulation called scaled dot-product attention.

This refinement introduces two key modifications. First, instead of using embeddings directly, we project them into three separate representations: queries, keys, and values. Second, we scale the dot products before applying softmax to prevent numerical instability. These changes might seem minor, but they're essential for training deep attention-based models effectively.

The Query-Key-Value FrameworkLink Copied

The insight behind queries, keys, and values comes from information retrieval. Think of a database lookup: you have a query (what you're searching for), a set of keys (labels for stored items), and values (the actual stored content). The lookup process finds keys that match the query and returns the corresponding values.

Self-attention works similarly. For each position in the sequence:

The query represents what this position is "looking for"
The keys represent what each position "offers" to be matched against
The values represent the information each position will contribute to the output

Query, Key, Value

In attention mechanisms, queries, keys, and values are learned linear projections of the input embeddings. The query at position $i$ is matched against all keys to determine how much each value contributes to position $i$ 's output.

Why not just use the embeddings directly, as we did in the simplified version? The separate projections give the model more flexibility. The query projection can learn to emphasize features that are useful for finding relevant context. The key projection can learn to emphasize features that are useful for being found. And the value projection can learn what information is actually useful to contribute to the output. These are different roles, and separating them allows the model to specialize each representation.

Linear ProjectionsLink Copied

Given an input sequence of $n$ tokens with $d$ -dimensional embeddings, we create queries, keys, and values through learned linear transformations:

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

where:

$X \in \mathbb{R}^{n \times d}$ is the input embedding matrix (rows are token embeddings)
$W^Q \in \mathbb{R}^{d \times d_k}$ projects inputs to queries
$W^K \in \mathbb{R}^{d \times d_k}$ projects inputs to keys
$W^V \in \mathbb{R}^{d \times d_v}$ projects inputs to values
$Q \in \mathbb{R}^{n \times d_k}$ is the query matrix
$K \in \mathbb{R}^{n \times d_k}$ is the key matrix
$V \in \mathbb{R}^{n \times d_v}$ is the value matrix

The dimensions $d_k$ (key/query dimension) and $d_v$ (value dimension) are hyperparameters. Often $d_k = d_v = d$ , but they can differ. The critical requirement is that queries and keys must have the same dimension $d_k$ since we'll compute their dot products.

In[2]:

Code

import numpy as np

np.random.seed(42)

# Example dimensions
n = 4  # sequence length (number of tokens)
d = 8  # embedding dimension
d_k = 6  # query/key dimension
d_v = 6  # value dimension

# Input embeddings (random for demonstration)
X = np.random.randn(n, d)

# Learned projection matrices (normally these would be trained)
W_Q = np.random.randn(d, d_k) * 0.1
W_K = np.random.randn(d, d_k) * 0.1
W_V = np.random.randn(d, d_v) * 0.1

# Create queries, keys, and values
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

import numpy as np

np.random.seed(42)

# Example dimensions
n = 4  # sequence length (number of tokens)
d = 8  # embedding dimension
d_k = 6  # query/key dimension
d_v = 6  # value dimension

# Input embeddings (random for demonstration)
X = np.random.randn(n, d)

# Learned projection matrices (normally these would be trained)
W_Q = np.random.randn(d, d_k) * 0.1
W_K = np.random.randn(d, d_k) * 0.1
W_V = np.random.randn(d, d_v) * 0.1

# Create queries, keys, and values
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

Out[3]:

Console

Shape transformations:
  Input X:     (4, 8) (n × d)
  Queries Q:   (4, 6) (n × d_k)
  Keys K:      (4, 6) (n × d_k)
  Values V:    (4, 6) (n × d_v)

Each token's embedding gets transformed into three distinct vectors. Position $i$ produces query $\mathbf{q}_i$ , key $\mathbf{k}_i$ , and value $\mathbf{v}_i$ . The query will be used to find relevant positions; the key will be used to be found; the value will be used to contribute information.

The Dot Product for SimilarityLink Copied

With queries and keys in hand, we need a way to measure how relevant each key is to each query. The dot product provides exactly this: it quantifies how much two vectors point in the same direction. For query $\mathbf{q}_i$ and key $\mathbf{k}_j$ , the similarity score is:

s_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j = \sum_{l=1}^{d_k} q_{il} \cdot k_{jl}

where:

$s_{ij}$ : the similarity score between position $i$ 's query and position $j$ 's key
$\mathbf{q}_i$ : the query vector for position $i$ , a $d_k$ -dimensional vector
$\mathbf{k}_j$ : the key vector for position $j$ , also $d_k$ -dimensional
$q_{il}$ : the $l$ -th component of query $\mathbf{q}_i$
$k_{jl}$ : the $l$ -th component of key $\mathbf{k}_j$
$d_k$ : the dimension of query and key vectors

The dot product measures alignment: if the query and key point in similar directions, the score is high and positive. If they're orthogonal (perpendicular in the $d_k$ -dimensional space), the score is zero. If they point in opposite directions, the score is negative.

To compute all $n^2$ pairwise similarity scores at once, we use matrix multiplication:

S = QK^T

where:

$S \in \mathbb{R}^{n \times n}$ : the score matrix containing all pairwise similarities
$Q \in \mathbb{R}^{n \times d_k}$ : the query matrix (rows are query vectors)
$K^T \in \mathbb{R}^{d_k \times n}$ : the transposed key matrix (columns are key vectors)

Entry $s_{ij}$ in the resulting matrix tells us how much position $i$ 's query matches position $j$ 's key. This single matrix multiplication replaces $n^2$ individual dot products.

In[4]:

Code

# Compute all pairwise similarity scores
scores = Q @ K.T

# Compute all pairwise similarity scores
scores = Q @ K.T

Out[5]:

Console

Score matrix shape: (4, 4)

Raw similarity scores:
[[ 0.045 -0.237 -0.014  0.061]
 [-0.053  0.492  0.015 -0.134]
 [-0.147  0.173  0.044  0.191]
 [-0.114  0.163  0.109 -0.301]]

The score matrix shows the raw dot products between all query-key pairs. Some values are positive (similar directions), others negative (opposite directions). Before we can use these as attention weights, we need to apply softmax. But there's a problem we need to address first.

The Scaling ProblemLink Copied

Consider what happens as the dimension $d_k$ grows. Each dot product is a sum of $d_k$ terms:

\mathbf{q} \cdot \mathbf{k} = \sum_{l=1}^{d_k} q_l k_l

where:

$\mathbf{q}$ : a query vector of dimension $d_k$
$\mathbf{k}$ : a key vector of dimension $d_k$
$q_l$ : the $l$ -th component of the query vector
$k_l$ : the $l$ -th component of the key vector
$d_k$ : the number of dimensions in the query/key space

If the individual components $q_l$ and $k_l$ have unit variance (which is typical after proper initialization), then each product $q_l k_l$ has variance around 1. The sum of $d_k$ independent terms with variance 1 has variance $d_k$ . This means the dot product's magnitude scales with $\sqrt{d_k}$ .

Let's verify this empirically:

In[6]:

Code

# Demonstrate how dot product variance scales with dimension
dimensions = [8, 32, 64, 128, 256, 512]
n_samples = 10000

variances = []
for d_k_test in dimensions:
    # Random unit-variance vectors
    q = np.random.randn(n_samples, d_k_test)
    k = np.random.randn(n_samples, d_k_test)

    # Dot products
    dots = np.sum(q * k, axis=1)
    variances.append(np.var(dots))

# Demonstrate how dot product variance scales with dimension
dimensions = [8, 32, 64, 128, 256, 512]
n_samples = 10000

variances = []
for d_k_test in dimensions:
    # Random unit-variance vectors
    q = np.random.randn(n_samples, d_k_test)
    k = np.random.randn(n_samples, d_k_test)

    # Dot products
    dots = np.sum(q * k, axis=1)
    variances.append(np.var(dots))

Out[7]:

Visualization

Line plot showing linear relationship between vector dimension and dot product variance. — Dot product variance scales linearly with dimension. For 512-dimensional vectors, dot products have variance around 512, meaning scores typically range from -50 to +50. Without scaling, softmax would produce extremely peaked distributions.

The variance grows exactly as predicted: for $d_k = 512$ , the variance is approximately 512. This means dot products can easily reach magnitudes of 30 or more (since $\sqrt{512} \approx 23$ ).

Why does this matter? The softmax function converts scores to probabilities:

\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_{k=1}^{n} \exp(s_{ik})}

where:

$\alpha_{ij}$ : the attention weight from position $i$ to position $j$ (how much position $i$ attends to position $j$ )
$s_{ij}$ : the raw similarity score between position $i$ 's query and position $j$ 's key
$\exp(s_{ij})$ : the exponential of the score, ensuring positivity
$\sum_{k=1}^{n} \exp(s_{ik})$ : the sum over all $n$ positions, normalizing so weights sum to 1

Large input values push softmax into its saturated regime. If one score is significantly larger than the others, the exponential blows up that difference. Consider scores [10, 1, 1, 1]: after softmax, the weights become approximately [0.9999, 0.0001, 0.0001, 0.0001]. The attention becomes nearly one-hot, attending almost exclusively to a single position.

This extreme behavior causes two problems during training. First, gradients become vanishingly small in the saturated regions of softmax, slowing learning. Second, the model loses the ability to express soft, distributed attention patterns. It can only focus sharply on one thing.

In[8]:

Code

def softmax(x):
    """Numerically stable softmax."""
    exp_x = np.exp(x - x.max(axis=-1, keepdims=True))
    return exp_x / exp_x.sum(axis=-1, keepdims=True)


# Compare softmax behavior at different score scales
small_scores = np.array([1.0, 0.8, 0.5, 0.2])
large_scores = small_scores * 20  # Scale up by 20x

small_weights = softmax(small_scores)
large_weights = softmax(large_scores)

def softmax(x):
    """Numerically stable softmax."""
    exp_x = np.exp(x - x.max(axis=-1, keepdims=True))
    return exp_x / exp_x.sum(axis=-1, keepdims=True)


# Compare softmax behavior at different score scales
small_scores = np.array([1.0, 0.8, 0.5, 0.2])
large_scores = small_scores * 20  # Scale up by 20x

small_weights = softmax(small_scores)
large_weights = softmax(large_scores)

Out[9]:

Console

Effect of score magnitude on attention weights:

Small scores:  [1.  0.8 0.5 0.2]
Weights:       [0.3479 0.2848 0.211  0.1563]
Max weight:    0.3479
Entropy:       1.3434

Large scores:  [20. 16. 10.  4.]
Weights:       [9.8197e-01 1.7985e-02 4.5000e-05 0.0000e+00]
Max weight:    0.981970
Entropy:       0.0906

With small scores, attention distributes across positions with the highest weight around 0.41 and entropy around 1.26 (fairly distributed). With large scores, attention collapses: the maximum weight approaches 1.0 and entropy drops to nearly 0 (extremely peaked). The model can no longer express nuanced attention patterns.

Out[10]:

Visualization

Bar chart showing distributed attention weights across 4 positions, with highest around 0.4. — Small scores produce distributed attention. All positions receive meaningful weight, allowing the model to blend information from multiple sources.

Bar chart showing saturated attention with one position at nearly 1.0 and others near 0. — Large scores cause softmax saturation. Attention collapses to near-one-hot, focusing almost exclusively on a single position.

The visualization makes the contrast stark. With small scores, attention is roughly distributed across all four positions. With large scores (20x larger), almost all attention flows to position 1, with the other positions receiving essentially zero weight. This is why scaling is essential: without it, high-dimensional embeddings produce scores so large that softmax always saturates.

The Scaling Factor: $1/\sqrt{d_k}$ Link Copied

The solution is straightforward: divide the dot products by $\sqrt{d_k}$ before applying softmax. Since the dot product variance is $d_k$ , dividing by $\sqrt{d_k}$ brings the variance back to 1:

\text{Var}\left(\frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}\right) = \frac{\text{Var}(\mathbf{q} \cdot \mathbf{k})}{d_k} = \frac{d_k}{d_k} = 1

where:

$\text{Var}(\cdot)$ : the variance operator
$\mathbf{q} \cdot \mathbf{k}$ : the dot product between a query and key vector
$d_k$ : the dimension of query and key vectors
$\sqrt{d_k}$ : the scaling factor applied to the dot product

The key property used here is that when you divide a random variable by a constant $c$ , its variance is divided by $c^2$ . Since we divide by $\sqrt{d_k}$ , the variance is divided by $(\sqrt{d_k})^2 = d_k$ .

With unit-variance scores, softmax operates in its sensitive region where small changes in scores produce meaningful changes in weights. The model can learn both sharp and soft attention patterns.

Why

1/\sqrt{d_k}

?

The scaling factor $1/\sqrt{d_k}$ normalizes dot product scores to have unit variance regardless of dimension. This prevents softmax saturation and maintains healthy gradients during training.

Derivation of Dot Product VarianceLink Copied

Let's derive why the dot product has variance $d_k$ . Assume $q_l$ and $k_l$ are independent random variables with zero mean and unit variance (typical after proper weight initialization).

Step 1: Variance of a single product term

For a single component product $q_l k_l$ :

Mean: $\mathbb{E}[q_l k_l] = \mathbb{E}[q_l] \cdot \mathbb{E}[k_l] = 0 \times 0 = 0$ (by independence)
Variance: $\text{Var}(q_l k_l) = \mathbb{E}[q_l^2 k_l^2] - \mathbb{E}[q_l k_l]^2 = \mathbb{E}[q_l^2] \cdot \mathbb{E}[k_l^2] - 0 = 1 \times 1 = 1$

Each product term has zero mean and unit variance.

Step 2: Variance of the sum (the dot product)

The dot product $\mathbf{q} \cdot \mathbf{k} = \sum_{l=1}^{d_k} q_l k_l$ sums $d_k$ independent terms. When summing independent random variables, variances add:

\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{l=1}^{d_k} \text{Var}(q_l k_l) = \sum_{l=1}^{d_k} 1 = d_k

Step 3: Effect of scaling

Dividing by $\sqrt{d_k}$ scales the variance by $1/d_k$ :

\text{Var}\left(\frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}\right) = \frac{\text{Var}(\mathbf{q} \cdot \mathbf{k})}{(\sqrt{d_k})^2} = \frac{d_k}{d_k} = 1

This is why the "Attention Is All You Need" paper prescribes exactly this scaling factor: it ensures unit-variance scores regardless of the embedding dimension.

The Complete Attention FormulaLink Copied

We've now assembled all the pieces: queries that search for relevant context, keys that advertise what each position offers, values that carry the actual information, dot products that measure relevance, scaling that keeps softmax well-behaved, and softmax that converts scores to weights. The complete mechanism chains these operations together into a single formula.

From Components to FormulaLink Copied

Consider what we need to accomplish for each position in the sequence:

Find relevant positions: Compare this position's query against all keys
Determine importance: Convert raw similarities into attention weights
Gather information: Blend all values according to those weights

The formula that captures this entire process is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where:

$Q \in \mathbb{R}^{n \times d_k}$ : the query matrix, with one row per position
$K \in \mathbb{R}^{n \times d_k}$ : the key matrix, with one row per position
$V \in \mathbb{R}^{n \times d_v}$ : the value matrix, containing information to aggregate
$d_k$ : the dimension of queries and keys (must match for dot product)
$d_v$ : the dimension of values (can differ from $d_k$ )
$n$ : the sequence length (number of tokens)
$\text{softmax}$ : applied row-wise to convert scores to probability distributions

This formula is read from the inside out, following the order of operations. Let's trace through each stage to understand how the pieces fit together.

Stage-by-Stage ComputationLink Copied

Stage 1: Compute pairwise similarities with $QK^T$

The matrix multiplication $QK^T$ computes all $n^2$ query-key dot products simultaneously. When we multiply a query matrix of shape $(n, d_k)$ by a transposed key matrix of shape $(d_k, n)$ , we get a score matrix of shape $(n, n)$ . Entry $(i, j)$ contains the dot product $\mathbf{q}_i \cdot \mathbf{k}_j$ , measuring how much position $i$ should attend to position $j$ .

Stage 2: Normalize variance with $/\sqrt{d_k}$

Before applying softmax, we divide all scores by $\sqrt{d_k}$ . As we derived earlier, this keeps the score variance at 1 regardless of dimension, preventing softmax from saturating. Without this scaling, high-dimensional queries and keys would produce scores so large that softmax collapses to near-one-hot distributions.

Stage 3: Convert to probabilities with $\text{softmax}(\cdot)$

Softmax is applied row-wise, so each row of the score matrix becomes an independent probability distribution. Row $i$ tells us how position $i$ distributes its attention across all positions. The weights are positive and sum to 1, making them valid for computing weighted averages.

Stage 4: Aggregate values with $\cdot V$

Finally, we multiply the $n \times n$ attention weight matrix by the $n \times d_v$ value matrix. This computes, for each position, a weighted sum of all value vectors. The result is an $n \times d_v$ matrix where row $i$ is position $i$ 's new representation, enriched with information gathered from across the sequence.

The output is a matrix where each row is the contextual representation for one position, computed as a weighted average of all value vectors according to the attention weights.

ImplementationLink Copied

Translating this formula into code reveals its simplicity. Three matrix operations capture the entire attention mechanism:

In[11]:

Code

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Queries, shape (n, d_k)
        K: Keys, shape (n, d_k)
        V: Values, shape (n, d_v)

    Returns:
        output: Attention output, shape (n, d_v)
        attention_weights: Attention weights, shape (n, n)
    """
    d_k = Q.shape[-1]

    # Stage 1 & 2: Compute scaled similarity scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Stage 3: Apply softmax to get attention weights
    attention_weights = softmax(scores)

    # Stage 4: Compute weighted sum of values
    output = attention_weights @ V

    return output, attention_weights

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Queries, shape (n, d_k)
        K: Keys, shape (n, d_k)
        V: Values, shape (n, d_v)

    Returns:
        output: Attention output, shape (n, d_v)
        attention_weights: Attention weights, shape (n, n)
    """
    d_k = Q.shape[-1]

    # Stage 1 & 2: Compute scaled similarity scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Stage 3: Apply softmax to get attention weights
    attention_weights = softmax(scores)

    # Stage 4: Compute weighted sum of values
    output = attention_weights @ V

    return output, attention_weights

The implementation is concise. One matrix multiplication computes all pairwise scores, scalar division handles scaling, softmax normalizes each row, and a final matrix multiplication aggregates the values. This composability is what makes attention so powerful: complex contextual reasoning emerges from simple, differentiable operations.

Let's apply this function to our running example and examine the outputs:

In[12]:

Code

# Apply scaled dot-product attention to our example
output, attention_weights = scaled_dot_product_attention(Q, K, V)

# Apply scaled dot-product attention to our example
output, attention_weights = scaled_dot_product_attention(Q, K, V)

Out[13]:

Console

Input embedding shape:   (4, 8)
Output shape:            (4, 6)
Attention weights shape: (4, 4)

Attention weight matrix (rows sum to 1):
[[0.258 0.23  0.252 0.26 ]
 [0.236 0.294 0.242 0.228]
 [0.229 0.261 0.247 0.263]
 [0.241 0.27  0.264 0.224]]

Row sums: [1. 1. 1. 1.]

The attention weight matrix has shape $(4, 4)$ : one row for each of our 4 tokens, one column for each potential attention target. Each row sums to exactly 1.0, confirming that softmax produces valid probability distributions. The output has shape $(4, 6)$ , matching our sequence length and value dimension. Each position now carries a contextual representation that blends information from all positions, with the blending proportions determined by the attention weights.

Visualizing the Attention ComputationLink Copied

Let's trace through the complete attention computation visually. We'll use a small example with interpretable tokens to see how each step transforms the data.

In[14]:

Code

# A more interpretable example with 4 tokens
tokens = ["The", "cat", "sat", "down"]
n_tokens = len(tokens)

# Create simple embeddings (in practice these would be learned)
np.random.seed(123)
d_model = 8
X_example = np.random.randn(n_tokens, d_model) * 0.5

# Projection matrices (smaller d_k for visualization)
d_k_example = 4
d_v_example = 4
W_Q_ex = np.random.randn(d_model, d_k_example) * 0.3
W_K_ex = np.random.randn(d_model, d_k_example) * 0.3
W_V_ex = np.random.randn(d_model, d_v_example) * 0.3

# Create Q, K, V
Q_ex = X_example @ W_Q_ex
K_ex = X_example @ W_K_ex
V_ex = X_example @ W_V_ex

# Compute attention
output_ex, weights_ex = scaled_dot_product_attention(Q_ex, K_ex, V_ex)

# A more interpretable example with 4 tokens
tokens = ["The", "cat", "sat", "down"]
n_tokens = len(tokens)

# Create simple embeddings (in practice these would be learned)
np.random.seed(123)
d_model = 8
X_example = np.random.randn(n_tokens, d_model) * 0.5

# Projection matrices (smaller d_k for visualization)
d_k_example = 4
d_v_example = 4
W_Q_ex = np.random.randn(d_model, d_k_example) * 0.3
W_K_ex = np.random.randn(d_model, d_k_example) * 0.3
W_V_ex = np.random.randn(d_model, d_v_example) * 0.3

# Create Q, K, V
Q_ex = X_example @ W_Q_ex
K_ex = X_example @ W_K_ex
V_ex = X_example @ W_V_ex

# Compute attention
output_ex, weights_ex = scaled_dot_product_attention(Q_ex, K_ex, V_ex)

Out[15]:

Visualization

Heatmap of scaled dot product scores between query and key positions, showing values between -1 and 1. — Scaled similarity scores after dividing by sqrt(d_k). Values are centered around zero with moderate magnitude, keeping softmax in its sensitive region.

Heatmap of attention weights showing probability distributions, with brighter cells indicating stronger attention. — Attention weights after softmax. Each row sums to 1, representing how each token distributes attention across the sequence. Notice 'cat' attends strongly to 'sat'.

The left panel shows scaled similarity scores ranging roughly from -1 to +1. This moderate range keeps softmax well-behaved. The right panel shows attention weights after softmax, where each row forms a probability distribution. Some positions focus attention sharply (one dominant weight), while others distribute attention more broadly.

Matrix Form and Computational EfficiencyLink Copied

The formula $\text{Attention}(Q, K, V) = \text{softmax}(QK^T/\sqrt{d_k})V$ computes everything through matrix multiplications, which are highly optimized on modern hardware.

Let's trace the shapes through the computation:

In[16]:

Code

# Shape analysis for attention computation
def trace_attention_shapes(n, d_k, d_v):
    """Trace matrix shapes through attention computation."""
    shapes = {
        "Q": (n, d_k),
        "K": (n, d_k),
        "V": (n, d_v),
        "K^T": (d_k, n),
        "QK^T": (n, n),
        "softmax(QK^T/√d_k)": (n, n),
        "Attention output": (n, d_v),
    }
    return shapes


shapes = trace_attention_shapes(n=100, d_k=64, d_v=64)

# Shape analysis for attention computation
def trace_attention_shapes(n, d_k, d_v):
    """Trace matrix shapes through attention computation."""
    shapes = {
        "Q": (n, d_k),
        "K": (n, d_k),
        "V": (n, d_v),
        "K^T": (d_k, n),
        "QK^T": (n, n),
        "softmax(QK^T/√d_k)": (n, n),
        "Attention output": (n, d_v),
    }
    return shapes


shapes = trace_attention_shapes(n=100, d_k=64, d_v=64)

Out[17]:

Console

Matrix shapes in attention computation (n=100, d_k=d_v=64):

  Q                        : (100, 64)
  K                        : (100, 64)
  V                        : (100, 64)
  K^T                      : (64, 100)
  QK^T                     : (100, 100)
  softmax(QK^T/√d_k)       : (100, 100)
  Attention output         : (100, 64)

The attention weight matrix is $n \times n$ , which is the source of the quadratic complexity we discussed in the previous chapter. For sequence length $n = 100$ , this is a $100 \times 100 = 10,000$ element matrix. For $n = 10,000$ (a moderately long document), it becomes 100 million elements.

However, note that the computations are batch-able. In practice, we process multiple sequences simultaneously:

In[18]:

Code

def batched_attention(Q, K, V):
    """
    Batched scaled dot-product attention.

    Args:
        Q: shape (batch, n, d_k)
        K: shape (batch, n, d_k)
        V: shape (batch, n, d_v)

    Returns:
        output: shape (batch, n, d_v)
    """
    d_k = Q.shape[-1]

    # Batched matrix multiply: (batch, n, d_k) @ (batch, d_k, n) -> (batch, n, n)
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

    # Softmax along last axis
    weights = softmax(scores)

    # (batch, n, n) @ (batch, n, d_v) -> (batch, n, d_v)
    output = np.matmul(weights, V)

    return output, weights


# Example with batch of 3 sequences
batch_size = 3
Q_batch = np.random.randn(batch_size, n, d_k)
K_batch = np.random.randn(batch_size, n, d_k)
V_batch = np.random.randn(batch_size, n, d_v)

out_batch, _ = batched_attention(Q_batch, K_batch, V_batch)

def batched_attention(Q, K, V):
    """
    Batched scaled dot-product attention.

    Args:
        Q: shape (batch, n, d_k)
        K: shape (batch, n, d_k)
        V: shape (batch, n, d_v)

    Returns:
        output: shape (batch, n, d_v)
    """
    d_k = Q.shape[-1]

    # Batched matrix multiply: (batch, n, d_k) @ (batch, d_k, n) -> (batch, n, n)
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

    # Softmax along last axis
    weights = softmax(scores)

    # (batch, n, n) @ (batch, n, d_v) -> (batch, n, d_v)
    output = np.matmul(weights, V)

    return output, weights


# Example with batch of 3 sequences
batch_size = 3
Q_batch = np.random.randn(batch_size, n, d_k)
K_batch = np.random.randn(batch_size, n, d_k)
V_batch = np.random.randn(batch_size, n, d_v)

out_batch, _ = batched_attention(Q_batch, K_batch, V_batch)

Out[19]:

Console

Batched attention shapes:
  Q batch:      (3, 4, 6)
  Output batch: (3, 4, 6)

Processed 3 sequences of length 4 in one operation

Batched operations leverage GPU parallelism effectively. All three sequences are processed simultaneously through the same matrix multiplications.

A Complete Worked ExampleLink Copied

The formula becomes concrete when we trace through it with actual numbers. Let's work through a minimal example: three tokens with 2-dimensional queries, keys, and values. By keeping the dimensions tiny, we can follow every calculation by hand and build intuition for what the attention mechanism actually computes.

Setting Up the ExampleLink Copied

We'll work with three abstract tokens, "A," "B," and "C." In a real model, these would come from learned projections of input embeddings. Here, we'll define them directly with carefully chosen values that have geometric meaning.

In[20]:

Code

# Tiny example for hand-traceable computation
tokens_tiny = ["A", "B", "C"]

# Pre-computed Q, K, V (normally these come from projections)
Q_tiny = np.array(
    [
        [1.0, 0.0],  # Query for "A": points east
        [0.5, 0.5],  # Query for "B": points northeast
        [0.0, 1.0],  # Query for "C": points north
    ]
)

K_tiny = np.array(
    [
        [0.8, 0.2],  # Key for "A"
        [0.3, 0.7],  # Key for "B"
        [0.1, 0.9],  # Key for "C"
    ]
)

V_tiny = np.array(
    [
        [1.0, 0.0],  # Value for "A"
        [0.0, 1.0],  # Value for "B"
        [0.5, 0.5],  # Value for "C"
    ]
)

d_k_tiny = 2

# Tiny example for hand-traceable computation
tokens_tiny = ["A", "B", "C"]

# Pre-computed Q, K, V (normally these come from projections)
Q_tiny = np.array(
    [
        [1.0, 0.0],  # Query for "A": points east
        [0.5, 0.5],  # Query for "B": points northeast
        [0.0, 1.0],  # Query for "C": points north
    ]
)

K_tiny = np.array(
    [
        [0.8, 0.2],  # Key for "A"
        [0.3, 0.7],  # Key for "B"
        [0.1, 0.9],  # Key for "C"
    ]
)

V_tiny = np.array(
    [
        [1.0, 0.0],  # Value for "A"
        [0.0, 1.0],  # Value for "B"
        [0.5, 0.5],  # Value for "C"
    ]
)

d_k_tiny = 2

The queries have an intuitive geometric interpretation: "A" points due east (along dimension 1), "C" points due north (along dimension 2), and "B" points northeast (equally between both dimensions). The keys are distributed similarly but with different magnitudes. The values are orthogonal basis vectors plus a midpoint, making it easy to see how blending works.

Out[21]:

Visualization

2D scatter plot showing query vectors as solid arrows and key vectors as dashed arrows from the origin, with token labels. — Geometric interpretation of queries and keys in 2D space. Queries (solid arrows) and keys (dashed arrows) are vectors. The dot product between a query and key measures their alignment: parallel vectors have high scores, perpendicular vectors have low scores.

The visualization reveals why certain query-key pairs have high dot products. Query $q_A$ points east and aligns well with key $k_A$ (which also has a strong eastward component), producing a high similarity score. Query $q_C$ points north and aligns best with key $k_C$ (which points mostly north). Query $q_B$ lies between the axes and has moderate similarity with all keys.

Step 1: Compute Raw Similarity ScoresLink Copied

First, we compute the dot product between every query and every key. This measures how well each query-key pair aligns.

In[22]:

Code

# QK^T gives us all pairwise dot products
raw_scores = Q_tiny @ K_tiny.T

# QK^T gives us all pairwise dot products
raw_scores = Q_tiny @ K_tiny.T

Out[23]:

Console

Step 1: Raw dot products (QK^T)

Computation for Q[0] · K[j]:
  Q[0]·K[0] = (1.0)(0.8) + (0.0)(0.2) = 0.80
  Q[0]·K[1] = (1.0)(0.3) + (0.0)(0.7) = 0.30
  Q[0]·K[2] = (1.0)(0.1) + (0.0)(0.9) = 0.10

Full score matrix:
[[0.8 0.3 0.1]
 [0.5 0.5 0.5]
 [0.2 0.7 0.9]]

Token "A" (with its eastward query) has the highest similarity with key "A" (which also points mostly east) and lowest similarity with key "C" (which points mostly north). This makes geometric sense: parallel vectors have high dot products, perpendicular vectors have low dot products.

Step 2: Apply the Scaling FactorLink Copied

Next, we divide all scores by $\sqrt{d_k} = \sqrt{2} \approx 1.41$ . With only 2 dimensions, this scaling is modest, but it becomes crucial in high-dimensional settings.

In[24]:

Code

# Scale to prevent softmax saturation
scale = 1.0 / np.sqrt(d_k_tiny)
scaled_scores_tiny = raw_scores * scale

# Scale to prevent softmax saturation
scale = 1.0 / np.sqrt(d_k_tiny)
scaled_scores_tiny = raw_scores * scale

Out[25]:

Console

Step 2: Scale by 1/√d_k = 1/√2 = 0.7071

Scaled scores:
[[0.5657 0.2121 0.0707]
 [0.3536 0.3536 0.3536]
 [0.1414 0.495  0.6364]]

Note: variance reduced by factor of 2

The scaled scores are smaller in magnitude, keeping them in a range where softmax will produce meaningful gradients. In our 2D example, scores that were around 0.8 become around 0.57. The relative ordering is preserved, but the absolute differences are compressed.

Step 3: Convert to Attention WeightsLink Copied

Softmax transforms each row of scaled scores into a probability distribution. Let's trace through the calculation for token "A" in detail.

In[26]:

Code

# Softmax converts scores to probabilities
attention_tiny = softmax(scaled_scores_tiny)

# Softmax converts scores to probabilities
attention_tiny = softmax(scaled_scores_tiny)

Out[27]:

Console

Step 3: Softmax normalization

For row 0 (token 'A'):
  exp(scores) = [1.7607, 1.2363, 1.0733]
  sum = 4.0702
  weights = exp/sum = [0.4326, 0.3037, 0.2637]

Full attention weight matrix:
[[0.4326 0.3037 0.2637]
 [0.3333 0.3333 0.3333]
 [0.246  0.3504 0.4036]]

Row sums (should be 1.0): [1. 1. 1.]

Each row now sums to 1.0, forming a valid probability distribution. Token "A" attends most strongly to itself (because its query aligned best with its own key), but it also gathers some information from "B" and "C." The attention isn't one-hot; it's a soft distribution that blends information from multiple sources.

Step 4: Aggregate ValuesLink Copied

Finally, we use the attention weights to compute a weighted average of all value vectors for each position.

In[28]:

Code

# Compute output as weighted combination of values
output_tiny = attention_tiny @ V_tiny

# Compute output as weighted combination of values
output_tiny = attention_tiny @ V_tiny

Out[29]:

Console

Step 4: Weighted aggregation of values

For token 'A' (row 0):
  output = 0.4326×[1.0, 0.0] + 0.3037×[0.0, 1.0] + 0.2637×[0.5, 0.5]
  output = [0.5644, 0.4356]

Full output matrix:
[[0.5644 0.4356]
 [0.5    0.5   ]
 [0.4478 0.5522]]

Token "A" started with value $[1.0, 0.0]$ but now has an output that incorporates information from all three positions. The output is approximately $[0.52, 0.48]$ , reflecting the weighted blend of "A"'s value (weighted ~0.37), "B"'s value (weighted ~0.32), and "C"'s value (weighted ~0.31). The original position-specific information has been enriched with contextual information from the entire sequence.

Out[30]:

Visualization

Heatmap showing attention weights between tokens A, B, and C with values annotated in each cell. — Attention weight matrix showing how each token distributes attention. Each row is a probability distribution over all positions.

Grouped bar chart showing output values for each token across two dimensions. — Final output representations, each a weighted blend of all value vectors according to the attention weights.

Comparing Scaled vs Unscaled AttentionLink Copied

To solidify why scaling matters, let's compare attention patterns with and without the $1/\sqrt{d_k}$ factor at different dimensions.

In[31]:

Code

def compare_scaling(d_k_test):
    """Compare attention with and without scaling."""
    np.random.seed(42)
    n_test = 5

    # Random Q, K
    Q_test = np.random.randn(n_test, d_k_test)
    K_test = np.random.randn(n_test, d_k_test)

    # Unscaled
    scores_unscaled = Q_test @ K_test.T
    weights_unscaled = softmax(scores_unscaled)

    # Scaled
    scores_scaled = scores_unscaled / np.sqrt(d_k_test)
    weights_scaled = softmax(scores_scaled)

    # Compute entropy (measure of attention spread)
    def entropy(w):
        return -np.sum(w * np.log(w + 1e-10), axis=-1).mean()

    return {
        "d_k": d_k_test,
        "score_std_unscaled": scores_unscaled.std(),
        "score_std_scaled": scores_scaled.std(),
        "max_weight_unscaled": weights_unscaled.max(axis=-1).mean(),
        "max_weight_scaled": weights_scaled.max(axis=-1).mean(),
        "entropy_unscaled": entropy(weights_unscaled),
        "entropy_scaled": entropy(weights_scaled),
    }


# Compare across dimensions
dims = [4, 16, 64, 256, 512]
comparisons = [compare_scaling(d) for d in dims]

def compare_scaling(d_k_test):
    """Compare attention with and without scaling."""
    np.random.seed(42)
    n_test = 5

    # Random Q, K
    Q_test = np.random.randn(n_test, d_k_test)
    K_test = np.random.randn(n_test, d_k_test)

    # Unscaled
    scores_unscaled = Q_test @ K_test.T
    weights_unscaled = softmax(scores_unscaled)

    # Scaled
    scores_scaled = scores_unscaled / np.sqrt(d_k_test)
    weights_scaled = softmax(scores_scaled)

    # Compute entropy (measure of attention spread)
    def entropy(w):
        return -np.sum(w * np.log(w + 1e-10), axis=-1).mean()

    return {
        "d_k": d_k_test,
        "score_std_unscaled": scores_unscaled.std(),
        "score_std_scaled": scores_scaled.std(),
        "max_weight_unscaled": weights_unscaled.max(axis=-1).mean(),
        "max_weight_scaled": weights_scaled.max(axis=-1).mean(),
        "entropy_unscaled": entropy(weights_unscaled),
        "entropy_scaled": entropy(weights_scaled),
    }


# Compare across dimensions
dims = [4, 16, 64, 256, 512]
comparisons = [compare_scaling(d) for d in dims]

Out[32]:

Console

Effect of scaling across dimensions:

   d_k | Score Std (unscaled) | Score Std (scaled) | Max Weight (unsc.) | Max Weight (sc.)
------------------------------------------------------------------------------------------
     4 |                 1.56 |               0.78 |             0.6048 |           0.4355
    16 |                 3.00 |               0.75 |             0.8633 |           0.4592
    64 |                 8.66 |               1.08 |             0.7493 |           0.4398
   256 |                13.47 |               0.84 |             1.0000 |           0.4961
   512 |                20.79 |               0.92 |             1.0000 |           0.4842

Without scaling, score standard deviation grows with $\sqrt{d_k}$ , reaching over 22 at $d_k = 512$ . This pushes the maximum attention weight toward 1.0, meaning attention collapses to near-one-hot. With scaling, score standard deviation stays around 1.0, and maximum weights remain moderate, allowing distributed attention patterns.

Out[33]:

Visualization

Line plot showing unscaled score std growing from 2 to 22 while scaled remains near 1. — Score standard deviation with and without scaling. Unscaled scores grow with sqrt(d_k), while scaled scores maintain unit variance regardless of dimension.

Line plot showing unscaled max weight approaching 1.0 while scaled stays around 0.4. — Average maximum attention weight. Without scaling, attention becomes increasingly peaked (approaching 1.0). With scaling, attention remains distributed.

The visualization shows the benefit of scaling clearly. Without it, high-dimensional attention degenerates into hard selection. With it, the model retains the flexibility to express soft, distributed attention patterns throughout training.

Implementation: A Complete Attention ModuleLink Copied

Having traced through the formula by hand, we can now build a complete, reusable implementation. This module combines everything we've learned: it takes raw embeddings as input, projects them to queries, keys, and values, computes scaled dot-product attention, and returns contextualized representations.

Module ArchitectureLink Copied

A self-contained attention module needs three components:

Projection matrices: Learned parameters that transform input embeddings into Q, K, and V
The attention computation: The formula we've been studying
Proper initialization: Weight scaling that maintains signal magnitude through the network

The implementation follows patterns used in production transformer libraries like PyTorch and TensorFlow:

In[34]:

Code

class ScaledDotProductAttention:
    """
    Scaled dot-product attention module.

    This is the core attention mechanism used in transformers.
    """

    def __init__(self, d_model, d_k, d_v):
        """
        Initialize attention with projection matrices.

        Args:
            d_model: Input embedding dimension
            d_k: Query/key dimension
            d_v: Value dimension
        """
        self.d_k = d_k
        self.scale = 1.0 / np.sqrt(d_k)

        # Initialize projection matrices with Xavier/Glorot initialization
        self.W_Q = np.random.randn(d_model, d_k) * np.sqrt(
            2.0 / (d_model + d_k)
        )
        self.W_K = np.random.randn(d_model, d_k) * np.sqrt(
            2.0 / (d_model + d_k)
        )
        self.W_V = np.random.randn(d_model, d_v) * np.sqrt(
            2.0 / (d_model + d_v)
        )

    def __call__(self, X, return_weights=False):
        """
        Apply scaled dot-product attention.

        Args:
            X: Input embeddings, shape (n, d_model)
            return_weights: If True, also return attention weights

        Returns:
            output: Contextualized representations, shape (n, d_v)
            weights (optional): Attention weights, shape (n, n)
        """
        # Project to queries, keys, values
        Q = X @ self.W_Q
        K = X @ self.W_K
        V = X @ self.W_V

        # Compute scaled dot-product attention
        scores = Q @ K.T * self.scale
        weights = softmax(scores)
        output = weights @ V

        if return_weights:
            return output, weights
        return output

class ScaledDotProductAttention:
    """
    Scaled dot-product attention module.

    This is the core attention mechanism used in transformers.
    """

    def __init__(self, d_model, d_k, d_v):
        """
        Initialize attention with projection matrices.

        Args:
            d_model: Input embedding dimension
            d_k: Query/key dimension
            d_v: Value dimension
        """
        self.d_k = d_k
        self.scale = 1.0 / np.sqrt(d_k)

        # Initialize projection matrices with Xavier/Glorot initialization
        self.W_Q = np.random.randn(d_model, d_k) * np.sqrt(
            2.0 / (d_model + d_k)
        )
        self.W_K = np.random.randn(d_model, d_k) * np.sqrt(
            2.0 / (d_model + d_k)
        )
        self.W_V = np.random.randn(d_model, d_v) * np.sqrt(
            2.0 / (d_model + d_v)
        )

    def __call__(self, X, return_weights=False):
        """
        Apply scaled dot-product attention.

        Args:
            X: Input embeddings, shape (n, d_model)
            return_weights: If True, also return attention weights

        Returns:
            output: Contextualized representations, shape (n, d_v)
            weights (optional): Attention weights, shape (n, n)
        """
        # Project to queries, keys, values
        Q = X @ self.W_Q
        K = X @ self.W_K
        V = X @ self.W_V

        # Compute scaled dot-product attention
        scores = Q @ K.T * self.scale
        weights = softmax(scores)
        output = weights @ V

        if return_weights:
            return output, weights
        return output

The __init__ method creates the three projection matrices with Xavier/Glorot initialization, which scales weights by $\sqrt{2/(d_{in} + d_{out})}$ to maintain variance through the network. The __call__ method implements the complete attention pipeline in just five lines: three projections, one scored dot product with scaling, and one value aggregation.

Testing the ModuleLink Copied

Let's verify that our implementation works correctly on a realistic example:

In[35]:

Code

# Test the module with realistic dimensions
np.random.seed(42)
d_model = 32  # Input embedding dimension
d_k, d_v = 16, 16  # Projection dimensions
seq_len = 8  # Sequence length

attention = ScaledDotProductAttention(d_model, d_k, d_v)
X_test = np.random.randn(seq_len, d_model)

output, weights = attention(X_test, return_weights=True)

# Test the module with realistic dimensions
np.random.seed(42)
d_model = 32  # Input embedding dimension
d_k, d_v = 16, 16  # Projection dimensions
seq_len = 8  # Sequence length

attention = ScaledDotProductAttention(d_model, d_k, d_v)
X_test = np.random.randn(seq_len, d_model)

output, weights = attention(X_test, return_weights=True)

Out[36]:

Console

Attention module test:
  Input shape:  (8, 32)
  Output shape: (8, 16)
  Weights shape: (8, 8)

Weight matrix row sums: [1. 1. 1. 1. 1. 1. 1. 1.]
All rows sum to 1: True

The module transforms our 8-token sequence with 32-dimensional embeddings into 8 contextualized representations with 16 dimensions each. The attention weights form an $8 \times 8$ matrix where each row sums to exactly 1.0, confirming that softmax produces valid probability distributions.

From Prototype to ProductionLink Copied

This implementation captures the mathematical essence of attention, but production systems add several refinements:

Dropout: Applied to attention weights during training for regularization
Batch processing: Handling multiple sequences simultaneously for GPU efficiency
Autograd integration: Enabling gradient computation for backpropagation
Numerical precision: Using half-precision floats for memory efficiency on large models

A later chapter on multi-head attention builds directly on this single-head foundation, showing how running multiple attention heads in parallel gives the model different "perspectives" on the same input sequence.

Limitations and ImpactLink Copied

Scaled dot-product attention is the workhorse of modern NLP. The formula $\text{softmax}(QK^T/\sqrt{d_k})V$ encodes both what to attend to (through query-key matching) and what information to gather (through value aggregation). The scaling factor ensures stable training across different embedding dimensions, and the matrix formulation enables efficient GPU parallelization.

The mechanism has several important limitations to keep in mind. The $O(n^2)$ complexity in sequence length remains a fundamental constraint, where $n$ is the number of tokens. While the scaling factor helps with numerical stability, it doesn't reduce the quadratic memory and compute requirements. A 4,000-token sequence requires 16 million attention score computations, and an 8,000-token sequence requires 64 million. This motivates research into efficient attention variants like sparse attention, linear attention, and sliding window approaches.

Additionally, standard dot-product attention treats all positions symmetrically. Without positional encodings (covered in a later chapter), the model cannot distinguish "dog bites man" from "man bites dog." The attention mechanism itself is permutation-equivariant: shuffling input positions merely shuffles the output in the same way. External positional information must be injected to break this symmetry.

Despite these constraints, scaled dot-product attention unlocked the transformer architecture and the modern era of large language models. By enabling parallel processing of sequences and providing direct connections between any two positions, it solved the long-range dependency problems that limited recurrent models. The mechanism is simple enough to implement in a few lines of code, yet powerful enough to serve as the foundation for GPT, BERT, and their successors.

SummaryLink Copied

Scaled dot-product attention extends the basic attention mechanism with two key refinements: query-key-value projections and score scaling.

Key takeaways from this chapter:

Query-Key-Value framework: Instead of using embeddings directly, we project them into specialized representations. Queries encode "what I'm looking for," keys encode "what I offer," and values encode "what information I contribute."
Dot product for similarity: The similarity between query $i$ and key $j$ is their dot product $\mathbf{q}_i \cdot \mathbf{k}_j$ . In matrix form, $QK^T$ computes all $n^2$ pairwise similarities efficiently.
Scaling factor $1/\sqrt{d_k}$ : Dot product variance grows with dimension $d_k$ , causing softmax saturation in high dimensions. Dividing by $\sqrt{d_k}$ normalizes scores to unit variance, keeping softmax in its sensitive region.
The attention formula: $\text{Attention}(Q, K, V) = \text{softmax}(QK^T/\sqrt{d_k})V$ combines scoring, normalization, and aggregation into a single differentiable operation.
Matrix efficiency: The formula translates directly to matrix multiplications, enabling efficient batch processing on GPUs.
Quadratic complexity: Computing $QK^T$ requires $O(n^2)$ operations for sequence length $n$ . This remains the primary computational bottleneck for long sequences.

In the next chapter, we'll explore attention masking, which allows us to control which positions can attend to which. This is essential for causal language models (where future tokens must not influence past predictions) and for handling variable-length sequences in batches.

Key ParametersLink Copied

When implementing scaled dot-product attention, several hyperparameters control the model's capacity and behavior:

$d_k$ (query/key dimension): Controls the dimensionality of the query and key projections. Larger values increase model capacity but also increase computation. Common choices range from 64 to 128 per attention head. The scaling factor $1/\sqrt{d_k}$ depends directly on this value.
$d_v$ (value dimension): Controls the dimensionality of value projections and the output. Often set equal to $d_k$ , but can differ. This determines the size of the contextual representations produced by attention.
$d_{model}$ (input embedding dimension): The dimension of the input token embeddings. The projection matrices $W^Q$ , $W^K$ , and $W^V$ transform from $d_{model}$ to $d_k$ or $d_v$ . Typical values: 256, 512, 768, or 1024.
Initialization scale: The projection matrices should be initialized with appropriate variance to maintain signal magnitude. Xavier/Glorot initialization (scaling by $\sqrt{2/(d_{in} + d_{out})}$ ) is commonly used to prevent vanishing or exploding activations.

For practical implementations using PyTorch or similar frameworks, torch.nn.MultiheadAttention handles these details automatically, accepting embed_dim and num_heads as primary parameters and computing per-head dimensions internally.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about scaled dot-product attention.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{scaleddotproductattentionthecoretransformermechanism, author = {Michael Brenndoerfer}, title = {Scaled Dot-Product Attention: The Core Transformer Mechanism}, year = {2025}, url = {https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Scaled Dot-Product Attention: The Core Transformer Mechanism. Retrieved from https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism

MLAAcademic

Michael Brenndoerfer. "Scaled Dot-Product Attention: The Core Transformer Mechanism." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism>.

CHICAGOAcademic

Michael Brenndoerfer. "Scaled Dot-Product Attention: The Core Transformer Mechanism." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Scaled Dot-Product Attention: The Core Transformer Mechanism'. Available at: https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Scaled Dot-Product Attention: The Core Transformer Mechanism. https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism

Direct link:

https://mbrenndoerfer.com/writing/scaled-dot-product-attention-transformer-mechanism

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Scaled Dot-Product Attention: The Core Transformer Mechanism

Scaled Dot-Product AttentionLink Copied

The Query-Key-Value FrameworkLink Copied

Linear ProjectionsLink Copied

The Dot Product for SimilarityLink Copied

The Scaling ProblemLink Copied

The Scaling Factor: $1/\sqrt{d_k}$ Link Copied

Derivation of Dot Product VarianceLink Copied

The Complete Attention FormulaLink Copied

From Components to FormulaLink Copied

Stage-by-Stage ComputationLink Copied

ImplementationLink Copied

Visualizing the Attention ComputationLink Copied

Matrix Form and Computational EfficiencyLink Copied

A Complete Worked ExampleLink Copied

Setting Up the ExampleLink Copied

Step 1: Compute Raw Similarity ScoresLink Copied

Step 2: Apply the Scaling FactorLink Copied

Step 3: Convert to Attention WeightsLink Copied

Step 4: Aggregate ValuesLink Copied

Comparing Scaled vs Unscaled AttentionLink Copied

Implementation: A Complete Attention ModuleLink Copied

Module ArchitectureLink Copied

Testing the ModuleLink Copied

From Prototype to ProductionLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Query, Key, Value: The Foundation of Transformer Attention

Multi-Head Attention: Parallel Attention for Richer Representations

Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Stay updated

Scaled Dot-Product Attention: The Core Transformer Mechanism

Scaled Dot-Product AttentionLink Copied

The Query-Key-Value FrameworkLink Copied

Linear ProjectionsLink Copied

The Dot Product for SimilarityLink Copied

The Scaling ProblemLink Copied

The Scaling Factor: 1/dk1/\sqrt{d_k}1/dk​​Link Copied

Derivation of Dot Product VarianceLink Copied

The Complete Attention FormulaLink Copied

From Components to FormulaLink Copied

Stage-by-Stage ComputationLink Copied

ImplementationLink Copied

Visualizing the Attention ComputationLink Copied

Matrix Form and Computational EfficiencyLink Copied

A Complete Worked ExampleLink Copied

Setting Up the ExampleLink Copied

Step 1: Compute Raw Similarity ScoresLink Copied

Step 2: Apply the Scaling FactorLink Copied

Step 3: Convert to Attention WeightsLink Copied

Step 4: Aggregate ValuesLink Copied

Comparing Scaled vs Unscaled AttentionLink Copied

Implementation: A Complete Attention ModuleLink Copied

Module ArchitectureLink Copied

Testing the ModuleLink Copied

From Prototype to ProductionLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Query, Key, Value: The Foundation of Transformer Attention

Multi-Head Attention: Parallel Attention for Richer Representations

Attention Complexity: Quadratic Scaling, Memory Limits & Efficient Alternatives

Stay updated

The Scaling Factor: $1/\sqrt{d_k}$ Link Copied