Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how attention mechanisms solve the information bottleneck in encoder-decoder models through soft lookup, alignment scores, and dynamic context vectors.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Attention IntuitionLink Copied

In the encoder-decoder framework, we compress an entire input sequence into a single fixed-length vector. This works reasonably well for short sequences, but as inputs grow longer, that single vector becomes an information bottleneck. How can we expect one vector to faithfully represent a 50-word sentence, let alone an entire paragraph?

Attention solves this by allowing the decoder to look back at all encoder states and decide which parts of the input are most relevant at each decoding step. Instead of relying on a compressed summary, the model learns to focus on different parts of the input as it generates each output token. This idea improved sequence-to-sequence models and became the foundation for the transformer architecture used in modern NLP.

Attention as Soft LookupLink Copied

Think of attention as a soft, differentiable dictionary lookup. In a traditional dictionary, you provide a key and get back exactly one value. Attention works similarly, but instead of retrieving a single value, it retrieves a weighted combination of all values based on how well each key matches your query.

Consider a translation task where you're generating the French word for "cat" from an English sentence. Rather than searching through the entire compressed representation, attention lets the decoder ask: "Which parts of the input are most relevant right now?" The answer comes as a probability distribution over all input positions.

Soft vs Hard Attention

Hard attention selects exactly one input position (like a traditional lookup), while soft attention computes a weighted average over all positions. Soft attention is differentiable and can be trained with backpropagation, making it the standard choice for neural networks.

The mechanics involve three components: a query (what we're looking for), keys (what we're comparing against), and values (what we retrieve). In the context of encoder-decoder models:

The query comes from the decoder's current state
The keys are the encoder's hidden states
The values are also the encoder's hidden states (often the same as keys)

The attention mechanism computes similarity scores between the query and all keys, normalizes these scores into a probability distribution, and then returns a weighted sum of the values. This weighted sum is the context vector that the decoder uses alongside its own state to generate the next token.

The Mathematics of AttentionLink Copied

Now that we have the intuition, let's translate it into precise mathematics. The goal is to build a mechanism that answers a simple question at each decoding step: "Which parts of the input should I focus on right now?"

Imagine you're the decoder, trying to generate the next word in a translation. You have your current state $s_t$ , which encodes what you've generated so far and what you're trying to produce next. Meanwhile, the encoder has processed the entire input sentence and produced a sequence of hidden states $h_1, h_2, \ldots, h_T$ , one for each of the $T$ input words. Each encoder state $h_j$ captures the meaning of word $j$ in context.

The challenge is clear: you need to selectively combine information from all these encoder states, giving more weight to positions that are relevant to your current generation task. Attention solves this in three steps.

Step 1: Measuring Relevance with Alignment ScoresLink Copied

The first question attention must answer is: "How relevant is each input position to what I'm currently generating?" We need a way to compare the decoder's current state with each encoder state and produce a relevance score.

This comparison happens through a scoring function:

e_{tj} = \text{score}(s_t, h_j)

where:

$e_{tj}$ : the alignment score, a single number indicating how relevant encoder position $j$ is when generating at decoder position $t$
$s_t$ : the decoder's hidden state at step $t$ , encoding the generation context
$h_j$ : the encoder's hidden state at position $j$ , encoding information about input word $j$
$\text{score}(\cdot, \cdot)$ : any function that takes two vectors and returns a scalar measuring their compatibility

Think of this as the decoder asking each encoder position: "How useful are you for what I'm trying to do right now?" The score function quantifies the answer. A high score means "very useful," while a low or negative score means "not relevant."

What should this scoring function look like? The simplest choice is the dot product, $e_{tj} = s_t^\top h_j$ , which measures how aligned the two vectors are in the embedding space. Vectors pointing in similar directions yield high scores; orthogonal vectors yield zero. We'll explore more sophisticated scoring functions in subsequent chapters, but the core idea remains: compare the query (decoder state) against each key (encoder state) to measure relevance.

Step 2: Converting Scores to a Probability DistributionLink Copied

Raw alignment scores present a problem: they can be any real number, positive or negative, large or small. We need to convert them into something more interpretable and usable, specifically, a probability distribution over input positions.

The softmax function accomplishes this transformation:

\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T} \exp(e_{tk})}

where:

$\alpha_{tj}$ : the attention weight for position $j$ , now guaranteed to be between 0 and 1
$\exp(e_{tj})$ : the exponential function applied to the alignment score, converting any real number to a positive value
$\sum_{k=1}^{T} \exp(e_{tk})$ : the sum of all exponentiated scores, serving as a normalizing constant

Why softmax? It has exactly the properties we need. First, the exponential function maps any real number to a positive value, ensuring all weights are non-negative. Second, dividing by the sum of all exponentials guarantees that $\sum_{j=1}^{T} \alpha_{tj} = 1$ . The result is a valid probability distribution.

Softmax also has a useful amplification property. If one score is much larger than the others, its corresponding weight will dominate. For example, if $e_{t1} = 5$ and all other scores are near 0, then $\alpha_{t1}$ will be close to 1 while the other weights approach 0. This allows the model to focus sharply on a single position when appropriate, or spread attention across multiple positions when the relevance is more evenly distributed.

Out[2]:

Visualization

Bar chart showing uniform alignment scores and resulting uniform attention weights. — Uniform scores produce uniform attention weights (0.20 each).

Bar chart showing small score differences amplified by softmax. — Small score differences get amplified into clearer preferences.

Bar chart showing large score differences producing peaked attention. — Large differences produce nearly one-hot attention, focusing on one position.

The visualization above demonstrates this amplification effect. With uniform scores, softmax produces uniform weights (0.20 each). Small score differences get amplified into clearer preferences. Large differences produce nearly one-hot attention, where almost all weight concentrates on the highest-scoring position.

Step 3: Computing the Context VectorLink Copied

With attention weights in hand, we can finally answer our original question: "What information from the input should I use?" The answer is a weighted combination of all encoder states:

c_t = \sum_{j=1}^{T} \alpha_{tj} h_j

where:

$c_t$ : the context vector, a single vector summarizing the relevant input information
$\alpha_{tj}$ : how much weight to give to position $j$ (from step 2)
$h_j$ : the information available at position $j$ (the encoder hidden state)

This weighted sum is the heart of attention. Each encoder state contributes to the context vector in proportion to its attention weight. If $\alpha_{t3} = 0.7$ and all other weights are small, then $c_t$ will be dominated by $h_3$ , the information at position 3. If weights are more evenly spread, the context vector blends information from multiple positions.

The context vector has the same dimension as the encoder states, making it easy to combine with other parts of the model. The decoder uses $c_t$ alongside its own state $s_t$ to predict the next output token.

The Complete PictureLink Copied

Let's trace through a concrete example. Suppose we're translating "The cat sat" and currently generating the French word "chat" (cat). The decoder state $s_t$ encodes that we've generated "Le" and are now producing the noun.

Alignment scores: The scoring function compares $s_t$ with each encoder state. It finds high compatibility with $h_2$ (the representation of "cat") and lower compatibility with $h_1$ ("The") and $h_3$ ("sat").
Attention weights: Softmax converts these scores into probabilities. Perhaps $\alpha_{t1} = 0.05$ , $\alpha_{t2} = 0.85$ , $\alpha_{t3} = 0.10$ .
Context vector: The weighted sum $c_t = 0.05 \cdot h_1 + 0.85 \cdot h_2 + 0.10 \cdot h_3$ produces a context vector dominated by the representation of "cat."

The decoder combines $c_t$ with $s_t$ and predicts "chat" as the next word. At the next step, when generating "noir" (black), the attention mechanism will shift focus to a different part of the input.

This dynamic, step-by-step focus is what makes attention effective. Rather than relying on a single compressed representation of the entire input, the model can look back at the original encoder states and select the information most relevant to each generation step.

Attention Weight InterpretationLink Copied

Attention weights offer something rare in deep learning: interpretability. By examining which input positions receive high attention weights, we can understand what the model is "looking at" when making predictions.

In machine translation, attention weights often reveal meaningful alignments between source and target languages. When translating "The black cat sat on the mat" to French "Le chat noir était assis sur le tapis", the model typically attends to:

"cat" when generating "chat"
"black" when generating "noir"
"mat" when generating "tapis"

This alignment emerges automatically from training. The model learns that certain source words are most relevant for generating certain target words, without explicit supervision about word correspondences.

However, attention weights require careful interpretation:

Weights show correlation, not causation: High attention on a position doesn't mean that position caused the output
Distributed information: Important information might be spread across multiple positions with moderate weights
Layer effects: In multi-layer models, different layers may attend to different aspects of the input

Despite these caveats, attention visualization remains one of the most valuable tools for understanding model behavior.

Handling Variable-Length InputsLink Copied

One of attention's most practical benefits is graceful handling of variable-length sequences. Traditional encoder-decoder models compress inputs of any length into a fixed-size vector, creating a fundamental mismatch: longer inputs must squeeze more information into the same space.

Attention sidesteps this entirely. The context vector $c_t$ is always a weighted average of encoder states, regardless of sequence length. For a 10-word sentence, we average over 10 states. For a 100-word paragraph, we average over 100 states. The mechanism scales naturally.

This property is crucial for tasks with highly variable input lengths:

Document summarization: Articles range from a few sentences to thousands of words
Question answering: Questions are short, but context passages can be lengthy
Code generation: Function descriptions vary from one-liners to detailed specifications

Without attention, longer inputs would require either truncation (losing information) or larger hidden states (increasing memory and computation). Attention avoids both problems by dynamically selecting relevant information at each step.

Attention vs PoolingLink Copied

Before attention became widespread, sequence models often used pooling operations to aggregate information. Mean pooling averages all hidden states, while max pooling takes the element-wise maximum. How does attention compare?

Mean pooling treats all positions equally by computing a simple average:

c = \frac{1}{T} \sum_{j=1}^{T} h_j

where:

$c$ : the aggregated context vector (same dimension as hidden states)
$T$ : the number of positions in the sequence
$h_j$ : the hidden state at position $j$

Each position receives weight $\frac{1}{T}$ , regardless of its content. This works when all parts of the input contribute equally to the output, but fails when relevance varies. In sentiment analysis, the phrase "not good" carries more weight than "the movie was", yet mean pooling gives them equal importance.

Max pooling extracts the strongest signal at each dimension independently:

c_d = \max_j h_{jd}

where:

$c_d$ : the $d$ -th dimension of the aggregated vector
$h_{jd}$ : the $d$ -th dimension of the hidden state at position $j$
$\max_j$ : the maximum value across all positions $j = 1, \ldots, T$

This captures salient features by selecting the most activated value for each dimension. However, it loses information about which positions contributed and cannot combine information from multiple positions in a nuanced way.

Attention provides learned, context-dependent weighting:

c = \sum_{j=1}^{T} \alpha_j h_j

where:

$c$ : the context vector
$\alpha_j$ : the learned attention weight for position $j$ (computed dynamically)
$h_j$ : the hidden state at position $j$

The weights $\alpha_j$ are computed dynamically based on what the model needs at each step. Unlike mean pooling's uniform weights or max pooling's binary selection, attention learns which positions matter most for the current prediction. This flexibility makes attention strictly more expressive than fixed pooling strategies.

Consider the sentence "The movie was absolutely terrible" for sentiment classification. The table below compares how mean pooling and attention weight each word:

Mean pooling vs. attention weights for sentiment classification. Mean pooling assigns uniform weights (0.20) regardless of word importance, while attention concentrates 75% of weight on "terrible," the key sentiment indicator.

Word	Mean Pooling	Attention
The	0.20	0.02
movie	0.20	0.08
was	0.20	0.03
absolutely	0.20	0.12
terrible	0.20	0.75

The contrast is stark. Mean pooling treats "The" and "terrible" as equally important, diluting the sentiment signal across all words. Attention learns to focus on "terrible" (and to a lesser extent "absolutely"), producing a context vector that emphasizes what actually matters for sentiment classification.

Comparison of aggregation methods. Attention uniquely provides both context-dependent weighting and interpretable outputs.

Method	Weights	Context-dependent	Interpretable
Mean pooling	Uniform ( $1/T$ )	No	No
Max pooling	Binary (0 or 1)	No	Partial
Attention	Learned	Yes	Yes

Building Intuition with CodeLink Copied

The mathematics of attention translates directly into code. Let's implement the three-step process we just described: compute alignment scores, apply softmax to get attention weights, and produce a context vector through weighted summation. Working through a concrete example will solidify these concepts and reveal what happens inside the attention mechanism.

In[3]:

Code

import numpy as np

np.random.seed(42)

# Simulate encoder outputs: 5 positions, each with 4-dimensional hidden state
encoder_states = np.random.randn(5, 4)
position_labels = ["The", "cat", "sat", "on", "mat"]

# Simulate decoder state (query): 4-dimensional
decoder_state = np.random.randn(4)

import numpy as np

np.random.seed(42)

# Simulate encoder outputs: 5 positions, each with 4-dimensional hidden state
encoder_states = np.random.randn(5, 4)
position_labels = ["The", "cat", "sat", "on", "mat"]

# Simulate decoder state (query): 4-dimensional
decoder_state = np.random.randn(4)

Out[4]:

Console

Encoder states shape: (5, 4)
Decoder state shape: (4,)

Encoder states (each row is a position):
  The: [ 0.5  -0.14  0.65  1.52]
  cat: [-0.23 -0.23  1.58  0.77]
  sat: [-0.47  0.54 -0.46 -0.47]
  on: [ 0.24 -1.91 -1.72 -0.56]
  mat: [-1.01  0.31 -0.91 -1.41]

We have 5 encoder states representing the words in "The cat sat on mat". Each state is a 4-dimensional vector containing the hidden representation learned by the encoder. The decoder state, also 4-dimensional, represents what the model is currently trying to generate. In practice, these dimensions would be much larger (256 to 1024), but the small size here makes the computation easy to follow.

Now let's implement the attention mechanism. The function below follows our three-step formula exactly: compute dot product scores, apply softmax normalization, and return the weighted sum:

In[5]:

Code

def compute_attention(query, keys, values):
    """
    Compute attention weights and context vector.

    Args:
        query: decoder state, shape (d,)
        keys: encoder states, shape (T, d)
        values: encoder states, shape (T, d)

    Returns:
        attention_weights: shape (T,)
        context_vector: shape (d,)
    """
    # Compute alignment scores (dot product)
    scores = np.dot(keys, query)

    # Apply softmax to get attention weights
    exp_scores = np.exp(
        scores - np.max(scores)
    )  # subtract max for numerical stability
    attention_weights = exp_scores / np.sum(exp_scores)

    # Compute context vector as weighted sum
    context_vector = np.dot(attention_weights, values)

    return attention_weights, context_vector


weights, context = compute_attention(
    decoder_state, encoder_states, encoder_states
)

def compute_attention(query, keys, values):
    """
    Compute attention weights and context vector.

    Args:
        query: decoder state, shape (d,)
        keys: encoder states, shape (T, d)
        values: encoder states, shape (T, d)

    Returns:
        attention_weights: shape (T,)
        context_vector: shape (d,)
    """
    # Compute alignment scores (dot product)
    scores = np.dot(keys, query)

    # Apply softmax to get attention weights
    exp_scores = np.exp(
        scores - np.max(scores)
    )  # subtract max for numerical stability
    attention_weights = exp_scores / np.sum(exp_scores)

    # Compute context vector as weighted sum
    context_vector = np.dot(attention_weights, values)

    return attention_weights, context_vector


weights, context = compute_attention(
    decoder_state, encoder_states, encoder_states
)

Out[6]:

Console

Attention weights:
  The : 0.035 █
  cat : 0.039 █
  sat : 0.116 ████
  on  : 0.604 ████████████████████████
  mat : 0.206 ████████

Context vector: [-0.108 -1.042 -1.199 -0.601]
Sum of weights: 1.000

The attention weights show how much the model focuses on each input position. Notice that the weights sum to exactly 1.0, forming a valid probability distribution over input positions. In this random example, the weights are distributed based on how similar each encoder state is to the decoder state (measured by dot product). The context vector is a weighted combination of all encoder states, with dimensions matching the encoder hidden size. In a trained model, these similarities would reflect learned relevance patterns rather than random correlations.

Visualizing Attention PatternsLink Copied

Attention weights are typically visualized as heatmaps, with rows representing decoder steps (outputs) and columns representing encoder positions (inputs). Let's create a visualization for a hypothetical translation task.

Out[7]:

Visualization

Heatmap showing attention weights between English source words and French target words with darker blue indicating higher attention. — Attention heatmap for English-to-French translation. Each row shows the attention distribution when generating a French word. Darker cells indicate stronger attention, revealing how the model aligns source and target words.

Several patterns emerge from this visualization:

Diagonal tendency: Many languages share similar word order, so attention often follows a rough diagonal
Reordering: French adjectives often follow nouns ("chat noir" vs "black cat"), visible in the swapped attention for "noir" and "chat"
Many-to-one mapping: Both "était" and "assis" attend primarily to "sat", reflecting how French uses two words where English uses one
Article alignment: Function words like "Le" and "le" align with their English counterparts

Attention in Practice: Sentiment AnalysisLink Copied

Let's see attention in a more complete example. We'll build a simple attention-based classifier for sentiment analysis, showing how attention helps identify which words drive the prediction.

In[8]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F


class AttentionClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, batch_first=True, bidirectional=True
        )

        # Attention components
        self.attention_weights = nn.Linear(hidden_dim * 2, 1)

        # Classifier
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)

        # Encode with bidirectional LSTM
        lstm_out, _ = self.lstm(embedded)  # (batch, seq_len, hidden_dim * 2)

        # Compute attention scores
        scores = self.attention_weights(lstm_out).squeeze(
            -1
        )  # (batch, seq_len)
        attention = F.softmax(scores, dim=1)  # (batch, seq_len)

        # Compute context vector
        context = torch.bmm(attention.unsqueeze(1), lstm_out).squeeze(
            1
        )  # (batch, hidden_dim * 2)

        # Classify
        logits = self.classifier(context)  # (batch, num_classes)

        return logits, attention

import torch
import torch.nn as nn
import torch.nn.functional as F


class AttentionClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, batch_first=True, bidirectional=True
        )

        # Attention components
        self.attention_weights = nn.Linear(hidden_dim * 2, 1)

        # Classifier
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)

        # Encode with bidirectional LSTM
        lstm_out, _ = self.lstm(embedded)  # (batch, seq_len, hidden_dim * 2)

        # Compute attention scores
        scores = self.attention_weights(lstm_out).squeeze(
            -1
        )  # (batch, seq_len)
        attention = F.softmax(scores, dim=1)  # (batch, seq_len)

        # Compute context vector
        context = torch.bmm(attention.unsqueeze(1), lstm_out).squeeze(
            1
        )  # (batch, hidden_dim * 2)

        # Classify
        logits = self.classifier(context)  # (batch, num_classes)

        return logits, attention

This model uses a bidirectional LSTM to encode the input, then applies attention to create a single context vector for classification. The attention weights tell us which words the model considers most important.

Let's create a simple vocabulary and test the model:

In[9]:

Code

# Create a simple vocabulary
words = [
    "<pad>",
    "<unk>",
    "the",
    "movie",
    "was",
    "great",
    "terrible",
    "acting",
    "plot",
    "boring",
    "excellent",
    "not",
    "really",
    "loved",
    "hated",
]
word_to_idx = {w: i for i, w in enumerate(words)}
idx_to_word = {i: w for w, i in word_to_idx.items()}


def tokenize(text):
    """Convert text to token indices."""
    tokens = text.lower().split()
    return [word_to_idx.get(t, word_to_idx["<unk>"]) for t in tokens]


# Initialize model
model = AttentionClassifier(
    vocab_size=len(words), embed_dim=32, hidden_dim=64, num_classes=2
)

# Create a simple vocabulary
words = [
    "<pad>",
    "<unk>",
    "the",
    "movie",
    "was",
    "great",
    "terrible",
    "acting",
    "plot",
    "boring",
    "excellent",
    "not",
    "really",
    "loved",
    "hated",
]
word_to_idx = {w: i for i, w in enumerate(words)}
idx_to_word = {i: w for w, i in word_to_idx.items()}


def tokenize(text):
    """Convert text to token indices."""
    tokens = text.lower().split()
    return [word_to_idx.get(t, word_to_idx["<unk>"]) for t in tokens]


# Initialize model
model = AttentionClassifier(
    vocab_size=len(words), embed_dim=32, hidden_dim=64, num_classes=2
)

In[10]:

Code

# Test sentences
test_sentences = [
    "the movie was great",
    "the movie was terrible",
    "the acting was not great",
    "really loved the plot",
]

# Process each sentence
model.eval()
results = []

for sentence in test_sentences:
    tokens = tokenize(sentence)
    x = torch.tensor([tokens])

    with torch.no_grad():
        logits, attention = model(x)

    results.append(
        {
            "sentence": sentence,
            "tokens": sentence.lower().split(),
            "attention": attention[0].numpy(),
        }
    )

# Test sentences
test_sentences = [
    "the movie was great",
    "the movie was terrible",
    "the acting was not great",
    "really loved the plot",
]

# Process each sentence
model.eval()
results = []

for sentence in test_sentences:
    tokens = tokenize(sentence)
    x = torch.tensor([tokens])

    with torch.no_grad():
        logits, attention = model(x)

    results.append(
        {
            "sentence": sentence,
            "tokens": sentence.lower().split(),
            "attention": attention[0].numpy(),
        }
    )

Out[11]:

Console

Attention weights for test sentences:

Sentence: "the movie was great"
Attention distribution:
  the       : 0.267 ███████
  movie     : 0.253 ███████
  was       : 0.237 ███████
  great     : 0.244 ███████
  → Highest attention: 'the' (0.267)

Sentence: "the movie was terrible"
Attention distribution:
  the       : 0.269 ████████
  movie     : 0.255 ███████
  was       : 0.235 ███████
  terrible  : 0.241 ███████
  → Highest attention: 'the' (0.269)

Sentence: "the acting was not great"
Attention distribution:
  the       : 0.208 ██████
  acting    : 0.193 █████
  was       : 0.193 █████
  not       : 0.208 ██████
  great     : 0.197 █████
  → Highest attention: 'not' (0.208)

Sentence: "really loved the plot"
Attention distribution:
  really    : 0.235 ███████
  loved     : 0.228 ██████
  the       : 0.252 ███████
  plot      : 0.284 ████████
  → Highest attention: 'plot' (0.284)

Since this is an untrained model with randomly initialized weights, the attention distribution appears arbitrary. The model hasn't learned which words are semantically important. After training on labeled sentiment data, we would expect sentiment-bearing words like "great", "terrible", "loved", and "hated" to receive substantially higher attention weights (0.5-0.8), while function words like "the" and "was" would receive minimal attention (0.02-0.10).

Learned Attention PatternsLink Copied

What would attention look like in a trained model? The table below shows realistic attention patterns for four sentiment sentences, with the highest-weighted word in each sentence highlighted:

Attention weights for sentiment classification across four example sentences. Sentiment-bearing words receive the highest attention, while function words like "the" and "was" are largely ignored.

Sentence	the	movie/acting/plot	was	sentiment word	other
"the movie was great"	0.05	0.15	0.10	0.70 (great)	—
"the movie was terrible"	0.05	0.12	0.08	0.75 (terrible)	—
"the acting was not great"	0.03	0.12	0.05	0.35 (great)	0.45 (not)
"really loved the plot"	0.05	0.25 (plot)	—	0.55 (loved)	0.15 (really)

These patterns reveal what we would expect from a trained model:

Sentiment words dominate: "great", "terrible", and "loved" receive the highest weights
Negation matters: In "not great", both "not" and "great" receive significant attention, as the model must combine them to understand the negated sentiment
Function words ignored: Words like "the" and "was" consistently receive low attention

The Attention Computation PipelineLink Copied

Let's trace through the complete attention computation step by step. This pipeline applies regardless of the specific scoring function used. The five stages are:

Encoder states ( $h_1, \ldots, h_T$ ): The LSTM or other encoder produces one hidden state per input position
Decoder state ( $s_t$ ): The decoder's current hidden state serves as the query
Alignment scores ( $e_{tj}$ ): A scoring function computes relevance between the query and each key
Attention weights ( $\alpha_{tj}$ ): Softmax normalizes scores into a probability distribution
Context vector ( $c_t$ ): Weighted sum of encoder states produces the final output

This context vector is then combined with the decoder state to predict the next token. Different attention variants (Bahdanau, Luong) differ primarily in how they compute the alignment scores in step 3.

Comparing Attention MechanismsLink Copied

We've established that attention requires a scoring function to measure relevance between decoder and encoder states. But what should this function look like? Different choices lead to different attention mechanisms, each with distinct trade-offs. Understanding these variants prepares you for the detailed treatments in upcoming chapters.

The fundamental question each scoring function answers is the same: "Given what I'm trying to generate (the decoder state) and what information is available (an encoder state), how compatible are they?" The answer is always a single number, the alignment score. But how we compute that number varies significantly.

Dot Product Attention (Luong)Link Copied

The most direct approach treats compatibility as geometric alignment. Two vectors that point in similar directions should have high compatibility; vectors that are orthogonal should have zero compatibility. The dot product captures exactly this intuition:

e_{tj} = s_t^\top h_j

where:

$s_t$ : the decoder state at step $t$ , a vector of dimension $d$
$h_j$ : the encoder state at position $j$ , also dimension $d$
$s_t^\top h_j$ : the inner product, computed as $\sum_{i=1}^{d} s_{t,i} \cdot h_{j,i}$

Geometrically, the dot product equals $\|s_t\| \cdot \|h_j\| \cdot \cos(\theta)$ , where $\theta$ is the angle between the vectors. When both vectors point in the same direction ( $\theta = 0$ ), the score is maximized. When they're perpendicular ( $\theta = 90°$ ), the score is zero. When they point in opposite directions ( $\theta = 180°$ ), the score is negative.

This simplicity is both a strength and a limitation. The dot product requires no learnable parameters, making it computationally efficient and easy to implement. However, it imposes a constraint: the decoder and encoder must have the same hidden dimension. More subtly, it assumes that compatibility can be measured purely through vector alignment in the existing embedding space, without any learned transformation.

Additive Attention (Bahdanau)Link Copied

What if simple geometric alignment isn't expressive enough? Perhaps compatibility depends on complex, non-linear relationships between the decoder and encoder states. Additive attention addresses this by introducing a small neural network to compute scores:

e_{tj} = v^\top \tanh(W_s s_t + W_h h_j)

where:

$W_s$ : a weight matrix of dimension $d_a \times d_s$ that projects the decoder state
$W_h$ : a weight matrix of dimension $d_a \times d_h$ that projects the encoder state
$v$ : a weight vector of dimension $d_a$ that produces the final scalar
$\tanh$ : the hyperbolic tangent, introducing nonlinearity
$d_a$ : the attention hidden dimension, a hyperparameter

Let's trace through what this formula does. First, $W_s s_t$ projects the decoder state into a new space of dimension $d_a$ . Similarly, $W_h h_j$ projects the encoder state into the same space. Adding these projections combines information from both states. The $\tanh$ activation introduces nonlinearity, allowing the model to capture complex interactions. Finally, $v^\top$ projects the result to a scalar score.

This approach has two key advantages. First, the learnable parameters ( $W_s$ , $W_h$ , $v$ ) allow the model to discover what "compatibility" means for the specific task, rather than relying on pre-existing geometric relationships. Second, since we project both states into a common space, the encoder and decoder can have different dimensions. This flexibility is valuable when using different architectures for encoding and decoding.

The cost is additional computation and more parameters to learn. For each encoder position, we must perform two matrix multiplications and a nonlinear activation. In practice, this overhead is manageable, and the increased expressiveness often justifies the cost.

Scaled Dot Product Attention (Transformer)Link Copied

The transformer architecture revived dot product attention but added a crucial refinement. The problem with vanilla dot products becomes apparent in high dimensions: scores can grow very large in magnitude, causing softmax to produce extremely peaked distributions.

To understand why, consider what happens when we compute $s_t^\top h_j$ for vectors with $d_k$ dimensions. If each component is independently drawn from a distribution with zero mean and unit variance, the expected value of the dot product is zero, but its variance is approximately $d_k$ . For $d_k = 512$ (a typical value), the standard deviation of scores is about 22. Scores this large cause softmax to assign nearly all probability mass to a single position, producing gradients close to zero for all other positions.

Scaled dot product attention fixes this by normalizing:

e_{tj} = \frac{s_t^\top h_j}{\sqrt{d_k}}

where:

$d_k$ : the dimension of the key vectors
$\sqrt{d_k}$ : the scaling factor that stabilizes variance

Dividing by $\sqrt{d_k}$ ensures the variance of scores remains approximately 1, regardless of dimension. This keeps softmax operating in a regime where gradients flow healthily during training. The fix is simple but essential for making attention work at scale.

Out[12]:

Visualization

Histogram showing dot product distributions becoming wider as dimension increases. — Distribution of dot product scores for different dimensions shows increasing spread as d grows.

Line plot showing standard deviation growing with square root of dimension. — Standard deviation grows as √d, which is why scaled dot product divides by this factor.

The visualization confirms the scaling problem. As dimension increases, the distribution of dot product scores spreads dramatically. At $d=512$ , scores routinely exceed $\pm 40$ , which would cause softmax to produce extremely peaked distributions. Scaling by $\sqrt{d_k}$ normalizes this spread back to a manageable range.

Choosing a Scoring FunctionLink Copied

Each approach represents a different trade-off:

Comparison of attention scoring functions. Dot product is fastest but requires matching dimensions; additive is most flexible but adds parameters; scaled dot product balances efficiency with numerical stability.

Scoring Function	Parameters	Computational Cost	Flexibility
Dot Product	None	$O(d)$	Low (same dimension required)
Additive	$O(d_a \cdot (d_s + d_h) + d_a)$	$O(d_a)$	High (different dimensions OK)
Scaled Dot Product	None	$O(d)$	Low (same dimension required)

Dot product attention is fastest but least flexible. Additive attention is most flexible but requires more computation. Scaled dot product combines the efficiency of dot products with numerical stability for high dimensions.

In modern practice, scaled dot product attention dominates. Its efficiency allows transformers to use multiple attention heads in parallel, each learning different aspects of relevance. The lack of learnable parameters in the score function is offset by the projection matrices that create queries, keys, and values, which we'll explore in the transformer chapters.

Limitations and ImpactLink Copied

Attention improved sequence modeling significantly, but it comes with trade-offs worth understanding.

The most significant limitation is computational cost. Standard attention computes pairwise interactions between all query and key positions, resulting in $O(T^2)$ complexity where $T$ is the sequence length. For a 1000-token document, this means a million attention computations per layer. This quadratic scaling motivates ongoing research into efficient attention variants like sparse attention, linear attention, and the various approximations used in models like Longformer and BigBird.

Another consideration is that attention weights, while interpretable, don't always tell the complete story. Research has shown that attention patterns can be manipulated without changing model predictions, and that high attention doesn't necessarily mean high importance for the final output. Gradient-based attribution methods sometimes provide more reliable explanations. Still, attention visualization remains valuable for debugging and building intuition about model behavior.

Despite these limitations, attention had a major impact on NLP. Before attention, sequence-to-sequence models struggled with long sequences and provided little insight into their decision-making. Attention enabled:

State-of-the-art machine translation: The Bahdanau attention paper (2014) dramatically improved translation quality
Interpretable models: Practitioners could finally see what their models were "looking at"
Variable-length handling: Models could process inputs of any length without architectural changes
The transformer architecture: Self-attention, where a sequence attends to itself, became the foundation of BERT, GPT, and virtually all modern language models

The attention mechanism went from a technique for improving translation to a core building block of modern deep learning for language.

Key ParametersLink Copied

When implementing attention mechanisms, several parameters significantly impact model behavior:

Hidden dimension (hidden_dim): The size of encoder and decoder hidden states. Larger values (256-1024) capture more nuanced representations but increase memory and computation. For attention to work with dot product scoring, encoder and decoder dimensions must match.
Attention dimension (d_a): For additive attention, this controls the size of the intermediate projection space. Typical values range from 64 to 512. Smaller values reduce parameters but may limit the model's ability to learn complex alignment patterns.
Number of attention heads: In multi-head attention (covered in transformer chapters), this parameter controls how many parallel attention computations run simultaneously. Common values are 4, 8, or 16 heads, with the hidden dimension divided equally among heads.
Dropout rate: Applied to attention weights during training to prevent the model from relying too heavily on specific positions. Values of 0.1-0.3 are typical. Higher dropout encourages more distributed attention patterns.
Temperature scaling: An optional parameter that divides attention scores before softmax. Values less than 1.0 sharpen the distribution (more focused attention), while values greater than 1.0 flatten it (more uniform attention). The scaled dot product attention uses $\sqrt{d_k}$ as an automatic temperature based on dimension.

SummaryLink Copied

Attention solves the information bottleneck in encoder-decoder models. Rather than compressing an entire input sequence into a single vector, attention allows the decoder to dynamically focus on relevant parts of the input at each generation step.

Key takeaways from this chapter:

Soft lookup: Attention functions as a differentiable dictionary lookup, computing weighted combinations of values based on query-key similarity
Three components: Every attention mechanism involves queries (what we're looking for), keys (what we compare against), and values (what we retrieve)
Interpretability: Attention weights reveal which input positions the model considers relevant, enabling visualization and debugging
Variable-length handling: Attention scales naturally to any input length, avoiding the fixed-size bottleneck of traditional encoders
Beyond pooling: Unlike mean or max pooling, attention provides learned, context-dependent weighting that adapts to each prediction step
Computational trade-off: The flexibility of attention comes at quadratic cost in sequence length, motivating efficient variants

In the following chapters, we'll examine specific attention mechanisms in detail. Bahdanau attention introduced the additive scoring function that made attention practical for machine translation. Luong attention explored simpler alternatives including dot product scoring. Understanding these foundations prepares you for the self-attention mechanism at the heart of transformers, where sequences attend to themselves to build rich contextual representations.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about attention mechanisms.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{attentionmechanismintuitionsoftlookupweightscontextvectors, author = {Michael Brenndoerfer}, title = {Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors}, year = {2025}, url = {https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors. Retrieved from https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors

MLAAcademic

Michael Brenndoerfer. "Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors>.

CHICAGOAcademic

Michael Brenndoerfer. "Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors'. Available at: https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors. https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors

Direct link:

https://mbrenndoerfer.com/writing/attention-mechanism-intuition-soft-lookup-weights-context-vectors

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Attention Mechanism Intuition: Soft Lookup, Weights & Context Vectors

Attention IntuitionLink Copied

Attention as Soft LookupLink Copied

The Mathematics of AttentionLink Copied

Step 1: Measuring Relevance with Alignment ScoresLink Copied

Step 2: Converting Scores to a Probability DistributionLink Copied

Step 3: Computing the Context VectorLink Copied

The Complete PictureLink Copied

Attention Weight InterpretationLink Copied

Handling Variable-Length InputsLink Copied

Attention vs PoolingLink Copied

Building Intuition with CodeLink Copied

Visualizing Attention PatternsLink Copied

Attention in Practice: Sentiment AnalysisLink Copied

Learned Attention PatternsLink Copied

The Attention Computation PipelineLink Copied

Comparing Attention MechanismsLink Copied

Dot Product Attention (Luong)Link Copied

Additive Attention (Bahdanau)Link Copied

Scaled Dot Product Attention (Transformer)Link Copied

Choosing a Scoring FunctionLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Beam Search: Finding Optimal Sequences in Neural Text Generation

Teacher Forcing: Training Seq2Seq Models with Ground Truth Context

Bahdanau Attention: Dynamic Context for Neural Machine Translation

Stay updated