Search

Search articles

Luong Attention: Dot Product, General & Local Attention Mechanisms

Michael BrenndoerferDecember 16, 202542 min read

Master Luong attention variants including dot product, general, and concat scoring. Compare global vs local attention and understand attention placement in seq2seq models.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Luong Attention

In the previous chapter, we explored Bahdanau attention, which revolutionized sequence-to-sequence models by allowing the decoder to dynamically focus on different parts of the input sequence. Bahdanau's approach uses an additive score function and computes attention before the RNN step, feeding the context vector as input to the decoder. But is this the only way to design an attention mechanism?

In 2015, Luong et al. proposed several alternative attention mechanisms that are computationally simpler and often more effective. Their work introduced multiplicative attention variants, the distinction between global and local attention, and a different placement of attention in the decoder architecture. Understanding these variations is essential because the scaled dot-product attention used in modern transformers descends directly from Luong's multiplicative formulation.

Attention Score Functions

At the heart of any attention mechanism lies a fundamental question: how do we measure the relevance of each encoder state to the current decoding step? This measurement, called the alignment score or compatibility score, determines which parts of the input sequence the decoder should focus on when generating each output token.

Think of it this way: when translating "The cat sat on the mat" to French, and you're about to generate the word "chat" (cat), you need a way to identify that the English word "cat" is the most relevant source word. The score function quantifies this relevance for every source position, producing a set of numbers that attention will convert into a probability distribution.

Bahdanau attention introduced an additive score function with learned parameters. Luong et al. asked: can we achieve similar results with simpler approaches? Their answer was three alternative score functions, each representing a different trade-off between simplicity, expressiveness, and computational cost.

Dot Product Attention

The simplest approach to measuring relevance is to ask: how similar are the decoder and encoder representations? If the encoder has learned to represent "cat" in a particular way, and the decoder has learned to look for that same pattern when generating "chat", then similar representations should yield high scores.

The dot product provides exactly this similarity measure. Given a decoder hidden state hth_t and an encoder hidden state hˉs\bar{h}_s, we compute:

score(ht,hˉs)=hthˉs\text{score}(h_t, \bar{h}_s) = h_t^\top \bar{h}_s

where:

  • htRdh_t \in \mathbb{R}^{d}: the decoder hidden state at timestep tt, encoding what the decoder is "looking for"
  • hˉsRd\bar{h}_s \in \mathbb{R}^{d}: the encoder hidden state at source position ss, encoding what that position "contains"
  • hthˉsh_t^\top \bar{h}_s: the inner product of the two vectors, yielding a scalar score

Why does the dot product capture similarity? Consider what happens geometrically. When two vectors point in the same direction, their dot product is large and positive, indicating high compatibility. When they point in opposite directions, the dot product is large and negative. Orthogonal vectors, sharing no common direction, yield zero.

This geometric interpretation has a powerful implication: the encoder and decoder can learn complementary representations. The encoder learns to place semantically similar words in similar directions in the hidden space. The decoder learns to "query" for specific semantic content by producing hidden states that point toward the relevant encoder representations.

The elegance of dot product attention lies in its simplicity. No learned parameters are needed for the score function itself. All the learning happens in the encoder and decoder, which discover representations where raw similarity is a good proxy for alignment. This makes dot product attention computationally efficient and easy to implement.

However, this simplicity comes with a constraint: the decoder and encoder hidden states must have the same dimensionality. If htRdhh_t \in \mathbb{R}^{d_h} and hˉsRdh\bar{h}_s \in \mathbb{R}^{d_h}, the dot product is well-defined. If they have different dimensions, you cannot compute a dot product directly, which motivates our next score function.

General (Bilinear) Attention

What if the encoder and decoder have different hidden dimensions? Or what if raw similarity isn't the right measure, and we want the model to learn a more nuanced notion of relevance?

The general score function addresses both concerns by introducing a learnable weight matrix WaW_a that transforms the encoder states before computing similarity:

score(ht,hˉs)=htWahˉs\text{score}(h_t, \bar{h}_s) = h_t^\top W_a \bar{h}_s

where:

  • htRdhh_t \in \mathbb{R}^{d_h}: the decoder hidden state at timestep tt
  • hˉsRdhˉ\bar{h}_s \in \mathbb{R}^{d_{\bar{h}}}: the encoder hidden state at source position ss
  • WaRdh×dhˉW_a \in \mathbb{R}^{d_h \times d_{\bar{h}}}: a learned weight matrix that projects encoder states into the decoder's space
  • htWahˉsh_t^\top W_a \bar{h}_s: a scalar score computed via matrix-vector multiplication

To understand what WaW_a does, let's trace through the computation step by step. First, WahˉsW_a \bar{h}_s transforms the encoder state into a dhd_h-dimensional vector. Then hth_t^\top takes the dot product with this transformed vector. The matrix WaW_a effectively learns which aspects of the encoder representation are most relevant for each dimension of the decoder state.

This formulation is called bilinear attention because the score is a bilinear function of the two input vectors: linear in hth_t when hˉs\bar{h}_s is fixed, and linear in hˉs\bar{h}_s when hth_t is fixed. The weight matrix WaW_a serves two purposes:

  • Dimension matching: When encoder and decoder have different hidden dimensions, WaW_a bridges the gap by projecting encoder states into the decoder's space
  • Learned similarity: Rather than relying on raw vector similarity, the matrix learns what aspects of encoder states are most relevant for alignment, potentially discovering non-obvious relationships

General attention is more expressive than dot product attention but requires additional parameters. For encoder and decoder dimensions of 512, WaW_a adds 512×512=262,144512 \times 512 = 262,144 parameters. This is modest compared to the total model size but can matter for smaller models or when memory is constrained.

Concat Attention

The most expressive score function takes a fundamentally different approach. Rather than computing a form of similarity between two vectors, it asks: given both the decoder state and an encoder state, what score should a neural network assign?

The concat score function concatenates the two states and passes them through a small neural network:

score(ht,hˉs)=vatanh(Wa[ht;hˉs])\text{score}(h_t, \bar{h}_s) = v_a^\top \tanh(W_a [h_t; \bar{h}_s])

Let's unpack this formula piece by piece:

  1. Concatenation [ht;hˉs][h_t; \bar{h}_s]: We stack the decoder and encoder states into a single vector of dimension dh+dhˉd_h + d_{\bar{h}}. This gives the network access to all information from both states.

  2. Linear projection Wa[ht;hˉs]W_a [h_t; \bar{h}_s]: The weight matrix WaRn×(dh+dhˉ)W_a \in \mathbb{R}^{n \times (d_h + d_{\bar{h}})} projects the concatenated vector to an intermediate space of dimension nn. This allows the network to learn arbitrary linear combinations of the input features.

  3. Nonlinearity tanh()\tanh(\cdot): The hyperbolic tangent activation introduces nonlinearity, enabling the network to learn complex, non-linear relationships between decoder and encoder states.

  4. Scalar projection vav_a^\top: Finally, the weight vector vaRnv_a \in \mathbb{R}^n projects the intermediate representation to a single scalar score.

where:

  • htRdhh_t \in \mathbb{R}^{d_h}: the decoder hidden state at timestep tt
  • hˉsRdhˉ\bar{h}_s \in \mathbb{R}^{d_{\bar{h}}}: the encoder hidden state at source position ss
  • nn: the intermediate dimension, a hyperparameter typically set equal to the hidden dimension

This formulation is similar to Bahdanau attention, with one key difference: Bahdanau uses the previous decoder state ht1h_{t-1}, while Luong's concat uses the current decoder state hth_t. This seemingly small change has significant implications for the overall architecture, which we'll explore in the section on attention placement.

Out[3]:
Visualization
Three diagrams showing dot product, general, and concat attention score computation flows.
Comparison of three attention score functions. Dot product is parameter-free and requires matching dimensions. General attention uses a learned matrix to compute bilinear similarity. Concat attention uses a two-layer network with nonlinearity.

Computational Comparison

The three score functions differ significantly in their computational requirements. Let's analyze them for a sequence of length SS with hidden dimension dd:

Computational comparison of Luong attention score functions. Dot product is most efficient but requires matching dimensions.
Score FunctionTime ComplexityParametersNotes
Dot ProductO(Sd)O(Sd)0Fastest, requires dh=dhˉd_h = d_{\bar{h}}
GeneralO(Sd2)O(Sd^2)dh×dhˉd_h \times d_{\bar{h}}Matrix multiplication dominates
ConcatO(Snd)O(Snd)n(dh+dhˉ)+nn(d_h + d_{\bar{h}}) + nTwo-step transformation
Out[4]:
Visualization
Line plot showing computational cost scaling with sequence length for three attention methods.
FLOPs per attention computation at different sequence lengths (hidden dim=256). Dot product scales linearly while general and concat scale quadratically with hidden dimension.
Bar chart comparing parameter counts across hidden dimensions for three attention methods.
Parameter count by hidden dimension. Dot product requires no parameters, making it the most memory-efficient choice.

For modern hardware with efficient matrix operations, dot product attention is significantly faster because it can be fully parallelized as a single matrix multiplication across all encoder states simultaneously. This efficiency advantage is why transformers adopt scaled dot-product attention as their core mechanism.

Global vs Local Attention

Beyond score functions, Luong et al. introduced an important architectural distinction: global attention attends to all encoder states, while local attention focuses on a subset of positions around a predicted alignment point.

Global Attention

Global attention is conceptually straightforward. At each decoder timestep tt, we compute attention weights over all encoder hidden states by applying softmax to the alignment scores:

αts=exp(score(ht,hˉs))s=1Sexp(score(ht,hˉs))\alpha_{ts} = \frac{\exp(\text{score}(h_t, \bar{h}_s))}{\sum_{s'=1}^{S} \exp(\text{score}(h_t, \bar{h}_{s'}))}

where:

  • αts\alpha_{ts}: the attention weight for source position ss when decoding at timestep tt
  • score(ht,hˉs)\text{score}(h_t, \bar{h}_s): the alignment score computed using any of the three score functions
  • SS: the total number of encoder positions (source sequence length)
  • exp()\exp(\cdot): the exponential function, ensuring all values are positive
  • The denominator normalizes the weights so they sum to 1 across all source positions

The context vector is then computed as the weighted sum of all encoder states:

ct=s=1Sαtshˉsc_t = \sum_{s=1}^{S} \alpha_{ts} \bar{h}_s

where:

  • ctRdhˉc_t \in \mathbb{R}^{d_{\bar{h}}}: the context vector at decoder timestep tt
  • αts\alpha_{ts}: the attention weight for position ss (how much to focus on that position)
  • hˉs\bar{h}_s: the encoder hidden state at position ss

This weighted sum allows the decoder to "focus" on relevant source positions: positions with high attention weights contribute more to the context vector, while positions with low weights contribute less.

Global attention is what we typically mean when we say "attention" without qualification. It allows the model to attend to any position in the source sequence, which is essential for tasks like translation where word order can differ dramatically between languages.

The computational cost of global attention is O(S)O(S) per decoder step, where SS is the source sequence length. For most NLP tasks with sequences of hundreds or thousands of tokens, this is manageable. However, for very long sequences (documents, books, genomic data), global attention becomes a bottleneck.

Local Attention

Local attention restricts the attention window to a subset of encoder positions centered around an alignment point ptp_t. This reduces computation and can also serve as an inductive bias when alignments are expected to be roughly monotonic.

Out[5]:
Visualization
Heatmap showing attention weights across all encoder positions for global attention.
Global attention computes weights over all encoder positions. Every source token can influence every target token, providing maximum flexibility but requiring O(S) computation per step.
Heatmap showing attention weights concentrated in a diagonal band for local attention.
Local attention focuses on a window of size 2D+1 centered at position p_t. Positions outside the window receive zero attention weight, reducing computation to O(D).

Luong et al. propose two variants for determining the alignment point ptp_t:

Monotonic alignment (local-m): The alignment point is simply set to pt=tp_t = t, assuming the source and target sequences are roughly aligned. This works well for tasks like speech recognition where the output follows the input order.

Predictive alignment (local-p): The model learns to predict the alignment point using a small neural network:

pt=Sσ(vptanh(Wpht))p_t = S \cdot \sigma(v_p^\top \tanh(W_p h_t))

where:

  • ptp_t: the predicted alignment point (center of the attention window) at decoder timestep tt
  • SS: the source sequence length
  • σ()\sigma(\cdot): the sigmoid function, which outputs a value in [0,1][0, 1]
  • WpW_p: a learned weight matrix that transforms the decoder hidden state
  • vpv_p: a learned weight vector that projects to a scalar
  • hth_t: the current decoder hidden state

The sigmoid ensures the output is between 0 and 1, and multiplying by SS scales it to a valid source position. This allows the model to learn non-monotonic alignments while still restricting attention to a local window.

Within the window [ptD,pt+D][p_t - D, p_t + D], attention weights are computed normally but multiplied by a Gaussian centered at ptp_t:

αts=align(ht,hˉs)exp((spt)22σ2)\alpha_{ts} = \text{align}(h_t, \bar{h}_s) \cdot \exp\left(-\frac{(s - p_t)^2}{2\sigma^2}\right)

where:

  • αts\alpha_{ts}: the final attention weight for position ss at decoder timestep tt
  • align(ht,hˉs)\text{align}(h_t, \bar{h}_s): the base attention weight from softmax over the local window
  • ss: the source position being attended to
  • ptp_t: the predicted center of the attention window
  • σ\sigma: the standard deviation of the Gaussian (controls window sharpness), typically set to D/2D/2
  • The exponential term is a Gaussian that peaks at s=pts = p_t and decays for positions farther from the center

The Gaussian favors positions near the center of the window, providing a soft boundary rather than a hard cutoff. Positions at the edge of the window receive lower weights even if their alignment scores are high.

When to Use Each

The choice between global and local attention depends on your task:

  • Global attention is the default choice for most NLP tasks. Translation, summarization, and question answering all benefit from the ability to attend to any position. The computational overhead is acceptable for typical sequence lengths.

  • Local attention shines when you have prior knowledge about alignment structure. Speech recognition, where phonemes appear in order, is a natural fit. Document-level tasks with very long sequences can also benefit from the reduced computation.

In practice, global attention dominates because modern hardware handles the O(S)O(S) computation efficiently, and the flexibility to attend anywhere is valuable. Local attention is more of a historical curiosity, though its ideas influenced later work on sparse attention patterns in transformers.

Attention Placement: Input vs Output

A subtle but important difference between Bahdanau and Luong attention is where the attention mechanism fits into the decoder architecture. This choice affects both the information flow and the computational graph.

Bahdanau: Attention as Input

In Bahdanau attention, the context vector is computed using the previous decoder hidden state and then concatenated with the input embedding to form the input to the current decoder step. The computation proceeds in three stages:

ct=Attention(ht1,Hˉ)x~t=[xt;ct]ht=RNN(x~t,ht1)\begin{align} c_t &= \text{Attention}(h_{t-1}, \bar{H}) \\ \tilde{x}_t &= [x_t; c_t] \\ h_t &= \text{RNN}(\tilde{x}_t, h_{t-1}) \end{align}

where:

  • ht1h_{t-1}: the decoder hidden state from the previous timestep
  • Hˉ=[hˉ1,hˉ2,,hˉS]\bar{H} = [\bar{h}_1, \bar{h}_2, \ldots, \bar{h}_S]: the matrix of all encoder hidden states
  • ctc_t: the context vector computed by attending over encoder states
  • xtx_t: the input embedding at timestep tt (typically the previous output token)
  • x~t\tilde{x}_t: the augmented input formed by concatenating xtx_t and ctc_t
  • hth_t: the new decoder hidden state

The context vector influences the RNN computation directly. This means the decoder can use information about which source positions are relevant when updating its hidden state.

Luong: Attention as Output

In Luong attention, the decoder RNN runs first, producing the current hidden state. Then attention is computed using this new state, and the results are combined. The computation proceeds in three stages:

ht=RNN(xt,ht1)ct=Attention(ht,Hˉ)h~t=tanh(Wc[ct;ht])\begin{align} h_t &= \text{RNN}(x_t, h_{t-1}) \\ c_t &= \text{Attention}(h_t, \bar{H}) \\ \tilde{h}_t &= \tanh(W_c [c_t; h_t]) \end{align}

where:

  • xtx_t: the input embedding at timestep tt
  • ht1h_{t-1}: the decoder hidden state from the previous timestep
  • hth_t: the new decoder hidden state after the RNN step
  • Hˉ=[hˉ1,hˉ2,,hˉS]\bar{H} = [\bar{h}_1, \bar{h}_2, \ldots, \bar{h}_S]: the matrix of all encoder hidden states
  • ctc_t: the context vector computed by attending over encoder states using the current hth_t
  • [ct;ht][c_t; h_t]: the concatenation of context and hidden state
  • WcW_c: a learned weight matrix that combines context and hidden state
  • h~t\tilde{h}_t: the "attentional hidden state" used for prediction

The final output h~t\tilde{h}_t combines the context vector with the decoder state through a learned transformation. This attentional state is then used for prediction, typically by projecting it to vocabulary size and applying softmax.

Out[6]:
Visualization
Flow diagram showing Bahdanau attention with context computed before RNN step.
Bahdanau attention computes context using the previous hidden state, then feeds it as input to the RNN. The context influences the state update directly.
Flow diagram showing Luong attention with context computed after RNN step.
Luong attention runs the RNN first, then computes context using the current hidden state. The context is combined with the state for output.

Implications of Placement

The placement choice has several practical implications:

Information flow: Bahdanau attention allows the context to influence the RNN state update, potentially enabling richer interactions. Luong attention keeps the RNN computation separate, which can be easier to reason about and debug.

Parallelization: Luong attention is slightly more amenable to parallelization during training because the RNN step doesn't depend on the attention computation. However, this advantage is minimal compared to the sequential nature of RNNs themselves.

Empirical performance: Luong et al. found their approach performed comparably or slightly better than Bahdanau attention on machine translation benchmarks. The simpler architecture and faster score functions (especially dot product) made it an attractive choice.

Luong vs Bahdanau: A Complete Comparison

Let's consolidate the differences between these two influential attention mechanisms:

Comparison of Bahdanau and Luong attention mechanisms. The key differences are score function, decoder state timing, and attention placement.
AspectBahdanauLuong
Score functionAdditive (concat with tanh)Dot, general, or concat
Decoder state usedPrevious (ht1h_{t-1})Current (hth_t)
Attention placementBefore RNN (input)After RNN (output)
Context integrationConcatenated with inputCombined via learned layer
Encoder architectureBidirectional RNNUnidirectional (stacked)
ParametersMore (additive scoring)Fewer (dot product option)

Both mechanisms compute attention weights via softmax and produce a context vector as a weighted sum of encoder states. The differences lie in the details of scoring, timing, and integration.

In practice, the choice between Bahdanau and Luong attention often matters less than other architectural decisions like model size, number of layers, and training procedure. Modern transformer architectures have largely superseded both, but understanding these mechanisms provides essential intuition for how attention works.

Implementation

Let's implement Luong attention in PyTorch. We'll build a complete attention module that supports all three score functions, then integrate it into a sequence-to-sequence decoder.

Attention Module

First, we define the core attention computation. The module takes encoder outputs and a decoder hidden state, computes attention weights, and returns the context vector.

In[7]:
Code
class LuongAttention(nn.Module):
    """Luong attention with configurable score function."""

    def __init__(self, hidden_dim, encoder_dim=None, method="dot"):
        super().__init__()
        self.method = method
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim or hidden_dim

        if method == "general":
            # W_a for bilinear scoring
            self.W_a = nn.Linear(self.encoder_dim, hidden_dim, bias=False)
        elif method == "concat":
            # Two-layer scoring network
            self.W_a = nn.Linear(
                hidden_dim + self.encoder_dim, hidden_dim, bias=False
            )
            self.v_a = nn.Linear(hidden_dim, 1, bias=False)

    def score(self, decoder_hidden, encoder_outputs):
        """Compute attention scores for all encoder positions."""
        if self.method == "dot":
            scores = torch.bmm(
                encoder_outputs, decoder_hidden.unsqueeze(2)
            ).squeeze(2)
        elif self.method == "general":
            transformed = self.W_a(encoder_outputs)
            scores = torch.bmm(
                transformed, decoder_hidden.unsqueeze(2)
            ).squeeze(2)
        elif self.method == "concat":
            src_len = encoder_outputs.size(1)
            decoder_expanded = decoder_hidden.unsqueeze(1).expand(
                -1, src_len, -1
            )
            concat = torch.cat([decoder_expanded, encoder_outputs], dim=2)
            scores = self.v_a(torch.tanh(self.W_a(concat))).squeeze(2)
        return scores

    def forward(self, decoder_hidden, encoder_outputs, mask=None):
        """Compute attention weights and context vector."""
        scores = self.score(decoder_hidden, encoder_outputs)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))
        attention_weights = F.softmax(scores, dim=1)
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        ).squeeze(1)
        return context, attention_weights

The LuongAttention class implements all three score functions. The constructor accepts the hidden dimension, an optional encoder dimension, and the scoring method. The score method computes alignment scores using dot product, general (bilinear), or concat approaches. The forward method orchestrates the full attention computation: scores, optional masking, softmax normalization, and weighted sum.

Attentional Decoder

Now let's build a decoder that uses Luong attention. The key difference from a standard RNN decoder is the attention step after the RNN and the combination layer that produces the attentional hidden state.

In[8]:
Code
class LuongDecoder(nn.Module):
    """Decoder with Luong attention."""

    def __init__(
        self,
        vocab_size,
        embed_dim,
        hidden_dim,
        encoder_dim=None,
        attention_method="dot",
        num_layers=1,
        dropout=0.1,
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim or hidden_dim

        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )
        self.attention = LuongAttention(
            hidden_dim, self.encoder_dim, attention_method
        )
        self.W_c = nn.Linear(
            self.encoder_dim + hidden_dim, hidden_dim, bias=False
        )
        self.output_projection = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward_step(self, input_token, hidden, encoder_outputs, mask=None):
        """Single decoding step with attention."""
        embedded = self.dropout(self.embedding(input_token.unsqueeze(1)))
        rnn_output, hidden = self.rnn(embedded, hidden)
        rnn_output = rnn_output.squeeze(1)
        context, attention_weights = self.attention(
            hidden[-1], encoder_outputs, mask
        )
        combined = torch.cat([context, rnn_output], dim=1)
        attentional_hidden = torch.tanh(self.W_c(combined))
        output = self.output_projection(self.dropout(attentional_hidden))
        return output, hidden, attention_weights

The LuongDecoder class combines an embedding layer, a GRU, the attention module, a combination layer for producing the attentional hidden state, and an output projection. The forward_step method implements a single decoding step: embed, RNN, then attention. This order is the defining characteristic of Luong attention.

Testing the Implementation

Let's verify our implementation works correctly with some example data.

In[9]:
Code
# Create sample data
batch_size = 4
src_len = 10
hidden_dim = 64
vocab_size = 1000

# Simulated encoder outputs (normally from an encoder RNN)
encoder_outputs = torch.randn(batch_size, src_len, hidden_dim)

# Padding mask (last 2 positions are padding)
mask = torch.ones(batch_size, src_len)
mask[:, -2:] = 0

# Initialize decoder
decoder = LuongDecoder(
    vocab_size=vocab_size,
    embed_dim=32,
    hidden_dim=hidden_dim,
    attention_method="dot",
)

# Initial hidden state
hidden = torch.zeros(1, batch_size, hidden_dim)

# Input token (e.g., start token)
input_token = torch.zeros(batch_size, dtype=torch.long)

# Run one decoding step
output, new_hidden, attention_weights = decoder.forward_step(
    input_token, hidden, encoder_outputs, mask
)
Out[10]:
Console
Output shape: torch.Size([4, 1000])
Hidden shape: torch.Size([1, 4, 64])
Attention weights shape: torch.Size([4, 10])

Attention weights sum to 1: 1.0000
Attention on padded positions: 0.000000

The output has the expected shape for vocabulary logits, and attention weights sum to 1 (ignoring padded positions). The near-zero attention on padded positions confirms our masking works correctly.

Visualizing Attention Patterns

Let's compare the attention patterns produced by different score functions on the same input.

In[11]:
Code
# Create decoders with different attention methods
methods = ["dot", "general", "concat"]
decoders = {
    method: LuongDecoder(vocab_size, 32, hidden_dim, attention_method=method)
    for method in methods
}

# Run decoding steps and collect attention weights
all_attention = {}
for method, decoder in decoders.items():
    hidden = torch.zeros(1, batch_size, hidden_dim)
    _, _, attn = decoder.forward_step(
        input_token, hidden, encoder_outputs, mask
    )
    all_attention[method] = attn.detach()
Out[12]:
Visualization
Bar chart showing dot product attention weights across source positions.
Dot product attention weights. Parameter-free scoring based on direct vector similarity.
Bar chart showing general attention weights across source positions.
General attention weights. Learned bilinear transformation enables flexible scoring.
Bar chart showing concat attention weights across source positions.
Concat attention weights. Two-layer network captures complex compatibility patterns.

With random weights, the attention patterns are similar across methods. After training, each method would develop distinct patterns based on what it learns about alignment. Dot product attention tends to produce sharper distributions when encoder and decoder representations align well, while concat attention can learn more complex compatibility functions.

To see what trained attention looks like, let's simulate a full translation with realistic attention patterns. The heatmap below shows attention weights across multiple decoding steps, revealing how the decoder systematically aligns with different source positions.

Out[13]:
Visualization
Heatmap showing attention weights with a diagonal pattern indicating word alignment between source and target sentences.
Simulated attention heatmap for translating 'The black cat sat quietly' → 'Le chat noir était assis tranquillement'. Each row shows attention weights when generating one target word. The roughly diagonal pattern reflects word-order correspondence, while deviations (e.g., 'noir' attending to 'black' at position 1) show how attention handles reordering.

Worked Example: English-French Alignment

The formulas we've discussed can feel abstract until you see them in action. Let's trace through a complete attention computation for a concrete translation example, following each step from raw hidden states to final attention weights.

Consider translating the English sentence "The cat sat" to French "Le chat assis". We'll focus on the moment when the decoder is about to generate the second French word "chat". At this point, the decoder needs to figure out which English word to focus on. Intuitively, it should attend to "cat" since that's the word being translated.

Setting Up the Problem

First, let's create simulated encoder hidden states for each English word. In a real model, these would come from running a bidirectional LSTM or transformer encoder over the input. Here, we'll craft vectors that capture the intuition: "cat" and the decoder state for "chat" should be similar, while "The" and "sat" should be less relevant.

In[14]:
Code
# Simulated encoder hidden states for "The cat sat"
np.random.seed(42)
torch.manual_seed(42)

# Create encoder representations that reflect semantic content:
# - "The": article with low semantic content (small magnitude)
# - "cat": noun with distinct pattern (will match decoder query)
# - "sat": verb with different pattern (won't match as well)
encoder_states = torch.tensor(
    [
        [0.1, 0.2, -0.1, 0.3],  # "The" - low magnitude, generic
        [0.8, -0.3, 0.9, 0.2],  # "cat" - strong noun pattern
        [0.2, 0.7, 0.1, -0.4],  # "sat" - different verb pattern
    ],
    dtype=torch.float32,
).unsqueeze(0)  # Add batch dimension for PyTorch

# Decoder hidden state when generating "chat"
# Designed to be similar to "cat" encoding
decoder_state = torch.tensor(
    [
        [0.7, -0.2, 0.8, 0.3]  # Query vector looking for noun-like content
    ],
    dtype=torch.float32,
)
Out[15]:
Console
Encoder states shape: torch.Size([1, 3, 4])
Decoder state shape: torch.Size([1, 4])

The encoder states have shape (1, 3, 4): batch size 1, 3 source words, and 4-dimensional hidden states. The decoder state has shape (1, 4): a single 4-dimensional vector representing what the decoder is "looking for" when generating "chat".

Step 1: Computing Alignment Scores

Now we apply the dot product score function. For each encoder state, we compute its dot product with the decoder state. Remember, the dot product measures how much two vectors point in the same direction.

In[16]:
Code
# Compute dot product scores: decoder_state · each encoder_state
scores = torch.bmm(encoder_states, decoder_state.unsqueeze(2)).squeeze()

# Apply softmax to convert scores to probabilities
attention_weights = F.softmax(scores, dim=0)

# Compute context vector as weighted sum of encoder states
context = torch.bmm(
    attention_weights.unsqueeze(0).unsqueeze(0), encoder_states
).squeeze()

Step 2: Examining the Results

Let's see what scores each word received and how softmax transforms them into attention weights:

Out[17]:
Console
Step-by-step attention computation:
----------------------------------------

Dot product scores (raw alignment):
  The: 0.0400
  cat: 1.4000
  sat: -0.0400

Attention weights (after softmax):
  The: 0.1718
  cat: 0.6695
  sat: 0.1586

Context vector: [0.585, -0.055, 0.601, 0.122]

Highest attention on: 'cat'

The results reveal exactly what we hoped to see. The decoder state for generating "chat" produces the highest alignment score with "cat" because their vectors point in similar directions. The softmax function then amplifies this difference: "cat" receives about 70% of the attention weight, while "The" and "sat" share the remaining 30%.

Notice how the context vector ends up being dominated by the "cat" representation. This is attention in action: the decoder has learned to extract the relevant information from the source sequence by focusing on the semantically corresponding word.

Out[18]:
Visualization
Bar chart showing attention weights with 'cat' receiving the highest weight around 0.7.
Attention weights when generating 'chat' from 'The cat sat'. The model correctly focuses on 'cat' (highlighted in red), the source word that semantically corresponds to the French target. This alignment emerges from the similarity between the decoder state and the 'cat' encoder representation.

The visualization below shows how the context vector is constructed as a weighted blend of the encoder states. Each dimension of the context vector is a weighted sum of that dimension across all encoder states, with the weights determined by attention.

Out[19]:
Visualization
Stacked bar chart showing weighted contributions from each encoder state to context vector dimensions.
Weighted contributions to the context vector. Stacked bars show each encoder state's contribution per dimension, weighted by attention. The 'cat' encoder (red) dominates with 70% attention weight.
Grouped bar chart comparing encoder state values with the resulting context vector.
Context vector compared to original encoder states. The context closely resembles the 'cat' representation since it receives the highest attention weight.

Limitations and Impact

Luong attention addressed several limitations of Bahdanau attention while introducing its own trade-offs. Understanding these helps contextualize attention's evolution toward transformers.

Computational Efficiency

The dot product score function is significantly faster than additive attention, especially for long sequences. With hidden dimension dd and sequence length SS, dot product requires O(Sd)O(Sd) operations compared to O(Snd)O(Snd) for additive attention (where nn is the intermediate dimension). This efficiency gain compounds during training when attention is computed millions of times.

However, dot product attention has a subtle numerical issue: when the hidden dimension is large, dot products can become very large in magnitude. This pushes softmax toward extreme values (near 0 or 1), causing gradient saturation. The transformer architecture addresses this with scaled dot-product attention, dividing by d\sqrt{d} to keep values in a reasonable range.

Out[20]:
Visualization
Line plot showing score standard deviation growing with hidden dimension for unscaled vs scaled attention.
As hidden dimension increases, dot product magnitudes grow proportionally (for unit-variance random vectors, std ≈ √d). Scaling by 1/√d keeps values stable around 1.
Grouped bar chart showing attention weight distributions becoming sharper as score magnitude increases.
Large scores cause softmax to produce near-binary distributions, leading to gradient saturation. Higher scale values concentrate all attention on the highest-scoring position.

Expressiveness vs Simplicity

General and concat attention are more expressive than dot product, able to learn arbitrary compatibility functions. But this expressiveness comes at a cost: more parameters, slower computation, and potential overfitting on small datasets. Empirically, the simpler dot product often performs comparably or better, suggesting that the encoder and decoder learn representations where raw similarity is a good proxy for alignment.

This finding influenced the design of transformers, which use dot product attention exclusively. The lesson is that with sufficient model capacity and data, simple mechanisms can match or exceed complex ones.

Sequential Bottleneck

Both Bahdanau and Luong attention still rely on RNNs for the encoder and decoder. This creates a sequential bottleneck: each timestep must wait for the previous one, preventing parallelization. Attention helps by providing direct connections to encoder states, but the fundamental limitation remains.

Transformers eliminate this bottleneck entirely by using self-attention instead of recurrence. Each position can attend to all other positions in parallel, enabling massive speedups on modern hardware. Luong's dot product attention, combined with the key-query-value formulation, became the foundation for this revolution.

Legacy and Influence

Luong attention's most lasting contribution is demonstrating that simple, efficient attention mechanisms work well. The dot product score function, attention after the decoder step, and the idea of multiple attention variants all influenced subsequent research. When Vaswani et al. designed the transformer, they chose scaled dot-product attention, directly building on Luong's work.

The distinction between global and local attention also foreshadowed later work on sparse attention patterns. Transformers face quadratic complexity in sequence length, and researchers have explored various ways to restrict attention to local windows or learned patterns. Luong's local attention was an early exploration of this trade-off between expressiveness and efficiency.

Summary

This chapter explored Luong attention, a family of attention mechanisms that simplified and extended Bahdanau's original formulation.

The key innovations include three score functions for computing alignment:

  • Dot product: Parameter-free, efficient, requires matching dimensions
  • General: Learned bilinear transformation, handles different dimensions
  • Concat: Two-layer network with nonlinearity, most expressive

Luong attention also introduced the distinction between global and local attention. Global attention considers all encoder positions, while local attention focuses on a predicted window. Global attention dominates in practice due to its flexibility and the efficiency of modern hardware.

The architectural placement differs from Bahdanau: Luong computes attention after the RNN step using the current hidden state, then combines the context with the hidden state through a learned transformation. This "attention as output" approach is simpler and slightly more parallelizable.

Perhaps most importantly, Luong's work demonstrated that simple attention mechanisms, particularly dot product attention, can match or exceed more complex alternatives. This insight directly influenced the transformer architecture, which uses scaled dot-product attention as its core mechanism. Understanding Luong attention provides essential context for the self-attention mechanisms we'll explore in the next part of this book.

Key Parameters

When implementing Luong attention, several parameters significantly impact model behavior and performance:

  • Attention method (method): Chooses between "dot", "general", or "concat" score functions. Dot product is fastest and parameter-free but requires matching encoder/decoder dimensions. General attention adds a learned projection matrix, enabling different dimensions and learned similarity. Concat attention is most expressive but slowest. Start with dot product for most applications.

  • Hidden dimension (hidden_dim): The dimensionality of the decoder hidden state. Larger values (256-512) provide more representational capacity but increase computation. For attention, this determines the space in which similarity is computed. Values of 256-512 work well for most sequence-to-sequence tasks.

  • Encoder dimension (encoder_dim): The dimensionality of encoder hidden states. Can differ from hidden_dim when using general or concat attention. For bidirectional encoders, this is typically 2 * hidden_dim since forward and backward states are concatenated.

  • Local attention window (D): For local attention, controls the half-width of the attention window. Positions outside [ptD,pt+D][p_t - D, p_t + D] receive zero attention. Larger windows (D=10-20) provide more flexibility but increase computation. Smaller windows (D=2-5) work well when alignments are expected to be monotonic.

  • Dropout (dropout): Applied to embeddings and the attentional hidden state before output projection. Values of 0.1-0.3 help prevent overfitting, especially important for attention weights which can become overly peaked during training.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Luong attention mechanisms.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{luongattentiondotproductgenerallocalattentionmechanisms, author = {Michael Brenndoerfer}, title = {Luong Attention: Dot Product, General & Local Attention Mechanisms}, year = {2025}, url = {https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Luong Attention: Dot Product, General & Local Attention Mechanisms. Retrieved from https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local
MLAAcademic
Michael Brenndoerfer. "Luong Attention: Dot Product, General & Local Attention Mechanisms." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local>.
CHICAGOAcademic
Michael Brenndoerfer. "Luong Attention: Dot Product, General & Local Attention Mechanisms." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Luong Attention: Dot Product, General & Local Attention Mechanisms'. Available at: https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Luong Attention: Dot Product, General & Local Attention Mechanisms. https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free