Luong Attention: Dot Product, General & Local Attention Mechanisms

Michael Brenndoerfer

Master Luong attention variants including dot product, general, and concat scoring. Compare global vs local attention and understand attention placement in seq2seq models.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Luong AttentionLink Copied

In the previous chapter, we explored Bahdanau attention, which revolutionized sequence-to-sequence models by allowing the decoder to dynamically focus on different parts of the input sequence. Bahdanau's approach uses an additive score function and computes attention before the RNN step, feeding the context vector as input to the decoder. But is this the only way to design an attention mechanism?

In 2015, Luong et al. proposed several alternative attention mechanisms that are computationally simpler and often more effective. Their work introduced multiplicative attention variants, the distinction between global and local attention, and a different placement of attention in the decoder architecture. Understanding these variations is essential because the scaled dot-product attention used in modern transformers descends directly from Luong's multiplicative formulation.

Attention Score FunctionsLink Copied

At the heart of any attention mechanism lies a fundamental question: how do we measure the relevance of each encoder state to the current decoding step? This measurement, called the alignment score or compatibility score, determines which parts of the input sequence the decoder should focus on when generating each output token.

Think of it this way: when translating "The cat sat on the mat" to French, and you're about to generate the word "chat" (cat), you need a way to identify that the English word "cat" is the most relevant source word. The score function quantifies this relevance for every source position, producing a set of numbers that attention will convert into a probability distribution.

Bahdanau attention introduced an additive score function with learned parameters. Luong et al. asked: can we achieve similar results with simpler approaches? Their answer was three alternative score functions, each representing a different trade-off between simplicity, expressiveness, and computational cost.

Dot Product AttentionLink Copied

The simplest approach to measuring relevance is to ask: how similar are the decoder and encoder representations? If the encoder has learned to represent "cat" in a particular way, and the decoder has learned to look for that same pattern when generating "chat", then similar representations should yield high scores.

The dot product provides exactly this similarity measure. Given a decoder hidden state $h_t$ and an encoder hidden state $\bar{h}_s$ , we compute:

\text{score}(h_t, \bar{h}_s) = h_t^\top \bar{h}_s

where:

$h_t \in \mathbb{R}^{d}$ : the decoder hidden state at timestep $t$ , encoding what the decoder is "looking for"
$\bar{h}_s \in \mathbb{R}^{d}$ : the encoder hidden state at source position $s$ , encoding what that position "contains"
$h_t^\top \bar{h}_s$ : the inner product of the two vectors, yielding a scalar score

Why does the dot product capture similarity? Consider what happens geometrically. When two vectors point in the same direction, their dot product is large and positive, indicating high compatibility. When they point in opposite directions, the dot product is large and negative. Orthogonal vectors, sharing no common direction, yield zero.

This geometric interpretation has a powerful implication: the encoder and decoder can learn complementary representations. The encoder learns to place semantically similar words in similar directions in the hidden space. The decoder learns to "query" for specific semantic content by producing hidden states that point toward the relevant encoder representations.

The elegance of dot product attention lies in its simplicity. No learned parameters are needed for the score function itself. All the learning happens in the encoder and decoder, which discover representations where raw similarity is a good proxy for alignment. This makes dot product attention computationally efficient and easy to implement.

However, this simplicity comes with a constraint: the decoder and encoder hidden states must have the same dimensionality. If $h_t \in \mathbb{R}^{d_h}$ and $\bar{h}_s \in \mathbb{R}^{d_h}$ , the dot product is well-defined. If they have different dimensions, you cannot compute a dot product directly, which motivates our next score function.

General (Bilinear) AttentionLink Copied

What if the encoder and decoder have different hidden dimensions? Or what if raw similarity isn't the right measure, and we want the model to learn a more nuanced notion of relevance?

The general score function addresses both concerns by introducing a learnable weight matrix $W_a$ that transforms the encoder states before computing similarity:

\text{score}(h_t, \bar{h}_s) = h_t^\top W_a \bar{h}_s

where:

$h_t \in \mathbb{R}^{d_h}$ : the decoder hidden state at timestep $t$
$\bar{h}_s \in \mathbb{R}^{d_{\bar{h}}}$ : the encoder hidden state at source position $s$
$W_a \in \mathbb{R}^{d_h \times d_{\bar{h}}}$ : a learned weight matrix that projects encoder states into the decoder's space
$h_t^\top W_a \bar{h}_s$ : a scalar score computed via matrix-vector multiplication

To understand what $W_a$ does, let's trace through the computation step by step. First, $W_a \bar{h}_s$ transforms the encoder state into a $d_h$ -dimensional vector. Then $h_t^\top$ takes the dot product with this transformed vector. The matrix $W_a$ effectively learns which aspects of the encoder representation are most relevant for each dimension of the decoder state.

This formulation is called bilinear attention because the score is a bilinear function of the two input vectors: linear in $h_t$ when $\bar{h}_s$ is fixed, and linear in $\bar{h}_s$ when $h_t$ is fixed. The weight matrix $W_a$ serves two purposes:

Dimension matching: When encoder and decoder have different hidden dimensions, $W_a$ bridges the gap by projecting encoder states into the decoder's space
Learned similarity: Rather than relying on raw vector similarity, the matrix learns what aspects of encoder states are most relevant for alignment, potentially discovering non-obvious relationships

General attention is more expressive than dot product attention but requires additional parameters. For encoder and decoder dimensions of 512, $W_a$ adds $512 \times 512 = 262,144$ parameters. This is modest compared to the total model size but can matter for smaller models or when memory is constrained.

Concat AttentionLink Copied

The most expressive score function takes a fundamentally different approach. Rather than computing a form of similarity between two vectors, it asks: given both the decoder state and an encoder state, what score should a neural network assign?

The concat score function concatenates the two states and passes them through a small neural network:

\text{score}(h_t, \bar{h}_s) = v_a^\top \tanh(W_a [h_t; \bar{h}_s])

Let's unpack this formula piece by piece:

Concatenation $[h_t; \bar{h}_s]$ : We stack the decoder and encoder states into a single vector of dimension $d_h + d_{\bar{h}}$ . This gives the network access to all information from both states.
Linear projection $W_a [h_t; \bar{h}_s]$ : The weight matrix $W_a \in \mathbb{R}^{n \times (d_h + d_{\bar{h}})}$ projects the concatenated vector to an intermediate space of dimension $n$ . This allows the network to learn arbitrary linear combinations of the input features.
Nonlinearity $\tanh(\cdot)$ : The hyperbolic tangent activation introduces nonlinearity, enabling the network to learn complex, non-linear relationships between decoder and encoder states.
Scalar projection $v_a^\top$ : Finally, the weight vector $v_a \in \mathbb{R}^n$ projects the intermediate representation to a single scalar score.

where:

$h_t \in \mathbb{R}^{d_h}$ : the decoder hidden state at timestep $t$
$\bar{h}_s \in \mathbb{R}^{d_{\bar{h}}}$ : the encoder hidden state at source position $s$
$n$ : the intermediate dimension, a hyperparameter typically set equal to the hidden dimension

This formulation is similar to Bahdanau attention, with one key difference: Bahdanau uses the previous decoder state $h_{t-1}$ , while Luong's concat uses the current decoder state $h_t$ . This seemingly small change has significant implications for the overall architecture, which we'll explore in the section on attention placement.

Out[3]:

Visualization

Three diagrams showing dot product, general, and concat attention score computation flows. — Comparison of three attention score functions. Dot product is parameter-free and requires matching dimensions. General attention uses a learned matrix to compute bilinear similarity. Concat attention uses a two-layer network with nonlinearity.

Computational ComparisonLink Copied

The three score functions differ significantly in their computational requirements. Let's analyze them for a sequence of length $S$ with hidden dimension $d$ :

Computational comparison of Luong attention score functions. Dot product is most efficient but requires matching dimensions.

Score Function	Time Complexity	Parameters	Notes
Dot Product	$O(Sd)$	0	Fastest, requires $d_h = d_{\bar{h}}$
General	$O(Sd^2)$	$d_h \times d_{\bar{h}}$	Matrix multiplication dominates
Concat	$O(Snd)$	$n(d_h + d_{\bar{h}}) + n$	Two-step transformation

Out[4]:

Visualization

Line plot showing computational cost scaling with sequence length for three attention methods. — FLOPs per attention computation at different sequence lengths (hidden dim=256). Dot product scales linearly while general and concat scale quadratically with hidden dimension.

Bar chart comparing parameter counts across hidden dimensions for three attention methods. — Parameter count by hidden dimension. Dot product requires no parameters, making it the most memory-efficient choice.

For modern hardware with efficient matrix operations, dot product attention is significantly faster because it can be fully parallelized as a single matrix multiplication across all encoder states simultaneously. This efficiency advantage is why transformers adopt scaled dot-product attention as their core mechanism.

Global vs Local AttentionLink Copied

Beyond score functions, Luong et al. introduced an important architectural distinction: global attention attends to all encoder states, while local attention focuses on a subset of positions around a predicted alignment point.

Global AttentionLink Copied

Global attention is conceptually straightforward. At each decoder timestep $t$ , we compute attention weights over all encoder hidden states by applying softmax to the alignment scores:

\alpha_{ts} = \frac{\exp(\text{score}(h_t, \bar{h}_s))}{\sum_{s'=1}^{S} \exp(\text{score}(h_t, \bar{h}_{s'}))}

where:

$\alpha_{ts}$ : the attention weight for source position $s$ when decoding at timestep $t$
$\text{score}(h_t, \bar{h}_s)$ : the alignment score computed using any of the three score functions
$S$ : the total number of encoder positions (source sequence length)
$\exp(\cdot)$ : the exponential function, ensuring all values are positive
The denominator normalizes the weights so they sum to 1 across all source positions

The context vector is then computed as the weighted sum of all encoder states:

c_t = \sum_{s=1}^{S} \alpha_{ts} \bar{h}_s

where:

$c_t \in \mathbb{R}^{d_{\bar{h}}}$ : the context vector at decoder timestep $t$
$\alpha_{ts}$ : the attention weight for position $s$ (how much to focus on that position)
$\bar{h}_s$ : the encoder hidden state at position $s$

This weighted sum allows the decoder to "focus" on relevant source positions: positions with high attention weights contribute more to the context vector, while positions with low weights contribute less.

Global attention is what we typically mean when we say "attention" without qualification. It allows the model to attend to any position in the source sequence, which is essential for tasks like translation where word order can differ dramatically between languages.

The computational cost of global attention is $O(S)$ per decoder step, where $S$ is the source sequence length. For most NLP tasks with sequences of hundreds or thousands of tokens, this is manageable. However, for very long sequences (documents, books, genomic data), global attention becomes a bottleneck.

Local AttentionLink Copied

Local attention restricts the attention window to a subset of encoder positions centered around an alignment point $p_t$ . This reduces computation and can also serve as an inductive bias when alignments are expected to be roughly monotonic.

Out[5]:

Visualization

Heatmap showing attention weights across all encoder positions for global attention. — Global attention computes weights over all encoder positions. Every source token can influence every target token, providing maximum flexibility but requiring O(S) computation per step.

Heatmap showing attention weights concentrated in a diagonal band for local attention. — Local attention focuses on a window of size 2D+1 centered at position p_t. Positions outside the window receive zero attention weight, reducing computation to O(D).

Luong et al. propose two variants for determining the alignment point $p_t$ :

Monotonic alignment (local-m): The alignment point is simply set to $p_t = t$ , assuming the source and target sequences are roughly aligned. This works well for tasks like speech recognition where the output follows the input order.

Predictive alignment (local-p): The model learns to predict the alignment point using a small neural network:

p_t = S \cdot \sigma(v_p^\top \tanh(W_p h_t))

where:

$p_t$ : the predicted alignment point (center of the attention window) at decoder timestep $t$
$S$ : the source sequence length
$\sigma(\cdot)$ : the sigmoid function, which outputs a value in $[0, 1]$
$W_p$ : a learned weight matrix that transforms the decoder hidden state
$v_p$ : a learned weight vector that projects to a scalar
$h_t$ : the current decoder hidden state

The sigmoid ensures the output is between 0 and 1, and multiplying by $S$ scales it to a valid source position. This allows the model to learn non-monotonic alignments while still restricting attention to a local window.

Within the window $[p_t - D, p_t + D]$ , attention weights are computed normally but multiplied by a Gaussian centered at $p_t$ :

\alpha_{ts} = \text{align}(h_t, \bar{h}_s) \cdot \exp\left(-\frac{(s - p_t)^2}{2\sigma^2}\right)

where:

$\alpha_{ts}$ : the final attention weight for position $s$ at decoder timestep $t$
$\text{align}(h_t, \bar{h}_s)$ : the base attention weight from softmax over the local window
$s$ : the source position being attended to
$p_t$ : the predicted center of the attention window
$\sigma$ : the standard deviation of the Gaussian (controls window sharpness), typically set to $D/2$
The exponential term is a Gaussian that peaks at $s = p_t$ and decays for positions farther from the center

The Gaussian favors positions near the center of the window, providing a soft boundary rather than a hard cutoff. Positions at the edge of the window receive lower weights even if their alignment scores are high.

When to Use EachLink Copied

The choice between global and local attention depends on your task:

Global attention is the default choice for most NLP tasks. Translation, summarization, and question answering all benefit from the ability to attend to any position. The computational overhead is acceptable for typical sequence lengths.
Local attention shines when you have prior knowledge about alignment structure. Speech recognition, where phonemes appear in order, is a natural fit. Document-level tasks with very long sequences can also benefit from the reduced computation.

In practice, global attention dominates because modern hardware handles the $O(S)$ computation efficiently, and the flexibility to attend anywhere is valuable. Local attention is more of a historical curiosity, though its ideas influenced later work on sparse attention patterns in transformers.

Attention Placement: Input vs OutputLink Copied

A subtle but important difference between Bahdanau and Luong attention is where the attention mechanism fits into the decoder architecture. This choice affects both the information flow and the computational graph.

Bahdanau: Attention as InputLink Copied

In Bahdanau attention, the context vector is computed using the previous decoder hidden state and then concatenated with the input embedding to form the input to the current decoder step. The computation proceeds in three stages:

\begin{align} c_t &= \text{Attention}(h_{t-1}, \bar{H}) \\ \tilde{x}_t &= [x_t; c_t] \\ h_t &= \text{RNN}(\tilde{x}_t, h_{t-1}) \end{align}

where:

$h_{t-1}$ : the decoder hidden state from the previous timestep
$\bar{H} = [\bar{h}_1, \bar{h}_2, \ldots, \bar{h}_S]$ : the matrix of all encoder hidden states
$c_t$ : the context vector computed by attending over encoder states
$x_t$ : the input embedding at timestep $t$ (typically the previous output token)
$\tilde{x}_t$ : the augmented input formed by concatenating $x_t$ and $c_t$
$h_t$ : the new decoder hidden state

The context vector influences the RNN computation directly. This means the decoder can use information about which source positions are relevant when updating its hidden state.

Luong: Attention as OutputLink Copied

In Luong attention, the decoder RNN runs first, producing the current hidden state. Then attention is computed using this new state, and the results are combined. The computation proceeds in three stages:

\begin{align} h_t &= \text{RNN}(x_t, h_{t-1}) \\ c_t &= \text{Attention}(h_t, \bar{H}) \\ \tilde{h}_t &= \tanh(W_c [c_t; h_t]) \end{align}

where:

$x_t$ : the input embedding at timestep $t$
$h_{t-1}$ : the decoder hidden state from the previous timestep
$h_t$ : the new decoder hidden state after the RNN step
$\bar{H} = [\bar{h}_1, \bar{h}_2, \ldots, \bar{h}_S]$ : the matrix of all encoder hidden states
$c_t$ : the context vector computed by attending over encoder states using the current $h_t$
$[c_t; h_t]$ : the concatenation of context and hidden state
$W_c$ : a learned weight matrix that combines context and hidden state
$\tilde{h}_t$ : the "attentional hidden state" used for prediction

The final output $\tilde{h}_t$ combines the context vector with the decoder state through a learned transformation. This attentional state is then used for prediction, typically by projecting it to vocabulary size and applying softmax.

Out[6]:

Visualization

Flow diagram showing Bahdanau attention with context computed before RNN step. — Bahdanau attention computes context using the previous hidden state, then feeds it as input to the RNN. The context influences the state update directly.

Flow diagram showing Luong attention with context computed after RNN step. — Luong attention runs the RNN first, then computes context using the current hidden state. The context is combined with the state for output.

Implications of PlacementLink Copied

The placement choice has several practical implications:

Information flow: Bahdanau attention allows the context to influence the RNN state update, potentially enabling richer interactions. Luong attention keeps the RNN computation separate, which can be easier to reason about and debug.

Parallelization: Luong attention is slightly more amenable to parallelization during training because the RNN step doesn't depend on the attention computation. However, this advantage is minimal compared to the sequential nature of RNNs themselves.

Empirical performance: Luong et al. found their approach performed comparably or slightly better than Bahdanau attention on machine translation benchmarks. The simpler architecture and faster score functions (especially dot product) made it an attractive choice.

Luong vs Bahdanau: A Complete ComparisonLink Copied

Let's consolidate the differences between these two influential attention mechanisms:

Comparison of Bahdanau and Luong attention mechanisms. The key differences are score function, decoder state timing, and attention placement.

Aspect	Bahdanau	Luong
Score function	Additive (concat with tanh)	Dot, general, or concat
Decoder state used	Previous ( $h_{t-1}$ )	Current ( $h_t$ )
Attention placement	Before RNN (input)	After RNN (output)
Context integration	Concatenated with input	Combined via learned layer
Encoder architecture	Bidirectional RNN	Unidirectional (stacked)
Parameters	More (additive scoring)	Fewer (dot product option)

Both mechanisms compute attention weights via softmax and produce a context vector as a weighted sum of encoder states. The differences lie in the details of scoring, timing, and integration.

In practice, the choice between Bahdanau and Luong attention often matters less than other architectural decisions like model size, number of layers, and training procedure. Modern transformer architectures have largely superseded both, but understanding these mechanisms provides essential intuition for how attention works.

ImplementationLink Copied

Let's implement Luong attention in PyTorch. We'll build a complete attention module that supports all three score functions, then integrate it into a sequence-to-sequence decoder.

Attention ModuleLink Copied

First, we define the core attention computation. The module takes encoder outputs and a decoder hidden state, computes attention weights, and returns the context vector.

In[7]:

Code

class LuongAttention(nn.Module):
    """Luong attention with configurable score function."""

    def __init__(self, hidden_dim, encoder_dim=None, method="dot"):
        super().__init__()
        self.method = method
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim or hidden_dim

        if method == "general":
            # W_a for bilinear scoring
            self.W_a = nn.Linear(self.encoder_dim, hidden_dim, bias=False)
        elif method == "concat":
            # Two-layer scoring network
            self.W_a = nn.Linear(
                hidden_dim + self.encoder_dim, hidden_dim, bias=False
            )
            self.v_a = nn.Linear(hidden_dim, 1, bias=False)

    def score(self, decoder_hidden, encoder_outputs):
        """Compute attention scores for all encoder positions."""
        if self.method == "dot":
            scores = torch.bmm(
                encoder_outputs, decoder_hidden.unsqueeze(2)
            ).squeeze(2)
        elif self.method == "general":
            transformed = self.W_a(encoder_outputs)
            scores = torch.bmm(
                transformed, decoder_hidden.unsqueeze(2)
            ).squeeze(2)
        elif self.method == "concat":
            src_len = encoder_outputs.size(1)
            decoder_expanded = decoder_hidden.unsqueeze(1).expand(
                -1, src_len, -1
            )
            concat = torch.cat([decoder_expanded, encoder_outputs], dim=2)
            scores = self.v_a(torch.tanh(self.W_a(concat))).squeeze(2)
        return scores

    def forward(self, decoder_hidden, encoder_outputs, mask=None):
        """Compute attention weights and context vector."""
        scores = self.score(decoder_hidden, encoder_outputs)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))
        attention_weights = F.softmax(scores, dim=1)
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        ).squeeze(1)
        return context, attention_weights

class LuongAttention(nn.Module):
    """Luong attention with configurable score function."""

    def __init__(self, hidden_dim, encoder_dim=None, method="dot"):
        super().__init__()
        self.method = method
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim or hidden_dim

        if method == "general":
            # W_a for bilinear scoring
            self.W_a = nn.Linear(self.encoder_dim, hidden_dim, bias=False)
        elif method == "concat":
            # Two-layer scoring network
            self.W_a = nn.Linear(
                hidden_dim + self.encoder_dim, hidden_dim, bias=False
            )
            self.v_a = nn.Linear(hidden_dim, 1, bias=False)

    def score(self, decoder_hidden, encoder_outputs):
        """Compute attention scores for all encoder positions."""
        if self.method == "dot":
            scores = torch.bmm(
                encoder_outputs, decoder_hidden.unsqueeze(2)
            ).squeeze(2)
        elif self.method == "general":
            transformed = self.W_a(encoder_outputs)
            scores = torch.bmm(
                transformed, decoder_hidden.unsqueeze(2)
            ).squeeze(2)
        elif self.method == "concat":
            src_len = encoder_outputs.size(1)
            decoder_expanded = decoder_hidden.unsqueeze(1).expand(
                -1, src_len, -1
            )
            concat = torch.cat([decoder_expanded, encoder_outputs], dim=2)
            scores = self.v_a(torch.tanh(self.W_a(concat))).squeeze(2)
        return scores

    def forward(self, decoder_hidden, encoder_outputs, mask=None):
        """Compute attention weights and context vector."""
        scores = self.score(decoder_hidden, encoder_outputs)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))
        attention_weights = F.softmax(scores, dim=1)
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        ).squeeze(1)
        return context, attention_weights

The LuongAttention class implements all three score functions. The constructor accepts the hidden dimension, an optional encoder dimension, and the scoring method. The score method computes alignment scores using dot product, general (bilinear), or concat approaches. The forward method orchestrates the full attention computation: scores, optional masking, softmax normalization, and weighted sum.

Attentional DecoderLink Copied

Now let's build a decoder that uses Luong attention. The key difference from a standard RNN decoder is the attention step after the RNN and the combination layer that produces the attentional hidden state.

In[8]:

Code

class LuongDecoder(nn.Module):
    """Decoder with Luong attention."""

    def __init__(
        self,
        vocab_size,
        embed_dim,
        hidden_dim,
        encoder_dim=None,
        attention_method="dot",
        num_layers=1,
        dropout=0.1,
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim or hidden_dim

        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )
        self.attention = LuongAttention(
            hidden_dim, self.encoder_dim, attention_method
        )
        self.W_c = nn.Linear(
            self.encoder_dim + hidden_dim, hidden_dim, bias=False
        )
        self.output_projection = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward_step(self, input_token, hidden, encoder_outputs, mask=None):
        """Single decoding step with attention."""
        embedded = self.dropout(self.embedding(input_token.unsqueeze(1)))
        rnn_output, hidden = self.rnn(embedded, hidden)
        rnn_output = rnn_output.squeeze(1)
        context, attention_weights = self.attention(
            hidden[-1], encoder_outputs, mask
        )
        combined = torch.cat([context, rnn_output], dim=1)
        attentional_hidden = torch.tanh(self.W_c(combined))
        output = self.output_projection(self.dropout(attentional_hidden))
        return output, hidden, attention_weights

class LuongDecoder(nn.Module):
    """Decoder with Luong attention."""

    def __init__(
        self,
        vocab_size,
        embed_dim,
        hidden_dim,
        encoder_dim=None,
        attention_method="dot",
        num_layers=1,
        dropout=0.1,
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.encoder_dim = encoder_dim or hidden_dim

        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )
        self.attention = LuongAttention(
            hidden_dim, self.encoder_dim, attention_method
        )
        self.W_c = nn.Linear(
            self.encoder_dim + hidden_dim, hidden_dim, bias=False
        )
        self.output_projection = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward_step(self, input_token, hidden, encoder_outputs, mask=None):
        """Single decoding step with attention."""
        embedded = self.dropout(self.embedding(input_token.unsqueeze(1)))
        rnn_output, hidden = self.rnn(embedded, hidden)
        rnn_output = rnn_output.squeeze(1)
        context, attention_weights = self.attention(
            hidden[-1], encoder_outputs, mask
        )
        combined = torch.cat([context, rnn_output], dim=1)
        attentional_hidden = torch.tanh(self.W_c(combined))
        output = self.output_projection(self.dropout(attentional_hidden))
        return output, hidden, attention_weights

The LuongDecoder class combines an embedding layer, a GRU, the attention module, a combination layer for producing the attentional hidden state, and an output projection. The forward_step method implements a single decoding step: embed, RNN, then attention. This order is the defining characteristic of Luong attention.

Testing the ImplementationLink Copied

Let's verify our implementation works correctly with some example data.

In[9]:

Code

# Create sample data
batch_size = 4
src_len = 10
hidden_dim = 64
vocab_size = 1000

# Simulated encoder outputs (normally from an encoder RNN)
encoder_outputs = torch.randn(batch_size, src_len, hidden_dim)

# Padding mask (last 2 positions are padding)
mask = torch.ones(batch_size, src_len)
mask[:, -2:] = 0

# Initialize decoder
decoder = LuongDecoder(
    vocab_size=vocab_size,
    embed_dim=32,
    hidden_dim=hidden_dim,
    attention_method="dot",
)

# Initial hidden state
hidden = torch.zeros(1, batch_size, hidden_dim)

# Input token (e.g., start token)
input_token = torch.zeros(batch_size, dtype=torch.long)

# Run one decoding step
output, new_hidden, attention_weights = decoder.forward_step(
    input_token, hidden, encoder_outputs, mask
)

# Create sample data
batch_size = 4
src_len = 10
hidden_dim = 64
vocab_size = 1000

# Simulated encoder outputs (normally from an encoder RNN)
encoder_outputs = torch.randn(batch_size, src_len, hidden_dim)

# Padding mask (last 2 positions are padding)
mask = torch.ones(batch_size, src_len)
mask[:, -2:] = 0

# Initialize decoder
decoder = LuongDecoder(
    vocab_size=vocab_size,
    embed_dim=32,
    hidden_dim=hidden_dim,
    attention_method="dot",
)

# Initial hidden state
hidden = torch.zeros(1, batch_size, hidden_dim)

# Input token (e.g., start token)
input_token = torch.zeros(batch_size, dtype=torch.long)

# Run one decoding step
output, new_hidden, attention_weights = decoder.forward_step(
    input_token, hidden, encoder_outputs, mask
)

Out[10]:

Console

Output shape: torch.Size([4, 1000])
Hidden shape: torch.Size([1, 4, 64])
Attention weights shape: torch.Size([4, 10])

Attention weights sum to 1: 1.0000
Attention on padded positions: 0.000000

The output has the expected shape for vocabulary logits, and attention weights sum to 1 (ignoring padded positions). The near-zero attention on padded positions confirms our masking works correctly.

Visualizing Attention PatternsLink Copied

Let's compare the attention patterns produced by different score functions on the same input.

In[11]:

Code

# Create decoders with different attention methods
methods = ["dot", "general", "concat"]
decoders = {
    method: LuongDecoder(vocab_size, 32, hidden_dim, attention_method=method)
    for method in methods
}

# Run decoding steps and collect attention weights
all_attention = {}
for method, decoder in decoders.items():
    hidden = torch.zeros(1, batch_size, hidden_dim)
    _, _, attn = decoder.forward_step(
        input_token, hidden, encoder_outputs, mask
    )
    all_attention[method] = attn.detach()

# Create decoders with different attention methods
methods = ["dot", "general", "concat"]
decoders = {
    method: LuongDecoder(vocab_size, 32, hidden_dim, attention_method=method)
    for method in methods
}

# Run decoding steps and collect attention weights
all_attention = {}
for method, decoder in decoders.items():
    hidden = torch.zeros(1, batch_size, hidden_dim)
    _, _, attn = decoder.forward_step(
        input_token, hidden, encoder_outputs, mask
    )
    all_attention[method] = attn.detach()

Out[12]:

Visualization

Bar chart showing dot product attention weights across source positions. — Dot product attention weights. Parameter-free scoring based on direct vector similarity.

Bar chart showing general attention weights across source positions. — General attention weights. Learned bilinear transformation enables flexible scoring.

Bar chart showing concat attention weights across source positions. — Concat attention weights. Two-layer network captures complex compatibility patterns.

With random weights, the attention patterns are similar across methods. After training, each method would develop distinct patterns based on what it learns about alignment. Dot product attention tends to produce sharper distributions when encoder and decoder representations align well, while concat attention can learn more complex compatibility functions.

To see what trained attention looks like, let's simulate a full translation with realistic attention patterns. The heatmap below shows attention weights across multiple decoding steps, revealing how the decoder systematically aligns with different source positions.

Out[13]:

Visualization

Heatmap showing attention weights with a diagonal pattern indicating word alignment between source and target sentences. — Simulated attention heatmap for translating 'The black cat sat quietly' → 'Le chat noir était assis tranquillement'. Each row shows attention weights when generating one target word. The roughly diagonal pattern reflects word-order correspondence, while deviations (e.g., 'noir' attending to 'black' at position 1) show how attention handles reordering.

Worked Example: English-French AlignmentLink Copied

The formulas we've discussed can feel abstract until you see them in action. Let's trace through a complete attention computation for a concrete translation example, following each step from raw hidden states to final attention weights.

Consider translating the English sentence "The cat sat" to French "Le chat assis". We'll focus on the moment when the decoder is about to generate the second French word "chat". At this point, the decoder needs to figure out which English word to focus on. Intuitively, it should attend to "cat" since that's the word being translated.

Setting Up the ProblemLink Copied

First, let's create simulated encoder hidden states for each English word. In a real model, these would come from running a bidirectional LSTM or transformer encoder over the input. Here, we'll craft vectors that capture the intuition: "cat" and the decoder state for "chat" should be similar, while "The" and "sat" should be less relevant.

In[14]:

Code

# Simulated encoder hidden states for "The cat sat"
np.random.seed(42)
torch.manual_seed(42)

# Create encoder representations that reflect semantic content:
# - "The": article with low semantic content (small magnitude)
# - "cat": noun with distinct pattern (will match decoder query)
# - "sat": verb with different pattern (won't match as well)
encoder_states = torch.tensor(
    [
        [0.1, 0.2, -0.1, 0.3],  # "The" - low magnitude, generic
        [0.8, -0.3, 0.9, 0.2],  # "cat" - strong noun pattern
        [0.2, 0.7, 0.1, -0.4],  # "sat" - different verb pattern
    ],
    dtype=torch.float32,
).unsqueeze(0)  # Add batch dimension for PyTorch

# Decoder hidden state when generating "chat"
# Designed to be similar to "cat" encoding
decoder_state = torch.tensor(
    [
        [0.7, -0.2, 0.8, 0.3]  # Query vector looking for noun-like content
    ],
    dtype=torch.float32,
)

# Simulated encoder hidden states for "The cat sat"
np.random.seed(42)
torch.manual_seed(42)

# Create encoder representations that reflect semantic content:
# - "The": article with low semantic content (small magnitude)
# - "cat": noun with distinct pattern (will match decoder query)
# - "sat": verb with different pattern (won't match as well)
encoder_states = torch.tensor(
    [
        [0.1, 0.2, -0.1, 0.3],  # "The" - low magnitude, generic
        [0.8, -0.3, 0.9, 0.2],  # "cat" - strong noun pattern
        [0.2, 0.7, 0.1, -0.4],  # "sat" - different verb pattern
    ],
    dtype=torch.float32,
).unsqueeze(0)  # Add batch dimension for PyTorch

# Decoder hidden state when generating "chat"
# Designed to be similar to "cat" encoding
decoder_state = torch.tensor(
    [
        [0.7, -0.2, 0.8, 0.3]  # Query vector looking for noun-like content
    ],
    dtype=torch.float32,
)

Out[15]:

Console

Encoder states shape: torch.Size([1, 3, 4])
Decoder state shape: torch.Size([1, 4])

The encoder states have shape (1, 3, 4): batch size 1, 3 source words, and 4-dimensional hidden states. The decoder state has shape (1, 4): a single 4-dimensional vector representing what the decoder is "looking for" when generating "chat".

Step 1: Computing Alignment ScoresLink Copied

Now we apply the dot product score function. For each encoder state, we compute its dot product with the decoder state. Remember, the dot product measures how much two vectors point in the same direction.

In[16]:

Code

# Compute dot product scores: decoder_state · each encoder_state
scores = torch.bmm(encoder_states, decoder_state.unsqueeze(2)).squeeze()

# Apply softmax to convert scores to probabilities
attention_weights = F.softmax(scores, dim=0)

# Compute context vector as weighted sum of encoder states
context = torch.bmm(
    attention_weights.unsqueeze(0).unsqueeze(0), encoder_states
).squeeze()

# Compute dot product scores: decoder_state · each encoder_state
scores = torch.bmm(encoder_states, decoder_state.unsqueeze(2)).squeeze()

# Apply softmax to convert scores to probabilities
attention_weights = F.softmax(scores, dim=0)

# Compute context vector as weighted sum of encoder states
context = torch.bmm(
    attention_weights.unsqueeze(0).unsqueeze(0), encoder_states
).squeeze()

Step 2: Examining the ResultsLink Copied

Let's see what scores each word received and how softmax transforms them into attention weights:

Out[17]:

Console

Step-by-step attention computation:
----------------------------------------

Dot product scores (raw alignment):
  The: 0.0400
  cat: 1.4000
  sat: -0.0400

Attention weights (after softmax):
  The: 0.1718
  cat: 0.6695
  sat: 0.1586

Context vector: [0.585, -0.055, 0.601, 0.122]

Highest attention on: 'cat'

The results reveal exactly what we hoped to see. The decoder state for generating "chat" produces the highest alignment score with "cat" because their vectors point in similar directions. The softmax function then amplifies this difference: "cat" receives about 70% of the attention weight, while "The" and "sat" share the remaining 30%.

Notice how the context vector ends up being dominated by the "cat" representation. This is attention in action: the decoder has learned to extract the relevant information from the source sequence by focusing on the semantically corresponding word.

Out[18]:

Visualization

Bar chart showing attention weights with 'cat' receiving the highest weight around 0.7. — Attention weights when generating 'chat' from 'The cat sat'. The model correctly focuses on 'cat' (highlighted in red), the source word that semantically corresponds to the French target. This alignment emerges from the similarity between the decoder state and the 'cat' encoder representation.

The visualization below shows how the context vector is constructed as a weighted blend of the encoder states. Each dimension of the context vector is a weighted sum of that dimension across all encoder states, with the weights determined by attention.

Out[19]:

Visualization

Stacked bar chart showing weighted contributions from each encoder state to context vector dimensions. — Weighted contributions to the context vector. Stacked bars show each encoder state's contribution per dimension, weighted by attention. The 'cat' encoder (red) dominates with 70% attention weight.

Grouped bar chart comparing encoder state values with the resulting context vector. — Context vector compared to original encoder states. The context closely resembles the 'cat' representation since it receives the highest attention weight.

Limitations and ImpactLink Copied

Luong attention addressed several limitations of Bahdanau attention while introducing its own trade-offs. Understanding these helps contextualize attention's evolution toward transformers.

Computational EfficiencyLink Copied

The dot product score function is significantly faster than additive attention, especially for long sequences. With hidden dimension $d$ and sequence length $S$ , dot product requires $O(Sd)$ operations compared to $O(Snd)$ for additive attention (where $n$ is the intermediate dimension). This efficiency gain compounds during training when attention is computed millions of times.

However, dot product attention has a subtle numerical issue: when the hidden dimension is large, dot products can become very large in magnitude. This pushes softmax toward extreme values (near 0 or 1), causing gradient saturation. The transformer architecture addresses this with scaled dot-product attention, dividing by $\sqrt{d}$ to keep values in a reasonable range.

Out[20]:

Visualization

Line plot showing score standard deviation growing with hidden dimension for unscaled vs scaled attention. — As hidden dimension increases, dot product magnitudes grow proportionally (for unit-variance random vectors, std ≈ √d). Scaling by 1/√d keeps values stable around 1.

Grouped bar chart showing attention weight distributions becoming sharper as score magnitude increases. — Large scores cause softmax to produce near-binary distributions, leading to gradient saturation. Higher scale values concentrate all attention on the highest-scoring position.

Expressiveness vs SimplicityLink Copied

General and concat attention are more expressive than dot product, able to learn arbitrary compatibility functions. But this expressiveness comes at a cost: more parameters, slower computation, and potential overfitting on small datasets. Empirically, the simpler dot product often performs comparably or better, suggesting that the encoder and decoder learn representations where raw similarity is a good proxy for alignment.

This finding influenced the design of transformers, which use dot product attention exclusively. The lesson is that with sufficient model capacity and data, simple mechanisms can match or exceed complex ones.

Sequential BottleneckLink Copied

Both Bahdanau and Luong attention still rely on RNNs for the encoder and decoder. This creates a sequential bottleneck: each timestep must wait for the previous one, preventing parallelization. Attention helps by providing direct connections to encoder states, but the fundamental limitation remains.

Transformers eliminate this bottleneck entirely by using self-attention instead of recurrence. Each position can attend to all other positions in parallel, enabling massive speedups on modern hardware. Luong's dot product attention, combined with the key-query-value formulation, became the foundation for this revolution.

Legacy and InfluenceLink Copied

Luong attention's most lasting contribution is demonstrating that simple, efficient attention mechanisms work well. The dot product score function, attention after the decoder step, and the idea of multiple attention variants all influenced subsequent research. When Vaswani et al. designed the transformer, they chose scaled dot-product attention, directly building on Luong's work.

The distinction between global and local attention also foreshadowed later work on sparse attention patterns. Transformers face quadratic complexity in sequence length, and researchers have explored various ways to restrict attention to local windows or learned patterns. Luong's local attention was an early exploration of this trade-off between expressiveness and efficiency.

SummaryLink Copied

This chapter explored Luong attention, a family of attention mechanisms that simplified and extended Bahdanau's original formulation.

The key innovations include three score functions for computing alignment:

Dot product: Parameter-free, efficient, requires matching dimensions
General: Learned bilinear transformation, handles different dimensions
Concat: Two-layer network with nonlinearity, most expressive

Luong attention also introduced the distinction between global and local attention. Global attention considers all encoder positions, while local attention focuses on a predicted window. Global attention dominates in practice due to its flexibility and the efficiency of modern hardware.

The architectural placement differs from Bahdanau: Luong computes attention after the RNN step using the current hidden state, then combines the context with the hidden state through a learned transformation. This "attention as output" approach is simpler and slightly more parallelizable.

Perhaps most importantly, Luong's work demonstrated that simple attention mechanisms, particularly dot product attention, can match or exceed more complex alternatives. This insight directly influenced the transformer architecture, which uses scaled dot-product attention as its core mechanism. Understanding Luong attention provides essential context for the self-attention mechanisms we'll explore in the next part of this book.

Key ParametersLink Copied

When implementing Luong attention, several parameters significantly impact model behavior and performance:

Attention method (method): Chooses between "dot", "general", or "concat" score functions. Dot product is fastest and parameter-free but requires matching encoder/decoder dimensions. General attention adds a learned projection matrix, enabling different dimensions and learned similarity. Concat attention is most expressive but slowest. Start with dot product for most applications.
Hidden dimension (hidden_dim): The dimensionality of the decoder hidden state. Larger values (256-512) provide more representational capacity but increase computation. For attention, this determines the space in which similarity is computed. Values of 256-512 work well for most sequence-to-sequence tasks.
Encoder dimension (encoder_dim): The dimensionality of encoder hidden states. Can differ from hidden_dim when using general or concat attention. For bidirectional encoders, this is typically 2 * hidden_dim since forward and backward states are concatenated.
Local attention window (D): For local attention, controls the half-width of the attention window. Positions outside $[p_t - D, p_t + D]$ receive zero attention. Larger windows (D=10-20) provide more flexibility but increase computation. Smaller windows (D=2-5) work well when alignments are expected to be monotonic.
Dropout (dropout): Applied to embeddings and the attentional hidden state before output projection. Values of 0.1-0.3 help prevent overfitting, especially important for attention weights which can become overly peaked during training.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Luong attention mechanisms.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{luongattentiondotproductgenerallocalattentionmechanisms, author = {Michael Brenndoerfer}, title = {Luong Attention: Dot Product, General & Local Attention Mechanisms}, year = {2025}, url = {https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Luong Attention: Dot Product, General & Local Attention Mechanisms. Retrieved from https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local

MLAAcademic

Michael Brenndoerfer. "Luong Attention: Dot Product, General & Local Attention Mechanisms." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local>.

CHICAGOAcademic

Michael Brenndoerfer. "Luong Attention: Dot Product, General & Local Attention Mechanisms." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Luong Attention: Dot Product, General & Local Attention Mechanisms'. Available at: https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Luong Attention: Dot Product, General & Local Attention Mechanisms. https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local

Direct link:

https://mbrenndoerfer.com/writing/luong-attention-mechanisms-dot-product-general-local

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Luong Attention: Dot Product, General & Local Attention Mechanisms

Luong AttentionLink Copied

Attention Score FunctionsLink Copied

Dot Product AttentionLink Copied

General (Bilinear) AttentionLink Copied

Concat AttentionLink Copied

Computational ComparisonLink Copied

Global vs Local AttentionLink Copied

Global AttentionLink Copied

Local AttentionLink Copied

When to Use EachLink Copied

Attention Placement: Input vs OutputLink Copied

Bahdanau: Attention as InputLink Copied

Luong: Attention as OutputLink Copied

Implications of PlacementLink Copied

Luong vs Bahdanau: A Complete ComparisonLink Copied

ImplementationLink Copied

Attention ModuleLink Copied

Attentional DecoderLink Copied

Testing the ImplementationLink Copied

Visualizing Attention PatternsLink Copied

Worked Example: English-French AlignmentLink Copied

Setting Up the ProblemLink Copied

Step 1: Computing Alignment ScoresLink Copied

Step 2: Examining the ResultsLink Copied

Limitations and ImpactLink Copied

Computational EfficiencyLink Copied

Expressiveness vs SimplicityLink Copied

Sequential BottleneckLink Copied

Legacy and InfluenceLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Beam Search: Finding Optimal Sequences in Neural Text Generation

Teacher Forcing: Training Seq2Seq Models with Ground Truth Context

Bahdanau Attention: Dynamic Context for Neural Machine Translation

Stay updated