Search

Search articles

Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation

Michael BrenndoerferDecember 16, 202543 min read

Learn the encoder-decoder framework for sequence-to-sequence learning, including context vectors, LSTM implementations, and the bottleneck problem that motivated attention mechanisms.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Encoder-Decoder Framework

In the previous part, we explored how recurrent neural networks process sequences, capturing temporal dependencies through hidden states that accumulate information over time. But there's a fundamental limitation we haven't addressed: what if the input and output sequences have different lengths? How do you translate "The cat sat on the mat" (six words) into "Le chat était assis sur le tapis" (seven words)? How do you summarize a 500-word article into a 50-word abstract?

These variable-length sequence-to-sequence problems require a new architectural paradigm. The encoder-decoder framework, introduced by Sutskever et al. in 2014, provides an elegant solution: use one RNN to compress the input sequence into a fixed-size representation, then use another RNN to generate the output sequence from that representation. This simple idea unlocked machine translation, text summarization, and countless other applications that transform one sequence into another.

The Core Insight: Separate Reading from Writing

Before diving into architecture details, let's understand the fundamental insight behind encoder-decoder models. Consider how a human translator works. First, they read the entire source sentence, building a mental understanding of its meaning. Then, they produce the translation word by word, consulting their mental representation of the original. They don't translate word by word as they read, because that would fail to capture context and produce awkward, literal translations.

The encoder-decoder framework mirrors this process. The encoder "reads" the input sequence, compressing it into a dense vector representation called the context vector. The decoder then "writes" the output sequence, using the context vector as its guide. This separation allows each component to specialize: the encoder focuses on understanding, while the decoder focuses on generation.

Out[3]:
Visualization
Diagram showing encoder processing input sequence into context vector, which feeds into decoder generating output sequence.
High-level view of the encoder-decoder architecture. The encoder processes the input sequence left-to-right, compressing it into a context vector. The decoder then generates the output sequence one token at a time, conditioned on the context vector.

The figure illustrates the basic flow. The encoder processes "The cat sat" sequentially, with each hidden state hth_t incorporating information from all previous words. After processing the final word, the encoder's hidden state becomes the context vector cc. The decoder then generates the translation, starting with a special start token <s> and producing one word at a time until it outputs an end token.

The Encoder: Compressing Input into Meaning

The encoder's job is straightforward: process the input sequence and produce a fixed-size representation that captures its meaning. We can use any RNN architecture for this purpose, whether vanilla RNN, LSTM, or GRU. In practice, LSTMs and GRUs dominate due to their ability to capture long-range dependencies.

Encoder Architecture

For an input sequence x1,x2,,xTx_1, x_2, \ldots, x_T of length TT, the encoder computes a sequence of hidden states:

ht=RNNenc(xt,ht1)h_t = \text{RNN}_{\text{enc}}(x_t, h_{t-1})

where:

  • xtx_t: the input token at timestep tt, typically represented as an embedding vector
  • ht1h_{t-1}: the previous hidden state, carrying information from tokens x1,,xt1x_1, \ldots, x_{t-1}
  • hth_t: the new hidden state, now incorporating information from x1,,xtx_1, \ldots, x_t
  • RNNenc\text{RNN}_{\text{enc}}: the encoder's recurrent function (LSTM, GRU, etc.)

The final hidden state hTh_T serves as the context vector cc, summarizing the entire input sequence in a single vector:

c=hTc = h_T

where:

  • cc: the context vector that will be passed to the decoder
  • hTh_T: the encoder's hidden state after processing all TT input tokens

This is remarkably simple, but there's a subtle point worth emphasizing. The context vector must encode everything the decoder needs to know about the input. For a 100-word input sentence, all the semantic content, syntactic structure, and nuance must be compressed into a vector of perhaps 512 or 1024 dimensions. This compression is both the power and the limitation of the basic encoder-decoder framework.

Out[4]:
Visualization
Diagram showing LSTM cells processing word embeddings sequentially with hidden state connections.
Detailed view of the encoder processing a sequence. Each LSTM cell receives the current word embedding and the previous hidden state, producing a new hidden state. The final hidden state becomes the context vector that summarizes the entire input.

Implementing the Encoder

Let's implement a basic LSTM encoder in PyTorch. The implementation is straightforward because PyTorch's nn.LSTM handles the sequential processing internally:

In[5]:
Code
class Encoder(nn.Module):
    def __init__(
        self, vocab_size, embed_size, hidden_size, num_layers=1, dropout=0.1
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer converts token indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM processes the sequence of embeddings
        self.lstm = nn.LSTM(
            embed_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

    def forward(self, src):
        # src shape: (batch_size, seq_len)

        # Convert tokens to embeddings
        embedded = self.embedding(src)  # (batch_size, seq_len, embed_size)

        # Process through LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        # outputs: (batch_size, seq_len, hidden_size) - all hidden states
        # hidden: (num_layers, batch_size, hidden_size) - final hidden state per layer
        # cell: (num_layers, batch_size, hidden_size) - final cell state per layer

        return hidden, cell

The encoder returns both the hidden state and cell state (for LSTM). These together form the context that initializes the decoder. Notice that we don't return the outputs tensor containing all hidden states. In the basic encoder-decoder framework, only the final hidden state matters. Later, when we add attention, we'll need those intermediate states.

In[6]:
Code
# Test the encoder
vocab_size = 10000
embed_size = 256
hidden_size = 512
num_layers = 2

encoder = Encoder(vocab_size, embed_size, hidden_size, num_layers)

# Simulate a batch of 4 sentences, each 20 tokens long
batch_size, seq_len = 4, 20
src = torch.randint(0, vocab_size, (batch_size, seq_len))

hidden, cell = encoder(src)
Out[7]:
Console
Encoder Output Shapes:
  Input: torch.Size([4, 20])
  Hidden state: torch.Size([2, 4, 512])
  Cell state: torch.Size([2, 4, 512])

The context vector has 1024 total dimensions
(2 layers × 512 hidden units per layer)

The hidden state has shape (num_layers, batch_size, hidden_size). For a 2-layer LSTM, this means we have two hidden vectors per sequence: one from the first layer and one from the second. Both contribute to the context that initializes the decoder.

To visualize how information accumulates in the encoder, let's examine the hidden state activations as the encoder processes a sequence:

Out[8]:
Visualization
Heatmap showing encoder hidden state values evolving across sequence positions.
Encoder hidden state activations across sequence positions (first 32 dimensions shown). Each column represents the hidden state after processing one token. Early positions show sparse activation patterns, while later positions show denser patterns as the encoder accumulates information from the entire sequence. The final column becomes the context vector.

The heatmap reveals how the encoder builds up its representation. Each column shows the hidden state after processing one more token. Notice how the activation patterns change across positions: some dimensions respond strongly to specific words, while others accumulate information gradually. The rightmost column is the context vector cc that gets passed to the decoder. It must encode everything about the input sequence that the decoder needs for translation.

The Decoder: Generating Output from Context

The decoder's job is more complex than the encoder's. It must generate the output sequence one token at a time, where each token depends on the context vector and all previously generated tokens. This autoregressive generation creates a dependency chain: to generate token tt, you need tokens 1,2,,t11, 2, \ldots, t-1.

Decoder Architecture

The decoder is also an RNN, but with a crucial difference: its initial hidden state comes from the encoder's context vector rather than being initialized to zeros. At each timestep tt, the decoder:

  1. Takes the previous output token yt1y_{t-1} as input
  2. Updates its hidden state using the RNN
  3. Produces a probability distribution over the vocabulary
  4. Samples or selects the next token yty_t

Mathematically, the decoder performs two operations at each timestep. First, it updates its hidden state by combining the previous token with its memory of what it has generated so far:

st=RNNdec(yt1,st1)s_t = \text{RNN}_{\text{dec}}(y_{t-1}, s_{t-1})

where:

  • sts_t: the decoder's hidden state at timestep tt, encoding information about all previously generated tokens
  • yt1y_{t-1}: the embedding of the previous output token (or the start token <s> when t=1t=1)
  • st1s_{t-1}: the decoder's hidden state from the previous timestep
  • RNNdec\text{RNN}_{\text{dec}}: the decoder's recurrent function (LSTM, GRU, etc.)

The crucial initialization is s0=cs_0 = c, meaning the decoder starts with the context vector from the encoder as its initial hidden state. This is how information flows from the encoder to the decoder.

Second, the decoder converts its hidden state into a probability distribution over the vocabulary to predict the next token:

P(yty<t,c)=softmax(Wost+bo)P(y_t | y_{<t}, c) = \text{softmax}(W_o s_t + b_o)

where:

  • P(yty<t,c)P(y_t | y_{<t}, c): probability distribution over all vocabulary tokens for position tt
  • WoW_o: output projection weight matrix of shape (vocab_size, hidden_size)
  • sts_t: the current decoder hidden state
  • bob_o: output projection bias vector of shape (vocab_size,)
  • y<ty_{<t}: all previously generated tokens y1,,yt1y_1, \ldots, y_{t-1}
  • cc: the context vector (implicitly encoded in the hidden states through initialization)

The softmax function converts the raw scores (logits) into a valid probability distribution that sums to 1, allowing us to either sample from this distribution or take the most probable token.

To see this concretely, let's visualize what a typical decoder output looks like. The decoder produces a probability distribution over the entire vocabulary at each timestep:

Out[9]:
Visualization
Bar chart showing probability distribution over vocabulary tokens with 'chat' having highest probability.
Example decoder output showing the probability distribution over vocabulary tokens at a single timestep. The model assigns high probability to a few likely candidates ('chat', 'chien') while spreading small probabilities across thousands of other tokens. Greedy decoding selects 'chat' (0.72), but beam search might explore 'chien' (0.15) as well.

This visualization shows a typical decoder output. The model has learned that "chat" (cat) is the most likely next word given the context, assigning it 72% probability. Alternative translations like "chien" (dog) receive smaller but non-negligible probability. The long tail of the distribution spreads tiny probabilities across thousands of other vocabulary tokens.

Out[10]:
Visualization
Diagram showing decoder LSTM cells generating tokens sequentially with softmax output layers.
Detailed view of the decoder generating output tokens. The decoder is initialized with the context vector from the encoder. At each step, it takes the previous output token, updates its hidden state, and predicts the next token through a softmax layer.

Implementing the Decoder

The decoder implementation requires careful handling of the autoregressive generation process:

In[11]:
Code
class Decoder(nn.Module):
    def __init__(
        self, vocab_size, embed_size, hidden_size, num_layers=1, dropout=0.1
    ):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding for target tokens
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM for sequential generation
        self.lstm = nn.LSTM(
            embed_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        # Output projection to vocabulary
        self.fc_out = nn.Linear(hidden_size, vocab_size)

    def forward(self, trg, hidden, cell):
        # trg shape: (batch_size, trg_len)
        # hidden, cell: from encoder, shape (num_layers, batch_size, hidden_size)

        # Embed target tokens
        embedded = self.embedding(trg)  # (batch_size, trg_len, embed_size)

        # Process through LSTM, initialized with encoder states
        outputs, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # outputs: (batch_size, trg_len, hidden_size)

        # Project to vocabulary
        predictions = self.fc_out(outputs)  # (batch_size, trg_len, vocab_size)

        return predictions, hidden, cell

During training, we feed the entire target sequence to the decoder at once. This is called teacher forcing, which we'll cover in detail in the next chapter. The key insight is that during training, we know the correct output sequence, so we can compute all timesteps in parallel rather than generating one token at a time.

Connecting Encoder and Decoder: The Seq2Seq Model

Now let's combine the encoder and decoder into a complete sequence-to-sequence model:

In[12]:
Code
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

        # Ensure encoder and decoder have compatible dimensions
        assert encoder.hidden_size == decoder.hidden_size
        assert encoder.num_layers == decoder.num_layers

    def forward(self, src, trg):
        # src: (batch_size, src_len) - source sequence
        # trg: (batch_size, trg_len) - target sequence (for teacher forcing)

        # Encode the source sequence
        hidden, cell = self.encoder(src)

        # Decode using encoder's final state as initial context
        outputs, _, _ = self.decoder(trg, hidden, cell)

        return outputs
In[13]:
Code
# Create the complete model
src_vocab_size = 10000  # Source language vocabulary
trg_vocab_size = 12000  # Target language vocabulary
embed_size = 256
hidden_size = 512
num_layers = 2

encoder = Encoder(src_vocab_size, embed_size, hidden_size, num_layers)
decoder = Decoder(trg_vocab_size, embed_size, hidden_size, num_layers)
model = Seq2Seq(encoder, decoder)

# Test forward pass
src = torch.randint(
    0, src_vocab_size, (4, 20)
)  # 4 source sentences, 20 tokens each
trg = torch.randint(
    0, trg_vocab_size, (4, 25)
)  # 4 target sentences, 25 tokens each

outputs = model(src, trg)
Out[14]:
Console
Seq2Seq Model Test:
  Source shape: torch.Size([4, 20])
  Target shape: torch.Size([4, 25])
  Output shape: torch.Size([4, 25, 12000])

Output is logits over 12000 target vocabulary tokens
for each of 25 positions in the target sequence

The output has shape (batch_size, trg_len, vocab_size), containing unnormalized log-probabilities (logits) for each position in the target sequence. During training, we compute the cross-entropy loss between these predictions and the actual target tokens.

The Context Vector Bottleneck

The basic encoder-decoder architecture has a fundamental limitation: all information about the source sequence must pass through a single fixed-size context vector. This creates an information bottleneck that becomes increasingly problematic as sequences grow longer.

Out[15]:
Visualization
Diagram showing long input sequence being compressed through narrow bottleneck into context vector.
The context vector bottleneck problem. A 50-word sentence must be compressed into the same 512-dimensional vector as a 5-word sentence. As input length increases, the context vector becomes increasingly overloaded, losing fine-grained information.

Consider what happens when translating a long sentence. The encoder must compress all the nuances, word relationships, and semantic content into perhaps 512 or 1024 numbers. Early words in the sequence are processed many timesteps before the context vector is formed, so their information must survive through many LSTM updates. Despite the LSTM's gating mechanisms, some information inevitably degrades.

This bottleneck manifests in several ways:

  • Translation quality degrades for long sentences: Research showed that basic seq2seq models performed well on sentences under 20 words but quality dropped sharply for longer inputs
  • The decoder lacks access to specific source positions: When generating a word, the decoder can't "look back" at a specific part of the input
  • Information about word order can be lost: The context vector may capture what concepts are present but lose their precise arrangement

We can visualize this degradation by examining how well the decoder can reconstruct different parts of the input:

Out[16]:
Visualization
Line plot showing reconstruction accuracy decreasing for earlier positions in the sequence.
Simulated reconstruction accuracy as a function of position in the source sequence. Words near the end of the input (processed just before the context vector is formed) are reconstructed more accurately than words at the beginning, which must survive through many LSTM updates.

The attention mechanism, which we'll cover later in this part, directly addresses the bottleneck problem by allowing the decoder to access all encoder hidden states, not just the final one. But understanding the bottleneck is crucial for appreciating why attention was such an important breakthrough.

The bottleneck's impact on real-world performance was documented in the original seq2seq papers. Translation quality, measured by BLEU score, degrades systematically as source sentences get longer:

Out[17]:
Visualization
Line plot showing BLEU score decreasing with sentence length for basic seq2seq but remaining stable for attention models.
Translation quality (BLEU score) as a function of source sentence length for basic seq2seq models. Performance is strong for short sentences but degrades significantly beyond 20-30 words, motivating the development of attention mechanisms. The dashed line shows how attention-based models maintain quality across all lengths.

This empirical observation was a key motivation for developing attention. Basic seq2seq models achieve competitive BLEU scores on short sentences (under 20 words) but performance drops sharply for longer inputs. The attention mechanism, shown as the dashed line, maintains quality regardless of length by allowing the decoder to directly access relevant parts of the source sequence rather than relying solely on the compressed context vector.

Seq2Seq for Machine Translation

Machine translation was the driving application for encoder-decoder models. Let's walk through a complete example of how the model processes a translation task.

The Translation Pipeline

Consider translating "The cat sat on the mat" to French. The pipeline proceeds as follows:

  1. Tokenization: Convert the English sentence to token indices using a source vocabulary
  2. Encoding: Process tokens through the encoder to get the context vector
  3. Decoding: Generate French tokens one at a time, starting with a start token
  4. Detokenization: Convert output indices back to French words
In[18]:
Code
# Simulate a simple translation example
# In practice, you'd use real tokenizers and vocabularies

# Simulated vocabularies
src_vocab = {
    "<pad>": 0,
    "<s>": 1,
    "</s>": 2,
    "the": 3,
    "cat": 4,
    "sat": 5,
    "on": 6,
    "mat": 7,
}
trg_vocab = {
    "<pad>": 0,
    "<s>": 1,
    "</s>": 2,
    "le": 3,
    "chat": 4,
    "était": 5,
    "assis": 6,
    "sur": 7,
    "tapis": 8,
}

# Reverse mapping for decoding
idx_to_trg = {v: k for k, v in trg_vocab.items()}


def encode_sentence(sentence, vocab):
    """Convert sentence to tensor of indices."""
    tokens = sentence.lower().split()
    indices = [vocab.get(t, 0) for t in tokens]
    return torch.tensor([indices])  # Add batch dimension


def decode_indices(indices, idx_to_word):
    """Convert indices back to words."""
    return " ".join([idx_to_word.get(i, "<unk>") for i in indices])


# Encode source sentence
src_sentence = "the cat sat on the mat"
src_tensor = encode_sentence(src_sentence, src_vocab)

# For training, we also have the target (with start token prepended)
trg_sentence = "<s> le chat était assis sur le tapis"
trg_tensor = encode_sentence(trg_sentence, trg_vocab)
Out[19]:
Console
Translation Example:
  Source: 'the cat sat on the mat'
  Source indices: [3, 4, 5, 6, 3, 7]

  Target: '<s> le chat était assis sur le tapis'
  Target indices: [1, 3, 4, 5, 6, 7, 3, 8]

The source sentence maps to indices [3, 4, 5, 6, 3, 7], where repeated words like "the" map to the same index (3). The target includes the start token <s> (index 1) prepended, which tells the decoder to begin generating. Notice that the source has 6 tokens while the target has 8, demonstrating how seq2seq handles variable-length mappings.

Training the Translation Model

During training, we use teacher forcing: the decoder receives the correct previous token at each step, not its own predictions. This allows parallel computation and stable training:

In[20]:
Code
def train_step(model, src, trg, criterion, optimizer):
    """
    Single training step for seq2seq model.

    Args:
        model: Seq2Seq model
        src: Source sequence (batch_size, src_len)
        trg: Target sequence (batch_size, trg_len), including start token
        criterion: Loss function (CrossEntropyLoss)
        optimizer: Optimizer

    Returns:
        loss value
    """
    optimizer.zero_grad()

    # Forward pass
    # Input to decoder: all tokens except the last (which has no next token to predict)
    # Target for loss: all tokens except the first (the start token)
    output = model(src, trg[:, :-1])  # (batch, trg_len-1, vocab_size)

    # Reshape for loss computation
    output_dim = output.shape[-1]
    output = output.contiguous().view(
        -1, output_dim
    )  # (batch * (trg_len-1), vocab_size)
    trg_flat = trg[:, 1:].contiguous().view(-1)  # (batch * (trg_len-1),)

    # Compute loss
    loss = criterion(output, trg_flat)

    # Backward pass
    loss.backward()

    # Gradient clipping to prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

    return loss.item()

The loss computation deserves attention. We compare the model's predictions (excluding the last position, which predicts beyond the sequence) against the target tokens (excluding the start token, which is input not output). This offset alignment is crucial for correct training.

Inference: Generating Translations

At inference time, we don't have the target sequence. We must generate tokens autoregressively, feeding each prediction back as input for the next step:

In[21]:
Code
def translate(model, src, max_len=50, start_token=1, end_token=2):
    """
    Generate translation for a source sequence.

    Args:
        model: Trained Seq2Seq model
        src: Source sequence (1, src_len)
        max_len: Maximum output length
        start_token: Index of <s> token
        end_token: Index of </s> token

    Returns:
        List of generated token indices
    """
    model.eval()

    with torch.no_grad():
        # Encode source
        hidden, cell = model.encoder(src)

        # Start with start token
        current_token = torch.tensor([[start_token]])
        generated = [start_token]

        for _ in range(max_len):
            # Decode one step
            output, hidden, cell = model.decoder(current_token, hidden, cell)

            # Get most likely next token
            next_token = output.argmax(dim=-1).item()
            generated.append(next_token)

            # Stop if end token generated
            if next_token == end_token:
                break

            # Prepare input for next step
            current_token = torch.tensor([[next_token]])

    return generated

This greedy decoding always selects the most probable next token. In practice, beam search (covered in a later chapter) often produces better results by exploring multiple hypotheses simultaneously.

Seq2Seq for Text Summarization

Machine translation maps sequences of similar lengths, but the encoder-decoder framework handles arbitrary length ratios. Text summarization compresses long documents into short summaries, making it another natural application.

Out[22]:
Visualization
Diagram showing long document being encoded and decoded into short summary.
Seq2seq for summarization. A long document (many encoder steps) is compressed into a context vector, which the decoder expands into a short summary (few decoder steps). The extreme compression ratio makes the bottleneck problem especially severe for summarization.

Summarization presents unique challenges compared to translation:

  • Extreme compression ratios: A 500-word article might become a 50-word summary, requiring 10:1 compression
  • Content selection: The model must decide what information is important enough to include
  • Abstraction vs extraction: Should the summary use words from the source or generate new phrasings?

The basic encoder-decoder model struggles with these challenges. The bottleneck problem is especially severe when compressing long documents. Later innovations like attention and copy mechanisms significantly improved summarization quality.

Training Setup and Considerations

Training seq2seq models requires careful attention to several practical details: choosing an appropriate loss function, handling variable-length sequences through padding and masking, preventing gradient explosions, and tuning the learning rate schedule. This section covers each of these considerations with practical code examples.

Loss Function

We use cross-entropy loss to train the model. At each position in the output sequence, the model predicts a probability distribution over the vocabulary, and we penalize it based on how much probability it assigns to the correct token. Summing these penalties across all positions gives us the total loss for a sequence:

L=t=1TlogP(yty<t,c)\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t^* | y_{<t}, c)

where:

  • L\mathcal{L}: the total loss for the sequence (lower is better)
  • TT: the length of the target sequence
  • yty_t^*: the correct (ground truth) target token at position tt
  • y<ty_{<t}: all target tokens before position tt, i.e., y1,y2,,yt1y_1, y_2, \ldots, y_{t-1}
  • cc: the context vector from the encoder
  • P(yty<t,c)P(y_t^* | y_{<t}, c): the probability the model assigns to the correct token yty_t^*, given the context and previous tokens

The negative log transforms probabilities into losses: when the model assigns probability 1.0 to the correct token, log(1.0)=0-\log(1.0) = 0 (no loss). When it assigns probability 0.01, log(0.01)4.6-\log(0.01) \approx 4.6 (high loss). This encourages the model to assign high probability to the correct next token at every position.

Let's visualize this relationship to build intuition for how cross-entropy loss penalizes predictions:

Out[23]:
Visualization
Line plot showing negative log function with loss on y-axis and probability on x-axis.
The negative log loss function used in cross-entropy. When the model assigns high probability to the correct token (right side), loss is low. When it assigns low probability (left side), loss increases sharply. This steep penalty for confident wrong predictions drives the model to be well-calibrated.

The curve shows why cross-entropy is effective for training: it penalizes confidently wrong predictions much more severely than uncertain ones. A model that assigns only 1% probability to the correct token incurs 20× more loss than one assigning 50%. This steep gradient in the low-probability region provides strong learning signal when the model makes mistakes.

Handling Variable-Length Sequences

Real data contains sequences of varying lengths. We handle this through padding and masking:

In[24]:
Code
def create_mask(seq, pad_idx=0):
    """
    Create attention mask for padded sequences.
    Returns True for real tokens, False for padding.
    """
    return seq != pad_idx


# Example: batch with different length sequences
sequences = [
    [1, 5, 8, 3, 2],  # Length 5
    [1, 7, 4, 2, 0],  # Length 4 (padded)
    [1, 6, 9, 3, 8, 2],  # Length 6 - would need padding in others
]

# In practice, pad to max length in batch
max_len = 6
padded = torch.tensor(
    [
        [1, 5, 8, 3, 2, 0],
        [1, 7, 4, 2, 0, 0],
        [1, 6, 9, 3, 8, 2],
    ]
)

mask = create_mask(padded)
Out[25]:
Console
Padded sequences:
tensor([[1, 5, 8, 3, 2, 0],
        [1, 7, 4, 2, 0, 0],
        [1, 6, 9, 3, 8, 2]])

Mask (True = real token):
tensor([[ True,  True,  True,  True,  True, False],
        [ True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True,  True]])

The padded tensor shows zeros appended to shorter sequences to match the maximum length of 6. The mask tensor marks which positions contain real tokens (True) versus padding (False). During loss computation, we use this mask to ensure the model isn't penalized for predictions at padded positions, which would distort the training signal.

The mask is used during loss computation to ignore predictions at padded positions. PyTorch's CrossEntropyLoss supports an ignore_index parameter for this purpose.

Gradient Clipping

Seq2seq models are prone to exploding gradients due to the long computational graphs created by unrolling through time. Gradient clipping limits the gradient magnitude:

In[26]:
Code
# Gradient clipping is essential for stable training
max_grad_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

A typical value is 1.0 or 5.0. Without clipping, training often diverges with NaN losses.

Learning Rate and Optimization

Adam optimizer with learning rate scheduling works well for seq2seq models:

In[27]:
Code
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Learning rate decay: reduce by factor of 0.5 when validation loss plateaus
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=3
)

Starting with a learning rate around 10310^{-3} and reducing it when progress stalls typically works well.

A typical training run shows characteristic loss curves as the model learns to translate:

Out[28]:
Visualization
Line plot showing training and validation loss over epochs with learning rate reduction markers.
Typical training and validation loss curves for a seq2seq translation model. Training loss decreases steadily, while validation loss initially follows but eventually plateaus or increases slightly, indicating the onset of overfitting. The learning rate is reduced at epochs 15 and 25 (vertical dashed lines) when validation loss stalls.

This visualization shows several important training dynamics. Early epochs show rapid loss reduction as the model learns basic translation patterns. The gap between training and validation loss indicates generalization: a small gap means the model generalizes well, while a growing gap signals overfitting. Learning rate reductions (marked by vertical lines) help the model escape local minima and continue improving. The best model checkpoint is typically saved when validation loss is lowest, around epoch 18-20 in this example.

Putting It Together: A Complete Training Loop

Let's implement a complete training loop that incorporates all these considerations:

In[29]:
Code
def train_epoch(model, data_loader, optimizer, criterion, clip_grad=1.0):
    """
    Train for one epoch.

    Args:
        model: Seq2Seq model
        data_loader: DataLoader yielding (src, trg) batches
        optimizer: Optimizer
        criterion: Loss function
        clip_grad: Maximum gradient norm

    Returns:
        Average loss for the epoch
    """
    model.train()
    total_loss = 0

    for src, trg in data_loader:
        optimizer.zero_grad()

        # Forward pass (teacher forcing)
        output = model(src, trg[:, :-1])

        # Compute loss
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        trg_flat = trg[:, 1:].contiguous().view(-1)

        loss = criterion(output, trg_flat)

        # Backward pass
        loss.backward()

        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)

        # Update weights
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(data_loader)


def evaluate(model, data_loader, criterion):
    """
    Evaluate model on validation/test data.

    Returns:
        Average loss
    """
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for src, trg in data_loader:
            output = model(src, trg[:, :-1])

            output_dim = output.shape[-1]
            output = output.contiguous().view(-1, output_dim)
            trg_flat = trg[:, 1:].contiguous().view(-1)

            loss = criterion(output, trg_flat)
            total_loss += loss.item()

    return total_loss / len(data_loader)

Limitations and Impact

The encoder-decoder framework was a breakthrough that enabled end-to-end learning for sequence-to-sequence tasks. Before this architecture, machine translation relied on complex pipelines with separate components for alignment, phrase extraction, and language modeling. The seq2seq approach replaced all of this with a single neural network trained end-to-end.

However, the basic architecture has significant limitations that motivated subsequent research. The context vector bottleneck forces all source information through a fixed-size vector, causing information loss for long sequences. Sutskever et al.'s original paper showed that reversing the source sequence improved results, likely because it reduced the average distance between corresponding words in source and target. This hack highlighted the underlying problem: the model struggled to preserve information about early source tokens.

The rigid encoder-decoder separation also limits flexibility. The encoder must finish processing before the decoder can start, preventing any interaction between reading and writing. Human translators don't work this way. They might read part of a sentence, start translating, then look back at the source for clarification. The attention mechanism, which we'll cover in subsequent chapters, addresses this by allowing the decoder to "look back" at any part of the encoded sequence.

Despite these limitations, the encoder-decoder framework established several principles that remain central to modern sequence modeling. The idea of encoding variable-length input into a fixed representation, then decoding back to variable-length output, appears in countless architectures. The separation of understanding (encoding) from generation (decoding) provides a clean abstraction that simplifies model design. And the end-to-end training paradigm, where the entire system is optimized jointly for the final task, has become the dominant approach in NLP.

The seq2seq architecture also demonstrated the power of recurrent networks for complex language tasks. While transformers have since surpassed RNN-based models on most benchmarks, the conceptual framework of encoder-decoder remains. Modern transformer models like T5 and BART use the same high-level architecture: encode the input, then decode the output. The attention mechanism that made transformers possible was first developed to address the bottleneck problem in RNN-based seq2seq models.

Summary

This chapter introduced the encoder-decoder framework, the foundational architecture for sequence-to-sequence learning. We covered how this paradigm separates the tasks of understanding input sequences and generating output sequences, enabling applications like machine translation and text summarization.

The encoder processes the input sequence through an RNN, compressing it into a fixed-size context vector that represents the input's meaning. The decoder, initialized with this context vector, generates the output sequence one token at a time, using each prediction as input for the next step.

The context vector bottleneck is the key limitation of basic seq2seq models. All information about the input must pass through this single vector, causing information loss for long sequences. This bottleneck motivated the development of attention mechanisms, which we'll explore in upcoming chapters.

Key implementation details include:

  • Use LSTM or GRU cells for both encoder and decoder to capture long-range dependencies
  • Initialize the decoder's hidden state with the encoder's final hidden state
  • Apply teacher forcing during training, feeding correct tokens rather than predictions
  • Use cross-entropy loss with masking to handle variable-length sequences
  • Clip gradients to prevent exploding gradients during backpropagation

The encoder-decoder framework established the paradigm for sequence-to-sequence learning that persists in modern architectures. While attention and transformers have improved upon the basic design, the core insight of separating encoding from decoding remains central to how we approach sequence transformation tasks.

In the next chapter, we'll examine teacher forcing in detail, understanding both its benefits for training efficiency and its drawbacks in terms of exposure bias.

Key Parameters

When building encoder-decoder models with PyTorch's nn.LSTM or nn.GRU, these parameters have the most significant impact on model behavior:

Model architecture parameters for encoder-decoder models.
ParameterTypical ValuesDescription
hidden_size256-1024Dimensionality of the hidden state and context vector. For translation, 512 is a common starting point. The context vector bottleneck makes this choice critical: too small and information is lost, too large and training becomes slow.
num_layers2-4Number of stacked RNN layers in both encoder and decoder. Deeper networks capture more complex patterns but require careful initialization and may need residual connections for stable training.
embed_size256-512Dimensionality of token embeddings. Should be large enough to capture semantic distinctions but not so large that it dominates parameter count.
dropout0.1-0.3Probability of dropping connections between LSTM layers (only active when num_layers > 1). Applied between layers, not within recurrent connections.
batch_firstTrueWhen True, input tensors have shape (batch, seq_len, features). Using batch_first=True aligns with common data loading patterns and makes debugging easier.

For training, the following parameters control optimization behavior:

Training parameters for seq2seq optimization.
ParameterTypical ValuesDescription
learning_rate0.001Initial learning rate for Adam optimizer. Reduce on plateau. Too high causes instability, too low causes slow convergence.
clip_grad1.0-5.0Maximum gradient norm for clipping. Prevents exploding gradients. Essential for stable training of deep seq2seq models.
ignore_index0 (pad token)Index to ignore in cross-entropy loss (typically the padding token index). Ensures padded positions don't contribute to the loss.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the encoder-decoder framework and sequence-to-sequence models.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{encoderdecoderframeworkseq2seqarchitectureformachinetranslation, author = {Michael Brenndoerfer}, title = {Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation. Retrieved from https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation
MLAAcademic
Michael Brenndoerfer. "Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation>.
CHICAGOAcademic
Michael Brenndoerfer. "Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation'. Available at: https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation. https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free