Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation

Michael Brenndoerfer

Learn the encoder-decoder framework for sequence-to-sequence learning, including context vectors, LSTM implementations, and the bottleneck problem that motivated attention mechanisms.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Encoder-Decoder FrameworkLink Copied

In the previous part, we explored how recurrent neural networks process sequences, capturing temporal dependencies through hidden states that accumulate information over time. But there's a fundamental limitation we haven't addressed: what if the input and output sequences have different lengths? How do you translate "The cat sat on the mat" (six words) into "Le chat était assis sur le tapis" (seven words)? How do you summarize a 500-word article into a 50-word abstract?

These variable-length sequence-to-sequence problems require a new architectural paradigm. The encoder-decoder framework, introduced by Sutskever et al. in 2014, provides an elegant solution: use one RNN to compress the input sequence into a fixed-size representation, then use another RNN to generate the output sequence from that representation. This simple idea unlocked machine translation, text summarization, and countless other applications that transform one sequence into another.

The Core Insight: Separate Reading from WritingLink Copied

Before diving into architecture details, let's understand the fundamental insight behind encoder-decoder models. Consider how a human translator works. First, they read the entire source sentence, building a mental understanding of its meaning. Then, they produce the translation word by word, consulting their mental representation of the original. They don't translate word by word as they read, because that would fail to capture context and produce awkward, literal translations.

The encoder-decoder framework mirrors this process. The encoder "reads" the input sequence, compressing it into a dense vector representation called the context vector. The decoder then "writes" the output sequence, using the context vector as its guide. This separation allows each component to specialize: the encoder focuses on understanding, while the decoder focuses on generation.

Out[3]:

Visualization

Diagram showing encoder processing input sequence into context vector, which feeds into decoder generating output sequence. — High-level view of the encoder-decoder architecture. The encoder processes the input sequence left-to-right, compressing it into a context vector. The decoder then generates the output sequence one token at a time, conditioned on the context vector.

The figure illustrates the basic flow. The encoder processes "The cat sat" sequentially, with each hidden state $h_t$ incorporating information from all previous words. After processing the final word, the encoder's hidden state becomes the context vector $c$ . The decoder then generates the translation, starting with a special start token <s> and producing one word at a time until it outputs an end token.

The Encoder: Compressing Input into MeaningLink Copied

The encoder's job is straightforward: process the input sequence and produce a fixed-size representation that captures its meaning. We can use any RNN architecture for this purpose, whether vanilla RNN, LSTM, or GRU. In practice, LSTMs and GRUs dominate due to their ability to capture long-range dependencies.

Encoder ArchitectureLink Copied

For an input sequence $x_1, x_2, \ldots, x_T$ of length $T$ , the encoder computes a sequence of hidden states:

h_t = \text{RNN}_{\text{enc}}(x_t, h_{t-1})

where:

$x_t$ : the input token at timestep $t$ , typically represented as an embedding vector
$h_{t-1}$ : the previous hidden state, carrying information from tokens $x_1, \ldots, x_{t-1}$
$h_t$ : the new hidden state, now incorporating information from $x_1, \ldots, x_t$
$\text{RNN}_{\text{enc}}$ : the encoder's recurrent function (LSTM, GRU, etc.)

The final hidden state $h_T$ serves as the context vector $c$ , summarizing the entire input sequence in a single vector:

c = h_T

where:

$c$ : the context vector that will be passed to the decoder
$h_T$ : the encoder's hidden state after processing all $T$ input tokens

This is remarkably simple, but there's a subtle point worth emphasizing. The context vector must encode everything the decoder needs to know about the input. For a 100-word input sentence, all the semantic content, syntactic structure, and nuance must be compressed into a vector of perhaps 512 or 1024 dimensions. This compression is both the power and the limitation of the basic encoder-decoder framework.

Out[4]:

Visualization

Diagram showing LSTM cells processing word embeddings sequentially with hidden state connections. — Detailed view of the encoder processing a sequence. Each LSTM cell receives the current word embedding and the previous hidden state, producing a new hidden state. The final hidden state becomes the context vector that summarizes the entire input.

Implementing the EncoderLink Copied

Let's implement a basic LSTM encoder in PyTorch. The implementation is straightforward because PyTorch's nn.LSTM handles the sequential processing internally:

In[5]:

Code

class Encoder(nn.Module):
    def __init__(
        self, vocab_size, embed_size, hidden_size, num_layers=1, dropout=0.1
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer converts token indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM processes the sequence of embeddings
        self.lstm = nn.LSTM(
            embed_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

    def forward(self, src):
        # src shape: (batch_size, seq_len)

        # Convert tokens to embeddings
        embedded = self.embedding(src)  # (batch_size, seq_len, embed_size)

        # Process through LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        # outputs: (batch_size, seq_len, hidden_size) - all hidden states
        # hidden: (num_layers, batch_size, hidden_size) - final hidden state per layer
        # cell: (num_layers, batch_size, hidden_size) - final cell state per layer

        return hidden, cell

class Encoder(nn.Module):
    def __init__(
        self, vocab_size, embed_size, hidden_size, num_layers=1, dropout=0.1
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer converts token indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM processes the sequence of embeddings
        self.lstm = nn.LSTM(
            embed_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

    def forward(self, src):
        # src shape: (batch_size, seq_len)

        # Convert tokens to embeddings
        embedded = self.embedding(src)  # (batch_size, seq_len, embed_size)

        # Process through LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        # outputs: (batch_size, seq_len, hidden_size) - all hidden states
        # hidden: (num_layers, batch_size, hidden_size) - final hidden state per layer
        # cell: (num_layers, batch_size, hidden_size) - final cell state per layer

        return hidden, cell

The encoder returns both the hidden state and cell state (for LSTM). These together form the context that initializes the decoder. Notice that we don't return the outputs tensor containing all hidden states. In the basic encoder-decoder framework, only the final hidden state matters. Later, when we add attention, we'll need those intermediate states.

In[6]:

Code

# Test the encoder
vocab_size = 10000
embed_size = 256
hidden_size = 512
num_layers = 2

encoder = Encoder(vocab_size, embed_size, hidden_size, num_layers)

# Simulate a batch of 4 sentences, each 20 tokens long
batch_size, seq_len = 4, 20
src = torch.randint(0, vocab_size, (batch_size, seq_len))

hidden, cell = encoder(src)

# Test the encoder
vocab_size = 10000
embed_size = 256
hidden_size = 512
num_layers = 2

encoder = Encoder(vocab_size, embed_size, hidden_size, num_layers)

# Simulate a batch of 4 sentences, each 20 tokens long
batch_size, seq_len = 4, 20
src = torch.randint(0, vocab_size, (batch_size, seq_len))

hidden, cell = encoder(src)

Out[7]:

Console

Encoder Output Shapes:
  Input: torch.Size([4, 20])
  Hidden state: torch.Size([2, 4, 512])
  Cell state: torch.Size([2, 4, 512])

The context vector has 1024 total dimensions
(2 layers × 512 hidden units per layer)

The hidden state has shape (num_layers, batch_size, hidden_size). For a 2-layer LSTM, this means we have two hidden vectors per sequence: one from the first layer and one from the second. Both contribute to the context that initializes the decoder.

To visualize how information accumulates in the encoder, let's examine the hidden state activations as the encoder processes a sequence:

Out[8]:

Visualization

Heatmap showing encoder hidden state values evolving across sequence positions. — Encoder hidden state activations across sequence positions (first 32 dimensions shown). Each column represents the hidden state after processing one token. Early positions show sparse activation patterns, while later positions show denser patterns as the encoder accumulates information from the entire sequence. The final column becomes the context vector.

The heatmap reveals how the encoder builds up its representation. Each column shows the hidden state after processing one more token. Notice how the activation patterns change across positions: some dimensions respond strongly to specific words, while others accumulate information gradually. The rightmost column is the context vector $c$ that gets passed to the decoder. It must encode everything about the input sequence that the decoder needs for translation.

The Decoder: Generating Output from ContextLink Copied

The decoder's job is more complex than the encoder's. It must generate the output sequence one token at a time, where each token depends on the context vector and all previously generated tokens. This autoregressive generation creates a dependency chain: to generate token $t$ , you need tokens $1, 2, \ldots, t-1$ .

Decoder ArchitectureLink Copied

The decoder is also an RNN, but with a crucial difference: its initial hidden state comes from the encoder's context vector rather than being initialized to zeros. At each timestep $t$ , the decoder:

Takes the previous output token $y_{t-1}$ as input
Updates its hidden state using the RNN
Produces a probability distribution over the vocabulary
Samples or selects the next token $y_t$

Mathematically, the decoder performs two operations at each timestep. First, it updates its hidden state by combining the previous token with its memory of what it has generated so far:

s_t = \text{RNN}_{\text{dec}}(y_{t-1}, s_{t-1})

where:

$s_t$ : the decoder's hidden state at timestep $t$ , encoding information about all previously generated tokens
$y_{t-1}$ : the embedding of the previous output token (or the start token <s> when $t=1$ )
$s_{t-1}$ : the decoder's hidden state from the previous timestep
$\text{RNN}_{\text{dec}}$ : the decoder's recurrent function (LSTM, GRU, etc.)

The crucial initialization is $s_0 = c$ , meaning the decoder starts with the context vector from the encoder as its initial hidden state. This is how information flows from the encoder to the decoder.

Second, the decoder converts its hidden state into a probability distribution over the vocabulary to predict the next token:

P(y_t | y_{<t}, c) = \text{softmax}(W_o s_t + b_o)

where:

$P(y_t | y_{<t}, c)$ : probability distribution over all vocabulary tokens for position $t$
$W_o$ : output projection weight matrix of shape (vocab_size, hidden_size)
$s_t$ : the current decoder hidden state
$b_o$ : output projection bias vector of shape (vocab_size,)
$y_{<t}$ : all previously generated tokens $y_1, \ldots, y_{t-1}$
$c$ : the context vector (implicitly encoded in the hidden states through initialization)

The softmax function converts the raw scores (logits) into a valid probability distribution that sums to 1, allowing us to either sample from this distribution or take the most probable token.

To see this concretely, let's visualize what a typical decoder output looks like. The decoder produces a probability distribution over the entire vocabulary at each timestep:

Out[9]:

Visualization

Bar chart showing probability distribution over vocabulary tokens with 'chat' having highest probability. — Example decoder output showing the probability distribution over vocabulary tokens at a single timestep. The model assigns high probability to a few likely candidates ('chat', 'chien') while spreading small probabilities across thousands of other tokens. Greedy decoding selects 'chat' (0.72), but beam search might explore 'chien' (0.15) as well.

This visualization shows a typical decoder output. The model has learned that "chat" (cat) is the most likely next word given the context, assigning it 72% probability. Alternative translations like "chien" (dog) receive smaller but non-negligible probability. The long tail of the distribution spreads tiny probabilities across thousands of other vocabulary tokens.

Out[10]:

Visualization

Diagram showing decoder LSTM cells generating tokens sequentially with softmax output layers. — Detailed view of the decoder generating output tokens. The decoder is initialized with the context vector from the encoder. At each step, it takes the previous output token, updates its hidden state, and predicts the next token through a softmax layer.

Implementing the DecoderLink Copied

The decoder implementation requires careful handling of the autoregressive generation process:

In[11]:

Code

class Decoder(nn.Module):
    def __init__(
        self, vocab_size, embed_size, hidden_size, num_layers=1, dropout=0.1
    ):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding for target tokens
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM for sequential generation
        self.lstm = nn.LSTM(
            embed_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        # Output projection to vocabulary
        self.fc_out = nn.Linear(hidden_size, vocab_size)

    def forward(self, trg, hidden, cell):
        # trg shape: (batch_size, trg_len)
        # hidden, cell: from encoder, shape (num_layers, batch_size, hidden_size)

        # Embed target tokens
        embedded = self.embedding(trg)  # (batch_size, trg_len, embed_size)

        # Process through LSTM, initialized with encoder states
        outputs, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # outputs: (batch_size, trg_len, hidden_size)

        # Project to vocabulary
        predictions = self.fc_out(outputs)  # (batch_size, trg_len, vocab_size)

        return predictions, hidden, cell

class Decoder(nn.Module):
    def __init__(
        self, vocab_size, embed_size, hidden_size, num_layers=1, dropout=0.1
    ):
        super().__init__()
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding for target tokens
        self.embedding = nn.Embedding(vocab_size, embed_size)

        # LSTM for sequential generation
        self.lstm = nn.LSTM(
            embed_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        # Output projection to vocabulary
        self.fc_out = nn.Linear(hidden_size, vocab_size)

    def forward(self, trg, hidden, cell):
        # trg shape: (batch_size, trg_len)
        # hidden, cell: from encoder, shape (num_layers, batch_size, hidden_size)

        # Embed target tokens
        embedded = self.embedding(trg)  # (batch_size, trg_len, embed_size)

        # Process through LSTM, initialized with encoder states
        outputs, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # outputs: (batch_size, trg_len, hidden_size)

        # Project to vocabulary
        predictions = self.fc_out(outputs)  # (batch_size, trg_len, vocab_size)

        return predictions, hidden, cell

During training, we feed the entire target sequence to the decoder at once. This is called teacher forcing, which we'll cover in detail in the next chapter. The key insight is that during training, we know the correct output sequence, so we can compute all timesteps in parallel rather than generating one token at a time.

Connecting Encoder and Decoder: The Seq2Seq ModelLink Copied

Now let's combine the encoder and decoder into a complete sequence-to-sequence model:

In[12]:

Code

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

        # Ensure encoder and decoder have compatible dimensions
        assert encoder.hidden_size == decoder.hidden_size
        assert encoder.num_layers == decoder.num_layers

    def forward(self, src, trg):
        # src: (batch_size, src_len) - source sequence
        # trg: (batch_size, trg_len) - target sequence (for teacher forcing)

        # Encode the source sequence
        hidden, cell = self.encoder(src)

        # Decode using encoder's final state as initial context
        outputs, _, _ = self.decoder(trg, hidden, cell)

        return outputs

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

        # Ensure encoder and decoder have compatible dimensions
        assert encoder.hidden_size == decoder.hidden_size
        assert encoder.num_layers == decoder.num_layers

    def forward(self, src, trg):
        # src: (batch_size, src_len) - source sequence
        # trg: (batch_size, trg_len) - target sequence (for teacher forcing)

        # Encode the source sequence
        hidden, cell = self.encoder(src)

        # Decode using encoder's final state as initial context
        outputs, _, _ = self.decoder(trg, hidden, cell)

        return outputs

In[13]:

Code

# Create the complete model
src_vocab_size = 10000  # Source language vocabulary
trg_vocab_size = 12000  # Target language vocabulary
embed_size = 256
hidden_size = 512
num_layers = 2

encoder = Encoder(src_vocab_size, embed_size, hidden_size, num_layers)
decoder = Decoder(trg_vocab_size, embed_size, hidden_size, num_layers)
model = Seq2Seq(encoder, decoder)

# Test forward pass
src = torch.randint(
    0, src_vocab_size, (4, 20)
)  # 4 source sentences, 20 tokens each
trg = torch.randint(
    0, trg_vocab_size, (4, 25)
)  # 4 target sentences, 25 tokens each

outputs = model(src, trg)

# Create the complete model
src_vocab_size = 10000  # Source language vocabulary
trg_vocab_size = 12000  # Target language vocabulary
embed_size = 256
hidden_size = 512
num_layers = 2

encoder = Encoder(src_vocab_size, embed_size, hidden_size, num_layers)
decoder = Decoder(trg_vocab_size, embed_size, hidden_size, num_layers)
model = Seq2Seq(encoder, decoder)

# Test forward pass
src = torch.randint(
    0, src_vocab_size, (4, 20)
)  # 4 source sentences, 20 tokens each
trg = torch.randint(
    0, trg_vocab_size, (4, 25)
)  # 4 target sentences, 25 tokens each

outputs = model(src, trg)

Out[14]:

Console

Seq2Seq Model Test:
  Source shape: torch.Size([4, 20])
  Target shape: torch.Size([4, 25])
  Output shape: torch.Size([4, 25, 12000])

Output is logits over 12000 target vocabulary tokens
for each of 25 positions in the target sequence

The output has shape (batch_size, trg_len, vocab_size), containing unnormalized log-probabilities (logits) for each position in the target sequence. During training, we compute the cross-entropy loss between these predictions and the actual target tokens.

The Context Vector BottleneckLink Copied

The basic encoder-decoder architecture has a fundamental limitation: all information about the source sequence must pass through a single fixed-size context vector. This creates an information bottleneck that becomes increasingly problematic as sequences grow longer.

Out[15]:

Visualization

Diagram showing long input sequence being compressed through narrow bottleneck into context vector. — The context vector bottleneck problem. A 50-word sentence must be compressed into the same 512-dimensional vector as a 5-word sentence. As input length increases, the context vector becomes increasingly overloaded, losing fine-grained information.

Consider what happens when translating a long sentence. The encoder must compress all the nuances, word relationships, and semantic content into perhaps 512 or 1024 numbers. Early words in the sequence are processed many timesteps before the context vector is formed, so their information must survive through many LSTM updates. Despite the LSTM's gating mechanisms, some information inevitably degrades.

This bottleneck manifests in several ways:

Translation quality degrades for long sentences: Research showed that basic seq2seq models performed well on sentences under 20 words but quality dropped sharply for longer inputs
The decoder lacks access to specific source positions: When generating a word, the decoder can't "look back" at a specific part of the input
Information about word order can be lost: The context vector may capture what concepts are present but lose their precise arrangement

We can visualize this degradation by examining how well the decoder can reconstruct different parts of the input:

Out[16]:

Visualization

Line plot showing reconstruction accuracy decreasing for earlier positions in the sequence. — Simulated reconstruction accuracy as a function of position in the source sequence. Words near the end of the input (processed just before the context vector is formed) are reconstructed more accurately than words at the beginning, which must survive through many LSTM updates.

The attention mechanism, which we'll cover later in this part, directly addresses the bottleneck problem by allowing the decoder to access all encoder hidden states, not just the final one. But understanding the bottleneck is crucial for appreciating why attention was such an important breakthrough.

The bottleneck's impact on real-world performance was documented in the original seq2seq papers. Translation quality, measured by BLEU score, degrades systematically as source sentences get longer:

Out[17]:

Visualization

Line plot showing BLEU score decreasing with sentence length for basic seq2seq but remaining stable for attention models. — Translation quality (BLEU score) as a function of source sentence length for basic seq2seq models. Performance is strong for short sentences but degrades significantly beyond 20-30 words, motivating the development of attention mechanisms. The dashed line shows how attention-based models maintain quality across all lengths.

This empirical observation was a key motivation for developing attention. Basic seq2seq models achieve competitive BLEU scores on short sentences (under 20 words) but performance drops sharply for longer inputs. The attention mechanism, shown as the dashed line, maintains quality regardless of length by allowing the decoder to directly access relevant parts of the source sequence rather than relying solely on the compressed context vector.

Seq2Seq for Machine TranslationLink Copied

Machine translation was the driving application for encoder-decoder models. Let's walk through a complete example of how the model processes a translation task.

The Translation PipelineLink Copied

Consider translating "The cat sat on the mat" to French. The pipeline proceeds as follows:

Tokenization: Convert the English sentence to token indices using a source vocabulary
Encoding: Process tokens through the encoder to get the context vector
Decoding: Generate French tokens one at a time, starting with a start token
Detokenization: Convert output indices back to French words

In[18]:

Code

# Simulate a simple translation example
# In practice, you'd use real tokenizers and vocabularies

# Simulated vocabularies
src_vocab = {
    "<pad>": 0,
    "<s>": 1,
    "</s>": 2,
    "the": 3,
    "cat": 4,
    "sat": 5,
    "on": 6,
    "mat": 7,
}
trg_vocab = {
    "<pad>": 0,
    "<s>": 1,
    "</s>": 2,
    "le": 3,
    "chat": 4,
    "était": 5,
    "assis": 6,
    "sur": 7,
    "tapis": 8,
}

# Reverse mapping for decoding
idx_to_trg = {v: k for k, v in trg_vocab.items()}


def encode_sentence(sentence, vocab):
    """Convert sentence to tensor of indices."""
    tokens = sentence.lower().split()
    indices = [vocab.get(t, 0) for t in tokens]
    return torch.tensor([indices])  # Add batch dimension


def decode_indices(indices, idx_to_word):
    """Convert indices back to words."""
    return " ".join([idx_to_word.get(i, "<unk>") for i in indices])


# Encode source sentence
src_sentence = "the cat sat on the mat"
src_tensor = encode_sentence(src_sentence, src_vocab)

# For training, we also have the target (with start token prepended)
trg_sentence = "<s> le chat était assis sur le tapis"
trg_tensor = encode_sentence(trg_sentence, trg_vocab)

# Simulate a simple translation example
# In practice, you'd use real tokenizers and vocabularies

# Simulated vocabularies
src_vocab = {
    "<pad>": 0,
    "<s>": 1,
    "</s>": 2,
    "the": 3,
    "cat": 4,
    "sat": 5,
    "on": 6,
    "mat": 7,
}
trg_vocab = {
    "<pad>": 0,
    "<s>": 1,
    "</s>": 2,
    "le": 3,
    "chat": 4,
    "était": 5,
    "assis": 6,
    "sur": 7,
    "tapis": 8,
}

# Reverse mapping for decoding
idx_to_trg = {v: k for k, v in trg_vocab.items()}


def encode_sentence(sentence, vocab):
    """Convert sentence to tensor of indices."""
    tokens = sentence.lower().split()
    indices = [vocab.get(t, 0) for t in tokens]
    return torch.tensor([indices])  # Add batch dimension


def decode_indices(indices, idx_to_word):
    """Convert indices back to words."""
    return " ".join([idx_to_word.get(i, "<unk>") for i in indices])


# Encode source sentence
src_sentence = "the cat sat on the mat"
src_tensor = encode_sentence(src_sentence, src_vocab)

# For training, we also have the target (with start token prepended)
trg_sentence = "<s> le chat était assis sur le tapis"
trg_tensor = encode_sentence(trg_sentence, trg_vocab)

Out[19]:

Console

Translation Example:
  Source: 'the cat sat on the mat'
  Source indices: [3, 4, 5, 6, 3, 7]

  Target: '<s> le chat était assis sur le tapis'
  Target indices: [1, 3, 4, 5, 6, 7, 3, 8]

The source sentence maps to indices [3, 4, 5, 6, 3, 7], where repeated words like "the" map to the same index (3). The target includes the start token <s> (index 1) prepended, which tells the decoder to begin generating. Notice that the source has 6 tokens while the target has 8, demonstrating how seq2seq handles variable-length mappings.

Training the Translation ModelLink Copied

During training, we use teacher forcing: the decoder receives the correct previous token at each step, not its own predictions. This allows parallel computation and stable training:

In[20]:

Code

def train_step(model, src, trg, criterion, optimizer):
    """
    Single training step for seq2seq model.

    Args:
        model: Seq2Seq model
        src: Source sequence (batch_size, src_len)
        trg: Target sequence (batch_size, trg_len), including start token
        criterion: Loss function (CrossEntropyLoss)
        optimizer: Optimizer

    Returns:
        loss value
    """
    optimizer.zero_grad()

    # Forward pass
    # Input to decoder: all tokens except the last (which has no next token to predict)
    # Target for loss: all tokens except the first (the start token)
    output = model(src, trg[:, :-1])  # (batch, trg_len-1, vocab_size)

    # Reshape for loss computation
    output_dim = output.shape[-1]
    output = output.contiguous().view(
        -1, output_dim
    )  # (batch * (trg_len-1), vocab_size)
    trg_flat = trg[:, 1:].contiguous().view(-1)  # (batch * (trg_len-1),)

    # Compute loss
    loss = criterion(output, trg_flat)

    # Backward pass
    loss.backward()

    # Gradient clipping to prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

    return loss.item()

def train_step(model, src, trg, criterion, optimizer):
    """
    Single training step for seq2seq model.

    Args:
        model: Seq2Seq model
        src: Source sequence (batch_size, src_len)
        trg: Target sequence (batch_size, trg_len), including start token
        criterion: Loss function (CrossEntropyLoss)
        optimizer: Optimizer

    Returns:
        loss value
    """
    optimizer.zero_grad()

    # Forward pass
    # Input to decoder: all tokens except the last (which has no next token to predict)
    # Target for loss: all tokens except the first (the start token)
    output = model(src, trg[:, :-1])  # (batch, trg_len-1, vocab_size)

    # Reshape for loss computation
    output_dim = output.shape[-1]
    output = output.contiguous().view(
        -1, output_dim
    )  # (batch * (trg_len-1), vocab_size)
    trg_flat = trg[:, 1:].contiguous().view(-1)  # (batch * (trg_len-1),)

    # Compute loss
    loss = criterion(output, trg_flat)

    # Backward pass
    loss.backward()

    # Gradient clipping to prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

    return loss.item()

The loss computation deserves attention. We compare the model's predictions (excluding the last position, which predicts beyond the sequence) against the target tokens (excluding the start token, which is input not output). This offset alignment is crucial for correct training.

Inference: Generating TranslationsLink Copied

At inference time, we don't have the target sequence. We must generate tokens autoregressively, feeding each prediction back as input for the next step:

In[21]:

Code

def translate(model, src, max_len=50, start_token=1, end_token=2):
    """
    Generate translation for a source sequence.

    Args:
        model: Trained Seq2Seq model
        src: Source sequence (1, src_len)
        max_len: Maximum output length
        start_token: Index of <s> token
        end_token: Index of </s> token

    Returns:
        List of generated token indices
    """
    model.eval()

    with torch.no_grad():
        # Encode source
        hidden, cell = model.encoder(src)

        # Start with start token
        current_token = torch.tensor([[start_token]])
        generated = [start_token]

        for _ in range(max_len):
            # Decode one step
            output, hidden, cell = model.decoder(current_token, hidden, cell)

            # Get most likely next token
            next_token = output.argmax(dim=-1).item()
            generated.append(next_token)

            # Stop if end token generated
            if next_token == end_token:
                break

            # Prepare input for next step
            current_token = torch.tensor([[next_token]])

    return generated

def translate(model, src, max_len=50, start_token=1, end_token=2):
    """
    Generate translation for a source sequence.

    Args:
        model: Trained Seq2Seq model
        src: Source sequence (1, src_len)
        max_len: Maximum output length
        start_token: Index of <s> token
        end_token: Index of </s> token

    Returns:
        List of generated token indices
    """
    model.eval()

    with torch.no_grad():
        # Encode source
        hidden, cell = model.encoder(src)

        # Start with start token
        current_token = torch.tensor([[start_token]])
        generated = [start_token]

        for _ in range(max_len):
            # Decode one step
            output, hidden, cell = model.decoder(current_token, hidden, cell)

            # Get most likely next token
            next_token = output.argmax(dim=-1).item()
            generated.append(next_token)

            # Stop if end token generated
            if next_token == end_token:
                break

            # Prepare input for next step
            current_token = torch.tensor([[next_token]])

    return generated

This greedy decoding always selects the most probable next token. In practice, beam search (covered in a later chapter) often produces better results by exploring multiple hypotheses simultaneously.

Seq2Seq for Text SummarizationLink Copied

Machine translation maps sequences of similar lengths, but the encoder-decoder framework handles arbitrary length ratios. Text summarization compresses long documents into short summaries, making it another natural application.

Out[22]:

Visualization

Diagram showing long document being encoded and decoded into short summary. — Seq2seq for summarization. A long document (many encoder steps) is compressed into a context vector, which the decoder expands into a short summary (few decoder steps). The extreme compression ratio makes the bottleneck problem especially severe for summarization.

Summarization presents unique challenges compared to translation:

Extreme compression ratios: A 500-word article might become a 50-word summary, requiring 10:1 compression
Content selection: The model must decide what information is important enough to include
Abstraction vs extraction: Should the summary use words from the source or generate new phrasings?

The basic encoder-decoder model struggles with these challenges. The bottleneck problem is especially severe when compressing long documents. Later innovations like attention and copy mechanisms significantly improved summarization quality.

Training Setup and ConsiderationsLink Copied

Training seq2seq models requires careful attention to several practical details: choosing an appropriate loss function, handling variable-length sequences through padding and masking, preventing gradient explosions, and tuning the learning rate schedule. This section covers each of these considerations with practical code examples.

Loss FunctionLink Copied

We use cross-entropy loss to train the model. At each position in the output sequence, the model predicts a probability distribution over the vocabulary, and we penalize it based on how much probability it assigns to the correct token. Summing these penalties across all positions gives us the total loss for a sequence:

\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t^* | y_{<t}, c)

where:

$\mathcal{L}$ : the total loss for the sequence (lower is better)
$T$ : the length of the target sequence
$y_t^*$ : the correct (ground truth) target token at position $t$
$y_{<t}$ : all target tokens before position $t$ , i.e., $y_1, y_2, \ldots, y_{t-1}$
$c$ : the context vector from the encoder
$P(y_t^* | y_{<t}, c)$ : the probability the model assigns to the correct token $y_t^*$ , given the context and previous tokens

The negative log transforms probabilities into losses: when the model assigns probability 1.0 to the correct token, $-\log(1.0) = 0$ (no loss). When it assigns probability 0.01, $-\log(0.01) \approx 4.6$ (high loss). This encourages the model to assign high probability to the correct next token at every position.

Let's visualize this relationship to build intuition for how cross-entropy loss penalizes predictions:

Out[23]:

Visualization

Line plot showing negative log function with loss on y-axis and probability on x-axis. — The negative log loss function used in cross-entropy. When the model assigns high probability to the correct token (right side), loss is low. When it assigns low probability (left side), loss increases sharply. This steep penalty for confident wrong predictions drives the model to be well-calibrated.

The curve shows why cross-entropy is effective for training: it penalizes confidently wrong predictions much more severely than uncertain ones. A model that assigns only 1% probability to the correct token incurs 20× more loss than one assigning 50%. This steep gradient in the low-probability region provides strong learning signal when the model makes mistakes.

Handling Variable-Length SequencesLink Copied

Real data contains sequences of varying lengths. We handle this through padding and masking:

In[24]:

Code

def create_mask(seq, pad_idx=0):
    """
    Create attention mask for padded sequences.
    Returns True for real tokens, False for padding.
    """
    return seq != pad_idx


# Example: batch with different length sequences
sequences = [
    [1, 5, 8, 3, 2],  # Length 5
    [1, 7, 4, 2, 0],  # Length 4 (padded)
    [1, 6, 9, 3, 8, 2],  # Length 6 - would need padding in others
]

# In practice, pad to max length in batch
max_len = 6
padded = torch.tensor(
    [
        [1, 5, 8, 3, 2, 0],
        [1, 7, 4, 2, 0, 0],
        [1, 6, 9, 3, 8, 2],
    ]
)

mask = create_mask(padded)

def create_mask(seq, pad_idx=0):
    """
    Create attention mask for padded sequences.
    Returns True for real tokens, False for padding.
    """
    return seq != pad_idx


# Example: batch with different length sequences
sequences = [
    [1, 5, 8, 3, 2],  # Length 5
    [1, 7, 4, 2, 0],  # Length 4 (padded)
    [1, 6, 9, 3, 8, 2],  # Length 6 - would need padding in others
]

# In practice, pad to max length in batch
max_len = 6
padded = torch.tensor(
    [
        [1, 5, 8, 3, 2, 0],
        [1, 7, 4, 2, 0, 0],
        [1, 6, 9, 3, 8, 2],
    ]
)

mask = create_mask(padded)

Out[25]:

Console

Padded sequences:
tensor([[1, 5, 8, 3, 2, 0],
        [1, 7, 4, 2, 0, 0],
        [1, 6, 9, 3, 8, 2]])

Mask (True = real token):
tensor([[ True,  True,  True,  True,  True, False],
        [ True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True,  True]])

The padded tensor shows zeros appended to shorter sequences to match the maximum length of 6. The mask tensor marks which positions contain real tokens (True) versus padding (False). During loss computation, we use this mask to ensure the model isn't penalized for predictions at padded positions, which would distort the training signal.

The mask is used during loss computation to ignore predictions at padded positions. PyTorch's CrossEntropyLoss supports an ignore_index parameter for this purpose.

Gradient ClippingLink Copied

Seq2seq models are prone to exploding gradients due to the long computational graphs created by unrolling through time. Gradient clipping limits the gradient magnitude:

In[26]:

Code

# Gradient clipping is essential for stable training
max_grad_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

# Gradient clipping is essential for stable training
max_grad_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

A typical value is 1.0 or 5.0. Without clipping, training often diverges with NaN losses.

Learning Rate and OptimizationLink Copied

Adam optimizer with learning rate scheduling works well for seq2seq models:

In[27]:

Code

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Learning rate decay: reduce by factor of 0.5 when validation loss plateaus
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=3
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Learning rate decay: reduce by factor of 0.5 when validation loss plateaus
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=3
)

Starting with a learning rate around $10^{-3}$ and reducing it when progress stalls typically works well.

A typical training run shows characteristic loss curves as the model learns to translate:

Out[28]:

Visualization

Line plot showing training and validation loss over epochs with learning rate reduction markers. — Typical training and validation loss curves for a seq2seq translation model. Training loss decreases steadily, while validation loss initially follows but eventually plateaus or increases slightly, indicating the onset of overfitting. The learning rate is reduced at epochs 15 and 25 (vertical dashed lines) when validation loss stalls.

This visualization shows several important training dynamics. Early epochs show rapid loss reduction as the model learns basic translation patterns. The gap between training and validation loss indicates generalization: a small gap means the model generalizes well, while a growing gap signals overfitting. Learning rate reductions (marked by vertical lines) help the model escape local minima and continue improving. The best model checkpoint is typically saved when validation loss is lowest, around epoch 18-20 in this example.

Putting It Together: A Complete Training LoopLink Copied

Let's implement a complete training loop that incorporates all these considerations:

In[29]:

Code

def train_epoch(model, data_loader, optimizer, criterion, clip_grad=1.0):
    """
    Train for one epoch.

    Args:
        model: Seq2Seq model
        data_loader: DataLoader yielding (src, trg) batches
        optimizer: Optimizer
        criterion: Loss function
        clip_grad: Maximum gradient norm

    Returns:
        Average loss for the epoch
    """
    model.train()
    total_loss = 0

    for src, trg in data_loader:
        optimizer.zero_grad()

        # Forward pass (teacher forcing)
        output = model(src, trg[:, :-1])

        # Compute loss
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        trg_flat = trg[:, 1:].contiguous().view(-1)

        loss = criterion(output, trg_flat)

        # Backward pass
        loss.backward()

        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)

        # Update weights
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(data_loader)


def evaluate(model, data_loader, criterion):
    """
    Evaluate model on validation/test data.

    Returns:
        Average loss
    """
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for src, trg in data_loader:
            output = model(src, trg[:, :-1])

            output_dim = output.shape[-1]
            output = output.contiguous().view(-1, output_dim)
            trg_flat = trg[:, 1:].contiguous().view(-1)

            loss = criterion(output, trg_flat)
            total_loss += loss.item()

    return total_loss / len(data_loader)

def train_epoch(model, data_loader, optimizer, criterion, clip_grad=1.0):
    """
    Train for one epoch.

    Args:
        model: Seq2Seq model
        data_loader: DataLoader yielding (src, trg) batches
        optimizer: Optimizer
        criterion: Loss function
        clip_grad: Maximum gradient norm

    Returns:
        Average loss for the epoch
    """
    model.train()
    total_loss = 0

    for src, trg in data_loader:
        optimizer.zero_grad()

        # Forward pass (teacher forcing)
        output = model(src, trg[:, :-1])

        # Compute loss
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        trg_flat = trg[:, 1:].contiguous().view(-1)

        loss = criterion(output, trg_flat)

        # Backward pass
        loss.backward()

        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)

        # Update weights
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(data_loader)


def evaluate(model, data_loader, criterion):
    """
    Evaluate model on validation/test data.

    Returns:
        Average loss
    """
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for src, trg in data_loader:
            output = model(src, trg[:, :-1])

            output_dim = output.shape[-1]
            output = output.contiguous().view(-1, output_dim)
            trg_flat = trg[:, 1:].contiguous().view(-1)

            loss = criterion(output, trg_flat)
            total_loss += loss.item()

    return total_loss / len(data_loader)

Limitations and ImpactLink Copied

The encoder-decoder framework was a breakthrough that enabled end-to-end learning for sequence-to-sequence tasks. Before this architecture, machine translation relied on complex pipelines with separate components for alignment, phrase extraction, and language modeling. The seq2seq approach replaced all of this with a single neural network trained end-to-end.

However, the basic architecture has significant limitations that motivated subsequent research. The context vector bottleneck forces all source information through a fixed-size vector, causing information loss for long sequences. Sutskever et al.'s original paper showed that reversing the source sequence improved results, likely because it reduced the average distance between corresponding words in source and target. This hack highlighted the underlying problem: the model struggled to preserve information about early source tokens.

The rigid encoder-decoder separation also limits flexibility. The encoder must finish processing before the decoder can start, preventing any interaction between reading and writing. Human translators don't work this way. They might read part of a sentence, start translating, then look back at the source for clarification. The attention mechanism, which we'll cover in subsequent chapters, addresses this by allowing the decoder to "look back" at any part of the encoded sequence.

Despite these limitations, the encoder-decoder framework established several principles that remain central to modern sequence modeling. The idea of encoding variable-length input into a fixed representation, then decoding back to variable-length output, appears in countless architectures. The separation of understanding (encoding) from generation (decoding) provides a clean abstraction that simplifies model design. And the end-to-end training paradigm, where the entire system is optimized jointly for the final task, has become the dominant approach in NLP.

The seq2seq architecture also demonstrated the power of recurrent networks for complex language tasks. While transformers have since surpassed RNN-based models on most benchmarks, the conceptual framework of encoder-decoder remains. Modern transformer models like T5 and BART use the same high-level architecture: encode the input, then decode the output. The attention mechanism that made transformers possible was first developed to address the bottleneck problem in RNN-based seq2seq models.

SummaryLink Copied

This chapter introduced the encoder-decoder framework, the foundational architecture for sequence-to-sequence learning. We covered how this paradigm separates the tasks of understanding input sequences and generating output sequences, enabling applications like machine translation and text summarization.

The encoder processes the input sequence through an RNN, compressing it into a fixed-size context vector that represents the input's meaning. The decoder, initialized with this context vector, generates the output sequence one token at a time, using each prediction as input for the next step.

The context vector bottleneck is the key limitation of basic seq2seq models. All information about the input must pass through this single vector, causing information loss for long sequences. This bottleneck motivated the development of attention mechanisms, which we'll explore in upcoming chapters.

Key implementation details include:

Use LSTM or GRU cells for both encoder and decoder to capture long-range dependencies
Initialize the decoder's hidden state with the encoder's final hidden state
Apply teacher forcing during training, feeding correct tokens rather than predictions
Use cross-entropy loss with masking to handle variable-length sequences
Clip gradients to prevent exploding gradients during backpropagation

The encoder-decoder framework established the paradigm for sequence-to-sequence learning that persists in modern architectures. While attention and transformers have improved upon the basic design, the core insight of separating encoding from decoding remains central to how we approach sequence transformation tasks.

In the next chapter, we'll examine teacher forcing in detail, understanding both its benefits for training efficiency and its drawbacks in terms of exposure bias.

Key ParametersLink Copied

When building encoder-decoder models with PyTorch's nn.LSTM or nn.GRU, these parameters have the most significant impact on model behavior:

Model architecture parameters for encoder-decoder models.

Parameter	Typical Values	Description
`hidden_size`	256-1024	Dimensionality of the hidden state and context vector. For translation, 512 is a common starting point. The context vector bottleneck makes this choice critical: too small and information is lost, too large and training becomes slow.
`num_layers`	2-4	Number of stacked RNN layers in both encoder and decoder. Deeper networks capture more complex patterns but require careful initialization and may need residual connections for stable training.
`embed_size`	256-512	Dimensionality of token embeddings. Should be large enough to capture semantic distinctions but not so large that it dominates parameter count.
`dropout`	0.1-0.3	Probability of dropping connections between LSTM layers (only active when `num_layers > 1`). Applied between layers, not within recurrent connections.
`batch_first`	`True`	When `True`, input tensors have shape `(batch, seq_len, features)`. Using `batch_first=True` aligns with common data loading patterns and makes debugging easier.

For training, the following parameters control optimization behavior:

Training parameters for seq2seq optimization.

Parameter	Typical Values	Description
`learning_rate`	0.001	Initial learning rate for Adam optimizer. Reduce on plateau. Too high causes instability, too low causes slow convergence.
`clip_grad`	1.0-5.0	Maximum gradient norm for clipping. Prevents exploding gradients. Essential for stable training of deep seq2seq models.
`ignore_index`	0 (pad token)	Index to ignore in cross-entropy loss (typically the padding token index). Ensures padded positions don't contribute to the loss.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the encoder-decoder framework and sequence-to-sequence models.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{encoderdecoderframeworkseq2seqarchitectureformachinetranslation, author = {Michael Brenndoerfer}, title = {Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation. Retrieved from https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation

MLAAcademic

Michael Brenndoerfer. "Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation>.

CHICAGOAcademic

Michael Brenndoerfer. "Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation'. Available at: https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation. https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation

Direct link:

https://mbrenndoerfer.com/writing/encoder-decoder-framework-seq2seq-architecture-machine-translation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Encoder-Decoder Framework: Seq2Seq Architecture for Machine Translation

Encoder-Decoder FrameworkLink Copied

The Core Insight: Separate Reading from WritingLink Copied

The Encoder: Compressing Input into MeaningLink Copied

Encoder ArchitectureLink Copied

Implementing the EncoderLink Copied

The Decoder: Generating Output from ContextLink Copied

Decoder ArchitectureLink Copied

Implementing the DecoderLink Copied

Connecting Encoder and Decoder: The Seq2Seq ModelLink Copied

The Context Vector BottleneckLink Copied

Seq2Seq for Machine TranslationLink Copied

The Translation PipelineLink Copied

Training the Translation ModelLink Copied

Inference: Generating TranslationsLink Copied

Seq2Seq for Text SummarizationLink Copied

Training Setup and ConsiderationsLink Copied

Loss FunctionLink Copied

Handling Variable-Length SequencesLink Copied

Gradient ClippingLink Copied

Learning Rate and OptimizationLink Copied

Putting It Together: A Complete Training LoopLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Beam Search: Finding Optimal Sequences in Neural Text Generation

Teacher Forcing: Training Seq2Seq Models with Ground Truth Context

Bahdanau Attention: Dynamic Context for Neural Machine Translation

Stay updated