BART Architecture: Encoder-Decoder Design for NLP

Michael BrenndoerferJanuary 13, 202531 min read

Learn BART's encoder-decoder architecture combining BERT and GPT designs. Explore attention patterns, model configurations, and implementation details.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BART Architecture

BART (Bidirectional and Auto-Regressive Transformers) combines ideas from two dominant paradigms in language modeling. When Facebook AI Research (now Meta AI) introduced BART in 2019, the NLP landscape was divided between encoder-only models like BERT, which excelled at understanding tasks, and decoder-only models like GPT, which dominated generation. BART asked a simple question: what if we combined the best of both worlds?

The answer turned out to be remarkably effective. BART uses a standard encoder-decoder architecture, based on a key insight: You can pre-train this architecture by corrupting text in arbitrary ways and then learning to reconstruct the original. This denoising approach proved especially powerful for tasks that require both understanding input and generating coherent output, such as summarization, translation, and question answering.

Where T5, which we covered in previous chapters, takes an encoder-decoder approach with a specific span corruption objective, BART explores a more flexible design space. It uses the same architectural building blocks but makes different choices about normalization, activation functions, and how pre-training objectives are structured. These differences matter in practice, leading to distinct strengths for different applications.

The BART Encoder-Decoder Design

BART follows the encoder-decoder framework we discussed in Part IX and Part XIII, but its design philosophy draws explicitly from BERT and GPT. To appreciate this design choice, consider the fundamental tension in language modeling: understanding requires seeing the full context (including what comes before and after a word), while generation must proceed sequentially since you cannot use words you have not yet produced. These two requirements seem contradictory, yet both are essential for tasks like summarization where you must deeply understand a document before producing a coherent condensed version.

BART resolves this tension through architectural separation. The encoder is essentially a BERT-style transformer, using bidirectional self-attention over the input sequence and allowing each token to attend to all other tokens. This bidirectional view means that when the encoder processes the word "bank" in a sentence, it can simultaneously consider both the preceding context ("walked along the") and the following context ("of the river") to determine that we're discussing a riverbank rather than a financial institution. The decoder, in contrast, is essentially a GPT-style transformer, using causal (left-to-right) self-attention to ensure the model can only use previously generated tokens when predicting the next one. This constraint is not a limitation, but a necessity. During generation, future tokens simply don't exist yet.

The architecture can be summarized as:

BART=BERT Encoder+GPT Decoder\text{BART} = \text{BERT Encoder} + \text{GPT Decoder}

where:

  • Mencoder\mathbf{M}_{\text{encoder}}: the attention mask matrix of shape n×nn \times n
  • nn: the length of the input sequence
  • Each entry of 1 indicates that attention is permitted between that query-key pair

This conceptual equation expresses that BART's architecture combines two components: an encoder following BERT's bidirectional design, and a decoder following GPT's autoregressive design. The "+" here represents architectural composition rather than mathematical addition—the encoder and decoder are connected through cross-attention, a mechanism that allows the decoder to query the encoder's representations as it generates each token.

This formulation isn't just a metaphor. The BART authors explicitly designed the encoder to match BERT's architecture and the decoder to match GPT's, then connected them with cross-attention. This means BART inherits the bidirectional contextual understanding that made BERT successful for classification and extraction tasks, while also gaining the autoregressive generation capabilities that made GPT successful for text generation. The result is a model that can first build a rich, context-aware representation of the input, then leverage that representation to generate fluent, coherent output.

Encoder Structure

The BART encoder processes the input sequence using standard transformer encoder blocks, transforming raw token embeddings into contextualized representations. These representations capture the meaning of each token within its full surrounding context. Each block contains:

  1. Multi-head self-attention with bidirectional (non-causal) masking. This is the mechanism that allows each token to "see" every other token in the input, gathering information from both directions to build context-aware representations.

  2. Feed-forward network with GeLU activation. After attention aggregates information across positions, this two-layer neural network transforms each position's representation independently, allowing the model to compute complex non-linear functions of the attended information.

  3. Residual connections around both sublayers. These skip connections add the input of each sublayer to its output, creating direct gradient pathways that facilitate training deep networks and allowing the model to learn incremental refinements rather than complete transformations.

  4. Layer normalization applied after each sublayer (post-norm). This normalizes the activations to have zero mean and unit variance, stabilizing training by preventing the hidden representations from growing too large or too small as they pass through many layers.

The encoder produces a sequence of hidden states, one for each input token. These states capture rich bidirectional context, as each encoder representation can incorporate information from the entire input sequence. When you pass a sentence through BART's encoder, the representation for each word encodes not just that word's meaning in isolation, but its meaning within the specific context of that particular sentence.

Decoder Structure

The decoder generates output tokens autoregressively, one at a time, using the encoder's representations to understand the input. This process mirrors how a human might write a summary. First, you read and understand the source document, which corresponds to the encoder. Then you compose the summary word by word while referring back to the original. This corresponds to the decoder with cross-attention. Each decoder block contains:

  1. Causal self-attention over previously generated tokens. This allows each position to attend only to earlier positions in the output sequence, building up a representation of what has been generated so far. The causal constraint ensures the model cannot "cheat" by looking at tokens it hasn't yet produced.

  2. Cross-attention over encoder hidden states. This is the bridge between understanding and generation. It allows the decoder to query the encoder's representations, focusing on different parts of the input as needed for generating each output token.

  3. Feed-forward network with GeLU activation. Just as in the encoder, this transforms the combined self-attention and cross-attention information through a non-linear function.

  4. Residual connections around all three sublayers. With three sublayers instead of two, the decoder has even more opportunity to benefit from these direct pathways that preserve information and facilitate gradient flow.

  5. Layer normalization after each sublayer (post-norm). This maintains training stability across the decoder's deeper structure.

The causal masking in self-attention prevents the decoder from "cheating" by looking at future tokens during training. When training on a target sequence like "The study found significant results," the decoder predicting "found" can see "The" and "study" but not "significant" or "results." Cross-attention allows each decoder position to attend to all encoder positions, enabling the decoder to ground its generation in the input context—when generating "found," it can look back at the relevant parts of the source document.

Key Architectural Choices

BART makes several design decisions that distinguish it from other encoder-decoder models. These choices reflect its heritage from BERT and GPT and affect training dynamics and model behavior.

  • GeLU activation: Unlike T5, which uses ReLU, BART follows BERT and GPT-2 in using the GeLU (Gaussian Error Linear Unit) activation function in feed-forward layers. GeLU provides smoother gradients than ReLU because it doesn't have a sharp transition at zero. Instead, it smoothly interpolates between passing and blocking signals. This smoothness can lead to more stable optimization and has become the preferred activation for many modern language models.

  • Post-layer normalization: BART applies layer normalization after residual connections (known as post-norm), following the original transformer design. T5 uses pre-norm, applying layer normalization before each sublayer. The placement of normalization affects gradient flow. Pre-norm tends to produce more stable gradients at initialization, making it easier to train very deep models. Post-norm can achieve slightly better final performance when training succeeds. This trade-off explains why both conventions persist in practice.

  • Learned positional embeddings: BART uses learned absolute position embeddings, similar to BERT and GPT-2, rather than the relative position encodings used in T5. With learned embeddings, the model maintains a separate embedding vector for each position (1, 2, 3, and so on up to some maximum), and these vectors are learned during training. This approach is simple and effective but creates a hard limit on sequence length. The model has no embedding for position 1025 if it was trained with a maximum of 1024.

  • No parameter sharing: Unlike T5, BART does not tie encoder and decoder embeddings by default. The encoder and decoder maintain separate embedding matrices. This increases parameter count but allows the encoder and decoder to learn specialized representations suited to their different roles (understanding versus generation).

Attention Configuration

Understanding BART's attention patterns is essential for understanding how information flows through the model. Attention is the mechanism by which transformers route information between positions, and the pattern of allowed attention—which positions can attend to which other positions—fundamentally shapes what the model can learn and compute. Let's examine each attention mechanism in detail, building intuition for why each pattern exists and how it serves the model's goals.

Encoder Self-Attention

In the encoder, every token can attend to every other token. This bidirectional attention gives the encoder the power to build contextual representations, because a word's meaning often depends on context that appears both before and after it. Consider disambiguating "The bank was eroding" versus "The bank was closing"; you need both the subject and the verb to understand which sense of "bank" is intended.

For an input sequence of length nn, the attention mask is an n×nn \times n matrix of ones:

Mencoder=[111111111]\mathbf{M}_{\text{encoder}} = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 1 & 1 & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & \cdots & 1 \end{bmatrix}

where:

  • Mencoder\mathbf{M}_{\text{encoder}}: the attention mask matrix of shape n×nn \times n
  • nn: the length of the input sequence
  • Each entry of 1 indicates that attention is permitted between that query-key pair

Reading this matrix, row ii describes which positions token ii can attend to: a 1 in column jj means token ii can attend to token jj. Since every entry is 1, every token can attend to every other token, including itself. This complete connectivity allows information to flow freely through the sequence, enabling the encoder to build representations that incorporate arbitrarily distant context.

This bidirectional attention is what gives encoder-only models like BERT their power for understanding tasks. Each token's representation is informed by the complete context, not just preceding tokens. A word at the beginning of a sentence can be influenced by words at the end, and vice versa, enabling the rich contextual representations that make BERT effective for tasks like sentiment analysis and named entity recognition.

Decoder Causal Self-Attention

The decoder uses causal (or autoregressive) masking in its self-attention layers. The word "causal" refers to the structure of language generation. Each token is caused by (depends on) the tokens that came before it, not those that come after. This constraint isn't artificial. During generation, future tokens don't yet exist.

For a sequence of length mm, the attention mask is a lower-triangular matrix:

Mdecoder=[1000110011101111]\mathbf{M}_{\text{decoder}} = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 1 & 1 & 0 & \cdots & 0 \\ 1 & 1 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & 1 & \cdots & 1 \end{bmatrix}

where:

  • Mdecoder\mathbf{M}_{\text{decoder}}: the causal attention mask matrix of shape m×mm \times m
  • mm: the length of the decoder sequence (tokens generated so far)
  • Entry (i,j)=1(i, j) = 1 if jij \leq i, meaning position ii can attend to position jj, and 0 otherwise

The lower-triangular structure emerges directly from the causality constraint. Position 1 can only attend to position 1 (itself), so the first row has a single 1. Position 2 can attend to positions 1 and 2, giving two 1s. Position ii can attend to all positions from 1 through ii, creating the triangular pattern. The zeros above the diagonal represent the blocked attention to future positions, which is information that the model must not use because it won't be available at generation time.

This ensures that when generating token tt, the model can only attend to tokens at positions 1,2,,t1, 2, \ldots, t. This causal structure makes autoregressive generation possible. During training, we can compute attention for all positions in parallel (the mask handles the constraints), but the model learns to predict each token using only its predecessors, exactly as it will during generation.

Cross-Attention

Cross-attention connects the decoder to the encoder, bridging the understanding and generation phases. Without cross-attention, the decoder would not know what input to generate output for. It would become an unconditional language model, generating plausible text without grounding in any specific input.

For each decoder position, queries come from the decoder hidden states while keys and values come from the encoder hidden states. This asymmetry is meaningful. The decoder is asking questions (represented as queries) about the input, and the encoder provides the answers (as values) along with ways to determine relevance (through keys). There is no masking in cross-attention; every decoder position can attend to every encoder position:

CrossAttn(Qdec,Kenc,Venc)=softmax(QdecKencTdk)Venc\text{CrossAttn}(\mathbf{Q}_{\text{dec}}, \mathbf{K}_{\text{enc}}, \mathbf{V}_{\text{enc}}) = \text{softmax}\left(\frac{\mathbf{Q}_{\text{dec}} \mathbf{K}_{\text{enc}}^T}{\sqrt{d_k}}\right) \mathbf{V}_{\text{enc}}

where:

  • Qdec\mathbf{Q}_{\text{dec}}: query vectors derived from decoder hidden states, shape (m,dk)(m, d_k). These represent "what the decoder is looking for" at each position.
  • Kenc\mathbf{K}_{\text{enc}}: key vectors derived from encoder hidden states, shape (n,dk)(n, d_k). These represent "what information each encoder position offers" and are compared against queries.
  • Venc\mathbf{V}_{\text{enc}}: value vectors derived from encoder hidden states, shape (n,dv)(n, d_v). These contain the actual information that will be retrieved and combined.
  • dkd_k: the dimension of query and key vectors (used for scaling)
  • dvd_v: the dimension of value vectors
  • dk\sqrt{d_k}: scaling factor that prevents dot products from growing too large, which would push softmax into regions with vanishing gradients

The computation proceeds in three stages. First, the dot product QdecKencT\mathbf{Q}_{\text{dec}} \mathbf{K}_{\text{enc}}^T computes a compatibility score between each decoder position and each encoder position, resulting in an m×nm \times n matrix of raw attention scores. Second, the softmax operation converts these scaled scores into attention weights that sum to 1 across encoder positions. This is a proper probability distribution over the input, indicating how much attention each output position pays to each input position. Third, these weights are used to compute a weighted combination of encoder values, where positions with higher attention weights contribute more to the final representation.

This allows the decoder to dynamically "look at" different parts of the input when generating each output token. When summarizing a document, the decoder might focus on the introduction when generating the first sentence of the summary, then shift attention to specific details when describing particular findings, and attend to the conclusion when wrapping up. This dynamic alignment is what makes encoder-decoder models so effective for conditional generation tasks.

This allows the decoder to dynamically focus on different parts of the input as it generates each output token, a mechanism we explored thoroughly in our discussion of Bahdanau and Luong attention in Part IX.

Attention Flow Visualization

To understand how information flows through BART, consider processing an input sequence and generating an output. The flow follows a clear two-phase structure that separates understanding from generation while connecting them through cross-attention.

  1. Encoder phase: Input tokens are processed through LL encoder layers. At each layer, bidirectional self-attention allows information to flow freely between all positions. The input might be a long document. By the end of the encoder, each token's representation includes context from the entire document. This phase processes the entire input once, creating a fixed set of representations that the decoder will query.

  2. Decoder phase: For each output token, the decoder performs a sequence of operations that integrate three sources of information:

    • Causal self-attention integrates information from previously generated tokens, building a representation of the output generated so far
    • Cross-attention retrieves relevant information from the encoder, grounding the generation in the input
    • The feed-forward network transforms the combined representation, computing complex functions of the attended information
    • The output projection produces a probability distribution over the vocabulary, from which the next token is selected

This two-phase structure is efficient for tasks with long inputs and shorter outputs. The expensive encoder computation happens once, and the decoder reuses those representations for every generated token.

Out[2]:
Visualization
Heatmap showing encoder bidirectional attention pattern with all ones.
Encoder bidirectional attention mask where every token attends to all others.
Heatmap showing decoder causal attention with lower-triangular pattern.
Decoder causal self-attention mask with lower-triangular structure.
Heatmap showing cross-attention with full connectivity.
Cross-attention mask showing full connectivity from decoder to encoder.

BART vs T5 Comparison

BART and T5 were developed around the same time and share the encoder-decoder architecture, but they differ in several important ways. Understanding these differences helps you choose the right model for your application and shows the design space of encoder-decoder models more broadly.

Architectural Differences

The table below summarizes the key architectural distinctions:

Architectural comparison between BART and T5.
ComponentBARTT5
Activation functionGeLUReLU (original) / GeGLU (v1.1)
NormalizationPost-normPre-norm
Position encodingLearned absoluteRelative (bucketed)
Embedding sharingSeparateTied (encoder-decoder)
VocabularyBPE (GPT-2 tokenizer)SentencePiece

The choice of pre-norm versus post-norm affects training stability. As we discussed in Part XII, pre-norm tends to produce more stable gradients at initialization, which can enable training larger models more easily. The key difference is where normalization occurs in the residual pathway. Pre-norm normalizes the input to each sublayer, ensuring that the residual connection adds a well-scaled update. Post-norm normalizes the output, which can lead to larger variations in the residual pathway. However, post-norm models often achieve slightly better final performance when training is successful, creating a trade-off between training ease and final quality.

BART's use of learned absolute position embeddings means it has a fixed maximum sequence length (typically 1024 tokens), while T5's relative position encoding theoretically generalizes to longer sequences. In practice, both models require additional techniques for handling very long contexts, as we covered in Part XV.

Input-Output Formatting

The most noticeable difference is how input and output are structured:

T5 uses a text-to-text format where every task is framed as mapping an input text to an output text. Tasks are specified through prefixes:

summarize: The researchers conducted experiments... → The study found... translate English to German: Hello world → Hallo Welt

BART treats tasks more naturally. For sequence-to-sequence tasks like summarization, the input goes to the encoder and the output comes from the decoder. For classification tasks, a special token's representation from the decoder is used for prediction. This is closer to how BERT handles classification, where the [CLS] token representation is used.

Pre-training Objectives

While we'll cover BART's pre-training in detail in the next chapter, here are the high-level differences.

  • T5 uses span corruption, replacing random spans of tokens with sentinel tokens and training the model to generate the missing spans.

  • BART explores multiple noise functions: token masking, token deletion, text infilling, sentence permutation, and document rotation. The full document is reconstructed as output.

T5's span corruption is more computationally efficient because the target sequence is much shorter than the input. BART's document reconstruction means the target is the same length as the uncorrupted input, requiring more computation but providing a stronger training signal for generation tasks.

Performance Trade-offs

In benchmarks, both models show strong performance with different strengths:

  • Summarization: BART tends to perform better, likely because its pre-training requires generating coherent, full-length text rather than filling in short spans.
  • Translation: Performance is similar, with both models achieving strong results when fine-tuned on parallel data.
  • Question answering: Both perform well, with specific results depending on the dataset and fine-tuning setup.
  • Classification: BART can perform classification by using the decoder's representation of a special token, though encoder-only models remain competitive for pure classification.

The practical difference often comes down to implementation details and fine-tuning approach rather than architecture.

BART Model Sizes

BART was released in two primary sizes, following the naming convention established by BERT:

BART-base

The base model provides a good balance between capability and computational requirements.

  • Encoder layers: 6
  • Decoder layers: 6
  • Hidden dimension: 768
  • Attention heads: 12
  • Feed-forward dimension: 3072
  • Parameters: ~140 million

BART-base works well when computational resources are limited or fast inference is required. It can run on consumer GPUs and provides strong performance on tasks such as summarization and question answering.

BART-large

The large model increases capacity significantly.

  • Encoder layers: 12
  • Decoder layers: 12
  • Hidden dimension: 1024
  • Attention heads: 16
  • Feed-forward dimension: 4096
  • Parameters: ~400 million

BART-large performs better on most benchmarks, particularly for complex generation tasks. It requires more memory and computation, typically requiring a GPU with at least 16GB of memory for fine-tuning.

Comparison with Other Models

The following table contextualizes BART's sizes relative to related models:

Parameter counts and layer configurations for BART and related models.
ModelParametersEncoder LayersDecoder Layers
BERT-base110M12-
BERT-large340M24-
GPT-2 Small124M-12
GPT-2 Medium355M-24
BART-base140M66
BART-large400M1212
T5-base220M1212
T5-large770M2424

BART's parameter count is somewhat lower than T5 for the same size label because T5 uses more layers. BART-large with 12+12 layers is closer to T5-base with 12+12 layers in terms of depth, though T5 uses parameter-efficient relative position encodings while BART has separate learned embeddings for each position.

Out[3]:
Visualization
Horizontal bar chart comparing parameter counts of BERT, GPT-2, BART, and T5 models.
Parameter counts for encoder-decoder and related models. BART-base and BART-large sit between BERT and T5 in terms of total parameters, reflecting their different layer configurations.

Code Implementation

Let's explore BART's architecture using the Hugging Face Transformers library. We'll examine the model structure, inspect attention patterns, and see how the encoder and decoder interact.

In[4]:
Code
from transformers import BartModel, BartTokenizer, BartConfig

## Load BART-base model and tokenizer with attention output enabled in config
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
config = BartConfig.from_pretrained(
    "facebook/bart-base", output_attentions=True, output_hidden_states=True
)
model = BartModel.from_pretrained("facebook/bart-base", config=config)
model.eval()

First, let's examine the model configuration to confirm the architectural details we discussed:

In[5]:
Code
## Store configuration for inspection
model_config = model.config
Out[6]:
Console
BART-base Configuration:
  Encoder layers: 6
  Decoder layers: 6
  Hidden dimension: 768
  Attention heads: 12
  FFN dimension: 3072
  Vocabulary size: 50265
  Max position embeddings: 1024
  Activation function: gelu

The configuration confirms our discussion: 6 encoder and decoder layers, 768-dimensional hidden states, 12 attention heads, and GeLU activation. The maximum position embeddings of 1024 reflects BART's use of learned absolute position encoding, matching the architectural specifications from the BART paper.

Now let's pass an example through the model and examine the outputs:

In[7]:
Code
import torch

## Prepare input
input_text = "BART is a denoising autoencoder for pretraining sequence-to-sequence models."
inputs = tokenizer(input_text, return_tensors="pt")

## Create decoder input (shifted right, starting with BOS token)
decoder_input_ids = torch.tensor([[tokenizer.bos_token_id]])
In[8]:
Code
## Forward pass through the full model
with torch.no_grad():
    outputs = model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=decoder_input_ids,
        output_attentions=True,
        output_hidden_states=True,
    )
Out[9]:
Console
Model Outputs:
  Encoder last hidden state shape: torch.Size([1, 23, 768])
  Decoder last hidden state shape: torch.Size([1, 1, 768])
  Number of encoder attention outputs: 6
  Number of decoder attention outputs: 6
  Number of cross-attention outputs: 6

The output shapes reveal the flow of information. The encoder produces hidden states for each input token, while the decoder produces hidden states for each position in the output sequence (currently just one, for the BOS token).

Let's visualize the attention patterns from the last layer of each component:

In[10]:
Code
## Extract attention weights (last layer, first head)
encoder_attn = (
    outputs.encoder_attentions[-1][0, 0].detach().numpy()
)  # [seq_len, seq_len]
decoder_self_attn = (
    outputs.decoder_attentions[-1][0, 0].detach().numpy()
)  # [1, 1]
cross_attn = outputs.cross_attentions[-1][0, 0].detach().numpy()  # [1, seq_len]

## Get tokens for labeling
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
Out[11]:
Visualization
Heatmap of encoder self-attention weights showing bidirectional attention pattern.
Encoder self-attention weights from the last layer of BART-base. Each cell shows how much attention token (row) pays to token (column). The bidirectional pattern allows every token to attend to all others.

The encoder attention pattern shows the bidirectional nature of BART's encoder. Each token can attend to all other tokens, with the model learning to focus on contextually relevant positions.

Now let's examine cross-attention by generating a longer output sequence:

In[12]:
Code
import torch
from transformers import BartForConditionalGeneration, BartConfig

## Load model with config that enables attention outputs
config_gen = BartConfig.from_pretrained(
    "facebook/bart-base", output_attentions=True
)
model_gen = BartForConditionalGeneration.from_pretrained(
    "facebook/bart-base", config=config_gen
)
model_gen.eval()

## We'll manually decode a few steps to capture attention
decoder_ids = [tokenizer.bos_token_id]
cross_attentions_per_step = []

with torch.no_grad():
    for step in range(5):
        decoder_input = torch.tensor([decoder_ids])
        gen_outputs = model_gen(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            decoder_input_ids=decoder_input,
            output_attentions=True,
        )

        # Get cross-attention from last layer, first head
        cross_attn_step = (
            gen_outputs.cross_attentions[-1][0, 0, -1, :].detach().numpy()
        )
        cross_attentions_per_step.append(cross_attn_step)

        # Greedy decoding for next token
        next_token = gen_outputs.logits[0, -1, :].argmax().item()
        decoder_ids.append(next_token)

## Stack cross-attentions
import numpy as np

cross_attn_matrix = np.stack(cross_attentions_per_step)
generated_tokens = tokenizer.convert_ids_to_tokens(decoder_ids[1:])  # Skip BOS
Out[13]:
Visualization
Heatmap of cross-attention from decoder to encoder positions.
Cross-attention weights showing how each generated token attends to the encoder output. The model focuses on different parts of the input as it generates each output token.

The cross-attention visualization reveals how the decoder grounds its generation in the input. Each row shows which encoder positions a generated token attended to, demonstrating the dynamic alignment between input and output.

Let's also count parameters to verify the model sizes we discussed:

In[14]:
Code
def count_parameters(model):
    """Count trainable parameters in model components."""
    encoder_params = sum(p.numel() for p in model.model.encoder.parameters())
    decoder_params = sum(p.numel() for p in model.model.decoder.parameters())
    embed_params = sum(p.numel() for p in model.model.shared.parameters())
    lm_head_params = sum(p.numel() for p in model.lm_head.parameters())
    total_params = sum(p.numel() for p in model.parameters())
    return (
        encoder_params,
        decoder_params,
        embed_params,
        lm_head_params,
        total_params,
    )
Out[15]:
Console
BART-base Parameter Counts:
  Encoder parameters: 81,920,256
  Decoder parameters: 96,103,680
  Shared embeddings: 38,603,520
  Total parameters: 139,420,416
  Total (millions): 139.4M

The parameter count confirms that BART-base has approximately 140 million parameters, split roughly evenly between the encoder and decoder with significant additional parameters in the embedding matrices. This aligns with our earlier discussion of BART model sizes and demonstrates how the encoder-decoder architecture distributes capacity across both components.

Out[16]:
Visualization
Pie chart showing BART-base parameter distribution across components.
Distribution of parameters across BART-base components. The encoder and decoder contain similar parameter counts, with embeddings contributing a substantial portion due to the large vocabulary size.

Key Parameters

The key parameters for BART's architecture are:

  • d_model: The hidden dimension size, which is 768 for base and 1024 for large. Controls the representation capacity of the model.
  • encoder_layers / decoder_layers: Number of transformer blocks in each component. More layers increase model capacity but also computational cost.
  • encoder_attention_heads: Number of attention heads (12 for base, 16 for large). Multiple heads allow the model to attend to different aspects of the input simultaneously.
  • encoder_ffn_dim: Dimension of the feed-forward network, typically 4 times the hidden dimension. Controls the capacity of the non-linear transformations.
  • max_position_embeddings: Maximum sequence length the model can process (1024 for BART). Limited by learned absolute position embeddings.
  • activation_function: The non-linearity used in feed-forward layers (GeLU for BART).

Limitations and Impact

BART's architecture has trade-offs worth understanding.

The post-norm design that BART inherited from the original transformer makes training less stable at large scales compared to pre-norm architectures. This is why T5 and most modern LLMs use pre-norm. When scaling BART beyond its original sizes, learning rate schedules and initialization require careful tuning.

The use of learned absolute position embeddings limits BART's ability to generalize to sequence lengths beyond those seen during training. While the model can process longer sequences by extending position embeddings, performance typically degrades. This contrasts with relative position encoding approaches such as those in T5 or the RoPE embeddings we covered in Part XI, which offer better length generalization.

Computationally, BART's pre-training objective requires reconstructing the entire input document, making pre-training more expensive than T5's span corruption approach. For downstream applications, however, this difference disappears since both models fine-tune and generate output tokens autoregressively.

Despite these limitations, BART had a major impact. It demonstrated that combining BERT-style encoding with GPT-style decoding produces a model that handles both understanding and generation well. The denoising pre-training framework opened new directions for exploring different corruption schemes, which we'll examine in the next chapter.

BART also showed that encoder-decoder architectures work well for conditional generation tasks. While decoder-only models like GPT have since dominated many applications due to their simplicity and scalability, encoder-decoder models like BART remain competitive for tasks where the input and output differ structurally, such as in document summarization or data-to-text generation.

Summary

BART combines a BERT-style bidirectional encoder with a GPT-style autoregressive decoder, creating a model for both understanding and generation. The encoder processes input with full bidirectional self-attention, while the decoder uses causal self-attention for generation and cross-attention to condition on the encoder's representations.

Compared to T5, BART makes different architectural choices. These include GeLU activation instead of ReLU, post-norm instead of pre-norm, learned absolute positions instead of relative positions, and separate embedding matrices for encoder and decoder. These differences reflect BART's design philosophy of directly combining proven components from BERT and GPT rather than exploring novel architectural variations.

BART comes in base (140M parameters) and large (400M parameters) configurations. Both use 12 attention heads in each layer and follow the same overall structure, differing only in depth and hidden dimensions.

The next chapter explores BART's pre-training in detail, examining the various noising functions that teach the model to reconstruct corrupted text and how these objectives shape the model's capabilities for downstream tasks.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about BART's encoder-decoder architecture.

Loading component...

Reference

BIBTEXAcademic
@misc{bartarchitectureencoderdecoderdesignfornlp, author = {Michael Brenndoerfer}, title = {BART Architecture: Encoder-Decoder Design for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }
APAAcademic
Michael Brenndoerfer (2025). BART Architecture: Encoder-Decoder Design for NLP. Retrieved from https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers
MLAAcademic
Michael Brenndoerfer. "BART Architecture: Encoder-Decoder Design for NLP." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers>.
CHICAGOAcademic
Michael Brenndoerfer. "BART Architecture: Encoder-Decoder Design for NLP." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers.
HARVARDAcademic
Michael Brenndoerfer (2025) 'BART Architecture: Encoder-Decoder Design for NLP'. Available at: https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers (Accessed: 12/25/2025).
SimpleBasic
Michael Brenndoerfer (2025). BART Architecture: Encoder-Decoder Design for NLP. https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers