BART Architecture: Encoder-Decoder Design for NLP

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn BART's encoder-decoder architecture combining BERT and GPT designs. Explore attention patterns, model configurations, and implementation details.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BART ArchitectureLink Copied

BART (Bidirectional and Auto-Regressive Transformers) combines ideas from two dominant paradigms in language modeling. When Facebook AI Research (now Meta AI) introduced BART in 2019, the NLP landscape was divided between encoder-only models like BERT, which excelled at understanding tasks, and decoder-only models like GPT, which dominated generation. BART asked a simple question: what if we combined the best of both worlds?

The answer turned out to be remarkably effective. BART uses a standard encoder-decoder architecture, based on a key insight: You can pre-train this architecture by corrupting text in arbitrary ways and then learning to reconstruct the original. This denoising approach proved especially powerful for tasks that require both understanding input and generating coherent output, such as summarization, translation, and question answering.

Where T5, which we covered in previous chapters, takes an encoder-decoder approach with a specific span corruption objective, BART explores a more flexible design space. It uses the same architectural building blocks but makes different choices about normalization, activation functions, and how pre-training objectives are structured. These differences matter in practice, leading to distinct strengths for different applications.

The BART Encoder-Decoder DesignLink Copied

BART follows the encoder-decoder framework we discussed in Part IX and Part XIII, but its design philosophy draws explicitly from BERT and GPT. To appreciate this design choice, consider the fundamental tension in language modeling: understanding requires seeing the full context (including what comes before and after a word), while generation must proceed sequentially since you cannot use words you have not yet produced. These two requirements seem contradictory, yet both are essential for tasks like summarization where you must deeply understand a document before producing a coherent condensed version.

BART resolves this tension through architectural separation. The encoder is essentially a BERT-style transformer, using bidirectional self-attention over the input sequence and allowing each token to attend to all other tokens. This bidirectional view means that when the encoder processes the word "bank" in a sentence, it can simultaneously consider both the preceding context ("walked along the") and the following context ("of the river") to determine that we're discussing a riverbank rather than a financial institution. The decoder, in contrast, is essentially a GPT-style transformer, using causal (left-to-right) self-attention to ensure the model can only use previously generated tokens when predicting the next one. This constraint is not a limitation, but a necessity. During generation, future tokens simply don't exist yet.

The architecture can be summarized as:

\text{BART} = \text{BERT Encoder} + \text{GPT Decoder}

where:

$\mathbf{M}_{\text{encoder}}$ : the attention mask matrix of shape $n \times n$
$n$ : the length of the input sequence
Each entry of 1 indicates that attention is permitted between that query-key pair

This conceptual equation expresses that BART's architecture combines two components: an encoder following BERT's bidirectional design, and a decoder following GPT's autoregressive design. The "+" here represents architectural composition rather than mathematical addition—the encoder and decoder are connected through cross-attention, a mechanism that allows the decoder to query the encoder's representations as it generates each token.

This formulation isn't just a metaphor. The BART authors explicitly designed the encoder to match BERT's architecture and the decoder to match GPT's, then connected them with cross-attention. This means BART inherits the bidirectional contextual understanding that made BERT successful for classification and extraction tasks, while also gaining the autoregressive generation capabilities that made GPT successful for text generation. The result is a model that can first build a rich, context-aware representation of the input, then leverage that representation to generate fluent, coherent output.

Encoder StructureLink Copied

The BART encoder processes the input sequence using standard transformer encoder blocks, transforming raw token embeddings into contextualized representations. These representations capture the meaning of each token within its full surrounding context. Each block contains:

Multi-head self-attention with bidirectional (non-causal) masking. This is the mechanism that allows each token to "see" every other token in the input, gathering information from both directions to build context-aware representations.
Feed-forward network with GeLU activation. After attention aggregates information across positions, this two-layer neural network transforms each position's representation independently, allowing the model to compute complex non-linear functions of the attended information.
Residual connections around both sublayers. These skip connections add the input of each sublayer to its output, creating direct gradient pathways that facilitate training deep networks and allowing the model to learn incremental refinements rather than complete transformations.
Layer normalization applied after each sublayer (post-norm). This normalizes the activations to have zero mean and unit variance, stabilizing training by preventing the hidden representations from growing too large or too small as they pass through many layers.

The encoder produces a sequence of hidden states, one for each input token. These states capture rich bidirectional context, as each encoder representation can incorporate information from the entire input sequence. When you pass a sentence through BART's encoder, the representation for each word encodes not just that word's meaning in isolation, but its meaning within the specific context of that particular sentence.

Decoder StructureLink Copied

The decoder generates output tokens autoregressively, one at a time, using the encoder's representations to understand the input. This process mirrors how a human might write a summary. First, you read and understand the source document, which corresponds to the encoder. Then you compose the summary word by word while referring back to the original. This corresponds to the decoder with cross-attention. Each decoder block contains:

Causal self-attention over previously generated tokens. This allows each position to attend only to earlier positions in the output sequence, building up a representation of what has been generated so far. The causal constraint ensures the model cannot "cheat" by looking at tokens it hasn't yet produced.
Cross-attention over encoder hidden states. This is the bridge between understanding and generation. It allows the decoder to query the encoder's representations, focusing on different parts of the input as needed for generating each output token.
Feed-forward network with GeLU activation. Just as in the encoder, this transforms the combined self-attention and cross-attention information through a non-linear function.
Residual connections around all three sublayers. With three sublayers instead of two, the decoder has even more opportunity to benefit from these direct pathways that preserve information and facilitate gradient flow.
Layer normalization after each sublayer (post-norm). This maintains training stability across the decoder's deeper structure.

The causal masking in self-attention prevents the decoder from "cheating" by looking at future tokens during training. When training on a target sequence like "The study found significant results," the decoder predicting "found" can see "The" and "study" but not "significant" or "results." Cross-attention allows each decoder position to attend to all encoder positions, enabling the decoder to ground its generation in the input context—when generating "found," it can look back at the relevant parts of the source document.

Key Architectural ChoicesLink Copied

BART makes several design decisions that distinguish it from other encoder-decoder models. These choices reflect its heritage from BERT and GPT and affect training dynamics and model behavior.

GeLU activation: Unlike T5, which uses ReLU, BART follows BERT and GPT-2 in using the GeLU (Gaussian Error Linear Unit) activation function in feed-forward layers. GeLU provides smoother gradients than ReLU because it doesn't have a sharp transition at zero. Instead, it smoothly interpolates between passing and blocking signals. This smoothness can lead to more stable optimization and has become the preferred activation for many modern language models.
Post-layer normalization: BART applies layer normalization after residual connections (known as post-norm), following the original transformer design. T5 uses pre-norm, applying layer normalization before each sublayer. The placement of normalization affects gradient flow. Pre-norm tends to produce more stable gradients at initialization, making it easier to train very deep models. Post-norm can achieve slightly better final performance when training succeeds. This trade-off explains why both conventions persist in practice.
Learned positional embeddings: BART uses learned absolute position embeddings, similar to BERT and GPT-2, rather than the relative position encodings used in T5. With learned embeddings, the model maintains a separate embedding vector for each position (1, 2, 3, and so on up to some maximum), and these vectors are learned during training. This approach is simple and effective but creates a hard limit on sequence length. The model has no embedding for position 1025 if it was trained with a maximum of 1024.
No parameter sharing: Unlike T5, BART does not tie encoder and decoder embeddings by default. The encoder and decoder maintain separate embedding matrices. This increases parameter count but allows the encoder and decoder to learn specialized representations suited to their different roles (understanding versus generation).

Attention ConfigurationLink Copied

Understanding BART's attention patterns is essential for understanding how information flows through the model. Attention is the mechanism by which transformers route information between positions, and the pattern of allowed attention—which positions can attend to which other positions—fundamentally shapes what the model can learn and compute. Let's examine each attention mechanism in detail, building intuition for why each pattern exists and how it serves the model's goals.

Encoder Self-AttentionLink Copied

In the encoder, every token can attend to every other token. This bidirectional attention gives the encoder the power to build contextual representations, because a word's meaning often depends on context that appears both before and after it. Consider disambiguating "The bank was eroding" versus "The bank was closing"; you need both the subject and the verb to understand which sense of "bank" is intended.

For an input sequence of length $n$ , the attention mask is an $n \times n$ matrix of ones:

\mathbf{M}_{\text{encoder}} = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 1 & 1 & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & \cdots & 1 \end{bmatrix}

where:

$\mathbf{M}_{\text{encoder}}$ : the attention mask matrix of shape $n \times n$
$n$ : the length of the input sequence
Each entry of 1 indicates that attention is permitted between that query-key pair

Reading this matrix, row $i$ describes which positions token $i$ can attend to: a 1 in column $j$ means token $i$ can attend to token $j$ . Since every entry is 1, every token can attend to every other token, including itself. This complete connectivity allows information to flow freely through the sequence, enabling the encoder to build representations that incorporate arbitrarily distant context.

This bidirectional attention is what gives encoder-only models like BERT their power for understanding tasks. Each token's representation is informed by the complete context, not just preceding tokens. A word at the beginning of a sentence can be influenced by words at the end, and vice versa, enabling the rich contextual representations that make BERT effective for tasks like sentiment analysis and named entity recognition.

Decoder Causal Self-AttentionLink Copied

The decoder uses causal (or autoregressive) masking in its self-attention layers. The word "causal" refers to the structure of language generation. Each token is caused by (depends on) the tokens that came before it, not those that come after. This constraint isn't artificial. During generation, future tokens don't yet exist.

For a sequence of length $m$ , the attention mask is a lower-triangular matrix:

\mathbf{M}_{\text{decoder}} = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 1 & 1 & 0 & \cdots & 0 \\ 1 & 1 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & 1 & \cdots & 1 \end{bmatrix}

where:

$\mathbf{M}_{\text{decoder}}$ : the causal attention mask matrix of shape $m \times m$
$m$ : the length of the decoder sequence (tokens generated so far)
Entry $(i, j) = 1$ if $j \leq i$ , meaning position $i$ can attend to position $j$ , and 0 otherwise

The lower-triangular structure emerges directly from the causality constraint. Position 1 can only attend to position 1 (itself), so the first row has a single 1. Position 2 can attend to positions 1 and 2, giving two 1s. Position $i$ can attend to all positions from 1 through $i$ , creating the triangular pattern. The zeros above the diagonal represent the blocked attention to future positions, which is information that the model must not use because it won't be available at generation time.

This ensures that when generating token $t$ , the model can only attend to tokens at positions $1, 2, \ldots, t$ . This causal structure makes autoregressive generation possible. During training, we can compute attention for all positions in parallel (the mask handles the constraints), but the model learns to predict each token using only its predecessors, exactly as it will during generation.

Cross-AttentionLink Copied

Cross-attention connects the decoder to the encoder, bridging the understanding and generation phases. Without cross-attention, the decoder would not know what input to generate output for. It would become an unconditional language model, generating plausible text without grounding in any specific input.

For each decoder position, queries come from the decoder hidden states while keys and values come from the encoder hidden states. This asymmetry is meaningful. The decoder is asking questions (represented as queries) about the input, and the encoder provides the answers (as values) along with ways to determine relevance (through keys). There is no masking in cross-attention; every decoder position can attend to every encoder position:

\text{CrossAttn}(\mathbf{Q}_{\text{dec}}, \mathbf{K}_{\text{enc}}, \mathbf{V}_{\text{enc}}) = \text{softmax}\left(\frac{\mathbf{Q}_{\text{dec}} \mathbf{K}_{\text{enc}}^T}{\sqrt{d_k}}\right) \mathbf{V}_{\text{enc}}

where:

$\mathbf{Q}_{\text{dec}}$ : query vectors derived from decoder hidden states, shape $(m, d_k)$ . These represent "what the decoder is looking for" at each position.
$\mathbf{K}_{\text{enc}}$ : key vectors derived from encoder hidden states, shape $(n, d_k)$ . These represent "what information each encoder position offers" and are compared against queries.
$\mathbf{V}_{\text{enc}}$ : value vectors derived from encoder hidden states, shape $(n, d_v)$ . These contain the actual information that will be retrieved and combined.
$d_k$ : the dimension of query and key vectors (used for scaling)
$d_v$ : the dimension of value vectors
$\sqrt{d_k}$ : scaling factor that prevents dot products from growing too large, which would push softmax into regions with vanishing gradients

The computation proceeds in three stages. First, the dot product $\mathbf{Q}_{\text{dec}} \mathbf{K}_{\text{enc}}^T$ computes a compatibility score between each decoder position and each encoder position, resulting in an $m \times n$ matrix of raw attention scores. Second, the softmax operation converts these scaled scores into attention weights that sum to 1 across encoder positions. This is a proper probability distribution over the input, indicating how much attention each output position pays to each input position. Third, these weights are used to compute a weighted combination of encoder values, where positions with higher attention weights contribute more to the final representation.

This allows the decoder to dynamically "look at" different parts of the input when generating each output token. When summarizing a document, the decoder might focus on the introduction when generating the first sentence of the summary, then shift attention to specific details when describing particular findings, and attend to the conclusion when wrapping up. This dynamic alignment is what makes encoder-decoder models so effective for conditional generation tasks.

This allows the decoder to dynamically focus on different parts of the input as it generates each output token, a mechanism we explored thoroughly in our discussion of Bahdanau and Luong attention in Part IX.

Attention Flow VisualizationLink Copied

To understand how information flows through BART, consider processing an input sequence and generating an output. The flow follows a clear two-phase structure that separates understanding from generation while connecting them through cross-attention.

Encoder phase: Input tokens are processed through $L$ encoder layers. At each layer, bidirectional self-attention allows information to flow freely between all positions. The input might be a long document. By the end of the encoder, each token's representation includes context from the entire document. This phase processes the entire input once, creating a fixed set of representations that the decoder will query.
Decoder phase: For each output token, the decoder performs a sequence of operations that integrate three sources of information:
- Causal self-attention integrates information from previously generated tokens, building a representation of the output generated so far
- Cross-attention retrieves relevant information from the encoder, grounding the generation in the input
- The feed-forward network transforms the combined representation, computing complex functions of the attended information
- The output projection produces a probability distribution over the vocabulary, from which the next token is selected

This two-phase structure is efficient for tasks with long inputs and shorter outputs. The expensive encoder computation happens once, and the decoder reuses those representations for every generated token.

Out[2]:

Visualization

Heatmap showing encoder bidirectional attention pattern with all ones. — Encoder bidirectional attention mask where every token attends to all others.

Heatmap showing decoder causal attention with lower-triangular pattern. — Decoder causal self-attention mask with lower-triangular structure.

Heatmap showing cross-attention with full connectivity. — Cross-attention mask showing full connectivity from decoder to encoder.

BART vs T5 ComparisonLink Copied

BART and T5 were developed around the same time and share the encoder-decoder architecture, but they differ in several important ways. Understanding these differences helps you choose the right model for your application and shows the design space of encoder-decoder models more broadly.

Architectural DifferencesLink Copied

The table below summarizes the key architectural distinctions:

Architectural comparison between BART and T5.

Component	BART	T5
Activation function	GeLU	ReLU (original) / GeGLU (v1.1)
Normalization	Post-norm	Pre-norm
Position encoding	Learned absolute	Relative (bucketed)
Embedding sharing	Separate	Tied (encoder-decoder)
Vocabulary	BPE (GPT-2 tokenizer)	SentencePiece

The choice of pre-norm versus post-norm affects training stability. As we discussed in Part XII, pre-norm tends to produce more stable gradients at initialization, which can enable training larger models more easily. The key difference is where normalization occurs in the residual pathway. Pre-norm normalizes the input to each sublayer, ensuring that the residual connection adds a well-scaled update. Post-norm normalizes the output, which can lead to larger variations in the residual pathway. However, post-norm models often achieve slightly better final performance when training is successful, creating a trade-off between training ease and final quality.

BART's use of learned absolute position embeddings means it has a fixed maximum sequence length (typically 1024 tokens), while T5's relative position encoding theoretically generalizes to longer sequences. In practice, both models require additional techniques for handling very long contexts, as we covered in Part XV.

Input-Output FormattingLink Copied

The most noticeable difference is how input and output are structured:

T5 uses a text-to-text format where every task is framed as mapping an input text to an output text. Tasks are specified through prefixes:

summarize: The researchers conducted experiments... → The study found...
translate English to German: Hello world → Hallo Welt

BART treats tasks more naturally. For sequence-to-sequence tasks like summarization, the input goes to the encoder and the output comes from the decoder. For classification tasks, a special token's representation from the decoder is used for prediction. This is closer to how BERT handles classification, where the [CLS] token representation is used.

Pre-training ObjectivesLink Copied

While we'll cover BART's pre-training in detail in the next chapter, here are the high-level differences.

T5 uses span corruption, replacing random spans of tokens with sentinel tokens and training the model to generate the missing spans.
BART explores multiple noise functions: token masking, token deletion, text infilling, sentence permutation, and document rotation. The full document is reconstructed as output.

T5's span corruption is more computationally efficient because the target sequence is much shorter than the input. BART's document reconstruction means the target is the same length as the uncorrupted input, requiring more computation but providing a stronger training signal for generation tasks.

Performance Trade-offsLink Copied

In benchmarks, both models show strong performance with different strengths:

Summarization: BART tends to perform better, likely because its pre-training requires generating coherent, full-length text rather than filling in short spans.
Translation: Performance is similar, with both models achieving strong results when fine-tuned on parallel data.
Question answering: Both perform well, with specific results depending on the dataset and fine-tuning setup.
Classification: BART can perform classification by using the decoder's representation of a special token, though encoder-only models remain competitive for pure classification.

The practical difference often comes down to implementation details and fine-tuning approach rather than architecture.

BART Model SizesLink Copied

BART was released in two primary sizes, following the naming convention established by BERT:

BART-baseLink Copied

The base model provides a good balance between capability and computational requirements.

Encoder layers: 6
Decoder layers: 6
Hidden dimension: 768
Attention heads: 12
Feed-forward dimension: 3072
Parameters: ~140 million

BART-base works well when computational resources are limited or fast inference is required. It can run on consumer GPUs and provides strong performance on tasks such as summarization and question answering.

BART-largeLink Copied

The large model increases capacity significantly.

Encoder layers: 12
Decoder layers: 12
Hidden dimension: 1024
Attention heads: 16
Feed-forward dimension: 4096
Parameters: ~400 million

BART-large performs better on most benchmarks, particularly for complex generation tasks. It requires more memory and computation, typically requiring a GPU with at least 16GB of memory for fine-tuning.

Comparison with Other ModelsLink Copied

The following table contextualizes BART's sizes relative to related models:

Parameter counts and layer configurations for BART and related models.

Model	Parameters	Encoder Layers	Decoder Layers
BERT-base	110M	12	-
BERT-large	340M	24	-
GPT-2 Small	124M	-	12
GPT-2 Medium	355M	-	24
BART-base	140M	6	6
BART-large	400M	12	12
T5-base	220M	12	12
T5-large	770M	24	24

BART's parameter count is somewhat lower than T5 for the same size label because T5 uses more layers. BART-large with 12+12 layers is closer to T5-base with 12+12 layers in terms of depth, though T5 uses parameter-efficient relative position encodings while BART has separate learned embeddings for each position.

Out[3]:

Visualization

Horizontal bar chart comparing parameter counts of BERT, GPT-2, BART, and T5 models. — Parameter counts for encoder-decoder and related models. BART-base and BART-large sit between BERT and T5 in terms of total parameters, reflecting their different layer configurations.

Code ImplementationLink Copied

Let's explore BART's architecture using the Hugging Face Transformers library. We'll examine the model structure, inspect attention patterns, and see how the encoder and decoder interact.

In[4]:

Code

from transformers import BartModel, BartTokenizer, BartConfig

## Load BART-base model and tokenizer with attention output enabled in config
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
config = BartConfig.from_pretrained(
    "facebook/bart-base", output_attentions=True, output_hidden_states=True
)
model = BartModel.from_pretrained("facebook/bart-base", config=config)
model.eval()

from transformers import BartModel, BartTokenizer, BartConfig

## Load BART-base model and tokenizer with attention output enabled in config
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
config = BartConfig.from_pretrained(
    "facebook/bart-base", output_attentions=True, output_hidden_states=True
)
model = BartModel.from_pretrained("facebook/bart-base", config=config)
model.eval()

First, let's examine the model configuration to confirm the architectural details we discussed:

In[5]:

Code

## Store configuration for inspection
model_config = model.config

## Store configuration for inspection
model_config = model.config

Out[6]:

Console

BART-base Configuration:
  Encoder layers: 6
  Decoder layers: 6
  Hidden dimension: 768
  Attention heads: 12
  FFN dimension: 3072
  Vocabulary size: 50265
  Max position embeddings: 1024
  Activation function: gelu

The configuration confirms our discussion: 6 encoder and decoder layers, 768-dimensional hidden states, 12 attention heads, and GeLU activation. The maximum position embeddings of 1024 reflects BART's use of learned absolute position encoding, matching the architectural specifications from the BART paper.

Now let's pass an example through the model and examine the outputs:

In[7]:

Code

import torch

## Prepare input
input_text = "BART is a denoising autoencoder for pretraining sequence-to-sequence models."
inputs = tokenizer(input_text, return_tensors="pt")

## Create decoder input (shifted right, starting with BOS token)
decoder_input_ids = torch.tensor([[tokenizer.bos_token_id]])

import torch

## Prepare input
input_text = "BART is a denoising autoencoder for pretraining sequence-to-sequence models."
inputs = tokenizer(input_text, return_tensors="pt")

## Create decoder input (shifted right, starting with BOS token)
decoder_input_ids = torch.tensor([[tokenizer.bos_token_id]])

In[8]:

Code

## Forward pass through the full model
with torch.no_grad():
    outputs = model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=decoder_input_ids,
        output_attentions=True,
        output_hidden_states=True,
    )

## Forward pass through the full model
with torch.no_grad():
    outputs = model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=decoder_input_ids,
        output_attentions=True,
        output_hidden_states=True,
    )

Out[9]:

Console

Model Outputs:
  Encoder last hidden state shape: torch.Size([1, 23, 768])
  Decoder last hidden state shape: torch.Size([1, 1, 768])
  Number of encoder attention outputs: 6
  Number of decoder attention outputs: 6
  Number of cross-attention outputs: 6

The output shapes reveal the flow of information. The encoder produces hidden states for each input token, while the decoder produces hidden states for each position in the output sequence (currently just one, for the BOS token).

Let's visualize the attention patterns from the last layer of each component:

In[10]:

Code

## Extract attention weights (last layer, first head)
encoder_attn = (
    outputs.encoder_attentions[-1][0, 0].detach().numpy()
)  # [seq_len, seq_len]
decoder_self_attn = (
    outputs.decoder_attentions[-1][0, 0].detach().numpy()
)  # [1, 1]
cross_attn = outputs.cross_attentions[-1][0, 0].detach().numpy()  # [1, seq_len]

## Get tokens for labeling
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

## Extract attention weights (last layer, first head)
encoder_attn = (
    outputs.encoder_attentions[-1][0, 0].detach().numpy()
)  # [seq_len, seq_len]
decoder_self_attn = (
    outputs.decoder_attentions[-1][0, 0].detach().numpy()
)  # [1, 1]
cross_attn = outputs.cross_attentions[-1][0, 0].detach().numpy()  # [1, seq_len]

## Get tokens for labeling
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

Out[11]:

Visualization

Heatmap of encoder self-attention weights showing bidirectional attention pattern. — Encoder self-attention weights from the last layer of BART-base. Each cell shows how much attention token (row) pays to token (column). The bidirectional pattern allows every token to attend to all others.

The encoder attention pattern shows the bidirectional nature of BART's encoder. Each token can attend to all other tokens, with the model learning to focus on contextually relevant positions.

Now let's examine cross-attention by generating a longer output sequence:

In[12]:

Code

import torch
from transformers import BartForConditionalGeneration, BartConfig

## Load model with config that enables attention outputs
config_gen = BartConfig.from_pretrained(
    "facebook/bart-base", output_attentions=True
)
model_gen = BartForConditionalGeneration.from_pretrained(
    "facebook/bart-base", config=config_gen
)
model_gen.eval()

## We'll manually decode a few steps to capture attention
decoder_ids = [tokenizer.bos_token_id]
cross_attentions_per_step = []

with torch.no_grad():
    for step in range(5):
        decoder_input = torch.tensor([decoder_ids])
        gen_outputs = model_gen(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            decoder_input_ids=decoder_input,
            output_attentions=True,
        )

        # Get cross-attention from last layer, first head
        cross_attn_step = (
            gen_outputs.cross_attentions[-1][0, 0, -1, :].detach().numpy()
        )
        cross_attentions_per_step.append(cross_attn_step)

        # Greedy decoding for next token
        next_token = gen_outputs.logits[0, -1, :].argmax().item()
        decoder_ids.append(next_token)

## Stack cross-attentions
import numpy as np

cross_attn_matrix = np.stack(cross_attentions_per_step)
generated_tokens = tokenizer.convert_ids_to_tokens(decoder_ids[1:])  # Skip BOS

import torch
from transformers import BartForConditionalGeneration, BartConfig

## Load model with config that enables attention outputs
config_gen = BartConfig.from_pretrained(
    "facebook/bart-base", output_attentions=True
)
model_gen = BartForConditionalGeneration.from_pretrained(
    "facebook/bart-base", config=config_gen
)
model_gen.eval()

## We'll manually decode a few steps to capture attention
decoder_ids = [tokenizer.bos_token_id]
cross_attentions_per_step = []

with torch.no_grad():
    for step in range(5):
        decoder_input = torch.tensor([decoder_ids])
        gen_outputs = model_gen(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            decoder_input_ids=decoder_input,
            output_attentions=True,
        )

        # Get cross-attention from last layer, first head
        cross_attn_step = (
            gen_outputs.cross_attentions[-1][0, 0, -1, :].detach().numpy()
        )
        cross_attentions_per_step.append(cross_attn_step)

        # Greedy decoding for next token
        next_token = gen_outputs.logits[0, -1, :].argmax().item()
        decoder_ids.append(next_token)

## Stack cross-attentions
import numpy as np

cross_attn_matrix = np.stack(cross_attentions_per_step)
generated_tokens = tokenizer.convert_ids_to_tokens(decoder_ids[1:])  # Skip BOS

Out[13]:

Visualization

Heatmap of cross-attention from decoder to encoder positions. — Cross-attention weights showing how each generated token attends to the encoder output. The model focuses on different parts of the input as it generates each output token.

The cross-attention visualization reveals how the decoder grounds its generation in the input. Each row shows which encoder positions a generated token attended to, demonstrating the dynamic alignment between input and output.

Let's also count parameters to verify the model sizes we discussed:

In[14]:

Code

def count_parameters(model):
    """Count trainable parameters in model components."""
    encoder_params = sum(p.numel() for p in model.model.encoder.parameters())
    decoder_params = sum(p.numel() for p in model.model.decoder.parameters())
    embed_params = sum(p.numel() for p in model.model.shared.parameters())
    lm_head_params = sum(p.numel() for p in model.lm_head.parameters())
    total_params = sum(p.numel() for p in model.parameters())
    return (
        encoder_params,
        decoder_params,
        embed_params,
        lm_head_params,
        total_params,
    )

def count_parameters(model):
    """Count trainable parameters in model components."""
    encoder_params = sum(p.numel() for p in model.model.encoder.parameters())
    decoder_params = sum(p.numel() for p in model.model.decoder.parameters())
    embed_params = sum(p.numel() for p in model.model.shared.parameters())
    lm_head_params = sum(p.numel() for p in model.lm_head.parameters())
    total_params = sum(p.numel() for p in model.parameters())
    return (
        encoder_params,
        decoder_params,
        embed_params,
        lm_head_params,
        total_params,
    )

Out[15]:

Console

BART-base Parameter Counts:
  Encoder parameters: 81,920,256
  Decoder parameters: 96,103,680
  Shared embeddings: 38,603,520
  Total parameters: 139,420,416
  Total (millions): 139.4M

The parameter count confirms that BART-base has approximately 140 million parameters, split roughly evenly between the encoder and decoder with significant additional parameters in the embedding matrices. This aligns with our earlier discussion of BART model sizes and demonstrates how the encoder-decoder architecture distributes capacity across both components.

Out[16]:

Visualization

Pie chart showing BART-base parameter distribution across components. — Distribution of parameters across BART-base components. The encoder and decoder contain similar parameter counts, with embeddings contributing a substantial portion due to the large vocabulary size.

Key ParametersLink Copied

The key parameters for BART's architecture are:

d_model: The hidden dimension size, which is 768 for base and 1024 for large. Controls the representation capacity of the model.
encoder_layers / decoder_layers: Number of transformer blocks in each component. More layers increase model capacity but also computational cost.
encoder_attention_heads: Number of attention heads (12 for base, 16 for large). Multiple heads allow the model to attend to different aspects of the input simultaneously.
encoder_ffn_dim: Dimension of the feed-forward network, typically 4 times the hidden dimension. Controls the capacity of the non-linear transformations.
max_position_embeddings: Maximum sequence length the model can process (1024 for BART). Limited by learned absolute position embeddings.
activation_function: The non-linearity used in feed-forward layers (GeLU for BART).

Limitations and ImpactLink Copied

BART's architecture has trade-offs worth understanding.

The post-norm design that BART inherited from the original transformer makes training less stable at large scales compared to pre-norm architectures. This is why T5 and most modern LLMs use pre-norm. When scaling BART beyond its original sizes, learning rate schedules and initialization require careful tuning.

The use of learned absolute position embeddings limits BART's ability to generalize to sequence lengths beyond those seen during training. While the model can process longer sequences by extending position embeddings, performance typically degrades. This contrasts with relative position encoding approaches such as those in T5 or the RoPE embeddings we covered in Part XI, which offer better length generalization.

Computationally, BART's pre-training objective requires reconstructing the entire input document, making pre-training more expensive than T5's span corruption approach. For downstream applications, however, this difference disappears since both models fine-tune and generate output tokens autoregressively.

Despite these limitations, BART had a major impact. It demonstrated that combining BERT-style encoding with GPT-style decoding produces a model that handles both understanding and generation well. The denoising pre-training framework opened new directions for exploring different corruption schemes, which we'll examine in the next chapter.

BART also showed that encoder-decoder architectures work well for conditional generation tasks. While decoder-only models like GPT have since dominated many applications due to their simplicity and scalability, encoder-decoder models like BART remain competitive for tasks where the input and output differ structurally, such as in document summarization or data-to-text generation.

SummaryLink Copied

BART combines a BERT-style bidirectional encoder with a GPT-style autoregressive decoder, creating a model for both understanding and generation. The encoder processes input with full bidirectional self-attention, while the decoder uses causal self-attention for generation and cross-attention to condition on the encoder's representations.

Compared to T5, BART makes different architectural choices. These include GeLU activation instead of ReLU, post-norm instead of pre-norm, learned absolute positions instead of relative positions, and separate embedding matrices for encoder and decoder. These differences reflect BART's design philosophy of directly combining proven components from BERT and GPT rather than exploring novel architectural variations.

BART comes in base (140M parameters) and large (400M parameters) configurations. Both use 12 attention heads in each layer and follow the same overall structure, differing only in depth and hidden dimensions.

The next chapter explores BART's pre-training in detail, examining the various noising functions that teach the model to reconstruct corrupted text and how these objectives shape the model's capabilities for downstream tasks.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about BART's encoder-decoder architecture.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{bartarchitectureencoderdecoderdesignfornlp, author = {Michael Brenndoerfer}, title = {BART Architecture: Encoder-Decoder Design for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }

APAAcademic

Michael Brenndoerfer (2025). BART Architecture: Encoder-Decoder Design for NLP. Retrieved from https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers

MLAAcademic

Michael Brenndoerfer. "BART Architecture: Encoder-Decoder Design for NLP." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "BART Architecture: Encoder-Decoder Design for NLP." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'BART Architecture: Encoder-Decoder Design for NLP'. Available at: https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers (Accessed: 12/25/2025).

SimpleBasic

Michael Brenndoerfer (2025). BART Architecture: Encoder-Decoder Design for NLP. https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers

Direct link:

https://mbrenndoerfer.com/writing/bart-architecture-encoder-decoder-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

BART Architecture: Encoder-Decoder Design for NLP

BART ArchitectureLink Copied

The BART Encoder-Decoder DesignLink Copied

Encoder StructureLink Copied

Decoder StructureLink Copied

Key Architectural ChoicesLink Copied

Attention ConfigurationLink Copied

Encoder Self-AttentionLink Copied

Decoder Causal Self-AttentionLink Copied

Cross-AttentionLink Copied

Attention Flow VisualizationLink Copied

BART vs T5 ComparisonLink Copied

Architectural DifferencesLink Copied

Input-Output FormattingLink Copied

Pre-training ObjectivesLink Copied

Performance Trade-offsLink Copied

BART Model SizesLink Copied

BART-baseLink Copied

BART-largeLink Copied

Comparison with Other ModelsLink Copied

Code ImplementationLink Copied

Key ParametersLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

mT5: Multilingual T5 Architecture & Cross-Lingual Transfer

BART Pre-training: Denoising Strategies & Text Infilling

T5 Task Formatting: Text-to-Text NLP Unification

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

mT5: Multilingual T5 Architecture & Cross-Lingual Transfer

BART Pre-training: Denoising Strategies & Text Infilling

T5 Task Formatting: Text-to-Text NLP Unification

Stay updated