Encoder Architecture: Bidirectional Transformers for Understanding Tasks

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how encoder-only transformers like BERT use bidirectional self-attention for text understanding. Covers encoder design, layer stacking, output usage for classification and extraction, and BERT-style configurations.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Encoder ArchitectureLink Copied

The transformer, introduced in "Attention Is All You Need," combined an encoder and a decoder to handle sequence-to-sequence tasks like machine translation. But a striking realization emerged shortly after: you don't always need both components. For tasks that require understanding rather than generating text, the encoder alone suffices. This insight led to BERT and a family of encoder-only models that dominated NLP benchmarks for years.

Encoder-only transformers process an entire input sequence and produce rich, contextualized representations for each token. Unlike decoders, which generate text one token at a time, encoders see the full context at once. This bidirectional nature makes them ideal for classification, named entity recognition, question answering, and semantic similarity tasks.

This chapter explores the encoder architecture in depth. We'll examine how bidirectional self-attention differs from the causal attention used in decoders, understand why encoders excel at understanding tasks, and implement a complete encoder from scratch. By the end, you'll understand the design principles that made BERT and its descendants so successful.

The Encoder-Only ParadigmLink Copied

The original transformer uses both an encoder to process input and a decoder to generate output. For translation, the encoder reads the source sentence while the decoder produces the target sentence token by token. This separation makes sense when input and output differ in length and structure.

But many NLP tasks don't require generation at all. Sentiment analysis asks: is this review positive or negative? Named entity recognition asks: which words are person names, locations, or organizations? Question answering asks: which span in the passage answers this question? These are all understanding tasks. You need to comprehend the input text, not produce new text.

Encoder-Only Models

Encoder-only transformers process complete input sequences to produce contextualized representations. They're designed for understanding tasks where the goal is to analyze or classify text rather than generate new sequences.

For understanding tasks, an encoder-only architecture offers three key advantages:

Bidirectional context: Each token can attend to tokens both before and after it, capturing richer context than left-to-right processing allows.
Computational efficiency: Without a decoder, you eliminate cross-attention and the sequential generation loop, making inference faster for classification tasks.
Simpler training: You can train on masked language modeling without needing parallel text pairs (source and target sentences).

The encoder produces one representation vector per input token. Depending on the task, you might use the first token's representation for classification, all representations for sequence labeling, or specific span representations for extraction tasks.

Bidirectional Self-AttentionLink Copied

The defining feature of encoders is bidirectional self-attention. Every token can attend to every other token in the sequence, including tokens that appear after it. This contrasts sharply with the causal (masked) attention used in decoders, where each token can only attend to previous tokens.

Consider processing the sentence "The bank approved the loan quickly." When computing the representation for "bank," bidirectional attention sees:

Before: "The"
After: "approved the loan quickly"

The word "loan" in the future context helps disambiguate that "bank" refers to a financial institution rather than a river bank. Causal attention would miss this crucial signal, seeing only "The" before "bank."

In[2]:

Code

import numpy as np

np.random.seed(42)


def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def create_attention_pattern(seq_len, pattern_type="bidirectional"):
    """
    Create attention patterns for visualization.

    Args:
        seq_len: Length of the sequence
        pattern_type: "bidirectional" or "causal"

    Returns:
        Attention mask (0 for allowed, -inf for blocked)
    """
    if pattern_type == "bidirectional":
        # All positions can attend to all positions
        return np.zeros((seq_len, seq_len))
    else:
        # Causal: can only attend to current and previous positions
        mask = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
        return mask


# Example sentence
tokens = ["The", "bank", "approved", "the", "loan", "quickly"]
seq_len = len(tokens)

# Create random similarity scores (simulating Q @ K.T)
np.random.seed(42)
raw_scores = np.random.randn(seq_len, seq_len)

# Apply masks
bidirectional_mask = create_attention_pattern(seq_len, "bidirectional")
causal_mask = create_attention_pattern(seq_len, "causal")

# Compute attention weights
bidirectional_weights = softmax(raw_scores + bidirectional_mask, axis=-1)
causal_weights = softmax(raw_scores + causal_mask, axis=-1)

import numpy as np

np.random.seed(42)


def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def create_attention_pattern(seq_len, pattern_type="bidirectional"):
    """
    Create attention patterns for visualization.

    Args:
        seq_len: Length of the sequence
        pattern_type: "bidirectional" or "causal"

    Returns:
        Attention mask (0 for allowed, -inf for blocked)
    """
    if pattern_type == "bidirectional":
        # All positions can attend to all positions
        return np.zeros((seq_len, seq_len))
    else:
        # Causal: can only attend to current and previous positions
        mask = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
        return mask


# Example sentence
tokens = ["The", "bank", "approved", "the", "loan", "quickly"]
seq_len = len(tokens)

# Create random similarity scores (simulating Q @ K.T)
np.random.seed(42)
raw_scores = np.random.randn(seq_len, seq_len)

# Apply masks
bidirectional_mask = create_attention_pattern(seq_len, "bidirectional")
causal_mask = create_attention_pattern(seq_len, "causal")

# Compute attention weights
bidirectional_weights = softmax(raw_scores + bidirectional_mask, axis=-1)
causal_weights = softmax(raw_scores + causal_mask, axis=-1)

Out[3]:

Visualization

Heatmap showing full attention matrix where all positions attend to all other positions. — Bidirectional attention in encoders. Every token attends to every other token, including future context. When processing 'bank', the model sees 'loan' which helps disambiguate the meaning.

Heatmap showing lower-triangular attention pattern where positions only attend to previous tokens. — Causal attention in decoders. Each token can only attend to itself and previous tokens. 'Bank' cannot see 'loan' and must rely solely on left context.

The visualizations reveal the fundamental difference. In bidirectional attention (left), the "bank" row shows non-zero weights for all six positions, including "loan" and "quickly" that appear later in the sequence. In causal attention (right), the "bank" row shows weights only for "The" and "bank" itself. The future context that would help disambiguate "bank" is completely invisible.

The Attention Mechanism Without MaskingLink Copied

To understand how bidirectional attention works mathematically, we need to build up from a simple question: how does a token decide which other tokens to pay attention to?

The answer involves three learned projections that give each token different "roles" in the attention computation. Think of it as a information retrieval system where tokens can both ask questions and provide answers:

Queries represent what a token is looking for. When computing the representation for "bank," its query vector encodes the question "what information do I need to understand my meaning?"
Keys represent what a token offers. Each token advertises its content through a key vector, saying "this is what I can tell you about."
Values contain the actual information to be gathered. Once attention decides which tokens matter, their value vectors provide the content.

Starting from the input sequence $X \in \mathbb{R}^{n \times d}$ containing $n$ tokens with $d$ -dimensional embeddings, we project into these three spaces:

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

where:

$X$ : the input matrix with one row per token
$W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}$ : learned projection matrices
$Q, K, V$ : the resulting query, key, and value matrices

With these projections in hand, we measure relevance by computing dot products between queries and keys. A high dot product between query $i$ and key $j$ means token $i$ should attend strongly to token $j$ . We arrange all these dot products into a single matrix multiplication:

\text{Scores} = QK^T

This produces an $n \times n$ matrix where entry $(i, j)$ measures how much token $i$ should attend to token $j$ . The key insight for encoders: we compute all $n^2$ entries, not just the lower triangle. Position 2 can attend to position 5, and position 5 can attend to position 2.

Raw dot products can grow large as the dimension $d_k$ increases, which would push softmax into regions with vanishing gradients. To stabilize training, we scale by $\sqrt{d_k}$ :

\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}

The softmax function converts these scores into a probability distribution. For each token (each row), softmax ensures the attention weights across all positions sum to 1:

\text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)

Finally, we use these weights to compute a weighted average of value vectors. If token $i$ assigns weight 0.4 to token $j$ , then 40% of $j$ 's value vector contributes to $i$ 's output:

\text{Output} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Putting it all together, the complete attention formula is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

$Q = XW^Q$ : query matrix, each row is a token asking "what should I pay attention to?"
$K = XW^K$ : key matrix, each row is a token advertising "this is what I contain"
$V = XW^V$ : value matrix, each row is a token's actual content to be gathered
$d_k$ : dimension of query/key vectors, used for scaling to prevent gradient issues
The softmax normalizes attention weights to sum to 1 for each query position

The crucial difference from decoder attention: no mask is applied before the softmax. Every query-key pair contributes to the attention weights, allowing information to flow in both directions. This is the simplest form of self-attention, with no architectural constraints on which positions can interact.

In[4]:

Code

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)
        mask: Optional mask (0 for allowed, -inf for blocked)

    Returns:
        Output of shape (seq_len, d_v) and attention weights
    """
    d_k = Q.shape[-1]

    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply mask if provided
    if mask is not None:
        scores = scores + mask

    # Convert to probabilities
    weights = softmax(scores, axis=-1)

    # Compute weighted sum of values
    output = weights @ V

    return output, weights


# Demonstrate with a simple example
d_model = 64
d_k = 16

# Random input embeddings
X = np.random.randn(seq_len, d_model)

# Random projections (in practice, these are learned)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_k) * 0.1

Q = X @ W_q
K = X @ W_k
V = X @ W_v

# Bidirectional attention (no mask)
output_bidir, weights_bidir = scaled_dot_product_attention(Q, K, V, mask=None)

# Causal attention (with mask)
causal = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
output_causal, weights_causal = scaled_dot_product_attention(
    Q, K, V, mask=causal
)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query matrix of shape (seq_len, d_k)
        K: Key matrix of shape (seq_len, d_k)
        V: Value matrix of shape (seq_len, d_v)
        mask: Optional mask (0 for allowed, -inf for blocked)

    Returns:
        Output of shape (seq_len, d_v) and attention weights
    """
    d_k = Q.shape[-1]

    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Apply mask if provided
    if mask is not None:
        scores = scores + mask

    # Convert to probabilities
    weights = softmax(scores, axis=-1)

    # Compute weighted sum of values
    output = weights @ V

    return output, weights


# Demonstrate with a simple example
d_model = 64
d_k = 16

# Random input embeddings
X = np.random.randn(seq_len, d_model)

# Random projections (in practice, these are learned)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_k) * 0.1

Q = X @ W_q
K = X @ W_k
V = X @ W_v

# Bidirectional attention (no mask)
output_bidir, weights_bidir = scaled_dot_product_attention(Q, K, V, mask=None)

# Causal attention (with mask)
causal = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
output_causal, weights_causal = scaled_dot_product_attention(
    Q, K, V, mask=causal
)

Out[5]:

Visualization

Grouped bar chart comparing attention weights for bidirectional vs causal attention at each token position. — Attention distribution for the word 'bank' (position 1). Bidirectional attention distributes weight across all tokens, including future context like 'loan' that disambiguates meaning. Causal attention concentrates entirely on the token itself and 'The', missing crucial context.

The visualization quantifies what bidirectional attention enables. With bidirectional attention, "bank" draws information from all tokens, including future tokens like "loan" and "quickly" that appear later in the sequence. With causal attention, these future tokens contribute zero weight. The representation of "bank" differs fundamentally depending on which attention pattern is used.

Encoder for Understanding TasksLink Copied

Bidirectional attention makes encoders powerful tools for tasks that require comprehending input text. Let's examine the major categories of tasks where encoder-only models excel.

Text ClassificationLink Copied

Classification maps an entire input sequence to a single label: sentiment (positive/negative), topic (sports/politics/technology), or intent (question/command/statement). The encoder processes the full sequence, then a classification head converts the representation to class probabilities.

The [CLS] Token

BERT and similar models prepend a special [CLS] token to every input. After passing through all encoder layers, this token's representation aggregates information from the entire sequence and serves as the sequence-level representation for classification.

Why use [CLS] rather than averaging all token representations? The [CLS] token starts with no inherent meaning. Through self-attention, it learns to collect task-relevant information from the actual content tokens. This learned aggregation often outperforms simple averaging, especially for tasks requiring holistic understanding.

In[6]:

Code

def simulate_classification_flow(hidden_states, num_classes=3):
    """
    Simulate how encoder output flows to classification.

    Args:
        hidden_states: Encoder output of shape (seq_len, d_model)
        num_classes: Number of classification categories

    Returns:
        Class probabilities
    """
    # Use [CLS] token representation (first position)
    cls_representation = hidden_states[0]  # Shape: (d_model,)

    # Classification head: linear projection + softmax
    d_model = cls_representation.shape[0]
    W_classifier = np.random.randn(d_model, num_classes) * 0.02
    b_classifier = np.zeros(num_classes)

    logits = cls_representation @ W_classifier + b_classifier
    probabilities = softmax(logits)

    return probabilities, cls_representation


# Example: classify sentiment of a review
review_tokens = ["[CLS]", "This", "movie", "was", "fantastic", "!", "[SEP]"]
d_model = 256
hidden_states = np.random.randn(len(review_tokens), d_model)

# Get classification probabilities
probs, cls_repr = simulate_classification_flow(hidden_states, num_classes=3)

def simulate_classification_flow(hidden_states, num_classes=3):
    """
    Simulate how encoder output flows to classification.

    Args:
        hidden_states: Encoder output of shape (seq_len, d_model)
        num_classes: Number of classification categories

    Returns:
        Class probabilities
    """
    # Use [CLS] token representation (first position)
    cls_representation = hidden_states[0]  # Shape: (d_model,)

    # Classification head: linear projection + softmax
    d_model = cls_representation.shape[0]
    W_classifier = np.random.randn(d_model, num_classes) * 0.02
    b_classifier = np.zeros(num_classes)

    logits = cls_representation @ W_classifier + b_classifier
    probabilities = softmax(logits)

    return probabilities, cls_representation


# Example: classify sentiment of a review
review_tokens = ["[CLS]", "This", "movie", "was", "fantastic", "!", "[SEP]"]
d_model = 256
hidden_states = np.random.randn(len(review_tokens), d_model)

# Get classification probabilities
probs, cls_repr = simulate_classification_flow(hidden_states, num_classes=3)

Out[7]:

Console

Sentiment Classification Example
----------------------------------------
Input: This movie was fantastic !

Class Probabilities:
  Negative  : 0.2675 ████████
  Neutral   : 0.3108 █████████
  Positive  : 0.4217 ████████████

[CLS] representation shape: (256,)
[CLS] representation norm: 16.5427

Token Classification (Sequence Labeling)Link Copied

Named entity recognition, part-of-speech tagging, and similar tasks require a label for each token, not just one label for the whole sequence. Here, every token's encoder representation passes through a classification head:

In[8]:

Code

def token_classification(hidden_states, num_labels=5):
    """
    Apply token-level classification to encoder outputs.

    Args:
        hidden_states: Encoder output of shape (seq_len, d_model)
        num_labels: Number of possible token labels

    Returns:
        Per-token predictions
    """
    seq_len, d_model = hidden_states.shape

    # Classification head applied to each token
    W = np.random.randn(d_model, num_labels) * 0.02
    b = np.zeros(num_labels)

    logits = hidden_states @ W + b  # (seq_len, num_labels)
    predictions = np.argmax(logits, axis=1)

    return predictions, softmax(logits, axis=1)


# Example: Named Entity Recognition
ner_tokens = ["John", "works", "at", "Google", "in", "California", "."]
hidden_states_ner = np.random.randn(len(ner_tokens), d_model)

predictions, probs = token_classification(hidden_states_ner, num_labels=5)

def token_classification(hidden_states, num_labels=5):
    """
    Apply token-level classification to encoder outputs.

    Args:
        hidden_states: Encoder output of shape (seq_len, d_model)
        num_labels: Number of possible token labels

    Returns:
        Per-token predictions
    """
    seq_len, d_model = hidden_states.shape

    # Classification head applied to each token
    W = np.random.randn(d_model, num_labels) * 0.02
    b = np.zeros(num_labels)

    logits = hidden_states @ W + b  # (seq_len, num_labels)
    predictions = np.argmax(logits, axis=1)

    return predictions, softmax(logits, axis=1)


# Example: Named Entity Recognition
ner_tokens = ["John", "works", "at", "Google", "in", "California", "."]
hidden_states_ner = np.random.randn(len(ner_tokens), d_model)

predictions, probs = token_classification(hidden_states_ner, num_labels=5)

Out[9]:

Console

Named Entity Recognition Example
--------------------------------------------------
Token        | Predicted Label | Confidence
--------------------------------------------------
John         | B-PER           |     0.2584 ←
works        | O               |     0.3012 
at           | O               |     0.3333 
Google       | B-ORG           |     0.2534 ←
in           | O               |     0.2496 
California   | B-LOC           |     0.2801 ←
.            | O               |     0.2998

Each token gets its own contextualized representation that incorporates information from the entire sentence. When classifying "Google," the encoder representation already knows that "John works at" precedes it and "in California" follows it, helping distinguish company names from other uses of the word.

Extractive Question AnsweringLink Copied

Given a question and a passage, extractive QA identifies the span in the passage that answers the question. The encoder processes the concatenated question and passage, then two classification heads predict the start and end positions of the answer span:

In[10]:

Code

def extractive_qa(hidden_states, question_len):
    """
    Predict answer span from encoder representations.

    Args:
        hidden_states: Encoder output of shape (total_len, d_model)
        question_len: Number of tokens in the question (to exclude from answer)

    Returns:
        Start and end position predictions
    """
    seq_len, d_model = hidden_states.shape

    # Separate heads for start and end positions
    W_start = np.random.randn(d_model, 1) * 0.02
    W_end = np.random.randn(d_model, 1) * 0.02

    # Compute logits for each position
    start_logits = (hidden_states @ W_start).squeeze()
    end_logits = (hidden_states @ W_end).squeeze()

    # Mask out question tokens (answer must be in passage)
    start_logits[:question_len] = float("-inf")
    end_logits[:question_len] = float("-inf")

    # Get predictions
    start_pos = np.argmax(start_logits)
    end_pos = np.argmax(end_logits)

    return start_pos, end_pos, softmax(start_logits), softmax(end_logits)


# Example
question = ["[CLS]", "Where", "does", "John", "work", "?", "[SEP]"]
passage = [
    "John",
    "Smith",
    "works",
    "at",
    "Google",
    "headquarters",
    ".",
    "[SEP]",
]
full_sequence = question + passage
question_len = len(question)

hidden_states_qa = np.random.randn(len(full_sequence), d_model)
start, end, start_probs, end_probs = extractive_qa(
    hidden_states_qa, question_len
)

def extractive_qa(hidden_states, question_len):
    """
    Predict answer span from encoder representations.

    Args:
        hidden_states: Encoder output of shape (total_len, d_model)
        question_len: Number of tokens in the question (to exclude from answer)

    Returns:
        Start and end position predictions
    """
    seq_len, d_model = hidden_states.shape

    # Separate heads for start and end positions
    W_start = np.random.randn(d_model, 1) * 0.02
    W_end = np.random.randn(d_model, 1) * 0.02

    # Compute logits for each position
    start_logits = (hidden_states @ W_start).squeeze()
    end_logits = (hidden_states @ W_end).squeeze()

    # Mask out question tokens (answer must be in passage)
    start_logits[:question_len] = float("-inf")
    end_logits[:question_len] = float("-inf")

    # Get predictions
    start_pos = np.argmax(start_logits)
    end_pos = np.argmax(end_logits)

    return start_pos, end_pos, softmax(start_logits), softmax(end_logits)


# Example
question = ["[CLS]", "Where", "does", "John", "work", "?", "[SEP]"]
passage = [
    "John",
    "Smith",
    "works",
    "at",
    "Google",
    "headquarters",
    ".",
    "[SEP]",
]
full_sequence = question + passage
question_len = len(question)

hidden_states_qa = np.random.randn(len(full_sequence), d_model)
start, end, start_probs, end_probs = extractive_qa(
    hidden_states_qa, question_len
)

Out[11]:

Visualization

Bar chart showing start and end probabilities for each token in the passage. — Start and end position probabilities for extractive question answering. The model learns to predict which token begins the answer span (start) and which ends it (end). Bidirectional attention enables the model to consider both the question and surrounding passage context when making predictions.

The bidirectional nature is crucial here. When evaluating whether "Google" is the answer, the model simultaneously considers the question "Where does John work?" and the surrounding context "works at ... headquarters." This holistic view is only possible with bidirectional attention.

Building an Encoder LayerLink Copied

Having understood the attention mechanism, we can now assemble a complete encoder layer. But attention alone isn't enough. Deep networks need architectural support to train successfully, and encoder layers combine several components that work together to enable learning.

An encoder layer has two major sublayers, each solving a different problem:

Multi-head self-attention enables tokens to gather information from each other. This is where bidirectional context mixing happens.
Feed-forward network (FFN) processes each token independently through a nonlinear transformation. This adds capacity that pure attention lacks.

Both sublayers are wrapped with two additional mechanisms that make deep stacking possible:

Residual connections add the input directly to the output of each sublayer. This creates "gradient highways" that allow learning signals to flow through many layers without vanishing.
Layer normalization stabilizes the distribution of activations, preventing them from exploding or collapsing as they pass through many transformations.

The modern pre-norm configuration places normalization before each sublayer rather than after. This ordering provides cleaner gradient flow and has become the standard for deep transformers. The complete computation unfolds in two stages:

Stage 1: Attention sublayer

\mathbf{h} = \mathbf{x} + \text{MultiHeadAttention}(\text{LayerNorm}(\mathbf{x}))

First, we normalize the input $\mathbf{x}$ . Then attention computes context-aware representations. Finally, we add the original input back (the residual connection). The result $\mathbf{h}$ blends the original representation with information gathered from other positions.

Stage 2: Feed-forward sublayer

\mathbf{y} = \mathbf{h} + \text{FFN}(\text{LayerNorm}(\mathbf{h}))

The same pattern repeats: normalize, transform, add residual. The FFN applies the same transformation independently to each token position, adding nonlinearity and additional learnable capacity.

Together, these stages form the encoder layer's output:

\begin{aligned} \mathbf{h} &= \mathbf{x} + \text{MultiHeadAttention}(\text{LayerNorm}(\mathbf{x})) \\ \mathbf{y} &= \mathbf{h} + \text{FFN}(\text{LayerNorm}(\mathbf{h})) \end{aligned}

where:

$\mathbf{x} \in \mathbb{R}^{n \times d}$ : input to the encoder layer, containing $n$ token representations
$\text{LayerNorm}(\cdot)$ : normalizes activations to have zero mean and unit variance per token
$\text{MultiHeadAttention}(\cdot)$ : bidirectional multi-head self-attention (no causal mask)
$\text{FFN}(\cdot)$ : position-wise feed-forward network with nonlinear activation
$\mathbf{h}$ : intermediate representation after the attention sublayer
$\mathbf{y}$ : final output of the encoder layer, same shape as input

Notice that the output $\mathbf{y}$ has the same shape as the input $\mathbf{x}$ . This dimensional consistency is what allows us to stack encoder layers: the output of one layer becomes the input to the next, with each layer progressively refining the representations.

Let's implement each component, starting with the building blocks:

In[12]:

Code

class LayerNorm:
    """Layer Normalization."""

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.gamma = np.ones(dim)
        self.beta = np.zeros(dim)

    def __call__(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta


class MultiHeadAttention:
    """Multi-head self-attention for encoder (bidirectional)."""

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # Initialize projections
        scale = np.sqrt(2.0 / (d_model + d_model))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def __call__(self, x, mask=None):
        """
        Apply multi-head attention.

        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional padding mask

        Returns:
            Output of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]

        # Project to Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head: (seq_len, n_heads, d_head)
        Q = Q.reshape(seq_len, self.n_heads, self.d_head)
        K = K.reshape(seq_len, self.n_heads, self.d_head)
        V = V.reshape(seq_len, self.n_heads, self.d_head)

        # Transpose for attention: (n_heads, seq_len, d_head)
        Q = Q.transpose(1, 0, 2)
        K = K.transpose(1, 0, 2)
        V = V.transpose(1, 0, 2)

        # Compute attention scores: (n_heads, seq_len, seq_len)
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_head)

        # Apply mask if provided (for padding)
        if mask is not None:
            scores = scores + mask

        # Note: No causal mask! This is bidirectional attention
        weights = softmax(scores, axis=-1)

        # Apply attention: (n_heads, seq_len, d_head)
        attended = np.matmul(weights, V)

        # Reshape back: (seq_len, d_model)
        attended = attended.transpose(1, 0, 2).reshape(seq_len, self.d_model)

        # Output projection
        output = attended @ self.W_o

        return output, weights


class FeedForward:
    """Position-wise feed-forward network."""

    def __init__(self, d_model, d_ff):
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))

        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)

    def __call__(self, x):
        """Apply two-layer FFN with GELU activation."""
        hidden = x @ self.W1 + self.b1
        # GELU activation
        hidden = (
            0.5
            * hidden
            * (
                1
                + np.tanh(np.sqrt(2 / np.pi) * (hidden + 0.044715 * hidden**3))
            )
        )
        return hidden @ self.W2 + self.b2

class LayerNorm:
    """Layer Normalization."""

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.gamma = np.ones(dim)
        self.beta = np.zeros(dim)

    def __call__(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        x_norm = (x - mean) / np.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta


class MultiHeadAttention:
    """Multi-head self-attention for encoder (bidirectional)."""

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # Initialize projections
        scale = np.sqrt(2.0 / (d_model + d_model))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def __call__(self, x, mask=None):
        """
        Apply multi-head attention.

        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional padding mask

        Returns:
            Output of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]

        # Project to Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head: (seq_len, n_heads, d_head)
        Q = Q.reshape(seq_len, self.n_heads, self.d_head)
        K = K.reshape(seq_len, self.n_heads, self.d_head)
        V = V.reshape(seq_len, self.n_heads, self.d_head)

        # Transpose for attention: (n_heads, seq_len, d_head)
        Q = Q.transpose(1, 0, 2)
        K = K.transpose(1, 0, 2)
        V = V.transpose(1, 0, 2)

        # Compute attention scores: (n_heads, seq_len, seq_len)
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_head)

        # Apply mask if provided (for padding)
        if mask is not None:
            scores = scores + mask

        # Note: No causal mask! This is bidirectional attention
        weights = softmax(scores, axis=-1)

        # Apply attention: (n_heads, seq_len, d_head)
        attended = np.matmul(weights, V)

        # Reshape back: (seq_len, d_model)
        attended = attended.transpose(1, 0, 2).reshape(seq_len, self.d_model)

        # Output projection
        output = attended @ self.W_o

        return output, weights


class FeedForward:
    """Position-wise feed-forward network."""

    def __init__(self, d_model, d_ff):
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))

        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)

    def __call__(self, x):
        """Apply two-layer FFN with GELU activation."""
        hidden = x @ self.W1 + self.b1
        # GELU activation
        hidden = (
            0.5
            * hidden
            * (
                1
                + np.tanh(np.sqrt(2 / np.pi) * (hidden + 0.044715 * hidden**3))
            )
        )
        return hidden @ self.W2 + self.b2

The feed-forward network uses GELU (Gaussian Error Linear Unit) activation, which has become the standard for transformer models. Unlike ReLU which hard-thresholds at zero, GELU provides smooth gating that depends on the input's magnitude:

Out[13]:

Visualization

Line plot comparing GELU and ReLU activation functions. — GELU activation compared to ReLU. GELU smoothly gates values based on their magnitude, providing non-zero gradients for negative inputs. This helps transformers learn more nuanced representations than the hard cutoff of ReLU.

Let's visualize how layer normalization transforms the activation distributions:

Out[14]:

Visualization

Two histograms comparing activation distributions before and after layer normalization. — Effect of layer normalization on activation distributions. Before normalization, activations can have varying means and scales across dimensions. After normalization, each token's representation has zero mean and unit variance, stabilizing training dynamics.

Now we assemble these components into a complete encoder layer:

In[15]:

Code

class EncoderLayer:
    """
    Single transformer encoder layer.

    Uses pre-norm configuration with bidirectional self-attention.
    """

    def __init__(self, d_model, n_heads, d_ff):
        """
        Args:
            d_model: Model dimension
            n_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
        """
        self.d_model = d_model

        # Sublayer 1: Multi-head self-attention
        self.norm1 = LayerNorm(d_model)
        self.attention = MultiHeadAttention(d_model, n_heads)

        # Sublayer 2: Feed-forward network
        self.norm2 = LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff)

    def __call__(self, x, mask=None):
        """
        Forward pass through encoder layer.

        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional padding mask

        Returns:
            Output of shape (seq_len, d_model)
        """
        # Sublayer 1: Attention with residual
        normed = self.norm1(x)
        attended, attn_weights = self.attention(normed, mask)
        x = x + attended  # Residual connection

        # Sublayer 2: FFN with residual
        normed = self.norm2(x)
        ffn_out = self.ffn(normed)
        x = x + ffn_out  # Residual connection

        return x, attn_weights


# Test the encoder layer
d_model = 256
n_heads = 8
d_ff = 1024
seq_len = 10

layer = EncoderLayer(d_model, n_heads, d_ff)
x = np.random.randn(seq_len, d_model)
output, attn_weights = layer(x)

class EncoderLayer:
    """
    Single transformer encoder layer.

    Uses pre-norm configuration with bidirectional self-attention.
    """

    def __init__(self, d_model, n_heads, d_ff):
        """
        Args:
            d_model: Model dimension
            n_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
        """
        self.d_model = d_model

        # Sublayer 1: Multi-head self-attention
        self.norm1 = LayerNorm(d_model)
        self.attention = MultiHeadAttention(d_model, n_heads)

        # Sublayer 2: Feed-forward network
        self.norm2 = LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff)

    def __call__(self, x, mask=None):
        """
        Forward pass through encoder layer.

        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional padding mask

        Returns:
            Output of shape (seq_len, d_model)
        """
        # Sublayer 1: Attention with residual
        normed = self.norm1(x)
        attended, attn_weights = self.attention(normed, mask)
        x = x + attended  # Residual connection

        # Sublayer 2: FFN with residual
        normed = self.norm2(x)
        ffn_out = self.ffn(normed)
        x = x + ffn_out  # Residual connection

        return x, attn_weights


# Test the encoder layer
d_model = 256
n_heads = 8
d_ff = 1024
seq_len = 10

layer = EncoderLayer(d_model, n_heads, d_ff)
x = np.random.randn(seq_len, d_model)
output, attn_weights = layer(x)

Out[16]:

Console

Encoder Layer Test
----------------------------------------
Configuration:
  Model dimension (d_model): 256
  Attention heads: 8
  Head dimension: 32
  FFN dimension: 1024

Input shape: (10, 256)
Output shape: (10, 256)
Attention weights shape: (8, 10, 10)

Input norm: 51.3758
Output norm: 62.4308

The encoder layer preserves the input shape: a sequence of $n$ tokens with $d$ -dimensional representations goes in, and an identically shaped sequence comes out. The attention weights have shape (n_heads, seq_len, seq_len), showing that each head computes its own full attention matrix over all position pairs.

Visualizing Multi-Head Attention PatternsLink Copied

Different attention heads learn to focus on different aspects of the input. Some might attend to adjacent tokens (capturing local syntax), while others span long distances (capturing semantic relationships). Let's visualize how the 8 attention heads in our encoder layer distribute their attention differently:

Out[17]:

Visualization

Grid of 8 heatmaps showing different attention patterns learned by each head in a multi-head attention layer. — Attention patterns from 8 different heads in a single encoder layer. Each head learns distinct patterns: some focus locally (strong diagonal), others capture long-range dependencies (off-diagonal weights). This diversity enables the model to capture multiple types of relationships simultaneously.

The variation across heads illustrates why multi-head attention is so powerful. No single head needs to capture all relationships. The ensemble of heads provides coverage across local patterns, position-specific attention, and long-range dependencies. During training, each head specializes for patterns that help the overall task.

Stacking Encoder LayersLink Copied

A single encoder layer provides limited representational power. Real models stack many layers, allowing each successive layer to refine the representations further. BERT-Base uses 12 layers, BERT-Large uses 24, and some models go even deeper.

In[18]:

Code

class TransformerEncoder:
    """
    Complete transformer encoder.

    Stacks multiple encoder layers with optional final normalization.
    """

    def __init__(self, n_layers, d_model, n_heads, d_ff):
        """
        Args:
            n_layers: Number of encoder layers
            d_model: Model dimension
            n_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
        """
        self.n_layers = n_layers
        self.d_model = d_model

        # Stack of encoder layers
        self.layers = [
            EncoderLayer(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final layer normalization (used in pre-norm configuration)
        self.final_norm = LayerNorm(d_model)

    def __call__(self, x, mask=None, return_all_layers=False):
        """
        Forward pass through all encoder layers.

        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional padding mask
            return_all_layers: Whether to return intermediate representations

        Returns:
            Final output (and optionally all layer outputs)
        """
        all_outputs = [x]
        all_attentions = []

        for layer in self.layers:
            x, attn = layer(x, mask)
            all_outputs.append(x)
            all_attentions.append(attn)

        # Apply final normalization
        x = self.final_norm(x)

        if return_all_layers:
            return x, all_outputs, all_attentions
        return x


# Build a 6-layer encoder
n_layers = 6
encoder = TransformerEncoder(n_layers, d_model, n_heads, d_ff)

# Process a sequence
seq_len = 8
x = np.random.randn(seq_len, d_model)
output, all_outputs, all_attentions = encoder(x, return_all_layers=True)

class TransformerEncoder:
    """
    Complete transformer encoder.

    Stacks multiple encoder layers with optional final normalization.
    """

    def __init__(self, n_layers, d_model, n_heads, d_ff):
        """
        Args:
            n_layers: Number of encoder layers
            d_model: Model dimension
            n_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
        """
        self.n_layers = n_layers
        self.d_model = d_model

        # Stack of encoder layers
        self.layers = [
            EncoderLayer(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final layer normalization (used in pre-norm configuration)
        self.final_norm = LayerNorm(d_model)

    def __call__(self, x, mask=None, return_all_layers=False):
        """
        Forward pass through all encoder layers.

        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional padding mask
            return_all_layers: Whether to return intermediate representations

        Returns:
            Final output (and optionally all layer outputs)
        """
        all_outputs = [x]
        all_attentions = []

        for layer in self.layers:
            x, attn = layer(x, mask)
            all_outputs.append(x)
            all_attentions.append(attn)

        # Apply final normalization
        x = self.final_norm(x)

        if return_all_layers:
            return x, all_outputs, all_attentions
        return x


# Build a 6-layer encoder
n_layers = 6
encoder = TransformerEncoder(n_layers, d_model, n_heads, d_ff)

# Process a sequence
seq_len = 8
x = np.random.randn(seq_len, d_model)
output, all_outputs, all_attentions = encoder(x, return_all_layers=True)

Out[19]:

Console

Transformer Encoder Configuration
----------------------------------------
Number of layers: 6
Model dimension: 256
Attention heads per layer: 8
FFN hidden dimension: 1024

Parameters per layer: 788,736
Total encoder parameters: 4,732,928

Input shape: (8, 256)
Output shape: (8, 256)

Let's visualize how representations evolve through the layers:

Out[20]:

Visualization

Line plot showing representation norms for each token position across 6 encoder layers. — Evolution of token representations through encoder layers. Each layer refines the representations, with early layers capturing local patterns and later layers integrating global context. The representation norm increases slightly due to residual accumulation.

The visualization shows how representations develop across layers. Early layers might have more variable norms as they begin processing the input, while deeper layers tend to stabilize as the residual connections accumulate refined representations.

Attention Entropy Across LayersLink Copied

A useful diagnostic for understanding encoder behavior is attention entropy, a measure of how concentrated or diffuse each token's attention is. High entropy means attention is spread broadly; low entropy means it focuses on specific positions.

Out[21]:

Visualization

Line plot showing attention entropy for each token position across 6 encoder layers. — Attention entropy across encoder layers. Early layers often show higher entropy (broader attention) as tokens gather general context. Deeper layers may develop more focused patterns as representations specialize. Each line represents one token position's average entropy across attention heads.

Entropy close to the maximum (dotted line) indicates uniform attention, meaning every token attends roughly equally to all positions. Lower entropy indicates more selective attention. Tracking entropy across layers reveals how the encoder progressively refines its attention patterns from broad context gathering to more task-specific focus.

Encoder Output UsageLink Copied

The encoder produces one contextualized representation per input token. How you use these representations depends on your downstream task.

Sequence-Level TasksLink Copied

For classification, sentiment analysis, or any task requiring a single output for the whole sequence, use the [CLS] token representation:

In[22]:

Code

def get_sequence_representation(encoder_output, pooling="cls"):
    """
    Extract sequence-level representation from encoder output.

    Args:
        encoder_output: Shape (seq_len, d_model)
        pooling: "cls" (use first token), "mean" (average all),
                 or "max" (max pooling)

    Returns:
        Single vector of shape (d_model,)
    """
    if pooling == "cls":
        return encoder_output[0]
    elif pooling == "mean":
        return np.mean(encoder_output, axis=0)
    elif pooling == "max":
        return np.max(encoder_output, axis=0)
    else:
        raise ValueError(f"Unknown pooling: {pooling}")


# Compare pooling strategies
cls_repr = get_sequence_representation(output, "cls")
mean_repr = get_sequence_representation(output, "mean")
max_repr = get_sequence_representation(output, "max")

def get_sequence_representation(encoder_output, pooling="cls"):
    """
    Extract sequence-level representation from encoder output.

    Args:
        encoder_output: Shape (seq_len, d_model)
        pooling: "cls" (use first token), "mean" (average all),
                 or "max" (max pooling)

    Returns:
        Single vector of shape (d_model,)
    """
    if pooling == "cls":
        return encoder_output[0]
    elif pooling == "mean":
        return np.mean(encoder_output, axis=0)
    elif pooling == "max":
        return np.max(encoder_output, axis=0)
    else:
        raise ValueError(f"Unknown pooling: {pooling}")


# Compare pooling strategies
cls_repr = get_sequence_representation(output, "cls")
mean_repr = get_sequence_representation(output, "mean")
max_repr = get_sequence_representation(output, "max")

Out[23]:

Visualization

Three histograms showing value distributions for CLS, mean, and max pooling strategies. — Comparison of pooling strategies for sequence-level representations. Each subplot shows the distribution of values in the pooled representation. CLS token reflects a single learned aggregation, mean pooling smooths across all tokens, and max pooling emphasizes the most extreme features.

Mean pooling often works well when all tokens contribute equally (like sentence similarity). CLS pooling is preferred when the model was pre-trained to aggregate into the first position. Max pooling can capture salient features but may be sensitive to outliers.

Token-Level TasksLink Copied

For NER, POS tagging, or other sequence labeling tasks, use every token's representation:

In[24]:

Code

def get_token_representations(encoder_output, special_tokens_mask=None):
    """
    Extract per-token representations for sequence labeling.

    Args:
        encoder_output: Shape (seq_len, d_model)
        special_tokens_mask: Boolean mask (True for special tokens to exclude)

    Returns:
        Token representations (possibly with special tokens removed)
    """
    if special_tokens_mask is not None:
        # Keep only non-special tokens
        return encoder_output[~special_tokens_mask]
    return encoder_output


# Example: exclude [CLS] and [SEP]
seq_len = output.shape[0]
special_mask = np.zeros(seq_len, dtype=bool)
special_mask[0] = True  # [CLS]
special_mask[-1] = True  # [SEP]

content_representations = get_token_representations(output, special_mask)

def get_token_representations(encoder_output, special_tokens_mask=None):
    """
    Extract per-token representations for sequence labeling.

    Args:
        encoder_output: Shape (seq_len, d_model)
        special_tokens_mask: Boolean mask (True for special tokens to exclude)

    Returns:
        Token representations (possibly with special tokens removed)
    """
    if special_tokens_mask is not None:
        # Keep only non-special tokens
        return encoder_output[~special_tokens_mask]
    return encoder_output


# Example: exclude [CLS] and [SEP]
seq_len = output.shape[0]
special_mask = np.zeros(seq_len, dtype=bool)
special_mask[0] = True  # [CLS]
special_mask[-1] = True  # [SEP]

content_representations = get_token_representations(output, special_mask)

Out[25]:

Console

Token-Level Representations
----------------------------------------
Total tokens: 8
Special tokens: 2
Content tokens: 6

Full encoder output shape: (8, 256)
Content-only shape: (6, 256)

Span-Level TasksLink Copied

For question answering, relation extraction, or any task involving text spans, you might need to construct span representations from multiple tokens:

In[26]:

Code

def get_span_representation(encoder_output, start, end, method="endpoints"):
    """
    Extract representation for a span of tokens.

    Args:
        encoder_output: Shape (seq_len, d_model)
        start: Start position (inclusive)
        end: End position (inclusive)
        method: "endpoints" (concat start and end), "mean" (average span),
                "attention" (weighted by position)

    Returns:
        Span representation
    """
    span_tokens = encoder_output[start : end + 1]

    if method == "endpoints":
        return np.concatenate([span_tokens[0], span_tokens[-1]])
    elif method == "mean":
        return np.mean(span_tokens, axis=0)
    elif method == "maxpool":
        return np.max(span_tokens, axis=0)
    else:
        raise ValueError(f"Unknown method: {method}")


# Example: represent the span "at Google" (positions 3-4 in "John works at Google")
span_repr_endpoints = get_span_representation(output, 3, 4, "endpoints")
span_repr_mean = get_span_representation(output, 3, 4, "mean")

def get_span_representation(encoder_output, start, end, method="endpoints"):
    """
    Extract representation for a span of tokens.

    Args:
        encoder_output: Shape (seq_len, d_model)
        start: Start position (inclusive)
        end: End position (inclusive)
        method: "endpoints" (concat start and end), "mean" (average span),
                "attention" (weighted by position)

    Returns:
        Span representation
    """
    span_tokens = encoder_output[start : end + 1]

    if method == "endpoints":
        return np.concatenate([span_tokens[0], span_tokens[-1]])
    elif method == "mean":
        return np.mean(span_tokens, axis=0)
    elif method == "maxpool":
        return np.max(span_tokens, axis=0)
    else:
        raise ValueError(f"Unknown method: {method}")


# Example: represent the span "at Google" (positions 3-4 in "John works at Google")
span_repr_endpoints = get_span_representation(output, 3, 4, "endpoints")
span_repr_mean = get_span_representation(output, 3, 4, "mean")

Out[27]:

Console

Span Representation Methods
--------------------------------------------------
Span positions: 3 to 4 (2 tokens)

Endpoints (concat): shape (512,)
Mean pooling: shape (256,)

Endpoint method doubles dimension but preserves boundary information.
Mean method preserves dimension but loses position within span.

BERT-Style Encoder ConfigurationLink Copied

BERT established conventions that influenced nearly all subsequent encoder models. Let's examine the specific architectural choices that defined BERT and its variants.

In[28]:

Code

# BERT configurations
bert_configs = {
    "BERT-Base": {
        "n_layers": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,  # 4 * d_model
        "vocab_size": 30522,
        "max_position": 512,
    },
    "BERT-Large": {
        "n_layers": 24,
        "d_model": 1024,
        "n_heads": 16,
        "d_ff": 4096,
        "vocab_size": 30522,
        "max_position": 512,
    },
    "RoBERTa-Base": {
        "n_layers": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
        "vocab_size": 50265,
        "max_position": 512,
    },
    "ALBERT-xxlarge": {
        "n_layers": 12,
        "d_model": 4096,
        "n_heads": 64,
        "d_ff": 16384,
        "vocab_size": 30000,
        "max_position": 512,
        "embedding_dim": 128,  # ALBERT uses factorized embeddings
    },
}


def count_encoder_params(config):
    """Count parameters in an encoder model."""
    d = config["d_model"]
    ff = config["d_ff"]
    L = config["n_layers"]
    V = config["vocab_size"]
    P = config["max_position"]
    E = config.get("embedding_dim", d)  # ALBERT-style factorization

    # Embedding parameters
    token_embed = V * E
    position_embed = P * d

    # If using factorized embeddings (ALBERT)
    if E != d:
        embed_projection = E * d
    else:
        embed_projection = 0

    # Per-layer parameters (approximate)
    attention_params = 4 * d * d  # Q, K, V, O projections
    ffn_params = d * ff + ff + ff * d + d  # Two linear layers with biases
    norm_params = 2 * d + 2 * d  # Two layer norms per layer
    layer_params = attention_params + ffn_params + norm_params

    # Total
    total = token_embed + position_embed + embed_projection + L * layer_params
    return total


# Calculate parameters for each configuration
param_counts = {
    name: count_encoder_params(config) for name, config in bert_configs.items()
}

# BERT configurations
bert_configs = {
    "BERT-Base": {
        "n_layers": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,  # 4 * d_model
        "vocab_size": 30522,
        "max_position": 512,
    },
    "BERT-Large": {
        "n_layers": 24,
        "d_model": 1024,
        "n_heads": 16,
        "d_ff": 4096,
        "vocab_size": 30522,
        "max_position": 512,
    },
    "RoBERTa-Base": {
        "n_layers": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
        "vocab_size": 50265,
        "max_position": 512,
    },
    "ALBERT-xxlarge": {
        "n_layers": 12,
        "d_model": 4096,
        "n_heads": 64,
        "d_ff": 16384,
        "vocab_size": 30000,
        "max_position": 512,
        "embedding_dim": 128,  # ALBERT uses factorized embeddings
    },
}


def count_encoder_params(config):
    """Count parameters in an encoder model."""
    d = config["d_model"]
    ff = config["d_ff"]
    L = config["n_layers"]
    V = config["vocab_size"]
    P = config["max_position"]
    E = config.get("embedding_dim", d)  # ALBERT-style factorization

    # Embedding parameters
    token_embed = V * E
    position_embed = P * d

    # If using factorized embeddings (ALBERT)
    if E != d:
        embed_projection = E * d
    else:
        embed_projection = 0

    # Per-layer parameters (approximate)
    attention_params = 4 * d * d  # Q, K, V, O projections
    ffn_params = d * ff + ff + ff * d + d  # Two linear layers with biases
    norm_params = 2 * d + 2 * d  # Two layer norms per layer
    layer_params = attention_params + ffn_params + norm_params

    # Total
    total = token_embed + position_embed + embed_projection + L * layer_params
    return total


# Calculate parameters for each configuration
param_counts = {
    name: count_encoder_params(config) for name, config in bert_configs.items()
}

Out[29]:

Console

Encoder Model Configurations
======================================================================

BERT-Base
----------------------------------------
  Layers: 12
  Hidden size: 768
  Attention heads: 12
  Head dimension: 64
  FFN dimension: 3072 (4x hidden)
  Vocabulary: 30,522
  Max position: 512
  Approx. parameters: 108.9M

BERT-Large
----------------------------------------
  Layers: 24
  Hidden size: 1024
  Attention heads: 16
  Head dimension: 64
  FFN dimension: 4096 (4x hidden)
  Vocabulary: 30,522
  Max position: 512
  Approx. parameters: 334.0M

RoBERTa-Base
----------------------------------------
  Layers: 12
  Hidden size: 768
  Attention heads: 12
  Head dimension: 64
  FFN dimension: 3072 (4x hidden)
  Vocabulary: 50,265
  Max position: 512
  Approx. parameters: 124.0M

ALBERT-xxlarge
----------------------------------------
  Layers: 12
  Hidden size: 4096
  Attention heads: 64
  Head dimension: 64
  FFN dimension: 16384 (4x hidden)
  Vocabulary: 30,000
  Max position: 512
  Embedding dim: 128 (factorized)
  Approx. parameters: 2422.8M

Out[30]:

Visualization

Bar chart comparing parameter counts across BERT variants from 100M to 300M parameters. — Parameter counts for common encoder models. BERT-Large roughly quadruples BERT-Base through deeper layers and wider hidden dimensions. ALBERT-xxlarge achieves high capacity with fewer parameters through weight sharing and factorized embeddings.

Key Design PatternsLink Copied

Several architectural choices recur across encoder models.

Head dimension: BERT maintains 64 dimensions per attention head, computed as:

d_k = \frac{d_{\text{model}}}{n_{\text{heads}}}

where:

$d_k$ : the dimension of each attention head's query and key vectors
$d_{\text{model}}$ : the model's hidden dimension (768 for BERT-Base)
$n_{\text{heads}}$ : the number of parallel attention heads (12 for BERT-Base)

This gives $d_k = 768 / 12 = 64$ for BERT-Base. The choice balances expressiveness per head with having multiple heads for diverse attention patterns.

FFN expansion: The feed-forward network uses a 4x expansion factor. With $d_{\text{model}} = 768$ , the FFN hidden dimension is $768 \times 4 = 3072$ . This ratio has become a standard convention, providing a computational "bottleneck" where most of the model's parameters reside.

Vocabulary and positions: BERT uses WordPiece tokenization with a 30K vocabulary and supports sequences up to 512 tokens. Later models like RoBERTa expand the vocabulary and some variants extend to longer sequences.

Complete Encoder ImplementationLink Copied

Let's bring everything together into a complete, production-style encoder implementation:

In[31]:

Code

class BertEncoder:
    """
    Complete BERT-style encoder implementation.

    Includes embeddings, stacked encoder layers, and pooling.
    """

    def __init__(
        self,
        vocab_size=30522,
        max_position=512,
        n_layers=12,
        d_model=768,
        n_heads=12,
        d_ff=3072,
    ):
        self.d_model = d_model
        self.n_layers = n_layers

        # Embeddings
        self.token_embeddings = np.random.randn(vocab_size, d_model) * 0.02
        self.position_embeddings = np.random.randn(max_position, d_model) * 0.02
        self.segment_embeddings = np.random.randn(2, d_model) * 0.02
        self.embedding_norm = LayerNorm(d_model)

        # Encoder layers
        self.layers = [
            EncoderLayer(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final normalization
        self.final_norm = LayerNorm(d_model)

        # Pooler (for [CLS] representation)
        self.pooler_dense = np.random.randn(d_model, d_model) * 0.02

    def embed(self, token_ids, segment_ids=None):
        """
        Compute input embeddings.

        Args:
            token_ids: Array of token indices, shape (seq_len,)
            segment_ids: Optional segment indices for sentence pairs

        Returns:
            Embeddings of shape (seq_len, d_model)
        """
        seq_len = len(token_ids)

        # Token embeddings
        token_embeds = self.token_embeddings[token_ids]

        # Position embeddings
        positions = np.arange(seq_len)
        position_embeds = self.position_embeddings[positions]

        # Segment embeddings (default to segment 0)
        if segment_ids is None:
            segment_ids = np.zeros(seq_len, dtype=int)
        segment_embeds = self.segment_embeddings[segment_ids]

        # Sum and normalize
        embeddings = token_embeds + position_embeds + segment_embeds
        embeddings = self.embedding_norm(embeddings)

        return embeddings

    def __call__(self, token_ids, segment_ids=None, attention_mask=None):
        """
        Full forward pass.

        Args:
            token_ids: Token indices, shape (seq_len,)
            segment_ids: Segment indices for sentence pairs
            attention_mask: Padding mask (1 for real, 0 for padding)

        Returns:
            Dictionary with sequence_output and pooled_output
        """
        # Compute embeddings
        hidden_states = self.embed(token_ids, segment_ids)

        # Convert attention mask to additive format
        if attention_mask is not None:
            # attention_mask: 1 for real tokens, 0 for padding
            # Convert to: 0 for allowed, -inf for blocked
            extended_mask = (1.0 - attention_mask) * float("-inf")
            extended_mask = extended_mask[
                np.newaxis, np.newaxis, :
            ]  # Broadcast dims
        else:
            extended_mask = None

        # Pass through encoder layers
        for layer in self.layers:
            hidden_states, _ = layer(hidden_states)

        # Final normalization
        sequence_output = self.final_norm(hidden_states)

        # Pooled output (from [CLS] token with tanh activation)
        cls_output = sequence_output[0]
        pooled_output = np.tanh(cls_output @ self.pooler_dense)

        return {
            "last_hidden_state": sequence_output,
            "pooler_output": pooled_output,
        }


# Create a small encoder for demonstration
small_encoder = BertEncoder(
    vocab_size=1000,
    max_position=128,
    n_layers=4,
    d_model=256,
    n_heads=8,
    d_ff=1024,
)

# Process a sample sequence
sample_tokens = np.array([2, 105, 234, 567, 89, 3])  # [CLS], tokens..., [SEP]
output = small_encoder(sample_tokens)

class BertEncoder:
    """
    Complete BERT-style encoder implementation.

    Includes embeddings, stacked encoder layers, and pooling.
    """

    def __init__(
        self,
        vocab_size=30522,
        max_position=512,
        n_layers=12,
        d_model=768,
        n_heads=12,
        d_ff=3072,
    ):
        self.d_model = d_model
        self.n_layers = n_layers

        # Embeddings
        self.token_embeddings = np.random.randn(vocab_size, d_model) * 0.02
        self.position_embeddings = np.random.randn(max_position, d_model) * 0.02
        self.segment_embeddings = np.random.randn(2, d_model) * 0.02
        self.embedding_norm = LayerNorm(d_model)

        # Encoder layers
        self.layers = [
            EncoderLayer(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final normalization
        self.final_norm = LayerNorm(d_model)

        # Pooler (for [CLS] representation)
        self.pooler_dense = np.random.randn(d_model, d_model) * 0.02

    def embed(self, token_ids, segment_ids=None):
        """
        Compute input embeddings.

        Args:
            token_ids: Array of token indices, shape (seq_len,)
            segment_ids: Optional segment indices for sentence pairs

        Returns:
            Embeddings of shape (seq_len, d_model)
        """
        seq_len = len(token_ids)

        # Token embeddings
        token_embeds = self.token_embeddings[token_ids]

        # Position embeddings
        positions = np.arange(seq_len)
        position_embeds = self.position_embeddings[positions]

        # Segment embeddings (default to segment 0)
        if segment_ids is None:
            segment_ids = np.zeros(seq_len, dtype=int)
        segment_embeds = self.segment_embeddings[segment_ids]

        # Sum and normalize
        embeddings = token_embeds + position_embeds + segment_embeds
        embeddings = self.embedding_norm(embeddings)

        return embeddings

    def __call__(self, token_ids, segment_ids=None, attention_mask=None):
        """
        Full forward pass.

        Args:
            token_ids: Token indices, shape (seq_len,)
            segment_ids: Segment indices for sentence pairs
            attention_mask: Padding mask (1 for real, 0 for padding)

        Returns:
            Dictionary with sequence_output and pooled_output
        """
        # Compute embeddings
        hidden_states = self.embed(token_ids, segment_ids)

        # Convert attention mask to additive format
        if attention_mask is not None:
            # attention_mask: 1 for real tokens, 0 for padding
            # Convert to: 0 for allowed, -inf for blocked
            extended_mask = (1.0 - attention_mask) * float("-inf")
            extended_mask = extended_mask[
                np.newaxis, np.newaxis, :
            ]  # Broadcast dims
        else:
            extended_mask = None

        # Pass through encoder layers
        for layer in self.layers:
            hidden_states, _ = layer(hidden_states)

        # Final normalization
        sequence_output = self.final_norm(hidden_states)

        # Pooled output (from [CLS] token with tanh activation)
        cls_output = sequence_output[0]
        pooled_output = np.tanh(cls_output @ self.pooler_dense)

        return {
            "last_hidden_state": sequence_output,
            "pooler_output": pooled_output,
        }


# Create a small encoder for demonstration
small_encoder = BertEncoder(
    vocab_size=1000,
    max_position=128,
    n_layers=4,
    d_model=256,
    n_heads=8,
    d_ff=1024,
)

# Process a sample sequence
sample_tokens = np.array([2, 105, 234, 567, 89, 3])  # [CLS], tokens..., [SEP]
output = small_encoder(sample_tokens)

Out[32]:

Console

Complete Encoder Forward Pass
--------------------------------------------------
Input sequence length: 6

Outputs:
  last_hidden_state shape: (6, 256)
  pooler_output shape: (256,)

[CLS] representation (pooler output):
  Norm: 4.7245
  Min: -0.7552
  Max: 0.7962
  (tanh activation bounds values to [-1, 1])

Limitations and Trade-offsLink Copied

Encoder-only models have proven remarkably effective for understanding tasks, but they come with inherent limitations that inform when to use them and when to consider alternatives.

No Generative CapabilityLink Copied

The most fundamental limitation is that encoders cannot generate text. They produce representations, not sequences. For tasks like translation, summarization, or dialogue, you need a decoder or encoder-decoder architecture. Attempting to force an encoder to generate by iteratively predicting tokens is inefficient and poorly suited to the architecture's design.

This limitation shapes the NLP landscape. BERT excels at classification, extraction, and similarity tasks, but GPT and its descendants dominate text generation. Understanding which architecture fits your task is crucial.

Bidirectional Training ConstraintsLink Copied

The bidirectional nature that makes encoders powerful also constrains how they can be trained. You cannot simply predict the next token because the model can already see it. BERT's masked language modeling (MLM) objective works around this by hiding some tokens and predicting them, but this introduces a train-test mismatch: during training, the model sees [MASK] tokens that never appear during inference.

This mismatch can affect fine-tuning, especially for tasks sensitive to exact input format. Various successors like ELECTRA addressed this by using different pre-training objectives that avoid artificial [MASK] tokens.

Fixed Sequence LengthLink Copied

Encoders typically have a maximum sequence length set during pre-training, often 512 tokens for BERT-family models. Handling longer documents requires strategies like truncation, chunking, or using long-context variants like Longformer. Each approach has trade-offs between context coverage, computational cost, and handling of cross-chunk dependencies.

Computational CostLink Copied

Bidirectional attention has $O(n^2)$ complexity in sequence length, where $n$ is the number of tokens. This quadratic scaling arises because every token computes attention scores against every other token, requiring $n \times n$ score computations per attention head. For a 512-token sequence, this means computing $512^2 = 262,144$ attention scores per layer per head. While manageable for moderate sequences, this cost limits scaling to longer contexts without architectural modifications like sparse attention or linear attention approximations.

Despite these limitations, encoder-only models remain the go-to choice for many production NLP systems. Their efficiency at inference time (single forward pass rather than autoregressive generation) and strong performance on classification tasks make them invaluable. The key is matching the architecture to the task.

SummaryLink Copied

Encoder-only transformers represent a powerful paradigm for natural language understanding. By removing the decoder and enabling bidirectional attention, they excel at tasks requiring comprehension rather than generation.

Key takeaways from this chapter:

Bidirectional attention allows each token to attend to all other tokens, including future context. This enables richer representations than causal attention but precludes autoregressive generation.
Encoder layers stack multi-head attention and feed-forward networks with residual connections and normalization. The pre-norm configuration places normalization before each sublayer for training stability.
Output usage depends on the task: use [CLS] for sequence classification, all tokens for sequence labeling, and span representations for extraction tasks.
BERT established conventions that persist today: 12 or 24 layers, hidden dimensions of 768 or 1024, attention heads with 64 dimensions, and 4x FFN expansion.
Understanding tasks like classification, NER, and extractive QA are natural fits for encoders, while generation tasks require decoders.

The encoder architecture laid the groundwork for understanding how transformers can be decomposed and specialized. In the next chapter, we'll explore the decoder architecture and see how causal masking enables the text generation capabilities that power modern language models.

Key ParametersLink Copied

When implementing or configuring encoder-only transformers, several parameters directly impact model capacity, computational cost, and downstream performance:

d_model (hidden dimension): The dimensionality of token representations throughout the encoder. BERT-Base uses 768, BERT-Large uses 1024. Larger values increase capacity but quadratically increase attention computation.
n_layers (depth): Number of stacked encoder layers. BERT-Base uses 12, BERT-Large uses 24. More layers enable learning more complex feature hierarchies but increase memory and compute requirements linearly.
n_heads (attention heads): Number of parallel attention heads per layer. Typically chosen so that $d_{\text{model}} / n_{\text{heads}} = 64$ . More heads allow learning diverse attention patterns, but the per-head dimension decreases.
d_ff (feed-forward dimension): Hidden dimension of the position-wise FFN. Convention is $4 \times d_{\text{model}}$ . This expansion-contraction pattern provides the majority of the model's parameters.
max_position (sequence length): Maximum number of tokens the encoder can process. BERT uses 512. Longer sequences increase memory quadratically due to attention's $O(n^2)$ complexity.
vocab_size: Size of the tokenizer vocabulary. BERT uses ~30K WordPiece tokens, RoBERTa uses ~50K BPE tokens. Larger vocabularies reduce unknown tokens but increase embedding table size.

For fine-tuning pre-trained encoders, the key hyperparameters are learning rate (typically 1e-5 to 5e-5), batch size (16-32 for most tasks), and number of epochs (2-4 for most classification tasks).

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about encoder-only transformer architectures.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Transformer Block Assembly

Next Chapter

Decoder Architecture

Reference

BIBTEXAcademic

@misc{encoderarchitecturebidirectionaltransformersforunderstandingtasks, author = {Michael Brenndoerfer}, title = {Encoder Architecture: Bidirectional Transformers for Understanding Tasks}, year = {2025}, url = {https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Encoder Architecture: Bidirectional Transformers for Understanding Tasks. Retrieved from https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding

MLAAcademic

Michael Brenndoerfer. "Encoder Architecture: Bidirectional Transformers for Understanding Tasks." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding>.

CHICAGOAcademic

Michael Brenndoerfer. "Encoder Architecture: Bidirectional Transformers for Understanding Tasks." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Encoder Architecture: Bidirectional Transformers for Understanding Tasks'. Available at: https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Encoder Architecture: Bidirectional Transformers for Understanding Tasks. https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding

Direct link:

https://mbrenndoerfer.com/writing/encoder-architecture-bidirectional-transformers-understanding

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Encoder Architecture: Bidirectional Transformers for Understanding Tasks

Encoder ArchitectureLink Copied

The Encoder-Only ParadigmLink Copied

Bidirectional Self-AttentionLink Copied

The Attention Mechanism Without MaskingLink Copied

Encoder for Understanding TasksLink Copied

Text ClassificationLink Copied

Token Classification (Sequence Labeling)Link Copied

Extractive Question AnsweringLink Copied

Building an Encoder LayerLink Copied

Visualizing Multi-Head Attention PatternsLink Copied

Stacking Encoder LayersLink Copied

Attention Entropy Across LayersLink Copied

Encoder Output UsageLink Copied

Sequence-Level TasksLink Copied

Token-Level TasksLink Copied

Span-Level TasksLink Copied

BERT-Style Encoder ConfigurationLink Copied

Key Design PatternsLink Copied

Complete Encoder ImplementationLink Copied

Limitations and Trade-offsLink Copied

No Generative CapabilityLink Copied

Bidirectional Training ConstraintsLink Copied

Fixed Sequence LengthLink Copied

Computational CostLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers

Decoder Architecture: Causal Masking & Autoregressive Generation

Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide

Stay updated