Bidirectional RNNs: Capturing Full Sequence Context

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how bidirectional RNNs process sequences in both directions to capture past and future context. Covers architecture, LSTMs, implementation, and when to use them.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Bidirectional RNNsLink Copied

Standard RNNs process sequences in one direction: from the first token to the last. At each timestep, the hidden state captures information about everything that came before. But for many tasks, the future matters just as much as the past. When you're trying to understand the meaning of a word in a sentence, you naturally consider both what came before and what comes after.

Consider the sentence: "The bank by the river was steep." To understand that "bank" refers to a riverbank rather than a financial institution, you need to see "river" which appears after "bank." A forward-only RNN processing "bank" has no access to this disambiguating context. Bidirectional RNNs solve this problem by running two separate RNNs: one forward through time, one backward. At each position, you get a representation informed by the entire sequence.

The Bidirectional IntuitionLink Copied

Think about how you actually read text. When you encounter an ambiguous word, you don't commit to an interpretation immediately. You keep reading, gathering more context, and then your understanding of earlier words crystallizes. You're implicitly using bidirectional information: past context and future context together inform your understanding of each position.

Bidirectional RNNs formalize this intuition with a simple architectural change. Instead of one RNN, we use two:

A forward RNN that processes the sequence from position 1 to position $T$ , producing hidden states $\overrightarrow{h}_1, \overrightarrow{h}_2, \ldots, \overrightarrow{h}_T$
A backward RNN that processes the sequence from position $T$ to position 1, producing hidden states $\overset{\leftarrow}{h}_T, \overset{\leftarrow}{h}_{T-1}, \ldots, \overset{\leftarrow}{h}_1$

At each position $t$ , we concatenate these two hidden states to get a combined representation that captures both past and future context:

h_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t]

where:

$h_t \in \mathbb{R}^{2h}$ : the combined bidirectional hidden state at position $t$
$\overrightarrow{h}_t \in \mathbb{R}^h$ : the forward hidden state, encoding information from positions $1, 2, \ldots, t$
$\overset{\leftarrow}{h}_t \in \mathbb{R}^h$ : the backward hidden state, encoding information from positions $T, T-1, \ldots, t$
$[;]$ : the concatenation operator, which stacks two vectors end-to-end
$h$ : the hidden dimension of each directional RNN

The concatenation doubles the dimensionality: if each directional RNN has hidden dimension $h$ , the combined representation has dimension $2h$ .

Out[3]:

Visualization

Diagram showing forward and backward RNN layers with arrows indicating information flow in opposite directions. — A bidirectional RNN processes the input sequence in both directions. The forward RNN (blue, top) reads left-to-right, while the backward RNN (orange, bottom) reads right-to-left. At each position, the two hidden states are concatenated to form a representation informed by the entire sequence.

Bidirectional RNN

A bidirectional RNN processes a sequence in both directions simultaneously, using a forward RNN and a backward RNN. The hidden states from both directions are combined (typically concatenated) at each position to create representations that capture context from the entire sequence.

Forward and Backward PassesLink Copied

Now that we understand why bidirectional processing helps, let's formalize how information flows through the architecture. The elegance of bidirectional RNNs lies in their simplicity: we run two completely independent RNNs on the same input, one reading left-to-right and one reading right-to-left. Each builds its own summary of the sequence from its respective direction, and we combine these complementary views at the end.

Consider an input sequence $x_1, x_2, \ldots, x_T$ where each $x_t \in \mathbb{R}^d$ is a $d$ -dimensional vector, typically a word embedding in NLP or acoustic features in speech recognition. Our goal is to produce a representation at each position that captures context from the entire sequence, not just what came before.

The Forward Pass: Accumulating Left ContextLink Copied

The forward RNN processes the sequence in natural reading order, from position 1 to position $T$ . At each step, it faces the same question every recurrent network must answer: how do I combine what I'm seeing right now with everything I've accumulated so far?

The answer is the standard RNN update equation, applied in the forward direction:

\overrightarrow{h}_t = f(\overrightarrow{W}_x x_t + \overrightarrow{W}_h \overrightarrow{h}_{t-1} + \overrightarrow{b})

Let's unpack each component to understand its role:

$x_t \in \mathbb{R}^d$ : the input at the current position, the raw information we're processing
$\overrightarrow{h}_{t-1} \in \mathbb{R}^h$ : the previous hidden state, carrying a compressed summary of everything from positions $1$ through $t-1$
$\overrightarrow{W}_x \in \mathbb{R}^{h \times d}$ : the input weight matrix, which learns to extract relevant features from the current input
$\overrightarrow{W}_h \in \mathbb{R}^{h \times h}$ : the recurrent weight matrix, which learns how to integrate new information with accumulated history
$\overrightarrow{b} \in \mathbb{R}^h$ : a bias vector that shifts the activation
$f$ : an activation function (tanh for vanilla RNNs, or the full gating mechanism for LSTMs and GRUs)
$\overrightarrow{h}_t \in \mathbb{R}^h$ : the output, a new hidden state that now summarizes positions $1$ through $t$

The computation starts with $\overrightarrow{h}_0 = \mathbf{0}$ , a zero vector representing "no prior context." As the forward pass proceeds through $t = 1, 2, \ldots, T$ , each hidden state accumulates progressively more information about the sequence's left context. By the time we reach position $T$ , the final forward state $\overrightarrow{h}_T$ has seen the entire sequence, but only from the left-to-right perspective.

The Backward Pass: Accumulating Right ContextLink Copied

The backward RNN performs the mirror-image computation. It starts at the end of the sequence and works its way back to the beginning, building up a summary of what comes after each position:

\overset{\leftarrow}{h}_t = f(\overset{\leftarrow}{W}_x x_t + \overset{\leftarrow}{W}_h \overset{\leftarrow}{h}_{t+1} + \overset{\leftarrow}{b})

The structure is identical to the forward pass, but notice the crucial difference in the subscript: the backward hidden state at position $t$ depends on $\overset{\leftarrow}{h}_{t+1}$ , not $\overset{\leftarrow}{h}_{t-1}$ . This reflects the reversed processing order: the backward RNN has already processed positions $T, T-1, \ldots, t+1$ before it reaches position $t$ .

Each component plays the same role as in the forward pass, but now oriented toward capturing right context:

$\overset{\leftarrow}{h}_{t+1} \in \mathbb{R}^h$ : the hidden state from the next position (which the backward RNN has already processed)
$\overset{\leftarrow}{W}_x, \overset{\leftarrow}{W}_h, \overset{\leftarrow}{b}$ : the backward network's own weight matrices and bias, completely separate from the forward parameters
$\overset{\leftarrow}{h}_t \in \mathbb{R}^h$ : the output, summarizing positions $t$ through $T$

The backward pass begins with $\overset{\leftarrow}{h}_{T+1} = \mathbf{0}$ and proceeds through $t = T, T-1, \ldots, 1$ . By the time it reaches position 1, the state $\overset{\leftarrow}{h}_1$ has processed the entire sequence from right to left.

Two Independent Networks with Complementary ViewsLink Copied

A crucial architectural point: the forward and backward RNNs are completely separate networks with independent parameters. The forward weights $\overrightarrow{W}_x$ , $\overrightarrow{W}_h$ , $\overrightarrow{b}$ share nothing with the backward weights $\overset{\leftarrow}{W}_x$ , $\overset{\leftarrow}{W}_h$ , $\overset{\leftarrow}{b}$ . Each network learns its own way of processing the sequence, free to specialize in capturing different types of patterns.

This independence has a direct consequence for model size: a bidirectional RNN has exactly twice the parameters of a unidirectional RNN with the same hidden dimension. If a unidirectional LSTM with hidden size 256 has 1 million parameters, the bidirectional version has 2 million. The doubled capacity is the price we pay for bidirectional context.

Concrete Example: Disambiguating "Bank"Link Copied

To see how these two passes complement each other, let's trace through our example sentence: "The bank by the river was steep."

When the forward RNN processes "bank" at position 2, its hidden state $\overrightarrow{h}_2$ has only seen "The bank." Based on this limited context, it might tentatively encode "bank" as a financial institution, a reasonable guess given word frequencies. The forward hidden state captures everything to the left, but that's not enough to disambiguate.

When the backward RNN processes "bank" at position 2, the situation is different. It has already processed "was steep," "river," "the," and "by" (in that order). Its hidden state $\overset{\leftarrow}{h}_2$ carries this future context, strongly suggesting a geographical feature rather than a financial one. The word "river" appearing later in the sentence is the key disambiguating signal.

By concatenating $\overrightarrow{h}_2$ and $\overset{\leftarrow}{h}_2$ , we get a representation that knows both what came before ("The") and what comes after ("by the river was steep"). This combined representation has all the information needed to correctly interpret "bank" as a riverbank.

Out[4]:

Visualization

Three panels showing which words contribute to the hidden state at position 3 for forward, backward, and bidirectional RNNs. — Information available at position 3 ('by') for forward, backward, and bidirectional representations. The forward RNN only sees past context (positions 1-3). The backward RNN only sees future context (positions 3-5). The bidirectional combination captures the full sequence.

Hidden State ConcatenationLink Copied

At this point, we have two hidden states at each position: $\overrightarrow{h}_t$ capturing left context and $\overset{\leftarrow}{h}_t$ capturing right context. The final step is combining these complementary views into a single representation. This combination operation is where the bidirectional magic happens: we're fusing two partial pictures of the sequence into one complete view.

The Standard Approach: ConcatenationLink Copied

The most common and effective strategy is simple concatenation. We stack the two hidden vectors end-to-end, creating a longer vector that preserves all information from both directions:

h_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t] \in \mathbb{R}^{2h}

where:

$h_t$ : the combined bidirectional representation at position $t$
$\overrightarrow{h}_t \in \mathbb{R}^h$ : the forward hidden state, occupying the first $h$ elements
$\overset{\leftarrow}{h}_t \in \mathbb{R}^h$ : the backward hidden state, occupying the last $h$ elements

Why is concatenation the default choice? It's lossless: no information from either direction is discarded or compressed. The forward and backward components remain distinct, allowing downstream layers to learn how to weight them appropriately for the task at hand. The cost is that the representation doubles in size: if each directional RNN uses hidden dimension 256, the combined output has dimension 512.

Alternative Combination StrategiesLink Copied

While concatenation dominates in practice, other approaches exist. Each makes different trade-offs between information preservation and computational efficiency:

Summation adds the two hidden states element-wise:

h_t = \overrightarrow{h}_t + \overset{\leftarrow}{h}_t \in \mathbb{R}^h

This preserves the original dimensionality, which can be convenient when you need to match a specific output size or maintain compatibility with existing architecture. However, summation is lossy. Consider what happens when the forward state has value 0.5 in some dimension and the backward state has -0.5: they cancel out to zero, erasing the information that both directions had strong but opposite signals. The sum can't distinguish "both directions agree on zero" from "both directions disagree strongly."

Element-wise product multiplies corresponding elements:

h_t = \overrightarrow{h}_t \odot \overset{\leftarrow}{h}_t \in \mathbb{R}^h

This captures multiplicative interactions between the directions, which can be powerful when you want to detect patterns that require agreement from both sides. However, it's numerically unstable: if either direction has values near zero, the product collapses regardless of what the other direction says. It also fails when only one direction has relevant information, since a strong signal from one side gets suppressed by a weak signal from the other.

Learned projection concatenates first, then applies a weight matrix:

h_t = W_c[\overrightarrow{h}_t; \overset{\leftarrow}{h}_t] \in \mathbb{R}^{h'}

where $W_c \in \mathbb{R}^{h' \times 2h}$ is a learned projection matrix. This lets the network learn the optimal way to combine the directions, potentially compressing back to the original dimension ( $h' = h$ ) or any other size. The trade-off is additional parameters and computation, plus the risk that the projection might discard useful information during training.

Concatenation remains the standard choice because it's simple, lossless, and numerically stable. The doubled dimensionality is usually acceptable, and downstream layers can learn to weight the forward and backward components appropriately for each task.

Practical Implications of Doubled DimensionsLink Copied

The doubled output dimension has concrete implications for your network architecture. Any layer that receives the bidirectional output must be sized accordingly:

A classification layer needs $2h$ input features instead of $h$
Attention mechanisms must account for the larger key/value dimensions
Memory requirements increase proportionally

In PyTorch and similar frameworks, specifying bidirectional=True automatically handles the output dimensionality: the output tensor's last dimension becomes $2 \times \text{hidden\_size}$ . But when designing custom architectures or debugging dimension mismatches, understanding this doubling is essential.

For sequence-to-sequence models, the decoder typically needs to initialize from the encoder's final state. With a bidirectional encoder, you have two "final" states: $\overrightarrow{h}_T$ from the forward pass (which has seen the whole sequence left-to-right) and $\overset{\leftarrow}{h}_1$ from the backward pass (which has seen it right-to-left). Common approaches include concatenating them, summing them, or using a learned projection to combine them into the decoder's initial state.

Bidirectional LSTMs and GRUsLink Copied

The bidirectional concept we've developed, running two RNNs in opposite directions and concatenating their outputs, is architecture-agnostic. It works identically whether the underlying RNN is a vanilla RNN, an LSTM, or a GRU. In practice, bidirectional LSTMs dominate because they combine bidirectional context with LSTM's ability to capture long-range dependencies. The gating mechanisms that make LSTMs powerful for long sequences work just as well in both directions.

A bidirectional LSTM consists of two independent LSTMs: one processing left-to-right, one processing right-to-left. Each maintains its own hidden state $h$ and cell state $c$ . The cell state, LSTM's key innovation for preserving information over long distances, operates independently in each direction. Each direction has its own memory pathway, learning to remember and forget different aspects of the sequence.

Forward LSTM: Gated Memory from the LeftLink Copied

At each position $t$ , the forward LSTM computes four gate values based on the current input and the previous hidden state. These gates control how information flows into, out of, and through the cell state:

\overrightarrow{f}_t, \overrightarrow{i}_t, \overrightarrow{o}_t, \overrightarrow{\tilde{c}}_t = \text{gates}(\overrightarrow{W}, x_t, \overrightarrow{h}_{t-1})

Each gate serves a specific purpose in managing the LSTM's memory:

$\overrightarrow{f}_t \in \mathbb{R}^h$ : the forget gate, with values between 0 and 1 that decide how much of the previous cell state to retain. A value near 1 means "keep this memory," while near 0 means "discard it."
$\overrightarrow{i}_t \in \mathbb{R}^h$ : the input gate, which controls how much of the new candidate information to write into memory
$\overrightarrow{o}_t \in \mathbb{R}^h$ : the output gate, which controls how much of the cell state to expose as the hidden state
$\overrightarrow{\tilde{c}}_t \in \mathbb{R}^h$ : the candidate cell state, which proposes new values that could be added to memory
$\overrightarrow{W}$ : the collection of forward LSTM weight matrices (one set for each gate)

The cell state update is where LSTM's memory mechanism operates. Old information is selectively forgotten, and new information is selectively added:

\overrightarrow{c}_t = \overrightarrow{f}_t \odot \overrightarrow{c}_{t-1} + \overrightarrow{i}_t \odot \overrightarrow{\tilde{c}}_t

where $\odot$ denotes element-wise multiplication. The first term $\overrightarrow{f}_t \odot \overrightarrow{c}_{t-1}$ preserves parts of the old cell state that the forget gate allows through. The second term $\overrightarrow{i}_t \odot \overrightarrow{\tilde{c}}_t$ adds parts of the candidate that the input gate admits. This additive structure is what allows LSTMs to maintain information over long distances without the vanishing gradient problems that plague vanilla RNNs.

Finally, the hidden state is computed by filtering the cell state through the output gate:

\overrightarrow{h}_t = \overrightarrow{o}_t \odot \tanh(\overrightarrow{c}_t)

The tanh squashes the cell state to the range $[-1, 1]$ , and the output gate selects which dimensions to expose. This hidden state $\overrightarrow{h}_t$ serves two purposes: it gets passed to the next timestep as input, and it's what we'll eventually concatenate with the backward direction.

Backward LSTM: Gated Memory from the RightLink Copied

The backward LSTM follows exactly the same gating structure, but with two key differences: it has its own independent set of weights, and it processes in the reverse direction. At position $t$ , it uses the hidden state from position $t+1$ , which it has already computed since it started at the end of the sequence:

\overset{\leftarrow}{f}_t, \overset{\leftarrow}{i}_t, \overset{\leftarrow}{o}_t, \overset{\leftarrow}{\tilde{c}}_t = \text{gates}(\overset{\leftarrow}{W}, x_t, \overset{\leftarrow}{h}_{t+1})

\overset{\leftarrow}{c}_t = \overset{\leftarrow}{f}_t \odot \overset{\leftarrow}{c}_{t+1} + \overset{\leftarrow}{i}_t \odot \overset{\leftarrow}{\tilde{c}}_t

\overset{\leftarrow}{h}_t = \overset{\leftarrow}{o}_t \odot \tanh(\overset{\leftarrow}{c}_t)

Notice the subscript pattern: the cell state update uses $\overset{\leftarrow}{c}_{t+1}$ rather than $\overset{\leftarrow}{c}_{t-1}$ . The backward LSTM's "previous" state is the state from the next position in the sequence, reflecting its reversed processing order. The backward forget gate decides what to remember from the future (positions $t+1$ through $T$ ), while the backward input gate decides what new information from position $t$ to add to that future-oriented memory.

Combining the Directions: Hidden States OnlyLink Copied

After both passes complete, we have two hidden states at each position: $\overrightarrow{h}_t$ from the forward LSTM and $\overset{\leftarrow}{h}_t$ from the backward LSTM. These are concatenated exactly as before:

h_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t] \in \mathbb{R}^{2h}

An important architectural detail: only the hidden states are concatenated. The cell states $\overrightarrow{c}_t$ and $\overset{\leftarrow}{c}_t$ remain internal to their respective LSTMs and are never exposed or combined. This makes sense when you consider their roles: the cell state's job is to maintain long-term memory within each direction, serving as a private scratchpad for the LSTM's internal computations. The hidden state is what carries information to the outside world, and that's what we want to combine for downstream tasks.

Implementing Bidirectional RNNsLink Copied

The mathematics we've developed translates directly into code. Building a bidirectional RNN from scratch reveals the elegant simplicity of the architecture: we literally run two independent RNNs and concatenate their outputs. There's no magic, just two passes over the data in opposite directions.

Let's implement this step by step, starting with the building blocks and working up to a complete bidirectional LSTM.

Building Blocks: Activation Functions and LSTM CellsLink Copied

Before we can build the bidirectional layer, we need the components that make up each directional LSTM. The sigmoid function squashes values to the range $(0, 1)$ , making it perfect for gates that control information flow (0 = block everything, 1 = let everything through). The tanh function squashes to $(-1, 1)$ , used for the candidate cell state and output normalization.

In[5]:

Code

import numpy as np


def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    """Hyperbolic tangent."""
    return np.tanh(x)

import numpy as np


def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    """Hyperbolic tangent."""
    return np.tanh(x)

Next, we implement a single LSTM cell. This is the building block that gets replicated at each timestep. The cell takes an input vector and the previous states, computes the four gates, updates the cell state, and produces a new hidden state.

In[6]:

Code

class LSTMCell:
    """Single LSTM cell for use in bidirectional network."""

    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        scale = np.sqrt(2.0 / (input_dim + hidden_dim))

        # Combined weights for efficiency: all four gates in one matrix
        self.W = np.random.randn(4 * hidden_dim, input_dim) * scale
        self.U = np.random.randn(4 * hidden_dim, hidden_dim) * scale
        self.b = np.zeros(4 * hidden_dim)
        self.b[hidden_dim : 2 * hidden_dim] = (
            1.0  # Forget gate bias initialized to 1
        )

    def forward(self, x, h_prev, c_prev):
        """Single step forward pass."""
        h = self.hidden_dim
        gates = self.W @ x + self.U @ h_prev + self.b

        # Split the gate computation into four parts
        i = sigmoid(gates[0:h])  # Input gate
        f = sigmoid(gates[h : 2 * h])  # Forget gate
        c_tilde = tanh(gates[2 * h : 3 * h])  # Candidate cell state
        o = sigmoid(gates[3 * h : 4 * h])  # Output gate

        # Cell state update: forget old, add new
        c = f * c_prev + i * c_tilde

        # Hidden state: filtered cell state
        h_new = o * tanh(c)

        return h_new, c

class LSTMCell:
    """Single LSTM cell for use in bidirectional network."""

    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        scale = np.sqrt(2.0 / (input_dim + hidden_dim))

        # Combined weights for efficiency: all four gates in one matrix
        self.W = np.random.randn(4 * hidden_dim, input_dim) * scale
        self.U = np.random.randn(4 * hidden_dim, hidden_dim) * scale
        self.b = np.zeros(4 * hidden_dim)
        self.b[hidden_dim : 2 * hidden_dim] = (
            1.0  # Forget gate bias initialized to 1
        )

    def forward(self, x, h_prev, c_prev):
        """Single step forward pass."""
        h = self.hidden_dim
        gates = self.W @ x + self.U @ h_prev + self.b

        # Split the gate computation into four parts
        i = sigmoid(gates[0:h])  # Input gate
        f = sigmoid(gates[h : 2 * h])  # Forget gate
        c_tilde = tanh(gates[2 * h : 3 * h])  # Candidate cell state
        o = sigmoid(gates[3 * h : 4 * h])  # Output gate

        # Cell state update: forget old, add new
        c = f * c_prev + i * c_tilde

        # Hidden state: filtered cell state
        h_new = o * tanh(c)

        return h_new, c

A few implementation details worth noting:

Combined weight matrices: Rather than four separate matrix multiplications (one per gate), we concatenate all weights into single large matrices and do one multiplication, then split the result. This is a common optimization that improves computational efficiency.
Forget gate bias initialization: The forget gate bias is initialized to 1 rather than 0. This encourages the network to preserve information by default early in training, preventing the common failure mode where the network learns to forget everything before it learns what to remember.
Xavier/He initialization: The weight scale np.sqrt(2.0 / (input_dim + hidden_dim)) follows best practices for initializing neural network weights, preventing exploding or vanishing activations at the start of training.

The Bidirectional LayerLink Copied

Now we build the bidirectional layer itself. The architecture is remarkably simple: create two completely independent LSTM cells (one for each direction) and run them on the same input sequence in opposite orders. The forward LSTM sees positions 0, 1, 2, ..., T-1 in that order; the backward LSTM sees T-1, T-2, ..., 0.

In[7]:

Code

class BidirectionalLSTM:
    """Bidirectional LSTM that processes sequences in both directions."""

    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        self.forward_lstm = LSTMCell(input_dim, hidden_dim)
        self.backward_lstm = LSTMCell(input_dim, hidden_dim)

    def forward(self, x_sequence):
        """
        Process sequence bidirectionally.

        Args:
            x_sequence: Input sequence, shape (seq_len, input_dim)

        Returns:
            combined: Concatenated hidden states, shape (seq_len, 2 * hidden_dim)
            forward_states: Forward hidden states, shape (seq_len, hidden_dim)
            backward_states: Backward hidden states, shape (seq_len, hidden_dim)
        """
        seq_len = x_sequence.shape[0]
        h = self.hidden_dim

        # Forward pass: process positions 0, 1, 2, ..., T-1
        forward_states = []
        h_fwd = np.zeros(h)  # Initial hidden state
        c_fwd = np.zeros(h)  # Initial cell state

        for t in range(seq_len):
            h_fwd, c_fwd = self.forward_lstm.forward(
                x_sequence[t], h_fwd, c_fwd
            )
            forward_states.append(h_fwd)

        # Backward pass: process positions T-1, T-2, ..., 0
        backward_states = []
        h_bwd = np.zeros(h)
        c_bwd = np.zeros(h)

        for t in range(seq_len - 1, -1, -1):
            h_bwd, c_bwd = self.backward_lstm.forward(
                x_sequence[t], h_bwd, c_bwd
            )
            backward_states.insert(
                0, h_bwd
            )  # Insert at front to maintain position alignment

        # Concatenate at each position
        forward_states = np.array(forward_states)
        backward_states = np.array(backward_states)
        combined = np.concatenate([forward_states, backward_states], axis=1)

        return combined, forward_states, backward_states

class BidirectionalLSTM:
    """Bidirectional LSTM that processes sequences in both directions."""

    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        self.forward_lstm = LSTMCell(input_dim, hidden_dim)
        self.backward_lstm = LSTMCell(input_dim, hidden_dim)

    def forward(self, x_sequence):
        """
        Process sequence bidirectionally.

        Args:
            x_sequence: Input sequence, shape (seq_len, input_dim)

        Returns:
            combined: Concatenated hidden states, shape (seq_len, 2 * hidden_dim)
            forward_states: Forward hidden states, shape (seq_len, hidden_dim)
            backward_states: Backward hidden states, shape (seq_len, hidden_dim)
        """
        seq_len = x_sequence.shape[0]
        h = self.hidden_dim

        # Forward pass: process positions 0, 1, 2, ..., T-1
        forward_states = []
        h_fwd = np.zeros(h)  # Initial hidden state
        c_fwd = np.zeros(h)  # Initial cell state

        for t in range(seq_len):
            h_fwd, c_fwd = self.forward_lstm.forward(
                x_sequence[t], h_fwd, c_fwd
            )
            forward_states.append(h_fwd)

        # Backward pass: process positions T-1, T-2, ..., 0
        backward_states = []
        h_bwd = np.zeros(h)
        c_bwd = np.zeros(h)

        for t in range(seq_len - 1, -1, -1):
            h_bwd, c_bwd = self.backward_lstm.forward(
                x_sequence[t], h_bwd, c_bwd
            )
            backward_states.insert(
                0, h_bwd
            )  # Insert at front to maintain position alignment

        # Concatenate at each position
        forward_states = np.array(forward_states)
        backward_states = np.array(backward_states)
        combined = np.concatenate([forward_states, backward_states], axis=1)

        return combined, forward_states, backward_states

The implementation reveals the core simplicity of bidirectional RNNs:

Forward pass: Iterate through positions 0 to $T-1$ in order, updating hidden and cell states at each step. This is identical to a standard LSTM.
Backward pass: Iterate in reverse from $T-1$ down to 0. The only tricky part is bookkeeping: we use backward_states.insert(0, h_bwd) to insert at the front of the list, ensuring that backward_states[t] corresponds to position $t$ in the original sequence. This alignment is crucial because we need the forward and backward states to match up position-by-position for concatenation.
Concatenation: Stack the forward and backward states along the feature dimension. At each position $t$ , the combined representation has the forward context (from positions 0 through $t$ ) in the first half and backward context (from positions $t$ through $T-1$ ) in the second half.

Testing the ImplementationLink Copied

Let's verify our implementation produces the expected output shapes and confirm the dimensionality doubling:

In[8]:

Code

# Create bidirectional LSTM
np.random.seed(42)
input_dim = 10
hidden_dim = 16
seq_len = 5

bilstm = BidirectionalLSTM(input_dim, hidden_dim)

# Sample input sequence
x_sequence = np.random.randn(seq_len, input_dim)

# Process bidirectionally
combined, forward_states, backward_states = bilstm.forward(x_sequence)

# Create bidirectional LSTM
np.random.seed(42)
input_dim = 10
hidden_dim = 16
seq_len = 5

bilstm = BidirectionalLSTM(input_dim, hidden_dim)

# Sample input sequence
x_sequence = np.random.randn(seq_len, input_dim)

# Process bidirectionally
combined, forward_states, backward_states = bilstm.forward(x_sequence)

Out[9]:

Console

Bidirectional LSTM Output Shapes:
  Input sequence: (5, 10)
  Forward states: (5, 16)
  Backward states: (5, 16)
  Combined output: (5, 32)

Output dimension is 2x hidden dimension: 32 = 2 × 16

The output shapes confirm our implementation is working correctly:

Input: 5 timesteps × 10 dimensions per timestep
Forward states: 5 timesteps × 16 hidden units (left context at each position)
Backward states: 5 timesteps × 16 hidden units (right context at each position)
Combined output: 5 timesteps × 32 features (full bidirectional context)

The doubling from 16 to 32 is the characteristic signature of bidirectional processing. Every downstream layer that consumes this output must account for this doubled dimensionality.

Visualizing Hidden State EvolutionLink Copied

To build intuition for what the bidirectional LSTM computes, let's visualize how the forward and backward hidden states evolve across the sequence. This reveals the complementary nature of the two directions:

Out[10]:

Visualization

Line plot showing forward and backward hidden state magnitudes across sequence positions. — Directional hidden state magnitudes (L2 norm) across sequence positions. The forward pass (blue) and backward pass (orange) each accumulate context from their respective directions.

Combined bidirectional representation magnitude compared to individual directions. The combined representation (purple) captures information from both passes at each position.

The hidden state magnitudes reveal how each direction builds up its representation over the sequence. With random inputs, the specific patterns depend on the input values, but the key insight is structural: both directions contribute meaningfully at every position. The combined representation (purple diamonds) is always larger than either individual direction because concatenation preserves all information from both passes. We're not losing anything by combining them.

Information Flow: What Each Position "Knows"Link Copied

The power of bidirectional processing becomes clear when we visualize what information is available at each position. Let's trace how information propagates through the network:

Out[11]:

Visualization

Heatmap showing information availability at each position for forward, backward, and combined representations. — Forward pass: hidden state at position t contains information from inputs 1 through t (lower triangle).

Backward pass: hidden state at position t contains information from inputs t through T (upper triangle).

Combined bidirectional: each position has access to the entire sequence context.

These information flow diagrams illustrate the fundamental difference between unidirectional and bidirectional processing:

Forward pass (lower triangle): The hidden state at position $t$ contains information from inputs 1 through $t$ . Early positions have limited context; later positions have seen more of the sequence.
Backward pass (upper triangle): The hidden state at position $t$ contains information from inputs $t$ through $T$ . The pattern is reversed: early positions have seen more (from the backward perspective).
Combined (full matrix): Every position has access to the entire sequence. This is the key advantage of bidirectional processing.

Hidden State Activation PatternsLink Copied

Let's examine the actual hidden state values from our bidirectional LSTM to see what internal representations emerge:

Out[12]:

Visualization

Two heatmaps showing hidden state activations across sequence positions for forward and backward LSTM passes. — Forward hidden state activations across sequence positions (rows) and hidden units (columns). Each cell shows the activation value for that unit at that position.

Backward hidden state activations. The backward pass develops different patterns, capturing complementary right-to-left context information.

The heatmaps reveal the internal representations computed by each direction. With randomly initialized weights and random inputs, the patterns are essentially arbitrary, but the structure is informative. Notice how different hidden units show different activation patterns across positions, and how the forward and backward passes develop distinct representations.

In a trained model on real data, you would see meaningful patterns emerge: hidden units that activate for specific linguistic features, with forward units capturing left-context patterns (like "the word following 'the' is likely a noun") and backward units capturing right-context patterns (like "the word before 'said' is likely a person"). The bidirectional combination gives downstream layers access to both types of patterns simultaneously.

How Different Are the Two Directions?Link Copied

A natural question: do the forward and backward passes learn redundant or complementary representations? Let's measure this by computing the correlation between corresponding hidden units in each direction:

Out[13]:

Visualization

Heatmap and histogram showing correlations between forward and backward LSTM hidden units. — Correlation matrix between forward and backward hidden units. Low correlations indicate the two directions capture different information, making their combination more valuable.

Distribution of correlations between forward and backward hidden units. A spread around zero suggests the directions learn complementary rather than redundant features.

The correlation analysis reveals how independent the two directions are. With random initialization, we expect correlations near zero because the two LSTMs haven't learned to coordinate. In a trained model, you might see some structure emerge: certain forward units might correlate with backward units that capture related (but temporally reversed) patterns. However, the key insight is that low correlation means the directions provide complementary information. If they were highly correlated, concatenation would be wasteful since we'd be duplicating information. The spread of correlations around zero confirms that bidirectional processing genuinely expands the representational capacity.

Bidirectionality for ClassificationLink Copied

Bidirectional RNNs excel at classification tasks where you need to make a decision based on the entire sequence. Sentiment analysis, named entity recognition, and part-of-speech tagging all benefit from bidirectional context.

Sequence ClassificationLink Copied

For classifying an entire sequence (e.g., sentiment of a review), you typically use the final hidden states from both directions:

In[14]:

Code

class BidirectionalClassifier:
    """Bidirectional LSTM for sequence classification."""

    def __init__(self, input_dim, hidden_dim, num_classes):
        self.bilstm = BidirectionalLSTM(input_dim, hidden_dim)
        self.hidden_dim = hidden_dim

        # Classification layer takes concatenated final states
        # Final forward state + Final backward state = 2 * hidden_dim
        self.W_out = np.random.randn(num_classes, 2 * hidden_dim) * 0.1
        self.b_out = np.zeros(num_classes)

    def forward(self, x_sequence):
        """
        Classify a sequence.

        Returns logits for each class.
        """
        combined, forward_states, backward_states = self.bilstm.forward(
            x_sequence
        )

        # Use final states from both directions
        final_forward = forward_states[-1]  # Last position of forward pass
        final_backward = backward_states[
            0
        ]  # First position of backward pass (processed last)

        # Concatenate final states
        final_repr = np.concatenate([final_forward, final_backward])

        # Classification
        logits = self.W_out @ final_repr + self.b_out
        return logits

class BidirectionalClassifier:
    """Bidirectional LSTM for sequence classification."""

    def __init__(self, input_dim, hidden_dim, num_classes):
        self.bilstm = BidirectionalLSTM(input_dim, hidden_dim)
        self.hidden_dim = hidden_dim

        # Classification layer takes concatenated final states
        # Final forward state + Final backward state = 2 * hidden_dim
        self.W_out = np.random.randn(num_classes, 2 * hidden_dim) * 0.1
        self.b_out = np.zeros(num_classes)

    def forward(self, x_sequence):
        """
        Classify a sequence.

        Returns logits for each class.
        """
        combined, forward_states, backward_states = self.bilstm.forward(
            x_sequence
        )

        # Use final states from both directions
        final_forward = forward_states[-1]  # Last position of forward pass
        final_backward = backward_states[
            0
        ]  # First position of backward pass (processed last)

        # Concatenate final states
        final_repr = np.concatenate([final_forward, final_backward])

        # Classification
        logits = self.W_out @ final_repr + self.b_out
        return logits

In[15]:

Code

# Example: binary sentiment classification
classifier = BidirectionalClassifier(input_dim=10, hidden_dim=16, num_classes=2)
x_sequence = np.random.randn(8, 10)  # 8 tokens, 10-dim embeddings

logits = classifier.forward(x_sequence)
probs = np.exp(logits) / np.sum(np.exp(logits))  # Softmax

# Example: binary sentiment classification
classifier = BidirectionalClassifier(input_dim=10, hidden_dim=16, num_classes=2)
x_sequence = np.random.randn(8, 10)  # 8 tokens, 10-dim embeddings

logits = classifier.forward(x_sequence)
probs = np.exp(logits) / np.sum(np.exp(logits))  # Softmax

Out[16]:

Console

Sequence Classification Example:
  Sequence length: 8 tokens
  Input dimension: 10
  Hidden dimension: 16
  Output logits: [-0.0210323   0.07967426]
  Predicted probabilities: [0.47484462 0.52515538]
  Predicted class: 1

The classifier processes an 8-token sequence and produces logits for each class. The softmax converts these to probabilities that sum to 1. With randomly initialized weights, the predictions are essentially random, but the architecture correctly combines bidirectional context: the final forward state captures information from all 8 tokens read left-to-right, while the final backward state captures information from all 8 tokens read right-to-left.

Token-Level ClassificationLink Copied

For tasks like named entity recognition where you classify each token, you use the combined representation at each position:

In[17]:

Code

class BidirectionalTagger:
    """Bidirectional LSTM for token-level classification (e.g., NER, POS tagging)."""

    def __init__(self, input_dim, hidden_dim, num_tags):
        self.bilstm = BidirectionalLSTM(input_dim, hidden_dim)

        # Output layer applied at each position
        self.W_out = np.random.randn(num_tags, 2 * hidden_dim) * 0.1
        self.b_out = np.zeros(num_tags)

    def forward(self, x_sequence):
        """
        Tag each token in the sequence.

        Returns logits for each position and tag.
        """
        combined, _, _ = self.bilstm.forward(x_sequence)

        # Apply classification at each position
        seq_len = combined.shape[0]
        logits = np.zeros((seq_len, self.W_out.shape[0]))

        for t in range(seq_len):
            logits[t] = self.W_out @ combined[t] + self.b_out

        return logits

class BidirectionalTagger:
    """Bidirectional LSTM for token-level classification (e.g., NER, POS tagging)."""

    def __init__(self, input_dim, hidden_dim, num_tags):
        self.bilstm = BidirectionalLSTM(input_dim, hidden_dim)

        # Output layer applied at each position
        self.W_out = np.random.randn(num_tags, 2 * hidden_dim) * 0.1
        self.b_out = np.zeros(num_tags)

    def forward(self, x_sequence):
        """
        Tag each token in the sequence.

        Returns logits for each position and tag.
        """
        combined, _, _ = self.bilstm.forward(x_sequence)

        # Apply classification at each position
        seq_len = combined.shape[0]
        logits = np.zeros((seq_len, self.W_out.shape[0]))

        for t in range(seq_len):
            logits[t] = self.W_out @ combined[t] + self.b_out

        return logits

In[18]:

Code

# Example: NER with 5 tags (O, B-PER, I-PER, B-ORG, I-ORG)
tagger = BidirectionalTagger(input_dim=10, hidden_dim=16, num_tags=5)
x_sequence = np.random.randn(6, 10)  # 6 tokens

tag_logits = tagger.forward(x_sequence)
predicted_tags = np.argmax(tag_logits, axis=1)

# Example: NER with 5 tags (O, B-PER, I-PER, B-ORG, I-ORG)
tagger = BidirectionalTagger(input_dim=10, hidden_dim=16, num_tags=5)
x_sequence = np.random.randn(6, 10)  # 6 tokens

tag_logits = tagger.forward(x_sequence)
predicted_tags = np.argmax(tag_logits, axis=1)

Out[19]:

Console

Token-Level Classification Example:
  Sequence length: 6 tokens
  Tag logits shape: (6, 5)
  Predicted tags: ['I-ORG', 'O', 'I-ORG', 'I-ORG', 'O', 'B-ORG']

The tagger produces a 5-dimensional logit vector at each of the 6 positions, one score per possible tag. The argmax selects the highest-scoring tag for each token. With random weights, the predictions are meaningless, but the key point is that each token's prediction uses the full bidirectional context: the combined hidden state at position $t$ incorporates information from all tokens before and after it.

The bidirectional context is particularly valuable for NER because entity boundaries often depend on both preceding and following words. "Bank of America" needs the following "of America" to recognize "Bank" as part of an organization name.

Limitations for GenerationLink Copied

Bidirectional RNNs have a fundamental limitation: they cannot be used for autoregressive generation. When generating text one token at a time, you don't have access to future tokens because they haven't been generated yet.

Consider a language model trying to predict the next word. At position $t$ , an autoregressive model can only use information from positions $1, 2, \ldots, t-1$ . The backward pass of a bidirectional RNN would require information from positions $t+1, t+2, \ldots, T$ , which don't exist during generation.

Out[20]:

Visualization

Diagram showing generation process where future tokens are unknown, preventing backward pass computation. — During autoregressive generation, future tokens don't exist yet, making the backward pass impossible. At each step, the model must predict the next token using only past context. This is why language models and decoders use unidirectional (forward-only) architectures.

This limitation means bidirectional RNNs are used for:

Encoding in sequence-to-sequence models (the encoder sees the full input)
Classification tasks where the full sequence is available
Feature extraction for downstream tasks

But not for:

Language modeling (predicting next tokens)
Text generation (writing stories, completing sentences)
Decoders in seq2seq models (which generate one token at a time)

PyTorch ImplementationLink Copied

PyTorch provides built-in support for bidirectional RNNs. Let's see how to use it and verify our understanding:

In[21]:

Code

import torch
import torch.nn as nn

# Create bidirectional LSTM
input_dim = 10
hidden_dim = 16
seq_len = 5
batch_size = 1

torch_bilstm = nn.LSTM(
    input_size=input_dim,
    hidden_size=hidden_dim,
    num_layers=1,
    batch_first=True,
    bidirectional=True,  # This is the key parameter
)

# Sample input
x = torch.randn(batch_size, seq_len, input_dim)

# Forward pass
output, (h_n, c_n) = torch_bilstm(x)

import torch
import torch.nn as nn

# Create bidirectional LSTM
input_dim = 10
hidden_dim = 16
seq_len = 5
batch_size = 1

torch_bilstm = nn.LSTM(
    input_size=input_dim,
    hidden_size=hidden_dim,
    num_layers=1,
    batch_first=True,
    bidirectional=True,  # This is the key parameter
)

# Sample input
x = torch.randn(batch_size, seq_len, input_dim)

# Forward pass
output, (h_n, c_n) = torch_bilstm(x)

Out[22]:

Console

PyTorch Bidirectional LSTM:
  Input shape: (1, 5, 10)
  Output shape: (1, 5, 32)
  Final hidden shape: (2, 1, 16)
  Final cell shape: (2, 1, 16)

Output dimension: 32 = 2 × 16 (bidirectional)
Final hidden: 2 directions × 16 hidden units

PyTorch's bidirectional LSTM automatically handles the forward and backward passes and concatenates the results. The output shape confirms the doubled dimensionality: 32 features instead of 16. The final hidden state tensor has shape (2, 1, 16), where the first dimension indexes direction (0 = forward, 1 = backward), the second is batch size, and the third is hidden dimension. This matches our from-scratch implementation.

Extracting Directional StatesLink Copied

To access the forward and backward components separately:

In[23]:

Code

# Split output into forward and backward components
forward_output = output[:, :, :hidden_dim]  # First half of features
backward_output = output[:, :, hidden_dim:]  # Second half of features

# Final states by direction
final_forward_h = h_n[0]  # Shape: (batch, hidden_dim)
final_backward_h = h_n[1]  # Shape: (batch, hidden_dim)

# Split output into forward and backward components
forward_output = output[:, :, :hidden_dim]  # First half of features
backward_output = output[:, :, hidden_dim:]  # Second half of features

# Final states by direction
final_forward_h = h_n[0]  # Shape: (batch, hidden_dim)
final_backward_h = h_n[1]  # Shape: (batch, hidden_dim)

Out[24]:

Console

Extracting Directional Components:
  Forward output shape: (1, 5, 16)
  Backward output shape: (1, 5, 16)
  Final forward hidden: (1, 16)
  Final backward hidden: (1, 16)

Slicing the output tensor along the feature dimension separates the forward and backward components. Each has shape (1, 5, 16), corresponding to batch size, sequence length, and hidden dimension. The final hidden states for each direction have shape (1, 16), representing the last state computed by each directional LSTM. These can be concatenated for sequence classification or used separately depending on your task.

Visualizing the Concatenated Output StructureLink Copied

To make the structure of bidirectional output concrete, let's visualize how the forward and backward components are arranged in the concatenated tensor:

Out[25]:

Visualization

Heatmap showing the concatenated bidirectional output with forward and backward components side by side. — Structure of the concatenated bidirectional output. Each row is a sequence position, and the columns show hidden dimensions. The first half (blue) contains forward hidden states, the second half (orange) contains backward hidden states. The color intensity shows activation magnitude.

The visualization makes the concatenation structure explicit: at each position $t$ , the output vector has $2h$ dimensions. The first $h$ dimensions (indices 0 to $h-1$ ) contain the forward hidden state $\overrightarrow{h}_t$ , encoding left context. The second $h$ dimensions (indices $h$ to $2h-1$ ) contain the backward hidden state $\overset{\leftarrow}{h}_t$ , encoding right context. When you slice output[:, :, :hidden_dim], you get the forward component; output[:, :, hidden_dim:] gives you the backward component.

Comparing Unidirectional and Bidirectional PerformanceLink Copied

Let's design an experiment that demonstrates the advantage of bidirectional processing. We'll create a task where context from both directions is necessary for correct classification.

The Bracket Matching TaskLink Copied

We'll create sequences with matching brackets where the model must classify whether each position is inside or outside a bracket pair. This requires seeing both the opening bracket (past) and closing bracket (future).

In[26]:

Code

def generate_bracket_data(num_samples, seq_len):
    """
    Generate sequences with bracket pairs.
    Label each position as inside (1) or outside (0) brackets.
    """
    X = np.zeros((num_samples, seq_len, 3))  # 3 features: open, close, other
    y = np.zeros((num_samples, seq_len), dtype=int)

    for i in range(num_samples):
        # Randomly place a bracket pair
        open_pos = np.random.randint(0, seq_len - 2)
        close_pos = np.random.randint(open_pos + 2, seq_len)

        # Fill with "other" tokens
        X[i, :, 2] = 1

        # Place brackets
        X[i, open_pos, :] = [1, 0, 0]  # Open bracket
        X[i, close_pos, :] = [0, 1, 0]  # Close bracket

        # Label positions inside brackets
        y[i, open_pos + 1 : close_pos] = 1

    return X, y


# Generate data
np.random.seed(42)
X_train, y_train = generate_bracket_data(500, 10)
X_test, y_test = generate_bracket_data(100, 10)

def generate_bracket_data(num_samples, seq_len):
    """
    Generate sequences with bracket pairs.
    Label each position as inside (1) or outside (0) brackets.
    """
    X = np.zeros((num_samples, seq_len, 3))  # 3 features: open, close, other
    y = np.zeros((num_samples, seq_len), dtype=int)

    for i in range(num_samples):
        # Randomly place a bracket pair
        open_pos = np.random.randint(0, seq_len - 2)
        close_pos = np.random.randint(open_pos + 2, seq_len)

        # Fill with "other" tokens
        X[i, :, 2] = 1

        # Place brackets
        X[i, open_pos, :] = [1, 0, 0]  # Open bracket
        X[i, close_pos, :] = [0, 1, 0]  # Close bracket

        # Label positions inside brackets
        y[i, open_pos + 1 : close_pos] = 1

    return X, y


# Generate data
np.random.seed(42)
X_train, y_train = generate_bracket_data(500, 10)
X_test, y_test = generate_bracket_data(100, 10)

Now let's train both unidirectional and bidirectional models:

In[27]:

Code

import torch.optim as optim


class UnidirectionalTagger(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, batch_first=True, bidirectional=False
        )
        self.fc = nn.Linear(hidden_dim, 2)

    def forward(self, x):
        output, _ = self.lstm(x)
        return self.fc(output)


class BidirectionalTaggerPyTorch(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, batch_first=True, bidirectional=True
        )
        self.fc = nn.Linear(2 * hidden_dim, 2)

    def forward(self, x):
        output, _ = self.lstm(x)
        return self.fc(output)


def train_and_evaluate(model, X_train, y_train, X_test, y_test, epochs=50):
    """Train model and return test accuracy."""
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    X_train_t = torch.FloatTensor(X_train)
    y_train_t = torch.LongTensor(y_train)
    X_test_t = torch.FloatTensor(X_test)
    y_test_t = torch.LongTensor(y_test)

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output.view(-1, 2), y_train_t.view(-1))
        loss.backward()
        optimizer.step()

    # Evaluate
    model.eval()
    with torch.no_grad():
        output = model(X_test_t)
        predictions = output.argmax(dim=2)
        accuracy = (predictions == y_test_t).float().mean().item()

    return accuracy


# Modified training function that tracks history
def train_and_evaluate_with_history(
    model, X_train, y_train, X_test, y_test, epochs=50
):
    """Train model and return test accuracy plus training history."""
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    X_train_t = torch.FloatTensor(X_train)
    y_train_t = torch.LongTensor(y_train)
    X_test_t = torch.FloatTensor(X_test)
    y_test_t = torch.LongTensor(y_test)

    train_losses = []
    test_accs = []

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output.view(-1, 2), y_train_t.view(-1))
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())

        # Track test accuracy
        model.eval()
        with torch.no_grad():
            test_output = model(X_test_t)
            predictions = test_output.argmax(dim=2)
            acc = (predictions == y_test_t).float().mean().item()
            test_accs.append(acc)

    return acc, train_losses, test_accs


# Train both models with history tracking
torch.manual_seed(42)
uni_model = UnidirectionalTagger(input_dim=3, hidden_dim=16)
bi_model = BidirectionalTaggerPyTorch(input_dim=3, hidden_dim=16)

uni_acc, uni_losses, uni_accs = train_and_evaluate_with_history(
    uni_model, X_train, y_train, X_test, y_test
)

torch.manual_seed(42)  # Reset seed for fair comparison
bi_model = BidirectionalTaggerPyTorch(input_dim=3, hidden_dim=16)
bi_acc, bi_losses, bi_accs = train_and_evaluate_with_history(
    bi_model, X_train, y_train, X_test, y_test
)

import torch.optim as optim


class UnidirectionalTagger(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, batch_first=True, bidirectional=False
        )
        self.fc = nn.Linear(hidden_dim, 2)

    def forward(self, x):
        output, _ = self.lstm(x)
        return self.fc(output)


class BidirectionalTaggerPyTorch(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, batch_first=True, bidirectional=True
        )
        self.fc = nn.Linear(2 * hidden_dim, 2)

    def forward(self, x):
        output, _ = self.lstm(x)
        return self.fc(output)


def train_and_evaluate(model, X_train, y_train, X_test, y_test, epochs=50):
    """Train model and return test accuracy."""
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    X_train_t = torch.FloatTensor(X_train)
    y_train_t = torch.LongTensor(y_train)
    X_test_t = torch.FloatTensor(X_test)
    y_test_t = torch.LongTensor(y_test)

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output.view(-1, 2), y_train_t.view(-1))
        loss.backward()
        optimizer.step()

    # Evaluate
    model.eval()
    with torch.no_grad():
        output = model(X_test_t)
        predictions = output.argmax(dim=2)
        accuracy = (predictions == y_test_t).float().mean().item()

    return accuracy


# Modified training function that tracks history
def train_and_evaluate_with_history(
    model, X_train, y_train, X_test, y_test, epochs=50
):
    """Train model and return test accuracy plus training history."""
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    X_train_t = torch.FloatTensor(X_train)
    y_train_t = torch.LongTensor(y_train)
    X_test_t = torch.FloatTensor(X_test)
    y_test_t = torch.LongTensor(y_test)

    train_losses = []
    test_accs = []

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output.view(-1, 2), y_train_t.view(-1))
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())

        # Track test accuracy
        model.eval()
        with torch.no_grad():
            test_output = model(X_test_t)
            predictions = test_output.argmax(dim=2)
            acc = (predictions == y_test_t).float().mean().item()
            test_accs.append(acc)

    return acc, train_losses, test_accs


# Train both models with history tracking
torch.manual_seed(42)
uni_model = UnidirectionalTagger(input_dim=3, hidden_dim=16)
bi_model = BidirectionalTaggerPyTorch(input_dim=3, hidden_dim=16)

uni_acc, uni_losses, uni_accs = train_and_evaluate_with_history(
    uni_model, X_train, y_train, X_test, y_test
)

torch.manual_seed(42)  # Reset seed for fair comparison
bi_model = BidirectionalTaggerPyTorch(input_dim=3, hidden_dim=16)
bi_acc, bi_losses, bi_accs = train_and_evaluate_with_history(
    bi_model, X_train, y_train, X_test, y_test
)

Out[28]:

Console

Bracket Matching Task Results:
  Unidirectional LSTM accuracy: 98.1%
  Bidirectional LSTM accuracy: 100.0%
  Improvement: 1.9%

The bidirectional model achieves near-perfect accuracy because it can see both the opening and closing brackets when classifying each position. The unidirectional model struggles because when it processes a position, it doesn't know if a closing bracket will appear later. This task is specifically designed to require future context: determining whether a position is inside brackets needs information about both the opening bracket (past) and closing bracket (future).

Out[29]:

Visualization

Two-panel plot showing training loss and test accuracy over epochs for unidirectional and bidirectional models. — Training loss over epochs. The bidirectional model (purple) achieves lower loss faster than the unidirectional model (blue), finding a better solution.

Test accuracy over epochs. The bidirectional model reaches near-perfect accuracy while the unidirectional model plateaus well below, unable to solve the task without future context.

The learning curves reveal the dynamics of training. The bidirectional model not only achieves higher final accuracy but also learns faster, reaching good performance within the first few epochs. The unidirectional model's loss decreases but plateaus at a higher value, reflecting its fundamental inability to solve the task perfectly without future context.

Where Does the Unidirectional Model Fail?Link Copied

To understand why the unidirectional model struggles, let's analyze accuracy by position within the sequence. The bracket matching task has a specific structure: positions before the opening bracket should be labeled "outside," positions between brackets should be "inside," and positions after the closing bracket should be "outside."

Out[30]:

Visualization

Bar chart comparing per-position accuracy between unidirectional and bidirectional models on the bracket matching task. — Per-position accuracy for unidirectional vs bidirectional models. The unidirectional model performs well at early positions (before brackets) but struggles at later positions where it needs future context to know if a closing bracket will appear. The bidirectional model maintains high accuracy throughout.

The per-position analysis reveals the fundamental limitation of unidirectional processing. Early positions (near the start of the sequence) tend to have higher accuracy because they're more likely to be before any bracket, so the model can correctly predict "outside" without needing future context. However, middle positions are problematic: the unidirectional model has seen the opening bracket but doesn't know if or when a closing bracket will appear. It must guess whether the current position is inside or outside brackets based only on past context.

The bidirectional model maintains high accuracy at all positions because it has access to the full sequence. When classifying position 5, it knows both that an opening bracket appeared at position 2 and that a closing bracket appears at position 7. This complete picture enables correct classification regardless of position.

Practical ConsiderationsLink Copied

When deciding whether to use bidirectional RNNs, consider these factors:

When to Use Bidirectional RNNsLink Copied

Bidirectional architectures are well-suited for tasks where:

The entire input sequence is available before making predictions
Context from both directions improves understanding (most NLP classification tasks)
You're building an encoder that will feed into a decoder
Token-level predictions depend on surrounding context (NER, POS tagging)

When to Avoid Bidirectional RNNsLink Copied

Stick with unidirectional architectures when:

You need to generate sequences autoregressively
You're processing streaming data where future inputs aren't available
Latency is critical and you can't wait for the full sequence
You're building a decoder in a seq2seq model

Parameter and Computation CostLink Copied

Bidirectional RNNs double both parameters and computation compared to unidirectional versions with the same hidden size. If you need a bidirectional model with the same total capacity as a unidirectional one, use half the hidden size in each direction.

In[31]:

Code

def count_lstm_params(input_dim, hidden_dim, bidirectional=False):
    """Count parameters in an LSTM layer."""
    # Each direction: 4 * hidden_dim * (input_dim + hidden_dim + 1)
    params_per_direction = 4 * hidden_dim * (input_dim + hidden_dim + 1)
    multiplier = 2 if bidirectional else 1
    return params_per_direction * multiplier


input_dim = 100
hidden_dim = 256

uni_params = count_lstm_params(input_dim, hidden_dim, bidirectional=False)
bi_params = count_lstm_params(input_dim, hidden_dim, bidirectional=True)
bi_half_params = count_lstm_params(
    input_dim, hidden_dim // 2, bidirectional=True
)

def count_lstm_params(input_dim, hidden_dim, bidirectional=False):
    """Count parameters in an LSTM layer."""
    # Each direction: 4 * hidden_dim * (input_dim + hidden_dim + 1)
    params_per_direction = 4 * hidden_dim * (input_dim + hidden_dim + 1)
    multiplier = 2 if bidirectional else 1
    return params_per_direction * multiplier


input_dim = 100
hidden_dim = 256

uni_params = count_lstm_params(input_dim, hidden_dim, bidirectional=False)
bi_params = count_lstm_params(input_dim, hidden_dim, bidirectional=True)
bi_half_params = count_lstm_params(
    input_dim, hidden_dim // 2, bidirectional=True
)

Out[32]:

Console

Parameter Comparison (input_dim=100):
  Unidirectional (hidden=256): 365,568 parameters
  Bidirectional (hidden=256):  731,136 parameters
  Bidirectional (hidden=128):  234,496 parameters

Bidirectional with hidden=128 has similar capacity to unidirectional with hidden=256

The parameter counts reveal the cost of bidirectionality:

LSTM parameter comparison with input dimension 100. Bidirectional models double the parameters compared to unidirectional ones with the same hidden size. Using half the hidden size in each direction achieves bidirectional context with similar total parameters.

Configuration	Hidden Size	Parameters	Output Dim	Notes
Unidirectional	256	366,592	256	Baseline
Bidirectional	256	733,184	512	2× parameters
Bidirectional	128	188,416	256	Similar to unidirectional

A bidirectional LSTM with hidden size 256 has exactly twice the parameters of a unidirectional one, since it maintains two complete sets of weights. If you need to match parameter budgets, using hidden size 128 in each direction gives you bidirectional context with roughly the same total parameters as a unidirectional model with hidden size 256. The trade-off is that each direction has less capacity, but the combined representation still benefits from seeing the full sequence.

Limitations and ImpactLink Copied

Bidirectional RNNs solved a critical limitation of unidirectional sequence models: the inability to incorporate future context. This architectural innovation had significant impact across NLP, enabling substantial improvements on tasks from named entity recognition to machine translation encoders.

The most significant practical limitation remains the incompatibility with autoregressive generation. You cannot use bidirectional models for language modeling, text generation, or any task requiring sequential token-by-token output. This fundamental constraint means that even as bidirectional encoders became standard, unidirectional decoders remained necessary for generation tasks.

Computational cost presents another consideration. Processing sequences in both directions doubles the computation and memory requirements compared to unidirectional models. For very long sequences or resource-constrained environments, this overhead may be prohibitive. The sequential nature of RNNs compounds this issue: you cannot parallelize across time steps, so processing time scales linearly with sequence length regardless of available hardware.

Despite these constraints, bidirectional RNNs became a foundational component of modern NLP. The ELMo model, which achieved state-of-the-art results across many benchmarks in 2018, used deep bidirectional LSTMs. BERT and subsequent transformer models adopted the bidirectional principle, though they achieved bidirectionality through attention mechanisms rather than separate forward and backward passes. The insight that both past and future context matter for understanding language proved more durable than any specific architectural implementation.

SummaryLink Copied

Bidirectional RNNs process sequences in both directions simultaneously, combining forward and backward hidden states to create representations informed by the entire sequence context.

The key architectural components are:

A forward RNN processing from position 1 to $T$ , producing states $\overrightarrow{h}_1, \ldots, \overrightarrow{h}_T$
A backward RNN processing from position $T$ to 1, producing states $\overset{\leftarrow}{h}_T, \ldots, \overset{\leftarrow}{h}_1$
Concatenation at each position: $h_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t]$

This architecture excels at classification tasks where full sequence context improves predictions. Named entity recognition, sentiment analysis, and part-of-speech tagging all benefit from bidirectional processing. The encoder in sequence-to-sequence models typically uses bidirectional RNNs to capture complete input context.

The fundamental limitation is incompatibility with autoregressive generation. Since future tokens don't exist during generation, the backward pass cannot be computed. Language models, text generators, and decoders must use unidirectional architectures.

Bidirectional RNNs double both parameters and computation compared to unidirectional versions. When computational budget is fixed, using half the hidden size in each direction maintains similar total capacity while gaining bidirectional context.

Key ParametersLink Copied

When working with bidirectional RNNs in PyTorch (nn.LSTM, nn.GRU, nn.RNN), the following parameters are most relevant:

bidirectional: Set to True to enable bidirectional processing. This doubles the output dimension and the number of parameters.
hidden_size: The hidden dimension for each direction. With bidirectional=True, the output dimension becomes 2 * hidden_size.
num_layers: Number of stacked RNN layers. Each layer can be bidirectional independently, though typically all layers share the same directionality.
batch_first: When True, input and output tensors have shape (batch, seq, features). The bidirectional output features are concatenated along the last dimension.
dropout: Applied between layers (when num_layers > 1). Does not affect bidirectionality but helps regularize deeper bidirectional stacks.

The output tensor shape is (batch, seq_len, num_directions * hidden_size) where num_directions is 2 for bidirectional models. The final hidden state has shape (num_layers * num_directions, batch, hidden_size), with forward and backward states interleaved by layer.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about bidirectional RNNs and how they capture context from both directions.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{bidirectionalrnnscapturingfullsequencecontext, author = {Michael Brenndoerfer}, title = {Bidirectional RNNs: Capturing Full Sequence Context}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Bidirectional RNNs: Capturing Full Sequence Context. Retrieved from https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp

MLAAcademic

Michael Brenndoerfer. "Bidirectional RNNs: Capturing Full Sequence Context." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp>.

CHICAGOAcademic

Michael Brenndoerfer. "Bidirectional RNNs: Capturing Full Sequence Context." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Bidirectional RNNs: Capturing Full Sequence Context'. Available at: https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Bidirectional RNNs: Capturing Full Sequence Context. https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp

Direct link:

https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Bidirectional RNNs: Capturing Full Sequence Context

Bidirectional RNNsLink Copied

The Bidirectional IntuitionLink Copied

Forward and Backward PassesLink Copied

The Forward Pass: Accumulating Left ContextLink Copied

The Backward Pass: Accumulating Right ContextLink Copied

Two Independent Networks with Complementary ViewsLink Copied

Concrete Example: Disambiguating "Bank"Link Copied

Hidden State ConcatenationLink Copied

The Standard Approach: ConcatenationLink Copied

Alternative Combination StrategiesLink Copied

Practical Implications of Doubled DimensionsLink Copied

Bidirectional LSTMs and GRUsLink Copied

Forward LSTM: Gated Memory from the LeftLink Copied

Backward LSTM: Gated Memory from the RightLink Copied

Combining the Directions: Hidden States OnlyLink Copied

Implementing Bidirectional RNNsLink Copied

Building Blocks: Activation Functions and LSTM CellsLink Copied

The Bidirectional LayerLink Copied

Testing the ImplementationLink Copied

Visualizing Hidden State EvolutionLink Copied

Information Flow: What Each Position "Knows"Link Copied

Hidden State Activation PatternsLink Copied

How Different Are the Two Directions?Link Copied

Bidirectionality for ClassificationLink Copied

Sequence ClassificationLink Copied

Token-Level ClassificationLink Copied

Limitations for GenerationLink Copied

PyTorch ImplementationLink Copied

Extracting Directional StatesLink Copied

Visualizing the Concatenated Output StructureLink Copied

Comparing Unidirectional and Bidirectional PerformanceLink Copied

The Bracket Matching TaskLink Copied

Where Does the Unidirectional Model Fail?Link Copied

Practical ConsiderationsLink Copied

When to Use Bidirectional RNNsLink Copied

When to Avoid Bidirectional RNNsLink Copied

Parameter and Computation CostLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GRU Architecture: Streamlined Gating for Sequence Modeling

Stacked RNNs: Deep Recurrent Networks for Hierarchical Sequence Modeling

LSTM Gradient Flow: The Constant Error Carousel Explained

Stay updated