Search

Search articles

Bidirectional RNNs: Capturing Full Sequence Context

Michael BrenndoerferDecember 16, 202552 min read

Learn how bidirectional RNNs process sequences in both directions to capture past and future context. Covers architecture, LSTMs, implementation, and when to use them.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Bidirectional RNNs

Standard RNNs process sequences in one direction: from the first token to the last. At each timestep, the hidden state captures information about everything that came before. But for many tasks, the future matters just as much as the past. When you're trying to understand the meaning of a word in a sentence, you naturally consider both what came before and what comes after.

Consider the sentence: "The bank by the river was steep." To understand that "bank" refers to a riverbank rather than a financial institution, you need to see "river" which appears after "bank." A forward-only RNN processing "bank" has no access to this disambiguating context. Bidirectional RNNs solve this problem by running two separate RNNs: one forward through time, one backward. At each position, you get a representation informed by the entire sequence.

The Bidirectional Intuition

Think about how you actually read text. When you encounter an ambiguous word, you don't commit to an interpretation immediately. You keep reading, gathering more context, and then your understanding of earlier words crystallizes. You're implicitly using bidirectional information: past context and future context together inform your understanding of each position.

Bidirectional RNNs formalize this intuition with a simple architectural change. Instead of one RNN, we use two:

  • A forward RNN that processes the sequence from position 1 to position TT, producing hidden states h1,h2,,hT\overrightarrow{h}_1, \overrightarrow{h}_2, \ldots, \overrightarrow{h}_T
  • A backward RNN that processes the sequence from position TT to position 1, producing hidden states hT,hT1,,h1\overset{\leftarrow}{h}_T, \overset{\leftarrow}{h}_{T-1}, \ldots, \overset{\leftarrow}{h}_1

At each position tt, we concatenate these two hidden states to get a combined representation that captures both past and future context:

ht=[ht;ht]h_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t]

where:

  • htR2hh_t \in \mathbb{R}^{2h}: the combined bidirectional hidden state at position tt
  • htRh\overrightarrow{h}_t \in \mathbb{R}^h: the forward hidden state, encoding information from positions 1,2,,t1, 2, \ldots, t
  • htRh\overset{\leftarrow}{h}_t \in \mathbb{R}^h: the backward hidden state, encoding information from positions T,T1,,tT, T-1, \ldots, t
  • [;][;]: the concatenation operator, which stacks two vectors end-to-end
  • hh: the hidden dimension of each directional RNN

The concatenation doubles the dimensionality: if each directional RNN has hidden dimension hh, the combined representation has dimension 2h2h.

Out[3]:
Visualization
Diagram showing forward and backward RNN layers with arrows indicating information flow in opposite directions.
A bidirectional RNN processes the input sequence in both directions. The forward RNN (blue, top) reads left-to-right, while the backward RNN (orange, bottom) reads right-to-left. At each position, the two hidden states are concatenated to form a representation informed by the entire sequence.
Bidirectional RNN

A bidirectional RNN processes a sequence in both directions simultaneously, using a forward RNN and a backward RNN. The hidden states from both directions are combined (typically concatenated) at each position to create representations that capture context from the entire sequence.

Forward and Backward Passes

Now that we understand why bidirectional processing helps, let's formalize how information flows through the architecture. The elegance of bidirectional RNNs lies in their simplicity: we run two completely independent RNNs on the same input, one reading left-to-right and one reading right-to-left. Each builds its own summary of the sequence from its respective direction, and we combine these complementary views at the end.

Consider an input sequence x1,x2,,xTx_1, x_2, \ldots, x_T where each xtRdx_t \in \mathbb{R}^d is a dd-dimensional vector, typically a word embedding in NLP or acoustic features in speech recognition. Our goal is to produce a representation at each position that captures context from the entire sequence, not just what came before.

The Forward Pass: Accumulating Left Context

The forward RNN processes the sequence in natural reading order, from position 1 to position TT. At each step, it faces the same question every recurrent network must answer: how do I combine what I'm seeing right now with everything I've accumulated so far?

The answer is the standard RNN update equation, applied in the forward direction:

ht=f(Wxxt+Whht1+b)\overrightarrow{h}_t = f(\overrightarrow{W}_x x_t + \overrightarrow{W}_h \overrightarrow{h}_{t-1} + \overrightarrow{b})

Let's unpack each component to understand its role:

  • xtRdx_t \in \mathbb{R}^d: the input at the current position, the raw information we're processing
  • ht1Rh\overrightarrow{h}_{t-1} \in \mathbb{R}^h: the previous hidden state, carrying a compressed summary of everything from positions 11 through t1t-1
  • WxRh×d\overrightarrow{W}_x \in \mathbb{R}^{h \times d}: the input weight matrix, which learns to extract relevant features from the current input
  • WhRh×h\overrightarrow{W}_h \in \mathbb{R}^{h \times h}: the recurrent weight matrix, which learns how to integrate new information with accumulated history
  • bRh\overrightarrow{b} \in \mathbb{R}^h: a bias vector that shifts the activation
  • ff: an activation function (tanh for vanilla RNNs, or the full gating mechanism for LSTMs and GRUs)
  • htRh\overrightarrow{h}_t \in \mathbb{R}^h: the output, a new hidden state that now summarizes positions 11 through tt

The computation starts with h0=0\overrightarrow{h}_0 = \mathbf{0}, a zero vector representing "no prior context." As the forward pass proceeds through t=1,2,,Tt = 1, 2, \ldots, T, each hidden state accumulates progressively more information about the sequence's left context. By the time we reach position TT, the final forward state hT\overrightarrow{h}_T has seen the entire sequence, but only from the left-to-right perspective.

The Backward Pass: Accumulating Right Context

The backward RNN performs the mirror-image computation. It starts at the end of the sequence and works its way back to the beginning, building up a summary of what comes after each position:

ht=f(Wxxt+Whht+1+b)\overset{\leftarrow}{h}_t = f(\overset{\leftarrow}{W}_x x_t + \overset{\leftarrow}{W}_h \overset{\leftarrow}{h}_{t+1} + \overset{\leftarrow}{b})

The structure is identical to the forward pass, but notice the crucial difference in the subscript: the backward hidden state at position tt depends on ht+1\overset{\leftarrow}{h}_{t+1}, not ht1\overset{\leftarrow}{h}_{t-1}. This reflects the reversed processing order: the backward RNN has already processed positions T,T1,,t+1T, T-1, \ldots, t+1 before it reaches position tt.

Each component plays the same role as in the forward pass, but now oriented toward capturing right context:

  • ht+1Rh\overset{\leftarrow}{h}_{t+1} \in \mathbb{R}^h: the hidden state from the next position (which the backward RNN has already processed)
  • Wx,Wh,b\overset{\leftarrow}{W}_x, \overset{\leftarrow}{W}_h, \overset{\leftarrow}{b}: the backward network's own weight matrices and bias, completely separate from the forward parameters
  • htRh\overset{\leftarrow}{h}_t \in \mathbb{R}^h: the output, summarizing positions tt through TT

The backward pass begins with hT+1=0\overset{\leftarrow}{h}_{T+1} = \mathbf{0} and proceeds through t=T,T1,,1t = T, T-1, \ldots, 1. By the time it reaches position 1, the state h1\overset{\leftarrow}{h}_1 has processed the entire sequence from right to left.

Two Independent Networks with Complementary Views

A crucial architectural point: the forward and backward RNNs are completely separate networks with independent parameters. The forward weights Wx\overrightarrow{W}_x, Wh\overrightarrow{W}_h, b\overrightarrow{b} share nothing with the backward weights Wx\overset{\leftarrow}{W}_x, Wh\overset{\leftarrow}{W}_h, b\overset{\leftarrow}{b}. Each network learns its own way of processing the sequence, free to specialize in capturing different types of patterns.

This independence has a direct consequence for model size: a bidirectional RNN has exactly twice the parameters of a unidirectional RNN with the same hidden dimension. If a unidirectional LSTM with hidden size 256 has 1 million parameters, the bidirectional version has 2 million. The doubled capacity is the price we pay for bidirectional context.

Concrete Example: Disambiguating "Bank"

To see how these two passes complement each other, let's trace through our example sentence: "The bank by the river was steep."

When the forward RNN processes "bank" at position 2, its hidden state h2\overrightarrow{h}_2 has only seen "The bank." Based on this limited context, it might tentatively encode "bank" as a financial institution, a reasonable guess given word frequencies. The forward hidden state captures everything to the left, but that's not enough to disambiguate.

When the backward RNN processes "bank" at position 2, the situation is different. It has already processed "was steep," "river," "the," and "by" (in that order). Its hidden state h2\overset{\leftarrow}{h}_2 carries this future context, strongly suggesting a geographical feature rather than a financial one. The word "river" appearing later in the sentence is the key disambiguating signal.

By concatenating h2\overrightarrow{h}_2 and h2\overset{\leftarrow}{h}_2, we get a representation that knows both what came before ("The") and what comes after ("by the river was steep"). This combined representation has all the information needed to correctly interpret "bank" as a riverbank.

Out[4]:
Visualization
Three panels showing which words contribute to the hidden state at position 3 for forward, backward, and bidirectional RNNs.
Information available at position 3 ('by') for forward, backward, and bidirectional representations. The forward RNN only sees past context (positions 1-3). The backward RNN only sees future context (positions 3-5). The bidirectional combination captures the full sequence.

Hidden State Concatenation

At this point, we have two hidden states at each position: ht\overrightarrow{h}_t capturing left context and ht\overset{\leftarrow}{h}_t capturing right context. The final step is combining these complementary views into a single representation. This combination operation is where the bidirectional magic happens: we're fusing two partial pictures of the sequence into one complete view.

The Standard Approach: Concatenation

The most common and effective strategy is simple concatenation. We stack the two hidden vectors end-to-end, creating a longer vector that preserves all information from both directions:

ht=[ht;ht]R2hh_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t] \in \mathbb{R}^{2h}

where:

  • hth_t: the combined bidirectional representation at position tt
  • htRh\overrightarrow{h}_t \in \mathbb{R}^h: the forward hidden state, occupying the first hh elements
  • htRh\overset{\leftarrow}{h}_t \in \mathbb{R}^h: the backward hidden state, occupying the last hh elements

Why is concatenation the default choice? It's lossless: no information from either direction is discarded or compressed. The forward and backward components remain distinct, allowing downstream layers to learn how to weight them appropriately for the task at hand. The cost is that the representation doubles in size: if each directional RNN uses hidden dimension 256, the combined output has dimension 512.

Alternative Combination Strategies

While concatenation dominates in practice, other approaches exist. Each makes different trade-offs between information preservation and computational efficiency:

Summation adds the two hidden states element-wise:

ht=ht+htRhh_t = \overrightarrow{h}_t + \overset{\leftarrow}{h}_t \in \mathbb{R}^h

This preserves the original dimensionality, which can be convenient when you need to match a specific output size or maintain compatibility with existing architecture. However, summation is lossy. Consider what happens when the forward state has value 0.5 in some dimension and the backward state has -0.5: they cancel out to zero, erasing the information that both directions had strong but opposite signals. The sum can't distinguish "both directions agree on zero" from "both directions disagree strongly."

Element-wise product multiplies corresponding elements:

ht=hthtRhh_t = \overrightarrow{h}_t \odot \overset{\leftarrow}{h}_t \in \mathbb{R}^h

This captures multiplicative interactions between the directions, which can be powerful when you want to detect patterns that require agreement from both sides. However, it's numerically unstable: if either direction has values near zero, the product collapses regardless of what the other direction says. It also fails when only one direction has relevant information, since a strong signal from one side gets suppressed by a weak signal from the other.

Learned projection concatenates first, then applies a weight matrix:

ht=Wc[ht;ht]Rhh_t = W_c[\overrightarrow{h}_t; \overset{\leftarrow}{h}_t] \in \mathbb{R}^{h'}

where WcRh×2hW_c \in \mathbb{R}^{h' \times 2h} is a learned projection matrix. This lets the network learn the optimal way to combine the directions, potentially compressing back to the original dimension (h=hh' = h) or any other size. The trade-off is additional parameters and computation, plus the risk that the projection might discard useful information during training.

Concatenation remains the standard choice because it's simple, lossless, and numerically stable. The doubled dimensionality is usually acceptable, and downstream layers can learn to weight the forward and backward components appropriately for each task.

Practical Implications of Doubled Dimensions

The doubled output dimension has concrete implications for your network architecture. Any layer that receives the bidirectional output must be sized accordingly:

  • A classification layer needs 2h2h input features instead of hh
  • Attention mechanisms must account for the larger key/value dimensions
  • Memory requirements increase proportionally

In PyTorch and similar frameworks, specifying bidirectional=True automatically handles the output dimensionality: the output tensor's last dimension becomes 2×hidden_size2 \times \text{hidden\_size}. But when designing custom architectures or debugging dimension mismatches, understanding this doubling is essential.

For sequence-to-sequence models, the decoder typically needs to initialize from the encoder's final state. With a bidirectional encoder, you have two "final" states: hT\overrightarrow{h}_T from the forward pass (which has seen the whole sequence left-to-right) and h1\overset{\leftarrow}{h}_1 from the backward pass (which has seen it right-to-left). Common approaches include concatenating them, summing them, or using a learned projection to combine them into the decoder's initial state.

Bidirectional LSTMs and GRUs

The bidirectional concept we've developed, running two RNNs in opposite directions and concatenating their outputs, is architecture-agnostic. It works identically whether the underlying RNN is a vanilla RNN, an LSTM, or a GRU. In practice, bidirectional LSTMs dominate because they combine bidirectional context with LSTM's ability to capture long-range dependencies. The gating mechanisms that make LSTMs powerful for long sequences work just as well in both directions.

A bidirectional LSTM consists of two independent LSTMs: one processing left-to-right, one processing right-to-left. Each maintains its own hidden state hh and cell state cc. The cell state, LSTM's key innovation for preserving information over long distances, operates independently in each direction. Each direction has its own memory pathway, learning to remember and forget different aspects of the sequence.

Forward LSTM: Gated Memory from the Left

At each position tt, the forward LSTM computes four gate values based on the current input and the previous hidden state. These gates control how information flows into, out of, and through the cell state:

ft,it,ot,c~t=gates(W,xt,ht1)\overrightarrow{f}_t, \overrightarrow{i}_t, \overrightarrow{o}_t, \overrightarrow{\tilde{c}}_t = \text{gates}(\overrightarrow{W}, x_t, \overrightarrow{h}_{t-1})

Each gate serves a specific purpose in managing the LSTM's memory:

  • ftRh\overrightarrow{f}_t \in \mathbb{R}^h: the forget gate, with values between 0 and 1 that decide how much of the previous cell state to retain. A value near 1 means "keep this memory," while near 0 means "discard it."
  • itRh\overrightarrow{i}_t \in \mathbb{R}^h: the input gate, which controls how much of the new candidate information to write into memory
  • otRh\overrightarrow{o}_t \in \mathbb{R}^h: the output gate, which controls how much of the cell state to expose as the hidden state
  • c~tRh\overrightarrow{\tilde{c}}_t \in \mathbb{R}^h: the candidate cell state, which proposes new values that could be added to memory
  • W\overrightarrow{W}: the collection of forward LSTM weight matrices (one set for each gate)

The cell state update is where LSTM's memory mechanism operates. Old information is selectively forgotten, and new information is selectively added:

ct=ftct1+itc~t\overrightarrow{c}_t = \overrightarrow{f}_t \odot \overrightarrow{c}_{t-1} + \overrightarrow{i}_t \odot \overrightarrow{\tilde{c}}_t

where \odot denotes element-wise multiplication. The first term ftct1\overrightarrow{f}_t \odot \overrightarrow{c}_{t-1} preserves parts of the old cell state that the forget gate allows through. The second term itc~t\overrightarrow{i}_t \odot \overrightarrow{\tilde{c}}_t adds parts of the candidate that the input gate admits. This additive structure is what allows LSTMs to maintain information over long distances without the vanishing gradient problems that plague vanilla RNNs.

Finally, the hidden state is computed by filtering the cell state through the output gate:

ht=ottanh(ct)\overrightarrow{h}_t = \overrightarrow{o}_t \odot \tanh(\overrightarrow{c}_t)

The tanh squashes the cell state to the range [1,1][-1, 1], and the output gate selects which dimensions to expose. This hidden state ht\overrightarrow{h}_t serves two purposes: it gets passed to the next timestep as input, and it's what we'll eventually concatenate with the backward direction.

Backward LSTM: Gated Memory from the Right

The backward LSTM follows exactly the same gating structure, but with two key differences: it has its own independent set of weights, and it processes in the reverse direction. At position tt, it uses the hidden state from position t+1t+1, which it has already computed since it started at the end of the sequence:

ft,it,ot,c~t=gates(W,xt,ht+1)\overset{\leftarrow}{f}_t, \overset{\leftarrow}{i}_t, \overset{\leftarrow}{o}_t, \overset{\leftarrow}{\tilde{c}}_t = \text{gates}(\overset{\leftarrow}{W}, x_t, \overset{\leftarrow}{h}_{t+1}) ct=ftct+1+itc~t\overset{\leftarrow}{c}_t = \overset{\leftarrow}{f}_t \odot \overset{\leftarrow}{c}_{t+1} + \overset{\leftarrow}{i}_t \odot \overset{\leftarrow}{\tilde{c}}_t ht=ottanh(ct)\overset{\leftarrow}{h}_t = \overset{\leftarrow}{o}_t \odot \tanh(\overset{\leftarrow}{c}_t)

Notice the subscript pattern: the cell state update uses ct+1\overset{\leftarrow}{c}_{t+1} rather than ct1\overset{\leftarrow}{c}_{t-1}. The backward LSTM's "previous" state is the state from the next position in the sequence, reflecting its reversed processing order. The backward forget gate decides what to remember from the future (positions t+1t+1 through TT), while the backward input gate decides what new information from position tt to add to that future-oriented memory.

Combining the Directions: Hidden States Only

After both passes complete, we have two hidden states at each position: ht\overrightarrow{h}_t from the forward LSTM and ht\overset{\leftarrow}{h}_t from the backward LSTM. These are concatenated exactly as before:

ht=[ht;ht]R2hh_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t] \in \mathbb{R}^{2h}

An important architectural detail: only the hidden states are concatenated. The cell states ct\overrightarrow{c}_t and ct\overset{\leftarrow}{c}_t remain internal to their respective LSTMs and are never exposed or combined. This makes sense when you consider their roles: the cell state's job is to maintain long-term memory within each direction, serving as a private scratchpad for the LSTM's internal computations. The hidden state is what carries information to the outside world, and that's what we want to combine for downstream tasks.

Implementing Bidirectional RNNs

The mathematics we've developed translates directly into code. Building a bidirectional RNN from scratch reveals the elegant simplicity of the architecture: we literally run two independent RNNs and concatenate their outputs. There's no magic, just two passes over the data in opposite directions.

Let's implement this step by step, starting with the building blocks and working up to a complete bidirectional LSTM.

Building Blocks: Activation Functions and LSTM Cells

Before we can build the bidirectional layer, we need the components that make up each directional LSTM. The sigmoid function squashes values to the range (0,1)(0, 1), making it perfect for gates that control information flow (0 = block everything, 1 = let everything through). The tanh function squashes to (1,1)(-1, 1), used for the candidate cell state and output normalization.

In[5]:
Code
import numpy as np


def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    """Hyperbolic tangent."""
    return np.tanh(x)

Next, we implement a single LSTM cell. This is the building block that gets replicated at each timestep. The cell takes an input vector and the previous states, computes the four gates, updates the cell state, and produces a new hidden state.

In[6]:
Code
class LSTMCell:
    """Single LSTM cell for use in bidirectional network."""

    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        scale = np.sqrt(2.0 / (input_dim + hidden_dim))

        # Combined weights for efficiency: all four gates in one matrix
        self.W = np.random.randn(4 * hidden_dim, input_dim) * scale
        self.U = np.random.randn(4 * hidden_dim, hidden_dim) * scale
        self.b = np.zeros(4 * hidden_dim)
        self.b[hidden_dim : 2 * hidden_dim] = (
            1.0  # Forget gate bias initialized to 1
        )

    def forward(self, x, h_prev, c_prev):
        """Single step forward pass."""
        h = self.hidden_dim
        gates = self.W @ x + self.U @ h_prev + self.b

        # Split the gate computation into four parts
        i = sigmoid(gates[0:h])  # Input gate
        f = sigmoid(gates[h : 2 * h])  # Forget gate
        c_tilde = tanh(gates[2 * h : 3 * h])  # Candidate cell state
        o = sigmoid(gates[3 * h : 4 * h])  # Output gate

        # Cell state update: forget old, add new
        c = f * c_prev + i * c_tilde

        # Hidden state: filtered cell state
        h_new = o * tanh(c)

        return h_new, c

A few implementation details worth noting:

  1. Combined weight matrices: Rather than four separate matrix multiplications (one per gate), we concatenate all weights into single large matrices and do one multiplication, then split the result. This is a common optimization that improves computational efficiency.

  2. Forget gate bias initialization: The forget gate bias is initialized to 1 rather than 0. This encourages the network to preserve information by default early in training, preventing the common failure mode where the network learns to forget everything before it learns what to remember.

  3. Xavier/He initialization: The weight scale np.sqrt(2.0 / (input_dim + hidden_dim)) follows best practices for initializing neural network weights, preventing exploding or vanishing activations at the start of training.

The Bidirectional Layer

Now we build the bidirectional layer itself. The architecture is remarkably simple: create two completely independent LSTM cells (one for each direction) and run them on the same input sequence in opposite orders. The forward LSTM sees positions 0, 1, 2, ..., T-1 in that order; the backward LSTM sees T-1, T-2, ..., 0.

In[7]:
Code
class BidirectionalLSTM:
    """Bidirectional LSTM that processes sequences in both directions."""

    def __init__(self, input_dim, hidden_dim):
        self.hidden_dim = hidden_dim
        self.forward_lstm = LSTMCell(input_dim, hidden_dim)
        self.backward_lstm = LSTMCell(input_dim, hidden_dim)

    def forward(self, x_sequence):
        """
        Process sequence bidirectionally.

        Args:
            x_sequence: Input sequence, shape (seq_len, input_dim)

        Returns:
            combined: Concatenated hidden states, shape (seq_len, 2 * hidden_dim)
            forward_states: Forward hidden states, shape (seq_len, hidden_dim)
            backward_states: Backward hidden states, shape (seq_len, hidden_dim)
        """
        seq_len = x_sequence.shape[0]
        h = self.hidden_dim

        # Forward pass: process positions 0, 1, 2, ..., T-1
        forward_states = []
        h_fwd = np.zeros(h)  # Initial hidden state
        c_fwd = np.zeros(h)  # Initial cell state

        for t in range(seq_len):
            h_fwd, c_fwd = self.forward_lstm.forward(
                x_sequence[t], h_fwd, c_fwd
            )
            forward_states.append(h_fwd)

        # Backward pass: process positions T-1, T-2, ..., 0
        backward_states = []
        h_bwd = np.zeros(h)
        c_bwd = np.zeros(h)

        for t in range(seq_len - 1, -1, -1):
            h_bwd, c_bwd = self.backward_lstm.forward(
                x_sequence[t], h_bwd, c_bwd
            )
            backward_states.insert(
                0, h_bwd
            )  # Insert at front to maintain position alignment

        # Concatenate at each position
        forward_states = np.array(forward_states)
        backward_states = np.array(backward_states)
        combined = np.concatenate([forward_states, backward_states], axis=1)

        return combined, forward_states, backward_states

The implementation reveals the core simplicity of bidirectional RNNs:

  1. Forward pass: Iterate through positions 0 to T1T-1 in order, updating hidden and cell states at each step. This is identical to a standard LSTM.

  2. Backward pass: Iterate in reverse from T1T-1 down to 0. The only tricky part is bookkeeping: we use backward_states.insert(0, h_bwd) to insert at the front of the list, ensuring that backward_states[t] corresponds to position tt in the original sequence. This alignment is crucial because we need the forward and backward states to match up position-by-position for concatenation.

  3. Concatenation: Stack the forward and backward states along the feature dimension. At each position tt, the combined representation has the forward context (from positions 0 through tt) in the first half and backward context (from positions tt through T1T-1) in the second half.

Testing the Implementation

Let's verify our implementation produces the expected output shapes and confirm the dimensionality doubling:

In[8]:
Code
# Create bidirectional LSTM
np.random.seed(42)
input_dim = 10
hidden_dim = 16
seq_len = 5

bilstm = BidirectionalLSTM(input_dim, hidden_dim)

# Sample input sequence
x_sequence = np.random.randn(seq_len, input_dim)

# Process bidirectionally
combined, forward_states, backward_states = bilstm.forward(x_sequence)
Out[9]:
Console
Bidirectional LSTM Output Shapes:
  Input sequence: (5, 10)
  Forward states: (5, 16)
  Backward states: (5, 16)
  Combined output: (5, 32)

Output dimension is 2x hidden dimension: 32 = 2 × 16

The output shapes confirm our implementation is working correctly:

  • Input: 5 timesteps × 10 dimensions per timestep
  • Forward states: 5 timesteps × 16 hidden units (left context at each position)
  • Backward states: 5 timesteps × 16 hidden units (right context at each position)
  • Combined output: 5 timesteps × 32 features (full bidirectional context)

The doubling from 16 to 32 is the characteristic signature of bidirectional processing. Every downstream layer that consumes this output must account for this doubled dimensionality.

Visualizing Hidden State Evolution

To build intuition for what the bidirectional LSTM computes, let's visualize how the forward and backward hidden states evolve across the sequence. This reveals the complementary nature of the two directions:

Out[10]:
Visualization
Line plot showing forward and backward hidden state magnitudes across sequence positions.
Directional hidden state magnitudes (L2 norm) across sequence positions. The forward pass (blue) and backward pass (orange) each accumulate context from their respective directions.
Combined bidirectional representation magnitude compared to individual directions. The combined representation (purple) captures information from both passes at each position.
Combined bidirectional representation magnitude compared to individual directions. The combined representation (purple) captures information from both passes at each position.

The hidden state magnitudes reveal how each direction builds up its representation over the sequence. With random inputs, the specific patterns depend on the input values, but the key insight is structural: both directions contribute meaningfully at every position. The combined representation (purple diamonds) is always larger than either individual direction because concatenation preserves all information from both passes. We're not losing anything by combining them.

Information Flow: What Each Position "Knows"

The power of bidirectional processing becomes clear when we visualize what information is available at each position. Let's trace how information propagates through the network:

Out[11]:
Visualization
Heatmap showing information availability at each position for forward, backward, and combined representations.
Forward pass: hidden state at position t contains information from inputs 1 through t (lower triangle).
Backward pass: hidden state at position t contains information from inputs t through T (upper triangle).
Backward pass: hidden state at position t contains information from inputs t through T (upper triangle).
Combined bidirectional: each position has access to the entire sequence context.
Combined bidirectional: each position has access to the entire sequence context.

These information flow diagrams illustrate the fundamental difference between unidirectional and bidirectional processing:

  • Forward pass (lower triangle): The hidden state at position tt contains information from inputs 1 through tt. Early positions have limited context; later positions have seen more of the sequence.
  • Backward pass (upper triangle): The hidden state at position tt contains information from inputs tt through TT. The pattern is reversed: early positions have seen more (from the backward perspective).
  • Combined (full matrix): Every position has access to the entire sequence. This is the key advantage of bidirectional processing.

Hidden State Activation Patterns

Let's examine the actual hidden state values from our bidirectional LSTM to see what internal representations emerge:

Out[12]:
Visualization
Two heatmaps showing hidden state activations across sequence positions for forward and backward LSTM passes.
Forward hidden state activations across sequence positions (rows) and hidden units (columns). Each cell shows the activation value for that unit at that position.
Backward hidden state activations. The backward pass develops different patterns, capturing complementary right-to-left context information.
Backward hidden state activations. The backward pass develops different patterns, capturing complementary right-to-left context information.

The heatmaps reveal the internal representations computed by each direction. With randomly initialized weights and random inputs, the patterns are essentially arbitrary, but the structure is informative. Notice how different hidden units show different activation patterns across positions, and how the forward and backward passes develop distinct representations.

In a trained model on real data, you would see meaningful patterns emerge: hidden units that activate for specific linguistic features, with forward units capturing left-context patterns (like "the word following 'the' is likely a noun") and backward units capturing right-context patterns (like "the word before 'said' is likely a person"). The bidirectional combination gives downstream layers access to both types of patterns simultaneously.

How Different Are the Two Directions?

A natural question: do the forward and backward passes learn redundant or complementary representations? Let's measure this by computing the correlation between corresponding hidden units in each direction:

Out[13]:
Visualization
Heatmap and histogram showing correlations between forward and backward LSTM hidden units.
Correlation matrix between forward and backward hidden units. Low correlations indicate the two directions capture different information, making their combination more valuable.
Distribution of correlations between forward and backward hidden units. A spread around zero suggests the directions learn complementary rather than redundant features.
Distribution of correlations between forward and backward hidden units. A spread around zero suggests the directions learn complementary rather than redundant features.

The correlation analysis reveals how independent the two directions are. With random initialization, we expect correlations near zero because the two LSTMs haven't learned to coordinate. In a trained model, you might see some structure emerge: certain forward units might correlate with backward units that capture related (but temporally reversed) patterns. However, the key insight is that low correlation means the directions provide complementary information. If they were highly correlated, concatenation would be wasteful since we'd be duplicating information. The spread of correlations around zero confirms that bidirectional processing genuinely expands the representational capacity.

Bidirectionality for Classification

Bidirectional RNNs excel at classification tasks where you need to make a decision based on the entire sequence. Sentiment analysis, named entity recognition, and part-of-speech tagging all benefit from bidirectional context.

Sequence Classification

For classifying an entire sequence (e.g., sentiment of a review), you typically use the final hidden states from both directions:

In[14]:
Code
class BidirectionalClassifier:
    """Bidirectional LSTM for sequence classification."""

    def __init__(self, input_dim, hidden_dim, num_classes):
        self.bilstm = BidirectionalLSTM(input_dim, hidden_dim)
        self.hidden_dim = hidden_dim

        # Classification layer takes concatenated final states
        # Final forward state + Final backward state = 2 * hidden_dim
        self.W_out = np.random.randn(num_classes, 2 * hidden_dim) * 0.1
        self.b_out = np.zeros(num_classes)

    def forward(self, x_sequence):
        """
        Classify a sequence.

        Returns logits for each class.
        """
        combined, forward_states, backward_states = self.bilstm.forward(
            x_sequence
        )

        # Use final states from both directions
        final_forward = forward_states[-1]  # Last position of forward pass
        final_backward = backward_states[
            0
        ]  # First position of backward pass (processed last)

        # Concatenate final states
        final_repr = np.concatenate([final_forward, final_backward])

        # Classification
        logits = self.W_out @ final_repr + self.b_out
        return logits
In[15]:
Code
# Example: binary sentiment classification
classifier = BidirectionalClassifier(input_dim=10, hidden_dim=16, num_classes=2)
x_sequence = np.random.randn(8, 10)  # 8 tokens, 10-dim embeddings

logits = classifier.forward(x_sequence)
probs = np.exp(logits) / np.sum(np.exp(logits))  # Softmax
Out[16]:
Console
Sequence Classification Example:
  Sequence length: 8 tokens
  Input dimension: 10
  Hidden dimension: 16
  Output logits: [-0.0210323   0.07967426]
  Predicted probabilities: [0.47484462 0.52515538]
  Predicted class: 1

The classifier processes an 8-token sequence and produces logits for each class. The softmax converts these to probabilities that sum to 1. With randomly initialized weights, the predictions are essentially random, but the architecture correctly combines bidirectional context: the final forward state captures information from all 8 tokens read left-to-right, while the final backward state captures information from all 8 tokens read right-to-left.

Token-Level Classification

For tasks like named entity recognition where you classify each token, you use the combined representation at each position:

In[17]:
Code
class BidirectionalTagger:
    """Bidirectional LSTM for token-level classification (e.g., NER, POS tagging)."""

    def __init__(self, input_dim, hidden_dim, num_tags):
        self.bilstm = BidirectionalLSTM(input_dim, hidden_dim)

        # Output layer applied at each position
        self.W_out = np.random.randn(num_tags, 2 * hidden_dim) * 0.1
        self.b_out = np.zeros(num_tags)

    def forward(self, x_sequence):
        """
        Tag each token in the sequence.

        Returns logits for each position and tag.
        """
        combined, _, _ = self.bilstm.forward(x_sequence)

        # Apply classification at each position
        seq_len = combined.shape[0]
        logits = np.zeros((seq_len, self.W_out.shape[0]))

        for t in range(seq_len):
            logits[t] = self.W_out @ combined[t] + self.b_out

        return logits
In[18]:
Code
# Example: NER with 5 tags (O, B-PER, I-PER, B-ORG, I-ORG)
tagger = BidirectionalTagger(input_dim=10, hidden_dim=16, num_tags=5)
x_sequence = np.random.randn(6, 10)  # 6 tokens

tag_logits = tagger.forward(x_sequence)
predicted_tags = np.argmax(tag_logits, axis=1)
Out[19]:
Console
Token-Level Classification Example:
  Sequence length: 6 tokens
  Tag logits shape: (6, 5)
  Predicted tags: ['I-ORG', 'O', 'I-ORG', 'I-ORG', 'O', 'B-ORG']

The tagger produces a 5-dimensional logit vector at each of the 6 positions, one score per possible tag. The argmax selects the highest-scoring tag for each token. With random weights, the predictions are meaningless, but the key point is that each token's prediction uses the full bidirectional context: the combined hidden state at position tt incorporates information from all tokens before and after it.

The bidirectional context is particularly valuable for NER because entity boundaries often depend on both preceding and following words. "Bank of America" needs the following "of America" to recognize "Bank" as part of an organization name.

Limitations for Generation

Bidirectional RNNs have a fundamental limitation: they cannot be used for autoregressive generation. When generating text one token at a time, you don't have access to future tokens because they haven't been generated yet.

Consider a language model trying to predict the next word. At position tt, an autoregressive model can only use information from positions 1,2,,t11, 2, \ldots, t-1. The backward pass of a bidirectional RNN would require information from positions t+1,t+2,,Tt+1, t+2, \ldots, T, which don't exist during generation.

Out[20]:
Visualization
Diagram showing generation process where future tokens are unknown, preventing backward pass computation.
During autoregressive generation, future tokens don't exist yet, making the backward pass impossible. At each step, the model must predict the next token using only past context. This is why language models and decoders use unidirectional (forward-only) architectures.

This limitation means bidirectional RNNs are used for:

  • Encoding in sequence-to-sequence models (the encoder sees the full input)
  • Classification tasks where the full sequence is available
  • Feature extraction for downstream tasks

But not for:

  • Language modeling (predicting next tokens)
  • Text generation (writing stories, completing sentences)
  • Decoders in seq2seq models (which generate one token at a time)

PyTorch Implementation

PyTorch provides built-in support for bidirectional RNNs. Let's see how to use it and verify our understanding:

In[21]:
Code
import torch
import torch.nn as nn

# Create bidirectional LSTM
input_dim = 10
hidden_dim = 16
seq_len = 5
batch_size = 1

torch_bilstm = nn.LSTM(
    input_size=input_dim,
    hidden_size=hidden_dim,
    num_layers=1,
    batch_first=True,
    bidirectional=True,  # This is the key parameter
)

# Sample input
x = torch.randn(batch_size, seq_len, input_dim)

# Forward pass
output, (h_n, c_n) = torch_bilstm(x)
Out[22]:
Console
PyTorch Bidirectional LSTM:
  Input shape: (1, 5, 10)
  Output shape: (1, 5, 32)
  Final hidden shape: (2, 1, 16)
  Final cell shape: (2, 1, 16)

Output dimension: 32 = 2 × 16 (bidirectional)
Final hidden: 2 directions × 16 hidden units

PyTorch's bidirectional LSTM automatically handles the forward and backward passes and concatenates the results. The output shape confirms the doubled dimensionality: 32 features instead of 16. The final hidden state tensor has shape (2, 1, 16), where the first dimension indexes direction (0 = forward, 1 = backward), the second is batch size, and the third is hidden dimension. This matches our from-scratch implementation.

Extracting Directional States

To access the forward and backward components separately:

In[23]:
Code
# Split output into forward and backward components
forward_output = output[:, :, :hidden_dim]  # First half of features
backward_output = output[:, :, hidden_dim:]  # Second half of features

# Final states by direction
final_forward_h = h_n[0]  # Shape: (batch, hidden_dim)
final_backward_h = h_n[1]  # Shape: (batch, hidden_dim)
Out[24]:
Console
Extracting Directional Components:
  Forward output shape: (1, 5, 16)
  Backward output shape: (1, 5, 16)
  Final forward hidden: (1, 16)
  Final backward hidden: (1, 16)

Slicing the output tensor along the feature dimension separates the forward and backward components. Each has shape (1, 5, 16), corresponding to batch size, sequence length, and hidden dimension. The final hidden states for each direction have shape (1, 16), representing the last state computed by each directional LSTM. These can be concatenated for sequence classification or used separately depending on your task.

Visualizing the Concatenated Output Structure

To make the structure of bidirectional output concrete, let's visualize how the forward and backward components are arranged in the concatenated tensor:

Out[25]:
Visualization
Heatmap showing the concatenated bidirectional output with forward and backward components side by side.
Structure of the concatenated bidirectional output. Each row is a sequence position, and the columns show hidden dimensions. The first half (blue) contains forward hidden states, the second half (orange) contains backward hidden states. The color intensity shows activation magnitude.

The visualization makes the concatenation structure explicit: at each position tt, the output vector has 2h2h dimensions. The first hh dimensions (indices 0 to h1h-1) contain the forward hidden state ht\overrightarrow{h}_t, encoding left context. The second hh dimensions (indices hh to 2h12h-1) contain the backward hidden state ht\overset{\leftarrow}{h}_t, encoding right context. When you slice output[:, :, :hidden_dim], you get the forward component; output[:, :, hidden_dim:] gives you the backward component.

Comparing Unidirectional and Bidirectional Performance

Let's design an experiment that demonstrates the advantage of bidirectional processing. We'll create a task where context from both directions is necessary for correct classification.

The Bracket Matching Task

We'll create sequences with matching brackets where the model must classify whether each position is inside or outside a bracket pair. This requires seeing both the opening bracket (past) and closing bracket (future).

In[26]:
Code
def generate_bracket_data(num_samples, seq_len):
    """
    Generate sequences with bracket pairs.
    Label each position as inside (1) or outside (0) brackets.
    """
    X = np.zeros((num_samples, seq_len, 3))  # 3 features: open, close, other
    y = np.zeros((num_samples, seq_len), dtype=int)

    for i in range(num_samples):
        # Randomly place a bracket pair
        open_pos = np.random.randint(0, seq_len - 2)
        close_pos = np.random.randint(open_pos + 2, seq_len)

        # Fill with "other" tokens
        X[i, :, 2] = 1

        # Place brackets
        X[i, open_pos, :] = [1, 0, 0]  # Open bracket
        X[i, close_pos, :] = [0, 1, 0]  # Close bracket

        # Label positions inside brackets
        y[i, open_pos + 1 : close_pos] = 1

    return X, y


# Generate data
np.random.seed(42)
X_train, y_train = generate_bracket_data(500, 10)
X_test, y_test = generate_bracket_data(100, 10)

Now let's train both unidirectional and bidirectional models:

In[27]:
Code
import torch.optim as optim


class UnidirectionalTagger(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, batch_first=True, bidirectional=False
        )
        self.fc = nn.Linear(hidden_dim, 2)

    def forward(self, x):
        output, _ = self.lstm(x)
        return self.fc(output)


class BidirectionalTaggerPyTorch(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, batch_first=True, bidirectional=True
        )
        self.fc = nn.Linear(2 * hidden_dim, 2)

    def forward(self, x):
        output, _ = self.lstm(x)
        return self.fc(output)


def train_and_evaluate(model, X_train, y_train, X_test, y_test, epochs=50):
    """Train model and return test accuracy."""
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    X_train_t = torch.FloatTensor(X_train)
    y_train_t = torch.LongTensor(y_train)
    X_test_t = torch.FloatTensor(X_test)
    y_test_t = torch.LongTensor(y_test)

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output.view(-1, 2), y_train_t.view(-1))
        loss.backward()
        optimizer.step()

    # Evaluate
    model.eval()
    with torch.no_grad():
        output = model(X_test_t)
        predictions = output.argmax(dim=2)
        accuracy = (predictions == y_test_t).float().mean().item()

    return accuracy


# Modified training function that tracks history
def train_and_evaluate_with_history(
    model, X_train, y_train, X_test, y_test, epochs=50
):
    """Train model and return test accuracy plus training history."""
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    X_train_t = torch.FloatTensor(X_train)
    y_train_t = torch.LongTensor(y_train)
    X_test_t = torch.FloatTensor(X_test)
    y_test_t = torch.LongTensor(y_test)

    train_losses = []
    test_accs = []

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = criterion(output.view(-1, 2), y_train_t.view(-1))
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())

        # Track test accuracy
        model.eval()
        with torch.no_grad():
            test_output = model(X_test_t)
            predictions = test_output.argmax(dim=2)
            acc = (predictions == y_test_t).float().mean().item()
            test_accs.append(acc)

    return acc, train_losses, test_accs


# Train both models with history tracking
torch.manual_seed(42)
uni_model = UnidirectionalTagger(input_dim=3, hidden_dim=16)
bi_model = BidirectionalTaggerPyTorch(input_dim=3, hidden_dim=16)

uni_acc, uni_losses, uni_accs = train_and_evaluate_with_history(
    uni_model, X_train, y_train, X_test, y_test
)

torch.manual_seed(42)  # Reset seed for fair comparison
bi_model = BidirectionalTaggerPyTorch(input_dim=3, hidden_dim=16)
bi_acc, bi_losses, bi_accs = train_and_evaluate_with_history(
    bi_model, X_train, y_train, X_test, y_test
)
Out[28]:
Console
Bracket Matching Task Results:
  Unidirectional LSTM accuracy: 98.1%
  Bidirectional LSTM accuracy: 100.0%
  Improvement: 1.9%

The bidirectional model achieves near-perfect accuracy because it can see both the opening and closing brackets when classifying each position. The unidirectional model struggles because when it processes a position, it doesn't know if a closing bracket will appear later. This task is specifically designed to require future context: determining whether a position is inside brackets needs information about both the opening bracket (past) and closing bracket (future).

Out[29]:
Visualization
Two-panel plot showing training loss and test accuracy over epochs for unidirectional and bidirectional models.
Training loss over epochs. The bidirectional model (purple) achieves lower loss faster than the unidirectional model (blue), finding a better solution.
Test accuracy over epochs. The bidirectional model reaches near-perfect accuracy while the unidirectional model plateaus well below, unable to solve the task without future context.
Test accuracy over epochs. The bidirectional model reaches near-perfect accuracy while the unidirectional model plateaus well below, unable to solve the task without future context.

The learning curves reveal the dynamics of training. The bidirectional model not only achieves higher final accuracy but also learns faster, reaching good performance within the first few epochs. The unidirectional model's loss decreases but plateaus at a higher value, reflecting its fundamental inability to solve the task perfectly without future context.

Where Does the Unidirectional Model Fail?

To understand why the unidirectional model struggles, let's analyze accuracy by position within the sequence. The bracket matching task has a specific structure: positions before the opening bracket should be labeled "outside," positions between brackets should be "inside," and positions after the closing bracket should be "outside."

Out[30]:
Visualization
Bar chart comparing per-position accuracy between unidirectional and bidirectional models on the bracket matching task.
Per-position accuracy for unidirectional vs bidirectional models. The unidirectional model performs well at early positions (before brackets) but struggles at later positions where it needs future context to know if a closing bracket will appear. The bidirectional model maintains high accuracy throughout.

The per-position analysis reveals the fundamental limitation of unidirectional processing. Early positions (near the start of the sequence) tend to have higher accuracy because they're more likely to be before any bracket, so the model can correctly predict "outside" without needing future context. However, middle positions are problematic: the unidirectional model has seen the opening bracket but doesn't know if or when a closing bracket will appear. It must guess whether the current position is inside or outside brackets based only on past context.

The bidirectional model maintains high accuracy at all positions because it has access to the full sequence. When classifying position 5, it knows both that an opening bracket appeared at position 2 and that a closing bracket appears at position 7. This complete picture enables correct classification regardless of position.

Practical Considerations

When deciding whether to use bidirectional RNNs, consider these factors:

When to Use Bidirectional RNNs

Bidirectional architectures are well-suited for tasks where:

  • The entire input sequence is available before making predictions
  • Context from both directions improves understanding (most NLP classification tasks)
  • You're building an encoder that will feed into a decoder
  • Token-level predictions depend on surrounding context (NER, POS tagging)

When to Avoid Bidirectional RNNs

Stick with unidirectional architectures when:

  • You need to generate sequences autoregressively
  • You're processing streaming data where future inputs aren't available
  • Latency is critical and you can't wait for the full sequence
  • You're building a decoder in a seq2seq model

Parameter and Computation Cost

Bidirectional RNNs double both parameters and computation compared to unidirectional versions with the same hidden size. If you need a bidirectional model with the same total capacity as a unidirectional one, use half the hidden size in each direction.

In[31]:
Code
def count_lstm_params(input_dim, hidden_dim, bidirectional=False):
    """Count parameters in an LSTM layer."""
    # Each direction: 4 * hidden_dim * (input_dim + hidden_dim + 1)
    params_per_direction = 4 * hidden_dim * (input_dim + hidden_dim + 1)
    multiplier = 2 if bidirectional else 1
    return params_per_direction * multiplier


input_dim = 100
hidden_dim = 256

uni_params = count_lstm_params(input_dim, hidden_dim, bidirectional=False)
bi_params = count_lstm_params(input_dim, hidden_dim, bidirectional=True)
bi_half_params = count_lstm_params(
    input_dim, hidden_dim // 2, bidirectional=True
)
Out[32]:
Console
Parameter Comparison (input_dim=100):
  Unidirectional (hidden=256): 365,568 parameters
  Bidirectional (hidden=256):  731,136 parameters
  Bidirectional (hidden=128):  234,496 parameters

Bidirectional with hidden=128 has similar capacity to unidirectional with hidden=256

The parameter counts reveal the cost of bidirectionality:

LSTM parameter comparison with input dimension 100. Bidirectional models double the parameters compared to unidirectional ones with the same hidden size. Using half the hidden size in each direction achieves bidirectional context with similar total parameters.
ConfigurationHidden SizeParametersOutput DimNotes
Unidirectional256366,592256Baseline
Bidirectional256733,1845122× parameters
Bidirectional128188,416256Similar to unidirectional

A bidirectional LSTM with hidden size 256 has exactly twice the parameters of a unidirectional one, since it maintains two complete sets of weights. If you need to match parameter budgets, using hidden size 128 in each direction gives you bidirectional context with roughly the same total parameters as a unidirectional model with hidden size 256. The trade-off is that each direction has less capacity, but the combined representation still benefits from seeing the full sequence.

Limitations and Impact

Bidirectional RNNs solved a critical limitation of unidirectional sequence models: the inability to incorporate future context. This architectural innovation had significant impact across NLP, enabling substantial improvements on tasks from named entity recognition to machine translation encoders.

The most significant practical limitation remains the incompatibility with autoregressive generation. You cannot use bidirectional models for language modeling, text generation, or any task requiring sequential token-by-token output. This fundamental constraint means that even as bidirectional encoders became standard, unidirectional decoders remained necessary for generation tasks.

Computational cost presents another consideration. Processing sequences in both directions doubles the computation and memory requirements compared to unidirectional models. For very long sequences or resource-constrained environments, this overhead may be prohibitive. The sequential nature of RNNs compounds this issue: you cannot parallelize across time steps, so processing time scales linearly with sequence length regardless of available hardware.

Despite these constraints, bidirectional RNNs became a foundational component of modern NLP. The ELMo model, which achieved state-of-the-art results across many benchmarks in 2018, used deep bidirectional LSTMs. BERT and subsequent transformer models adopted the bidirectional principle, though they achieved bidirectionality through attention mechanisms rather than separate forward and backward passes. The insight that both past and future context matter for understanding language proved more durable than any specific architectural implementation.

Summary

Bidirectional RNNs process sequences in both directions simultaneously, combining forward and backward hidden states to create representations informed by the entire sequence context.

The key architectural components are:

  • A forward RNN processing from position 1 to TT, producing states h1,,hT\overrightarrow{h}_1, \ldots, \overrightarrow{h}_T
  • A backward RNN processing from position TT to 1, producing states hT,,h1\overset{\leftarrow}{h}_T, \ldots, \overset{\leftarrow}{h}_1
  • Concatenation at each position: ht=[ht;ht]h_t = [\overrightarrow{h}_t; \overset{\leftarrow}{h}_t]

This architecture excels at classification tasks where full sequence context improves predictions. Named entity recognition, sentiment analysis, and part-of-speech tagging all benefit from bidirectional processing. The encoder in sequence-to-sequence models typically uses bidirectional RNNs to capture complete input context.

The fundamental limitation is incompatibility with autoregressive generation. Since future tokens don't exist during generation, the backward pass cannot be computed. Language models, text generators, and decoders must use unidirectional architectures.

Bidirectional RNNs double both parameters and computation compared to unidirectional versions. When computational budget is fixed, using half the hidden size in each direction maintains similar total capacity while gaining bidirectional context.

Key Parameters

When working with bidirectional RNNs in PyTorch (nn.LSTM, nn.GRU, nn.RNN), the following parameters are most relevant:

  • bidirectional: Set to True to enable bidirectional processing. This doubles the output dimension and the number of parameters.

  • hidden_size: The hidden dimension for each direction. With bidirectional=True, the output dimension becomes 2 * hidden_size.

  • num_layers: Number of stacked RNN layers. Each layer can be bidirectional independently, though typically all layers share the same directionality.

  • batch_first: When True, input and output tensors have shape (batch, seq, features). The bidirectional output features are concatenated along the last dimension.

  • dropout: Applied between layers (when num_layers > 1). Does not affect bidirectionality but helps regularize deeper bidirectional stacks.

The output tensor shape is (batch, seq_len, num_directions * hidden_size) where num_directions is 2 for bidirectional models. The final hidden state has shape (num_layers * num_directions, batch, hidden_size), with forward and backward states interleaved by layer.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about bidirectional RNNs and how they capture context from both directions.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{bidirectionalrnnscapturingfullsequencecontext, author = {Michael Brenndoerfer}, title = {Bidirectional RNNs: Capturing Full Sequence Context}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Bidirectional RNNs: Capturing Full Sequence Context. Retrieved from https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp
MLAAcademic
Michael Brenndoerfer. "Bidirectional RNNs: Capturing Full Sequence Context." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Bidirectional RNNs: Capturing Full Sequence Context." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Bidirectional RNNs: Capturing Full Sequence Context'. Available at: https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Bidirectional RNNs: Capturing Full Sequence Context. https://mbrenndoerfer.com/writing/bidirectional-rnns-full-sequence-context-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free