RNN Architecture: Complete Guide to Recurrent Neural Networks

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master RNN architecture from recurrent connections to hidden state dynamics. Learn parameter sharing, sequence classification, generation, and implement an RNN from scratch.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

RNN ArchitectureLink Copied

How do you process a sentence, a time series, or any sequence where context matters? Feedforward neural networks treat each input independently, but language and many real-world problems have temporal structure. The meaning of "bank" depends on whether we're talking about rivers or money. The next word in a sentence depends on all the words that came before.

Recurrent Neural Networks (RNNs) solve this problem by introducing memory. Unlike feedforward networks that process inputs in isolation, RNNs maintain a hidden state that accumulates information as they read through a sequence. This hidden state acts as a compressed summary of everything the network has seen so far, allowing it to make predictions that depend on context.

This chapter builds your understanding of RNN architecture from the ground up. We'll start with the intuition behind recurrent connections, then formalize the mathematics of hidden state updates. You'll see how to "unroll" an RNN into a computational graph, understand parameter sharing across time steps, and explore how RNNs handle both sequence classification and sequence generation. By the end, you'll have implemented a working RNN from scratch.

The Need for Sequential MemoryLink Copied

Consider the task of sentiment analysis. Given the sentence "The movie was not good," a feedforward network processing each word independently would see "good" and might predict positive sentiment. But the word "not" completely reverses the meaning. To understand this sentence, we need to remember that "not" appeared before "good."

This is the fundamental limitation of feedforward networks for sequential data: they have no mechanism to carry information from one input to the next. Each input is processed in isolation, as if the network has amnesia between time steps.

Recurrent Neural Network

A Recurrent Neural Network (RNN) is a neural network architecture designed for sequential data. It processes inputs one at a time while maintaining a hidden state that carries information across time steps, enabling the network to model temporal dependencies.

RNNs address this limitation by introducing a feedback loop. The network's output at each time step depends not only on the current input but also on what the network "remembers" from previous inputs. This memory is encoded in a vector called the hidden state.

Recurrent Connections: The Key InsightLink Copied

The defining feature of an RNN is the recurrent connection, a loop that feeds the network's hidden state back into itself. This creates a form of short-term memory that persists across time steps.

In a standard feedforward layer, information flows in one direction. The hidden state is computed solely from the current input:

\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})

where $\mathbf{x}$ is the input, $\mathbf{W}$ is the weight matrix, $\mathbf{b}$ is the bias, and $\sigma$ is an activation function.

In a recurrent layer, the hidden state from the previous time step feeds back as an additional input. This is the key difference that enables memory:

\mathbf{h}_t = \sigma(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b})

where:

$\mathbf{x}_t$ : the input vector at time step $t$
$\mathbf{h}_t$ : the hidden state at time step $t$ (what the network "remembers")
$\mathbf{h}_{t-1}$ : the hidden state from the previous time step
$\mathbf{W}_{xh}$ : weight matrix connecting input to hidden state
$\mathbf{W}_{hh}$ : weight matrix connecting previous hidden state to current hidden state (the recurrent weights)
$\mathbf{b}$ : bias vector
$\sigma$ : activation function (typically $\tanh$ )

The recurrent weight matrix $\mathbf{W}_{hh}$ is what gives RNNs their memory. It determines how the previous hidden state influences the current one.

Out[3]:

Visualization

Diagram showing an RNN cell with input x, hidden state h, and a self-loop representing the recurrent connection. — The recurrent connection in an RNN. The hidden state h feeds back into itself through the weight matrix W_hh, creating a loop that carries information across time steps. The input x enters through W_xh, and both contributions are combined before the activation function.

The self-loop in the diagram represents the recurrent connection. At each time step, the hidden state $\mathbf{h}_t$ is computed using both the current input $\mathbf{x}_t$ and the previous hidden state $\mathbf{h}_{t-1}$ . This creates a chain of dependencies that allows information to flow through time.

Hidden State as MemoryLink Copied

The hidden state $\mathbf{h}_t$ is the RNN's memory. It's a fixed-size vector that encodes a summary of all the inputs the network has seen up to time $t$ . Think of it as a compressed representation of the sequence history.

At time step 0, before the RNN has seen any inputs, we typically initialize the hidden state to a zero vector:

\mathbf{h}_0 = \mathbf{0}

where $\mathbf{h}_0 \in \mathbb{R}^{H}$ is a vector of zeros with the same dimension as the hidden state.

As the RNN processes each input, the hidden state evolves to incorporate new information:

$\mathbf{h}_1$ encodes information about $\mathbf{x}_1$
$\mathbf{h}_2$ encodes information about $\mathbf{x}_1$ and $\mathbf{x}_2$
$\mathbf{h}_t$ encodes information about $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_t$

The hidden state doesn't store the raw inputs. Instead, it learns to extract and compress the features that are relevant for the task. For language modeling, this might include syntactic structure, semantic meaning, and discourse context. For time series prediction, it might capture trends, seasonality, and recent fluctuations.

Out[4]:

Visualization

Timeline showing hidden states h1 through h4 building up information from inputs x1 through x4. — Evolution of the hidden state as an RNN processes a sequence. Each hidden state h_t incorporates information from the current input x_t and all previous inputs through the recurrent connection. The color gradient illustrates how information from earlier inputs persists but gradually fades.

The key insight is that the hidden state size is fixed regardless of sequence length. Whether processing a 5-word sentence or a 500-word document, the RNN compresses all relevant information into a vector of the same dimension. This is both a strength (constant memory usage) and a limitation (finite capacity to store long-range dependencies).

Out[5]:

Visualization

Histogram showing the distribution of hidden state values, with most values between -0.5 and 0.5. — Distribution of hidden state values across time steps. The tanh activation bounds all values to [-1, 1], with most values concentrated near zero. This bounded range prevents the hidden state from exploding as the sequence progresses.

This distribution reveals that the $\tanh$ activation keeps hidden state values well-behaved. Most values cluster near zero, with the distribution tapering off toward the bounds at $\pm 1$ . This zero-centered property is important for gradient flow during training.

Unrolling the Computation GraphLink Copied

To understand how gradients flow through an RNN during training, we "unroll" the recurrent loop across time. This transforms the compact diagram with a self-loop into an explicit computational graph where each time step is a separate node.

The unrolled view reveals that an RNN is essentially a very deep feedforward network, with one layer per time step. The crucial difference is that the same weights are shared across all time steps.

Out[6]:

Visualization

Four copies of the RNN cell arranged horizontally, connected by arrows showing hidden state flow from left to right. — Unrolled RNN computation graph for a sequence of length 4. The same weights (W_xh, W_hh, W_hy) are applied at every time step. This weight sharing is what makes RNNs handle variable-length sequences with a fixed number of parameters.

The unrolled graph makes several things clear:

Depth: An RNN processing a sequence of length $T$ is effectively a $T$ -layer deep network, where $T$ is the number of time steps. This depth is what makes training RNNs challenging, as we'll see in later chapters on vanishing gradients.
Weight sharing: The same $\mathbf{W}_{xh}$ , $\mathbf{W}_{hh}$ , and $\mathbf{W}_{hy}$ matrices appear at every time step. This dramatically reduces the number of parameters compared to having separate weights for each position.
Gradient flow: During backpropagation, gradients must flow backward through all $T$ time steps. This is called Backpropagation Through Time (BPTT), which we'll cover in the next chapter.

Parameter sharing is a fundamental design choice in RNNs. The same weights are used to process the input at every time step, regardless of position in the sequence.

This has several important implications:

Generalization to variable lengths: Because the weights don't depend on position, an RNN trained on sequences of length 10 can process sequences of length 100 or 1000 without any modification. The network learns transformations that apply universally across time steps.

Statistical strength: If a pattern (like "not" negating sentiment) can appear anywhere in a sequence, parameter sharing means the network learns this pattern once and applies it everywhere. Without sharing, the network would need to learn the same pattern separately for each position.

Reduced parameters: A feedforward network processing a sequence of length $T$ with hidden size $H$ would need $O(T \times H^2)$ parameters if each position had separate weights. An RNN needs only $O(H^2)$ parameters, independent of sequence length. Here, $T$ represents the number of time steps and $H$ is the hidden state dimension.

In[7]:

Code

# Demonstrating parameter sharing in RNNs
# The same weights process every time step

hidden_size = 64
input_size = 32

# These weights are shared across ALL time steps
W_xh = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden
W_hh = (
    np.random.randn(hidden_size, hidden_size) * 0.01
)  # Hidden to hidden (recurrent)
b_h = np.zeros(hidden_size)  # Bias

# Total parameters: hidden_size * input_size + hidden_size * hidden_size + hidden_size
total_params = W_xh.size + W_hh.size + b_h.size

# Demonstrating parameter sharing in RNNs
# The same weights process every time step

hidden_size = 64
input_size = 32

# These weights are shared across ALL time steps
W_xh = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden
W_hh = (
    np.random.randn(hidden_size, hidden_size) * 0.01
)  # Hidden to hidden (recurrent)
b_h = np.zeros(hidden_size)  # Bias

# Total parameters: hidden_size * input_size + hidden_size * hidden_size + hidden_size
total_params = W_xh.size + W_hh.size + b_h.size

Out[8]:

Console

RNN Parameter Count (shared across all time steps)
--------------------------------------------------
Input size: 32
Hidden size: 64

W_xh (input → hidden): 32 × 64 = 2,048 parameters
W_hh (hidden → hidden): 64 × 64 = 4,096 parameters
b_h (bias): 64 = 64 parameters

Total parameters: 6,208

This same parameter count works for sequences of ANY length!

The RNN has just over 6,000 parameters, regardless of sequence length. The parameter count is independent of sequence length. Whether we process 10 tokens or 10,000 tokens, the RNN uses exactly the same 6,208 parameters. This efficiency is crucial for processing long sequences like documents or time series.

Out[9]:

Visualization

Line plot showing RNN parameters staying constant while feedforward parameters grow linearly with sequence length. — Parameter efficiency comparison between RNNs and position-specific feedforward networks. The RNN maintains constant parameters regardless of sequence length, while a feedforward network with separate weights per position grows linearly. At sequence length 100, the feedforward approach requires 16× more parameters.

This visualization makes the efficiency of parameter sharing concrete. The RNN's flat line represents constant memory footprint regardless of sequence length, while the feedforward approach's linear growth quickly becomes impractical for long sequences.

The RNN EquationsLink Copied

Now that we understand the intuition behind recurrent connections and hidden states, let's formalize the mathematics that make RNNs work. The goal is to derive equations that capture two essential operations: updating the network's memory based on new input, and producing outputs based on that memory.

The Core Challenge: Combining Past and PresentLink Copied

At each time step, an RNN faces a fundamental question: how should it blend information from the current input with what it already knows from the past? This isn't trivial. The network needs to:

Extract relevant features from the current input $\mathbf{x}_t$
Retain useful information from the previous hidden state $\mathbf{h}_{t-1}$
Combine these sources in a way that produces a useful new representation

The elegant solution is to use separate learned transformations for each information source, then add them together before applying a nonlinearity.

The Hidden State Update EquationLink Copied

The hidden state update is the heart of the RNN. It determines how the network's memory evolves as it processes each new input:

\mathbf{h}_t = \tanh(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}_h)

Let's unpack this equation piece by piece to understand why each component is necessary.

The input transformation $\mathbf{W}_{xh}\mathbf{x}_t$ projects the current input into the hidden space. The weight matrix $\mathbf{W}_{xh}$ learns which aspects of the input are relevant for the task. For example, in language modeling, it might learn to emphasize semantic features of word embeddings while downweighting noise.

The recurrent transformation $\mathbf{W}_{hh}\mathbf{h}_{t-1}$ determines how past information influences the present. This is the "memory" mechanism. The matrix $\mathbf{W}_{hh}$ learns which aspects of the previous hidden state should persist and how they should be transformed. Some dimensions might carry long-term information (like sentence topic), while others track short-term patterns (like recent words).

The bias term $\mathbf{b}_h$ provides a learnable offset, allowing the network to shift its activation patterns. This is particularly useful for controlling the default behavior when inputs are near zero.

The addition of these three terms is crucial. By summing the contributions, the network can learn to weight past and present information appropriately. If $\mathbf{W}_{hh}$ has small values, the network emphasizes current input. If $\mathbf{W}_{xh}$ is small, it relies more on memory.

The $\tanh$ activation serves two purposes. First, it introduces nonlinearity, allowing the network to learn complex patterns. Without it, stacking multiple time steps would collapse to a single linear transformation. Second, it bounds the output to $[-1, 1]$ , preventing the hidden state from growing without limit as the sequence progresses.

The Output EquationLink Copied

Once we have the updated hidden state, we need to produce an output. This is typically a simple linear transformation:

\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

The output layer maps from the hidden representation to the task-specific output space. For language modeling, $\mathbf{y}_t$ might be logits over the vocabulary. For sentiment analysis, it might be scores for positive and negative classes.

Complete Variable ReferenceLink Copied

To implement these equations correctly, you need to understand the shape and role of each variable:

$\mathbf{x}_t \in \mathbb{R}^{d}$ : input vector at time $t$ (e.g., a word embedding of dimension $d$ )
$\mathbf{h}_t \in \mathbb{R}^{H}$ : hidden state at time $t$ , the network's memory
$\mathbf{h}_{t-1} \in \mathbb{R}^{H}$ : hidden state from the previous time step
$\mathbf{y}_t \in \mathbb{R}^{V}$ : output at time $t$ (e.g., logits over a vocabulary of size $V$ )
$\mathbf{W}_{xh} \in \mathbb{R}^{H \times d}$ : input-to-hidden weight matrix, transforms $d$ -dimensional inputs to $H$ -dimensional hidden space
$\mathbf{W}_{hh} \in \mathbb{R}^{H \times H}$ : hidden-to-hidden (recurrent) weight matrix, determines how past information influences the present
$\mathbf{W}_{hy} \in \mathbb{R}^{V \times H}$ : hidden-to-output weight matrix, maps hidden state to output space
$\mathbf{b}_h \in \mathbb{R}^{H}$ : hidden bias vector
$\mathbf{b}_y \in \mathbb{R}^{V}$ : output bias vector

Why tanh for RNNs?

The $\tanh$ activation squashes values to $[-1, 1]$ , centering the hidden state around zero. This zero-centered property helps with gradient flow compared to sigmoid (which outputs only positive values). However, $\tanh$ still suffers from vanishing gradients for large inputs, a problem we'll address with LSTM and GRU architectures.

Dimension AnalysisLink Copied

Understanding the dimensions at each step is crucial for implementing RNNs correctly. Let's trace through a concrete example:

In[10]:

Code

# Dimension analysis for RNN forward pass
batch_size = 16  # Number of sequences processed in parallel
seq_length = 20  # Length of each sequence
input_dim = 128  # Dimension of input vectors (e.g., embedding size)
hidden_dim = 256  # Dimension of hidden state
output_dim = 1000  # Dimension of output (e.g., vocabulary size)

# Weight matrices
W_xh = np.random.randn(hidden_dim, input_dim) * 0.01
W_hh = np.random.randn(hidden_dim, hidden_dim) * 0.01
W_hy = np.random.randn(output_dim, hidden_dim) * 0.01
b_h = np.zeros(hidden_dim)
b_y = np.zeros(output_dim)

# Input sequence: (batch_size, seq_length, input_dim)
X = np.random.randn(batch_size, seq_length, input_dim)

# Initialize hidden state: (batch_size, hidden_dim)
h = np.zeros((batch_size, hidden_dim))

# Dimension analysis for RNN forward pass
batch_size = 16  # Number of sequences processed in parallel
seq_length = 20  # Length of each sequence
input_dim = 128  # Dimension of input vectors (e.g., embedding size)
hidden_dim = 256  # Dimension of hidden state
output_dim = 1000  # Dimension of output (e.g., vocabulary size)

# Weight matrices
W_xh = np.random.randn(hidden_dim, input_dim) * 0.01
W_hh = np.random.randn(hidden_dim, hidden_dim) * 0.01
W_hy = np.random.randn(output_dim, hidden_dim) * 0.01
b_h = np.zeros(hidden_dim)
b_y = np.zeros(output_dim)

# Input sequence: (batch_size, seq_length, input_dim)
X = np.random.randn(batch_size, seq_length, input_dim)

# Initialize hidden state: (batch_size, hidden_dim)
h = np.zeros((batch_size, hidden_dim))

Out[11]:

Console

RNN Dimension Analysis
============================================================

Configuration:
  Batch size: 16
  Sequence length: 20
  Input dimension: 128
  Hidden dimension: 256
  Output dimension: 1000

Weight Matrices:
  W_xh: (256, 128) - maps input to hidden
  W_hh: (256, 256) - maps hidden to hidden
  W_hy: (1000, 256) - maps hidden to output

At each time step t:
  x_t: (16, 128)
  h_{t-1}: (16, 256)

Forward pass computations:
  W_xh @ x_t.T: (256, 128) @ (128, 16) = (256, 16)
  W_hh @ h.T: (256, 256) @ (256, 16) = (256, 16)
  h_t: (16, 256) - after transpose and tanh
  y_t: (16, 1000) - output logits

Implementing an RNN from ScratchLink Copied

With the mathematical foundations in place, let's translate the equations into working code. We'll build the implementation incrementally, starting with the simplest possible unit (a single time step) and progressively adding complexity until we have a complete, reusable RNN.

This bottom-up approach serves two purposes: it reinforces the mathematical concepts by showing exactly how each equation maps to code, and it produces modular components that are easy to test and debug.

Single Time StepLink Copied

The atomic unit of an RNN is the hidden state update for a single time step. This function directly implements the equation $\mathbf{h}_t = \tanh(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}_h)$ :

In[12]:

Code

def rnn_step(x_t, h_prev, W_xh, W_hh, b_h):
    """
    Perform one step of an RNN.

    Args:
        x_t: Input at time t, shape (batch_size, input_dim)
        h_prev: Hidden state from time t-1, shape (batch_size, hidden_dim)
        W_xh: Input-to-hidden weights, shape (hidden_dim, input_dim)
        W_hh: Hidden-to-hidden weights, shape (hidden_dim, hidden_dim)
        b_h: Hidden bias, shape (hidden_dim,)

    Returns:
        h_t: Hidden state at time t, shape (batch_size, hidden_dim)
    """
    # Compute pre-activation: combine input and previous hidden state
    pre_activation = x_t @ W_xh.T + h_prev @ W_hh.T + b_h

    # Apply tanh activation
    h_t = np.tanh(pre_activation)

    return h_t

def rnn_step(x_t, h_prev, W_xh, W_hh, b_h):
    """
    Perform one step of an RNN.

    Args:
        x_t: Input at time t, shape (batch_size, input_dim)
        h_prev: Hidden state from time t-1, shape (batch_size, hidden_dim)
        W_xh: Input-to-hidden weights, shape (hidden_dim, input_dim)
        W_hh: Hidden-to-hidden weights, shape (hidden_dim, hidden_dim)
        b_h: Hidden bias, shape (hidden_dim,)

    Returns:
        h_t: Hidden state at time t, shape (batch_size, hidden_dim)
    """
    # Compute pre-activation: combine input and previous hidden state
    pre_activation = x_t @ W_xh.T + h_prev @ W_hh.T + b_h

    # Apply tanh activation
    h_t = np.tanh(pre_activation)

    return h_t

Out[13]:

Console

Single RNN Step Test
----------------------------------------
Input x_t shape: (4, 8)
Previous hidden h_prev shape: (4, 16)
Output h_t shape: (4, 16)

Hidden state range: [-0.345, 0.397]
(Values are in [-1, 1] due to tanh activation)

The output shape confirms that our rnn_step function correctly transforms inputs of shape (batch_size, input_dim) to hidden states of shape (batch_size, hidden_dim). The hidden state values fall within the expected $[-1, 1]$ range due to the $\tanh$ activation.

Full Sequence ProcessingLink Copied

A single time step is useful, but real applications require processing entire sequences. The key insight is that we simply loop through time, feeding each step's output hidden state as the next step's input. We also collect all hidden states, which proves useful for tasks like attention mechanisms or when we need outputs at every position:

In[14]:

Code

def rnn_forward(X, h_0, W_xh, W_hh, b_h):
    """
    Forward pass through an RNN for a full sequence.

    Args:
        X: Input sequence, shape (batch_size, seq_length, input_dim)
        h_0: Initial hidden state, shape (batch_size, hidden_dim)
        W_xh: Input-to-hidden weights
        W_hh: Hidden-to-hidden weights
        b_h: Hidden bias

    Returns:
        H: All hidden states, shape (batch_size, seq_length, hidden_dim)
        h_final: Final hidden state, shape (batch_size, hidden_dim)
    """
    batch_size, seq_length, input_dim = X.shape
    hidden_dim = W_hh.shape[0]

    # Store all hidden states for later use (e.g., attention, outputs)
    H = np.zeros((batch_size, seq_length, hidden_dim))

    # Initialize hidden state
    h = h_0

    # Process each time step
    for t in range(seq_length):
        x_t = X[:, t, :]  # Get input at time t
        h = rnn_step(x_t, h, W_xh, W_hh, b_h)
        H[:, t, :] = h  # Store hidden state

    return H, h

def rnn_forward(X, h_0, W_xh, W_hh, b_h):
    """
    Forward pass through an RNN for a full sequence.

    Args:
        X: Input sequence, shape (batch_size, seq_length, input_dim)
        h_0: Initial hidden state, shape (batch_size, hidden_dim)
        W_xh: Input-to-hidden weights
        W_hh: Hidden-to-hidden weights
        b_h: Hidden bias

    Returns:
        H: All hidden states, shape (batch_size, seq_length, hidden_dim)
        h_final: Final hidden state, shape (batch_size, hidden_dim)
    """
    batch_size, seq_length, input_dim = X.shape
    hidden_dim = W_hh.shape[0]

    # Store all hidden states for later use (e.g., attention, outputs)
    H = np.zeros((batch_size, seq_length, hidden_dim))

    # Initialize hidden state
    h = h_0

    # Process each time step
    for t in range(seq_length):
        x_t = X[:, t, :]  # Get input at time t
        h = rnn_step(x_t, h, W_xh, W_hh, b_h)
        H[:, t, :] = h  # Store hidden state

    return H, h

Out[15]:

Console

Full Sequence RNN Forward Pass
---------------------------------------------
Input sequence shape: (4, 10, 8)
All hidden states shape: (4, 10, 16)
Final hidden state shape: (4, 16)

Hidden state evolution (first sample, first 3 dimensions):
  t=0: [ 0.153986    0.198592   -0.13734667]
  t=1: [ 0.03350975 -0.27899339 -0.31690171]
  t=2: [-0.12532001  0.05981587 -0.02644331]
  t=3: [-0.01383598 -0.16003592  0.03473395]
  t=4: [ 0.20436864 -0.30465032 -0.11823189]

The output shows that our RNN processes a batch of 4 sequences, each with 10 time steps and 8 input features, producing hidden states of dimension 16. The hidden state evolution demonstrates how values change at each time step as the network incorporates new information. Notice that the values remain bounded within $[-1, 1]$ due to the $\tanh$ activation, and different dimensions evolve differently based on the learned weights.

Complete RNN ClassLink Copied

For practical use, we want to encapsulate the RNN's parameters and operations into a single object. The class below combines weight initialization, the forward pass, and utility methods. Notice how the forward pass now includes the output computation $\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y$ , making this a complete sequence-to-sequence model:

In[16]:

Code

class SimpleRNN:
    """
    A simple RNN implementation from scratch.
    """

    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Initialize RNN parameters.

        Args:
            input_dim: Dimension of input vectors
            hidden_dim: Dimension of hidden state
            output_dim: Dimension of output vectors
        """
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

        # Initialize weights with Xavier/Glorot initialization
        scale_xh = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_hh = np.sqrt(2.0 / (hidden_dim + hidden_dim))
        scale_hy = np.sqrt(2.0 / (hidden_dim + output_dim))

        self.W_xh = np.random.randn(hidden_dim, input_dim) * scale_xh
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
        self.W_hy = np.random.randn(output_dim, hidden_dim) * scale_hy
        self.b_h = np.zeros(hidden_dim)
        self.b_y = np.zeros(output_dim)

    def forward(self, X, h_0=None):
        """
        Forward pass through the RNN.

        Args:
            X: Input sequence, shape (batch_size, seq_length, input_dim)
            h_0: Initial hidden state (optional)

        Returns:
            outputs: Output at each time step, shape (batch_size, seq_length, output_dim)
            H: All hidden states
            h_final: Final hidden state
        """
        batch_size, seq_length, _ = X.shape

        if h_0 is None:
            h_0 = np.zeros((batch_size, self.hidden_dim))

        # Get all hidden states
        H = np.zeros((batch_size, seq_length, self.hidden_dim))
        h = h_0

        for t in range(seq_length):
            x_t = X[:, t, :]
            h = np.tanh(x_t @ self.W_xh.T + h @ self.W_hh.T + self.b_h)
            H[:, t, :] = h

        # Compute outputs for all time steps
        outputs = H @ self.W_hy.T + self.b_y

        return outputs, H, h

    def count_parameters(self):
        """Count total number of trainable parameters."""
        return (
            self.W_xh.size
            + self.W_hh.size
            + self.W_hy.size
            + self.b_h.size
            + self.b_y.size
        )

class SimpleRNN:
    """
    A simple RNN implementation from scratch.
    """

    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Initialize RNN parameters.

        Args:
            input_dim: Dimension of input vectors
            hidden_dim: Dimension of hidden state
            output_dim: Dimension of output vectors
        """
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

        # Initialize weights with Xavier/Glorot initialization
        scale_xh = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_hh = np.sqrt(2.0 / (hidden_dim + hidden_dim))
        scale_hy = np.sqrt(2.0 / (hidden_dim + output_dim))

        self.W_xh = np.random.randn(hidden_dim, input_dim) * scale_xh
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
        self.W_hy = np.random.randn(output_dim, hidden_dim) * scale_hy
        self.b_h = np.zeros(hidden_dim)
        self.b_y = np.zeros(output_dim)

    def forward(self, X, h_0=None):
        """
        Forward pass through the RNN.

        Args:
            X: Input sequence, shape (batch_size, seq_length, input_dim)
            h_0: Initial hidden state (optional)

        Returns:
            outputs: Output at each time step, shape (batch_size, seq_length, output_dim)
            H: All hidden states
            h_final: Final hidden state
        """
        batch_size, seq_length, _ = X.shape

        if h_0 is None:
            h_0 = np.zeros((batch_size, self.hidden_dim))

        # Get all hidden states
        H = np.zeros((batch_size, seq_length, self.hidden_dim))
        h = h_0

        for t in range(seq_length):
            x_t = X[:, t, :]
            h = np.tanh(x_t @ self.W_xh.T + h @ self.W_hh.T + self.b_h)
            H[:, t, :] = h

        # Compute outputs for all time steps
        outputs = H @ self.W_hy.T + self.b_y

        return outputs, H, h

    def count_parameters(self):
        """Count total number of trainable parameters."""
        return (
            self.W_xh.size
            + self.W_hh.size
            + self.W_hy.size
            + self.b_h.size
            + self.b_y.size
        )

Out[17]:

Console

SimpleRNN Test
==================================================
Architecture: 64 → 128 → 100
Total parameters: 37,604

Forward pass shapes:
  Input: (8, 15, 64)
  Outputs: (8, 15, 100)
  Hidden states: (8, 15, 128)
  Final hidden: (8, 128)

The SimpleRNN class encapsulates the full RNN computation. With an architecture of 64 input dimensions, 128 hidden units, and 100 output classes, the model has approximately 30,000 parameters. The forward pass processes all 15 time steps and returns outputs at each position, all hidden states for potential use in attention mechanisms, and the final hidden state for classification tasks.

RNN for Sequence ClassificationLink Copied

One common application of RNNs is sequence classification, where we want to assign a single label to an entire sequence. Examples include sentiment analysis (positive/negative), spam detection, and topic classification.

For sequence classification, we typically use only the final hidden state $\mathbf{h}_T$ to make the prediction, where $T$ is the sequence length. This final state has "seen" the entire sequence and should encode all the information needed for classification.

Out[18]:

Visualization

Unrolled RNN with four time steps, showing only the final hidden state connected to an output layer for classification. — RNN for sequence classification. The network processes the entire sequence, and only the final hidden state h_T is used to produce the classification output. Earlier outputs are ignored.

In[19]:

Code

def rnn_classify(X, rnn, num_classes):
    """
    Use an RNN for sequence classification.

    Args:
        X: Input sequences, shape (batch_size, seq_length, input_dim)
        rnn: SimpleRNN instance
        num_classes: Number of output classes

    Returns:
        logits: Classification logits, shape (batch_size, num_classes)
    """
    # Forward pass through RNN
    _, H, h_final = rnn.forward(X)

    # Use only the final hidden state for classification
    # In practice, we'd have a separate classification head
    # Here we just return the final hidden state projection

    # Initialize classification weights
    W_class = np.random.randn(num_classes, rnn.hidden_dim) * 0.01
    b_class = np.zeros(num_classes)

    # Compute logits
    logits = h_final @ W_class.T + b_class

    return logits


# Example: sentiment classification
rnn = SimpleRNN(input_dim=64, hidden_dim=128, output_dim=2)
X = np.random.randn(8, 20, 64)  # 8 sequences of length 20
logits = rnn_classify(X, rnn, num_classes=2)

def rnn_classify(X, rnn, num_classes):
    """
    Use an RNN for sequence classification.

    Args:
        X: Input sequences, shape (batch_size, seq_length, input_dim)
        rnn: SimpleRNN instance
        num_classes: Number of output classes

    Returns:
        logits: Classification logits, shape (batch_size, num_classes)
    """
    # Forward pass through RNN
    _, H, h_final = rnn.forward(X)

    # Use only the final hidden state for classification
    # In practice, we'd have a separate classification head
    # Here we just return the final hidden state projection

    # Initialize classification weights
    W_class = np.random.randn(num_classes, rnn.hidden_dim) * 0.01
    b_class = np.zeros(num_classes)

    # Compute logits
    logits = h_final @ W_class.T + b_class

    return logits


# Example: sentiment classification
rnn = SimpleRNN(input_dim=64, hidden_dim=128, output_dim=2)
X = np.random.randn(8, 20, 64)  # 8 sequences of length 20
logits = rnn_classify(X, rnn, num_classes=2)

Out[20]:

Console

Sequence Classification Example
---------------------------------------------
Input shape: (8, 20, 64) (batch=8, length=20, dim=64)
Output logits shape: (8, 2) (batch=8, classes=2)

Sample predictions (softmax probabilities):
  Sequence 0: Negative=0.536, Positive=0.464
  Sequence 1: Negative=0.520, Positive=0.480
  Sequence 2: Negative=0.507, Positive=0.493

The classification output shows logits for each sequence in the batch. After applying softmax to convert logits to probabilities, we see predictions for each sequence. With randomly initialized weights, the predictions are essentially random. After training on labeled sentiment data, the model would learn to produce high positive probabilities for positive reviews and high negative probabilities for negative reviews.

RNN for Sequence GenerationLink Copied

Another powerful application is sequence generation, where the RNN produces an output at each time step. This is used for language modeling, machine translation, and text generation.

In generation mode, the RNN's output at time $t$ becomes (part of) the input at time $t+1$ . This creates an autoregressive loop where the model generates one token at a time, conditioning on all previously generated tokens.

Out[21]:

Visualization

Unrolled RNN with outputs feeding back as inputs to subsequent time steps, showing the autoregressive generation process. — RNN for sequence generation. Each output y_t feeds back as input to the next time step. During training, we use 'teacher forcing' where the true previous token is provided. During generation, we use the model's own predictions.

In[22]:

Code

def rnn_generate(rnn, start_token, max_length, temperature=1.0):
    """
    Generate a sequence using an RNN.

    Args:
        rnn: SimpleRNN instance
        start_token: Initial input vector, shape (1, input_dim)
        max_length: Maximum sequence length to generate
        temperature: Sampling temperature (higher = more random)

    Returns:
        generated: List of generated token indices
    """
    generated = []
    h = np.zeros((1, rnn.hidden_dim))  # Initialize hidden state
    x = start_token

    for _ in range(max_length):
        # Forward one step
        pre_act = x @ rnn.W_xh.T + h @ rnn.W_hh.T + rnn.b_h
        h = np.tanh(pre_act)

        # Compute output logits
        logits = h @ rnn.W_hy.T + rnn.b_y

        # Apply temperature and sample
        logits = logits / temperature
        probs = np.exp(logits) / np.exp(logits).sum()
        token_idx = np.random.choice(len(probs[0]), p=probs[0])

        generated.append(token_idx)

        # Create next input (one-hot encoding of generated token)
        x = np.zeros((1, rnn.input_dim))
        if token_idx < rnn.input_dim:
            x[0, token_idx] = 1.0

    return generated


# Example generation
rnn = SimpleRNN(input_dim=50, hidden_dim=128, output_dim=50)
start = np.zeros((1, 50))
start[0, 0] = 1.0  # Start token

np.random.seed(42)
generated = rnn_generate(rnn, start, max_length=10, temperature=1.0)

def rnn_generate(rnn, start_token, max_length, temperature=1.0):
    """
    Generate a sequence using an RNN.

    Args:
        rnn: SimpleRNN instance
        start_token: Initial input vector, shape (1, input_dim)
        max_length: Maximum sequence length to generate
        temperature: Sampling temperature (higher = more random)

    Returns:
        generated: List of generated token indices
    """
    generated = []
    h = np.zeros((1, rnn.hidden_dim))  # Initialize hidden state
    x = start_token

    for _ in range(max_length):
        # Forward one step
        pre_act = x @ rnn.W_xh.T + h @ rnn.W_hh.T + rnn.b_h
        h = np.tanh(pre_act)

        # Compute output logits
        logits = h @ rnn.W_hy.T + rnn.b_y

        # Apply temperature and sample
        logits = logits / temperature
        probs = np.exp(logits) / np.exp(logits).sum()
        token_idx = np.random.choice(len(probs[0]), p=probs[0])

        generated.append(token_idx)

        # Create next input (one-hot encoding of generated token)
        x = np.zeros((1, rnn.input_dim))
        if token_idx < rnn.input_dim:
            x[0, token_idx] = 1.0

    return generated


# Example generation
rnn = SimpleRNN(input_dim=50, hidden_dim=128, output_dim=50)
start = np.zeros((1, 50))
start[0, 0] = 1.0  # Start token

np.random.seed(42)
generated = rnn_generate(rnn, start, max_length=10, temperature=1.0)

Out[23]:

Console

Sequence Generation Example
---------------------------------------------
Vocabulary size: 50
Generated sequence length: 10
Generated token indices: [19, 47, 36, 30, 6, 7, 2, 43, 31, 36]

Note: With random weights, the output is meaningless.
After training on text, the RNN would generate coherent sequences.

The generated sequence is a list of token indices sampled from the model's output distribution. With untrained, random weights, these indices have no semantic meaning. However, after training on a text corpus, the same generation procedure would produce coherent sequences. The temperature parameter controls randomness: lower values (e.g., 0.5) make the model more confident and deterministic, while higher values (e.g., 1.5) increase diversity but may reduce coherence.

Out[24]:

Visualization

Bar chart showing probability distributions at different temperatures, from sharp peaks at low temperature to flat distributions at high temperature. — Effect of temperature on sampling probability distributions. Lower temperatures sharpen the distribution, making the model more confident in its top choice. Higher temperatures flatten the distribution, increasing randomness. The raw logits (temperature=1.0) represent the model's learned preferences.

The entropy values (H) quantify the randomness of each distribution. At temperature 0.3, the model is nearly deterministic with 75% probability on the top token. At temperature 2.5, the distribution is much flatter, giving even unlikely tokens a reasonable chance of being sampled. Choosing the right temperature is a key decision when deploying generative models.

Visualizing Hidden State DynamicsLink Copied

To build intuition about how RNNs process sequences, let's visualize how the hidden state evolves over time for different input patterns.

Out[25]:

Visualization

Six-panel figure showing input patterns and corresponding hidden state evolution for constant, oscillating, and random inputs. — Hidden state dynamics for three different input sequences. The top row shows the input patterns (constant, oscillating, and random). The bottom row shows how the first 5 dimensions of the hidden state evolve over time. Notice how different input patterns produce different hidden state trajectories.

The visualization reveals several important properties of RNN dynamics:

Constant input: The hidden state quickly converges to a fixed point. Once the network has "absorbed" the constant signal, the hidden state stops changing.
Oscillating input: The hidden state tracks the oscillations, with different dimensions responding at different phases and amplitudes.
Random input: The hidden state shows complex, chaotic dynamics as it tries to encode the unpredictable input stream.

These dynamics are determined by the recurrent weights $\mathbf{W}_{hh}$ . The eigenvalues of this matrix, which characterize how the matrix scales vectors in different directions, control whether the hidden state explodes, vanishes, or maintains stable dynamics over time. This is a topic we'll explore in depth in the chapter on vanishing gradients.

Out[26]:

Visualization

Heatmap showing hidden state dimensions over time, with color indicating activation values from -1 to 1. — Hidden state evolution heatmap for a single sequence. Each row represents one dimension of the hidden state, and each column is a time step. The color intensity shows the activation value. Notice how some dimensions maintain consistent patterns (horizontal bands) while others respond more dynamically to input changes.

The heatmap reveals how different hidden dimensions specialize in tracking different aspects of the input. Some dimensions (horizontal bands of consistent color) act as "memory cells" that maintain their state over time. Others show rapid fluctuations in response to input changes. The spike at time step 25 creates a visible perturbation across many dimensions, demonstrating how the hidden state responds to sudden input changes.

Limitations and ImpactLink Copied

RNNs represented a major breakthrough in sequence modeling, but they come with significant limitations that motivated the development of more advanced architectures.

The Vanishing Gradient ProblemLink Copied

The most critical limitation of vanilla RNNs is the vanishing gradient problem. When training on long sequences, gradients must flow backward through many time steps. At each step, the gradient is multiplied by the recurrent weight matrix $\mathbf{W}_{hh}$ . If the largest eigenvalue of $\mathbf{W}_{hh}$ is less than 1, these repeated multiplications cause the gradient to shrink exponentially, effectively preventing the network from learning long-range dependencies.

Consider a sequence where a word at position 5 is crucial for predicting the word at position 50. The gradient signal from position 50 must traverse 45 time steps to influence the weights that process position 5. With vanishing gradients, this signal becomes negligibly small, and the network fails to learn the dependency. This is why vanilla RNNs struggle with tasks requiring memory over more than 10-20 time steps.

Out[27]:

Visualization

Line plot showing gradient magnitude over time steps for different eigenvalue magnitudes, demonstrating exponential decay and explosion. — Gradient magnitude decay over time steps for different spectral radii of W_hh. When the largest eigenvalue is less than 1, gradients decay exponentially, making it impossible to learn long-range dependencies. When greater than 1, gradients explode. Only near 1 can gradients maintain useful magnitudes.

This visualization shows why vanilla RNNs struggle with long sequences. With a spectral radius of 0.9, the gradient is reduced to about 1% of its original magnitude after just 45 time steps. With 0.7, it becomes numerically zero after about 30 steps. The narrow band near $\rho = 1$ where gradients remain useful is difficult to maintain during training, motivating the gated architectures (LSTM, GRU) that explicitly address this problem.

Sequential Computation BottleneckLink Copied

RNNs process sequences one step at a time, with each step depending on the previous one. This sequential dependency prevents parallelization during training. While a feedforward network can process all positions simultaneously, an RNN must wait for $\mathbf{h}_{t-1}$ before computing $\mathbf{h}_t$ . On modern GPUs designed for parallel computation, this sequential bottleneck significantly slows training on long sequences.

Historical ImpactLink Copied

Despite these limitations, RNNs were transformative for NLP and sequence modeling:

Language modeling: RNNs enabled the first neural language models that could generate coherent text, moving beyond n-gram models.
Machine translation: Sequence-to-sequence RNNs with attention (which we'll cover later) achieved state-of-the-art translation quality before transformers.
Speech recognition: RNNs powered major advances in speech-to-text systems.
Time series: RNNs remain useful for many time series forecasting applications where sequences are short enough to avoid gradient issues.

The limitations of vanilla RNNs directly motivated the development of LSTM and GRU architectures, which use gating mechanisms to create "gradient highways" that allow information to flow over long distances. These architectures dominated NLP from roughly 2014-2017, until transformers provided an even more effective solution to the long-range dependency problem.

SummaryLink Copied

This chapter introduced the fundamental architecture of Recurrent Neural Networks:

Recurrent connections create memory by feeding the hidden state back into itself, allowing RNNs to process sequential data with temporal dependencies.
Hidden state acts as a compressed summary of all previous inputs, evolving at each time step through the equation $\mathbf{h}_t = \tanh(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}_h)$ .
Parameter sharing across time steps means the same weights process every position, enabling RNNs to handle variable-length sequences with a fixed number of parameters.
Unrolling the computation graph reveals that an RNN is effectively a deep network with one layer per time step, which is crucial for understanding gradient flow during training.
Sequence classification uses the final hidden state to make predictions about entire sequences, while sequence generation produces outputs at each step in an autoregressive loop.
Limitations include vanishing gradients that prevent learning long-range dependencies and sequential computation that prevents parallelization.

In the next chapter, we'll examine how to train RNNs using Backpropagation Through Time (BPTT), understanding exactly how gradients flow backward through the unrolled computation graph.

Key ParametersLink Copied

When implementing RNNs, several hyperparameters significantly impact model performance:

hidden_dim: The dimension of the hidden state vector. Larger values increase the model's capacity to store information but also increase computation and risk of overfitting. Typical values range from 64 to 512 for most NLP tasks.
input_dim: The dimension of input vectors, often determined by the embedding size. Common choices are 100, 200, or 300 for word embeddings.
num_layers: The number of stacked RNN layers. Deeper networks can learn more complex patterns but are harder to train. Most applications use 1-3 layers.
activation: The activation function applied to the hidden state. The default $\tanh$ works well for most cases, keeping values bounded in $[-1, 1]$ .
initial_hidden_state: How to initialize $\mathbf{h}_0$ . Zero initialization is standard, though learned initial states can sometimes improve performance.
temperature (for generation): Controls the randomness of sampling during sequence generation. Values below 1.0 make outputs more deterministic; values above 1.0 increase diversity.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about RNN architecture.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Gradient Clipping

Next Chapter

Backpropagation Through Time

Reference

BIBTEXAcademic

@misc{rnnarchitecturecompleteguidetorecurrentneuralnetworks, author = {Michael Brenndoerfer}, title = {RNN Architecture: Complete Guide to Recurrent Neural Networks}, year = {2025}, url = {https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). RNN Architecture: Complete Guide to Recurrent Neural Networks. Retrieved from https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide

MLAAcademic

Michael Brenndoerfer. "RNN Architecture: Complete Guide to Recurrent Neural Networks." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "RNN Architecture: Complete Guide to Recurrent Neural Networks." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide.

HARVARDAcademic

Michael Brenndoerfer (2025) 'RNN Architecture: Complete Guide to Recurrent Neural Networks'. Available at: https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). RNN Architecture: Complete Guide to Recurrent Neural Networks. https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide

Direct link:

https://mbrenndoerfer.com/writing/rnn-architecture-recurrent-neural-networks-guide

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

RNN Architecture: Complete Guide to Recurrent Neural Networks

RNN ArchitectureLink Copied

The Need for Sequential MemoryLink Copied

Recurrent Connections: The Key InsightLink Copied

Hidden State as MemoryLink Copied

Unrolling the Computation GraphLink Copied

The RNN EquationsLink Copied

The Core Challenge: Combining Past and PresentLink Copied

The Hidden State Update EquationLink Copied

The Output EquationLink Copied

Complete Variable ReferenceLink Copied

Dimension AnalysisLink Copied

Implementing an RNN from ScratchLink Copied

Single Time StepLink Copied

Full Sequence ProcessingLink Copied

Complete RNN ClassLink Copied

RNN for Sequence ClassificationLink Copied

RNN for Sequence GenerationLink Copied

Visualizing Hidden State DynamicsLink Copied

Limitations and ImpactLink Copied

The Vanishing Gradient ProblemLink Copied

Sequential Computation BottleneckLink Copied

Historical ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GRU Architecture: Streamlined Gating for Sequence Modeling

Stacked RNNs: Deep Recurrent Networks for Hierarchical Sequence Modeling

LSTM Gradient Flow: The Constant Error Carousel Explained

Stay updated

RNN Architecture: Complete Guide to Recurrent Neural Networks

RNN ArchitectureLink Copied

The Need for Sequential MemoryLink Copied

Recurrent Connections: The Key InsightLink Copied

Hidden State as MemoryLink Copied

Unrolling the Computation GraphLink Copied

Parameter Sharing Across TimeLink Copied

The RNN EquationsLink Copied

The Core Challenge: Combining Past and PresentLink Copied

The Hidden State Update EquationLink Copied

The Output EquationLink Copied

Complete Variable ReferenceLink Copied

Dimension AnalysisLink Copied

Implementing an RNN from ScratchLink Copied

Single Time StepLink Copied

Full Sequence ProcessingLink Copied

Complete RNN ClassLink Copied

RNN for Sequence ClassificationLink Copied

RNN for Sequence GenerationLink Copied

Visualizing Hidden State DynamicsLink Copied

Limitations and ImpactLink Copied

The Vanishing Gradient ProblemLink Copied

Sequential Computation BottleneckLink Copied

Historical ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GRU Architecture: Streamlined Gating for Sequence Modeling

Stacked RNNs: Deep Recurrent Networks for Hierarchical Sequence Modeling

LSTM Gradient Flow: The Constant Error Carousel Explained

Stay updated