LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master the mathematics behind LSTM gates including forget, input, output gates, and cell state updates. Includes from-scratch NumPy implementation and PyTorch comparison.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

LSTM Gate EquationsLink Copied

In the previous chapter, we explored the intuition behind LSTMs: how the cell state acts as an information highway and how gates selectively control information flow. Now we translate that intuition into precise mathematics. Understanding the exact equations is essential for implementing LSTMs from scratch, debugging unexpected behavior, and reasoning about parameter counts and computational costs.

This chapter presents each gate equation in detail, shows how they combine to update the cell state and hidden state, and walks through a complete implementation. By the end, you'll be able to trace exactly how information flows through an LSTM cell and count every parameter in the architecture.

The LSTM Cell at a GlanceLink Copied

Before diving into individual equations, let's establish the notation and see all the components together. An LSTM cell at time step $t$ receives three inputs and produces two outputs.

Inputs:

$\mathbf{x}_t \in \mathbb{R}^d$ : the input vector at time $t$ (e.g., a word embedding)
$\mathbf{h}_{t-1} \in \mathbb{R}^h$ : the hidden state from the previous time step
$\mathbf{c}_{t-1} \in \mathbb{R}^h$ : the cell state from the previous time step

Outputs:

$\mathbf{h}_t \in \mathbb{R}^h$ : the new hidden state
$\mathbf{c}_t \in \mathbb{R}^h$ : the new cell state

The hidden dimension $h$ is a hyperparameter you choose. A larger $h$ gives the model more capacity but increases computation and memory requirements.

The LSTM computes these outputs through four interacting components: the forget gate, input gate, candidate cell state, and output gate. Each gate is a small neural network that produces values between 0 and 1, acting as a soft switch that controls information flow.

The Forget GateLink Copied

Imagine you're reading a novel and encounter a new chapter. Some details from previous chapters remain crucial: the protagonist's name, the central conflict, key relationships. Other details have served their purpose and can fade: the color of a minor character's shirt, the exact wording of a throwaway line. Your brain naturally performs this filtering, retaining what matters while letting irrelevant details slip away.

The forget gate gives an LSTM this same capability. At each time step, it examines the current input and the network's recent processing history, then decides, for each piece of stored information, how much to retain. This isn't a binary keep-or-discard decision but a continuous scaling: some memories are preserved almost entirely, others are dimmed, and some are nearly erased.

Forget Gate

The forget gate computes a forgetting coefficient for each dimension of the cell state, determining how much of the previous memory to retain at the current time step.

The Core MechanismLink Copied

The forget gate needs to make its retention decisions based on context. What should be forgotten depends entirely on what the network is currently seeing and what it has recently processed. To capture this context-dependence, the gate combines two sources of information:

The current input $\mathbf{x}_t$ : What new information is arriving right now?
The previous hidden state $\mathbf{h}_{t-1}$ : What was the network's recent output, summarizing its processing up to this point?

These two vectors are transformed through learned weight matrices and combined:

\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{x}_t + \mathbf{U}_f \mathbf{h}_{t-1} + \mathbf{b}_f)

where:

$\mathbf{f}_t \in \mathbb{R}^h$ : the forget gate activation vector, one value per cell state dimension
$\mathbf{W}_f \in \mathbb{R}^{h \times d}$ : weight matrix that learns which input features signal "time to forget"
$\mathbf{U}_f \in \mathbb{R}^{h \times h}$ : weight matrix that learns which hidden state patterns signal "time to forget"
$\mathbf{b}_f \in \mathbb{R}^h$ : bias vector that sets the default forgetting behavior
$\sigma$ : the sigmoid activation function

Why Sigmoid?Link Copied

The sigmoid function is the mathematical heart of gating. It transforms any real number into a value between 0 and 1, perfect for representing "how much to keep":

\sigma(z) = \frac{1}{1 + e^{-z}}

where:

$z$ : the input value (can be any real number)
$e^{-z}$ : the exponential function, which ensures the output is always positive
The denominator $1 + e^{-z}$ normalizes the result to fall between 0 and 1

The sigmoid has a characteristic S-shape that makes it ideal for soft decisions. When the pre-activation $z$ is large and positive, $e^{-z}$ approaches 0, so $\sigma(z) \approx 1$ , meaning "keep this memory." When $z$ is large and negative, $e^{-z}$ becomes very large, so $\sigma(z) \approx 0$ , meaning "forget this." For intermediate values, the sigmoid produces intermediate retention levels, allowing the network to partially dim memories rather than making hard binary choices.

Out[2]:

Visualization

Line plot comparing sigmoid and tanh functions over the range -6 to 6, showing sigmoid bounded between 0 and 1, and tanh bounded between -1 and 1. — Comparison of sigmoid and tanh activation functions. Sigmoid (blue) outputs values in (0, 1), perfect for gating decisions. Tanh (orange) outputs values in (-1, 1), allowing both positive and negative updates. The shaded regions highlight the saturation zones where gradients become small.

Understanding the DimensionsLink Copied

Let's trace through the matrix dimensions carefully. Suppose we have an input dimension $d = 100$ (e.g., 100-dimensional word embeddings) and hidden dimension $h = 256$ .

The input transformation $\mathbf{W}_f \mathbf{x}_t$ multiplies a $(256 \times 100)$ matrix by a $(100 \times 1)$ vector, producing a $(256 \times 1)$ vector. The hidden state transformation $\mathbf{U}_f \mathbf{h}_{t-1}$ multiplies a $(256 \times 256)$ matrix by a $(256 \times 1)$ vector, also producing a $(256 \times 1)$ vector. These two vectors are added element-wise along with the bias, then sigmoid is applied element-wise to produce the final $(256 \times 1)$ forget gate vector.

Initialization MattersLink Copied

The forget gate bias $\mathbf{b}_f$ is often initialized to a positive value (commonly 1 or 2) rather than zero. This ensures that early in training, the forget gate outputs values close to 1, meaning the network starts by remembering everything. Without this initialization trick, the network might learn to forget too aggressively before it has learned what information is worth keeping.

The Input GateLink Copied

While the forget gate decides what to discard, the input gate addresses the complementary question: what new information deserves to be written into memory? Not every input is equally important. When reading text, some words carry crucial meaning ("The murderer was...") while others are grammatical scaffolding ("the", "was"). The input gate learns to recognize which incoming information warrants permanent storage.

The input gate works in tandem with a candidate cell state (which we'll examine next). Think of it as a two-stage process: the candidate proposes "here's what we could store," and the input gate decides "here's how much of that proposal to actually write." This separation of concerns allows the network to generate rich candidate updates while maintaining fine-grained control over what actually enters long-term memory.

The Gating MechanismLink Copied

Like the forget gate, the input gate examines the current input and previous hidden state to make context-dependent decisions:

\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{x}_t + \mathbf{U}_i \mathbf{h}_{t-1} + \mathbf{b}_i)

where:

$\mathbf{i}_t \in \mathbb{R}^h$ : the input gate activation vector, controlling write intensity per dimension
$\mathbf{W}_i \in \mathbb{R}^{h \times d}$ : weight matrix that learns which input features signal "important, store this"
$\mathbf{U}_i \in \mathbb{R}^{h \times h}$ : weight matrix that learns which processing contexts warrant new storage
$\mathbf{b}_i \in \mathbb{R}^h$ : bias vector setting the default write behavior

The structure is identical to the forget gate: a linear combination of the input and hidden state, followed by sigmoid. The difference lies entirely in the learned parameters. During training, the network discovers that certain input patterns should trigger forgetting while different patterns should trigger storage. The same architectural template serves opposite purposes through different learned weights.

The Candidate Cell StateLink Copied

The input gate decides how much to write, but something must decide what to write. This is the role of the candidate cell state: it proposes the actual content that might be added to memory. Think of it as drafting a memo that may or may not be filed, depending on the input gate's judgment.

The candidate needs to be expressive. It should be able to propose increases to certain memory dimensions, decreases to others, and leave some unchanged. This flexibility is crucial because the cell state encodes information through both the magnitude and sign of its values. A sentiment analysis model, for instance, might use positive cell state values to encode positive sentiment and negative values for negative sentiment. The candidate must be able to push the cell state in either direction.

The Candidate FormulaLink Copied

Like the gates, the candidate examines the current input and previous hidden state:

\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c \mathbf{x}_t + \mathbf{U}_c \mathbf{h}_{t-1} + \mathbf{b}_c)

where:

$\tilde{\mathbf{c}}_t \in \mathbb{R}^h$ : the candidate cell state vector, proposing potential updates
$\mathbf{W}_c \in \mathbb{R}^{h \times d}$ : weight matrix that learns what content to extract from the input
$\mathbf{U}_c \in \mathbb{R}^{h \times h}$ : weight matrix that learns how recent processing should shape the update
$\mathbf{b}_c \in \mathbb{R}^h$ : bias vector

Why tanh Instead of Sigmoid?Link Copied

Notice the crucial difference: the candidate uses $\tanh$ instead of sigmoid. The hyperbolic tangent function is defined as:

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

where $z$ is the input value. This function outputs values in the range $(-1, 1)$ , centered around zero. The centering is key: unlike sigmoid, which only produces positive values between 0 and 1, $\tanh$ can output negative values. This allows the candidate to propose both positive and negative updates to the cell state.

Consider what would happen if the candidate used sigmoid instead. The cell state could only ever increase (since you'd only be adding positive values). Information could accumulate but never be actively reversed. With $\tanh$ , the candidate can propose "increase this dimension" (positive values) or "decrease this dimension" (negative values), giving the network the full expressiveness it needs.

Why tanh for Candidates?

The candidate cell state uses $\tanh$ because it needs to propose values that can add to or subtract from the existing cell state. Sigmoid would only allow positive additions, severely limiting the network's representational power.

The Cell State UpdateLink Copied

We've now assembled all the ingredients: a forget gate that knows what to discard, an input gate that knows how much to write, and a candidate that proposes what to write. The cell state update combines these three components into a single equation that forms the heart of the LSTM.

This is where the "information highway" metaphor becomes concrete. The cell state flows through time, modified at each step by two operations: selective forgetting and selective addition. The key insight is that information can flow unchanged for many time steps if the forget gate stays near 1 and the input gate stays near 0. This enables LSTMs to capture long-range dependencies.

The Update EquationLink Copied

The new cell state is computed as:

\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t

where:

$\mathbf{c}_t \in \mathbb{R}^h$ : the new cell state
$\mathbf{c}_{t-1} \in \mathbb{R}^h$ : the previous cell state (the memory we're updating)
$\mathbf{f}_t \in \mathbb{R}^h$ : the forget gate activation (values between 0 and 1)
$\mathbf{i}_t \in \mathbb{R}^h$ : the input gate activation (values between 0 and 1)
$\tilde{\mathbf{c}}_t \in \mathbb{R}^h$ : the candidate cell state (values between -1 and 1)
$\odot$ : the Hadamard (element-wise) product, where $[\mathbf{a} \odot \mathbf{b}]_j = a_j \cdot b_j$

Understanding the Two TermsLink Copied

The equation has two terms, each serving a distinct purpose:

Term 1: Filtered Memory ( $\mathbf{f}_t \odot \mathbf{c}_{t-1}$ )

This term takes the previous cell state and scales each dimension by the corresponding forget gate value. If $f_t^{(j)} = 0.9$ for dimension $j$ , then 90% of that dimension's previous value is retained. If $f_t^{(j)} = 0.1$ , only 10% survives. This element-wise multiplication allows the network to selectively preserve some memories while erasing others, all within the same time step.

Term 2: New Information ( $\mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ )

This term takes the candidate's proposal and scales it by the input gate. If $i_t^{(j)} = 0.8$ for dimension $j$ , then 80% of the candidate's proposed value for that dimension gets written. If $i_t^{(j)} = 0.1$ , the candidate's proposal is largely ignored. This gating prevents irrelevant information from contaminating the cell state.

The Sum: Combining Old and New

Adding these terms creates the new cell state. Each dimension independently combines its filtered old value with its gated new value. This additive structure is crucial for gradient flow: during backpropagation, gradients can flow directly through the addition operation without being squashed by nonlinearities.

A Concrete Example: Tracing the Update Step by StepLink Copied

Let's walk through a concrete example to see exactly how the gates interact. Suppose the cell state has dimension $h = 4$ and at time $t$ we have:

Previous cell state: $\mathbf{c}_{t-1} = [0.5, -0.3, 0.8, 0.1]$
Forget gate: $\mathbf{f}_t = [0.9, 0.1, 0.7, 0.5]$
Input gate: $\mathbf{i}_t = [0.2, 0.8, 0.3, 0.6]$
Candidate: $\tilde{\mathbf{c}}_t = [0.4, 0.9, -0.2, 0.7]$

Step 1: Apply the forget gate to the previous cell state

\mathbf{f}_t \odot \mathbf{c}_{t-1} = [0.9, 0.1, 0.7, 0.5] \odot [0.5, -0.3, 0.8, 0.1] = [0.45, -0.03, 0.56, 0.05]

Step 2: Apply the input gate to the candidate

\mathbf{i}_t \odot \tilde{\mathbf{c}}_t = [0.2, 0.8, 0.3, 0.6] \odot [0.4, 0.9, -0.2, 0.7] = [0.08, 0.72, -0.06, 0.42]

Step 3: Sum to get the new cell state

\mathbf{c}_t = [0.45, -0.03, 0.56, 0.05] + [0.08, 0.72, -0.06, 0.42] = [0.53, 0.69, 0.50, 0.47]

Interpreting Each Dimension:

Dimension 1 ( $0.5 \to 0.53$ ): The forget gate was high (0.9), preserving most of the old value. The input gate was low (0.2), adding little new information. Result: minimal change, memory preserved.
Dimension 2 ( $-0.3 \to 0.69$ ): The forget gate was low (0.1), nearly erasing the old value. The input gate was high (0.8), strongly writing the candidate's positive value. Result: dramatic reversal, old memory replaced with new information.
Dimension 3 ( $0.8 \to 0.50$ ): Moderate forget gate (0.7) kept most of the old value. Low input gate (0.3) added a small negative candidate contribution. Result: gradual decrease.
Dimension 4 ( $0.1 \to 0.47$ ): Balanced gates (0.5 and 0.6) created a blend of old and new. Result: moderate change incorporating both sources.

This example illustrates the LSTM's power: each dimension of the cell state can be updated independently, with different balances of remembering and writing, all controlled by the learned gate values.

Out[3]:

Visualization

Grouped bar chart showing previous cell state, filtered memory, and new information across 4 dimensions. — Cell state update components showing the previous state (blue), filtered memory after forget gate (green), and new information from input gate (red). Gate values annotated above each dimension.

Grouped bar chart comparing cell state before and after update with change arrows. — Before and after comparison of the cell state, with change indicators showing how each dimension was modified by the update operation.

The Output GateLink Copied

The cell state now contains the LSTM's updated memory, but not all of this memory is relevant for the current moment. Consider a language model that has stored information about the subject of a sentence ("The cat"), the ongoing action ("is chasing"), and various contextual details. When predicting the next word, only some of this stored information is immediately relevant. The output gate acts as a filter, selecting which parts of the internal memory to expose for the current output.

This separation between internal memory (cell state) and external output (hidden state) is a key architectural insight. The cell state can accumulate and preserve information over long time spans, while the hidden state provides a task-relevant summary at each step. The output gate bridges these two representations.

The Output Gate EquationLink Copied

Following the same pattern as the other gates, the output gate examines the current context:

\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{x}_t + \mathbf{U}_o \mathbf{h}_{t-1} + \mathbf{b}_o)

where:

$\mathbf{o}_t \in \mathbb{R}^h$ : the output gate activation vector, controlling what to expose
$\mathbf{W}_o \in \mathbb{R}^{h \times d}$ : weight matrix that learns which input features signal "this memory is now relevant"
$\mathbf{U}_o \in \mathbb{R}^{h \times h}$ : weight matrix that learns which processing contexts warrant exposing certain memories
$\mathbf{b}_o \in \mathbb{R}^h$ : bias vector

The structure matches the other gates: linear combination followed by sigmoid. But the purpose is different. While the forget and input gates control what goes into memory, the output gate controls what comes out. This asymmetry allows the LSTM to store information that isn't immediately useful but may become relevant later.

The Hidden State ComputationLink Copied

The final step transforms the internal cell state into the external hidden state. This hidden state serves dual purposes: it becomes the LSTM's output for downstream processing (feeding into the next layer or making predictions), and it provides the recurrent context for the next time step.

The Hidden State FormulaLink Copied

The hidden state is computed by filtering the normalized cell state through the output gate:

\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)

where:

$\mathbf{h}_t \in \mathbb{R}^h$ : the new hidden state, the LSTM's output at this time step
$\mathbf{o}_t \in \mathbb{R}^h$ : the output gate activation (values between 0 and 1)
$\tanh(\mathbf{c}_t)$ : the cell state squashed to the range $(-1, 1)$

Why Apply tanh to the Cell State?Link Copied

Notice that we apply $\tanh$ to the cell state before gating. This serves two important purposes:

1. Normalization: The cell state can accumulate values well outside $[-1, 1]$ through repeated additions. If the forget gate stays near 1 and the input gate keeps adding positive values, the cell state could grow to 10, 100, or larger. Applying $\tanh$ compresses these potentially large values back into a bounded range, preventing the hidden state from exploding.

2. Consistency: The hidden state feeds into subsequent layers and provides recurrent context. Having it bounded in $(-1, 1)$ makes the network's behavior more predictable and stable. Downstream layers can rely on receiving inputs within a consistent range.

Why Apply tanh to the Cell State?

The cell state can accumulate values outside $[-1, 1]$ through repeated updates. Applying $\tanh$ before the output gate ensures the hidden state remains bounded, improving numerical stability and gradient flow.

The Complete PictureLink Copied

With the hidden state computed, we've completed one full LSTM step. The hidden state $\mathbf{h}_t$ now flows in two directions:

Forward to the next layer (or to the output): providing the LSTM's representation of the sequence up to time $t$
Recurrently to the next time step: serving as $\mathbf{h}_{t-1}$ when processing $\mathbf{x}_{t+1}$

Meanwhile, the cell state $\mathbf{c}_t$ flows only recurrently, carrying the LSTM's long-term memory forward without being directly exposed to downstream processing. This dual-track architecture, with the cell state as a protected memory channel and the hidden state as a filtered output, is what gives LSTMs their ability to capture long-range dependencies.

Complete LSTM EquationsLink Copied

Let's collect all the equations in one place for reference. Given inputs $\mathbf{x}_t$ , $\mathbf{h}_{t-1}$ , and $\mathbf{c}_{t-1}$ , the LSTM computes:

\begin{aligned} \mathbf{f}_t &= \sigma(\mathbf{W}_f \mathbf{x}_t + \mathbf{U}_f \mathbf{h}_{t-1} + \mathbf{b}_f) & \text{(forget gate)} \\ \mathbf{i}_t &= \sigma(\mathbf{W}_i \mathbf{x}_t + \mathbf{U}_i \mathbf{h}_{t-1} + \mathbf{b}_i) & \text{(input gate)} \\ \tilde{\mathbf{c}}_t &= \tanh(\mathbf{W}_c \mathbf{x}_t + \mathbf{U}_c \mathbf{h}_{t-1} + \mathbf{b}_c) & \text{(candidate)} \\ \mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t & \text{(cell state update)} \\ \mathbf{o}_t &= \sigma(\mathbf{W}_o \mathbf{x}_t + \mathbf{U}_o \mathbf{h}_{t-1} + \mathbf{b}_o) & \text{(output gate)} \\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t) & \text{(hidden state)} \end{aligned}

These six equations fully specify the LSTM cell. Notice the symmetry: three gates (forget, input, output) all have the same structure, differing only in their learned parameters. The candidate has the same structure but uses $\tanh$ instead of sigmoid. The cell state and hidden state updates are simple element-wise operations.

LSTM Parameter CountLink Copied

Understanding the parameter count is essential for estimating memory requirements and comparing architectures. Let's count every learnable parameter in an LSTM cell.

Each gate (forget, input, output) and the candidate cell state has:

One weight matrix $\mathbf{W} \in \mathbb{R}^{h \times d}$ for the input: $h \times d$ parameters
One weight matrix $\mathbf{U} \in \mathbb{R}^{h \times h}$ for the hidden state: $h \times h$ parameters
One bias vector $\mathbf{b} \in \mathbb{R}^h$ : $h$ parameters

Since we have four such components (forget gate, input gate, candidate, output gate), the total parameter count is:

\text{Parameters} = 4 \times (h \times d + h \times h + h) = 4h(d + h + 1)

where:

$h$ : the hidden dimension (size of the hidden and cell states)
$d$ : the input dimension (size of the input vector at each time step)
$h \times d$ : parameters in each input weight matrix $\mathbf{W}$
$h \times h$ : parameters in each recurrent weight matrix $\mathbf{U}$
$h$ : parameters in each bias vector $\mathbf{b}$
The factor of 4 accounts for the four components (forget gate, input gate, candidate, output gate)

For a more compact form, we can factor this as:

\text{Parameters} = 4h(d + h) + 4h

where $4h(d + h)$ counts all weight matrix parameters and $4h$ counts all bias parameters.

Example CalculationLink Copied

Let's compute the parameter count for a typical configuration with input dimension $d = 300$ (common for word embeddings) and hidden dimension $h = 512$ .

In[4]:

Code

d = 300  # input dimension (e.g., word embedding size)
h = 512  # hidden dimension

# Parameters per gate/candidate
params_W = h * d  # input weight matrix
params_U = h * h  # hidden state weight matrix
params_b = h  # bias vector
params_per_component = params_W + params_U + params_b

# Total parameters (4 components: forget, input, candidate, output)
total_params = 4 * params_per_component

d = 300  # input dimension (e.g., word embedding size)
h = 512  # hidden dimension

# Parameters per gate/candidate
params_W = h * d  # input weight matrix
params_U = h * h  # hidden state weight matrix
params_b = h  # bias vector
params_per_component = params_W + params_U + params_b

# Total parameters (4 components: forget, input, candidate, output)
total_params = 4 * params_per_component

Out[5]:

Console

Input dimension (d): 300
Hidden dimension (h): 512

Parameters per component:
  W matrix: 153,600 (512 × 300)
  U matrix: 262,144 (512 × 512)
  Bias: 512
  Subtotal: 416,256

Total LSTM parameters: 1,665,024

Using formula 4h(d + h + 1): 1,665,024

With these dimensions, a single LSTM layer has over 1.6 million parameters. For comparison, a vanilla RNN with the same dimensions would have:

\text{RNN Parameters} = h(d + h + 1) = 512 \times (300 + 512 + 1) = 416{,}256

where the single factor (instead of 4) reflects that a vanilla RNN has only one set of weights for its single transformation. The LSTM's four-fold increase in parameters comes from its four separate components (forget gate, input gate, candidate, output gate), each with their own weight matrices and biases.

Stacked LSTMsLink Copied

When you stack multiple LSTM layers, the parameter count increases. The first layer takes the input dimension $d$ , but subsequent layers take the hidden dimension $h$ as their input (since they receive the previous layer's hidden state).

In[6]:

Code

def lstm_params(d, h):
    """Calculate parameters for one LSTM layer."""
    return 4 * h * (d + h + 1)


# Stack of 3 LSTM layers
d = 300
h = 512
num_layers = 3

# First layer: input is d
first_layer_params = lstm_params(d, h)

# Subsequent layers: input is h
subsequent_layer_params = lstm_params(h, h)

total_stacked_params = (
    first_layer_params + (num_layers - 1) * subsequent_layer_params
)

def lstm_params(d, h):
    """Calculate parameters for one LSTM layer."""
    return 4 * h * (d + h + 1)


# Stack of 3 LSTM layers
d = 300
h = 512
num_layers = 3

# First layer: input is d
first_layer_params = lstm_params(d, h)

# Subsequent layers: input is h
subsequent_layer_params = lstm_params(h, h)

total_stacked_params = (
    first_layer_params + (num_layers - 1) * subsequent_layer_params
)

Out[7]:

Console

3-layer stacked LSTM (d=300, h=512):
  Layer 1: 1,665,024 parameters
  Layer 2: 2,099,200 parameters
  Layer 3: 2,099,200 parameters
  Total: 5,863,424 parameters

A 3-layer LSTM with these dimensions has over 5.7 million parameters. Deep language models often use even larger hidden dimensions and more layers, quickly reaching hundreds of millions of parameters.

Out[8]:

Visualization

Line plot showing total LSTM parameters growing quadratically from 64 to 1024 hidden dimensions. — LSTM parameter count scaling with hidden dimension. The quadratic growth comes from the h×h recurrent weight matrices. Key parameter counts annotated for common hidden sizes.

Stacked area chart showing parameter breakdown into input weights, recurrent weights, and biases. — Parameter breakdown by type showing how recurrent weights (U matrices) dominate for larger hidden dimensions, while biases contribute negligibly.

Implementing LSTM from ScratchLink Copied

Now let's implement an LSTM cell from scratch using only NumPy. This exercise solidifies understanding and reveals the computational structure hidden behind framework abstractions.

Activation FunctionsLink Copied

First, we need the sigmoid and tanh activation functions:

In[9]:

Code

import numpy as np


def sigmoid(x):
    """Numerically stable sigmoid function."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    """Hyperbolic tangent function."""
    return np.tanh(x)

import numpy as np


def sigmoid(x):
    """Numerically stable sigmoid function."""
    return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))


def tanh(x):
    """Hyperbolic tangent function."""
    return np.tanh(x)

The sigmoid implementation handles numerical stability by using different formulas for positive and negative inputs, avoiding overflow from large exponentials.

LSTM Cell ClassLink Copied

Now we implement the LSTM cell with explicit weight matrices for each component:

In[10]:

Code

class LSTMCell:
    """A single LSTM cell implemented from scratch."""

    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        # Initialize weights using Xavier initialization
        scale_x = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_h = np.sqrt(2.0 / (hidden_dim + hidden_dim))

        # Forget gate parameters
        self.W_f = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_f = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_f = np.ones(
            hidden_dim
        )  # Initialize to 1 for better gradient flow

        # Input gate parameters
        self.W_i = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_i = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_i = np.zeros(hidden_dim)

        # Candidate cell state parameters
        self.W_c = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_c = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_c = np.zeros(hidden_dim)

        # Output gate parameters
        self.W_o = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_o = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_o = np.zeros(hidden_dim)

    def forward(self, x_t, h_prev, c_prev):
        """
        Forward pass through the LSTM cell.

        Args:
            x_t: Input at time t, shape (input_dim,)
            h_prev: Previous hidden state, shape (hidden_dim,)
            c_prev: Previous cell state, shape (hidden_dim,)

        Returns:
            h_t: New hidden state, shape (hidden_dim,)
            c_t: New cell state, shape (hidden_dim,)
        """
        # Forget gate
        f_t = sigmoid(self.W_f @ x_t + self.U_f @ h_prev + self.b_f)

        # Input gate
        i_t = sigmoid(self.W_i @ x_t + self.U_i @ h_prev + self.b_i)

        # Candidate cell state
        c_tilde = tanh(self.W_c @ x_t + self.U_c @ h_prev + self.b_c)

        # New cell state
        c_t = f_t * c_prev + i_t * c_tilde

        # Output gate
        o_t = sigmoid(self.W_o @ x_t + self.U_o @ h_prev + self.b_o)

        # New hidden state
        h_t = o_t * tanh(c_t)

        return h_t, c_t

    def count_parameters(self):
        """Count total learnable parameters."""
        return 4 * self.hidden_dim * (self.input_dim + self.hidden_dim + 1)

class LSTMCell:
    """A single LSTM cell implemented from scratch."""

    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        # Initialize weights using Xavier initialization
        scale_x = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_h = np.sqrt(2.0 / (hidden_dim + hidden_dim))

        # Forget gate parameters
        self.W_f = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_f = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_f = np.ones(
            hidden_dim
        )  # Initialize to 1 for better gradient flow

        # Input gate parameters
        self.W_i = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_i = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_i = np.zeros(hidden_dim)

        # Candidate cell state parameters
        self.W_c = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_c = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_c = np.zeros(hidden_dim)

        # Output gate parameters
        self.W_o = np.random.randn(hidden_dim, input_dim) * scale_x
        self.U_o = np.random.randn(hidden_dim, hidden_dim) * scale_h
        self.b_o = np.zeros(hidden_dim)

    def forward(self, x_t, h_prev, c_prev):
        """
        Forward pass through the LSTM cell.

        Args:
            x_t: Input at time t, shape (input_dim,)
            h_prev: Previous hidden state, shape (hidden_dim,)
            c_prev: Previous cell state, shape (hidden_dim,)

        Returns:
            h_t: New hidden state, shape (hidden_dim,)
            c_t: New cell state, shape (hidden_dim,)
        """
        # Forget gate
        f_t = sigmoid(self.W_f @ x_t + self.U_f @ h_prev + self.b_f)

        # Input gate
        i_t = sigmoid(self.W_i @ x_t + self.U_i @ h_prev + self.b_i)

        # Candidate cell state
        c_tilde = tanh(self.W_c @ x_t + self.U_c @ h_prev + self.b_c)

        # New cell state
        c_t = f_t * c_prev + i_t * c_tilde

        # Output gate
        o_t = sigmoid(self.W_o @ x_t + self.U_o @ h_prev + self.b_o)

        # New hidden state
        h_t = o_t * tanh(c_t)

        return h_t, c_t

    def count_parameters(self):
        """Count total learnable parameters."""
        return 4 * self.hidden_dim * (self.input_dim + self.hidden_dim + 1)

Let's verify our implementation produces outputs with the correct shapes:

In[11]:

Code

# Create an LSTM cell
input_dim = 10
hidden_dim = 20
lstm_cell = LSTMCell(input_dim, hidden_dim)

# Create sample inputs
x_t = np.random.randn(input_dim)
h_prev = np.zeros(hidden_dim)
c_prev = np.zeros(hidden_dim)

# Forward pass
h_t, c_t = lstm_cell.forward(x_t, h_prev, c_prev)

# Create an LSTM cell
input_dim = 10
hidden_dim = 20
lstm_cell = LSTMCell(input_dim, hidden_dim)

# Create sample inputs
x_t = np.random.randn(input_dim)
h_prev = np.zeros(hidden_dim)
c_prev = np.zeros(hidden_dim)

# Forward pass
h_t, c_t = lstm_cell.forward(x_t, h_prev, c_prev)

Out[12]:

Console

LSTM Cell Configuration:
  Input dimension: 10
  Hidden dimension: 20
  Total parameters: 2,480

Input shapes:
  x_t: (10,)
  h_prev: (20,)
  c_prev: (20,)

Output shapes:
  h_t: (20,)
  c_t: (20,)

Sample output values:
  h_t[:5]: [-0.1581875   0.05509321  0.11139839 -0.10368865 -0.01566213]
  c_t[:5]: [-0.37815718  0.08858582  0.37376202 -0.32877161 -0.02321892]

The hidden state values are bounded between $-1$ and $1$ because they're computed as the product of a sigmoid (range $(0, 1)$ ) and a tanh (range $(-1, 1)$ ). The cell state values can exceed this range since they accumulate through addition.

Processing a SequenceLink Copied

An LSTM processes sequences by applying the cell repeatedly, passing the hidden and cell states from one step to the next:

In[13]:

Code

class LSTM:
    """LSTM layer that processes sequences."""

    def __init__(self, input_dim, hidden_dim):
        self.cell = LSTMCell(input_dim, hidden_dim)
        self.hidden_dim = hidden_dim

    def forward(self, x_sequence):
        """
        Process a sequence through the LSTM.

        Args:
            x_sequence: Input sequence, shape (seq_len, input_dim)

        Returns:
            hidden_states: All hidden states, shape (seq_len, hidden_dim)
            final_cell_state: Final cell state, shape (hidden_dim,)
        """
        seq_len = x_sequence.shape[0]

        # Initialize states
        h_t = np.zeros(self.hidden_dim)
        c_t = np.zeros(self.hidden_dim)

        # Store all hidden states
        hidden_states = []

        # Process each time step
        for t in range(seq_len):
            h_t, c_t = self.cell.forward(x_sequence[t], h_t, c_t)
            hidden_states.append(h_t)

        return np.array(hidden_states), c_t

class LSTM:
    """LSTM layer that processes sequences."""

    def __init__(self, input_dim, hidden_dim):
        self.cell = LSTMCell(input_dim, hidden_dim)
        self.hidden_dim = hidden_dim

    def forward(self, x_sequence):
        """
        Process a sequence through the LSTM.

        Args:
            x_sequence: Input sequence, shape (seq_len, input_dim)

        Returns:
            hidden_states: All hidden states, shape (seq_len, hidden_dim)
            final_cell_state: Final cell state, shape (hidden_dim,)
        """
        seq_len = x_sequence.shape[0]

        # Initialize states
        h_t = np.zeros(self.hidden_dim)
        c_t = np.zeros(self.hidden_dim)

        # Store all hidden states
        hidden_states = []

        # Process each time step
        for t in range(seq_len):
            h_t, c_t = self.cell.forward(x_sequence[t], h_t, c_t)
            hidden_states.append(h_t)

        return np.array(hidden_states), c_t

Let's process a sample sequence and examine how the hidden state evolves:

In[14]:

Code

# Create LSTM and sample sequence
lstm = LSTM(input_dim=10, hidden_dim=20)
sequence = np.random.randn(15, 10)  # 15 time steps, 10-dimensional input

# Process the sequence
hidden_states, final_cell = lstm.forward(sequence)

# Create LSTM and sample sequence
lstm = LSTM(input_dim=10, hidden_dim=20)
sequence = np.random.randn(15, 10)  # 15 time steps, 10-dimensional input

# Process the sequence
hidden_states, final_cell = lstm.forward(sequence)

Out[15]:

Console

Sequence length: 15
Input dimension: 10
Hidden dimension: 20

Hidden states shape: (15, 20)
Final cell state shape: (20,)

The LSTM produces a hidden state at each time step. For sequence classification, you typically use the final hidden state. For sequence-to-sequence tasks, you use all hidden states.

Visualizing Gate ActivationsLink Copied

Let's visualize how the gates respond to a sequence, which reveals how the LSTM decides what to remember and forget:

In[16]:

Code

class LSTMWithGateHistory(LSTM):
    """LSTM that records gate activations for analysis."""

    def forward_with_gates(self, x_sequence):
        """Process sequence and return gate activations."""
        seq_len = x_sequence.shape[0]

        h_t = np.zeros(self.hidden_dim)
        c_t = np.zeros(self.hidden_dim)

        # Store gate activations
        forget_gates = []
        input_gates = []
        output_gates = []
        cell_states = []

        for t in range(seq_len):
            x_t = x_sequence[t]
            cell = self.cell

            # Compute gates explicitly
            f_t = sigmoid(cell.W_f @ x_t + cell.U_f @ h_t + cell.b_f)
            i_t = sigmoid(cell.W_i @ x_t + cell.U_i @ h_t + cell.b_i)
            c_tilde = tanh(cell.W_c @ x_t + cell.U_c @ h_t + cell.b_c)
            c_t = f_t * c_t + i_t * c_tilde
            o_t = sigmoid(cell.W_o @ x_t + cell.U_o @ h_t + cell.b_o)
            h_t = o_t * tanh(c_t)

            forget_gates.append(f_t.mean())
            input_gates.append(i_t.mean())
            output_gates.append(o_t.mean())
            cell_states.append(np.abs(c_t).mean())

        return {
            "forget": np.array(forget_gates),
            "input": np.array(input_gates),
            "output": np.array(output_gates),
            "cell_magnitude": np.array(cell_states),
        }

class LSTMWithGateHistory(LSTM):
    """LSTM that records gate activations for analysis."""

    def forward_with_gates(self, x_sequence):
        """Process sequence and return gate activations."""
        seq_len = x_sequence.shape[0]

        h_t = np.zeros(self.hidden_dim)
        c_t = np.zeros(self.hidden_dim)

        # Store gate activations
        forget_gates = []
        input_gates = []
        output_gates = []
        cell_states = []

        for t in range(seq_len):
            x_t = x_sequence[t]
            cell = self.cell

            # Compute gates explicitly
            f_t = sigmoid(cell.W_f @ x_t + cell.U_f @ h_t + cell.b_f)
            i_t = sigmoid(cell.W_i @ x_t + cell.U_i @ h_t + cell.b_i)
            c_tilde = tanh(cell.W_c @ x_t + cell.U_c @ h_t + cell.b_c)
            c_t = f_t * c_t + i_t * c_tilde
            o_t = sigmoid(cell.W_o @ x_t + cell.U_o @ h_t + cell.b_o)
            h_t = o_t * tanh(c_t)

            forget_gates.append(f_t.mean())
            input_gates.append(i_t.mean())
            output_gates.append(o_t.mean())
            cell_states.append(np.abs(c_t).mean())

        return {
            "forget": np.array(forget_gates),
            "input": np.array(input_gates),
            "output": np.array(output_gates),
            "cell_magnitude": np.array(cell_states),
        }

Out[17]:

Visualization

Line plot showing four curves representing forget, input, and output gate activations plus cell state magnitude over 50 time steps. — LSTM gate activations over a 50-step sequence. The forget gate (blue) stays high due to bias initialization, preserving memory. The input gate (orange) shows varying write intensity. The output gate (green) controls how much cell state is exposed. Cell state magnitude (red, dashed) grows as information accumulates.

The visualization reveals several key behaviors. The forget gate starts high (near 1) because we initialized its bias to 1, causing the LSTM to preserve information by default. The input and output gates show more variation as they learn to selectively write and read. The cell state magnitude tends to grow over time as information accumulates, though the gates prevent unbounded growth.

Comparing with PyTorchLink Copied

Let's verify our implementation matches PyTorch's LSTM by comparing outputs:

In[18]:

Code

import torch
import torch.nn as nn

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Dimensions
input_dim = 8
hidden_dim = 16
seq_len = 5

# Create our LSTM
our_lstm = LSTMCell(input_dim, hidden_dim)

# Create PyTorch LSTM and copy weights
torch_lstm = nn.LSTMCell(input_dim, hidden_dim)

# PyTorch stores weights concatenated: [W_i, W_f, W_c, W_o]
# We need to reorganize our weights to match
with torch.no_grad():
    # Input-hidden weights (ih): concatenated [i, f, c, o]
    torch_lstm.weight_ih.copy_(
        torch.tensor(
            np.vstack([our_lstm.W_i, our_lstm.W_f, our_lstm.W_c, our_lstm.W_o]),
            dtype=torch.float32,
        )
    )

    # Hidden-hidden weights (hh): concatenated [i, f, c, o]
    torch_lstm.weight_hh.copy_(
        torch.tensor(
            np.vstack([our_lstm.U_i, our_lstm.U_f, our_lstm.U_c, our_lstm.U_o]),
            dtype=torch.float32,
        )
    )

    # Biases
    torch_lstm.bias_ih.copy_(
        torch.tensor(
            np.concatenate(
                [our_lstm.b_i, our_lstm.b_f, our_lstm.b_c, our_lstm.b_o]
            ),
            dtype=torch.float32,
        )
    )

    torch_lstm.bias_hh.copy_(torch.zeros(4 * hidden_dim))

# Test input
x = np.random.randn(input_dim).astype(np.float32)
h_prev = np.zeros(hidden_dim, dtype=np.float32)
c_prev = np.zeros(hidden_dim, dtype=np.float32)

# Our forward pass
our_h, our_c = our_lstm.forward(x, h_prev, c_prev)

# PyTorch forward pass
x_torch = torch.tensor(x).unsqueeze(0)  # Add batch dimension
h_torch = torch.tensor(h_prev).unsqueeze(0)
c_torch = torch.tensor(c_prev).unsqueeze(0)
torch_h, torch_c = torch_lstm(x_torch, (h_torch, c_torch))

import torch
import torch.nn as nn

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Dimensions
input_dim = 8
hidden_dim = 16
seq_len = 5

# Create our LSTM
our_lstm = LSTMCell(input_dim, hidden_dim)

# Create PyTorch LSTM and copy weights
torch_lstm = nn.LSTMCell(input_dim, hidden_dim)

# PyTorch stores weights concatenated: [W_i, W_f, W_c, W_o]
# We need to reorganize our weights to match
with torch.no_grad():
    # Input-hidden weights (ih): concatenated [i, f, c, o]
    torch_lstm.weight_ih.copy_(
        torch.tensor(
            np.vstack([our_lstm.W_i, our_lstm.W_f, our_lstm.W_c, our_lstm.W_o]),
            dtype=torch.float32,
        )
    )

    # Hidden-hidden weights (hh): concatenated [i, f, c, o]
    torch_lstm.weight_hh.copy_(
        torch.tensor(
            np.vstack([our_lstm.U_i, our_lstm.U_f, our_lstm.U_c, our_lstm.U_o]),
            dtype=torch.float32,
        )
    )

    # Biases
    torch_lstm.bias_ih.copy_(
        torch.tensor(
            np.concatenate(
                [our_lstm.b_i, our_lstm.b_f, our_lstm.b_c, our_lstm.b_o]
            ),
            dtype=torch.float32,
        )
    )

    torch_lstm.bias_hh.copy_(torch.zeros(4 * hidden_dim))

# Test input
x = np.random.randn(input_dim).astype(np.float32)
h_prev = np.zeros(hidden_dim, dtype=np.float32)
c_prev = np.zeros(hidden_dim, dtype=np.float32)

# Our forward pass
our_h, our_c = our_lstm.forward(x, h_prev, c_prev)

# PyTorch forward pass
x_torch = torch.tensor(x).unsqueeze(0)  # Add batch dimension
h_torch = torch.tensor(h_prev).unsqueeze(0)
c_torch = torch.tensor(c_prev).unsqueeze(0)
torch_h, torch_c = torch_lstm(x_torch, (h_torch, c_torch))

Out[19]:

Console

Comparison between our LSTM and PyTorch LSTM:

Max absolute difference in hidden state: 3.22e-08
Max absolute difference in cell state: 6.41e-08

✓ Implementations match!

Our h_t[:5]: [ 0.0283901  -0.36721011 -0.10207159  0.0654979   0.1483921 ]
PyTorch h[:5]: [ 0.02839009 -0.36721012 -0.10207159  0.06549793  0.1483921 ]

The outputs match to numerical precision, confirming our implementation is correct. The small differences (on the order of $10^{-7}$ ) come from floating-point arithmetic variations between NumPy and PyTorch.

The Combined Weight Matrix FormulationLink Copied

In practice, frameworks like PyTorch don't compute each gate separately. Instead, they concatenate all weight matrices and compute all gates in a single matrix multiplication, which is more efficient on GPUs.

The combined formulation stacks the four weight matrices vertically into single large matrices:

\mathbf{W} = \begin{bmatrix} \mathbf{W}_i \\ \mathbf{W}_f \\ \mathbf{W}_c \\ \mathbf{W}_o \end{bmatrix} \in \mathbb{R}^{4h \times d} \quad\quad \mathbf{U} = \begin{bmatrix} \mathbf{U}_i \\ \mathbf{U}_f \\ \mathbf{U}_c \\ \mathbf{U}_o \end{bmatrix} \in \mathbb{R}^{4h \times h} \quad\quad \mathbf{b} = \begin{bmatrix} \mathbf{b}_i \\ \mathbf{b}_f \\ \mathbf{b}_c \\ \mathbf{b}_o \end{bmatrix} \in \mathbb{R}^{4h}

where:

$\mathbf{W} \in \mathbb{R}^{4h \times d}$ : combined input weight matrix, stacking all four gate input weights
$\mathbf{U} \in \mathbb{R}^{4h \times h}$ : combined recurrent weight matrix, stacking all four gate hidden weights
$\mathbf{b} \in \mathbb{R}^{4h}$ : combined bias vector, concatenating all four gate biases
The subscripts $i$ , $f$ , $c$ , $o$ denote input gate, forget gate, candidate, and output gate respectively

Then all gates are computed in one step:

\begin{bmatrix} \mathbf{i}_t \\ \mathbf{f}_t \\ \tilde{\mathbf{c}}_t \\ \mathbf{o}_t \end{bmatrix} = \begin{bmatrix} \sigma \\ \sigma \\ \tanh \\ \sigma \end{bmatrix} \left( \mathbf{W} \mathbf{x}_t + \mathbf{U} \mathbf{h}_{t-1} + \mathbf{b} \right)

The notation $\begin{bmatrix} \sigma \\ \sigma \\ \tanh \\ \sigma \end{bmatrix}$ indicates that different activation functions are applied to different portions of the result: sigmoid to the first $h$ elements (input gate), sigmoid to the next $h$ elements (forget gate), tanh to the next $h$ elements (candidate), and sigmoid to the final $h$ elements (output gate).

This produces a $4h$ -dimensional vector that is split into four chunks of size $h$ each. The efficiency gain comes from performing two large matrix multiplications instead of eight smaller ones.

In[20]:

Code

class EfficientLSTMCell:
    """LSTM cell using combined weight matrices for efficiency."""

    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        # Combined weight matrices
        scale_x = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_h = np.sqrt(2.0 / (hidden_dim + hidden_dim))

        self.W = np.random.randn(4 * hidden_dim, input_dim) * scale_x
        self.U = np.random.randn(4 * hidden_dim, hidden_dim) * scale_h
        self.b = np.zeros(4 * hidden_dim)

        # Initialize forget gate bias to 1
        self.b[hidden_dim : 2 * hidden_dim] = 1.0

    def forward(self, x_t, h_prev, c_prev):
        """Forward pass using combined matrices."""
        h = self.hidden_dim

        # Single matrix multiplication for all gates
        gates = self.W @ x_t + self.U @ h_prev + self.b

        # Split and apply activations
        i_t = sigmoid(gates[0:h])
        f_t = sigmoid(gates[h : 2 * h])
        c_tilde = tanh(gates[2 * h : 3 * h])
        o_t = sigmoid(gates[3 * h : 4 * h])

        # Cell state and hidden state updates
        c_t = f_t * c_prev + i_t * c_tilde
        h_t = o_t * tanh(c_t)

        return h_t, c_t

class EfficientLSTMCell:
    """LSTM cell using combined weight matrices for efficiency."""

    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        # Combined weight matrices
        scale_x = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale_h = np.sqrt(2.0 / (hidden_dim + hidden_dim))

        self.W = np.random.randn(4 * hidden_dim, input_dim) * scale_x
        self.U = np.random.randn(4 * hidden_dim, hidden_dim) * scale_h
        self.b = np.zeros(4 * hidden_dim)

        # Initialize forget gate bias to 1
        self.b[hidden_dim : 2 * hidden_dim] = 1.0

    def forward(self, x_t, h_prev, c_prev):
        """Forward pass using combined matrices."""
        h = self.hidden_dim

        # Single matrix multiplication for all gates
        gates = self.W @ x_t + self.U @ h_prev + self.b

        # Split and apply activations
        i_t = sigmoid(gates[0:h])
        f_t = sigmoid(gates[h : 2 * h])
        c_tilde = tanh(gates[2 * h : 3 * h])
        o_t = sigmoid(gates[3 * h : 4 * h])

        # Cell state and hidden state updates
        c_t = f_t * c_prev + i_t * c_tilde
        h_t = o_t * tanh(c_t)

        return h_t, c_t

Let's compare the performance of both implementations:

In[21]:

Code

import time

efficient_lstm = EfficientLSTMCell(input_dim=100, hidden_dim=256)
standard_lstm = LSTMCell(input_dim=100, hidden_dim=256)

x = np.random.randn(100)
h = np.zeros(256)
c = np.zeros(256)

# Time standard implementation
start = time.time()
for _ in range(1000):
    standard_lstm.forward(x, h, c)
standard_time = time.time() - start

# Time efficient implementation
start = time.time()
for _ in range(1000):
    efficient_lstm.forward(x, h, c)
efficient_time = time.time() - start

speedup = standard_time / efficient_time

import time

efficient_lstm = EfficientLSTMCell(input_dim=100, hidden_dim=256)
standard_lstm = LSTMCell(input_dim=100, hidden_dim=256)

x = np.random.randn(100)
h = np.zeros(256)
c = np.zeros(256)

# Time standard implementation
start = time.time()
for _ in range(1000):
    standard_lstm.forward(x, h, c)
standard_time = time.time() - start

# Time efficient implementation
start = time.time()
for _ in range(1000):
    efficient_lstm.forward(x, h, c)
efficient_time = time.time() - start

speedup = standard_time / efficient_time

Out[22]:

Console

Performance comparison (1000 forward passes):
  Standard implementation: 4334.7 ms
  Efficient implementation: 855.6 ms
  Speedup: 5.07x

The efficient implementation is faster because it performs fewer matrix multiplications. The standard implementation does 8 matrix multiplications (2 per gate), while the efficient version does only 2 (one for inputs, one for hidden states). On GPUs, this difference is even more pronounced because large matrix multiplications are highly optimized.

Limitations and Practical ConsiderationsLink Copied

LSTMs represented a major breakthrough in sequence modeling, but they come with important limitations that practitioners should understand.

Computational Cost: Each time step requires sequential computation because the hidden state at time $t$ depends on the hidden state at time $t-1$ . This sequential dependency prevents parallelization across time steps, making LSTMs slow to train on long sequences. A 1000-step sequence requires 1000 sequential operations, regardless of how many GPUs you have. This is the fundamental limitation that motivated the development of Transformers, which can process all positions in parallel.

Memory Requirements: LSTMs must store hidden and cell states for every time step during training to enable backpropagation through time. For a sequence of length $T$ with hidden dimension $h$ , this requires $O(T \cdot h)$ memory per layer, where $T$ is the number of time steps and $h$ is the hidden state size. With long sequences and deep networks, memory can become a bottleneck. Techniques like truncated backpropagation through time can help, but they sacrifice gradient accuracy for long-range dependencies.

Gradient Flow: Despite the cell state's "information highway," gradients still face challenges over very long sequences. The forget gate must maintain values very close to 1 for hundreds of time steps to preserve gradients, which is difficult to learn. In practice, LSTMs struggle with dependencies beyond a few hundred time steps.

Hyperparameter Sensitivity: The hidden dimension, number of layers, learning rate, and gradient clipping threshold all interact in complex ways. LSTMs can be finicky to train, often requiring careful tuning. The forget gate bias initialization (setting it to 1 or 2) is one example of a trick that significantly affects training stability.

Despite these limitations, LSTMs remain valuable in many contexts. They are well-suited for streaming applications where you process data one element at a time. They work well for moderate-length sequences (up to a few hundred tokens). They also require less data to train than Transformers for smaller-scale problems. Understanding their equations deeply, as we've done in this chapter, helps you recognize when they're the right tool and when to reach for alternatives.

Key ParametersLink Copied

When implementing or using LSTMs, several parameters significantly impact model behavior and performance:

hidden_dim (hidden_size in PyTorch): The dimensionality of the hidden and cell states. Larger values increase model capacity but also increase computation and memory. Common values range from 128 to 1024, with 256-512 being typical for many NLP tasks.
input_dim (input_size in PyTorch): The dimensionality of the input vectors at each time step. This is typically determined by your embedding layer or feature representation.
num_layers: The number of stacked LSTM layers. Deeper networks can learn more complex patterns but are harder to train. Most applications use 1-3 layers; beyond 4-5 layers, gradient flow becomes problematic without techniques like residual connections.
dropout: Regularization applied between LSTM layers (not within a single layer). Values of 0.2-0.5 are common. Note that dropout is only applied during training and when num_layers > 1.
bidirectional: Whether to process sequences in both directions. Bidirectional LSTMs double the effective hidden size but cannot be used for autoregressive generation tasks.
forget_bias (initialization): The initial value for the forget gate bias. Setting this to 1.0 or higher helps the network learn to preserve information early in training. This is often set implicitly by the framework but can be overridden.
batch_first: In PyTorch, determines whether input tensors have shape (batch, seq, features) or (seq, batch, features). The former is more intuitive but the latter is the default for historical reasons.

SummaryLink Copied

This chapter translated the intuition of LSTM gates into precise mathematics and working code. The key equations form a system where three gates control information flow through a memory cell.

The Gate Equations:

Forget gate: $\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{x}_t + \mathbf{U}_f \mathbf{h}_{t-1} + \mathbf{b}_f)$ decides what to discard from memory
Input gate: $\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{x}_t + \mathbf{U}_i \mathbf{h}_{t-1} + \mathbf{b}_i)$ decides how much new information to write
Candidate: $\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c \mathbf{x}_t + \mathbf{U}_c \mathbf{h}_{t-1} + \mathbf{b}_c)$ proposes new values to store
Output gate: $\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{x}_t + \mathbf{U}_o \mathbf{h}_{t-1} + \mathbf{b}_o)$ decides what to expose as output

The State Updates:

Cell state: $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ combines filtered memory with gated new information
Hidden state: $\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$ exposes filtered cell state as output

Parameter Count:

An LSTM cell has $4h(d + h + 1)$ parameters, where $d$ is the input dimension (e.g., word embedding size), $h$ is the hidden dimension, and the factor of 4 accounts for the four components (forget gate, input gate, candidate, output gate). This is four times the parameters of a vanilla RNN, reflecting the four separate learnable transformations.

Implementation Insights:

Initializing the forget gate bias to 1 improves training by starting with full memory retention
Combining weight matrices into single operations improves computational efficiency
The cell state can grow unbounded, so applying tanh before the output gate maintains numerical stability

With these equations internalized, you can now trace exactly how information flows through an LSTM, debug unexpected behaviors, and make informed decisions about when to use this architecture. The next chapter examines how these equations enable better gradient flow compared to vanilla RNNs.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about LSTM gate equations.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{lstmgateequationscompletemathematicalguidewithnumpyimplementation, author = {Michael Brenndoerfer}, title = {LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation. Retrieved from https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation

MLAAcademic

Michael Brenndoerfer. "LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation>.

CHICAGOAcademic

Michael Brenndoerfer. "LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation'. Available at: https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation. https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation

Direct link:

https://mbrenndoerfer.com/writing/lstm-gate-equations-mathematical-guide-implementation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

LSTM Gate Equations: Complete Mathematical Guide with NumPy Implementation

LSTM Gate EquationsLink Copied

The LSTM Cell at a GlanceLink Copied

The Forget GateLink Copied

The Core MechanismLink Copied

Why Sigmoid?Link Copied

Understanding the DimensionsLink Copied

Initialization MattersLink Copied

The Input GateLink Copied

The Gating MechanismLink Copied

The Candidate Cell StateLink Copied

The Candidate FormulaLink Copied

Why tanh Instead of Sigmoid?Link Copied

The Cell State UpdateLink Copied

The Update EquationLink Copied

Understanding the Two TermsLink Copied

A Concrete Example: Tracing the Update Step by StepLink Copied

The Output GateLink Copied

The Output Gate EquationLink Copied

The Hidden State ComputationLink Copied

The Hidden State FormulaLink Copied

Why Apply tanh to the Cell State?Link Copied

The Complete PictureLink Copied

Complete LSTM EquationsLink Copied

LSTM Parameter CountLink Copied

Example CalculationLink Copied

Stacked LSTMsLink Copied

Implementing LSTM from ScratchLink Copied

Activation FunctionsLink Copied

LSTM Cell ClassLink Copied

Processing a SequenceLink Copied

Visualizing Gate ActivationsLink Copied

Comparing with PyTorchLink Copied

The Combined Weight Matrix FormulationLink Copied

Limitations and Practical ConsiderationsLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GRU Architecture: Streamlined Gating for Sequence Modeling

Stacked RNNs: Deep Recurrent Networks for Hierarchical Sequence Modeling

LSTM Gradient Flow: The Constant Error Carousel Explained

Stay updated