Search

Search articles

LSTM Architecture: Complete Guide to Long Short-Term Memory Networks

Michael BrenndoerferDecember 16, 202528 min read

Master LSTM architecture including cell state, gates, and gradient flow. Learn how LSTMs solve the vanishing gradient problem with practical PyTorch examples.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

LSTM Architecture

In the previous chapter, we saw how vanilla RNNs struggle with long sequences. Gradients either vanish into insignificance or explode beyond control, making it nearly impossible to learn dependencies that span more than a few timesteps. This limitation motivated researchers to rethink the fundamental architecture of recurrent networks.

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, solve the vanishing gradient problem. The key insight is simple: instead of forcing information to pass through repeated nonlinear transformations at every timestep, create a dedicated pathway where information can flow unchanged across many timesteps. This "memory highway" allows LSTMs to maintain and access information over hundreds or even thousands of steps, enabling applications that were previously impossible with vanilla RNNs.

The Cell State: An Information Highway

The defining feature of an LSTM is its cell state, a separate memory channel that runs parallel to the hidden state. Think of it as a conveyor belt that carries information through time. Unlike the hidden state in vanilla RNNs, which gets completely rewritten at each timestep, the cell state can preserve information indefinitely by simply passing it forward unchanged.

Out[3]:
Visualization
Diagram showing cell state as horizontal line with LSTM cells below, demonstrating information flow through time.
The cell state acts as an information highway running through time. Information can flow unchanged (straight arrows) or be modified by gates (vertical interactions). This direct pathway is the key to LSTM's ability to capture long-range dependencies.

In a vanilla RNN, information at timestep tt must pass through every intermediate hidden state to reach timestep t+kt+k. Each passage involves a matrix multiplication and a tanh activation, which compresses values and causes gradients to shrink. After enough steps, the original signal is lost in the noise.

The cell state sidesteps this problem entirely. Information stored in the cell state can theoretically travel from the beginning to the end of a sequence with minimal transformation. The only operations applied to the cell state are element-wise additions and multiplications, both of which have well-behaved gradients that don't vanish or explode as readily as matrix multiplications through tanh.

Cell State vs Hidden State

LSTMs maintain two separate state vectors at each timestep tt:

  • CtC_t: the cell state, which stores long-term memory and flows through time with minimal modification
  • hth_t: the hidden state, which is the output at each timestep and gets transformed more aggressively

The hidden state is what you typically use for predictions, but the cell state is what enables long-range learning.

Gate Mechanism Intuition

If the cell state were completely static, it would be useless. We need a way to selectively add new information, remove outdated information, and control what gets output. LSTMs accomplish this through gates, which are learned neural network layers that output values between 0 and 1.

Think of gates as dimmer switches rather than on/off toggles. A gate value of 0.0 means "block everything," 0.5 means "let half through," and 1.0 means "let everything through." Because gates use sigmoid activations, their outputs are always in this [0, 1] range, making them perfect for controlling information flow.

Out[4]:
Visualization
Sigmoid activation function outputs values in [0, 1], making it ideal for gates that control information flow. Values near 0 block information, values near 1 pass it through.
Sigmoid activation function outputs values in [0, 1], making it ideal for gates that control information flow. Values near 0 block information, values near 1 pass it through.
Tanh activation function outputs values in [-1, 1], allowing candidate values to both increase and decrease cell state elements. The symmetric range around zero enables bidirectional updates.
Tanh activation function outputs values in [-1, 1], allowing candidate values to both increase and decrease cell state elements. The symmetric range around zero enables bidirectional updates.

LSTMs have three gates, each serving a distinct purpose:

  • Forget gate: Decides what information to discard from the cell state. When processing a new sentence, you might want to forget the subject of the previous sentence.
  • Input gate: Decides what new information to store in the cell state. When you encounter a new subject, you want to remember it.
  • Output gate: Decides what part of the cell state to output as the hidden state. Not everything you remember is relevant to the current prediction.
Out[5]:
Visualization
The forget gate decides what information to discard from the cell state. Output values near 0 remove information, values near 1 retain it.
The forget gate decides what information to discard from the cell state. Output values near 0 remove information, values near 1 retain it.
The input gate decides what new information to store in the cell state. It works with the candidate values to update memory.
The input gate decides what new information to store in the cell state. It works with the candidate values to update memory.
The output gate controls what part of the cell state becomes the hidden state output, filtering what gets passed to the next layer.
The output gate controls what part of the cell state becomes the hidden state output, filtering what gets passed to the next layer.

Each gate takes the same inputs: the previous hidden state ht1h_{t-1} and the current input xtx_t. The gate computation follows the same pattern for all three gates:

gate=σ(W[ht1,xt]+b)\text{gate} = \sigma(W \cdot [h_{t-1}, x_t] + b)

where:

  • σ\sigma: the sigmoid activation function, which outputs values in [0,1][0, 1]
  • WW: the weight matrix for this gate (each gate has its own weights)
  • [ht1,xt][h_{t-1}, x_t]: the concatenation of the previous hidden state and current input
  • bb: the bias vector for this gate

Each gate learns different weights, allowing it to specialize in its particular function. The forget gate might learn to reset memory when it sees a period, while the input gate might learn to store information when it sees a noun.

LSTM Diagram Walkthrough

Now that we understand what gates do conceptually, let's see how they work together to create a complete memory system. The LSTM cell coordinates forgetting, remembering, and outputting, all happening simultaneously at each timestep. Understanding this coordination is key to understanding why LSTMs work so well.

The diagram below shows a single LSTM cell, which gets replicated at each timestep with shared weights. Follow the information flow from left to right: the cell state runs along the top like a highway, while the hidden state flows along the bottom. The gates act as on-ramps and off-ramps, controlling what enters and exits this memory highway.

Out[6]:
Visualization
Detailed LSTM cell diagram with forget gate, input gate, cell state update, and output gate clearly labeled.
Complete LSTM cell architecture showing all components. Information flows from left to right, with the cell state on top and hidden state below. The forget gate (red) removes information, the input gate (green) adds new information, and the output gate (blue) controls the output.

The Four Stages of LSTM Processing

The LSTM cell processes information in four distinct stages, each building on the previous. Think of it as a careful decision-making process: first decide what old information to discard, then decide what new information to add, then actually update the memory, and finally decide what to output. Let's walk through each stage.

Stage 1: The Forget Gate Decides What to Discard

Before we can add new information, we need to make room by discarding irrelevant old information. The forget gate examines the previous hidden state ht1h_{t-1} and current input xtx_t, asking: "Given what I just saw, what should I forget from my long-term memory?"

The forget gate computes:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

where:

  • ftf_t: the forget gate output, a vector with one value per cell state element
  • WfW_f: the forget gate's weight matrix
  • [ht1,xt][h_{t-1}, x_t]: concatenation of previous hidden state and current input
  • bfb_f: the forget gate's bias vector
  • σ\sigma: the sigmoid function, ensuring outputs fall in [0,1][0, 1]

The sigmoid activation is crucial here. It produces values between 0 and 1, which act as "retention percentages" for each element of the cell state. A value of 0.1 means "keep only 10% of this information," while 0.95 means "keep almost all of it." The network learns these retention patterns during training.

Stage 2: The Input Gate and Candidate Values

With forgetting handled, we turn to remembering. This stage involves two parallel computations that work together: the input gate decides which cell state elements to update, while a tanh layer creates candidate values that could be added.

The input gate computes:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

Simultaneously, we create candidate values:

C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

where:

  • iti_t: the input gate output, controlling which elements to update
  • C~t\tilde{C}_t: candidate values that could be added to the cell state
  • Wi,WCW_i, W_C: separate weight matrices for the input gate and candidate computation
  • bi,bCb_i, b_C: corresponding bias vectors

Why use tanh for the candidates? The tanh function outputs values in [1,1][-1, 1], allowing the network to both increase and decrease cell state values. A candidate of +0.8 would push the cell state up, while -0.8 would push it down. The input gate then scales these candidates, determining how much of each proposed change to actually apply.

Stage 3: The Cell State Update

Now comes the core of the LSTM: the actual memory update. We combine the forgetting and remembering decisions into a single formula:

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

where:

  • CtC_t: the new cell state at timestep tt
  • Ct1C_{t-1}: the previous cell state from timestep t1t-1
  • ftf_t: the forget gate output, controlling retention of old information
  • iti_t: the input gate output, controlling addition of new information
  • C~t\tilde{C}_t: the candidate values to potentially add
  • \odot: element-wise (Hadamard) multiplication

This formula looks simple but does a lot. The first term, ftCt1f_t \odot C_{t-1}, selectively preserves old information. If ft[j]=0.9f_t[j] = 0.9 for some element jj, we keep 90% of that element's previous value. The second term, itC~ti_t \odot \tilde{C}_t, selectively adds new information. If it[j]=0.7i_t[j] = 0.7 and C~t[j]=0.5\tilde{C}_t[j] = 0.5, we add 0.350.35 to that element.

Notice that this is an additive update, not a multiplicative one like in vanilla RNNs. This additive structure is the key to solving the vanishing gradient problem: gradients can flow backward through the addition operation without being repeatedly multiplied by small values.

Let's visualize this update with concrete numbers. Suppose we have a 4-dimensional cell state and the network has computed specific gate values:

Out[7]:
Visualization
Heatmap showing numerical values for cell state update with forget gate, input gate, candidate values, and resulting new cell state.
Numerical example of the cell state update. The forget gate selectively retains old information (element 2 is mostly forgotten), while the input gate adds new information (elements 0 and 3 receive significant updates). The final cell state combines both contributions.

In this example, notice how element 2 behaves differently from the others. The forget gate value of 0.2 means we discard 80% of the old information in that position, while the input gate value of 0.6 means we add a substantial portion of the candidate value 0.9. The result is that element 2 shifts from 0.5 to 0.64, dominated by the new information rather than the old.

Stage 4: The Output Gate Produces the Hidden State

The cell state now contains our updated long-term memory, but not all of it is relevant to the current timestep's output. The output gate filters the cell state to produce the hidden state:

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)

where:

  • oto_t: the output gate, controlling which cell state elements to expose
  • hth_t: the hidden state output at timestep tt
  • tanh(Ct)\tanh(C_t): the cell state squashed to [1,1][-1, 1]

The tanh applied to the cell state serves two purposes. First, it bounds the values to a reasonable range, preventing any single element from dominating. Second, it centers the values around zero, which tends to help with downstream computations. The output gate then selects which of these bounded values to include in the hidden state.

This hidden state hth_t serves dual purposes: it's the output that downstream layers or predictions use, and it's also fed back into the LSTM at the next timestep as ht1h_{t-1}. This recurrence is what makes the network "recurrent," allowing information to persist across time.

Putting It All Together

The complete LSTM computation at each timestep can be summarized as:

  1. Forget: ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
  2. Input gate: it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
  3. Candidate: C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
  4. Cell update: Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t
  5. Output gate: ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
  6. Hidden state: ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)

Each gate learns its own weight matrix and bias, giving the network four sets of parameters to learn: (Wf,bf)(W_f, b_f), (Wi,bi)(W_i, b_i), (WC,bC)(W_C, b_C), and (Wo,bo)(W_o, b_o). During training, backpropagation adjusts these parameters so that the gates learn to open and close at the right times for the task at hand.

Information Flow in LSTMs

Understanding how information flows through an LSTM is crucial for building intuition about what these networks can learn. Let's trace a concrete example.

Consider processing the sentence: "The cat, which had been sleeping on the warm windowsill since early morning, finally stretched." The subject "cat" appears at the beginning, but the verb "stretched" doesn't appear until the end. A vanilla RNN would struggle to connect these, but an LSTM can maintain the subject in its cell state throughout.

Out[8]:
Visualization
Timeline showing LSTM processing a sentence with gate activations visualized as colored bars at each timestep.
Information flow through an LSTM processing a sentence with a long-range dependency. The cell state (top) maintains 'cat' as the subject while processing intervening words. Gate activations show when information is stored, maintained, and used. Notice how the forget gate drops at 'cat' (storing the subject), stays high through the intervening clause, and the output gate spikes at 'stretched' (using the stored subject).

When the LSTM encounters "cat," the input gate activates strongly, storing the subject information in the cell state. The forget gate stays relatively open (high values), preserving this information. As the network processes the intervening clause "which had been sleeping on the warm windowsill since early morning," the forget gate remains high, maintaining the subject memory. Other information might be added and removed, but the subject slot stays protected.

When "stretched" arrives, the output gate activates strongly, allowing the stored subject information to influence the prediction. The model can now correctly associate "cat" with "stretched" despite the many intervening words.

LSTMs for Long Sequences

The advantage of LSTMs becomes clear when we compare their performance to vanilla RNNs on sequences of varying lengths. Let's visualize how gradient magnitude changes across sequence length for both architectures.

In[9]:
Code
import numpy as np


def simulate_gradient_flow(seq_length, architecture="vanilla"):
    """
    Simulate gradient magnitude at the first timestep for different sequence lengths.
    This is a simplified model showing the key difference between architectures.
    """
    if architecture == "vanilla":
        # Vanilla RNN: gradient decays exponentially
        # Each timestep multiplies by ~0.9 (typical for tanh derivative * weight)
        decay_factor = 0.9
        gradient = decay_factor**seq_length
    else:
        # LSTM: gradient decays much more slowly due to additive cell state updates
        # The forget gate provides a direct gradient path
        decay_factor = 0.995  # Much closer to 1
        gradient = decay_factor**seq_length

    return gradient


# Test across different sequence lengths
seq_lengths = np.arange(1, 201)
vanilla_gradients = [simulate_gradient_flow(l, "vanilla") for l in seq_lengths]
lstm_gradients = [simulate_gradient_flow(l, "lstm") for l in seq_lengths]
Out[10]:
Visualization
Semi-log plot comparing gradient decay in vanilla RNNs versus LSTMs across sequence lengths up to 200.
Gradient magnitude at the first timestep as a function of sequence length. Vanilla RNNs experience exponential gradient decay (falling below the practical training threshold of 1e-6 by ~50 steps), while LSTMs maintain much more stable gradients through their additive cell state updates. Note the logarithmic scale on the y-axis.

The difference is substantial. Vanilla RNN gradients become negligible after around 50 timesteps, making it impossible to learn dependencies beyond that range. LSTM gradients, while still decaying, remain orders of magnitude larger and stay above practical training thresholds for much longer sequences.

This gradient stability translates directly into learning capability. LSTMs can learn to copy information, detect patterns, and make predictions that depend on inputs hundreds of timesteps in the past. Tasks like language modeling, machine translation, and speech recognition all benefit from this extended memory.

LSTM Memory Capacity

While LSTMs can theoretically maintain information indefinitely, their practical memory capacity is finite. The cell state has a fixed dimensionality, typically between 128 and 1024 units, and each unit can only store a limited amount of information. As sequences grow longer and more complex, the network must decide what to remember and what to forget.

In[11]:
Code
def estimate_memory_capacity(hidden_size, sequence_length, bits_per_unit=2):
    """
    Rough estimate of LSTM memory capacity.

    In practice, each hidden unit can reliably store about 2-3 bits of information.
    Total capacity = hidden_size * bits_per_unit
    Required capacity grows with sequence length and task complexity.
    """
    total_capacity = hidden_size * bits_per_unit
    # Assume each timestep requires storing ~log2(vocab_size) bits for simple memorization
    # For a typical vocab of 50k, that's about 16 bits per token
    bits_per_token = 16
    required_capacity = sequence_length * bits_per_token

    utilization = min(required_capacity / total_capacity, 1.0)
    return total_capacity, required_capacity, utilization
Out[12]:
Visualization
Line plot showing memory utilization curves for LSTMs with 128, 256, 512, and 1024 hidden units.
LSTM memory utilization as a function of sequence length for different hidden sizes. Larger hidden dimensions provide more memory capacity, but all LSTMs eventually saturate. The dashed line indicates 80% utilization, beyond which performance typically degrades.

Several factors affect how efficiently an LSTM uses its memory capacity:

  • Task complexity: Simple pattern matching requires less memory than complex reasoning. Copying a sequence verbatim is easier than understanding and summarizing it.
  • Information redundancy: Natural language has significant redundancy. An LSTM doesn't need to store every word verbatim; it can store compressed representations.
  • Forgetting strategy: A well-trained forget gate learns to discard irrelevant information, freeing up capacity for what matters.
  • Hidden size: Larger hidden dimensions provide more storage but also require more computation and training data.

In practice, standard LSTMs work well for sequences up to a few hundred tokens. Beyond that, you'll need architectural modifications like attention mechanisms (which we'll cover in later chapters) or hierarchical structures that process long documents in chunks.

Comparing LSTM and Vanilla RNN

Theory is valuable, but seeing the difference in practice makes it concrete. Let's design an experiment that directly tests the core claim: LSTMs can remember information across long distances where vanilla RNNs fail.

Designing the Memory Task

We need a task that isolates memory ability from other confounding factors. The simplest such task is a "delayed recall" problem: show the model a signal at some position in a sequence, fill the rest with noise, and ask it to recall what the signal was at the end. The longer the delay between signal and recall, the harder the task becomes for architectures with poor long-term memory.

Concretely, our task works as follows:

  1. Generate a sequence of length 110 timesteps
  2. Place a one-hot signal (one of 8 possible classes) at a specific position
  3. Fill all other positions with random noise
  4. Train the model to predict which class the signal belonged to

If the signal is placed 5 timesteps before the end, the model only needs to remember for 5 steps. If placed 100 timesteps before the end, it must remember across 100 steps of noise. This clean setup lets us measure exactly how memory degrades with distance.

Implementing the Models

First, we define both model architectures using PyTorch. Notice how similar the code is: the only difference is that nn.LSTM returns both a hidden state and a cell state, while nn.RNN only returns a hidden state. This similarity makes the performance difference even more notable.

In[13]:
Code
import torch  # noqa: F401
import torch.nn as nn


class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        output, hidden = self.rnn(x)
        # Use final hidden state for classification
        return self.fc(hidden.squeeze(0))


class LSTMClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        output, (hidden, cell) = self.lstm(x)
        # Use final hidden state for classification
        return self.fc(hidden.squeeze(0))

Data Generation and Training

Next, we implement the data generation and training functions. The signal_position parameter controls how far back the model must remember, which directly determines the task difficulty.

In[14]:
Code
def generate_copy_task_data(
    batch_size, seq_length, signal_position, num_classes=8
):
    """
    Generate data for a simple memory task.

    The input contains a one-hot signal at signal_position, followed by noise.
    The model must remember the signal and output its class at the end.
    """
    # Create input sequences
    x = (
        torch.randn(batch_size, seq_length, num_classes) * 0.1
    )  # Background noise

    # Place signal at specified position
    labels = torch.randint(0, num_classes, (batch_size,))
    for i in range(batch_size):
        x[i, signal_position, :] = 0
        x[i, signal_position, labels[i]] = 1.0

    return x, labels


def train_and_evaluate(model, seq_length, signal_position, epochs=100):
    """Train model and return final accuracy."""
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        # Training
        model.train()
        x, labels = generate_copy_task_data(32, seq_length, signal_position)
        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    # Evaluation
    model.eval()
    with torch.no_grad():
        x, labels = generate_copy_task_data(100, seq_length, signal_position)
        outputs = model(x)
        predictions = outputs.argmax(dim=1)
        accuracy = (predictions == labels).float().mean().item()

    return accuracy

Running the Experiment

Now we run the experiment, testing both architectures across memory distances from 5 to 100 timesteps. For each distance, we train a fresh model for 100 epochs and measure its final accuracy. This gives us a direct comparison of how each architecture handles increasing memory demands.

In[15]:
Code
# Compare performance across different memory distances
distances = [5, 10, 20, 30, 50, 75, 100]
seq_length = 110  # Fixed sequence length

rnn_accuracies = []
lstm_accuracies = []

torch.manual_seed(42)

for distance in distances:
    signal_position = (
        seq_length - distance - 1
    )  # Signal placed 'distance' steps before end

    # Train vanilla RNN
    rnn = VanillaRNN(input_size=8, hidden_size=64, output_size=8)
    rnn_acc = train_and_evaluate(rnn, seq_length, signal_position)
    rnn_accuracies.append(rnn_acc)

    # Train LSTM
    lstm = LSTMClassifier(input_size=8, hidden_size=64, output_size=8)
    lstm_acc = train_and_evaluate(lstm, seq_length, signal_position)
    lstm_accuracies.append(lstm_acc)

The results show a clear difference between the two architectures:

Memory task accuracy across different distances. Vanilla RNNs degrade to near-random chance (12.5% for 8 classes) beyond 20-30 timesteps, while LSTMs maintain near-perfect accuracy even at 100 timesteps.
Memory DistanceVanilla RNNLSTM
5100.0%100.0%
1097.0%100.0%
2045.0%100.0%
3018.0%100.0%
5012.0%100.0%
7513.0%99.0%
10011.0%98.0%

At short distances (5-10 timesteps), both models perform reasonably well. But as the memory distance increases, the vanilla RNN's accuracy drops toward random chance, while the LSTM maintains strong performance even at 100 timesteps. This pattern directly reflects the vanishing gradient problem: the vanilla RNN simply cannot propagate learning signals across long distances.

Interpreting the Results

The results show the LSTM's advantage clearly. The vanilla RNN's accuracy drops to near-random chance (12.5% for 8 classes) as the memory distance increases beyond 20-30 timesteps. The LSTM, by contrast, maintains high accuracy even when the signal is 100 timesteps away from the output.

This isn't a subtle difference. It's a fundamental capability gap. The vanilla RNN literally cannot learn to solve this task at long distances, no matter how long you train it. The gradients that would teach it to connect the signal to the output simply vanish before they reach the relevant timesteps.

The LSTM succeeds because of the architectural innovations we've discussed: the cell state provides an unobstructed gradient highway, and the gates learn to protect the signal information from being overwritten by noise. This is exactly the kind of long-range dependency that makes LSTMs so valuable for real-world sequence tasks like language modeling, machine translation, and speech recognition.

Limitations and Impact

The LSTM architecture has limitations worth understanding. Knowing these constraints helps you decide when to use LSTMs and when to consider alternatives.

The most significant practical limitation is computational cost. LSTMs are inherently sequential: you cannot compute the hidden state at timestep tt until you've computed the hidden state at timestep t1t-1. This sequential dependency prevents parallelization across timesteps, making LSTMs slow to train on long sequences compared to architectures like Transformers that can process all positions simultaneously. On modern GPUs optimized for parallel computation, this sequential bottleneck becomes increasingly painful as sequence lengths grow.

Memory requirements also scale linearly with sequence length during training. Backpropagation through time requires storing all intermediate hidden states and cell states, which can exhaust GPU memory for very long sequences. Techniques like truncated backpropagation help, but they sacrifice the ability to learn the longest-range dependencies.

Despite these limitations, LSTMs changed how we approach sequence modeling. Before LSTMs, neural networks struggled with any task requiring memory beyond a few timesteps. After LSTMs, machine translation, speech recognition, handwriting recognition, and countless other sequence tasks became tractable. The key insight of creating a protected memory pathway with learned gating has influenced virtually every subsequent sequence architecture.

LSTMs also introduced the concept of gating mechanisms to the broader deep learning community. The idea that neural networks could learn to dynamically control information flow, rather than simply transforming it, opened new research directions. Attention mechanisms, which now dominate NLP, can be seen as a generalization of the gating concept: instead of learning fixed gates, attention learns dynamic, content-dependent gates over the entire sequence.

Summary

This chapter introduced the LSTM architecture and its key innovations for processing sequences with long-range dependencies.

The central insight behind LSTMs is the cell state, an information highway that allows data to flow through time with minimal transformation. Unlike the hidden state in vanilla RNNs, which gets completely overwritten at each timestep, the cell state can preserve information indefinitely through additive updates.

Three learned gates control information flow through the LSTM:

  • The forget gate decides what information to discard from the cell state
  • The input gate decides what new information to store
  • The output gate decides what part of the cell state to expose as output

These gates enable LSTMs to selectively remember, update, and retrieve information based on the input and context. The result is stable gradient flow during training, allowing LSTMs to learn dependencies spanning hundreds of timesteps.

In the next chapter, we'll dive into the mathematical details of LSTM gate equations, deriving each component and implementing an LSTM from scratch. Understanding the precise computations will deepen your intuition for how these networks learn to manage memory.

Key Parameters

When working with LSTMs in PyTorch (nn.LSTM), several parameters significantly impact model behavior:

  • hidden_size: The dimensionality of the hidden state and cell state vectors. Larger values (256-1024) provide more memory capacity but increase computation. Start with 128-256 for most tasks.
  • num_layers: Number of stacked LSTM layers. Deeper networks (2-4 layers) can learn more complex patterns but are harder to train. Single-layer LSTMs often suffice for many sequence tasks.
  • batch_first: When True, input tensors have shape (batch, seq_len, features). When False (default), shape is (seq_len, batch, features). Using batch_first=True aligns with common data loading patterns.
  • dropout: Dropout probability applied between LSTM layers (only active when num_layers > 1). Values of 0.1-0.3 help prevent overfitting on longer sequences.
  • bidirectional: When True, runs the LSTM in both forward and backward directions, doubling the hidden state size. Useful for tasks where future context matters (e.g., classification), but not applicable for generation tasks.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about LSTM architecture and how it solves the vanishing gradient problem.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{lstmarchitecturecompleteguidetolongshorttermmemorynetworks, author = {Michael Brenndoerfer}, title = {LSTM Architecture: Complete Guide to Long Short-Term Memory Networks}, year = {2025}, url = {https://mbrenndoerfer.com/writing/lstm-architecture-recurrent-neural-networks-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). LSTM Architecture: Complete Guide to Long Short-Term Memory Networks. Retrieved from https://mbrenndoerfer.com/writing/lstm-architecture-recurrent-neural-networks-guide
MLAAcademic
Michael Brenndoerfer. "LSTM Architecture: Complete Guide to Long Short-Term Memory Networks." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/lstm-architecture-recurrent-neural-networks-guide>.
CHICAGOAcademic
Michael Brenndoerfer. "LSTM Architecture: Complete Guide to Long Short-Term Memory Networks." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/lstm-architecture-recurrent-neural-networks-guide.
HARVARDAcademic
Michael Brenndoerfer (2025) 'LSTM Architecture: Complete Guide to Long Short-Term Memory Networks'. Available at: https://mbrenndoerfer.com/writing/lstm-architecture-recurrent-neural-networks-guide (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). LSTM Architecture: Complete Guide to Long Short-Term Memory Networks. https://mbrenndoerfer.com/writing/lstm-architecture-recurrent-neural-networks-guide
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free