A comprehensive guide to residual connections, the architectural innovation that solved the vanishing gradient problem in deep networks. Learn how skip connections enabled training of networks with 100+ layers and became fundamental to modern language models and transformers.

This article is part of the free-to-read History of Language AI book
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2015: Residual Connections
By 2015, deep convolutional neural networks had demonstrated remarkable success in computer vision tasks. Networks like VGG and GoogLeNet had achieved impressive results on ImageNet, showing that depth was crucial for learning complex visual representations. However, researchers were encountering a counterintuitive problem: adding more layers to networks wasn't always beneficial, and in many cases, deeper networks performed worse than their shallower counterparts. This observation challenged the intuition that more depth should always enable more sophisticated feature learning.
Researchers at Microsoft Research Asia, led by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, investigated this phenomenon and discovered that the problem wasn't that deeper networks couldn't learn useful representations, but rather that training them had become increasingly difficult. The vanishing gradient problem, first identified in the context of recurrent networks, was also affecting very deep convolutional networks. As networks grew deeper, gradients would diminish as they propagated backward through many layers, making it difficult or impossible for early layers to learn effectively.
The solution they developed, residual connections, would prove to be one of the most important architectural innovations in deep learning. The idea was elegantly simple: instead of forcing layers to learn the complete transformation from input to output, allow them to learn the residual, or difference, between input and output. This approach enables networks to learn identity mappings when necessary, effectively allowing the network to skip layers that don't contribute to the task. The resulting architecture, ResNet, could be trained with networks over 100 layers deep, achieving dramatic improvements in accuracy while actually being easier to train than much shallower networks.
The impact of residual connections extended far beyond computer vision. The technique became fundamental to training deep networks across domains, from natural language processing to reinforcement learning. When transformer architectures emerged in 2017, they incorporated residual connections extensively throughout their structure. Modern language models rely on residual connections to enable training of networks with dozens or hundreds of layers. The technique has become so standard that it's now considered an essential component of nearly any deep neural network architecture.
The Problem
The problem of training very deep networks had been recognized for years, but its severity became clear as researchers attempted to scale convolutional networks to greater depths. Early deep networks like AlexNet had eight layers, and by 2014, networks like VGG had grown to 19 layers and GoogLeNet used a complex architecture with 22 layers. Researchers attempted to go even deeper, expecting that additional depth would capture more abstract and sophisticated features. However, experiments revealed a surprising and frustrating result: adding more layers often led to higher training error, not lower.
This observation was puzzling because in theory, a deeper network should always be able to match or exceed the performance of a shallower network. A deeper network could simply set its additional layers to perform identity transformations, copying inputs to outputs, and then continue learning from there. If a network with layers could achieve a certain error rate, a network with layers should be able to achieve at least that same error rate by learning the identity function in the extra layers and then performing the same computation as the original -layer network. Yet in practice, this wasn't happening.
The root cause was the vanishing gradient problem. When training deep networks using backpropagation, gradients must flow from the output layer back to the input layer. In a network with many layers, each layer multiplies the gradient by the weight matrix during backpropagation. If the weights are small or the activations are in saturation regions, these multiplications cause gradients to shrink exponentially as they propagate backward. By the time gradients reach the early layers, they may be so small that they provide no useful learning signal, effectively freezing those layers early in training.
Batch normalization, introduced in 2015, helped address this problem by normalizing activations and providing more stable gradients. However, even with batch normalization, training networks with more than about 20 layers remained challenging. The gradients were still decaying, and the optimization landscape for very deep networks appeared to have many local minima and flat regions that made convergence difficult. Researchers needed a more fundamental solution that would change how information flowed through the network.
Another related problem was the difficulty of learning identity mappings. In theory, a network should be able to learn to pass inputs through unchanged, but in practice, this proved difficult. Each layer applies a nonlinear transformation, and even if the goal is to preserve the input, the network must learn to counteract its own nonlinearities to approximate the identity function. For deep networks, this requires coordinating many layers to collectively produce an identity mapping, which is a complex optimization problem that can be difficult to solve.
The computational cost of training deep networks was also significant. When training failed, it often failed after many hours or days of computation, wasting computational resources. Researchers needed a way to ensure that deeper networks would train successfully and converge to good solutions, not just sometimes but reliably. The field needed an architectural innovation that would make depth beneficial rather than problematic.
The Solution
Residual connections solve these problems by providing a direct pathway for information to flow through the network. Instead of requiring each layer to learn the complete transformation from input to output, residual connections allow layers to learn the residual, or difference, between input and desired output. This architectural pattern makes it trivial for the network to learn identity mappings: if the residual is zero, the output equals the input.
The core idea is captured in the residual block. A traditional layer computes , where represents the transformation learned by the layer. A residual block computes , adding the input directly to the output of the transformation. This simple addition creates a shortcut connection that bypasses the layer, allowing information to flow directly from input to output if the learned transformation is close to zero.
The mathematical formulation is straightforward. Given an input and a learned transformation represented by one or more layers, the output of a residual block is:
This formulation assumes that the input and the output of have the same dimensions. When dimensions don't match, as they might when changing the number of channels in a convolutional network or when downsampling, a linear projection can be applied to the shortcut connection:
The projection matrix is learned during training, but it's typically implemented as a simple convolutional layer with filters that matches dimensions while preserving spatial information when necessary.
The power of this approach becomes clear when considering what the network needs to learn. In a traditional deep network, a layer must learn the complete transformation from its input to a desired output. In a network with residual connections, a layer only needs to learn the difference, or residual, between input and output. If the desired output is very similar to the input, the layer can learn to produce a small residual, making the learning task easier. If the desired transformation is significant, the layer can learn a large residual. In the extreme case where the identity mapping is optimal, the layer can learn to produce zero, and the shortcut connection handles the identity mapping automatically.
This makes training much more stable. Even if a layer's learned transformation is initially poor or produces gradients that are too small, the shortcut connection ensures that the input still reaches the output with a strong gradient signal. The network can learn incrementally: early in training, the shortcut connections dominate, providing a baseline performance. As training progresses, the learned transformations can refine this baseline, adding nuanced adjustments that improve performance.
The ResNet architecture applies this residual block pattern throughout a deep convolutional network. The network is organized into stages, each containing multiple residual blocks. Within each block, typically two or three convolutional layers are applied, followed by batch normalization and ReLU activations. The input to the block is added to the output of these transformations, and another ReLU is applied after the addition. This pattern repeats throughout the network, creating a deep architecture where each stage can incrementally refine features while maintaining gradient flow.
The architecture also incorporates downsampling layers that reduce spatial dimensions and increase the number of channels as the network processes higher-level features. At these transition points, the shortcut connections use convolutions with stride 2 to match dimensions. This allows the network to progressively extract features at multiple scales while maintaining the benefits of residual connections throughout.
Applications and Impact
Residual connections achieved immediate and dramatic success in computer vision. ResNet won the ImageNet classification challenge in 2015, achieving a top-5 error rate of 3.57%, a significant improvement over previous approaches. More importantly, ResNet demonstrated that networks with 50, 101, or even 152 layers could be trained effectively, something that had seemed impossible just months earlier. The deeper networks consistently outperformed their shallower counterparts, finally validating the intuition that depth should be beneficial.
The technique quickly became standard in computer vision research and applications. Beyond classification, residual connections proved valuable for object detection, semantic segmentation, and other vision tasks. Networks like ResNeXt extended the ResNet architecture with grouped convolutions, and DenseNet incorporated dense connections inspired by residual connections. The pattern of using skip connections became so common that it's now unusual to see a deep convolutional network without them.
Perhaps more significant than the computer vision applications was the technique's adoption in other domains. When researchers began applying deep learning to natural language processing, they found that residual connections were equally valuable for training deep networks on text. Sequence-to-sequence models for machine translation incorporated residual connections to enable deeper encoder and decoder networks. Language models with many layers used residual connections to maintain gradient flow through long computation graphs.
The technique became absolutely fundamental with the introduction of transformer architectures in 2017. The transformer architecture uses residual connections extensively: each sub-layer (multi-head self-attention and position-wise feed-forward networks) is wrapped with a residual connection, and layer normalization is applied after the residual addition. This design, sometimes called a "residual connection with layer normalization," has become the standard pattern for building deep transformer networks.
Every major language model released since 2017 uses residual connections throughout. GPT models, BERT, T5, and countless other architectures rely on residual connections to enable training of networks with dozens of layers. The technique has been essential for scaling to very large models: GPT-3 with 96 layers, PaLM with 118 layers, and other massive models would be impossible to train without residual connections. The gradient flow they provide is crucial for these extremely deep networks.
Residual connections have also been adopted in other domains. Reinforcement learning agents use residual connections in their policy and value networks. Generative models like generative adversarial networks and variational autoencoders incorporate residual connections. Graph neural networks use skip connections inspired by residual connections to propagate information across graph structures. The pattern has proven broadly applicable wherever deep networks need to be trained.
The impact extends to practical applications as well. Computer vision systems in autonomous vehicles, medical imaging, and manufacturing rely on ResNet-based architectures. Natural language processing systems for translation, summarization, and question answering use transformer models built on residual connections. The technique has enabled applications that simply weren't possible with shallower networks, providing the depth needed to capture complex patterns in data.
Limitations
While residual connections solved critical problems with training deep networks, they also introduced some limitations and considerations. One issue is that residual connections can create a tendency for networks to rely too heavily on the shortcut path, potentially underutilizing the learned transformations. If the shortcut connection always provides a strong signal, the learned layers might not develop as sophisticated representations as they could in a network without shortcuts. However, in practice, this hasn't proven to be a major problem, as the learned transformations do contribute significantly to performance.
The addition operation in residual connections requires that inputs and outputs have compatible dimensions. While this is straightforward when dimensions match, handling dimension mismatches requires additional layers (the projection matrices ), which adds parameters and computational cost. In some architectures, particularly those with complex dimension changes, managing these projections can become complex. However, the benefits of residual connections typically outweigh these costs.
Another consideration is that residual connections increase memory usage during training. Because the input must be stored to be added to the output, residual blocks require more memory than traditional layers. This can be a constraint when training very large models or when working with limited computational resources. The memory overhead is usually manageable but becomes more significant as model size increases.
Some research has questioned whether residual connections are always necessary or optimal. In certain settings, particularly with careful initialization and normalization, very deep networks can sometimes be trained without residual connections. However, these cases are exceptions, and residual connections remain the standard approach for training deep networks reliably. The technique provides such significant benefits for training stability and convergence that it's generally worth the small overhead.
Residual connections also don't solve all problems with deep networks. Issues like overfitting, catastrophic forgetting in continual learning scenarios, and computational efficiency remain challenges that residual connections don't directly address. Networks with residual connections can still overfit if not properly regularized, and they still require significant computational resources for training and inference.
The technique works best when the transformations being learned are incremental refinements rather than complete reworkings of the representation. In cases where each layer needs to make substantial changes to the representation, the shortcut connection might interfere with learning. However, for most practical applications, the incremental refinement that residual connections enable is exactly what's needed for effective feature learning.
Legacy and Looking Forward
Residual connections have become one of the most fundamental architectural patterns in deep learning. The technique is now so standard that it's rare to see a deep network architecture that doesn't incorporate some form of skip connection. The innovation demonstrated that architectural choices can fundamentally change the optimization landscape, making previously intractable problems solvable through thoughtful design rather than just algorithmic improvements.
The impact on the field has been profound. Residual connections enabled the training of networks with hundreds of layers, unlocking capabilities that were previously impossible. This depth has been crucial for the success of modern language models, computer vision systems, and other AI applications. The technique has become so integrated into standard practice that it's often taken for granted, but its contribution to the field's progress cannot be overstated.
Modern research continues to explore variations and improvements on residual connections. DenseNet introduced dense connections where each layer receives inputs from all previous layers, creating an even richer connectivity pattern. Highway networks explored learnable gating mechanisms for skip connections. Some architectures use residual connections in more sophisticated ways, such as applying them across multiple scales or incorporating them into attention mechanisms.
The relationship between residual connections and other techniques like batch normalization and layer normalization has also been extensively studied. The combination of residual connections with normalization techniques has proven particularly effective, with each technique addressing different aspects of training deep networks. Modern architectures typically use both residual connections and normalization, recognizing that they complement each other rather than competing.
Looking forward, residual connections remain essential for training very deep networks. As models continue to grow in size and complexity, the gradient flow that residual connections provide becomes even more critical. The technique has enabled the development of models with trillions of parameters, and it will likely continue to be fundamental as models scale even further.
Research also continues into understanding why residual connections work so well. Some theoretical work suggests that residual connections create smoother optimization landscapes or enable better gradient flow, but a complete understanding of their benefits remains an active area of investigation. This research may lead to even more effective architectural patterns in the future.
The introduction of residual connections in 2015 represents a pivotal moment in deep learning history. By solving the problem of training very deep networks, this work unlocked a new regime of model capabilities. The technique's simplicity and effectiveness have made it a cornerstone of modern neural network design, and its influence continues to shape how deep networks are built and trained across all domains of AI.
Quiz
Ready to test your understanding of residual connections? Challenge yourself with these questions about this fundamental architectural innovation and see how well you've grasped the key concepts that made training very deep networks possible. Good luck!
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization
A comprehensive guide covering t-SNE (t-Distributed Stochastic Neighbor Embedding), including mathematical foundations, probability distributions, KL divergence optimization, and practical implementation. Learn how to visualize complex high-dimensional datasets effectively.

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations
A comprehensive guide covering LIME (Local Interpretable Model-Agnostic Explanations), including mathematical foundations, implementation strategies, and practical applications. Learn how to explain any machine learning model's predictions with interpretable local approximations.

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction
A comprehensive guide covering UMAP dimensionality reduction, including mathematical foundations, fuzzy simplicial sets, manifold learning, and practical implementation. Learn how to preserve both local and global structure in high-dimensional data visualization.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
