Backpropagation - Training Deep Neural Networks

Michael Brenndoerfer

Data, Analytics & AI Machine Learning LLM and GenAI History of Language AI

In the 1980s, neural networks hit a wall—nobody knew how to train deep models. That changed when Rumelhart, Hinton, and Williams introduced backpropagation in 1986. Their clever use of the chain rule finally let researchers figure out which parts of a network deserved credit or blame, making deep learning work in practice. Thanks to this breakthrough, we now have everything from word embeddings to powerful language models like transformers.

Part of History of Language AI

This article is part of the free-to-read History of Language AI book

View full handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

1986: Backpropagation

In the mid-1980s, neural networks were stuck in a frustrating paradox. Multi-layer models looked promising on paper, offering the theoretical capacity to learn complex patterns that single-layer networks couldn't handle. Researchers could build networks with many hidden layers, architecting elaborate structures that should, in principle, be capable of sophisticated computation. But there was a critical problem: no one knew how to train them effectively. These deeper architectures simply wouldn't learn, no matter how carefully they were constructed or how much data they were fed.

The field was at an impasse. Single-layer perceptrons had clear limitations, unable to solve even simple problems like XOR. Everyone knew that deeper networks were the answer, but without a way to train them, that knowledge was useless. It was like having the blueprints for a revolutionary machine but no instructions for how to actually build it.

Then, in 1986, three researchers changed everything. David Rumelhart, Geoffrey Hinton, and Ronald Williams published their landmark paper "Learning representations by back-propagating errors," which popularized backpropagation and demonstrated its power for training multi-layer neural networks. While the mathematical foundations had been developed earlier by researchers like Seppo Linnainmaa in 1970 and Paul Werbos in the 1970s and early 1980s, it was the 1986 paper that brought the algorithm to widespread attention and showed its practical effectiveness. This wasn't just an incremental improvement or a minor technical refinement. It was the moment when deep learning became practical. Backpropagation didn't just solve a technical problem, it laid the foundation for every language AI system we use today, from the autocomplete on your phone to the most sophisticated large language models.

The Credit Assignment Problem

The puzzle researchers faced seemed deceptively simple on the surface: how do you know which connections in a network are responsible for errors in the output? Consider a concrete example. Imagine you've built a neural network to predict whether movie reviews express positive or negative sentiment. You feed it the clearly positive review "I love this movie" and it confidently predicts negative sentiment. Something has obviously gone wrong, but where exactly? Which of the thousands of connections between neurons should you adjust? By how much? In which direction? And how can you be sure your adjustments will actually improve things rather than making them worse?

This fundamental challenge became known as the credit assignment problem. The task was figuring out which connections in the network deserve "credit" or "blame" for the network's performance. It's analogous to trying to identify which player on a football team is responsible for a loss. The outcome depends on hundreds of interconnected decisions and actions, and it's not always obvious which ones mattered most.

For single-layer networks, the perceptron learning rule could handle this elegantly enough. When an error occurred, you knew exactly which weights to adjust because there was only one layer of connections between input and output. But multi-layer networks remained largely untrainable. There was no efficient way to send error signals backward through multiple layers to fix the weights that actually caused the problem. The deeper the network, the more opaque the relationship between any particular weight and the final output. Researchers had tried various approaches, including methods that essentially treated each layer independently or used random search strategies, but none proved both efficient and effective enough to make deep networks practical.

What is Backpropagation?

Backpropagation solved the credit assignment problem with a surprisingly elegant idea: work backward through the network, calculating exactly how much each weight contributed to the error. The insight was that you could trace the error signal backward from the output layer through each hidden layer, decomposing the total error into contributions from individual connections. At each layer, you could determine precisely how changing a particular weight would affect the final output.

Technically speaking, backpropagation is an algorithm that efficiently computes the gradients of the loss function with respect to each weight in a neural network. The loss function measures how wrong the network's predictions are, and the gradient tells you how to adjust each weight to reduce that wrongness. In more practical terms, backpropagation figures out which weights to adjust, by how much, and in which direction to reduce errors. It transforms the opaque, seemingly intractable problem of training deep networks into a systematic, computational process that a computer can execute efficiently.

Loading component...

The mathematical trick that makes this possible is the chain rule from calculus, a principle you might remember from introductory calculus courses. The chain rule provides a way to decompose complicated derivatives into simpler pieces that can be computed step by step. In the context of neural networks, it lets you efficiently compute how each weight, no matter how deep in the network, affects the final output. You start with the error at the output layer and systematically work backward, using the chain rule to propagate that error signal through each layer. What might seem like it would require exponentially many calculations can actually be done in linear time, making deep networks practical to train. That computational efficiency was the insight that unlocked deep learning.

How Backpropagation Works

Let's make this concrete by walking through a complete example. We'll examine how backpropagation trains a sentiment analysis system using a simple three-layer neural network. This example will illustrate both the forward pass, where the network makes a prediction, and the backward pass, where it learns from its mistakes.

Forward Pass

Say we feed the sentence "I love this movie" into our network. The network processes this input through three stages, transforming the words into a final prediction.

First, the input layer converts the sentence into numerical form. Each word gets represented as a word embedding, a vector of numbers that captures something about the word's meaning. The words "I", "love", "this", and "movie" each become vectors of perhaps 50 or 100 numbers. These vectors might encode semantic properties like whether a word is positive or negative, whether it relates to emotions, whether it's a noun or verb, and so on.

Next, the hidden layer takes these numerical representations and computes weighted combinations of them. Each neuron in this layer combines the input values using learned weights, adds them up, and then applies an activation function. The activation function introduces non-linearity, allowing the network to capture complex patterns that simple linear combinations couldn't represent. You might think of this layer as detecting higher-level features in the text, like whether emotional language is present or whether the sentence structure suggests a positive or negative tone.

Finally, the output layer produces a single number between 0 and 1, which we interpret as a probability. A value near 0 means "definitely negative sentiment," while a value near 1 means "definitely positive sentiment." For our sentence "I love this movie," we'd hope the network outputs something close to 1. But imagine it's still early in training, and the weights aren't well tuned yet. The network might output 0.3, suggesting it thinks this positive review is actually negative.

Backward Pass

The network predicted 0.3, but the review is actually positive, which we represent as 1.0. Something has clearly gone wrong, and this is where backpropagation does its work. The algorithm proceeds through three systematic steps to fix the problem.

First, we compute the error using a loss function. A common choice is squared error, which in this case gives us Loss = (1.0 - 0.3)² = 0.49. This is a substantial error. The loss function quantifies just how wrong the prediction was, giving us a single number that represents the network's failure. Different loss functions emphasize different aspects of being wrong. Squared error, for example, penalizes large errors much more heavily than small ones.

Next, we calculate gradients by working backward through the network. This is the heart of backpropagation and where the chain rule earns its keep. Starting from the output layer, we ask: how much did the output neuron's weights contribute to this error? We can calculate this directly because we know the output and the loss. Then we move to the hidden layer and ask: how much did each of these neurons' weights contribute to the output layer's error? Using the chain rule, we can decompose the overall error into contributions from each layer, then from each neuron, and finally from each individual weight. The result is a gradient for every single weight in the network, telling us exactly how much that particular weight contributed to the prediction being wrong.

Finally, we update the weights. For each weight, we nudge it in the direction that reduces the error. The size of the nudge is proportional to the gradient, so weights that contributed more to the error get adjusted more. We also multiply by a learning rate, a small number that controls how aggressive our updates are. After this update, if we were to run the same input through the network again, it would predict something slightly closer to 1.0, perhaps 0.35 instead of 0.3. Repeat this process thousands or millions of times with different examples, and the network gradually learns to make accurate predictions.

The Math Behind It

For those interested in the mathematical details, let's look at what's happening under the hood. If you prefer to skip the equations and move on to the broader implications, feel free to jump to the next section. The high-level understanding you've gained so far is sufficient for following the rest of the article.

Loading component...

At its core, backpropagation uses the chain rule from calculus to efficiently compute how each weight affects the final loss. For any weight $w_{ij}$ connecting neuron $i$ to neuron $j$ , we need to calculate the gradient:

\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{ij}}

Here, $L$ is the loss function, $a_j$ is the activation of neuron $j$ (the output after applying the activation function), and $z_j$ is the weighted input to neuron $j$ (before the activation function). This formula breaks down the complex question "how does this weight affect the loss?" into three simpler questions we can answer sequentially: How does the loss change with the activation? How does the activation change with the weighted input? And how does the weighted input change with the weight? Each of these partial derivatives is straightforward to compute, and multiplying them together gives us the gradient we need.

For the output layer, calculating how the loss changes with the activation is relatively straightforward. If we're using squared error loss, the derivative is:

\frac{\partial L}{\partial a_j} = 2(a_j - y_j)

This simply tells us that the rate of change of the loss is proportional to how far off our prediction $a_j$ is from the true value $y_j$ . The further off we are, the steeper the gradient and the larger the correction we'll make.

For hidden layers, things get more interesting. We use the chain rule to propagate gradients backward from the output. The gradient for a hidden layer activation is:

\frac{\partial L}{\partial a_i} = \sum_j \frac{\partial L}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} \cdot w_{ij}

This equation expresses a key insight: the gradient for a hidden neuron depends on the gradients of all the neurons it connects to in the next layer, weighted by the connection strengths. This is what allows error signals to flow backward through the network. Each layer's gradients are computed using the gradients from the layer ahead of it, creating a chain of computations that efficiently calculates all the gradients in a single backward pass through the network.

Weight Update Rule

Once we have computed all the gradients, actually updating the weights is straightforward. We use gradient descent, one of the fundamental optimization algorithms in machine learning:

w_{ij}^{new} = w_{ij}^{old} - \alpha \frac{\partial L}{\partial w_{ij}}

The term $\alpha$ is the learning rate, a hyperparameter that controls how big of a step we take in the direction indicated by the gradient. This parameter is crucial to successful training. Set it too large and the network makes wild adjustments that can cause training to diverge, with the loss bouncing around or even increasing over time. Set it too small and the network learns painfully slowly, requiring far more training examples and computation time to reach good performance. In practice, researchers often start with a moderate learning rate and decrease it over time, allowing the network to make large adjustments early in training when it's far from a good solution, and smaller, more refined adjustments later when it's getting close.

The negative sign in the update rule is critical. We subtract the gradient because the gradient points in the direction of increasing loss. We want to move in the opposite direction, toward decreasing loss. This is why it's called gradient descent: we're descending down the loss landscape, seeking the lowest point we can find.

What Backpropagation Made Possible

With backpropagation, neural network research exploded. The algorithm transformed a field that had been stagnant for decades into one of the most active areas of artificial intelligence research. Suddenly, researchers could do things that had seemed fundamentally impossible just years before. The impact was immediate and profound, touching every aspect of how neural networks were designed, trained, and applied.

Multi-Layer Networks

For the first time, researchers could reliably train networks with many hidden layers. Deep architectures actually worked, consistently and reproducibly. This unlocked complex pattern recognition capabilities that shallow networks simply couldn't achieve. The key insight was that networks learned hierarchical representations, building up understanding in stages. Early layers detected simple, low-level features, while deeper layers combined these into increasingly sophisticated concepts.

Consider a computer vision network analyzing images. The first layer might detect basic edges and corners, responding to simple contrasts between light and dark pixels. The second layer could combine these edges into more complex shapes like circles, rectangles, or curves. The third layer might recognize parts of objects, like wheels, windows, or faces. Finally, the output layer would identify complete objects, like cars, houses, or people. Each layer builds on the representations learned by the previous layer, creating a hierarchy of increasingly abstract and powerful features.

What made this truly revolutionary was that networks discovered these features automatically. Researchers didn't need to hand-craft feature detectors for every new task, laboriously encoding their intuitions about what patterns might be important. The network figured out what to look for on its own, often discovering features that human designers would never have thought to look for. This automatic feature learning became one of the defining characteristics of deep learning and a major reason for its success across diverse domains.

Practical Applications

The real-world impact was immediate and transformative across multiple domains. Speech recognition systems, which had struggled with the variability and complexity of human speech, suddenly became practical. Neural networks could learn to handle different accents, speaking rates, and background noise, adapting to patterns that would have been impossibly difficult to capture with hand-written rules. The systems were no longer brittle programs that broke when encountering unexpected inputs, but rather robust learners that generalized from examples.

Computer vision experienced perhaps the most dramatic transformation. Networks learned to recognize objects, faces, and entire scenes in images with accuracy that began to approach and then exceed human performance on specific tasks. They could handle variations in lighting, angle, and occlusion that would have required enormous amounts of specialized code in traditional computer vision systems. The network simply learned what mattered and what didn't from the training data.

Natural language processing also advanced significantly. Networks began to understand word relationships, capturing semantic similarities and syntactic patterns. They could learn that "king" relates to "queen" in the same way that "man" relates to "woman," without being explicitly taught these relationships. They started to handle syntax, learning grammatical structures from examples rather than requiring linguists to encode rules. While early applications were relatively simple compared to modern language models, the foundation was laid for everything that would come later. Backpropagation didn't just make these applications better, it made them possible in the first place.

Research Acceleration

Backpropagation fundamentally changed how researchers approached neural network research. The experimental cycle became dramatically faster. Before backpropagation, training a network meant using slow, inefficient methods that might take days or weeks to produce mediocre results, if they converged at all. With backpropagation, researchers could quickly test different architectures and see what worked. They could try adding more layers, changing activation functions, or adjusting network topology, getting feedback in hours or days instead of weeks or months.

This speed enabled a more empirical, experimental approach to research. Rather than spending months developing elaborate theories about what should work, researchers could simply try things and see. The field became more data-driven and less reliant on pure intuition. Networks could grow larger and more complex because the training algorithm scaled efficiently. Optimization techniques that had been purely theoretical, developed by mathematicians without practical applications, could now be tested and refined on real problems.

The pace of progress accelerated dramatically. Papers built on each other more quickly. Innovations spread through the research community faster because other researchers could actually implement and test new ideas. The field entered a positive feedback loop where better training methods enabled larger networks, which solved harder problems, which attracted more researchers and funding, which led to even better methods. This acceleration, which started with backpropagation in 1986, continues to this day.

The Limitations

Of course, backpropagation wasn't a silver bullet that solved all problems in neural network training. While it made deep learning possible, it also revealed new challenges and limitations. Some of these were inherent to the algorithm itself, while others emerged only as researchers tried to push networks deeper and tackle more complex problems. Many of these challenges remain active areas of research today, nearly four decades after backpropagation's introduction.

Vanishing Gradients

The first major problem showed up when researchers tried to build truly deep networks with many layers. They discovered that gradients could become exponentially smaller as they propagated backward through the layers. Each layer multiplies the gradient by various weights and derivatives, and when these numbers are small (less than one), repeated multiplication causes the gradient to shrink rapidly. By the time the error signal reaches the early layers, it might be so tiny as to be effectively zero.

This meant that early layers in deep networks learned extremely slowly or stopped learning entirely. The network could adjust its final layers reasonably well, but the crucial early layers that should be learning fundamental features remained stuck with nearly random weights. This severely limited the practical depth of networks for many years, creating a frustrating situation where deeper architectures seemed theoretically better but couldn't actually be trained effectively. The problem was eventually addressed through innovations like better activation functions (ReLU instead of sigmoid), careful weight initialization schemes, and architectural changes like residual connections, but it remains a consideration in network design today.

Loading component...

Local Optima

Then there's the problem of getting stuck in suboptimal solutions. The loss landscape, which you can visualize as a hilly terrain where you're trying to find the lowest valley, contains many local minima. These are points where the gradient is zero, so gradient descent stops, but they're not the best possible solution, just better than their immediate surroundings. Imagine standing in a small depression on the side of a mountain. You're at a local low point, but the actual valley floor is much further down. Gradient descent can settle into these mediocre solutions instead of finding the global minimum.

This led to inconsistent and frustrating results, especially in the early days. Train the same network architecture twice with different random starting weights, and you might get wildly different performance. One run might find a great solution, while another got stuck in a poor local minimum. This made research difficult because you couldn't be sure whether a new idea actually helped or whether you just got lucky with the initialization.

Interestingly, this problem turned out to be less severe than initially feared, at least for large networks. Modern research suggests that many local minima in high-dimensional spaces are actually pretty good solutions, nearly as good as the global minimum. Additionally, techniques like momentum-based optimization, which we'll discuss in later chapters, help the optimizer escape shallow local minima. Still, the stochastic and somewhat unpredictable nature of neural network training traces back to this fundamental issue.

Computational Cost

Backpropagation, while efficient compared to alternatives, is still computationally expensive in absolute terms. You need to compute gradients for every weight in every layer, and as networks grew larger, this became a serious bottleneck. A modern large language model might have billions or even trillions of parameters, and backpropagation must compute a gradient for each one, for each training example.

Early researchers were severely constrained by computational limitations. They were stuck with relatively small models, not because they didn't want bigger ones or because small models were sufficient, but because they simply couldn't afford to train larger networks. The computers of the late 1980s and early 1990s were orders of magnitude slower than today's hardware. Training a moderately sized network on a reasonable dataset might take days or weeks. Every experiment required careful consideration of computational budgets, and researchers had to make difficult tradeoffs between network size, dataset size, and the number of experiments they could run.

This computational barrier limited the practical application of neural networks for many years. While there had been earlier AI winters in the 1970s and late 1980s, neural networks continued to make progress through the 1990s and early 2000s, though progress was slower than it would later become. It was only with the advent of GPU acceleration in the late 2000s, particularly demonstrated by breakthroughs like AlexNet in 2012, that training truly large and deep networks became practical. The algorithmic breakthrough of backpropagation had arrived decades before the hardware could fully exploit it.

Overfitting

Finally, neural networks trained with backpropagation had a troubling tendency to memorize their training data rather than learning general patterns. They'd achieve excellent performance on the training set but fail miserably when confronted with new, unseen examples. This overfitting problem is a fundamental challenge in machine learning, but it became particularly acute with neural networks because of their enormous capacity to memorize.

Think of it like a student who memorizes every practice problem verbatim but doesn't understand the underlying concepts. When exam day arrives with slightly different problems, they're lost. Neural networks can do something similar, learning to recognize specific training examples perfectly without extracting the generalizable patterns that would let them handle new data.

The problem gets worse with larger networks and smaller datasets. A network with millions of parameters trained on just thousands of examples has more than enough capacity to simply memorize every training example. Yet we need large networks to learn complex patterns, and we can't always get massive datasets. This creates a difficult tension.

The solution involved developing sophisticated regularization techniques. Dropout randomly turns off neurons during training, preventing the network from relying too heavily on any particular path through the network. Weight decay penalizes large weights, encouraging the network to use simpler patterns. Early stopping halts training before the network has fully memorized the training set. Batch normalization, skip connections, and data augmentation all help in various ways. But getting these techniques right required expertise and extensive experimentation. Even today, preventing overfitting while maintaining the network's ability to learn complex patterns remains a delicate balancing act.

The Legacy for Language AI

Every language AI system you use today, from the autocomplete on your phone to sophisticated conversational agents like ChatGPT, traces its lineage directly back to backpropagation. Without this 1986 breakthrough, modern natural language processing simply wouldn't exist. The algorithm's impact on the field has been profound and enduring, shaping not just how we train models but how we think about language understanding itself.

Foundation for Modern Models

Backpropagation made word embeddings possible, and these became the foundation of modern NLP. Before backpropagation, words were typically represented as one-hot vectors, sparse representations that treated every word as completely distinct from every other word. With backpropagation, networks could learn dense, distributed representations that captured semantic relationships. These embeddings encoded meaning in a geometric space where semantically similar words ended up close together. The famous example where "king" minus "man" plus "woman" equals "queen" demonstrates how these learned representations captured conceptual relationships that no human had explicitly programmed.

Backpropagation enabled recurrent neural networks, architectures specifically designed to process sequences. Unlike feedforward networks that treat each input independently, recurrent networks maintain an internal state that gets updated as they process a sequence. This made them naturally suited for language, where the meaning of a word depends on the words that came before it. Suddenly, machine translation became feasible without the elaborate manual feature engineering that statistical methods required. Text generation, speech recognition, and sentiment analysis all became tractable problems.

Most importantly, backpropagation provided the training mechanism for transformers and attention-based models, the architectures that power modern language AI. Transformers are massive networks with billions of parameters, trained on enormous corpora of text. Every one of those parameters gets updated using backpropagation. Without this efficient training algorithm, none of it would work. There would be no GPT, no BERT, no Claude, no modern language AI at all. The entire edifice of contemporary NLP rests on this 1986 foundation.

Training Paradigms

Backpropagation established supervised learning as the standard training approach for neural networks. The paradigm is straightforward: provide the network with input-output pairs, have it make predictions, calculate the error between its predictions and the correct answers, and use backpropagation to adjust the weights. This approach allowed researchers to train on massive amounts of labeled text data, learning patterns from millions or billions of examples.

More subtly, backpropagation made transfer learning practical. The idea is simple but powerful: train a large network on a general task with abundant data, then adapt it to a specific task with limited data. For language models, this typically means pre-training on vast amounts of text to learn general language understanding, then fine-tuning on a specific task like sentiment analysis or question answering. Backpropagation makes both phases work. The initial training learns general features, and fine-tuning adjusts those features for the target task, typically with a much smaller learning rate that makes small adjustments rather than wholesale changes.

This concept of pre-training and fine-tuning became absolutely central to modern NLP. Why train from scratch when you can start with a model that already understands language, that has already learned about grammar, semantics, world knowledge, and reasoning patterns? The approach dramatically reduced the data requirements for specific tasks and improved performance across the board. It's why we can build effective systems for specialized domains without needing millions of domain-specific training examples. Backpropagation's flexibility, its ability to take a trained network and continue training it on new data, made this entire paradigm possible.

Research Methodology

Backpropagation fundamentally changed how NLP researchers approached problems. The field shifted toward end-to-end learning, where a single neural network learns to map inputs directly to outputs without intermediate hand-crafted representations. Before this shift, building an NLP system meant designing elaborate pipelines. You'd have separate components for tokenization, part-of-speech tagging, parsing, named entity recognition, and so on, each requiring carefully engineered features and rules. With backpropagation, you could train a single network to learn all these steps jointly, optimizing the entire system for the final task.

This enabled a thoroughly data-driven approach. You no longer needed teams of linguists to encode explicit rules about grammar, syntax, and semantics. The network figured these patterns out on its own from examples. This was both liberating and democratizing. Suddenly, you could build effective NLP systems for languages and domains where detailed linguistic resources didn't exist. You just needed text data and computational resources.

The scalability of backpropagation was equally important. As computational power grew, networks could grow larger and larger. The algorithm's efficiency meant that a network with ten times as many parameters didn't require ten times as much computation per parameter, it required roughly the same computation per parameter but just more parameters. This favorable scaling paved the way for the massive models we have today. A GPT-3 scale model with 175 billion parameters would be completely untrainable without backpropagation's efficient gradient computation. The algorithm's impact on research methodology was profound: it made empirical, data-driven approaches dominant and enabled the scaling laws that continue to drive progress today.

Current Applications

Look at any modern language model, whether it's GPT, BERT, Claude, or whatever comes next, and you'll find that they all use backpropagation for training. This nearly 40-year-old algorithm remains the fundamental mechanism by which these systems learn. The models have grown exponentially larger, the architectures have become vastly more sophisticated, the training datasets have expanded to encompass much of the internet, but the core training algorithm remains backpropagation.

Machine translation has been completely revolutionized by neural approaches trained with backpropagation. The statistical methods that dominated the field for decades have been almost entirely replaced by neural models that learn translation end-to-end. The improvement in quality has been dramatic, with neural machine translation producing more fluent, more accurate translations across a wider range of language pairs.

Text generation that seems almost magical, producing coherent paragraphs or even entire articles on arbitrary topics, is built on this same fundamental training method. These models use backpropagation to learn the statistical patterns of language from massive text corpora, developing an implicit understanding of grammar, facts, reasoning, and even writing style.

Backpropagation isn't just part of the historical foundation of language AI. It's still the essential workhorse powering every advance in the field. When researchers develop new architectures, new training techniques, or new applications, they're still fundamentally relying on backpropagation to adjust the billions of parameters in their models. The algorithm has proven remarkably durable, remaining central to the field nearly four decades after its introduction.

Loading component...

Back to History of Language AI

Previous Chapter

Lesk Algorithm (1983)

Next Chapter

Katz Back-off (1987)

Reference

BIBTEXAcademic

@misc{backpropagationtrainingdeepneuralnetworks, author = {Michael Brenndoerfer}, title = {Backpropagation - Training Deep Neural Networks}, year = {2025}, url = {https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-16} }

APAAcademic

Michael Brenndoerfer (2025). Backpropagation - Training Deep Neural Networks. Retrieved from https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training

MLAAcademic

Michael Brenndoerfer. "Backpropagation - Training Deep Neural Networks." 2025. Web. 11/16/2025. <https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training>.

CHICAGOAcademic

Michael Brenndoerfer. "Backpropagation - Training Deep Neural Networks." Accessed 11/16/2025. https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Backpropagation - Training Deep Neural Networks'. Available at: https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training (Accessed: 11/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Backpropagation - Training Deep Neural Networks. https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training

Direct link:

https://mbrenndoerfer.com/writing/history-backpropagation-deep-learning-training

Part of History of Language AI

This article is part of the free-to-read History of Language AI book

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveBackpropagation - Training Deep Neural Networks

1986: Backpropagation

The Credit Assignment Problem

What is Backpropagation?

How Backpropagation Works

Forward Pass

Backward Pass

The Math Behind It

Weight Update Rule

What Backpropagation Made Possible

Multi-Layer Networks

Practical Applications

Research Acceleration

The Limitations

Vanishing Gradients

Local Optima

Computational Cost

Overfitting

The Legacy for Language AI

Foundation for Modern Models

Training Paradigms

Research Methodology

Current Applications

Lesk Algorithm (1983)

Katz Back-off (1987)

Reference

About the author: Michael Brenndoerfer

Related Content

HDBSCAN Clustering: Complete Guide to Hierarchical Density-Based Clustering with Automatic Cluster Selection

Hierarchical Clustering: Complete Guide with Dendrograms, Linkage Criteria & Implementation

Exponential Smoothing (ETS): Complete Guide to Time Series Forecasting with Weighted Averages & Holt-Winters

Stay updated