The Mathematics Behind LLM Fine-Tuning: A Beginner's Guide to how and why finetuning works
Back to Writing

The Mathematics Behind LLM Fine-Tuning: A Beginner's Guide to how and why finetuning works

Michael BrenndoerferJuly 28, 20259 min read2,508 words

Understand the mathematical foundations of LLM fine-tuning with clear explanations and minimal prerequisites. Learn how gradient descent, weight updates, and Transformer architectures work together to adapt pre-trained models to new tasks.

Fine-tuning adapts a pre-trained LLM by making small adjustments to its weight matrices through gradient descent. Starting with weights θ0\theta_0, we find minimal changes Δθ\Delta\theta that reduce errors on our specific data. The process works because:

  • Neural networks use distributed representations where small weight changes can unlock new behaviors
  • Gradient descent guides us toward better performance by iteratively adjusting weights in the direction that reduces loss
  • We preserve the model's general knowledge while adding task-specific capabilities
  • Even adjusting 0.1% of billions of parameters can significantly change model behavior

The math is straightforward: minimize loss, compute gradients, update weights, repeat. The power comes from the distributed nature of neural representations.

Fine-tuning large language models (LLMs) has become a cornerstone of modern AI applications, but what actually happens when we adapt a pre-trained model to our specific needs? While most discussions focus on practical aspects such as choosing datasets, setting hyperparameters, or measuring performance, few explain the underlying mathematics that makes fine-tuning work.

This article demystifies the core math behind fine-tuning: what exactly changes inside a model, how those changes are guided, and why the process works. We'll break down the essential equations and concepts step by step, with no advanced mathematics required. Whether you're a data scientist, engineer, or technical leader curious about the fundamentals, this guide will help you understand what happens "under the hood" when you fine-tune an LLM.

Fine-tuning is adjusting a foundation model's internal parameters

Prompt engineering and RAG work outside the network. Fine-tuning rewires it from the inside.

Fine-tuning starts with a pre-trained model whose parameters (weights) were learned on massive datasets containing trillions of tokens. Our task is to slightly adjust those weights so the model behaves as if it had specialized knowledge of your domain-specific data, which we'll call DD.

Mathematically, we're solving an optimization problem:

minΔθL(D;θ0+Δθ)\min_{\Delta\theta} L(D; \theta_0 + \Delta\theta)

In plain English: Start with the model's existing weights (θ0\theta_0) and look for the smallest adjustment (Δθ\Delta\theta) that minimizes the model's errors on your specific data (DD).

Let's break this down step by step:

  1. Start with a pre-trained model (θ0\theta_0)
    θ0\theta_0 represents all the weights the base model learned during its original training. Think of these as millions of dials already set to useful positions.

  2. Define what we want to change (Δθ\Delta\theta)
    Instead of starting from scratch, we only look for a small adjustment Δθ\Delta\theta to those original weights. This is like fine-tuning a few dials rather than resetting everything.

  3. Measure performance with a loss function (LL)
    We run the adjusted model (with weights θ0+Δθ\theta_0 + \Delta\theta) on our fine-tuning data DD and compute the loss LL, which measures how wrong the model's predictions are.

  4. Find the best adjustment
    Our goal is to find the specific Δθ\Delta\theta that makes the loss as small as possible: minΔθL(D;θ0+Δθ)\min_{\Delta\theta} L(D; \theta_0 + \Delta\theta)

Why this approach works:

  • We preserve everything the base model already knows (encoded in θ0\theta_0)
  • We only learn the additional changes needed for the new task (Δθ\Delta\theta)
  • By keeping Δθ\Delta\theta small, we avoid overwriting the model's general capabilities

How gradient descent guides the weight updates

Updates are computed using gradient descent:

θt+1=θtηθL\theta_{t+1} = \theta_t - \eta \nabla_\theta L

In plain English: Check how wrong we are, determine which weight adjustments would reduce that error, take a carefully sized step in that direction, and repeat until the model performs well on the new task.

Here's what each part means:

  1. Current weights (θt\theta_t)
    Think of θ\theta as all the adjustable parameters inside the model. At iteration tt, these parameters have values θt\theta_t.

  2. The error we're trying to minimize (LL)
    We run the model on our fine-tuning data and measure how incorrect its predictions are. This error is the loss LL.

  3. Direction of steepest increase (θL\nabla_\theta L)
    The gradient θL\nabla_\theta L tells us, for each parameter, which direction would increase the loss most rapidly. It's like a compass pointing uphill on an error landscape.

  4. Step size (η\eta, the learning rate)
    η\eta is a small positive number (often between 0.00001 and 0.001) that controls how much we adjust the weights in each iteration. Too large, and we might overshoot; too small, and training takes forever.

  5. The update rule
    We subtract ηθL\eta \nabla_\theta L because we want to move in the opposite direction of the gradient, moving downhill toward lower loss. The new weights become: θt+1=θtηθL\theta_{t+1} = \theta_t - \eta \nabla_\theta L

  6. Iterative improvement
    We repeat this process hundreds or thousands of times. Each iteration slightly improves the model's performance on our specific data.

Even a few thousand gradient steps on a modest dataset (hundreds to thousands of examples) can shift the model into a new region of parameter space where it exhibits domain-specific knowledge, specialized vocabulary, or particular reasoning patterns that weren't prominent in the original training data.

Understanding model weights and how gradient descent updates them

Now that we understand gradient descent, you might wonder: "What exactly is θ\theta in the context of a transformer model?" Let's demystify what "weights" actually are and how gradient descent operates on them during fine-tuning.

In plain English: Imagine a neural network as a massive switchboard with billions of adjustable connections. Each connection has a "weight" that determines how strongly signals flow through it. In Transformers, these weights are organized into large tables of numbers called matrices. Each matrix is like a giant spreadsheet where every cell contains a number that the model adjusts during training.

When we write θ\theta in our gradient descent equation, we're referring to all the learnable parameters - every single number in every weight matrix and bias vector throughout the model. This includes weights in attention layers, feed-forward networks, layer normalization, and embeddings. For a 7B parameter model, that's 7 billion individual numbers!

The key weight matrices in transformers

Each layer in a transformer contains several types of weight matrices, and understanding their roles helps us appreciate how fine-tuning changes model behavior:

1. Attention Mechanism

The attention mechanism helps the model understand relationships between words. It uses three types of weight matrices:

  • Query matrix (WQW_Q): Helps ask "What information am I looking for?"
  • Key matrix (WKW_K): Helps identify "What information is available?"
  • Value matrix (WVW_V): Contains the actual information to be shared

Given an input XX (where each row represents a word as numbers), the attention mechanism computes:

Q=XWQ(Queries)Q = X W_Q \quad \text{(Queries)} K=XWK(Keys)K = X W_K \quad \text{(Keys)} V=XWV(Values)V = X W_V \quad \text{(Values)}

The attention scores are then calculated as:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

where dkd_k is the dimension of the key vectors (this scaling prevents the values from becoming too large).

2. Feed-Forward Network (FFN)

After attention, each position's representation passes through a feed-forward network, which is essentially a two-layer neural network with an activation function:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \sigma(W_1 \cdot x + b_1) + b_2

Breaking this down:

  • W1W_1 expands the representation to a higher dimension (typically 4x larger)
  • σ\sigma is a non-linear activation function (like GELU or ReLU) that adds flexibility
  • W2W_2 compresses back to the original dimension
  • b1b_1 and b2b_2 are bias terms that shift the outputs

Think of it as: W1W_1 spreads the information out to see it from many angles, the activation function adds the ability to learn complex patterns, and W2W_2 combines everything back into a useful representation.

How gradient descent operates on these weights

Here's how gradient descent operates on these weights through a four-step cycle:

1. Forward Pass: Computing Predictions

The model processes your input text by passing it through dozens of transformer layers. At each layer:

  • Attention mechanisms use weight matrices (WQW_Q, WKW_K, WVW_V) to determine which parts of the input to focus on
  • Feed-forward networks apply transformations using their weight matrices (W1W_1, W2W_2)
  • The transformed data flows to the next layer, eventually producing a prediction

Think of this as the model using its current knowledge (weights) to make its best guess at the answer.

2. Loss Calculation: Measuring Error

The model's prediction is compared to the correct answer using a loss function (typically cross-entropy for language models). This produces a single number LL that quantifies how wrong the prediction was.

In plain English: If the model predicted "cat" but the correct next word was "dog", the loss function assigns a high error value. The worse the prediction, the higher the loss.

3. Backward Pass: Computing Gradients

This is where the magic happens. Using backpropagation (which applies the chain rule from calculus), we compute the gradient of the loss with respect to every single parameter:

  • For attention weights: WQL\nabla_{W_Q} L, WKL\nabla_{W_K} L, WVL\nabla_{W_V} L
  • For feed-forward weights: W1L\nabla_{W_1} L, W2L\nabla_{W_2} L
  • For all other parameters throughout the model

What gradients tell us: Each gradient entry answers the question "If I increase this specific weight by a tiny amount, how much will the loss increase?" Negative gradients mean decreasing that weight would reduce the loss.

Crucially, each gradient matrix has the same shape as its corresponding weight matrix. If WQW_Q is 4096×4096, then WQL\nabla_{W_Q} L is also 4096×4096, providing update instructions for each of the ~16.8 million parameters in that matrix alone.

4. Weight Update: Applying the Gradients

Finally, we update each weight by taking a small step in the direction that reduces the loss:

Wnew=WoldηWLW^{new} = W^{old} - \eta \nabla_W L

For example:

  • WQnew=WQoldηWQLW_Q^{new} = W_Q^{old} - \eta \nabla_{W_Q} L
  • W1new=W1oldηW1LW_1^{new} = W_1^{old} - \eta \nabla_{W_1} L

The learning rate η\eta controls the step size - typically a tiny value like 0.00001 to ensure we don't overshoot.

The complete cycle: Forward pass → Loss calculation → Backward pass → Weight update. This repeats thousands of times during fine-tuning, with each iteration making tiny adjustments across billions of parameters. These small changes accumulate, gradually shifting the model's behavior to excel at your specific task.

Why this matters for fine-tuning

When you fine-tune for a specific domain (say, medical text), the gradients guide the model to:

  • Adjust attention weights to focus more on medical terminology patterns
  • Modify feed-forward weights to better process clinical concepts
  • Update embedding weights to capture specialized vocabulary

The beauty is that we're not starting from scratch: we're making targeted adjustments to weights that already encode vast general knowledge, adding just the specialized capabilities we need.

During fine-tuning, gradient descent computes separate gradients for each of these weight matrices:

  • For attention weights: The gradients WQL\nabla_{W_Q} L, WKL\nabla_{W_K} L, and WVL\nabla_{W_V} L tell us how to adjust each attention matrix to better capture relationships relevant to our task
  • For FFN weights: The gradients W1L\nabla_{W_1} L and W2L\nabla_{W_2} L indicate how to modify the feed-forward transformations

Each gradient step makes tiny adjustments across millions of parameters, gradually shifting the model's behavior while preserving its general capabilities.

Scale: How large are these weight matrices?

For a 7 billion parameter model, the numbers are staggering:

  • A single attention matrix might be 4096 × 4096 = ~16.8 million parameters
  • The feed-forward layers might expand to 16384 dimensions, creating matrices with ~67 million parameters
  • With dozens of layers, each containing multiple such matrices, we quickly reach billions of parameters

All these weights together encode everything the model "knows": grammar, facts, reasoning patterns, and more.

Fine-tuning: Making targeted adjustments

When you fine-tune a model, you're making small, precise updates to these weight matrices:

Mathematically: Starting with original weights θ0\theta_0, fine-tuning finds a small change Δθ\Delta\theta such that the new weights become θ0+Δθ\theta_0 + \Delta\theta.

In practice: You might update:

  • Only certain layers (e.g., the final few layers)
  • Only certain types of weights (e.g., just attention matrices)
  • Or use techniques like LoRA that add small, trainable matrices alongside frozen weights

The gradient descent connection: Remember our update equation θt+1=θtηθL\theta_{t+1} = \theta_t - \eta \nabla_\theta L? During fine-tuning:

  1. θ\theta represents all these weight matrices collectively (WQW_Q, WKW_K, WVW_V, W1W_1, W2W_2, etc.)
  2. The gradient θL\nabla_\theta L contains individual gradients for each matrix
  3. Each iteration updates thousands of matrices simultaneously, each by a tiny amount η\eta times its gradient
  4. Over many iterations, these small updates accumulate to create domain-specific behavior

The key insight: Fine-tuning is like adjusting a few critical dials in a massive control panel. These small changes can significantly alter the model's behavior for specific tasks without erasing its general knowledge.

Why small changes have large effects

Transformer models use distributed representations, where no single weight or neuron represents a specific concept like "cat" or "legal terminology." Instead, knowledge is spread across thousands of neurons and millions of weights, with each neuron participating in representing many different concepts.

This distributed nature makes the model surprisingly sensitive to small adjustments. Even changing just 0.1% of the weights can have substantial effects because:

  1. Network effects: A small change in one part of the network can alter information flow throughout the model, like adding a new road that opens up entirely new routes in a transportation network.

  2. Emergent behaviors: Small weight adjustments can cause the model to suddenly exhibit new capabilities, such as understanding domain-specific jargon, adopting a consistent tone, or following specialized reasoning patterns.

  3. Compositional power: Because the model combines many small pieces of information to generate outputs, tweaking how these pieces interact can lead to dramatically different results.

This is the power of fine-tuning: by making targeted updates to a small fraction of the model's parameters, you can adapt a general-purpose language model to highly specialized tasks. The model retains its broad knowledge while gaining new, task-specific capabilities, all through careful mathematical optimization of its weight matrices.

Conclusion

Understanding the mathematics behind fine-tuning reveals why this technique is so effective for adapting large language models. At its core, fine-tuning is an optimization problem where we search for minimal weight adjustments that improve model performance on specific data. Through gradient descent, we iteratively update the model's parameters, which are the massive matrices that encode its knowledge and capabilities.

The distributed nature of neural representations means that even small changes to these weights can unlock new behaviors and domain expertise. This mathematical foundation explains why fine-tuning often outperforms prompt engineering or retrieval methods for tasks requiring consistent behavior, specialized knowledge, or particular reasoning styles.

Armed with this understanding, you can make more informed decisions about when and how to fine-tune models for your applications. The math may seem complex at first, but the core principle is elegantly simple: careful, targeted adjustments to a pre-trained model's parameters can transform a generalist into a specialist, all while preserving the vast knowledge encoded in the original weights.

For those interested in diving deeper into the mathematical foundations of transformers and attention mechanisms, Chapter 9 of Jurafsky and Martin's "Speech and Language Processing" provides an excellent, detailed exploration of the transformer architecture, including the multi-head attention mechanism and feed-forward components discussed in this article.

Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.