LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients

Michael Brenndoerfer

Machine Learning Language AI Handbook Data, Analytics & AI

Master LoRA's mathematical foundations including low-rank decomposition, gradient computation, rank selection, and initialization schemes for efficient fine-tuning.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

LoRA MathematicsLink Copied

In the previous chapter, we introduced LoRA as a parameter-efficient fine-tuning method that adapts large language models by learning low-rank updates to weight matrices. Instead of modifying the full weight matrix $W \in \mathbb{R}^{d \times k}$ , LoRA learns two smaller matrices whose product represents the weight change. Now we turn to the mathematical foundations that make this approach work. Understanding LoRA's mathematics reveals why the method is both theoretically grounded and practically effective. The formulation connects to fundamental concepts from linear algebra, specifically matrix decomposition techniques from Part III. By examining the initialization scheme, gradient flow, and rank selection criteria, You will gain the intuition needed to apply LoRA effectively and understand its variants covered in upcoming chapters.

The LoRA FormulationLink Copied

LoRA reparameterizes weight updates using a factorized form that captures the essence of efficient adaptation. Rather than learning an arbitrary update to each weight matrix, LoRA constrains the update to live in a low-dimensional subspace. This constraint reflects the empirical observation that useful fine-tuning updates often have simple structure, revealing important patterns in how models adapt to new tasks.

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$ , LoRA expresses the adapted weight as:

W = W_0 + \Delta W = W_0 + BA

where:

$W$ : the adapted weight matrix (final parameters)
$W_0$ : the pre-trained weight matrix (frozen)
$\Delta W$ : the weight update matrix (learned)
$B \in \mathbb{R}^{d \times r}$ : the up-projection matrix
$A \in \mathbb{R}^{r \times k}$ : the down-projection matrix
$r \ll \min(d, k)$ : the rank of the adaptation

The key insight is that $\Delta W = BA$ has rank at most $r$ . To understand why, recall that the rank of a matrix product cannot exceed the rank of either factor. Since $A$ has $r$ rows and $B$ has $r$ columns, the product $BA$ can have rank at most $r$ . Even though $\Delta W$ has the same dimensions as $W_0$ , potentially millions of entries, it lives in a much lower-dimensional subspace determined by the choice of $r$ . This constraint is what enables LoRA's parameter efficiency: instead of learning $d \times k$ independent parameters, we learn only $r \times (d + k)$ parameters while still producing updates that span the full weight matrix dimensions.

Forward Pass ComputationLink Copied

During the forward pass, the network must compute the output of each adapted layer. This computation reveals opportunities for efficient implementation and clarifies each matrix's role in the decomposition.

For an input $x \in \mathbb{R}^k$ , the output $h \in \mathbb{R}^d$ is computed as:

\begin{aligned} h &= Wx \\ &= (W_0 + BA)x && \text{(substitute } W \text{)} \\ &= W_0 x + BA x && \text{(distribute)} \\ &= W_0 x + B(Ax) && \text{(associativity)} \end{aligned}

where:

$h$ : the output vector of the layer
$W$ : the full adapted weight matrix
$x$ : the input vector to the layer
$W_0$ : the frozen pre-trained weight matrix
$B$ : the trainable up-projection matrix
$A$ : the trainable down-projection matrix

The final line reveals LoRA's computational structure. The parenthesization in $B(Ax)$ is critical for computational efficiency. The naive approach would form the full $d \times k$ matrix $BA$ explicitly and then multiply by $x$ . This defeats the purpose of the low-rank decomposition because we would need to compute and store a matrix with $dk$ entries.

Instead, we exploit the associativity of matrix multiplication to compute the result sequentially through a bottleneck of dimension $r$ . The computation proceeds in three stages:

First, we compute $z = Ax$ , yielding $z \in \mathbb{R}^r$ . This operation involves an $r \times k$ matrix multiplying a $k$ -dimensional vector, requiring $O(rk)$ operations.
Next, we compute $\Delta h = Bz$ , yielding $\Delta h \in \mathbb{R}^d$ . This involves a $d \times r$ matrix multiplying an $r$ -dimensional vector, requiring $O(dr)$ operations.
Finally, we add the base model's output: $h = W_0 x + \Delta h$ .

This sequential computation requires $O(rk + dr) = O(r(d+k))$ operations for the LoRA path, compared to $O(dk)$ if we materialized the full update matrix. When $r \ll \min(d, k)$ , which is the regime where LoRA operates, this provides significant savings during both training and inference. For a typical transformer layer where $d = k = 4096$ and $r = 16$ , the LoRA path requires roughly 0.8% of the operations needed to compute with a full update matrix.

Out[2]:

Visualization

The LoRA bottleneck compresses 64-dimensional inputs through an 8-dimensional intermediate representation (z = Ax), then expands back to 64-dimensional outputs (Δh = Bz). This forced compression through just 8 dimensions represents an 8x reduction that demonstrates LoRA's dramatic parameter efficiency. The bottleneck acts as an information filter, learning which features matter most for task adaptation while discarding task-irrelevant dimensions.

Scaling FactorLink Copied

The original LoRA paper introduces a scaling factor $\alpha$ to control the magnitude of the low-rank update. This scaling factor plays a crucial role in making LoRA practical across different settings and ranks.

The scaled formulation is:

h = W_0 x + \frac{\alpha}{r} BA x

where:

$h$ : the output vector
$W_0$ : the pre-trained weight matrix
$x$ : the input vector
$\alpha$ : a constant scaling factor (hyperparameter)
$r$ : the rank of the LoRA matrices
$B$ : the up-projection matrix
$A$ : the down-projection matrix
$BA$ : the low-rank update matrix product

The ratio $\frac{\alpha}{r}$ serves several important purposes.

Learning rate stability: Without the scaling factor, changing the rank $r$ would fundamentally alter the magnitude of updates produced by the LoRA path. A higher rank means more terms contributing to the product $BA$ , which, all else being equal, would produce larger outputs. By dividing by $r$ , we normalize this effect, allowing the same learning rate to produce updates of similar magnitude regardless of rank choice.
Initialization compatibility: As we'll see in the initialization section below, the scaling interacts with how we initialize $A$ and $B$ to ensure consistent behavior. The combination of proper initialization and scaling means that the model's behavior at the start of training doesn't depend sensitively on the rank choice.
Hyperparameter transfer: Because the scaling normalizes the effect of rank, a good value of $\alpha$ often transfers across different rank choices. This simplifies hyperparameter search considerably: you can tune $\alpha$ at a lower rank where experiments are cheaper, then scale up the rank while keeping $\alpha$ fixed.

In practice, many implementations set $\alpha = r$ , making the scaling factor equal to 1 and effectively removing the normalization. Others treat $\alpha$ as a tunable hyperparameter, typically in the range $[8, 64]$ . The choice depends on whether you want rank-independent behavior (use $\alpha/r$ scaling) or prefer to absorb any magnitude adjustments into the learning rate (set $\alpha = r$ ).

Low-Rank Decomposition AnalysisLink Copied

The mathematical foundation of LoRA rests on the hypothesis that weight updates during fine-tuning have low intrinsic rank. This is a strong claim that deserves careful examination. Why should task-specific adaptations live in a low-dimensional subspace? To understand why this might be true, we can connect LoRA to concepts from matrix approximation theory and explore what empirical evidence tells us about the structure of fine-tuning updates.

Connection to Singular Value DecompositionLink Copied

Recall from our discussion of SVD in Part III that any matrix can be decomposed into a sum of rank-1 components, ordered by their importance. This decomposition shows when and why low-rank approximations work well.

Any matrix $M \in \mathbb{R}^{d \times k}$ can be decomposed as:

M = U \Sigma V^T = \sum_{i=1}^{\min(d,k)} \sigma_i u_i v_i^T

where:

$M$ : the matrix to be decomposed
$U, V$ : orthogonal matrices containing singular vectors
$\Sigma$ : diagonal matrix containing the singular values $\sigma_i$
$\sigma_i$ : the singular values, ordered such that $\sigma_1 \geq \sigma_2 \geq \ldots \geq 0$
$u_i, v_i$ : the left and right singular vectors corresponding to $\sigma_i$

The singular values tell us how much each rank-1 component contributes to the matrix. The largest singular value $\sigma_1$ corresponds to the most important direction, the direction along which the matrix stretches inputs the most. Subsequent singular values capture progressively less important directions.

The Eckart-Young-Mirsky theorem tells us that the best rank- $r$ approximation to $M$ in the Frobenius norm is obtained by keeping only the first $r$ terms:

M_r = \sum_{i=1}^{r} \sigma_i u_i v_i^T

where:

$M_r$ : the rank- $r$ matrix that best approximates $M$
$r$ : the target rank (number of components retained)
$\sigma_i$ : the singular values
$u_i, v_i$ : the singular vectors

Among all possible rank- $r$ matrices, the truncated SVD provides the one closest to $M$ . The approximation error is precisely:

\|M - M_r\|_F^2 = \sum_{i=r+1}^{\min(d,k)} \sigma_i^2

where:

$\|M - M_r\|_F^2$ : the squared Frobenius norm of the approximation error
$\|\cdot\|_F$ : the Frobenius norm (square root of sum of squared elements)
$\sigma_i$ : the singular values corresponding to the discarded directions

This formula reveals when low-rank approximations work well: if the singular values decay rapidly, then the terms we discard when truncating contribute little to the total, and a low-rank approximation captures most of the matrix's structure. Conversely, if all singular values are similar in magnitude, truncation loses substantial information.

Empirical studies of fine-tuned models provide encouraging evidence for LoRA's approach. When researchers compute the weight changes $\Delta W_{\text{full}} = W_{\text{finetuned}} - W_{\text{pretrained}}$ from full fine-tuning runs and examine their singular value spectra, they find rapid decay. A small number of singular values dominate while most are near zero. This suggests that the actual weight changes needed for successful fine-tuning can be well approximated by low-rank matrices.

Out[3]:

Visualization

Singular value decay on a logarithmic scale reveals how different matrix types compress information. Fine-tuning weight updates (fast decay) capture 95 percent of information by rank 16, while medium and slow decay matrices require progressively higher ranks. The shaded region marks the effective LoRA operating range (r = 1 to 16) where compression and adaptation quality balance optimally.

Cumulative singular value energy versus rank reveals how rapidly matrices compress. Fine-tuning updates with fast decay reach 95 percent information content at rank 8, while medium decay requires rank 16. Practical LoRA uses r = 8 to r = 16 to capture nearly all useful information in typical fine-tuning updates while achieving strong parameter compression.

Intrinsic Dimensionality HypothesisLink Copied

The effectiveness of LoRA connects to the broader concept of intrinsic dimensionality in neural network training. Despite millions or billions of parameters, models need far fewer degrees of freedom for successful training. Called the "lottery ticket hypothesis" or "intrinsic dimensionality," this phenomenon shows that neural networks are vastly overparameterized for their tasks.

For fine-tuning specifically, the intuition supporting low-rank updates emerges from understanding what fine-tuning actually accomplishes:

Pre-training captures general structure: The pre-trained weights $W_0$ already encode rich representations of language. These weights have been shaped by exposure to vast amounts of text, learning general patterns about syntax, semantics, world knowledge, and reasoning. These representations are highly expressive and broadly useful.
Fine-tuning makes targeted adjustments: Adapting to a specific task requires modifying only certain aspects of these representations. A sentiment classifier needs only to adjust how certain features map to sentiment labels, not reorganize the model's understanding of language. A summarization model doesn't need new linguistic knowledge; it needs to learn task-specific patterns about what information to preserve and condense.
Targeted adjustments are low-rank: These task-specific modifications can often be expressed as linear combinations of a small number of directions in weight space. If fine-tuning primarily adjusts how the model uses existing features rather than learning entirely new ones, the weight changes will have structure that low-rank matrices can capture.

Full unconstrained fine-tuning doesn't produce exactly low-rank updates. Instead, we find that low-rank approximations suffice for most downstream tasks. Full-rank updates might capture noise or provide marginal gains on very demanding tasks, but for typical applications, the low-rank constraint sacrifices little while gaining substantial efficiency.

Rank Constraint InterpretationLink Copied

The rank constraint $\text{rank}(\Delta W) \leq r$ has a geometric interpretation that shows what LoRA learns during training. The matrix $\Delta W = BA$ maps inputs through an $r$ -dimensional bottleneck, forcing all information about the input to pass through a low-dimensional representation before influencing the output.

Information flows through the LoRA path in three stages. First, the input vector $x$ is projected down to an $r$ -dimensional representation $z$ by the matrix $A$ . Then, this compressed representation is expanded back to the full output dimension by the matrix $B$ . Mathematically:

x \xrightarrow{A} z \in \mathbb{R}^r \xrightarrow{B} \Delta h \in \mathbb{R}^d

where:

$x$ : the input vector
$A$ : the down-projection matrix
$z$ : the intermediate low-rank representation
$B$ : the up-projection matrix
$\Delta h$ : the update vector in the output space
$r$ : the rank of the adaptation
$d$ : the output dimension

This bottleneck structure implies several important properties. First, the update $\Delta h$ necessarily lies in the column space of $B$ , which has dimension at most $r$ . No matter what input $x$ we provide, the LoRA path can only produce outputs in this restricted subspace. Second, different inputs can only produce updates in this same $r$ -dimensional subspace. The LoRA update cannot independently adjust every dimension of the output; it must work within the constraints of the learned subspace. Third, and importantly, the subspace is learned during training, not predetermined. The optimization process discovers which directions in weight space are most useful for the task at hand.

The training process simultaneously learns two complementary aspects of the adaptation. The matrix $B$ encodes which $r$ -dimensional subspace in the output space should receive updates, effectively selecting the "directions" in which the model's behavior should change. The matrix $A$ encodes how to project inputs onto this subspace, determining which aspects of the input should influence these changes and by how much. Together, $A$ and $B$ learn both the structure of the adaptation and how to apply it based on the input.

Rank SelectionLink Copied

Choosing the rank $r$ involves balancing expressiveness against efficiency. Rank is a primary hyperparameter in LoRA. Understanding its trade-offs is essential for practical application. Lower ranks use fewer parameters and computation but constrain the adaptation's capacity to represent complex changes to the model's behavior.

Parameter Count AnalysisLink Copied

To understand the efficiency gains from LoRA, we need to compare the number of trainable parameters against full fine-tuning. For a weight matrix $W_0 \in \mathbb{R}^{d \times k}$ , full fine-tuning requires learning $dk$ parameters. LoRA instead introduces:

\text{LoRA parameters} = dr + rk = r(d + k)

where:

$d$ : output dimension of the layer
$k$ : input dimension of the layer
$r$ : rank of the adaptation

The parameter ratio compared to full fine-tuning reveals the compression achieved:

\frac{r(d + k)}{dk} = \frac{r}{d} + \frac{r}{k}

where:

$dk$ : total parameters in the full weight matrix
$r(d+k)$ : total parameters in the LoRA adapters
$r$ : rank of the adaptation
$d$ : output dimension
$k$ : input dimension

This ratio decreases as the layer dimensions grow, meaning LoRA becomes proportionally more efficient for larger models. For a transformer with $d = k = 4096$ , which is typical of modern large language models, and $r = 16$ :

\frac{16(4096 + 4096)}{4096 \times 4096} = \frac{131,072}{16,777,216} \approx 0.78\%

where:

$16$ : the rank $r$ used in this example
$4096$ : the input and output dimensions ( $d$ and $k$ )
$0.78\%$ : the resulting parameter efficiency ratio

This calculation shows that LoRA with rank 16 uses less than 1 percent of the parameters that full fine-tuning would require for this single layer. The dramatic reduction holds across the entire model. If we apply LoRA to all attention projections (the query, key, value, and output matrices denoted $W_Q$ , $W_K$ , $W_V$ , $W_O$ ) in each layer, the total trainable parameters remain a small fraction of the original model while still adapting the most important components of the transformer architecture.

Out[4]:

Visualization

LoRA parameter efficiency for a 4096-dimensional transformer layer demonstrates compression ratios from 30x (r=4) to 500x (r=64). Ranks r=8 to r=16 use only 0.2 to 1 percent of full fine-tuning parameters, representing the practical sweet spot where efficiency gains are substantial while maintaining adaptation capacity for typical NLP fine-tuning tasks.

LoRA parameter efficiency improves with model size across ranks 8, 16, and 64 (shown on logarithmic axes). At dimension d=8192 (typical for very large models), rank-16 adapters use under 0.4 percent of full fine-tuning parameters. This scaling benefit makes it possible to fine-tune massive language models on consumer hardware with limited memory.

Expressiveness vs. Efficiency Trade-offLink Copied

The rank determines the capacity of the adaptation, controlling how expressive the learned weight changes can be. Different ranks are appropriate for different scenarios:

$r = 1$ : The update is a rank-1 matrix $\Delta W = ba^T$ (outer product of two vectors). This is highly constrained, representing the simplest possible non-trivial update. Despite this severe limitation, rank-1 adaptations sometimes suffice for simple tasks like binary classification where the model primarily needs to adjust a single decision boundary.
$r = 4$ to $r = 16$ : These are common choices that balance efficiency and expressiveness for most NLP tasks. Many benchmarks show that ranks in this range achieve performance comparable to full fine-tuning while maintaining substantial parameter efficiency. This range represents the "sweet spot" for typical applications.
$r = 64$ to $r = 256$ : Higher capacity for complex adaptations, multi-task scenarios, or cases where lower ranks demonstrably underperform. These ranks sacrifice some efficiency for increased expressiveness, appropriate when the task demands more complex adaptations than lower ranks can represent.
$r = \min(d, k)$ : Full rank, equivalent to unconstrained fine-tuning of the layer. At this extreme, LoRA provides no parameter reduction but still maintains the structure of learning updates as a product of two matrices. This is primarily useful as a theoretical comparison point.

Empirically, rank selection depends on several factors that practitioners should consider. Task complexity plays a significant role. Simple classification may need only r = 8, while complex generation might benefit from r = 64 or higher. Dataset size interacts with rank in interesting ways. Smaller datasets benefit from lower ranks because the rank constraint provides implicit regularization, preventing overfitting. Larger datasets can exploit higher ranks without overfitting and may show continued improvement as rank increases. Finally, target modules often have different optimal ranks. Attention projections frequently need lower ranks than feed-forward layers, possibly because attention primarily routes information while feed-forward layers perform more complex transformations.

Effective Rank During TrainingLink Copied

Trained LoRA adapters show an interesting pattern: even when $r$ is set relatively high, the effective rank of the learned $BA$ is often lower than the maximum possible. The optimization process tends to concentrate the adaptation into fewer dimensions than the allocated rank would allow.

The effective rank can be measured using the singular values of BA. One natural measure based on information theory follows.

\text{effective rank}(BA) = \exp\left(-\sum_{i} \tilde{\sigma}_i \log \tilde{\sigma}_i\right)

where:

$\tilde{\sigma}_i$ : the normalized singular value such that $\sum \tilde{\sigma}_i = 1$
$\sigma_i$ : the $i$ -th singular value of the matrix $BA$
$\exp(\cdot)$ : exponential function (computes the perplexity of the singular value distribution)

This entropy-based measure, sometimes called the spectral entropy or perplexity of the singular value distribution, indicates how many singular values contribute meaningfully to the matrix. If all $r$ singular values are equal, the effective rank equals $r$ . If one singular value dominates, the effective rank approaches 1. Values between these extremes indicate intermediate concentration.

Studies examining trained LoRA adapters show that they often converge to solutions where only a few singular values dominate, even when the specified $r$ is much larger. This observation suggests that the actual intrinsic rank needed for the adaptation is lower than the specified $r$ . It also suggests potential improvements: methods like AdaLoRA, covered in a later chapter, exploit this observation by adaptively adjusting the rank during training rather than fixing it in advance.

Initialization SchemeLink Copied

LoRA's initialization is crucial for training stability and convergence. How you initialize matrices $A$ and $B$ determines the model's starting point and influences the entire training trajectory. The standard scheme is:

Matrix $A$ : Initialized from $\mathcal{N}(0, \sigma^2)$ (Gaussian) or Kaiming initialization
Matrix $B$ : Initialized to zero

This asymmetric initialization, with $A$ random and $B$ zero, ensures that the product $BA = 0$ at the start of training. This means the model begins with its pre-trained behavior completely intact: the LoRA path contributes nothing to the output until training begins to modify the parameters.

Rationale for Zero Initialization of BLink Copied

Starting with $\Delta W = BA = 0$ has several important advantages:

Continuity from pretraining: The model's initial behavior exactly matches the pretrained model. This is valuable because the pretrained model already performs well on many tasks. No "warm-up" period is needed for the model to recover from a random perturbation. From the first training step, we are refining good behavior rather than recovering from a disrupted starting point.
Stable training dynamics: Large random initializations of both $A$ and $B$ could produce $\|BA\|$ values that significantly perturb the pre-trained representations. If the initial perturbation is large, early training might focus on undoing this damage rather than learning the task. By starting at zero, we ensure that the first training steps are devoted entirely to task-relevant adaptation.
Gradient flow: A natural concern with zero initialization is whether gradients will flow properly. If $B = 0$ initially, won't the gradient with respect to $B$ also be zero, preventing any learning? Fortunately, this is not the case. The gradient with respect to $B$ depends on the input to the LoRA layer and the upstream gradient, not on the current value of $B$ . As we'll derive in the gradient section below, $B$ receives non-zero gradients even when it equals zero.

Variance AnalysisLink Copied

The initialization variance explains why the standard scheme works better. If we initialize both $A$ and $B$ randomly with entries drawn from $\mathcal{N}(0, \sigma^2)$ , what magnitude would we expect in the product $BA$ ?

For a single entry of the product:

\begin{aligned} \mathbb{E}[(BA)_{ij}^2] &= \mathbb{E}\left[\left(\sum_{l=1}^{r} B_{il} A_{lj}\right)^2\right] \\ &= \sum_{l=1}^{r} \mathbb{E}[B_{il}^2 A_{lj}^2] + \sum_{l \neq m} \mathbb{E}[B_{il} A_{lj} B_{im} A_{mj}] && \text{(expand square)} \\ &= \sum_{l=1}^{r} \mathbb{E}[B_{il}^2] \, \mathbb{E}[A_{lj}^2] + 0 && \text{(independence and zero mean)} \\ &= \sum_{l=1}^{r} \sigma^2 \sigma^2 && \text{(variance def.)} \\ &= r\sigma^4 \end{aligned}

where:

$(BA)_{ij}$ : element at row $i$ , column $j$ of the product
$B_{il}, A_{lj}$ : elements of the random matrices
$\sigma^2$ : variance of the initialization distribution
$r$ : rank (number of terms in the sum)

The key observation is that this expected squared magnitude grows linearly with $r$ . This rank dependence creates a problem: if we use the same initialization variance and learning rate for different ranks, higher ranks will produce larger initial perturbations and larger gradients. This would require rank-dependent learning rate adjustment to maintain consistent training dynamics. Zero-initializing $B$ sidesteps this issue entirely by ensuring the initial perturbation is exactly zero regardless of rank.

Out[5]:

Visualization

Random initialization of both matrices A and B for a 64x64 layer creates rank-dependent perturbations. The Frobenius norm of BA grows linearly with rank. Perturbations range from 0.5 at rank 1 to 4 at rank 64, demonstrating that higher ranks produce larger initial disruptions to pretrained weights, requiring rank-specific learning rate adjustments.

Standard LoRA initialization with B equal to zero ensures the Frobenius norm of BA is exactly zero across all ranks, eliminating rank-dependent perturbations. BA grows smoothly and consistently after the first gradient step. This scheme preserves pretrained model behavior at initialization while providing stable learning dynamics across any rank choice.

For matrix $A$ , the standard initialization uses variance that scales inversely with rank:

A_{lj} \sim \mathcal{N}\left(0, \frac{1}{r}\right)

where:

$A_{lj}$ : element of matrix $A$ at row $l$ , column $j$
$\mathcal{N}(\mu, \sigma^2)$ : normal distribution with mean $\mu$ and variance $\sigma^2$
$r$ : rank of the adaptation
$1/r$ : variance chosen to scale with the rank

This scaling ensures that when gradients begin flowing and $B$ starts to move away from zero, the updates to $B$ have reasonable magnitude regardless of $r$ . The variance is chosen so that the projection $Ax$ has expected squared norm that doesn't grow with $r$ . Some implementations use Kaiming initialization instead, setting:

A_{lj} \sim \mathcal{N}\left(0, \frac{2}{k}\right)

where:

$A_{lj}$ : element of matrix $A$
$k$ : input dimension of the layer

Kaiming initialization is designed to preserve the variance of activations through deep networks and is a reasonable alternative choice.

Alternative InitializationsLink Copied

While zero-initialization of $B$ is standard and works well in most cases, researchers have explored alternatives that may offer advantages in certain scenarios:

SVD initialization: Initialize $BA$ as a low-rank approximation of $W_{\text{fullfinetune}} - W_0$ from a reference full finetuning run. This requires running full fine-tuning once, but if you need to train many LoRA variants, initializing from the SVD of a reference solution can accelerate convergence significantly.
Symmetric initialization: Initialize both $A$ and $B$ with small random values, accepting the initial perturbation to pre-trained behavior. This may help when the pre-trained model's initial behavior is far from desired, though it requires careful tuning of the initialization scale.
Task-informed initialization: Use task-specific heuristics based on prior knowledge about the adaptation needed. For example, if adapting for a specific domain, initialize using SVD of domain-specific text representations.

We'll explore adaptive initialization strategies in the AdaLoRA chapter, where the initialization interacts with mechanisms that adjust rank during training.

LoRA Gradient ComputationLink Copied

Gradient flow through LoRA reveals training dynamics and connects to optimization concepts from Part VII. The gradient formulas reveal why the zero initialization of $B$ doesn't prevent learning and how $A$ and $B$ co-evolve during training.

Gradient DerivationLink Copied

Consider the forward pass for a single layer with LoRA adaptation:

h = W_0 x + \frac{\alpha}{r} BA x

where:

$h$ : the output vector
$W_0$ : the pre-trained weight matrix
$x$ : the input vector
$\alpha$ : the scaling factor
$r$ : the rank
$B, A$ : the LoRA up-projection and down-projection matrices

Let $\mathcal{L}$ be the loss function we're minimizing. To update $A$ and $B$ via gradient descent, we compute $\frac{\partial \mathcal{L}}{\partial A}$ and $\frac{\partial \mathcal{L}}{\partial B}$ . The chain rule traces how parameter changes propagate through the computation to affect the loss.

First, let's define the upstream gradient. Let $\frac{\partial \mathcal{L}}{\partial h} = g \in \mathbb{R}^d$ be the gradient of the loss with respect to the layer's output. This gradient comes from the layers above in the network and tells us how changes in $h$ affect the loss.

For the gradient with respect to $B$ , we apply the chain rule. The LoRA contribution to $h$ is $\frac{\alpha}{r} B(Ax)$ . Differentiating this with respect to $B$ , and then multiplying by how changes in $h$ affect the loss:

\begin{aligned} \frac{\partial \mathcal{L}}{\partial B} &= \frac{\alpha}{r} \frac{\partial \mathcal{L}}{\partial h} \frac{\partial h}{\partial B} && \text{(chain rule)} \\ &= \frac{\alpha}{r} g (Ax)^T && \text{(substitute derivatives)} \end{aligned}

where:

$\mathcal{L}$ : the loss function
$g$ : the upstream gradient vector $\frac{\partial \mathcal{L}}{\partial h}$
$Ax$ : the intermediate vector (input projected down to rank $r$ )
$\alpha$ : the scaling factor
$r$ : the rank

The gradient of $B$ is an outer product between the upstream gradient $g$ and the intermediate representation $Ax$ . Each entry of the gradient $(B)_{il}$ measures how much increasing that entry would affect the loss, which depends on how much gradient signal arrives at output dimension $i$ (captured by $g_i$ ) and how active the corresponding bottleneck dimension $l$ was (captured by $(Ax)_l$ ).

For processing batches efficiently, we extend this to matrix form:

\frac{\partial \mathcal{L}}{\partial B} = \frac{\alpha}{r} G (AX)^T

where:

$\mathcal{L}$ : loss function
$\alpha$ : scaling factor
$r$ : rank
$G \in \mathbb{R}^{d \times n}$ : matrix of upstream gradients for the batch
$X \in \mathbb{R}^{k \times n}$ : matrix of input vectors for the batch
$A$ : down-projection matrix

For the gradient with respect to $A$ , we again apply the chain rule. Now we need to trace how changes in $A$ affect $h$ through the intermediate computation $Ax$ :

\begin{aligned} \frac{\partial \mathcal{L}}{\partial A} &= \frac{\alpha}{r} B^T \frac{\partial \mathcal{L}}{\partial h} x^T && \text{(chain rule)} \\ &= \frac{\alpha}{r} B^T g x^T && \text{(substitute } g \text{)} \end{aligned}

where:

$B^T$ : transpose of the up-projection matrix
$\mathcal{L}$ : the loss function
$\alpha$ : the scaling factor
$r$ : the rank
$g$ : the upstream gradient vector
$x$ : the input vector

This formula shows that the gradient with respect to $A$ involves projecting the upstream gradient back through $B$ (using $B^T g$ ) and then forming an outer product with the input $x$ . In matrix form for batch processing:

\frac{\partial \mathcal{L}}{\partial A} = \frac{\alpha}{r} B^T G X^T

where:

$\mathcal{L}$ : loss function
$\alpha$ : scaling factor
$r$ : rank
$B^T$ : transpose of the up-projection matrix
$G$ : upstream gradient matrix
$X^T$ : transpose of the input batch matrix

Gradient Flow PropertiesLink Copied

Several properties of these gradients are noteworthy and reveal important aspects of LoRA's training dynamics.

Independence from $W_0$ : The gradients $\frac{\partial \mathcal{L}}{\partial A}$ and $\frac{\partial \mathcal{L}}{\partial B}$ are independent of $W_0$ . Frozen weights affect the upstream gradient $g$ but never appear explicitly in the LoRA gradient formulas. This separation enables efficient training where $W_0$ is never updated. You don't need to compute or store gradients for the much larger frozen weight matrices.

Coupled dynamics: Although $A$ and $B$ are separate parameters, their gradients are intimately coupled:

$\frac{\partial \mathcal{L}}{\partial B}$ depends on $A$ through the term $Ax$
$\frac{\partial \mathcal{L}}{\partial A}$ depends on $B$ through the term $B^T g$

This coupling means $A$ and $B$ co-evolve during training, with $A$ 's current value determining $B$ 's gradient and vice versa. This is similar to the dynamics in other factorized parameterizations, leading to interesting optimization behavior.

Zero initialization dynamics: At initialization ( $B = 0$ ):

$\frac{\partial \mathcal{L}}{\partial B} = \frac{\alpha}{r} g (Ax)^T \neq 0$ (generally non-zero)
$\frac{\partial \mathcal{L}}{\partial A} = \frac{\alpha}{r} B^T g x^T = 0$

This asymmetry is important: $B$ updates immediately from the first gradient step, while $A$ has zero gradient initially. However, after the first update step, $B \neq 0$ , and both matrices begin receiving non-zero gradients. The initial zero-gradient phase for $A$ is brief, lasting only a single step in theory, though in practice the gradients for $A$ remain small until $B$ has moved appreciably away from zero.

Computational Cost of Gradient ComputationLink Copied

LoRA's gradient computations are efficient, requiring operations proportional to the parameter count rather than the full weight matrix size.

Computing $\frac{\partial \mathcal{L}}{\partial B} = \frac{\alpha}{r} g z^T$ where $z = Ax$ has the following costs: during the forward pass, we must store $z \in \mathbb{R}^r$ , adding minimal memory overhead. The gradient computation itself is an outer product between vectors of dimension $d$ and $r$ , requiring $O(dr)$ operations.

Computing $\frac{\partial \mathcal{L}}{\partial A} = \frac{\alpha}{r} B^T g x^T$ requires first computing $B^T g$ , which is a matrix-vector product costing $O(dr)$ operations, followed by forming the outer product with $x^T$ , which costs $O(rk)$ operations.

The total gradient computation is therefore $O(r(d + k))$ , matching the parameter count. This is optimal. We need at least one operation per parameter to compute a gradient, and LoRA achieves this lower bound up to constant factors. This efficiency extends to the memory required for gradient storage, which also scales with $r(d+k)$ rather than $dk$ .

Worked ExampleLink Copied

Concrete numbers solidify the mathematical concepts. This example connects abstract formulas to actual numerical operations.

Consider a small weight matrix $W_0 \in \mathbb{R}^{4 \times 3}$ with rank r = 2 adaptation.

W_0 = \begin{bmatrix} 1.0 & 0.5 & -0.3 \\ 0.2 & 1.0 & 0.4 \\ -0.1 & 0.3 & 1.0 \\ 0.5 & -0.2 & 0.1 \end{bmatrix}

where:

$W_0$ : the pre-trained weight matrix

Initialize LoRA matrices following the standard scheme:

A = \begin{bmatrix} 0.3 & -0.5 & 0.2 \\ 0.4 & 0.1 & -0.3 \end{bmatrix}, \quad B = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}

where:

$A$ : the down-projection matrix (randomly initialized)
$B$ : the up-projection matrix (initialized to zero)

For input $x = [1.0, 0.5, -0.2]^T$ with scaling factor $\alpha = 2$ :

Forward pass:

Tracing through each step of forward computation:

Pre-trained output: We compute $W_0 x$ by multiplying the pre-trained weight matrix with the input. For the first element: $1.0 \cdot 1.0 + 0.5 \cdot 0.5 + (-0.3)(-0.2) = 1.0 + 0.25 + 0.06 = 1.31$ . Continuing for all elements: $W_0 x = [1.31, 0.62, -0.15, 0.38]^T$
LoRA intermediate: We compute $z = Ax$ , projecting the input down to the rank-2 bottleneck. First element: $0.3 \cdot 1.0 + (-0.5) \cdot 0.5 + 0.2 \cdot (-0.2) = 0.3 - 0.25 - 0.04 = 0.01$ . Second element: $0.4 \cdot 1.0 + 0.1 \cdot 0.5 + (-0.3) \cdot (-0.2) = 0.4 + 0.05 + 0.06 = 0.51$ . So $z = [0.01, 0.51]^T$
LoRA update: We compute $Bz$ , projecting back up to the output dimension. Since $B = 0$ , this gives $Bz = [0, 0, 0, 0]^T$
Final output: $h = W_0 x + \frac{\alpha}{r} Bz = [1.31, 0.62, -0.15, 0.38]^T + \frac{2}{2}[0, 0, 0, 0]^T = [1.31, 0.62, -0.15, 0.38]^T$

At initialization, the output equals the pre-trained output exactly. This confirms that zero-initializing $B$ preserves the pre-trained model's behavior.

Gradient computation:

With upstream gradient $g = [0.1, -0.2, 0.3, 0.1]^T$ from the loss and layers above, we compute gradients for both LoRA matrices.

For $B$ :

\begin{aligned} \frac{\partial \mathcal{L}}{\partial B} &= \frac{\alpha}{r} g z^T \\ &= \frac{2}{2} \begin{bmatrix} 0.1 \\ -0.2 \\ 0.3 \\ 0.1 \end{bmatrix} \begin{bmatrix} 0.01 & 0.51 \end{bmatrix} \\ &= \begin{bmatrix} 0.001 & 0.051 \\ -0.002 & -0.102 \\ 0.003 & 0.153 \\ 0.001 & 0.051 \end{bmatrix} \end{aligned}

where:

$\mathcal{L}$ : the loss function
$B$ : the up-projection matrix
$\alpha$ : the scaling factor ( $\alpha = 2$ )
$r$ : the rank ( $r = 2$ )
$g$ : the upstream gradient vector
$z$ : the intermediate activation vector computed in the forward pass

For $A$ :

\begin{aligned} \frac{\partial \mathcal{L}}{\partial A} &= \frac{\alpha}{r} B^T g x^T \\ &= \frac{2}{2} \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} g x^T \\ &= \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix} \end{aligned}

where:

$\mathcal{L}$ : the loss function
$A$ : the down-projection matrix
$B^T$ : the transpose of the up-projection matrix (currently zero)
$g$ : the upstream gradient vector
$x$ : the input vector

As expected from our theoretical analysis, $B$ receives non-zero gradients while $A$ 's gradient is zero at initialization. After one gradient descent step updates $B$ , subsequent forward passes will produce non-zero $B^T g$ terms, and $A$ will begin receiving gradients as well.

Code ImplementationLink Copied

We'll implement the LoRA mathematics in PyTorch, demonstrating the forward pass, gradient computation, and initialization. The implementation verifies our theoretical predictions empirically.

In[6]:

Code

import torch

torch.manual_seed(42)

import torch

torch.manual_seed(42)

In[7]:

Code

import math

import torch.nn as nn
import torch.nn.functional as F


class LoRALinear(nn.Module):
    """Linear layer with LoRA adaptation"""

    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 4,
        alpha: float = 1.0,
        pretrained_weight: torch.Tensor = None,
    ):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.scaling = alpha / rank

        # Frozen pre-trained weight
        if pretrained_weight is None:
            pretrained_weight = torch.randn(out_features, in_features) * 0.02
        self.register_buffer("W0", pretrained_weight)

        # LoRA matrices (trainable)
        self.A = nn.Parameter(torch.randn(rank, in_features) / math.sqrt(rank))
        self.B = nn.Parameter(torch.zeros(out_features, rank))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with LoRA adaptation."""
        # Original forward pass
        h0 = F.linear(x, self.W0)

        # LoRA path: x -> A -> B (with scaling)
        z = F.linear(x, self.A)  # Shape: (batch, rank)
        delta_h = F.linear(z, self.B)  # Shape: (batch, out_features)

        return h0 + self.scaling * delta_h

    def get_merged_weight(self) -> "torch.Tensor":
        """Return W0 + (alpha/r) * BA for inference."""
        return self.W0 + self.scaling * (self.B @ self.A)

import math

import torch.nn as nn
import torch.nn.functional as F


class LoRALinear(nn.Module):
    """Linear layer with LoRA adaptation"""

    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 4,
        alpha: float = 1.0,
        pretrained_weight: torch.Tensor = None,
    ):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.scaling = alpha / rank

        # Frozen pre-trained weight
        if pretrained_weight is None:
            pretrained_weight = torch.randn(out_features, in_features) * 0.02
        self.register_buffer("W0", pretrained_weight)

        # LoRA matrices (trainable)
        self.A = nn.Parameter(torch.randn(rank, in_features) / math.sqrt(rank))
        self.B = nn.Parameter(torch.zeros(out_features, rank))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with LoRA adaptation."""
        # Original forward pass
        h0 = F.linear(x, self.W0)

        # LoRA path: x -> A -> B (with scaling)
        z = F.linear(x, self.A)  # Shape: (batch, rank)
        delta_h = F.linear(z, self.B)  # Shape: (batch, out_features)

        return h0 + self.scaling * delta_h

    def get_merged_weight(self) -> "torch.Tensor":
        """Return W0 + (alpha/r) * BA for inference."""
        return self.W0 + self.scaling * (self.B @ self.A)

Let's verify the initialization produces zero updates:

In[8]:

Code

# Create a LoRA layer
layer = LoRALinear(in_features=64, out_features=128, rank=8, alpha=8.0)

# Check that BA = 0 at initialization
BA = layer.B @ layer.A
shape_a = layer.A.shape
shape_b = layer.B.shape
shape_ba = BA.shape
max_ba = BA.abs().max().item()

# Create a LoRA layer
layer = LoRALinear(in_features=64, out_features=128, rank=8, alpha=8.0)

# Check that BA = 0 at initialization
BA = layer.B @ layer.A
shape_a = layer.A.shape
shape_b = layer.B.shape
shape_ba = BA.shape
max_ba = BA.abs().max().item()

Out[9]:

Console

Shape of A: torch.Size([8, 64])
Shape of B: torch.Size([128, 8])
Shape of BA: torch.Size([128, 64])
Max absolute value in BA: 0.000e+00

The product $BA$ is exactly zero because $B$ is initialized to zero, confirming our model starts with pretrained behavior. Matrix $A$ has dimensions (rank, in_features) for the down-projection, while $B$ has dimensions (out_features, rank) for the up-projection. Their product $BA$ maintains the full weight matrix dimensions but contains only zeros at initialization, meaning the LoRA path contributes nothing to the output until training begins.

Verifying Gradient FlowLink Copied

Let's trace gradients through a forward-backward pass:

In[10]:

Code

# Create layer and input
layer = LoRALinear(in_features=8, out_features=4, rank=2, alpha=2.0)
x = torch.randn(1, 8, requires_grad=True)

# Forward pass
h = layer(x)
input_shape = x.shape
output_shape = h.shape

# Backward pass with a simple loss
loss = h.sum()
loss.backward()

grad_a = layer.A.grad
grad_b = layer.B.grad

# Create layer and input
layer = LoRALinear(in_features=8, out_features=4, rank=2, alpha=2.0)
x = torch.randn(1, 8, requires_grad=True)

# Forward pass
h = layer(x)
input_shape = x.shape
output_shape = h.shape

# Backward pass with a simple loss
loss = h.sum()
loss.backward()

grad_a = layer.A.grad
grad_b = layer.B.grad

Out[11]:

Console

Input shape: torch.Size([1, 8])
Output shape: torch.Size([1, 4])
Gradient norm of A: 0.000e+00
Gradient norm of B: 7.864e+00

As predicted by analysis, matrix $A$ 's gradient is zero at initialization because $B = 0$ in the gradient formula, while $B$ receives non-zero gradients. This asymmetry demonstrates the coupled dynamics derived mathematically. Matrix $A$ begins learning only after $B$ has moved away from zero, creating a brief initial phase where $B$ adapts alone before both matrices co-evolve.

Parameter Efficiency CalculationLink Copied

Let's compute the parameter savings for a realistic scenario:

In[12]:

Code

def calculate_lora_efficiency(d: int, k: int, r: int, num_layers: int) -> dict:
    """Calculate parameter counts and savings."""
    full_params = d * k * num_layers
    lora_params = r * (d + k) * num_layers
    ratio = lora_params / full_params

    return {
        "full_params": full_params,
        "lora_params": lora_params,
        "ratio_percent": ratio * 100,
        "compression": 1 / ratio,
    }


# LLaMA-7B attention projections: d=k=4096, 32 layers, 4 matrices per layer
results = calculate_lora_efficiency(d=4096, k=4096, r=16, num_layers=32 * 4)

def calculate_lora_efficiency(d: int, k: int, r: int, num_layers: int) -> dict:
    """Calculate parameter counts and savings."""
    full_params = d * k * num_layers
    lora_params = r * (d + k) * num_layers
    ratio = lora_params / full_params

    return {
        "full_params": full_params,
        "lora_params": lora_params,
        "ratio_percent": ratio * 100,
        "compression": 1 / ratio,
    }


# LLaMA-7B attention projections: d=k=4096, 32 layers, 4 matrices per layer
results = calculate_lora_efficiency(d=4096, k=4096, r=16, num_layers=32 * 4)

Out[13]:

Console

LLaMA-7B Attention LoRA Analysis (r=16):
  Full fine-tuning params: 2,147,483,648
  LoRA params: 16,777,216
  Percentage of full: 0.781%
  Compression ratio: 128x

The 128x compression achieves dramatic parameter reduction while maintaining strong adaptation capability. Full finetuning of these attention layers requires 2.1 billion parameters, while LoRA needs only 16.8 million trainable parameters, making finetuning practical on consumer GPUs with 16-24GB of memory. This dramatic reduction enables efficient multi-task serving where different LoRA adapters can be swapped without duplicating the base model weights.

Visualizing Rank EffectsLink Copied

Let's visualize how different ranks affect the approximation capacity:

In[14]:

Code

# Create a target matrix that simulates a fine-tuning update
# (low intrinsic rank with some noise)
d, k = 64, 64
true_rank = 8
U_true = torch.randn(d, true_rank)
V_true = torch.randn(true_rank, k)
target_delta_W = U_true @ V_true + 0.1 * torch.randn(d, k)

# Try different LoRA ranks
ranks = [1, 2, 4, 8, 16, 32, 64]
errors = []

for r in ranks:
    # Best rank-r approximation via SVD
    U, S, Vh = torch.linalg.svd(target_delta_W, full_matrices=False)
    approx = (U[:, :r] * S[:r]) @ Vh[:r, :]
    error = torch.norm(target_delta_W - approx).item()
    errors.append(error)

# Store true_rank for use in visualization
true_rank_value = true_rank

# Create a target matrix that simulates a fine-tuning update
# (low intrinsic rank with some noise)
d, k = 64, 64
true_rank = 8
U_true = torch.randn(d, true_rank)
V_true = torch.randn(true_rank, k)
target_delta_W = U_true @ V_true + 0.1 * torch.randn(d, k)

# Try different LoRA ranks
ranks = [1, 2, 4, 8, 16, 32, 64]
errors = []

for r in ranks:
    # Best rank-r approximation via SVD
    U, S, Vh = torch.linalg.svd(target_delta_W, full_matrices=False)
    approx = (U[:, :r] * S[:r]) @ Vh[:r, :]
    error = torch.norm(target_delta_W - approx).item()
    errors.append(error)

# Store true_rank for use in visualization
true_rank_value = true_rank

Out[15]:

Visualization

Line plot showing reconstruction error decreasing as LoRA rank increases from 1 to 64. — Reconstruction error for a 64x64 fine-tuning update matrix with intrinsic rank 8, computed via SVD truncation. Error decreases sharply until rank 8, then levels off at higher ranks. This pattern demonstrates the matrix's true intrinsic rank and explains why LoRA with r=8 to r=16 captures nearly all useful information while achieving strong parameter compression for typical fine-tuning tasks.

The error drops sharply until we reach the true intrinsic rank of the target matrix (8 in this case), then decreases more slowly. This shows why moderate ranks like 8 to 16 often suffice in practice.

Gradient Magnitude AnalysisLink Copied

Let's examine how gradients evolve during the first few training steps:

In[16]:

Code

# Track gradient magnitudes over steps
layer = LoRALinear(in_features=32, out_features=32, rank=4, alpha=4.0)
optimizer = torch.optim.Adam([layer.A, layer.B], lr=0.01)

grad_A_norms = []
grad_B_norms = []
B_norms = []

for step in range(20):
    optimizer.zero_grad()

    x = torch.randn(16, 32)
    target = torch.randn(16, 32)

    output = layer(x)
    loss = F.mse_loss(output, target)
    loss.backward()

    grad_A_norms.append(layer.A.grad.norm().item())
    grad_B_norms.append(layer.B.grad.norm().item())
    B_norms.append(layer.B.data.norm().item())

    optimizer.step()

# Track gradient magnitudes over steps
layer = LoRALinear(in_features=32, out_features=32, rank=4, alpha=4.0)
optimizer = torch.optim.Adam([layer.A, layer.B], lr=0.01)

grad_A_norms = []
grad_B_norms = []
B_norms = []

for step in range(20):
    optimizer.zero_grad()

    x = torch.randn(16, 32)
    target = torch.randn(16, 32)

    output = layer(x)
    loss = F.mse_loss(output, target)
    loss.backward()

    grad_A_norms.append(layer.A.grad.norm().item())
    grad_B_norms.append(layer.B.grad.norm().item())
    B_norms.append(layer.B.data.norm().item())

    optimizer.step()

Out[17]:

Visualization

Gradient magnitudes for matrices A and B during 20 training steps reveal asymmetric learning dynamics. Matrix B receives non-zero gradients from step 1, while matrix A's gradient is zero initially. As B evolves, gradient flow to A enables both matrices to co-develop. This coupled dynamic explains why B must adapt first before joint optimization begins.

Matrix B Frobenius norm grows smoothly from zero to approximately 0.6 over 20 training steps, enabling gradient flow to A through coupled dynamics. This stable, predictable growth shows how zero initialization preserves pretrained behavior while transitioning to full two-matrix adaptation.

The plot confirms the mathematical analysis. Matrix A's gradient starts at zero and grows as B moves away from zero. The coupled dynamics quickly bring both matrices into active training.

Weight Merging for InferenceLink Copied

LoRA's trained adapters can be merged into the base weights for inference:

In[18]:

Code

# After training, we can merge LoRA into the base weights
layer = LoRALinear(in_features=64, out_features=64, rank=8, alpha=8.0)

# Simulate some training
for _ in range(10):
    x = torch.randn(16, 64)
    loss = layer(x).sum()
    loss.backward()
    with torch.no_grad():
        layer.A -= 0.01 * layer.A.grad
        layer.B -= 0.01 * layer.B.grad
        layer.A.grad.zero_()
        layer.B.grad.zero_()

# Verify equivalence
x_test = torch.randn(4, 64)
output_separate = layer(x_test)
merged_weight = layer.get_merged_weight()
output_merged = F.linear(x_test, merged_weight)

max_diff = (output_separate - output_merged).abs().max().item()
are_equivalent = torch.allclose(output_separate, output_merged)

# After training, we can merge LoRA into the base weights
layer = LoRALinear(in_features=64, out_features=64, rank=8, alpha=8.0)

# Simulate some training
for _ in range(10):
    x = torch.randn(16, 64)
    loss = layer(x).sum()
    loss.backward()
    with torch.no_grad():
        layer.A -= 0.01 * layer.A.grad
        layer.B -= 0.01 * layer.B.grad
        layer.A.grad.zero_()
        layer.B.grad.zero_()

# Verify equivalence
x_test = torch.randn(4, 64)
output_separate = layer(x_test)
merged_weight = layer.get_merged_weight()
output_merged = F.linear(x_test, merged_weight)

max_diff = (output_separate - output_merged).abs().max().item()
are_equivalent = torch.allclose(output_separate, output_merged)

Out[19]:

Console

Max difference: 2.432e-05
Outputs are equivalent: True

After merging, the LoRA matrices can be discarded and replaced with a single weight matrix that has zero inference overhead compared to the original model. The negligible difference confirms that merging is mathematically exact within floating-point accuracy. This property matters for deployment. During development, separate A and B matrices provide flexibility, while production uses merged weights to eliminate computational overhead from the low-rank decomposition.

Key ParametersLink Copied

The key parameters for LoRA are:

rank (r): The bottleneck dimension of the low-rank decomposition. Lower ranks use fewer parameters but constrain adaptation capacity. Typical values range from 4 to 64.
alpha: Scaling factor that controls the magnitude of LoRA updates. Often set equal to rank or tuned in the range 8 to 64.
target_modules: Which weight matrices to apply LoRA to (e.g., query, key, value, output projections in attention layers).
initialization: Matrix A typically uses Gaussian or Kaiming initialization, while matrix B is initialized to zero to preserve pretrained behavior.

ConclusionLink Copied

LoRA's mathematical elegance stems from the simplicity of the factorized decomposition combined with its effectiveness in practice. The low-rank constraint is not arbitrary but grounded in the empirical observation that fine-tuning updates concentrate in low-dimensional subspaces. Understanding the mathematics—from the basic formulation through initialization, gradient flow, and rank selection—reveals why LoRA works and how to apply it effectively across diverse applications and architectures.

Limitations and ImpactLink Copied

LoRA achieves remarkable parameter efficiency, but its limitations matter for practical application.

The low-rank constraint limits adaptation expressiveness. Empirical evidence shows this rarely hurts performance on common NLP benchmarks, though certain tasks require higher-rank updates that LoRA cannot efficiently represent. Tasks requiring significant architectural changes, such as cross-domain or cross-modality adaptation, may need ranks higher than LoRA typically provides. The intrinsic dimensionality hypothesis provides theoretical grounding but cannot guarantee that low-rank approaches suffice for all adaptations.

The initialization scheme creates a particular training dynamic that may not be optimal for all scenarios. Zero-initializing B means early training updates only affect B, potentially slowing convergence compared to methods that update all parameters immediately. The scaling factor $\frac{\alpha}{r}$ introduces hyperparameters that interact with learning rates in non-obvious ways, requiring careful tuning.

Despite these limitations, LoRA's mathematical formulation has significantly influenced the field. The decomposition W = W₀ + BA provides a clean interface for modular adaptation, where different LoRA matrices can be trained for different tasks and swapped at inference without modifying the base model. This enables multi-tenant serving where a single base model serves many specialized applications. The formulation also inspired numerous extensions, including QLoRA (quantization with LoRA), AdaLoRA (dynamic rank adaptation), and structured approaches that exploit domain-specific priors about where low-rank updates should be applied.

SummaryLink Copied

This chapter covered the mathematical foundations of LoRA:

Core formulation: $W = W_0 + \frac{\alpha}{r}BA$ expresses weight updates as a product of two smaller matrices, constraining updates to rank $r$ .
Low-rank approximation: SVD theory and the intrinsic dimensionality hypothesis show that fine-tuning updates live in low-dimensional subspaces of weight space.
Rank selection: Balances expressiveness against efficiency. Ranges of r from 4 to 64 achieve over 100x compression while maintaining adaptation quality.
Initialization: B starts at zero while A is randomly initialized, ensuring training begins from pre-trained behavior with stable gradient dynamics.
Gradient flow: A and B co-evolve during training. A receives zero gradients initially but learns as B moves away from zero.

The next chapter covers practical implementation patterns for applying LoRA to transformer architectures and integrating it with existing training pipelines.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about LoRA's mathematical foundations.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{loramathematicslowrankadaptationformulasgradients, author = {Michael Brenndoerfer}, title = {LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients}, year = {2025}, url = {https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients. Retrieved from https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas

MLAAcademic

Michael Brenndoerfer. "LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients." 2026. Web. today. <https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas>.

CHICAGOAcademic

Michael Brenndoerfer. "LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients." Accessed today. https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas.

HARVARDAcademic

Michael Brenndoerfer (2025) 'LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients'. Available at: https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients. https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas

Direct link:

https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients

LoRA MathematicsLink Copied

The LoRA FormulationLink Copied

Forward Pass ComputationLink Copied

Scaling FactorLink Copied

Low-Rank Decomposition AnalysisLink Copied

Connection to Singular Value DecompositionLink Copied

Intrinsic Dimensionality HypothesisLink Copied

Rank Constraint InterpretationLink Copied

Rank SelectionLink Copied

Parameter Count AnalysisLink Copied

Expressiveness vs. Efficiency Trade-offLink Copied

Effective Rank During TrainingLink Copied

Initialization SchemeLink Copied

Rationale for Zero Initialization of BLink Copied

Variance AnalysisLink Copied

Alternative InitializationsLink Copied

LoRA Gradient ComputationLink Copied

Gradient DerivationLink Copied

Gradient Flow PropertiesLink Copied

Computational Cost of Gradient ComputationLink Copied

Worked ExampleLink Copied

Code ImplementationLink Copied

Verifying Gradient FlowLink Copied

Parameter Efficiency CalculationLink Copied

Visualizing Rank EffectsLink Copied

Gradient Magnitude AnalysisLink Copied

Weight Merging for InferenceLink Copied

Key ParametersLink Copied

ConclusionLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors

AdaLoRA: Adaptive Rank Allocation for Efficient Fine-Tuning

QLoRA: 4-Bit Quantization for Memory-Efficient LLM Fine-Tuning

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors

AdaLoRA: Adaptive Rank Allocation for Efficient Fine-Tuning

QLoRA: 4-Bit Quantization for Memory-Efficient LLM Fine-Tuning

Stay updated