LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients

Michael BrenndoerferDecember 1, 202546 min read

Master LoRA's mathematical foundations including low-rank decomposition, gradient computation, rank selection, and initialization schemes for efficient fine-tuning.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

LoRA Mathematics

In the previous chapter, we introduced LoRA as a parameter-efficient fine-tuning method that adapts large language models by learning low-rank updates to weight matrices. Instead of modifying the full weight matrix WRd×kW \in \mathbb{R}^{d \times k}, LoRA learns two smaller matrices whose product represents the weight change. Now we turn to the mathematical foundations that make this approach work. Understanding LoRA's mathematics reveals why the method is both theoretically grounded and practically effective. The formulation connects to fundamental concepts from linear algebra, specifically matrix decomposition techniques from Part III. By examining the initialization scheme, gradient flow, and rank selection criteria, You will gain the intuition needed to apply LoRA effectively and understand its variants covered in upcoming chapters.

The LoRA Formulation

LoRA reparameterizes weight updates using a factorized form that captures the essence of efficient adaptation. Rather than learning an arbitrary update to each weight matrix, LoRA constrains the update to live in a low-dimensional subspace. This constraint reflects the empirical observation that useful fine-tuning updates often have simple structure, revealing important patterns in how models adapt to new tasks.

For a pre-trained weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, LoRA expresses the adapted weight as:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where:

  • WW: the adapted weight matrix (final parameters)
  • W0W_0: the pre-trained weight matrix (frozen)
  • ΔW\Delta W: the weight update matrix (learned)
  • BRd×rB \in \mathbb{R}^{d \times r}: the up-projection matrix
  • ARr×kA \in \mathbb{R}^{r \times k}: the down-projection matrix
  • rmin(d,k)r \ll \min(d, k): the rank of the adaptation

The key insight is that ΔW=BA\Delta W = BA has rank at most rr. To understand why, recall that the rank of a matrix product cannot exceed the rank of either factor. Since AA has rr rows and BB has rr columns, the product BABA can have rank at most rr. Even though ΔW\Delta W has the same dimensions as W0W_0, potentially millions of entries, it lives in a much lower-dimensional subspace determined by the choice of rr. This constraint is what enables LoRA's parameter efficiency: instead of learning d×kd \times k independent parameters, we learn only r×(d+k)r \times (d + k) parameters while still producing updates that span the full weight matrix dimensions.

Forward Pass Computation

During the forward pass, the network must compute the output of each adapted layer. This computation reveals opportunities for efficient implementation and clarifies each matrix's role in the decomposition.

For an input xRkx \in \mathbb{R}^k, the output hRdh \in \mathbb{R}^d is computed as:

h=Wx=(W0+BA)x(substitute W)=W0x+BAx(distribute)=W0x+B(Ax)(associativity)\begin{aligned} h &= Wx \\ &= (W_0 + BA)x && \text{(substitute } W \text{)} \\ &= W_0 x + BA x && \text{(distribute)} \\ &= W_0 x + B(Ax) && \text{(associativity)} \end{aligned}

where:

  • hh: the output vector of the layer
  • WW: the full adapted weight matrix
  • xx: the input vector to the layer
  • W0W_0: the frozen pre-trained weight matrix
  • BB: the trainable up-projection matrix
  • AA: the trainable down-projection matrix

The final line reveals LoRA's computational structure. The parenthesization in B(Ax)B(Ax) is critical for computational efficiency. The naive approach would form the full d×kd \times k matrix BABA explicitly and then multiply by xx. This defeats the purpose of the low-rank decomposition because we would need to compute and store a matrix with dkdk entries.

Instead, we exploit the associativity of matrix multiplication to compute the result sequentially through a bottleneck of dimension rr. The computation proceeds in three stages:

  1. First, we compute z=Axz = Ax, yielding zRrz \in \mathbb{R}^r. This operation involves an r×kr \times k matrix multiplying a kk-dimensional vector, requiring O(rk)O(rk) operations.
  2. Next, we compute Δh=Bz\Delta h = Bz, yielding ΔhRd\Delta h \in \mathbb{R}^d. This involves a d×rd \times r matrix multiplying an rr-dimensional vector, requiring O(dr)O(dr) operations.
  3. Finally, we add the base model's output: h=W0x+Δhh = W_0 x + \Delta h.

This sequential computation requires O(rk+dr)=O(r(d+k))O(rk + dr) = O(r(d+k)) operations for the LoRA path, compared to O(dk)O(dk) if we materialized the full update matrix. When rmin(d,k)r \ll \min(d, k), which is the regime where LoRA operates, this provides significant savings during both training and inference. For a typical transformer layer where d=k=4096d = k = 4096 and r=16r = 16, the LoRA path requires roughly 0.8% of the operations needed to compute with a full update matrix.

Out[2]:
Visualization
The LoRA bottleneck compresses 64-dimensional inputs through an 8-dimensional intermediate representation (z = Ax), then expands back to 64-dimensional outputs (Δh = Bz). This forced compression through just 8 dimensions represents an 8x reduction that demonstrates LoRA's dramatic parameter efficiency. The bottleneck acts as an information filter, learning which features matter most for task adaptation while discarding task-irrelevant dimensions.
The LoRA bottleneck compresses 64-dimensional inputs through an 8-dimensional intermediate representation (z = Ax), then expands back to 64-dimensional outputs (Δh = Bz). This forced compression through just 8 dimensions represents an 8x reduction that demonstrates LoRA's dramatic parameter efficiency. The bottleneck acts as an information filter, learning which features matter most for task adaptation while discarding task-irrelevant dimensions.

Scaling Factor

The original LoRA paper introduces a scaling factor α\alpha to control the magnitude of the low-rank update. This scaling factor plays a crucial role in making LoRA practical across different settings and ranks.

The scaled formulation is:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} BA x

where:

  • hh: the output vector
  • W0W_0: the pre-trained weight matrix
  • xx: the input vector
  • α\alpha: a constant scaling factor (hyperparameter)
  • rr: the rank of the LoRA matrices
  • BB: the up-projection matrix
  • AA: the down-projection matrix
  • BABA: the low-rank update matrix product

The ratio αr\frac{\alpha}{r} serves several important purposes.

  • Learning rate stability: Without the scaling factor, changing the rank rr would fundamentally alter the magnitude of updates produced by the LoRA path. A higher rank means more terms contributing to the product BABA, which, all else being equal, would produce larger outputs. By dividing by rr, we normalize this effect, allowing the same learning rate to produce updates of similar magnitude regardless of rank choice.
  • Initialization compatibility: As we'll see in the initialization section below, the scaling interacts with how we initialize AA and BB to ensure consistent behavior. The combination of proper initialization and scaling means that the model's behavior at the start of training doesn't depend sensitively on the rank choice.
  • Hyperparameter transfer: Because the scaling normalizes the effect of rank, a good value of α\alpha often transfers across different rank choices. This simplifies hyperparameter search considerably: you can tune α\alpha at a lower rank where experiments are cheaper, then scale up the rank while keeping α\alpha fixed.

In practice, many implementations set α=r\alpha = r, making the scaling factor equal to 1 and effectively removing the normalization. Others treat α\alpha as a tunable hyperparameter, typically in the range [8,64][8, 64]. The choice depends on whether you want rank-independent behavior (use α/r\alpha/r scaling) or prefer to absorb any magnitude adjustments into the learning rate (set α=r\alpha = r).

Low-Rank Decomposition Analysis

The mathematical foundation of LoRA rests on the hypothesis that weight updates during fine-tuning have low intrinsic rank. This is a strong claim that deserves careful examination. Why should task-specific adaptations live in a low-dimensional subspace? To understand why this might be true, we can connect LoRA to concepts from matrix approximation theory and explore what empirical evidence tells us about the structure of fine-tuning updates.

Connection to Singular Value Decomposition

Recall from our discussion of SVD in Part III that any matrix can be decomposed into a sum of rank-1 components, ordered by their importance. This decomposition shows when and why low-rank approximations work well.

Any matrix MRd×kM \in \mathbb{R}^{d \times k} can be decomposed as:

M=UΣVT=i=1min(d,k)σiuiviTM = U \Sigma V^T = \sum_{i=1}^{\min(d,k)} \sigma_i u_i v_i^T

where:

  • MM: the matrix to be decomposed
  • U,VU, V: orthogonal matrices containing singular vectors
  • Σ\Sigma: diagonal matrix containing the singular values σi\sigma_i
  • σi\sigma_i: the singular values, ordered such that σ1σ20\sigma_1 \geq \sigma_2 \geq \ldots \geq 0
  • ui,viu_i, v_i: the left and right singular vectors corresponding to σi\sigma_i

The singular values tell us how much each rank-1 component contributes to the matrix. The largest singular value σ1\sigma_1 corresponds to the most important direction, the direction along which the matrix stretches inputs the most. Subsequent singular values capture progressively less important directions.

The Eckart-Young-Mirsky theorem tells us that the best rank-rr approximation to MM in the Frobenius norm is obtained by keeping only the first rr terms:

Mr=i=1rσiuiviTM_r = \sum_{i=1}^{r} \sigma_i u_i v_i^T

where:

  • MrM_r: the rank-rr matrix that best approximates MM
  • rr: the target rank (number of components retained)
  • σi\sigma_i: the singular values
  • ui,viu_i, v_i: the singular vectors

Among all possible rank-rr matrices, the truncated SVD provides the one closest to MM. The approximation error is precisely:

MMrF2=i=r+1min(d,k)σi2\|M - M_r\|_F^2 = \sum_{i=r+1}^{\min(d,k)} \sigma_i^2

where:

  • MMrF2\|M - M_r\|_F^2: the squared Frobenius norm of the approximation error
  • F\|\cdot\|_F: the Frobenius norm (square root of sum of squared elements)
  • σi\sigma_i: the singular values corresponding to the discarded directions

This formula reveals when low-rank approximations work well: if the singular values decay rapidly, then the terms we discard when truncating contribute little to the total, and a low-rank approximation captures most of the matrix's structure. Conversely, if all singular values are similar in magnitude, truncation loses substantial information.

Empirical studies of fine-tuned models provide encouraging evidence for LoRA's approach. When researchers compute the weight changes ΔWfull=WfinetunedWpretrained\Delta W_{\text{full}} = W_{\text{finetuned}} - W_{\text{pretrained}} from full fine-tuning runs and examine their singular value spectra, they find rapid decay. A small number of singular values dominate while most are near zero. This suggests that the actual weight changes needed for successful fine-tuning can be well approximated by low-rank matrices.

Out[3]:
Visualization
Singular value decay on a logarithmic scale reveals how different matrix types compress information. Fine-tuning weight updates (fast decay) capture 95 percent of information by rank 16, while medium and slow decay matrices require progressively higher ranks. The shaded region marks the effective LoRA operating range (r = 1 to 16) where compression and adaptation quality balance optimally.
Singular value decay on a logarithmic scale reveals how different matrix types compress information. Fine-tuning weight updates (fast decay) capture 95 percent of information by rank 16, while medium and slow decay matrices require progressively higher ranks. The shaded region marks the effective LoRA operating range (r = 1 to 16) where compression and adaptation quality balance optimally.
Cumulative singular value energy versus rank reveals how rapidly matrices compress. Fine-tuning updates with fast decay reach 95 percent information content at rank 8, while medium decay requires rank 16. Practical LoRA uses r = 8 to r = 16 to capture nearly all useful information in typical fine-tuning updates while achieving strong parameter compression.
Cumulative singular value energy versus rank reveals how rapidly matrices compress. Fine-tuning updates with fast decay reach 95 percent information content at rank 8, while medium decay requires rank 16. Practical LoRA uses r = 8 to r = 16 to capture nearly all useful information in typical fine-tuning updates while achieving strong parameter compression.

Intrinsic Dimensionality Hypothesis

The effectiveness of LoRA connects to the broader concept of intrinsic dimensionality in neural network training. Despite millions or billions of parameters, models need far fewer degrees of freedom for successful training. Called the "lottery ticket hypothesis" or "intrinsic dimensionality," this phenomenon shows that neural networks are vastly overparameterized for their tasks.

For fine-tuning specifically, the intuition supporting low-rank updates emerges from understanding what fine-tuning actually accomplishes:

  1. Pre-training captures general structure: The pre-trained weights W0W_0 already encode rich representations of language. These weights have been shaped by exposure to vast amounts of text, learning general patterns about syntax, semantics, world knowledge, and reasoning. These representations are highly expressive and broadly useful.
  2. Fine-tuning makes targeted adjustments: Adapting to a specific task requires modifying only certain aspects of these representations. A sentiment classifier needs only to adjust how certain features map to sentiment labels, not reorganize the model's understanding of language. A summarization model doesn't need new linguistic knowledge; it needs to learn task-specific patterns about what information to preserve and condense.
  3. Targeted adjustments are low-rank: These task-specific modifications can often be expressed as linear combinations of a small number of directions in weight space. If fine-tuning primarily adjusts how the model uses existing features rather than learning entirely new ones, the weight changes will have structure that low-rank matrices can capture.

Full unconstrained fine-tuning doesn't produce exactly low-rank updates. Instead, we find that low-rank approximations suffice for most downstream tasks. Full-rank updates might capture noise or provide marginal gains on very demanding tasks, but for typical applications, the low-rank constraint sacrifices little while gaining substantial efficiency.

Rank Constraint Interpretation

The rank constraint rank(ΔW)r\text{rank}(\Delta W) \leq r has a geometric interpretation that shows what LoRA learns during training. The matrix ΔW=BA\Delta W = BA maps inputs through an rr-dimensional bottleneck, forcing all information about the input to pass through a low-dimensional representation before influencing the output.

Information flows through the LoRA path in three stages. First, the input vector xx is projected down to an rr-dimensional representation zz by the matrix AA. Then, this compressed representation is expanded back to the full output dimension by the matrix BB. Mathematically:

xAzRrBΔhRdx \xrightarrow{A} z \in \mathbb{R}^r \xrightarrow{B} \Delta h \in \mathbb{R}^d

where:

  • xx: the input vector
  • AA: the down-projection matrix
  • zz: the intermediate low-rank representation
  • BB: the up-projection matrix
  • Δh\Delta h: the update vector in the output space
  • rr: the rank of the adaptation
  • dd: the output dimension

This bottleneck structure implies several important properties. First, the update Δh\Delta h necessarily lies in the column space of BB, which has dimension at most rr. No matter what input xx we provide, the LoRA path can only produce outputs in this restricted subspace. Second, different inputs can only produce updates in this same rr-dimensional subspace. The LoRA update cannot independently adjust every dimension of the output; it must work within the constraints of the learned subspace. Third, and importantly, the subspace is learned during training, not predetermined. The optimization process discovers which directions in weight space are most useful for the task at hand.

The training process simultaneously learns two complementary aspects of the adaptation. The matrix BB encodes which rr-dimensional subspace in the output space should receive updates, effectively selecting the "directions" in which the model's behavior should change. The matrix AA encodes how to project inputs onto this subspace, determining which aspects of the input should influence these changes and by how much. Together, AA and BB learn both the structure of the adaptation and how to apply it based on the input.

Rank Selection

Choosing the rank rr involves balancing expressiveness against efficiency. Rank is a primary hyperparameter in LoRA. Understanding its trade-offs is essential for practical application. Lower ranks use fewer parameters and computation but constrain the adaptation's capacity to represent complex changes to the model's behavior.

Parameter Count Analysis

To understand the efficiency gains from LoRA, we need to compare the number of trainable parameters against full fine-tuning. For a weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, full fine-tuning requires learning dkdk parameters. LoRA instead introduces:

LoRA parameters=dr+rk=r(d+k)\text{LoRA parameters} = dr + rk = r(d + k)

where:

  • dd: output dimension of the layer
  • kk: input dimension of the layer
  • rr: rank of the adaptation

The parameter ratio compared to full fine-tuning reveals the compression achieved:

r(d+k)dk=rd+rk\frac{r(d + k)}{dk} = \frac{r}{d} + \frac{r}{k}

where:

  • dkdk: total parameters in the full weight matrix
  • r(d+k)r(d+k): total parameters in the LoRA adapters
  • rr: rank of the adaptation
  • dd: output dimension
  • kk: input dimension

This ratio decreases as the layer dimensions grow, meaning LoRA becomes proportionally more efficient for larger models. For a transformer with d=k=4096d = k = 4096, which is typical of modern large language models, and r=16r = 16:

16(4096+4096)4096×4096=131,07216,777,2160.78%\frac{16(4096 + 4096)}{4096 \times 4096} = \frac{131,072}{16,777,216} \approx 0.78\%

where:

  • 1616: the rank rr used in this example
  • 40964096: the input and output dimensions (dd and kk)
  • 0.78%0.78\%: the resulting parameter efficiency ratio

This calculation shows that LoRA with rank 16 uses less than 1 percent of the parameters that full fine-tuning would require for this single layer. The dramatic reduction holds across the entire model. If we apply LoRA to all attention projections (the query, key, value, and output matrices denoted WQW_Q, WKW_K, WVW_V, WOW_O) in each layer, the total trainable parameters remain a small fraction of the original model while still adapting the most important components of the transformer architecture.

Out[4]:
Visualization
LoRA parameter efficiency for a 4096-dimensional transformer layer demonstrates compression ratios from 30x (r=4) to 500x (r=64). Ranks r=8 to r=16 use only 0.2 to 1 percent of full fine-tuning parameters, representing the practical sweet spot where efficiency gains are substantial while maintaining adaptation capacity for typical NLP fine-tuning tasks.
LoRA parameter efficiency for a 4096-dimensional transformer layer demonstrates compression ratios from 30x (r=4) to 500x (r=64). Ranks r=8 to r=16 use only 0.2 to 1 percent of full fine-tuning parameters, representing the practical sweet spot where efficiency gains are substantial while maintaining adaptation capacity for typical NLP fine-tuning tasks.
LoRA parameter efficiency improves with model size across ranks 8, 16, and 64 (shown on logarithmic axes). At dimension d=8192 (typical for very large models), rank-16 adapters use under 0.4 percent of full fine-tuning parameters. This scaling benefit makes it possible to fine-tune massive language models on consumer hardware with limited memory.
LoRA parameter efficiency improves with model size across ranks 8, 16, and 64 (shown on logarithmic axes). At dimension d=8192 (typical for very large models), rank-16 adapters use under 0.4 percent of full fine-tuning parameters. This scaling benefit makes it possible to fine-tune massive language models on consumer hardware with limited memory.

Expressiveness vs. Efficiency Trade-off

The rank determines the capacity of the adaptation, controlling how expressive the learned weight changes can be. Different ranks are appropriate for different scenarios:

  • r=1r = 1: The update is a rank-1 matrix ΔW=baT\Delta W = ba^T (outer product of two vectors). This is highly constrained, representing the simplest possible non-trivial update. Despite this severe limitation, rank-1 adaptations sometimes suffice for simple tasks like binary classification where the model primarily needs to adjust a single decision boundary.
  • r=4r = 4 to r=16r = 16: These are common choices that balance efficiency and expressiveness for most NLP tasks. Many benchmarks show that ranks in this range achieve performance comparable to full fine-tuning while maintaining substantial parameter efficiency. This range represents the "sweet spot" for typical applications.
  • r=64r = 64 to r=256r = 256: Higher capacity for complex adaptations, multi-task scenarios, or cases where lower ranks demonstrably underperform. These ranks sacrifice some efficiency for increased expressiveness, appropriate when the task demands more complex adaptations than lower ranks can represent.
  • r=min(d,k)r = \min(d, k): Full rank, equivalent to unconstrained fine-tuning of the layer. At this extreme, LoRA provides no parameter reduction but still maintains the structure of learning updates as a product of two matrices. This is primarily useful as a theoretical comparison point.

Empirically, rank selection depends on several factors that practitioners should consider. Task complexity plays a significant role. Simple classification may need only r = 8, while complex generation might benefit from r = 64 or higher. Dataset size interacts with rank in interesting ways. Smaller datasets benefit from lower ranks because the rank constraint provides implicit regularization, preventing overfitting. Larger datasets can exploit higher ranks without overfitting and may show continued improvement as rank increases. Finally, target modules often have different optimal ranks. Attention projections frequently need lower ranks than feed-forward layers, possibly because attention primarily routes information while feed-forward layers perform more complex transformations.

Effective Rank During Training

Trained LoRA adapters show an interesting pattern: even when rr is set relatively high, the effective rank of the learned BABA is often lower than the maximum possible. The optimization process tends to concentrate the adaptation into fewer dimensions than the allocated rank would allow.

The effective rank can be measured using the singular values of BA. One natural measure based on information theory follows.

effective rank(BA)=exp(iσ~ilogσ~i)\text{effective rank}(BA) = \exp\left(-\sum_{i} \tilde{\sigma}_i \log \tilde{\sigma}_i\right)

where:

  • σ~i\tilde{\sigma}_i: the normalized singular value such that σ~i=1\sum \tilde{\sigma}_i = 1
  • σi\sigma_i: the ii-th singular value of the matrix BABA
  • exp()\exp(\cdot): exponential function (computes the perplexity of the singular value distribution)

This entropy-based measure, sometimes called the spectral entropy or perplexity of the singular value distribution, indicates how many singular values contribute meaningfully to the matrix. If all rr singular values are equal, the effective rank equals rr. If one singular value dominates, the effective rank approaches 1. Values between these extremes indicate intermediate concentration.

Studies examining trained LoRA adapters show that they often converge to solutions where only a few singular values dominate, even when the specified rr is much larger. This observation suggests that the actual intrinsic rank needed for the adaptation is lower than the specified rr. It also suggests potential improvements: methods like AdaLoRA, covered in a later chapter, exploit this observation by adaptively adjusting the rank during training rather than fixing it in advance.

Initialization Scheme

LoRA's initialization is crucial for training stability and convergence. How you initialize matrices AA and BB determines the model's starting point and influences the entire training trajectory. The standard scheme is:

  • Matrix AA: Initialized from N(0,σ2)\mathcal{N}(0, \sigma^2) (Gaussian) or Kaiming initialization
  • Matrix BB: Initialized to zero

This asymmetric initialization, with AA random and BB zero, ensures that the product BA=0BA = 0 at the start of training. This means the model begins with its pre-trained behavior completely intact: the LoRA path contributes nothing to the output until training begins to modify the parameters.

Rationale for Zero Initialization of B

Starting with ΔW=BA=0\Delta W = BA = 0 has several important advantages:

  1. Continuity from pretraining: The model's initial behavior exactly matches the pretrained model. This is valuable because the pretrained model already performs well on many tasks. No "warm-up" period is needed for the model to recover from a random perturbation. From the first training step, we are refining good behavior rather than recovering from a disrupted starting point.

  2. Stable training dynamics: Large random initializations of both AA and BB could produce BA\|BA\| values that significantly perturb the pre-trained representations. If the initial perturbation is large, early training might focus on undoing this damage rather than learning the task. By starting at zero, we ensure that the first training steps are devoted entirely to task-relevant adaptation.

  3. Gradient flow: A natural concern with zero initialization is whether gradients will flow properly. If B=0B = 0 initially, won't the gradient with respect to BB also be zero, preventing any learning? Fortunately, this is not the case. The gradient with respect to BB depends on the input to the LoRA layer and the upstream gradient, not on the current value of BB. As we'll derive in the gradient section below, BB receives non-zero gradients even when it equals zero.

Variance Analysis

The initialization variance explains why the standard scheme works better. If we initialize both AA and BB randomly with entries drawn from N(0,σ2)\mathcal{N}(0, \sigma^2), what magnitude would we expect in the product BABA?

For a single entry of the product:

E[(BA)ij2]=E[(l=1rBilAlj)2]=l=1rE[Bil2Alj2]+lmE[BilAljBimAmj](expand square)=l=1rE[Bil2]E[Alj2]+0(independence and zero mean)=l=1rσ2σ2(variance def.)=rσ4\begin{aligned} \mathbb{E}[(BA)_{ij}^2] &= \mathbb{E}\left[\left(\sum_{l=1}^{r} B_{il} A_{lj}\right)^2\right] \\ &= \sum_{l=1}^{r} \mathbb{E}[B_{il}^2 A_{lj}^2] + \sum_{l \neq m} \mathbb{E}[B_{il} A_{lj} B_{im} A_{mj}] && \text{(expand square)} \\ &= \sum_{l=1}^{r} \mathbb{E}[B_{il}^2] \, \mathbb{E}[A_{lj}^2] + 0 && \text{(independence and zero mean)} \\ &= \sum_{l=1}^{r} \sigma^2 \sigma^2 && \text{(variance def.)} \\ &= r\sigma^4 \end{aligned}

where:

  • (BA)ij(BA)_{ij}: element at row ii, column jj of the product
  • Bil,AljB_{il}, A_{lj}: elements of the random matrices
  • σ2\sigma^2: variance of the initialization distribution
  • rr: rank (number of terms in the sum)

The key observation is that this expected squared magnitude grows linearly with rr. This rank dependence creates a problem: if we use the same initialization variance and learning rate for different ranks, higher ranks will produce larger initial perturbations and larger gradients. This would require rank-dependent learning rate adjustment to maintain consistent training dynamics. Zero-initializing BB sidesteps this issue entirely by ensuring the initial perturbation is exactly zero regardless of rank.

Out[5]:
Visualization
Random initialization of both matrices A and B for a 64x64 layer creates rank-dependent perturbations. The Frobenius norm of BA grows linearly with rank. Perturbations range from 0.5 at rank 1 to 4 at rank 64, demonstrating that higher ranks produce larger initial disruptions to pretrained weights, requiring rank-specific learning rate adjustments.
Random initialization of both matrices A and B for a 64x64 layer creates rank-dependent perturbations. The Frobenius norm of BA grows linearly with rank. Perturbations range from 0.5 at rank 1 to 4 at rank 64, demonstrating that higher ranks produce larger initial disruptions to pretrained weights, requiring rank-specific learning rate adjustments.
Standard LoRA initialization with B equal to zero ensures the Frobenius norm of BA is exactly zero across all ranks, eliminating rank-dependent perturbations. BA grows smoothly and consistently after the first gradient step. This scheme preserves pretrained model behavior at initialization while providing stable learning dynamics across any rank choice.
Standard LoRA initialization with B equal to zero ensures the Frobenius norm of BA is exactly zero across all ranks, eliminating rank-dependent perturbations. BA grows smoothly and consistently after the first gradient step. This scheme preserves pretrained model behavior at initialization while providing stable learning dynamics across any rank choice.

For matrix AA, the standard initialization uses variance that scales inversely with rank:

AljN(0,1r)A_{lj} \sim \mathcal{N}\left(0, \frac{1}{r}\right)

where:

  • AljA_{lj}: element of matrix AA at row ll, column jj
  • N(μ,σ2)\mathcal{N}(\mu, \sigma^2): normal distribution with mean μ\mu and variance σ2\sigma^2
  • rr: rank of the adaptation
  • 1/r1/r: variance chosen to scale with the rank

This scaling ensures that when gradients begin flowing and BB starts to move away from zero, the updates to BB have reasonable magnitude regardless of rr. The variance is chosen so that the projection AxAx has expected squared norm that doesn't grow with rr. Some implementations use Kaiming initialization instead, setting:

AljN(0,2k)A_{lj} \sim \mathcal{N}\left(0, \frac{2}{k}\right)

where:

  • AljA_{lj}: element of matrix AA
  • kk: input dimension of the layer

Kaiming initialization is designed to preserve the variance of activations through deep networks and is a reasonable alternative choice.

Alternative Initializations

While zero-initialization of BB is standard and works well in most cases, researchers have explored alternatives that may offer advantages in certain scenarios:

  • SVD initialization: Initialize BABA as a low-rank approximation of WfullfinetuneW0W_{\text{fullfinetune}} - W_0 from a reference full finetuning run. This requires running full fine-tuning once, but if you need to train many LoRA variants, initializing from the SVD of a reference solution can accelerate convergence significantly.
  • Symmetric initialization: Initialize both AA and BB with small random values, accepting the initial perturbation to pre-trained behavior. This may help when the pre-trained model's initial behavior is far from desired, though it requires careful tuning of the initialization scale.
  • Task-informed initialization: Use task-specific heuristics based on prior knowledge about the adaptation needed. For example, if adapting for a specific domain, initialize using SVD of domain-specific text representations.

We'll explore adaptive initialization strategies in the AdaLoRA chapter, where the initialization interacts with mechanisms that adjust rank during training.

LoRA Gradient Computation

Gradient flow through LoRA reveals training dynamics and connects to optimization concepts from Part VII. The gradient formulas reveal why the zero initialization of BB doesn't prevent learning and how AA and BB co-evolve during training.

Gradient Derivation

Consider the forward pass for a single layer with LoRA adaptation:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} BA x

where:

  • hh: the output vector
  • W0W_0: the pre-trained weight matrix
  • xx: the input vector
  • α\alpha: the scaling factor
  • rr: the rank
  • B,AB, A: the LoRA up-projection and down-projection matrices

Let L\mathcal{L} be the loss function we're minimizing. To update AA and BB via gradient descent, we compute LA\frac{\partial \mathcal{L}}{\partial A} and LB\frac{\partial \mathcal{L}}{\partial B}. The chain rule traces how parameter changes propagate through the computation to affect the loss.

First, let's define the upstream gradient. Let Lh=gRd\frac{\partial \mathcal{L}}{\partial h} = g \in \mathbb{R}^d be the gradient of the loss with respect to the layer's output. This gradient comes from the layers above in the network and tells us how changes in hh affect the loss.

For the gradient with respect to BB, we apply the chain rule. The LoRA contribution to hh is αrB(Ax)\frac{\alpha}{r} B(Ax). Differentiating this with respect to BB, and then multiplying by how changes in hh affect the loss:

LB=αrLhhB(chain rule)=αrg(Ax)T(substitute derivatives)\begin{aligned} \frac{\partial \mathcal{L}}{\partial B} &= \frac{\alpha}{r} \frac{\partial \mathcal{L}}{\partial h} \frac{\partial h}{\partial B} && \text{(chain rule)} \\ &= \frac{\alpha}{r} g (Ax)^T && \text{(substitute derivatives)} \end{aligned}

where:

  • L\mathcal{L}: the loss function
  • gg: the upstream gradient vector Lh\frac{\partial \mathcal{L}}{\partial h}
  • AxAx: the intermediate vector (input projected down to rank rr)
  • α\alpha: the scaling factor
  • rr: the rank

The gradient of BB is an outer product between the upstream gradient gg and the intermediate representation AxAx. Each entry of the gradient (B)il(B)_{il} measures how much increasing that entry would affect the loss, which depends on how much gradient signal arrives at output dimension ii (captured by gig_i) and how active the corresponding bottleneck dimension ll was (captured by (Ax)l(Ax)_l).

For processing batches efficiently, we extend this to matrix form:

LB=αrG(AX)T\frac{\partial \mathcal{L}}{\partial B} = \frac{\alpha}{r} G (AX)^T

where:

  • L\mathcal{L}: loss function
  • α\alpha: scaling factor
  • rr: rank
  • GRd×nG \in \mathbb{R}^{d \times n}: matrix of upstream gradients for the batch
  • XRk×nX \in \mathbb{R}^{k \times n}: matrix of input vectors for the batch
  • AA: down-projection matrix

For the gradient with respect to AA, we again apply the chain rule. Now we need to trace how changes in AA affect hh through the intermediate computation AxAx:

LA=αrBTLhxT(chain rule)=αrBTgxT(substitute g)\begin{aligned} \frac{\partial \mathcal{L}}{\partial A} &= \frac{\alpha}{r} B^T \frac{\partial \mathcal{L}}{\partial h} x^T && \text{(chain rule)} \\ &= \frac{\alpha}{r} B^T g x^T && \text{(substitute } g \text{)} \end{aligned}

where:

  • BTB^T: transpose of the up-projection matrix
  • L\mathcal{L}: the loss function
  • α\alpha: the scaling factor
  • rr: the rank
  • gg: the upstream gradient vector
  • xx: the input vector

This formula shows that the gradient with respect to AA involves projecting the upstream gradient back through BB (using BTgB^T g) and then forming an outer product with the input xx. In matrix form for batch processing:

LA=αrBTGXT\frac{\partial \mathcal{L}}{\partial A} = \frac{\alpha}{r} B^T G X^T

where:

  • L\mathcal{L}: loss function
  • α\alpha: scaling factor
  • rr: rank
  • BTB^T: transpose of the up-projection matrix
  • GG: upstream gradient matrix
  • XTX^T: transpose of the input batch matrix

Gradient Flow Properties

Several properties of these gradients are noteworthy and reveal important aspects of LoRA's training dynamics.

Independence from W0W_0: The gradients LA\frac{\partial \mathcal{L}}{\partial A} and LB\frac{\partial \mathcal{L}}{\partial B} are independent of W0W_0. Frozen weights affect the upstream gradient gg but never appear explicitly in the LoRA gradient formulas. This separation enables efficient training where W0W_0 is never updated. You don't need to compute or store gradients for the much larger frozen weight matrices.

Coupled dynamics: Although AA and BB are separate parameters, their gradients are intimately coupled:

  • LB\frac{\partial \mathcal{L}}{\partial B} depends on AA through the term AxAx
  • LA\frac{\partial \mathcal{L}}{\partial A} depends on BB through the term BTgB^T g

This coupling means AA and BB co-evolve during training, with AA's current value determining BB's gradient and vice versa. This is similar to the dynamics in other factorized parameterizations, leading to interesting optimization behavior.

Zero initialization dynamics: At initialization (B=0B = 0):

  • LB=αrg(Ax)T0\frac{\partial \mathcal{L}}{\partial B} = \frac{\alpha}{r} g (Ax)^T \neq 0 (generally non-zero)
  • LA=αrBTgxT=0\frac{\partial \mathcal{L}}{\partial A} = \frac{\alpha}{r} B^T g x^T = 0

This asymmetry is important: BB updates immediately from the first gradient step, while AA has zero gradient initially. However, after the first update step, B0B \neq 0, and both matrices begin receiving non-zero gradients. The initial zero-gradient phase for AA is brief, lasting only a single step in theory, though in practice the gradients for AA remain small until BB has moved appreciably away from zero.

Computational Cost of Gradient Computation

LoRA's gradient computations are efficient, requiring operations proportional to the parameter count rather than the full weight matrix size.

Computing LB=αrgzT\frac{\partial \mathcal{L}}{\partial B} = \frac{\alpha}{r} g z^T where z=Axz = Ax has the following costs: during the forward pass, we must store zRrz \in \mathbb{R}^r, adding minimal memory overhead. The gradient computation itself is an outer product between vectors of dimension dd and rr, requiring O(dr)O(dr) operations.

Computing LA=αrBTgxT\frac{\partial \mathcal{L}}{\partial A} = \frac{\alpha}{r} B^T g x^T requires first computing BTgB^T g, which is a matrix-vector product costing O(dr)O(dr) operations, followed by forming the outer product with xTx^T, which costs O(rk)O(rk) operations.

The total gradient computation is therefore O(r(d+k))O(r(d + k)), matching the parameter count. This is optimal. We need at least one operation per parameter to compute a gradient, and LoRA achieves this lower bound up to constant factors. This efficiency extends to the memory required for gradient storage, which also scales with r(d+k)r(d+k) rather than dkdk.

Worked Example

Concrete numbers solidify the mathematical concepts. This example connects abstract formulas to actual numerical operations.

Consider a small weight matrix W0R4×3W_0 \in \mathbb{R}^{4 \times 3} with rank r = 2 adaptation.

W0=[1.00.50.30.21.00.40.10.31.00.50.20.1]W_0 = \begin{bmatrix} 1.0 & 0.5 & -0.3 \\ 0.2 & 1.0 & 0.4 \\ -0.1 & 0.3 & 1.0 \\ 0.5 & -0.2 & 0.1 \end{bmatrix}

where:

  • W0W_0: the pre-trained weight matrix

Initialize LoRA matrices following the standard scheme:

A=[0.30.50.20.40.10.3],B=[00000000]A = \begin{bmatrix} 0.3 & -0.5 & 0.2 \\ 0.4 & 0.1 & -0.3 \end{bmatrix}, \quad B = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}

where:

  • AA: the down-projection matrix (randomly initialized)
  • BB: the up-projection matrix (initialized to zero)

For input x=[1.0,0.5,0.2]Tx = [1.0, 0.5, -0.2]^T with scaling factor α=2\alpha = 2:

Forward pass:

Tracing through each step of forward computation:

  1. Pre-trained output: We compute W0xW_0 x by multiplying the pre-trained weight matrix with the input. For the first element: 1.01.0+0.50.5+(0.3)(0.2)=1.0+0.25+0.06=1.311.0 \cdot 1.0 + 0.5 \cdot 0.5 + (-0.3)(-0.2) = 1.0 + 0.25 + 0.06 = 1.31. Continuing for all elements: W0x=[1.31,0.62,0.15,0.38]TW_0 x = [1.31, 0.62, -0.15, 0.38]^T

  2. LoRA intermediate: We compute z=Axz = Ax, projecting the input down to the rank-2 bottleneck. First element: 0.31.0+(0.5)0.5+0.2(0.2)=0.30.250.04=0.010.3 \cdot 1.0 + (-0.5) \cdot 0.5 + 0.2 \cdot (-0.2) = 0.3 - 0.25 - 0.04 = 0.01. Second element: 0.41.0+0.10.5+(0.3)(0.2)=0.4+0.05+0.06=0.510.4 \cdot 1.0 + 0.1 \cdot 0.5 + (-0.3) \cdot (-0.2) = 0.4 + 0.05 + 0.06 = 0.51. So z=[0.01,0.51]Tz = [0.01, 0.51]^T

  3. LoRA update: We compute BzBz, projecting back up to the output dimension. Since B=0B = 0, this gives Bz=[0,0,0,0]TBz = [0, 0, 0, 0]^T

  4. Final output: h=W0x+αrBz=[1.31,0.62,0.15,0.38]T+22[0,0,0,0]T=[1.31,0.62,0.15,0.38]Th = W_0 x + \frac{\alpha}{r} Bz = [1.31, 0.62, -0.15, 0.38]^T + \frac{2}{2}[0, 0, 0, 0]^T = [1.31, 0.62, -0.15, 0.38]^T

At initialization, the output equals the pre-trained output exactly. This confirms that zero-initializing BB preserves the pre-trained model's behavior.

Gradient computation:

With upstream gradient g=[0.1,0.2,0.3,0.1]Tg = [0.1, -0.2, 0.3, 0.1]^T from the loss and layers above, we compute gradients for both LoRA matrices.

For BB:

LB=αrgzT=22[0.10.20.30.1][0.010.51]=[0.0010.0510.0020.1020.0030.1530.0010.051]\begin{aligned} \frac{\partial \mathcal{L}}{\partial B} &= \frac{\alpha}{r} g z^T \\ &= \frac{2}{2} \begin{bmatrix} 0.1 \\ -0.2 \\ 0.3 \\ 0.1 \end{bmatrix} \begin{bmatrix} 0.01 & 0.51 \end{bmatrix} \\ &= \begin{bmatrix} 0.001 & 0.051 \\ -0.002 & -0.102 \\ 0.003 & 0.153 \\ 0.001 & 0.051 \end{bmatrix} \end{aligned}

where:

  • L\mathcal{L}: the loss function
  • BB: the up-projection matrix
  • α\alpha: the scaling factor (α=2\alpha = 2)
  • rr: the rank (r=2r = 2)
  • gg: the upstream gradient vector
  • zz: the intermediate activation vector computed in the forward pass

For AA:

LA=αrBTgxT=22[00000000]gxT=[000000]\begin{aligned} \frac{\partial \mathcal{L}}{\partial A} &= \frac{\alpha}{r} B^T g x^T \\ &= \frac{2}{2} \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix} g x^T \\ &= \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix} \end{aligned}

where:

  • L\mathcal{L}: the loss function
  • AA: the down-projection matrix
  • BTB^T: the transpose of the up-projection matrix (currently zero)
  • gg: the upstream gradient vector
  • xx: the input vector

As expected from our theoretical analysis, BB receives non-zero gradients while AA's gradient is zero at initialization. After one gradient descent step updates BB, subsequent forward passes will produce non-zero BTgB^T g terms, and AA will begin receiving gradients as well.

Code Implementation

We'll implement the LoRA mathematics in PyTorch, demonstrating the forward pass, gradient computation, and initialization. The implementation verifies our theoretical predictions empirically.

In[6]:
Code
import torch

torch.manual_seed(42)
In[7]:
Code
import math

import torch.nn as nn
import torch.nn.functional as F


class LoRALinear(nn.Module):
    """Linear layer with LoRA adaptation"""

    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 4,
        alpha: float = 1.0,
        pretrained_weight: torch.Tensor = None,
    ):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.scaling = alpha / rank

        # Frozen pre-trained weight
        if pretrained_weight is None:
            pretrained_weight = torch.randn(out_features, in_features) * 0.02
        self.register_buffer("W0", pretrained_weight)

        # LoRA matrices (trainable)
        self.A = nn.Parameter(torch.randn(rank, in_features) / math.sqrt(rank))
        self.B = nn.Parameter(torch.zeros(out_features, rank))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with LoRA adaptation."""
        # Original forward pass
        h0 = F.linear(x, self.W0)

        # LoRA path: x -> A -> B (with scaling)
        z = F.linear(x, self.A)  # Shape: (batch, rank)
        delta_h = F.linear(z, self.B)  # Shape: (batch, out_features)

        return h0 + self.scaling * delta_h

    def get_merged_weight(self) -> "torch.Tensor":
        """Return W0 + (alpha/r) * BA for inference."""
        return self.W0 + self.scaling * (self.B @ self.A)

Let's verify the initialization produces zero updates:

In[8]:
Code
# Create a LoRA layer
layer = LoRALinear(in_features=64, out_features=128, rank=8, alpha=8.0)

# Check that BA = 0 at initialization
BA = layer.B @ layer.A
shape_a = layer.A.shape
shape_b = layer.B.shape
shape_ba = BA.shape
max_ba = BA.abs().max().item()
Out[9]:
Console
Shape of A: torch.Size([8, 64])
Shape of B: torch.Size([128, 8])
Shape of BA: torch.Size([128, 64])
Max absolute value in BA: 0.000e+00

The product BABA is exactly zero because BB is initialized to zero, confirming our model starts with pretrained behavior. Matrix AA has dimensions (rank, in_features) for the down-projection, while BB has dimensions (out_features, rank) for the up-projection. Their product BABA maintains the full weight matrix dimensions but contains only zeros at initialization, meaning the LoRA path contributes nothing to the output until training begins.

Verifying Gradient Flow

Let's trace gradients through a forward-backward pass:

In[10]:
Code
# Create layer and input
layer = LoRALinear(in_features=8, out_features=4, rank=2, alpha=2.0)
x = torch.randn(1, 8, requires_grad=True)

# Forward pass
h = layer(x)
input_shape = x.shape
output_shape = h.shape

# Backward pass with a simple loss
loss = h.sum()
loss.backward()

grad_a = layer.A.grad
grad_b = layer.B.grad
Out[11]:
Console
Input shape: torch.Size([1, 8])
Output shape: torch.Size([1, 4])
Gradient norm of A: 0.000e+00
Gradient norm of B: 7.864e+00

As predicted by analysis, matrix AA's gradient is zero at initialization because B=0B = 0 in the gradient formula, while BB receives non-zero gradients. This asymmetry demonstrates the coupled dynamics derived mathematically. Matrix AA begins learning only after BB has moved away from zero, creating a brief initial phase where BB adapts alone before both matrices co-evolve.

Parameter Efficiency Calculation

Let's compute the parameter savings for a realistic scenario:

In[12]:
Code
def calculate_lora_efficiency(d: int, k: int, r: int, num_layers: int) -> dict:
    """Calculate parameter counts and savings."""
    full_params = d * k * num_layers
    lora_params = r * (d + k) * num_layers
    ratio = lora_params / full_params

    return {
        "full_params": full_params,
        "lora_params": lora_params,
        "ratio_percent": ratio * 100,
        "compression": 1 / ratio,
    }


# LLaMA-7B attention projections: d=k=4096, 32 layers, 4 matrices per layer
results = calculate_lora_efficiency(d=4096, k=4096, r=16, num_layers=32 * 4)
Out[13]:
Console
LLaMA-7B Attention LoRA Analysis (r=16):
  Full fine-tuning params: 2,147,483,648
  LoRA params: 16,777,216
  Percentage of full: 0.781%
  Compression ratio: 128x

The 128x compression achieves dramatic parameter reduction while maintaining strong adaptation capability. Full finetuning of these attention layers requires 2.1 billion parameters, while LoRA needs only 16.8 million trainable parameters, making finetuning practical on consumer GPUs with 16-24GB of memory. This dramatic reduction enables efficient multi-task serving where different LoRA adapters can be swapped without duplicating the base model weights.

Visualizing Rank Effects

Let's visualize how different ranks affect the approximation capacity:

In[14]:
Code
# Create a target matrix that simulates a fine-tuning update
# (low intrinsic rank with some noise)
d, k = 64, 64
true_rank = 8
U_true = torch.randn(d, true_rank)
V_true = torch.randn(true_rank, k)
target_delta_W = U_true @ V_true + 0.1 * torch.randn(d, k)

# Try different LoRA ranks
ranks = [1, 2, 4, 8, 16, 32, 64]
errors = []

for r in ranks:
    # Best rank-r approximation via SVD
    U, S, Vh = torch.linalg.svd(target_delta_W, full_matrices=False)
    approx = (U[:, :r] * S[:r]) @ Vh[:r, :]
    error = torch.norm(target_delta_W - approx).item()
    errors.append(error)

# Store true_rank for use in visualization
true_rank_value = true_rank
Out[15]:
Visualization
Line plot showing reconstruction error decreasing as LoRA rank increases from 1 to 64.
Reconstruction error for a 64x64 fine-tuning update matrix with intrinsic rank 8, computed via SVD truncation. Error decreases sharply until rank 8, then levels off at higher ranks. This pattern demonstrates the matrix's true intrinsic rank and explains why LoRA with r=8 to r=16 captures nearly all useful information while achieving strong parameter compression for typical fine-tuning tasks.

The error drops sharply until we reach the true intrinsic rank of the target matrix (8 in this case), then decreases more slowly. This shows why moderate ranks like 8 to 16 often suffice in practice.

Gradient Magnitude Analysis

Let's examine how gradients evolve during the first few training steps:

In[16]:
Code
# Track gradient magnitudes over steps
layer = LoRALinear(in_features=32, out_features=32, rank=4, alpha=4.0)
optimizer = torch.optim.Adam([layer.A, layer.B], lr=0.01)

grad_A_norms = []
grad_B_norms = []
B_norms = []

for step in range(20):
    optimizer.zero_grad()

    x = torch.randn(16, 32)
    target = torch.randn(16, 32)

    output = layer(x)
    loss = F.mse_loss(output, target)
    loss.backward()

    grad_A_norms.append(layer.A.grad.norm().item())
    grad_B_norms.append(layer.B.grad.norm().item())
    B_norms.append(layer.B.data.norm().item())

    optimizer.step()
Out[17]:
Visualization
Gradient magnitudes for matrices A and B during 20 training steps reveal asymmetric learning dynamics. Matrix B receives non-zero gradients from step 1, while matrix A's gradient is zero initially. As B evolves, gradient flow to A enables both matrices to co-develop. This coupled dynamic explains why B must adapt first before joint optimization begins.
Gradient magnitudes for matrices A and B during 20 training steps reveal asymmetric learning dynamics. Matrix B receives non-zero gradients from step 1, while matrix A's gradient is zero initially. As B evolves, gradient flow to A enables both matrices to co-develop. This coupled dynamic explains why B must adapt first before joint optimization begins.
Matrix B Frobenius norm grows smoothly from zero to approximately 0.6 over 20 training steps, enabling gradient flow to A through coupled dynamics. This stable, predictable growth shows how zero initialization preserves pretrained behavior while transitioning to full two-matrix adaptation.
Matrix B Frobenius norm grows smoothly from zero to approximately 0.6 over 20 training steps, enabling gradient flow to A through coupled dynamics. This stable, predictable growth shows how zero initialization preserves pretrained behavior while transitioning to full two-matrix adaptation.

The plot confirms the mathematical analysis. Matrix A's gradient starts at zero and grows as B moves away from zero. The coupled dynamics quickly bring both matrices into active training.

Weight Merging for Inference

LoRA's trained adapters can be merged into the base weights for inference:

In[18]:
Code
# After training, we can merge LoRA into the base weights
layer = LoRALinear(in_features=64, out_features=64, rank=8, alpha=8.0)

# Simulate some training
for _ in range(10):
    x = torch.randn(16, 64)
    loss = layer(x).sum()
    loss.backward()
    with torch.no_grad():
        layer.A -= 0.01 * layer.A.grad
        layer.B -= 0.01 * layer.B.grad
        layer.A.grad.zero_()
        layer.B.grad.zero_()

# Verify equivalence
x_test = torch.randn(4, 64)
output_separate = layer(x_test)
merged_weight = layer.get_merged_weight()
output_merged = F.linear(x_test, merged_weight)

max_diff = (output_separate - output_merged).abs().max().item()
are_equivalent = torch.allclose(output_separate, output_merged)
Out[19]:
Console
Max difference: 2.432e-05
Outputs are equivalent: True

After merging, the LoRA matrices can be discarded and replaced with a single weight matrix that has zero inference overhead compared to the original model. The negligible difference confirms that merging is mathematically exact within floating-point accuracy. This property matters for deployment. During development, separate A and B matrices provide flexibility, while production uses merged weights to eliminate computational overhead from the low-rank decomposition.

Key Parameters

The key parameters for LoRA are:

  • rank (r): The bottleneck dimension of the low-rank decomposition. Lower ranks use fewer parameters but constrain adaptation capacity. Typical values range from 4 to 64.
  • alpha: Scaling factor that controls the magnitude of LoRA updates. Often set equal to rank or tuned in the range 8 to 64.
  • target_modules: Which weight matrices to apply LoRA to (e.g., query, key, value, output projections in attention layers).
  • initialization: Matrix A typically uses Gaussian or Kaiming initialization, while matrix B is initialized to zero to preserve pretrained behavior.

Conclusion

LoRA's mathematical elegance stems from the simplicity of the factorized decomposition combined with its effectiveness in practice. The low-rank constraint is not arbitrary but grounded in the empirical observation that fine-tuning updates concentrate in low-dimensional subspaces. Understanding the mathematics—from the basic formulation through initialization, gradient flow, and rank selection—reveals why LoRA works and how to apply it effectively across diverse applications and architectures.

Limitations and Impact

LoRA achieves remarkable parameter efficiency, but its limitations matter for practical application.

The low-rank constraint limits adaptation expressiveness. Empirical evidence shows this rarely hurts performance on common NLP benchmarks, though certain tasks require higher-rank updates that LoRA cannot efficiently represent. Tasks requiring significant architectural changes, such as cross-domain or cross-modality adaptation, may need ranks higher than LoRA typically provides. The intrinsic dimensionality hypothesis provides theoretical grounding but cannot guarantee that low-rank approaches suffice for all adaptations.

The initialization scheme creates a particular training dynamic that may not be optimal for all scenarios. Zero-initializing B means early training updates only affect B, potentially slowing convergence compared to methods that update all parameters immediately. The scaling factor αr\frac{\alpha}{r} introduces hyperparameters that interact with learning rates in non-obvious ways, requiring careful tuning.

Despite these limitations, LoRA's mathematical formulation has significantly influenced the field. The decomposition W = W₀ + BA provides a clean interface for modular adaptation, where different LoRA matrices can be trained for different tasks and swapped at inference without modifying the base model. This enables multi-tenant serving where a single base model serves many specialized applications. The formulation also inspired numerous extensions, including QLoRA (quantization with LoRA), AdaLoRA (dynamic rank adaptation), and structured approaches that exploit domain-specific priors about where low-rank updates should be applied.

Summary

This chapter covered the mathematical foundations of LoRA:

  • Core formulation: W=W0+αrBAW = W_0 + \frac{\alpha}{r}BA expresses weight updates as a product of two smaller matrices, constraining updates to rank rr.

  • Low-rank approximation: SVD theory and the intrinsic dimensionality hypothesis show that fine-tuning updates live in low-dimensional subspaces of weight space.

  • Rank selection: Balances expressiveness against efficiency. Ranges of r from 4 to 64 achieve over 100x compression while maintaining adaptation quality.

  • Initialization: B starts at zero while A is randomly initialized, ensuring training begins from pre-trained behavior with stable gradient dynamics.

  • Gradient flow: A and B co-evolve during training. A receives zero gradients initially but learns as B moves away from zero.

The next chapter covers practical implementation patterns for applying LoRA to transformer architectures and integrating it with existing training pipelines.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about LoRA's mathematical foundations.

Loading component...

Reference

BIBTEXAcademic
@misc{loramathematicslowrankadaptationformulasgradients, author = {Michael Brenndoerfer}, title = {LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients}, year = {2025}, url = {https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients. Retrieved from https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas
MLAAcademic
Michael Brenndoerfer. "LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients." 2026. Web. today. <https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas>.
CHICAGOAcademic
Michael Brenndoerfer. "LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients." Accessed today. https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas.
HARVARDAcademic
Michael Brenndoerfer (2025) 'LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients'. Available at: https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). LoRA Mathematics: Low-Rank Adaptation Formulas & Gradients. https://mbrenndoerfer.com/writing/lora-mathematics-low-rank-adaptation-formulas