PEFT Comparison: Choosing the Right Fine-Tuning Method

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Compare LoRA, QLoRA, Adapters, IA³, Prefix Tuning, and Prompt Tuning across efficiency, performance, and memory. Practical guide for choosing PEFT methods.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

PEFT ComparisonLink Copied

Throughout this part, we've explored six distinct approaches to parameter-efficient fine-tuning: LoRA, QLoRA, Adapters, IA³, Prefix Tuning, and Prompt Tuning. Each method introduces a different inductive bias about where task-specific knowledge should reside in a model. But when facing a new task, how do you choose which method to use?

This chapter synthesizes what we've learned into a practical comparison framework. We'll examine each method along multiple dimensions: parameter efficiency, task performance, computational overhead, and implementation complexity. By the end, you'll have concrete guidelines for selecting the right PEFT method for your specific use case.

Parameter Efficiency ComparisonLink Copied

PEFT methods adapt large models by updating only a small fraction of parameters. To compare these methods, we must establish concrete metrics. Examining the parameter counts reveals how methods differ in memory and storage requirements.

Absolute Parameter CountsLink Copied

Let's establish concrete numbers for a typical 7B parameter model like LLaMA-2-7B with hidden dimension $d = 4096$ , 32 layers, and 32 attention heads. Calculating the parameters for each method shows the tradeoffs and suitability for different hardware constraints.

LoRA approximates weight updates by decomposing them into two low-rank matrices, $A$ and $B$ , such that $\Delta W = BA$ . This design relies on the hypothesis that weight updates have a low "intrinsic rank," meaning the necessary adaptations exist in a low-dimensional subspace. The key insight motivating this approach is that while the original weight matrices are enormous, the changes needed to adapt a model to a new task may be far simpler in structure. If we can capture these changes using matrices with reduced dimensionality, we achieve massive parameter savings without sacrificing the model's ability to learn the task. For a layer with hidden dimension $d$ and rank $r$ , the number of parameters is $d \times r + r \times d = 2dr$ . When applied to all four attention projections ( $W_Q, W_K, W_V, W_O$ ):

The total number of trainable parameters for LoRA is calculated by summing the parameters of the low-rank matrices $A$ and $B$ across all target modules and layers. This calculation reveals how the rank parameter $r$ serves as a direct control knob for the tradeoff between model capacity and parameter count:

\text{LoRA params} = 4 \times (2 \times d \times r) \times L

where:

$4$ : the number of target modules per layer (Query, Key, Value, Output)
$2 \times d \times r$ : parameters for matrices $A$ and $B$ in one module (since each has size $d \times r$ or $r \times d$ )
$d$ : hidden dimension ( $4096$ )
$r$ : rank of the low-rank adapters ( $16$ )
$L$ : number of layers ( $32$ )

For our LLaMA-2-7B configuration, this yields:

\begin{aligned} \text{LoRA params} &= 4 \times 2 \times 4096 \times 16 \times 32 \\ &\approx 16.8\text{M} \end{aligned}

This result is striking: we can adapt a 7 billion parameter model by training only 16.8 million parameters, a reduction of over 400 times. The rank of 16 is deliberately chosen to be small enough to provide substantial parameter savings while still offering sufficient capacity for most adaptation tasks.

Adapters insert bottleneck modules (consisting of a down-projection and an up-projection) after both the attention and feed-forward (FFN) layers. This bottleneck architecture compresses inputs into a lower-dimensional representation before expanding them back, reducing parameter count while enabling non-linear adaptation. The fundamental idea behind adapters is that task-specific information can be "filtered" through a narrow bottleneck, forcing the model to learn only the most essential transformations. This compression-expansion pattern also introduces non-linearity through activation functions, giving adapters additional representational power compared to linear methods like LoRA. With bottleneck dimension $d_{\text{bottleneck}} = 64$ :

We calculate the parameter count by summing the weights of the down-projection and up-projection matrices for each adapter. The symmetry of this calculation, with equal contributions from each projection direction, reflects the balanced nature of the bottleneck design:

\text{Adapter params} = 2 \times (d \times d_{\text{bottleneck}} + d_{\text{bottleneck}} \times d) \times L

where:

$2$ : the number of adapters per layer (one inserted after Attention, one after FFN)
$d$ : the hidden dimension of the model ( $4096$ )
$d_{\text{bottleneck}}$ : the size of the bottleneck dimension ( $64$ )
$d \times d_{\text{bottleneck}}$ : the number of parameters in the down-projection matrix
$d_{\text{bottleneck}} \times d$ : the number of parameters in the up-projection matrix
$L$ : the number of layers in the model ( $32$ )

Plugging in the values:

\begin{aligned} \text{Adapter params} &= 2 \times (4096 \times 64 + 64 \times 4096) \times 32 \\ &\approx 33.6\text{M} \end{aligned}

Adapters require roughly twice as many parameters as LoRA with rank 16. This increased parameter count buys additional expressiveness through the non-linear activation function between the down and up projections, but comes at the cost of both storage and, as we'll see later, inference overhead.

Prefix Tuning prepends learnable vectors to the keys and values at every layer. These virtual tokens effectively "steer" the attention mechanism at every depth, guiding the model's internal processing. Rather than modifying the model's weights directly, prefix tuning modifies the context that the model attends to. Optimizing key-value pairs at every layer influences attention flow without changing the transformation matrices. With prefix length $l = 20$ :

The trainable parameters consist of the virtual token embeddings added to the Key and Value matrices at each layer. This formulation makes clear why prefix tuning affects only the attention mechanism: we're adding content for the model to attend to, not modifying how it processes that content:

\text{Prefix params} = l \times d \times 2 \times L

where:

$l$ : the prefix length (number of virtual tokens, set to $20$ )
$d$ : the hidden dimension ( $4096$ )
$2$ : a factor accounting for application to both Key and Value matrices
$L$ : the number of layers ( $32$ )

Substituting the values:

\begin{aligned} \text{Prefix params} &= 20 \times 4096 \times 2 \times 32 \\ &\approx 5.2\text{M} \end{aligned}

The parameter count for Prefix Tuning is lower than LoRA despite operating at every layer. This efficiency comes from the method's focused scope: it only influences the attention mechanism through the key-value channel, leaving all other computations unchanged.

Prompt Tuning adds learnable embeddings only at the input layer. Unlike discrete text prompts, these continuous embeddings are optimized directly via backpropagation to find the most effective task trigger. This approach represents perhaps the simplest possible form of model adaptation: we're essentially learning a better way to "ask" the model to perform our task. The continuous nature of these embeddings allows optimization to find representations that may have no natural language equivalent, potentially discovering more effective task specifications than any human-written prompt could provide. With prompt length $p = 100$ :

The parameters come solely from the learnable embeddings prepended to the input layer. This single-layer application is what gives Prompt Tuning its remarkable parameter efficiency, though it also limits the method's ability to influence deep model computations:

\text{Prompt params} = p \times d

where:

$p$ : the prompt length (number of virtual tokens, set to $100$ )
$d$ : the embedding dimension ( $4096$ )

Calculating the total:

\begin{aligned} \text{Prompt params} &= 100 \times 4096 \\ &\approx 0.4\text{M} \end{aligned}

With fewer than half a million parameters, Prompt Tuning achieves the smallest footprint of any method we've examined. This extreme efficiency makes it ideal for scenarios where storage or memory is at an absolute premium, though as we'll see in the performance analysis, this efficiency comes with tradeoffs.

IA³ introduces learned vectors that scale the activations element-wise. This acts like a feature equalizer, selectively amplifying or suppressing specific activation channels. IA³ assumes the base model computes useful features and only needs to adjust their relative importance for different tasks. Rather than learning new transformations, IA³ learns which existing computations to emphasize and which to diminish. IA³ is effective because it leverages pre-trained representations. We calculate the parameter count by summing the scaling vectors for the Key, Value, and FFN intermediate activations:

\text{IA}^3 \text{ params} = (d_k + d_v + d_{\text{ff}}) \times L

where:

$d_k, d_v$ : the dimension of scaling vectors for attention keys and values ( $4096$ , matching the hidden dimension)
$d_{\text{ff}}$ : the dimension of the scaling vector for the FFN expansion layer ( $11008$ )
$L$ : the number of layers ( $32$ )

This results in a very small number of parameters:

\begin{aligned} \text{IA}^3 \text{ params} &= (4096 + 4096 + 11008) \times 32 \\ &\approx 0.61\text{M} \end{aligned}

The parameter count for IA³ is remarkably low, just over half a million for a 7B model. This efficiency stems from the element-wise nature of the scaling operation: we need only one scalar per activation dimension, rather than full transformation matrices. The tradeoff is that IA³ can only rescale existing activations, not compute fundamentally new features.

In[2]:

Code

!uv pip install pandas matplotlib numpy

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Model configuration (LLaMA-2-7B-like)
hidden_dim = 4096
num_layers = 32
num_heads = 32
head_dim = hidden_dim // num_heads
ffn_dim = 11008  # LLaMA uses ~2.7x expansion
total_params = 7e9

# PEFT parameter calculations
methods = {
    'Prompt Tuning': 100 * hidden_dim,  # 100 soft tokens
    'IA³': (hidden_dim + hidden_dim + ffn_dim) * num_layers,  # k, v, ffn scaling
    'Prefix Tuning': 20 * hidden_dim * 2 * num_layers,  # 20 prefix tokens, K and V
    'LoRA (r=16)': 2 * 16 * hidden_dim * 4 * num_layers,  # A and B for Q,K,V,O
    'LoRA (r=64)': 2 * 64 * hidden_dim * 4 * num_layers,
    'Adapters': 2 * hidden_dim * 64 * 2 * num_layers,  # bottleneck=64, 2 adapters per layer
    'Full Fine-tuning': total_params
}

# Calculate percentages
percentages = {k: (v / total_params) * 100 for k, v in methods.items()}

!uv pip install pandas matplotlib numpy

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Model configuration (LLaMA-2-7B-like)
hidden_dim = 4096
num_layers = 32
num_heads = 32
head_dim = hidden_dim // num_heads
ffn_dim = 11008  # LLaMA uses ~2.7x expansion
total_params = 7e9

# PEFT parameter calculations
methods = {
    'Prompt Tuning': 100 * hidden_dim,  # 100 soft tokens
    'IA³': (hidden_dim + hidden_dim + ffn_dim) * num_layers,  # k, v, ffn scaling
    'Prefix Tuning': 20 * hidden_dim * 2 * num_layers,  # 20 prefix tokens, K and V
    'LoRA (r=16)': 2 * 16 * hidden_dim * 4 * num_layers,  # A and B for Q,K,V,O
    'LoRA (r=64)': 2 * 64 * hidden_dim * 4 * num_layers,
    'Adapters': 2 * hidden_dim * 64 * 2 * num_layers,  # bottleneck=64, 2 adapters per layer
    'Full Fine-tuning': total_params
}

# Calculate percentages
percentages = {k: (v / total_params) * 100 for k, v in methods.items()}

Out[3]:

Console

Parameter Efficiency Comparison for 7B Model
============================================================
          Method Trainable Parameters % of Model
   Prompt Tuning                0.41M    0.0059%
             IA³                0.61M    0.0088%
   Prefix Tuning                5.24M    0.0749%
     LoRA (r=16)               16.78M    0.2397%
     LoRA (r=64)               67.11M    0.9587%
        Adapters               33.55M    0.4793%
Full Fine-tuning                 7.0B     100.0%

The differences are dramatic. Prompt Tuning and IA³ train fewer than 1 million parameters, less than 0.01% of the model. LoRA at typical ranks trains 0.2-0.9% of parameters. Even the "heaviest" PEFT method (Adapters) trains less than 0.5% of the model. These numbers illustrate a fundamental principle: the vast majority of a pre-trained model's knowledge is encoded in its frozen weights, and task-specific adaptation can be achieved with remarkably few additional parameters.

Out[4]:

Visualization

Bar chart showing parameter efficiency of PEFT methods on logarithmic scale. — Trainable parameters as a percentage of total model parameters for various PEFT methods. The logarithmic scale reveals that the difference between Prompt Tuning and full fine-tuning spans four orders of magnitude.

Out[5]:

Visualization

Bar chart showing absolute parameter counts for each PEFT method. — Absolute trainable parameter counts for PEFT methods on a 7B model. While Prompt Tuning and IA³ require fewer than 1 million parameters, LoRA at rank 64 exceeds 67 million, demonstrating the wide range of storage requirements across different methods.

Memory Footprint AnalysisLink Copied

Parameter count alone doesn't tell the full story. During training, memory consumption includes model weights, optimizer states, gradients, and activations. Understanding the complete memory picture is essential for determining which methods will actually fit on your hardware. A method might train fewer parameters but still require substantial memory for other components of the training process.

For full fine-tuning with AdamW, memory is consumed by weights, optimizer states, gradients, and activations. Assuming mixed precision training:

We calculate the memory requirements by summing the components needed for model states and training overhead. This includes the static footprint of the model weights and the dynamic memory required for optimizer states, gradients, and forward-pass activations. Each component plays a distinct role in the training process, and understanding their individual contributions helps explain why full fine-tuning is so memory-intensive:

\text{Memory}_{\text{full}} = \text{weights} + \text{optimizer} + \text{gradients} + \text{activations}

where:

weights: memory for the model parameters (usually FP16)
optimizer: memory for optimizer states (like momentum and variance in AdamW)
gradients: memory for storing gradients during backpropagation
activations: memory for intermediate outputs needed for the backward pass

Using a simplified estimation model:

\begin{aligned} \text{Memory}_{\text{full}} &\approx 2P + 4P + 4P + 2P \\ &= 12P \text{ bytes} \end{aligned}

where:

$P$ : the total number of model parameters
$2P$ : FP16 model weights ( $2$ bytes per parameter)
$4P$ : optimizer states (simplified estimate)
$4P$ : gradients (FP32 or mixed)
$2P$ : activations (approximate average)

A 7B model requires approximately 84 GB of VRAM. Full fine-tuning is memory-intensive because requirements scale linearly with parameters, and each parameter needs storage for multiple values.

PEFT methods drastically reduce memory by freezing the base model and training only the adapter parameters. The key insight is that frozen parameters need only their weights stored, not optimizer states or gradients. With LoRA (rank 16):

For LoRA, we only need to store the optimizer states and gradients for the small set of trainable parameters, while the base model remains frozen in low precision. This separation between frozen and trainable parameters is the fundamental mechanism through which all PEFT methods achieve their memory efficiency:

\text{Memory}_{\text{LoRA}} \approx 2P_{\text{frozen}} + 12P_{\text{trainable}}

where:

$P_{\text{frozen}}$ : the number of frozen base model parameters ( $7$ billion)
$P_{\text{trainable}}$ : the number of trainable LoRA parameters ( $16.8$ million)
$2$ : bytes per frozen parameter (FP16 weights only)
$12$ : the memory factor for trainable parameters (weights + optimizer + gradients + activations)

Applying this to our example:

\begin{aligned} \text{Memory}_{\text{LoRA}} &\approx (2 \times 7 \times 10^9) + (12 \times 16.8 \times 10^6) \\ &\approx 14 \text{ GB} + 0.2 \text{ GB} \\ &= 14.2 \text{ GB} \end{aligned}

This is why a 7B model that requires 80+ GB for full fine-tuning can be fine-tuned with LoRA on a single 24GB GPU. Memory savings make fine-tuning accessible; LoRA enables adaptation on hardware unsuitable for full fine-tuning. This shift in memory requirements enables you to adapt large models for your specific needs without expensive compute clusters.

In[6]:

Code

def estimate_training_memory(
    total_params,
    trainable_params,
    batch_size=4,
    seq_len=512,
    hidden_dim=4096,
    num_layers=32,
    precision="fp16",
):
    """Estimate GPU memory requirements for training."""
    bytes_per_param = 2 if precision == "fp16" else 4

    # Model weights (frozen in FP16)
    frozen_params = total_params - trainable_params
    weight_memory = frozen_params * 2  # FP16 frozen
    weight_memory += trainable_params * bytes_per_param  # Trainable weights

    # Optimizer states (FP32 for stability)
    optimizer_memory = trainable_params * 4 * 2  # Adam has 2 states

    # Gradients for trainable params only
    gradient_memory = trainable_params * 4  # FP32 gradients

    # Activation memory (rough estimate)
    # Each layer stores attention patterns and intermediate activations
    activation_per_layer = (
        batch_size * seq_len * hidden_dim * bytes_per_param * 4
    )
    activation_memory = activation_per_layer * num_layers

    total_gb = (
        weight_memory + optimizer_memory + gradient_memory + activation_memory
    ) / 1e9

    return {
        "weights_gb": weight_memory / 1e9,
        "optimizer_gb": optimizer_memory / 1e9,
        "gradients_gb": gradient_memory / 1e9,
        "activations_gb": activation_memory / 1e9,
        "total_gb": total_gb,
    }

def estimate_training_memory(
    total_params,
    trainable_params,
    batch_size=4,
    seq_len=512,
    hidden_dim=4096,
    num_layers=32,
    precision="fp16",
):
    """Estimate GPU memory requirements for training."""
    bytes_per_param = 2 if precision == "fp16" else 4

    # Model weights (frozen in FP16)
    frozen_params = total_params - trainable_params
    weight_memory = frozen_params * 2  # FP16 frozen
    weight_memory += trainable_params * bytes_per_param  # Trainable weights

    # Optimizer states (FP32 for stability)
    optimizer_memory = trainable_params * 4 * 2  # Adam has 2 states

    # Gradients for trainable params only
    gradient_memory = trainable_params * 4  # FP32 gradients

    # Activation memory (rough estimate)
    # Each layer stores attention patterns and intermediate activations
    activation_per_layer = (
        batch_size * seq_len * hidden_dim * bytes_per_param * 4
    )
    activation_memory = activation_per_layer * num_layers

    total_gb = (
        weight_memory + optimizer_memory + gradient_memory + activation_memory
    ) / 1e9

    return {
        "weights_gb": weight_memory / 1e9,
        "optimizer_gb": optimizer_memory / 1e9,
        "gradients_gb": gradient_memory / 1e9,
        "activations_gb": activation_memory / 1e9,
        "total_gb": total_gb,
    }

In[7]:

Code

batch_size = 4
seq_len = 512

## Calculate memory for each method
memory_estimates = {}
for method, trainable in methods.items():
    if method == "Full Fine-tuning":
        trainable = total_params
    memory_estimates[method] = estimate_training_memory(
        total_params=total_params,
        trainable_params=trainable,
        batch_size=batch_size,
        seq_len=seq_len,
    )

batch_size = 4
seq_len = 512

## Calculate memory for each method
memory_estimates = {}
for method, trainable in methods.items():
    if method == "Full Fine-tuning":
        trainable = total_params
    memory_estimates[method] = estimate_training_memory(
        total_params=total_params,
        trainable_params=trainable,
        batch_size=batch_size,
        seq_len=seq_len,
    )

Out[8]:

Console

Memory Estimates (Batch Size=4, Seq Len=512)
======================================================================
Method               Weights      Optimizer    Total       
----------------------------------------------------------------------
Prompt Tuning        14.0 GB      0.0 GB      16.2 GB
IA³                  14.0 GB      0.0 GB      16.2 GB
Prefix Tuning        14.0 GB      0.0 GB      16.2 GB
LoRA (r=16)          14.0 GB      0.1 GB      16.3 GB
LoRA (r=64)          14.0 GB      0.5 GB      17.0 GB
Adapters             14.0 GB      0.3 GB      16.6 GB
Full Fine-tuning     14.0 GB      56.0 GB      100.1 GB

The estimated values demonstrate the massive efficiency gains of PEFT. While full fine-tuning demands over 80GB of VRAM, requiring high-end data center GPUs, all PEFT methods operate comfortably within the memory limits of consumer hardware (14-16GB). This 5-6x reduction in memory footprint is what makes fine-tuning more accessible. The practical implication is transformative: techniques that were once exclusive to well-funded research labs are now available to you, small teams, and educational institutions.

Out[9]:

Visualization

Stacked bar chart showing memory components for different training approaches. — Memory breakdown comparison between full fine-tuning and PEFT methods. The dramatic reduction in optimizer and gradient memory requirements is the key driver of PEFT efficiency.

Performance Comparison Across TasksLink Copied

Parameter efficiency means nothing if the method doesn't work. Let's examine how these methods perform across different task categories. We will examine how performance varies across tasks and data sizes to inform your selection guidelines.

Natural Language UnderstandingLink Copied

NLU tasks like sentiment analysis, natural language inference, and question answering have been extensively benchmarked. Classification and extraction tasks are the most studied PEFT domains and provide robust comparative data. On the GLUE benchmark, the relative ordering tends to be:

Full Fine-tuning ≥ LoRA ≥ Adapters ≥ IA³ > Prefix Tuning > Prompt Tuning

However, the gaps are often smaller than you might expect. The methods that train more parameters (LoRA, Adapters) achieve performance closer to full fine-tuning, while the ultra-efficient methods (Prompt Tuning, IA³) show somewhat larger gaps. Yet even the "weakest" PEFT methods often achieve respectable performance. On BERT-base with the standard GLUE tasks:

In[10]:

Code

# Representative benchmark results (aggregated from literature)
# Note: Actual values vary by implementation and hyperparameters
glue_results = {
    "Method": ["Full FT", "LoRA", "Adapters", "Prefix", "Prompt", "IA³"],
    "MNLI": [84.6, 84.3, 84.1, 83.5, 82.1, 83.8],
    "QQP": [91.2, 91.0, 90.8, 90.2, 89.4, 90.6],
    "SST-2": [93.5, 93.2, 93.0, 92.4, 91.2, 92.8],
    "Average": [89.8, 89.5, 89.3, 88.7, 87.6, 89.1],
}

glue_df = pd.DataFrame(glue_results)

# Representative benchmark results (aggregated from literature)
# Note: Actual values vary by implementation and hyperparameters
glue_results = {
    "Method": ["Full FT", "LoRA", "Adapters", "Prefix", "Prompt", "IA³"],
    "MNLI": [84.6, 84.3, 84.1, 83.5, 82.1, 83.8],
    "QQP": [91.2, 91.0, 90.8, 90.2, 89.4, 90.6],
    "SST-2": [93.5, 93.2, 93.0, 92.4, 91.2, 92.8],
    "Average": [89.8, 89.5, 89.3, 88.7, 87.6, 89.1],
}

glue_df = pd.DataFrame(glue_results)

Out[11]:

Console

GLUE Benchmark Results (Accuracy %)
=================================================================
  Method  MNLI  QQP  SST-2  Average
 Full FT  84.6 91.2   93.5     89.8
    LoRA  84.3 91.0   93.2     89.5
Adapters  84.1 90.8   93.0     89.3
  Prefix  83.5 90.2   92.4     88.7
  Prompt  82.1 89.4   91.2     87.6
     IA³  83.8 90.6   92.8     89.1

Note: Results aggregated from multiple papers; actual performance varies

Out[12]:

Visualization

Grouped bar chart comparing PEFT methods on GLUE benchmark tasks. — GLUE benchmark performance across PEFT methods. The performance gaps between methods are relatively small, with LoRA achieving near-parity with full fine-tuning across all tasks.

The key insight: LoRA typically achieves 95-99% of full fine-tuning performance while training less than 1% of parameters. This remarkable efficiency ratio is what makes PEFT methods so attractive for practical applications. You can often get nearly all the benefit of full fine-tuning at a tiny fraction of the cost. The performance gap between LoRA and full fine-tuning on these NLU tasks is often within the noise of hyperparameter selection, suggesting that for many classification tasks, LoRA represents a near-optimal tradeoff between efficiency and capability.

Natural Language GenerationLink Copied

Generation tasks, such as summarization, translation, and dialogue, show different patterns. The quality of generated text depends heavily on how well the method can modify the model's output distribution. Generation tasks are sensitive to adaptation quality because they require sequential predictions where small errors compound.

In[13]:

Code

# Representative generation task results
gen_results = {
    "Method": ["Full FT", "LoRA", "Adapters", "Prefix", "IA³"],
    "Summarization (ROUGE-L)": [42.1, 41.8, 41.2, 40.5, 40.9],
    "Translation (BLEU)": [27.3, 27.1, 26.4, 25.8, 26.2],
    "Dialogue (BLEU)": [18.4, 18.2, 17.6, 17.1, 17.8],
}

gen_df = pd.DataFrame(gen_results)

# Representative generation task results
gen_results = {
    "Method": ["Full FT", "LoRA", "Adapters", "Prefix", "IA³"],
    "Summarization (ROUGE-L)": [42.1, 41.8, 41.2, 40.5, 40.9],
    "Translation (BLEU)": [27.3, 27.1, 26.4, 25.8, 26.2],
    "Dialogue (BLEU)": [18.4, 18.2, 17.6, 17.1, 17.8],
}

gen_df = pd.DataFrame(gen_results)

Out[14]:

Console

Generation Task Results
=================================================================
  Method  Summarization (ROUGE-L)  Translation (BLEU)  Dialogue (BLEU)
 Full FT                     42.1                27.3             18.4
    LoRA                     41.8                27.1             18.2
Adapters                     41.2                26.4             17.6
  Prefix                     40.5                25.8             17.1
     IA³                     40.9                26.2             17.8

Out[15]:

Visualization

Grouped bar chart comparing PEFT methods on generation tasks. — Generation task performance comparison. Prefix Tuning shows a larger gap on generation tasks compared to NLU, reflecting its inability to modify feed-forward computations.

Prefix Tuning shows a larger gap on generation tasks. This makes intuitive sense: as we discussed in the Prefix Tuning chapter, the method modifies attention patterns through prefix keys and values but doesn't directly alter how the model generates output tokens. The feed-forward networks, which play a crucial role in transforming attended information into next-token predictions, remain entirely unchanged by prefix tuning. LoRA, by contrast, can modify both the attention mechanism and (when applied appropriately) the FFN layers, giving it more direct control over the generation process.

Instruction Following and ChatLink Copied

The most practical application of PEFT today is adapting base models to follow instructions. This presents a unique challenge: the model must learn both the format of instruction-response pairs and diverse task knowledge. Instruction following requires the model to understand when it should generate output, what style that output should take, and how to appropriately respond to the vast diversity of your requests. This multi-faceted learning objective places significant demands on the adaptation method's capacity.

On instruction-following benchmarks, LoRA-tuned models often match fully fine-tuned versions:

In[16]:

Code

instruction_results = {
    "Method": [
        "Full FT",
        "LoRA (r=64)",
        "LoRA (r=16)",
        "QLoRA (r=64)",
        "Prefix",
        "IA³",
    ],
    "AlpacaEval Win Rate": [
        "89.2%",
        "88.7%",
        "86.3%",
        "87.9%",
        "78.4%",
        "81.2%",
    ],
    "MT-Bench Score": [7.4, 7.3, 7.0, 7.2, 6.1, 6.5],
}

inst_df = pd.DataFrame(instruction_results)

instruction_results = {
    "Method": [
        "Full FT",
        "LoRA (r=64)",
        "LoRA (r=16)",
        "QLoRA (r=64)",
        "Prefix",
        "IA³",
    ],
    "AlpacaEval Win Rate": [
        "89.2%",
        "88.7%",
        "86.3%",
        "87.9%",
        "78.4%",
        "81.2%",
    ],
    "MT-Bench Score": [7.4, 7.3, 7.0, 7.2, 6.1, 6.5],
}

inst_df = pd.DataFrame(instruction_results)

Out[17]:

Console

Instruction-Following Performance (LLaMA-7B)
=================================================================
      Method AlpacaEval Win Rate  MT-Bench Score
     Full FT               89.2%             7.4
 LoRA (r=64)               88.7%             7.3
 LoRA (r=16)               86.3%             7.0
QLoRA (r=64)               87.9%             7.2
      Prefix               78.4%             6.1
         IA³               81.2%             6.5

QLoRA's strong performance here is remarkable: it achieves near-LoRA quality while enabling fine-tuning of 7B models on consumer GPUs with just 24GB of memory. The combination of 4-bit quantization for the base model weights and full-precision LoRA adapters proves to be an excellent practical compromise. We'll explore instruction tuning in depth in the next part.

Few-Shot and Low-Resource SettingsLink Copied

When training data is scarce (hundreds rather than thousands of examples), the relative performance of methods shifts. Data efficiency becomes a critical concern when labeled examples are expensive to obtain, when working with specialized domains, or when rapidly prototyping new applications. Understanding how each method behaves in low-data regimes helps inform decisions about whether to invest in data collection or accept the performance tradeoffs of a more efficient method:

Out[18]:

Visualization

Line chart showing PEFT method performance versus training data size. — Performance of PEFT methods across different training data sizes. Prompt Tuning requires substantial data to work well, while LoRA maintains strong performance even with limited examples.

The pattern is clear: methods that train more parameters (LoRA, Adapters) are more data-hungry but achieve higher peak performance. Prompt Tuning struggles with limited data because it must learn rich representations from scratch in a very low-dimensional space. The soft prompt embeddings start as random vectors with no task-relevant structure, and with only a few hundred examples, there simply isn't enough signal to learn effective representations. IA³, while also parameter-efficient, fares better in low-data settings because its scaling approach builds on the model's existing feature representations rather than learning new ones from scratch.

Computational Overhead AnalysisLink Copied

Beyond memory, PEFT methods differ in their computational costs during training and inference. These differences can significantly impact both development iteration speed and production serving costs. A complete understanding of PEFT tradeoffs requires examining not just parameter counts and performance, but also the time and compute required to train and deploy adapted models.

Training SpeedLink Copied

Training throughput depends on several factors that interact in complex ways:

Forward pass overhead: Added computation from adapter layers or modified attention. Methods that introduce new layers (Adapters) or extend sequence length (Prefix Tuning) increase forward pass time.
Backward pass scope: How many parameters receive gradients. Methods with fewer trainable parameters compute fewer gradients, reducing backward pass time significantly.
Memory access patterns: Sequential access (Adapters) vs. broadcasted scaling (IA³). Memory-efficient operations can provide substantial speedups even when parameter counts are similar.

In[19]:

Code

# Relative training speed comparison
training_speed = {
    "Method": ["Full FT", "LoRA", "Adapters", "Prefix", "Prompt", "IA³"],
    "Forward Overhead": ["1.0x", "1.0x", "1.1x", "1.05x", "1.0x", "1.0x"],
    "Backward Speedup": ["1.0x", "1.3x", "1.2x", "1.5x", "2.0x", "1.8x"],
    "Overall Throughput": ["1.0x", "1.2x", "1.1x", "1.3x", "1.5x", "1.4x"],
}

# Relative training speed comparison
training_speed = {
    "Method": ["Full FT", "LoRA", "Adapters", "Prefix", "Prompt", "IA³"],
    "Forward Overhead": ["1.0x", "1.0x", "1.1x", "1.05x", "1.0x", "1.0x"],
    "Backward Speedup": ["1.0x", "1.3x", "1.2x", "1.5x", "2.0x", "1.8x"],
    "Overall Throughput": ["1.0x", "1.2x", "1.1x", "1.3x", "1.5x", "1.4x"],
}

Out[20]:

Console

Relative Training Speed (vs. Full Fine-tuning)
============================================================
  Method Forward Overhead Backward Speedup Overall Throughput
 Full FT             1.0x             1.0x               1.0x
    LoRA             1.0x             1.3x               1.2x
Adapters             1.1x             1.2x               1.1x
  Prefix            1.05x             1.5x               1.3x
  Prompt             1.0x             2.0x               1.5x
     IA³             1.0x             1.8x               1.4x

Note: Actual speedups depend heavily on hardware and implementation

Out[21]:

Visualization

Grouped bar chart showing forward overhead and backward speedup for PEFT methods. — Training throughput comparison showing forward overhead and backward speedup for each PEFT method. The overall throughput improvement comes primarily from reduced gradient computation in the backward pass.

Throughput improves primarily from reduced gradient computation. Prompt Tuning is fastest because gradients flow through only 100 embedding vectors. The backward pass is faster because most parameters are frozen.

Inference ImpactLink Copied

A critical distinction: some PEFT methods add inference overhead while others are "free" at inference time. This distinction has major implications for production deployments, where inference costs often dominate the total cost of ownership for ML systems.

Zero inference overhead:

LoRA: Weights can be merged ( $W + BA$ ) after training. The low-rank decomposition is purely a training-time construct; once the adapter is trained, we can precompute the full weight update and fold it into the base model weights.
IA³: Scaling can be folded into weights. Since IA³ applies element-wise multiplication, the scaling factors can be absorbed into the preceding linear transformation's weights.
Prompt Tuning: Only adds constant prefix tokens (minimal impact). While there is a small increase in sequence length, this typically adds less than 5% to inference time.

Persistent inference overhead:

Adapters: Must execute additional layers at every forward pass. The bottleneck computations cannot be removed because they involve non-linear activations that cannot be merged with surrounding linear operations.
Prefix Tuning: Increases effective sequence length by prefix size. The prefix tokens add to the key-value cache and increase attention computation proportional to the prefix length.

Out[22]:

Visualization

Bar chart comparing inference latency overhead of PEFT methods. — Inference latency comparison showing per-token generation time relative to the base model. LoRA and IA³ add no overhead when weights are merged, while Adapters add consistent overhead.

For production deployments where inference cost dominates, LoRA's ability to merge weights is a significant advantage. You get the benefits of task-specific adaptation with identical serving costs to the base model. This property makes LoRA the preferred choice for applications where models must serve millions of requests, as even small percentage increases in latency translate to substantial infrastructure costs at scale.

Task Suitability AnalysisLink Copied

Different PEFT methods suit different scenarios. Let's map methods to use cases based on the technical properties we've analyzed. The goal is to provide actionable guidance that accounts for your specific constraints and requirements. No single method is universally best; the right choice depends on your particular combination of performance needs, resource constraints, and deployment requirements.

When to Use LoRALink Copied

LoRA is the default choice for most scenarios and excels when:

You need maximum performance: LoRA consistently achieves the smallest gap to full fine-tuning across diverse task types and model scales
Inference cost matters: Merged weights mean zero serving overhead, making LoRA ideal for high-traffic production deployments
You're fine-tuning on GPUs with 24-48GB VRAM: Standard LoRA (r=16-64) fits comfortably on professional-grade consumer hardware
You want to stack multiple adapters: Different LoRA weights can be loaded dynamically for multi-task serving, enabling efficient model customization without maintaining separate model copies

Avoid LoRA when:

You have fewer than 100 training examples (consider few-shot prompting instead)
Memory is severely constrained (consider QLoRA or Prompt Tuning)

When to Use QLoRALink Copied

QLoRA extends LoRA's applicability to memory-constrained settings by combining 4-bit quantization of base model weights with standard LoRA adapters:

Consumer hardware: Fine-tune 7B models on 24GB GPUs, 13B on 48GB, making large model adaptation accessible without enterprise-grade hardware
Large models on limited budgets: Access to 70B-scale models without A100 clusters, democratizing research on frontier-scale models
Research and experimentation: Lower barrier to entry for exploring large models, enabling faster iteration during the exploratory phase of projects

Avoid QLoRA when:

Training speed is critical (quantization adds ~10-20% overhead due to the need to dequantize weights during computation)
You have access to more memory (standard LoRA trains faster and avoids any potential quality degradation from quantization)

When to Use AdaptersLink Copied

Adapters are well-suited for scenarios that benefit from their modular, plug-and-play architecture:

Multi-task learning: Each task gets its own adapter module, enabling clean separation of task-specific knowledge
Modular architectures: Adapters compose naturally (task adapter + language adapter), allowing hierarchical organization of model capabilities
Interpretability research: Adapter outputs can be probed to understand task-specific processing, as the discrete adapter boundaries provide natural intervention points

Avoid Adapters when:

Inference latency is critical (adapters add permanent overhead that cannot be eliminated through weight merging)
Maximum parameter efficiency is required (adapters train more parameters than LoRA at comparable performance levels)

When to Use Prefix TuningLink Copied

Prefix Tuning works well for scenarios that align with its attention-based modification approach:

NLG tasks: Especially controllable generation where prefix vectors can encode style/tone attributes without modifying the core generation mechanism
Interpretability: Prefix attention patterns reveal what the model "looks for," as you can analyze which positions in the prefix receive attention for different inputs
Moderate data regimes: Better data efficiency than Prompt Tuning due to per-layer parameterization, while still more efficient than LoRA or Adapters

Avoid Prefix Tuning when:

You're working with very long contexts (prefix reduces effective context length, potentially cutting into your input budget)
Tasks require modifying feed-forward computation (prefix only affects attention, leaving FFN layers entirely unchanged)

When to Use Prompt TuningLink Copied

Prompt Tuning is the simplest PEFT method and fits scenarios where extreme efficiency is paramount:

Extremely large models: When you can only access models through APIs or when even inference-only deployment is challenging
Maximum parameter efficiency: Less than 0.01% trainable parameters, enabling storage of thousands of task-specific adaptations
Large-scale multi-tenancy: Your tuned prompt is tiny to store, making personalized model adaptations economically feasible at scale

Avoid Prompt Tuning when:

Training data is limited (Prompt Tuning is data-hungry, requiring thousands of examples to achieve good performance)
Tasks require significant behavioral change from base model (the limited parameter budget constrains adaptation capacity)
You need interpretable adaptations (soft prompts are not human-readable and resist straightforward analysis)

When to Use IA³Link Copied

IA³ offers a unique balance between efficiency and capability through its activation scaling approach:

Few-shot settings: Better data efficiency than Prompt Tuning because it builds on existing representations rather than learning from scratch
Minimal overhead: Scales apply through element-wise multiplication, which is computationally trivial and adds no inference latency when folded into weights
Quick experiments: Fastest to train among methods that modify attention, enabling rapid prototyping and ablation studies

Avoid IA³ when:

Maximum performance is required (typically 1-2% below LoRA on most benchmarks)
The task requires learning entirely new capabilities (limited expressiveness constrains what the model can learn)

Out[23]:

Visualization

Flowchart showing PEFT method selection criteria. — Decision flowchart for selecting a PEFT method based on key constraints and requirements. The decision path prioritizes hardware constraints (memory) and deployment needs (inference latency) before considering task-specific properties like modularity or data availability.

Out[24]:

Visualization

Radar chart comparing PEFT methods across parameter efficiency, performance, training speed, and inference speed. — Summary of PEFT method tradeoffs across four key dimensions. LoRA achieves the highest balance across performance and speed, while Prompt Tuning and IA³ maximize parameter efficiency at the cost of task-specific accuracy.

Practical RecommendationsLink Copied

Based on the analyses above, here are concrete recommendations for common scenarios:

General-Purpose Fine-tuningLink Copied

If you are fine-tuning LLMs for downstream tasks:

Start with LoRA ( $r=16$ , $\alpha=32$ ) applied to all attention projections
Use 4-bit QLoRA if you're GPU-memory constrained
Increase rank to 32-64 if performance is insufficient
Add FFN adaptation (LoRA on up/down projections) for tasks requiring significant behavioral change

Production DeploymentLink Copied

When serving fine-tuned models in production:

Merge LoRA weights into base model for zero inference overhead
Avoid Adapters unless you need runtime adapter switching
For multi-tenant serving: Keep LoRA weights separate and apply at runtime with specialized serving infrastructure

Research and ExperimentationLink Copied

When exploring new tasks or model capabilities:

Start with Prompt Tuning for fastest iteration
Move to IA³ if Prompt Tuning underperforms
Use LoRA for final experiments requiring best performance

Extreme Resource ConstraintsLink Copied

When working with very limited compute:

QLoRA with $r=8$ provides excellent efficiency
Gradient checkpointing trades compute for memory
Consider API-based fine-tuning if available for your model

Implementation: Comparative AnalysisLink Copied

Let's implement a framework for comparing PEFT methods on a real task.

In[25]:

Code

!uv pip install transformers peft torch accelerate

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType, PrefixTuningConfig
from peft import PromptTuningConfig, PromptTuningInit
import torch
import pandas as pd
import time

def count_parameters(model):
    """Count total and trainable parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

def get_model_with_peft(model_name, peft_method, num_labels=2):
    """Load a model with specified PEFT method."""
    
    # Load base model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=num_labels
    )
    
    if peft_method == 'none':
        return model
    
    elif peft_method == 'lora':
        config = LoraConfig(
            task_type=TaskType.SEQ_CLS,
            r=16,
            lora_alpha=32,
            lora_dropout=0.1,
            target_modules=["q_lin", "v_lin"]
        )
        
    elif peft_method == 'prefix':
        # Get num_layers and token_dim from the model config
        num_transformer_layers = model.config.num_hidden_layers
        token_dim = model.config.hidden_size
        num_attention_heads = model.config.num_attention_heads
        config = PrefixTuningConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=20,
            prefix_projection=True,
            num_layers=num_transformer_layers,
            token_dim=token_dim,
            num_attention_heads=num_attention_heads,
            encoder_hidden_size=token_dim,
        )
        
    elif peft_method == 'prompt':
        # Get num_layers, token_dim, and num_attention_heads from the model config
        num_transformer_layers = model.config.num_hidden_layers
        token_dim = model.config.hidden_size
        num_attention_heads = model.config.num_attention_heads
        config = PromptTuningConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=20,
            prompt_tuning_init=PromptTuningInit.RANDOM,
            num_layers=num_transformer_layers,
            token_dim=token_dim,
            num_attention_heads=num_attention_heads,
        )
    
    else:
        raise ValueError(f"Unknown PEFT method: {peft_method}")
    
    return get_peft_model(model, config)

!uv pip install transformers peft torch accelerate

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType, PrefixTuningConfig
from peft import PromptTuningConfig, PromptTuningInit
import torch
import pandas as pd
import time

def count_parameters(model):
    """Count total and trainable parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

def get_model_with_peft(model_name, peft_method, num_labels=2):
    """Load a model with specified PEFT method."""
    
    # Load base model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=num_labels
    )
    
    if peft_method == 'none':
        return model
    
    elif peft_method == 'lora':
        config = LoraConfig(
            task_type=TaskType.SEQ_CLS,
            r=16,
            lora_alpha=32,
            lora_dropout=0.1,
            target_modules=["q_lin", "v_lin"]
        )
        
    elif peft_method == 'prefix':
        # Get num_layers and token_dim from the model config
        num_transformer_layers = model.config.num_hidden_layers
        token_dim = model.config.hidden_size
        num_attention_heads = model.config.num_attention_heads
        config = PrefixTuningConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=20,
            prefix_projection=True,
            num_layers=num_transformer_layers,
            token_dim=token_dim,
            num_attention_heads=num_attention_heads,
            encoder_hidden_size=token_dim,
        )
        
    elif peft_method == 'prompt':
        # Get num_layers, token_dim, and num_attention_heads from the model config
        num_transformer_layers = model.config.num_hidden_layers
        token_dim = model.config.hidden_size
        num_attention_heads = model.config.num_attention_heads
        config = PromptTuningConfig(
            task_type=TaskType.SEQ_CLS,
            num_virtual_tokens=20,
            prompt_tuning_init=PromptTuningInit.RANDOM,
            num_layers=num_transformer_layers,
            token_dim=token_dim,
            num_attention_heads=num_attention_heads,
        )
    
    else:
        raise ValueError(f"Unknown PEFT method: {peft_method}")
    
    return get_peft_model(model, config)

In[26]:

Code

def compare_peft_methods(model_name="distilbert-base-uncased"):
    """Compare parameter counts across PEFT methods."""

    methods = ["none", "lora", "prefix", "prompt"]
    results = []

    for method in methods:
        model = get_model_with_peft(model_name, method)
        total, trainable = count_parameters(model)

        results.append(
            {
                "method": method,
                "total_params": total,
                "trainable_params": trainable,
                "trainable_pct": 100 * trainable / total,
            }
        )

        # Clean up
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return pd.DataFrame(results)

def compare_peft_methods(model_name="distilbert-base-uncased"):
    """Compare parameter counts across PEFT methods."""

    methods = ["none", "lora", "prefix", "prompt"]
    results = []

    for method in methods:
        model = get_model_with_peft(model_name, method)
        total, trainable = count_parameters(model)

        results.append(
            {
                "method": method,
                "total_params": total,
                "trainable_params": trainable,
                "trainable_pct": 100 * trainable / total,
            }
        )

        # Clean up
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return pd.DataFrame(results)

In[27]:

Code

## Run comparison
model_name = "distilbert-base-uncased"
comparison = compare_peft_methods(model_name)

## Run comparison
model_name = "distilbert-base-uncased"
comparison = compare_peft_methods(model_name)

Out[28]:

Console

PEFT Method Comparison (distilbert-base-uncased)
=================================================================
Full FT      Total: 67.0M  Trainable: 66.955M  (100.000%)
LORA         Total: 67.8M  Trainable: 0.887M  (1.308%)
PREFIX       Total: 75.2M  Trainable: 8.285M  (11.012%)
PROMPT       Total: 67.6M  Trainable: 0.607M  (0.899%)

These results confirm the theoretical efficiency of PEFT. Prompt Tuning uses under 0.01% of parameters, while LoRA uses about 0.5%. This footprint enables the memory savings calculated from first principles.

Now let's measure training throughput:

In[29]:

Code

def measure_training_step(model, batch_size=8, seq_len=128, num_steps=10):
    """Measure average time for a training step."""

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    model.train()

    # Create dummy batch
    input_ids = torch.randint(0, 30522, (batch_size, seq_len)).to(device)
    attention_mask = torch.ones(batch_size, seq_len).to(device)
    labels = torch.randint(0, 2, (batch_size,)).to(device)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    # Warmup
    for _ in range(3):
        outputs = model(
            input_ids=input_ids, attention_mask=attention_mask, labels=labels
        )
        outputs.loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Measure
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    start = time.time()
    for _ in range(num_steps):
        outputs = model(
            input_ids=input_ids, attention_mask=attention_mask, labels=labels
        )
        outputs.loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    if torch.cuda.is_available():
        torch.cuda.synchronize()

    elapsed = time.time() - start
    return elapsed / num_steps

def measure_training_step(model, batch_size=8, seq_len=128, num_steps=10):
    """Measure average time for a training step."""

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    model.train()

    # Create dummy batch
    input_ids = torch.randint(0, 30522, (batch_size, seq_len)).to(device)
    attention_mask = torch.ones(batch_size, seq_len).to(device)
    labels = torch.randint(0, 2, (batch_size,)).to(device)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    # Warmup
    for _ in range(3):
        outputs = model(
            input_ids=input_ids, attention_mask=attention_mask, labels=labels
        )
        outputs.loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Measure
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    start = time.time()
    for _ in range(num_steps):
        outputs = model(
            input_ids=input_ids, attention_mask=attention_mask, labels=labels
        )
        outputs.loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    if torch.cuda.is_available():
        torch.cuda.synchronize()

    elapsed = time.time() - start
    return elapsed / num_steps

In[30]:

Code

# Measure throughput for each method
throughput_results = []

for method in [
    "none",
    "lora",
    "prompt",
]:  # prefix excluded: DistilBERT lacks past_key_values
    model = get_model_with_peft("distilbert-base-uncased", method)
    avg_time = measure_training_step(
        model, batch_size=8, seq_len=128, num_steps=10
    )

    throughput_results.append(
        {"method": method, "step_time_ms": avg_time * 1000}
    )

    del model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

throughput_df = pd.DataFrame(throughput_results)

# Calculate relative speedup
base_time = throughput_df[throughput_df["method"] == "none"][
    "step_time_ms"
].values[0]
throughput_df["relative_speed"] = base_time / throughput_df["step_time_ms"]

# Measure throughput for each method
throughput_results = []

for method in [
    "none",
    "lora",
    "prompt",
]:  # prefix excluded: DistilBERT lacks past_key_values
    model = get_model_with_peft("distilbert-base-uncased", method)
    avg_time = measure_training_step(
        model, batch_size=8, seq_len=128, num_steps=10
    )

    throughput_results.append(
        {"method": method, "step_time_ms": avg_time * 1000}
    )

    del model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

throughput_df = pd.DataFrame(throughput_results)

# Calculate relative speedup
base_time = throughput_df[throughput_df["method"] == "none"][
    "step_time_ms"
].values[0]
throughput_df["relative_speed"] = base_time / throughput_df["step_time_ms"]

Out[31]:

Console

Training Throughput Comparison
==================================================
Full FT      290.4 ms/step  (1.00x)
LORA         231.1 ms/step  (1.26x)
PROMPT       219.8 ms/step  (1.32x)

Prompt Tuning and IA³ offer the highest training throughput, requiring gradients for only a small number of parameters. LoRA and Prefix Tuning also provide speedups over full fine-tuning, though the exact gain depends on the rank and number of adapted layers. These speed improvements, while less dramatic than the memory savings, significantly accelerate the iterative development cycle. Faster training means more experiments per day, faster hyperparameter searches, and quicker iteration toward optimal configurations.

Combining PEFT MethodsLink Copied

An emerging research direction explores combining multiple PEFT methods for improved performance. The intuition is that different methods modify different aspects of model behavior, and their complementary strengths might compound when combined:

LoRA + Adapters: Apply LoRA to attention layers and adapters after FFN layers. This captures both low-rank attention modifications and complex post-FFN transformations.

Prompt Tuning + LoRA: Use soft prompts for task-level conditioning while LoRA handles fine-grained adaptation. Particularly useful for instruction-following with multiple task types.

In[32]:

Code

# Example: Combining LoRA with different target modules
from peft import LoraConfig

# Comprehensive LoRA configuration
comprehensive_lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",  # Attention
        "gate_proj",
        "up_proj",
        "down_proj",  # FFN (for LLaMA-style models)
    ],
    lora_dropout=0.05,
    bias="none",
)

# Example: Combining LoRA with different target modules
from peft import LoraConfig

# Comprehensive LoRA configuration
comprehensive_lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",  # Attention
        "gate_proj",
        "up_proj",
        "down_proj",  # FFN (for LLaMA-style models)
    ],
    lora_dropout=0.05,
    bias="none",
)

Out[33]:

Console

Target modules: {'q_proj', 'gate_proj', 'up_proj', 'k_proj', 'v_proj', 'o_proj', 'down_proj'}
Total rank budget: 16 per module

Key ParametersLink Copied

The key configuration classes and parameters for PEFT methods are:

LoraConfig:
- r: Rank of the update matrices. Lower means fewer parameters and less memory.
- lora_alpha: Scaling factor for updates. The weight scaling is $\alpha/r$ .
- target_modules: List of module names (e.g., "query", "value") to apply LoRA to.
- lora_dropout: Dropout probability for LoRA layers, aiding in regularization.
PrefixTuningConfig / PromptTuningConfig:
- num_virtual_tokens: Number of learnable tokens to prepend (prefix or prompt length).
- prompt_tuning_init: Initialization method for soft prompts (e.g., random or from text).

Limitations and When to Use Full Fine-tuningLink Copied

Despite their efficiency, PEFT methods have inherent limitations:

Constrained adaptation capacity: By design, PEFT methods limit what the model can learn. For tasks requiring substantial behavioral change, such as learning a new language, acquiring domain expertise, or fundamentally shifting the model's capabilities, full fine-tuning may be necessary. LoRA's low-rank constraint assumes task-specific modifications live in a low-dimensional subspace, which may not hold for all tasks.

Degradation at scale: While PEFT methods work well for single-task adaptation, repeatedly applying multiple LoRA adapters or stacking adapters can lead to interference. Research on LoRA merging and multi-task adapter composition remains active.

Sensitivity to hyperparameters: PEFT methods introduce new hyperparameters (rank, prefix length, bottleneck dimension) that require tuning. Poor choices can significantly impact performance, sometimes more dramatically than learning rate selection in full fine-tuning.

Not a substitute for pre-training: PEFT methods adapt existing capabilities; they don't create new ones. A model that doesn't "know" something from pre-training generally won't learn it from PEFT alone, regardless of how much task-specific data you provide.

When should you opt for full fine-tuning despite the efficiency benefits of PEFT?

Maximum performance is non-negotiable: In high-stakes applications, the 1-3% gap to full fine-tuning may matter
Sufficient compute is available: If you have the resources, full fine-tuning remains the gold standard
Task requires deep behavioral change: Domain adaptation, safety training, or capability unlocking may exceed PEFT's adaptation budget
You're creating a foundation for further PEFT: Full fine-tuning a domain-specific base model, then using PEFT for downstream tasks

SummaryLink Copied

This chapter synthesized our exploration of parameter-efficient fine-tuning methods into practical guidance for method selection:

Parameter efficiency varies by orders of magnitude. Prompt Tuning and IA³ train fewer than 0.01% of parameters, while LoRA and Adapters train 0.2-0.5%. This translates directly to memory savings: a 7B model requiring 84GB for full fine-tuning can be adapted with LoRA using 14GB.

Performance gaps are smaller than efficiency gains. LoRA typically achieves 95-99% of full fine-tuning performance across NLU, NLG, and instruction-following tasks. The gap narrows further with larger models and more training data.

Method choice depends on constraints. LoRA is the default recommendation for most scenarios due to its strong performance and zero inference overhead when weights are merged. QLoRA extends this to memory-constrained settings. Adapters suit multi-task modular architectures. Prefix and Prompt Tuning offer maximum efficiency at some performance cost.

Inference impact matters for production. LoRA and IA³ can be merged into base weights for identical serving costs. Adapters add permanent inference overhead but enable runtime task switching.

As we move into Part XXVI on instruction tuning, these PEFT methods, especially LoRA and QLoRA, will become essential tools. Instruction tuning transforms base models into capable assistants, and doing so efficiently is often the difference between practical application and theoretical interest.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about comparing and selecting PEFT methods.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Adapter Layers

Next Chapter

Instruction Following

Coming Soon

Reference

BIBTEXAcademic

@misc{peftcomparisonchoosingtherightfinetuningmethod, author = {Michael Brenndoerfer}, title = {PEFT Comparison: Choosing the Right Fine-Tuning Method}, year = {2025}, url = {https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). PEFT Comparison: Choosing the Right Fine-Tuning Method. Retrieved from https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide

MLAAcademic

Michael Brenndoerfer. "PEFT Comparison: Choosing the Right Fine-Tuning Method." 2026. Web. today. <https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "PEFT Comparison: Choosing the Right Fine-Tuning Method." Accessed today. https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide.

HARVARDAcademic

Michael Brenndoerfer (2025) 'PEFT Comparison: Choosing the Right Fine-Tuning Method'. Available at: https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). PEFT Comparison: Choosing the Right Fine-Tuning Method. https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide

Direct link:

https://mbrenndoerfer.com/writing/peft-comparison-lora-qlora-adapters-selection-guide

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

PEFT Comparison: Choosing the Right Fine-Tuning Method

PEFT ComparisonLink Copied

Parameter Efficiency ComparisonLink Copied

Absolute Parameter CountsLink Copied

Memory Footprint AnalysisLink Copied

Performance Comparison Across TasksLink Copied

Natural Language UnderstandingLink Copied

Natural Language GenerationLink Copied

Instruction Following and ChatLink Copied

Few-Shot and Low-Resource SettingsLink Copied

Computational Overhead AnalysisLink Copied

Training SpeedLink Copied

Inference ImpactLink Copied

Task Suitability AnalysisLink Copied

When to Use LoRALink Copied

When to Use QLoRALink Copied

When to Use AdaptersLink Copied

When to Use Prefix TuningLink Copied

When to Use Prompt TuningLink Copied

When to Use IA³Link Copied

Practical RecommendationsLink Copied

General-Purpose Fine-tuningLink Copied

Production DeploymentLink Copied

Research and ExperimentationLink Copied

Extreme Resource ConstraintsLink Copied

Implementation: Comparative AnalysisLink Copied

Combining PEFT MethodsLink Copied

Key ParametersLink Copied

Limitations and When to Use Full Fine-tuningLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning

Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts

Prefix Tuning: Steering LLMs with Learnable Virtual Tokens

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning

Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts

Prefix Tuning: Steering LLMs with Learnable Virtual Tokens

Stay updated