Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning

Michael Brenndoerfer

Learn how adapter layers insert trainable bottleneck modules into transformers for parameter-efficient fine-tuning. Covers architecture, placement, and fusion.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Adapter LayersLink Copied

While LoRA modifies attention weights through low-rank decomposition and prefix tuning prepends learnable tokens to the input, adapter layers take a different architectural approach: they insert small, trainable neural network modules directly into the transformer's forward pass. These compact bottleneck modules learn task-specific transformations while leaving the original model weights completely frozen.

Adapters emerged from a simple observation: if we want to add new capabilities to a pretrained model without disrupting what it already knows, why not add new components rather than modify existing ones? This additive approach provides a clean separation between pretrained knowledge and task-specific adaptations, making it possible to store multiple specialized versions of a model using only a fraction of the original parameters.

In this chapter, we'll explore the adapter architecture in detail, examining how bottleneck modules compress and expand hidden representations, where to place adapters within transformer blocks for maximum effectiveness, and how to choose the right dimensionality for your task. We'll also cover adapter fusion, a technique for combining multiple task-specific adapters to enable flexible multi-task learning.

Adapter ArchitectureLink Copied

The core idea behind adapters is straightforward: insert small bottleneck neural networks into each transformer layer that can learn task-specific transformations. To understand why this approach works, consider what happens when you fine-tune a pretrained model on a new task. The model needs to adjust how it processes information, but most of what it learned during pretraining remains relevant. Rather than modifying the carefully tuned weights that encode this valuable knowledge, adapters provide dedicated pathways for learning new behaviors.

These modules are called "bottleneck" architectures because they first compress the hidden representation to a much smaller dimension, apply a nonlinearity, and then project back to the original dimension. This compression serves a dual purpose: it dramatically reduces the number of parameters needed for adaptation, and it acts as a regularizer that forces the adapter to capture only the most essential task-specific information.

Bottleneck Architecture

A bottleneck architecture compresses data through a narrow intermediate representation before expanding it back. This forces the network to learn a compact encoding of the relevant information, acting as an information bottleneck. The concept draws from information theory, where limiting the bandwidth of a channel forces the sender to prioritize the most important signals. In the context of neural networks, this compression encourages the adapter to identify and preserve only those features of the input that are most relevant for the downstream task.

The Adapter ModuleLink Copied

An adapter module consists of three components arranged sequentially, each serving a specific purpose in the transformation pipeline. The architecture mirrors the structure of an autoencoder, where information is first compressed and then reconstructed, but here the goal is not reconstruction fidelity but rather learning a useful transformation of the input representation.

The three components work together as follows:

Down-projection: A linear layer that compresses the hidden dimension $d$ to a smaller bottleneck dimension $r$ . This projection acts as a learned feature selector, identifying which aspects of the high-dimensional hidden state are most relevant for the task at hand.
Nonlinearity: An activation function (typically ReLU or GELU) that enables nonlinear transformations. Without this component, the composition of two linear layers would simply be another linear transformation, severely limiting the adapter's expressive power.
Up-projection: A linear layer that expands from dimension $r$ back to $d$ . This projection reconstructs the full hidden dimension, translating the compressed task-specific representation back into a form compatible with subsequent transformer layers.

Given an input hidden state $\mathbf{h} \in \mathbb{R}^d$ , the adapter computes its output through the sequential application of these three components:

\text{Adapter}(\mathbf{h}) = \mathbf{W}_{up} \cdot f(\mathbf{W}_{down} \cdot \mathbf{h} + \mathbf{b}_{down}) + \mathbf{b}_{up}

To understand this formula, let's examine each component:

$\text{Adapter}(\mathbf{h})$ : the output of the adapter module, which represents the task-specific transformation that will be added to the original hidden state
$\mathbf{h}$ : the input hidden state, a vector of dimension $d$ that encodes the current representation at this point in the transformer
$\mathbf{W}_{down}$ : the down-projection weight matrix with shape $r \times d$ , which compresses the input from dimension $d$ to $r$ . Each row of this matrix can be thought of as a learned detector for a particular pattern in the input
$\mathbf{b}_{down}$ : the bias vector for the down-projection layer, with dimension $r$ . This allows the adapter to shift the compressed representation as needed
$f(\cdot)$ : the nonlinear activation function (e.g., ReLU or GELU) that enables the network to learn complex patterns. The choice of activation function affects both the expressiveness of the adapter and its training dynamics
$\mathbf{W}_{up}$ : the up-projection weight matrix with shape $d \times r$ , which expands the representation back from $r$ to $d$ . This matrix translates the bottleneck representation into modifications for the full hidden space
$\mathbf{b}_{up}$ : the bias vector for the up-projection layer, with dimension $d$ . Often initialized to zero to ensure near-identity behavior at the start of training

Information flows through this formula from the inside out. First, the input hidden state is multiplied by the down-projection matrix and shifted by the bias, yielding a compressed representation. Next, the activation function introduces nonlinearity, allowing the adapter to model complex relationships. Finally, the up-projection expands this compressed, transformed representation back to the original dimension.

Residual ConnectionLink Copied

A critical design choice is wrapping the adapter in a residual connection. This architectural decision fundamentally shapes how the adapter integrates with the transformer and has significant implications for both training dynamics and the nature of the learned transformation.

\mathbf{h}' = \mathbf{h} + \text{Adapter}(\mathbf{h})

where:

$\mathbf{h}'$ : the output hidden state after applying the adapter, which will be passed to subsequent operations in the transformer
$\mathbf{h}$ : the input hidden state from the previous layer or sublayer, preserved unchanged and added to the adapter output
$\text{Adapter}(\mathbf{h})$ : the task-specific transformation computed by the bottleneck module, which represents the "delta" or modification to apply

This residual structure serves multiple purposes that together make adapters both effective and practical. First, it enables the adapter to learn only the delta (change) needed for the new task, rather than having to reproduce the entire transformation. This is conceptually powerful because the adapter need not learn to pass through information that should remain unchanged; instead, it can focus entirely on what needs to be different for the new task. The original representation flows through the skip connection unchanged, and the adapter simply adds whatever modifications are necessary.

Second, the residual structure allows for near-identity initialization, where adapters initialized to output near-zero values will pass through the original representations unchanged. This property is crucial for stable training because it means the model starts from a state equivalent to the pretrained model. Early in training, before the adapters have learned meaningful transformations, the model's behavior remains close to the original, avoiding the disruption that could occur if adapters initially produced large, random modifications.

Third, it provides stable gradient flow during training, as discussed in the residual connections chapter from Part XII. Gradients can flow directly through the skip connection without passing through the adapter's nonlinearity, ensuring that learning signals reach all parts of the network even when the adapter produces small outputs.

Parameter CountLink Copied

Understanding the parameter count helps you appreciate adapter efficiency. For a single adapter module, the total number of trainable parameters is the sum of all weights and biases in the two projection layers:

\text{Parameters} = \underbrace{r \times d}_{\mathbf{W}_{down}} + \underbrace{r}_{\mathbf{b}_{down}} + \underbrace{d \times r}_{\mathbf{W}_{up}} + \underbrace{d}_{\mathbf{b}_{up}} = 2rd + r + d

Let's break down each term to see where the parameters come from:

$r$ : the bottleneck dimension, typically chosen to be much smaller than the hidden dimension (e.g., 64)
$d$ : the model's hidden dimension, determined by the base model architecture (e.g., 768 for BERT-base)
$r \times d$ : the number of parameters in the down-projection weight matrix, which must map from $d$ dimensions to $r$ dimensions
$r$ : the number of parameters in the down-projection bias, one for each output dimension
$d \times r$ : the number of parameters in the up-projection weight matrix, which maps from $r$ back to $d$
$d$ : the number of parameters in the up-projection bias
$2rd$ : the approximate parameter count when $r \ll d$ , showing that the two projection matrices dominate the size while the bias vectors contribute relatively little

When $r \ll d$ , this simplifies to approximately $2rd$ parameters per adapter, since the bias terms become negligible. For a model with hidden dimension $d = 768$ and bottleneck dimension $r = 64$ , each adapter contains roughly $2 \times 64 \times 768 = 98,304$ parameters, which is tiny compared to the millions of parameters in each transformer layer. To put this in perspective, a single self-attention layer in BERT-base has over 2 million parameters, so an adapter adds less than 5% overhead to each layer.

Out[2]:

Visualization

Adapter parameter count scaling with bottleneck dimension for different model sizes. The parameter count grows linearly with both the bottleneck dimension r and the hidden dimension d, following the formula 2rd. Even at larger bottleneck sizes, adapters remain a small fraction of the base model's parameters.

Adapter PlacementLink Copied

Where you insert adapters within the transformer block significantly impacts both performance and parameter efficiency. The transformer layer consists of multiple sublayers, each performing a distinct function, and the choice of where to place adapters determines which computations the adapter can influence. We have explored several placement strategies, each with different trade-offs between expressiveness, efficiency, and training stability.

Original Placement (Houlsby et al.)Link Copied

The original adapter paper by Houlsby et al. (2019) proposed inserting two adapter modules per transformer layer. This comprehensive approach ensures that adapters can modify representations both after contextual mixing occurs in the attention sublayer and after feature transformation in the feed-forward sublayer. The two insertion points are:

After the multi-head attention sublayer: Between the attention output and the first layer normalization. This position allows the adapter to modify how attended information is integrated.
After the feed-forward network sublayer: Between the FFN output and the second layer normalization. This position enables the adapter to adjust the feature transformations applied by the FFN.

This placement follows the existing residual connection pattern established in the transformer architecture. To understand how adapters integrate with the standard transformer computation, recall that the original transformer layer computation proceeds as follows:

\begin{aligned} \mathbf{h}_1 &= \text{LayerNorm}(\mathbf{x} + \text{MHA}(\mathbf{x})) \\ \mathbf{h}_2 &= \text{LayerNorm}(\mathbf{h}_1 + \text{FFN}(\mathbf{h}_1)) \end{aligned}

In these equations:

$\mathbf{h}_1$ : the intermediate representation after the attention sublayer, which captures contextually mixed information
$\text{LayerNorm}$ : the layer normalization operation that stabilizes training by normalizing activations across the feature dimension
$\mathbf{x}$ : the input to the transformer layer, which may come from the embedding layer or a previous transformer layer
$\text{MHA}$ : the Multi-Head Attention mechanism that computes contextual relationships between tokens
$\mathbf{h}_2$ : the final output of the transformer layer, ready to pass to the next layer
$\text{FFN}$ : the Feed-Forward Network sublayer that applies position-wise transformations to each token independently

With adapters inserted according to the Houlsby placement strategy, the adapted layer becomes:

\begin{aligned} \mathbf{h}_1 &= \text{LayerNorm}(\mathbf{x} + \text{Adapter}_1(\text{MHA}(\mathbf{x}))) \\ \mathbf{h}_2 &= \text{LayerNorm}(\mathbf{h}_1 + \text{Adapter}_2(\text{FFN}(\mathbf{h}_1))) \end{aligned}

Here, each component plays the following role:

$\mathbf{h}_1$ : the intermediate representation after the adapted attention sublayer, now incorporating task-specific modifications to the attention output
$\text{LayerNorm}$ : the layer normalization operation, unchanged from the original architecture
$\mathbf{x}$ : the input to the transformer layer
$\text{Adapter}_1$ : the first adapter module inserted after attention, which learns to modify how attention information should be adjusted for the specific task
$\text{MHA}$ : the Multi-Head Attention mechanism, with frozen weights
$\mathbf{h}_2$ : the final output of the adapted transformer layer
$\text{Adapter}_2$ : the second adapter module inserted after the feed-forward network, which learns task-specific adjustments to the feature transformations
$\text{FFN}$ : the Feed-Forward Network sublayer, with frozen weights

Note that the adapter is applied to the sublayer output before the residual connection with the input. This ordering is deliberate: it allows the adapter to transform the sublayer's contribution while still preserving the original skip connection's ability to pass information through unchanged.

Efficient Placement (Pfeiffer et al.)Link Copied

Later work by Pfeiffer et al. (2021) made a surprising discovery: using a single adapter per layer, placed only after the feed-forward network, achieves comparable performance with half the parameters. This finding challenged the assumption that both insertion points were necessary and led to more efficient adapter configurations becoming the standard practice.

The intuition behind this finding connects to what we learned about transformer components in Part XII. Feed-forward networks act as key-value memories that store factual knowledge, while attention layers handle contextual mixing. Task adaptation often requires modifying how the model processes and retrieves information, which aligns more closely with the FFN's role. When adapting a model for sentiment analysis, for example, the model primarily needs to adjust which features are emphasized and how they are transformed, rather than fundamentally changing how tokens attend to each other.

Furthermore, the attention mechanism is already quite flexible due to the query-key-value decomposition, which allows different relationships to be captured without weight modifications. The FFN, in contrast, applies fixed transformations that may need task-specific adjustment. By placing adapters after the FFN, we directly address the component most likely to need modification.

Parallel AdaptersLink Copied

An alternative to sequential placement is parallel adapters, where the adapter computation runs alongside the original sublayer rather than after it. This design represents a departure from the standard "intercept and modify" approach, instead treating the adapter as an independent contribution to the layer output.

\mathbf{h}' = \mathbf{x} + \text{MHA}(\mathbf{x}) + s \cdot \text{Adapter}(\mathbf{x})

In this formulation:

$\mathbf{h}'$ : the output of the layer, combining contributions from the input, the attention mechanism, and the adapter
$\mathbf{x}$ : the input to the layer, which serves as the common source for all three pathways
$\text{MHA}(\mathbf{x})$ : the attention sublayer output, computed from the input as in the standard transformer
$s$ : a scaling factor that controls the adapter's contribution to the final representation. This hyperparameter allows fine-grained control over how much the adapter influences the output
$\text{Adapter}(\mathbf{x})$ : the adapter computation running in parallel with the attention mechanism, operating directly on the layer input rather than on the attention output

Parallel placement can offer computational advantages on hardware that supports concurrent operations, as the adapter and original sublayer can be computed simultaneously. On modern GPUs and TPUs, this parallelism can translate to reduced wall-clock time even though the total computation is the same. Additionally, parallel placement changes the gradient flow: the adapter receives gradients directly from the layer output without having to backpropagate through the attention computation, which can lead to different optimization dynamics.

Placement ComparisonLink Copied

The choice of adapter placement involves trade-offs between several competing factors:

Parameter efficiency: Single FFN-only adapters use half the parameters of dual placement, which translates directly to storage savings when maintaining many task-specific adapters
Performance: Dual placement often achieves slightly better results on diverse tasks, particularly when tasks require modifying both contextual mixing and feature transformation
Training stability: The original dual placement provides more gradient pathways, potentially making optimization easier for difficult tasks
Inference overhead: More adapters mean more sequential computations, increasing latency during inference

For most practical applications, the single adapter placement after the FFN provides an excellent balance of efficiency and performance. The savings in parameters and inference time typically outweigh the marginal performance improvements of dual placement, especially when deploying multiple adapters for different tasks.

Out[3]:

Visualization

Comparison of adapter placement strategies showing parameter counts and simulated performance. The Houlsby placement (left) requires more parameters by inserting adapters after both attention and FFN sublayers, while the Pfeiffer placement (right) achieves comparable accuracy using only FFN adapters with half the parameter overhead.

Adapter DimensionalityLink Copied

The bottleneck dimension $r$ is the primary hyperparameter controlling the trade-off between adapter capacity and parameter efficiency. Choosing the right value requires understanding how information flows through the bottleneck and how compression affects the adapter's ability to learn useful transformations.

The Information Bottleneck PerspectiveLink Copied

When we compress a hidden representation from dimension $d$ to dimension $r$ , we're forcing the adapter to identify and preserve the most task-relevant information. This compression acts as a form of regularization, preventing the adapter from memorizing noise in the training data. The principle at work here connects to information theory: by limiting the capacity of the channel through which task information must flow, we force the adapter to be selective about what it encodes.

Consider a hidden representation $\mathbf{h} \in \mathbb{R}^{768}$ compressed to $r = 64$ dimensions. The adapter must encode whatever task-specific transformation it needs to make using only 64 degrees of freedom. This constraint encourages the adapter to learn generalizable patterns rather than fitting to idiosyncrasies in the training set. If a particular feature of the training data is not consistently useful for the task, the limited bottleneck capacity will prevent the adapter from dedicating parameters to capturing it.

The compression also has implications for what kinds of transformations adapters can represent. With a 12:1 compression ratio, the adapter cannot simply learn an arbitrary linear transformation of the input. Instead, it must identify a low-dimensional subspace that captures the essential variation needed for task adaptation. This constraint is often beneficial, as it biases the adapter toward learning transformations that align with the structure already present in the pretrained representations.

Dimensionality GuidelinesLink Copied

Research and practical experience suggest the following guidelines for choosing adapter dimensions, with the appropriate choice depending on task complexity, data availability, and computational constraints:

Very small ( $r = 8$ to $16$ ): Suitable for tasks closely related to pretraining or when data is extremely limited. Provides strong regularization but limited expressiveness. These minimal adapters work best when the task primarily requires recombining existing features rather than learning fundamentally new representations.
Small ( $r = 32$ to $64$ ): A good default for most NLP tasks. Balances efficiency with sufficient capacity for meaningful adaptation. This range provides enough flexibility to learn task-specific transformations while maintaining strong regularization.
Medium ( $r = 128$ to $256$ ): Appropriate for complex tasks or when full fine-tuning achieves significantly better results. Approaches the capacity of larger modifications. Consider this range when standard adapter sizes underperform or when the task requires substantial deviation from pretrained behavior.
Large ( $r \geq 512$ ): Rarely necessary; at this point, the parameter savings over full fine-tuning diminish substantially. If this much capacity is needed, alternative methods or full fine-tuning may be more appropriate.

Scaling with Model SizeLink Copied

You should consider how to scale adapter dimensions as model size increases. Larger models have larger hidden dimensions, so maintaining the same absolute bottleneck size results in increasingly severe compression. A useful heuristic is to maintain a consistent compression ratio $d/r$ . For example:

Recommended adapter dimensions for different model sizes maintaining a consistent 12:1 compression ratio.

Model Hidden Dim	Adapter Dim	Ratio
768 (BERT-base)	64	12:1
1024 (BERT-large)	85	12:1
4096 (LLaMA-7B)	341	12:1

However, empirical tuning often reveals that smaller models need proportionally larger adapters than bigger models. This may relate to the observation that larger models have more redundant capacity and can express task-specific information more compactly. A 7-billion parameter model has learned richer internal representations that can more easily be combined and adjusted through a small adapter, while a smaller model may require more adapter capacity to achieve the same effect.

Adapter Efficiency AnalysisLink Copied

Let's compare the parameter overhead of adapters at different bottleneck dimensions for a BERT-base model with 12 layers and hidden dimension 768:

Adapter parameter overhead at different bottleneck dimensions for a 12-layer BERT-base model.

Bottleneck Dim	Params per Adapter	Total Adapter Params	% of BERT-base
8	~12K	~288K	0.26%
32	~49K	~1.2M	1.1%
64	~98K	~2.4M	2.2%
256	~393K	~9.4M	8.6%

Even at $r = 256$ , adapters add less than 10% of the base model's parameters. This efficiency enables practical storage of dozens of task-specific adapters alongside a single frozen base model. For organizations deploying models across many tasks, this represents substantial savings in storage, memory, and model management overhead compared to maintaining separate fully fine-tuned models for each task.

Out[4]:

Visualization

Trade-off between adapter bottleneck dimension and model capacity. Left: As bottleneck dimension increases, the compression ratio decreases, allowing more information to flow through. Right: Simulated performance curves showing diminishing returns at higher bottleneck dimensions, with the optimal range typically between 32 and 128 for most tasks.

Worked Example\n\nLet's trace through an adapter computation manually to build intuition for how the bottleneck transformation works. Following the computation step by step reveals how each component contributes to the final output and helps clarify the mathematical operations involved.Link Copied

Consider a simplified scenario with hidden dimension $d = 4$ and bottleneck dimension $r = 2$ . While real adapters operate in much higher dimensions, this small example allows us to see every number in the computation. We have an input hidden state:

\mathbf{h} = \begin{bmatrix} 0.5 \\ -0.3 \\ 0.8 \\ -0.1 \end{bmatrix}

Our adapter has the following weights (after training). In practice, these would be learned through gradient descent, but here we use fixed values to illustrate the computation:

\mathbf{W}_{down} = \begin{bmatrix} 0.4 & 0.2 & -0.1 & 0.5 \\ -0.2 & 0.5 & 0.3 & 0.1 \end{bmatrix}, \quad \mathbf{b}_{down} = \begin{bmatrix} 0.1 \\ -0.1 \end{bmatrix}

\mathbf{W}_{up} = \begin{bmatrix} 0.3 & -0.2 \\ 0.1 & 0.4 \\ -0.2 & 0.2 \\ 0.5 & -0.1 \end{bmatrix}, \quad \mathbf{b}_{up} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}

Step 1: Down-projection

The first step compresses the 4-dimensional input to a 2-dimensional bottleneck representation. This linear transformation identifies which combinations of input features are relevant for the task:

\mathbf{z} = \mathbf{W}_{down} \cdot \mathbf{h} + \mathbf{b}_{down}

where:

$\mathbf{z}$ : the compressed bottleneck representation, which must encode all task-relevant information in just 2 dimensions
$\mathbf{W}_{down}$ : the down-projection weight matrix, whose rows define the feature combinations to extract
$\mathbf{h}$ : the input hidden state from the transformer
$\mathbf{b}_{down}$ : the bias vector that shifts the compressed representation

Computing the matrix-vector product by taking the dot product of each row of the weight matrix with the input vector:

\mathbf{W}_{down} \cdot \mathbf{h} = \begin{bmatrix} 0.4(0.5) + 0.2(-0.3) + (-0.1)(0.8) + 0.5(-0.1) \\ (-0.2)(0.5) + 0.5(-0.3) + 0.3(0.8) + 0.1(-0.1) \end{bmatrix} = \begin{bmatrix} 0.01 \\ -0.02 \end{bmatrix}

Adding the bias to shift the representation:

\mathbf{z} = \begin{bmatrix} 0.01 + 0.1 \\ -0.02 - 0.1 \end{bmatrix} = \begin{bmatrix} 0.11 \\ -0.12 \end{bmatrix}

Step 2: Nonlinearity (ReLU)

The activation function introduces nonlinearity, allowing the adapter to model complex patterns. ReLU zeroes out negative values while passing positive values unchanged:

f(\mathbf{z}) = \text{ReLU}(\mathbf{z}) = \begin{bmatrix} \max(0, 0.11) \\ \max(0, -0.12) \end{bmatrix} = \begin{bmatrix} 0.11 \\ 0 \end{bmatrix}

Notice that the second dimension is zeroed by ReLU, demonstrating how the activation function introduces sparsity. In this case, only one of the two bottleneck dimensions contributes to the output. This sparsity is a common feature of ReLU networks and can be seen as implicit feature selection within the bottleneck.

Step 3: Up-projection

The final transformation expands the activated bottleneck representation back to the original hidden dimension, producing the adapter's contribution to the final output:

\text{Adapter}(\mathbf{h}) = \mathbf{W}_{up} \cdot f(\mathbf{z}) + \mathbf{b}_{up}

where:

$\text{Adapter}(\mathbf{h})$ : the output of the adapter, a 4-dimensional vector that will be added to the input
$\mathbf{W}_{up}$ : the up-projection weight matrix, whose columns define how each bottleneck dimension contributes to each output dimension
$f(\mathbf{z})$ : the activation of the bottleneck representation, with the second dimension zeroed by ReLU
$\mathbf{b}_{up}$ : the up-projection bias vector, set to zero in this example

Since the second dimension of $f(\mathbf{z})$ is zero, only the first column of $\mathbf{W}_{up}$ contributes to the output:

\text{Adapter}(\mathbf{h}) = \begin{bmatrix} 0.3(0.11) + (-0.2)(0) \\ 0.1(0.11) + 0.4(0) \\ (-0.2)(0.11) + 0.2(0) \\ 0.5(0.11) + (-0.1)(0) \end{bmatrix} = \begin{bmatrix} 0.033 \\ 0.011 \\ -0.022 \\ 0.055 \end{bmatrix}

Step 4: Residual connection

Finally, the adapter output is added to the original input through the residual connection. This addition combines the pretrained representation with the task-specific modification:

\mathbf{h}' = \mathbf{h} + \text{Adapter}(\mathbf{h}) = \begin{bmatrix} 0.5 \\ -0.3 \\ 0.8 \\ -0.1 \end{bmatrix} + \begin{bmatrix} 0.033 \\ 0.011 \\ -0.022 \\ 0.055 \end{bmatrix} = \begin{bmatrix} 0.533 \\ -0.289 \\ 0.778 \\ -0.045 \end{bmatrix}

The adapter has made small, targeted adjustments to each dimension of the hidden state. The residual connection ensures these changes are additive modifications rather than complete replacements. Notice that the modifications are relatively small compared to the original values, which is typical for well-trained adapters that make surgical adjustments rather than wholesale changes to the representation.

Out[5]:

Visualization

Code ImplementationLink Copied

Let's implement adapter layers from scratch and then integrate them into a pretrained transformer model.

In[6]:

Code

import torch.nn as nn  # noqa: F401
import torch.nn.functional as F  # noqa: F401

import torch.nn as nn  # noqa: F401
import torch.nn.functional as F  # noqa: F401

Building the Adapter ModuleLink Copied

We'll start by implementing a single adapter module as a PyTorch module:

In[7]:

Code

class AdapterModule(nn.Module):
    """
    Bottleneck adapter module with residual connection.

    Args:
        hidden_dim: The transformer's hidden dimension
        bottleneck_dim: The adapter's bottleneck dimension
        activation: Activation function ('relu' or 'gelu')
    """

    def __init__(self, hidden_dim, bottleneck_dim, activation="relu"):
        super().__init__()

        # Down-projection: d -> r
        self.down_proj = nn.Linear(hidden_dim, bottleneck_dim)

        # Up-projection: r -> d
        self.up_proj = nn.Linear(bottleneck_dim, hidden_dim)

        # Activation function
        if activation == "relu":
            self.activation = nn.ReLU()
        elif activation == "gelu":
            self.activation = nn.GELU()
        else:
            raise ValueError(f"Unknown activation: {activation}")

        # Initialize for near-identity at start
        self._init_weights()

    def _init_weights(self):
        # Initialize down-projection with small random values
        nn.init.normal_(self.down_proj.weight, std=0.01)
        nn.init.zeros_(self.down_proj.bias)

        # Initialize up-projection to near-zero for identity initialization
        nn.init.normal_(self.up_proj.weight, std=0.01)
        nn.init.zeros_(self.up_proj.bias)

    def forward(self, x):
        # Bottleneck transformation
        residual = x
        x = self.down_proj(x)
        x = self.activation(x)
        x = self.up_proj(x)

        # Residual connection
        return residual + x

class AdapterModule(nn.Module):
    """
    Bottleneck adapter module with residual connection.

    Args:
        hidden_dim: The transformer's hidden dimension
        bottleneck_dim: The adapter's bottleneck dimension
        activation: Activation function ('relu' or 'gelu')
    """

    def __init__(self, hidden_dim, bottleneck_dim, activation="relu"):
        super().__init__()

        # Down-projection: d -> r
        self.down_proj = nn.Linear(hidden_dim, bottleneck_dim)

        # Up-projection: r -> d
        self.up_proj = nn.Linear(bottleneck_dim, hidden_dim)

        # Activation function
        if activation == "relu":
            self.activation = nn.ReLU()
        elif activation == "gelu":
            self.activation = nn.GELU()
        else:
            raise ValueError(f"Unknown activation: {activation}")

        # Initialize for near-identity at start
        self._init_weights()

    def _init_weights(self):
        # Initialize down-projection with small random values
        nn.init.normal_(self.down_proj.weight, std=0.01)
        nn.init.zeros_(self.down_proj.bias)

        # Initialize up-projection to near-zero for identity initialization
        nn.init.normal_(self.up_proj.weight, std=0.01)
        nn.init.zeros_(self.up_proj.bias)

    def forward(self, x):
        # Bottleneck transformation
        residual = x
        x = self.down_proj(x)
        x = self.activation(x)
        x = self.up_proj(x)

        # Residual connection
        return residual + x

Let's verify our adapter module works correctly and count its parameters:

In[8]:

Code

import torch

# Create an adapter for BERT-base dimensions
hidden_dim = 768
bottleneck_dim = 64

adapter = AdapterModule(hidden_dim, bottleneck_dim)

# Count parameters
total_params = sum(p.numel() for p in adapter.parameters())
trainable_params = sum(
    p.numel() for p in adapter.parameters() if p.requires_grad
)

# Test forward pass
batch_size, seq_len = 4, 128
x = torch.randn(batch_size, seq_len, hidden_dim)
output = adapter(x)

import torch

# Create an adapter for BERT-base dimensions
hidden_dim = 768
bottleneck_dim = 64

adapter = AdapterModule(hidden_dim, bottleneck_dim)

# Count parameters
total_params = sum(p.numel() for p in adapter.parameters())
trainable_params = sum(
    p.numel() for p in adapter.parameters() if p.requires_grad
)

# Test forward pass
batch_size, seq_len = 4, 128
x = torch.randn(batch_size, seq_len, hidden_dim)
output = adapter(x)

Out[9]:

Console

Adapter configuration:
  Hidden dimension: 768
  Bottleneck dimension: 64
  Total parameters: 99,136

Expected parameters:
  Down projection: 49,216
  Up projection: 49,920

Input shape: torch.Size([4, 128, 768])
Output shape: torch.Size([4, 128, 768])

The adapter adds approximately 99,000 parameters, which is only about 0.09% of BERT-base's 110 million parameters. Notice that the output shape matches the input shape, preserving the hidden dimension.

Integrating Adapters into a Transformer LayerLink Copied

Now let's create a modified transformer layer that includes adapters:

In[10]:

Code

class AdaptedTransformerLayer(nn.Module):
    """
    Transformer layer with adapter modules inserted after attention and FFN.

    This implements the original Houlsby et al. placement strategy.
    """

    def __init__(
        self,
        hidden_dim,
        num_heads,
        ffn_dim,
        bottleneck_dim,
        dropout=0.1,
        adapter_placement="both",
    ):
        super().__init__()

        # Standard transformer components
        self.self_attn = nn.MultiheadAttention(
            hidden_dim, num_heads, dropout=dropout, batch_first=True
        )
        self.ffn = nn.Sequential(
            nn.Linear(hidden_dim, ffn_dim),
            nn.GELU(),
            nn.Linear(ffn_dim, hidden_dim),
            nn.Dropout(dropout),
        )

        self.norm1 = nn.LayerNorm(hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.dropout = nn.Dropout(dropout)

        # Adapter modules
        self.adapter_placement = adapter_placement

        if adapter_placement in ["both", "attention"]:
            self.adapter_attn = AdapterModule(hidden_dim, bottleneck_dim)

        if adapter_placement in ["both", "ffn"]:
            self.adapter_ffn = AdapterModule(hidden_dim, bottleneck_dim)

    def forward(self, x, attention_mask=None):
        # Self-attention sublayer
        attn_out, _ = self.self_attn(x, x, x, key_padding_mask=attention_mask)
        attn_out = self.dropout(attn_out)

        # Apply adapter after attention (if configured)
        if self.adapter_placement in ["both", "attention"]:
            attn_out = self.adapter_attn(attn_out)

        x = self.norm1(x + attn_out)

        # Feed-forward sublayer
        ffn_out = self.ffn(x)

        # Apply adapter after FFN (if configured)
        if self.adapter_placement in ["both", "ffn"]:
            ffn_out = self.adapter_ffn(ffn_out)

        x = self.norm2(x + ffn_out)

        return x

class AdaptedTransformerLayer(nn.Module):
    """
    Transformer layer with adapter modules inserted after attention and FFN.

    This implements the original Houlsby et al. placement strategy.
    """

    def __init__(
        self,
        hidden_dim,
        num_heads,
        ffn_dim,
        bottleneck_dim,
        dropout=0.1,
        adapter_placement="both",
    ):
        super().__init__()

        # Standard transformer components
        self.self_attn = nn.MultiheadAttention(
            hidden_dim, num_heads, dropout=dropout, batch_first=True
        )
        self.ffn = nn.Sequential(
            nn.Linear(hidden_dim, ffn_dim),
            nn.GELU(),
            nn.Linear(ffn_dim, hidden_dim),
            nn.Dropout(dropout),
        )

        self.norm1 = nn.LayerNorm(hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.dropout = nn.Dropout(dropout)

        # Adapter modules
        self.adapter_placement = adapter_placement

        if adapter_placement in ["both", "attention"]:
            self.adapter_attn = AdapterModule(hidden_dim, bottleneck_dim)

        if adapter_placement in ["both", "ffn"]:
            self.adapter_ffn = AdapterModule(hidden_dim, bottleneck_dim)

    def forward(self, x, attention_mask=None):
        # Self-attention sublayer
        attn_out, _ = self.self_attn(x, x, x, key_padding_mask=attention_mask)
        attn_out = self.dropout(attn_out)

        # Apply adapter after attention (if configured)
        if self.adapter_placement in ["both", "attention"]:
            attn_out = self.adapter_attn(attn_out)

        x = self.norm1(x + attn_out)

        # Feed-forward sublayer
        ffn_out = self.ffn(x)

        # Apply adapter after FFN (if configured)
        if self.adapter_placement in ["both", "ffn"]:
            ffn_out = self.adapter_ffn(ffn_out)

        x = self.norm2(x + ffn_out)

        return x

Let's compare parameter counts for different adapter placements:

In[11]:

Code

# Configuration
hidden_dim = 768
num_heads = 12
ffn_dim = 3072
bottleneck_dim = 64

# Create layers with different adapter placements
layer_both = AdaptedTransformerLayer(
    hidden_dim, num_heads, ffn_dim, bottleneck_dim, adapter_placement="both"
)
layer_ffn_only = AdaptedTransformerLayer(
    hidden_dim, num_heads, ffn_dim, bottleneck_dim, adapter_placement="ffn"
)
layer_no_adapter = AdaptedTransformerLayer(
    hidden_dim, num_heads, ffn_dim, bottleneck_dim, adapter_placement="none"
)


# Count parameters
def count_adapter_params(layer):
    adapter_params = 0
    for name, module in layer.named_modules():
        if isinstance(module, AdapterModule):
            adapter_params += sum(p.numel() for p in module.parameters())
    return adapter_params


def count_total_params(layer):
    return sum(p.numel() for p in layer.parameters())

# Configuration
hidden_dim = 768
num_heads = 12
ffn_dim = 3072
bottleneck_dim = 64

# Create layers with different adapter placements
layer_both = AdaptedTransformerLayer(
    hidden_dim, num_heads, ffn_dim, bottleneck_dim, adapter_placement="both"
)
layer_ffn_only = AdaptedTransformerLayer(
    hidden_dim, num_heads, ffn_dim, bottleneck_dim, adapter_placement="ffn"
)
layer_no_adapter = AdaptedTransformerLayer(
    hidden_dim, num_heads, ffn_dim, bottleneck_dim, adapter_placement="none"
)


# Count parameters
def count_adapter_params(layer):
    adapter_params = 0
    for name, module in layer.named_modules():
        if isinstance(module, AdapterModule):
            adapter_params += sum(p.numel() for p in module.parameters())
    return adapter_params


def count_total_params(layer):
    return sum(p.numel() for p in layer.parameters())

Out[12]:

Console

Parameter comparison for different adapter placements:

Placement               Total Params  Adapter Params   Overhead
-----------------------------------------------------------------
Both (Houlsby)             7,286,144         198,272       2.8%
FFN only (Pfeiffer)        7,187,008          99,136       1.4%
No adapters                7,087,872               0       0.0%

The FFN-only placement uses half the adapter parameters while often achieving comparable performance to the original dual placement.

Freezing Pretrained WeightsLink Copied

A key aspect of adapter-based fine-tuning is keeping the pretrained weights frozen. Let's implement a utility to prepare a model for adapter training:

In[13]:

Code

def prepare_model_for_adapter_training(model, adapter_names=["adapter"]):
    """
    Freeze all parameters except adapter modules.

    Args:
        model: The model to prepare
        adapter_names: List of substrings to identify adapter parameters
    """
    # First, freeze all parameters
    for param in model.parameters():
        param.requires_grad = False

    # Then, unfreeze adapter parameters
    unfrozen_count = 0
    for name, param in model.named_parameters():
        if any(adapter_name in name for adapter_name in adapter_names):
            param.requires_grad = True
            unfrozen_count += 1

    # Count trainable vs total parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(
        p.numel() for p in model.parameters() if p.requires_grad
    )

    return {
        "total_params": total_params,
        "trainable_params": trainable_params,
        "trainable_pct": trainable_params / total_params * 100,
        "unfrozen_modules": unfrozen_count,
    }

def prepare_model_for_adapter_training(model, adapter_names=["adapter"]):
    """
    Freeze all parameters except adapter modules.

    Args:
        model: The model to prepare
        adapter_names: List of substrings to identify adapter parameters
    """
    # First, freeze all parameters
    for param in model.parameters():
        param.requires_grad = False

    # Then, unfreeze adapter parameters
    unfrozen_count = 0
    for name, param in model.named_parameters():
        if any(adapter_name in name for adapter_name in adapter_names):
            param.requires_grad = True
            unfrozen_count += 1

    # Count trainable vs total parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(
        p.numel() for p in model.parameters() if p.requires_grad
    )

    return {
        "total_params": total_params,
        "trainable_params": trainable_params,
        "trainable_pct": trainable_params / total_params * 100,
        "unfrozen_modules": unfrozen_count,
    }

In[14]:

Code

# Demonstrate on our adapted layer
stats = prepare_model_for_adapter_training(layer_both)

# Demonstrate on our adapted layer
stats = prepare_model_for_adapter_training(layer_both)

Out[15]:

Console

Model preparation for adapter training:

Total parameters: 7,286,144
Trainable parameters: 198,272
Trainable percentage: 2.72%

Parameter status by component:
  self_attn.in_proj_weight                 ✗ Frozen
  self_attn.in_proj_bias                   ✗ Frozen
  self_attn.out_proj.weight                ✗ Frozen
  self_attn.out_proj.bias                  ✗ Frozen
  ffn.0.weight                             ✗ Frozen
  ffn.0.bias                               ✗ Frozen
  ffn.2.weight                             ✗ Frozen
  ffn.2.bias                               ✗ Frozen
  norm1.weight                             ✗ Frozen
  norm1.bias                               ✗ Frozen
  norm2.weight                             ✗ Frozen
  norm2.bias                               ✗ Frozen
  adapter_attn.down_proj.weight            ✓ Trainable
  adapter_attn.down_proj.bias              ✓ Trainable
  adapter_attn.up_proj.weight              ✓ Trainable
  adapter_attn.up_proj.bias                ✓ Trainable
  adapter_ffn.down_proj.weight             ✓ Trainable
  adapter_ffn.down_proj.bias               ✓ Trainable
  adapter_ffn.up_proj.weight               ✓ Trainable
  adapter_ffn.up_proj.bias                 ✓ Trainable

Only the adapter modules remain trainable, while all original transformer weights are frozen. This is the fundamental mechanism that enables parameter-efficient fine-tuning.

Visualizing Adapter BehaviorLink Copied

Let's visualize how an adapter transforms hidden representations during training:

Out[16]:

Visualization

Out[17]:

Visualization

Adapter output magnitude during training. The average norm of the adapter output starts near zero due to initialization and gradually increases as the model learns task-specific transformations.

The visualization shows two key aspects of adapter behavior. The left panel illustrates the dramatic 12:1 compression ratio in the bottleneck, forcing the adapter to learn efficient encodings. The right panel shows how adapter outputs start near zero (due to our initialization strategy) and gradually increase as the adapter learns task-specific transformations.

Adapter FusionLink Copied

When you have multiple tasks and want a model that can handle all of them, you face a choice: train a single multi-task model, or train separate adapters for each task and combine them at inference time. Adapter fusion takes the second approach, providing a learned mechanism for combining multiple task-specific adapters. This technique enables flexible composition of specialized knowledge without the interference problems that often plague multi-task training.

The Multi-Adapter ChallengeLink Copied

Imagine you've trained separate adapters for sentiment analysis, named entity recognition, and question answering. Each adapter has learned to transform the base model's representations for its specific task, encoding knowledge about sentiment lexicons, entity boundaries, or answer extraction strategies. Now you encounter a task that requires reasoning about both sentiment and entities, such as analyzing customer reviews to identify which product features receive positive vs. negative sentiment.

Simple approaches like averaging adapter outputs often fail because different tasks may require emphasizing different adapters for different inputs. A token like "Apple" might need strong NER adapter influence to recognize it as an entity, while a token like "amazing" should draw primarily from the sentiment adapter. What you need is a learned attention mechanism that can dynamically weight adapter contributions based on the input context.

AdapterFusion ArchitectureLink Copied

AdapterFusion introduces a fusion layer that combines the outputs of multiple pretrained adapters using attention. This approach treats the combination of adapters as itself a learning problem, where the model must discover how to best leverage each adapter's expertise for a new task.

Given $N$ task-specific adapters with outputs $\{\mathbf{a}_1, \mathbf{a}_2, ..., \mathbf{a}_N\}$ for input $\mathbf{h}$ , the fusion layer computes a weighted combination:

\text{AdapterFusion}(\mathbf{h}) = \sum_{i=1}^{N} \alpha_i \cdot \mathbf{v}_i

Each term in this formula plays a specific role:

$\text{AdapterFusion}(\mathbf{h})$ : the combined output for the task, representing a learned mixture of all adapter contributions
$\mathbf{h}$ : the input hidden state, which provides context for determining how to weight the adapters
$\alpha_i$ : the attention weight for the $i$ -th adapter, indicating how much that adapter should contribute to this particular input
$\mathbf{v}_i$ : the value vector for the $i$ -th adapter, projected from the adapter output by $\mathbf{v}_i = \mathbf{a}_i \cdot \mathbf{W}_V$ . This projection allows the fusion layer to transform adapter outputs before combining them
$N$ : the total number of task adapters being fused together

The attention weights $\alpha_i$ are computed via a query-key-value attention mechanism, following the same pattern as the attention we studied in transformer architectures:

\alpha = \text{softmax}\left(\frac{\mathbf{Q} \cdot \mathbf{K}^T}{\sqrt{d_k}}\right) $$ this attention computation serves a specific purpose: - $\alpha$: the vector of attention weights across all adapters, summing to 1 due to the softmax normalization - $\text{softmax}$: the function normalizing the weights to sum to 1, ensuring the fusion produces a proper weighted average - $\mathbf{Q}$: the query derived from the hidden state, calculated as $\mathbf{h} \cdot \mathbf{W}_Q$. The query encodes "what kind of adapter output you are looking for?" - $\mathbf{K}$: the keys derived from adapter outputs, calculated as $[\mathbf{a}_i] \cdot \mathbf{W}_K$. Each key encodes "what kind of information does this adapter provide?" - $\sqrt{d_k}$: the scaling factor based on key dimension $d_k$ to stabilize gradients and prevent the softmax from producing overly peaked distributions The fusion weights $\mathbf{W}_Q$, $\mathbf{W}_K$, and $\mathbf{W}_V$ are the only new parameters trained during fusion. All adapter weights remain frozen, preserving the task-specific knowledge each adapter has learned while allowing the fusion mechanism to discover how to combine them effectively. ### Implementation Let's implement a basic adapter fusion layer:

In[18]:

Code

class AdapterFusion(nn.Module):
    """
    Combines multiple adapter outputs using learned attention weights.

    Args:
        hidden_dim: Transformer hidden dimension
        num_adapters: Number of adapters to fuse
    """

    def __init__(self, hidden_dim, num_adapters):
        super().__init__()

        self.hidden_dim = hidden_dim
        self.num_adapters = num_adapters

        # Attention parameters for fusion
        self.query_proj = nn.Linear(hidden_dim, hidden_dim)
        self.key_proj = nn.Linear(hidden_dim, hidden_dim)
        self.value_proj = nn.Linear(hidden_dim, hidden_dim)

        self.scale = hidden_dim**-0.5

    def forward(self, hidden_state, adapter_outputs):
        """
        Args:
            hidden_state: Current hidden state [batch, seq_len, hidden_dim]
            adapter_outputs: List of N adapter outputs, each [batch, seq_len, hidden_dim]

        Returns:
            Fused output [batch, seq_len, hidden_dim]
        """
        batch_size, seq_len, _ = hidden_state.shape

        # Stack adapter outputs: [batch, seq_len, num_adapters, hidden_dim]
        stacked = torch.stack(adapter_outputs, dim=2)

        # Query from hidden state: [batch, seq_len, 1, hidden_dim]
        query = self.query_proj(hidden_state).unsqueeze(2)

        # Keys and values from adapter outputs
        # [batch, seq_len, num_adapters, hidden_dim]
        keys = self.key_proj(stacked)
        values = self.value_proj(stacked)

        # Compute attention scores: [batch, seq_len, 1, num_adapters]
        scores = torch.matmul(query, keys.transpose(-2, -1)) * self.scale
        weights = F.softmax(scores, dim=-1)

        # Apply attention: [batch, seq_len, 1, hidden_dim]
        fused = torch.matmul(weights, values)

        # Remove singleton dimension: [batch, seq_len, hidden_dim]
        return fused.squeeze(2), weights.squeeze(2)

class AdapterFusion(nn.Module):
    """
    Combines multiple adapter outputs using learned attention weights.

    Args:
        hidden_dim: Transformer hidden dimension
        num_adapters: Number of adapters to fuse
    """

    def __init__(self, hidden_dim, num_adapters):
        super().__init__()

        self.hidden_dim = hidden_dim
        self.num_adapters = num_adapters

        # Attention parameters for fusion
        self.query_proj = nn.Linear(hidden_dim, hidden_dim)
        self.key_proj = nn.Linear(hidden_dim, hidden_dim)
        self.value_proj = nn.Linear(hidden_dim, hidden_dim)

        self.scale = hidden_dim**-0.5

    def forward(self, hidden_state, adapter_outputs):
        """
        Args:
            hidden_state: Current hidden state [batch, seq_len, hidden_dim]
            adapter_outputs: List of N adapter outputs, each [batch, seq_len, hidden_dim]

        Returns:
            Fused output [batch, seq_len, hidden_dim]
        """
        batch_size, seq_len, _ = hidden_state.shape

        # Stack adapter outputs: [batch, seq_len, num_adapters, hidden_dim]
        stacked = torch.stack(adapter_outputs, dim=2)

        # Query from hidden state: [batch, seq_len, 1, hidden_dim]
        query = self.query_proj(hidden_state).unsqueeze(2)

        # Keys and values from adapter outputs
        # [batch, seq_len, num_adapters, hidden_dim]
        keys = self.key_proj(stacked)
        values = self.value_proj(stacked)

        # Compute attention scores: [batch, seq_len, 1, num_adapters]
        scores = torch.matmul(query, keys.transpose(-2, -1)) * self.scale
        weights = F.softmax(scores, dim=-1)

        # Apply attention: [batch, seq_len, 1, hidden_dim]
        fused = torch.matmul(weights, values)

        # Remove singleton dimension: [batch, seq_len, hidden_dim]
        return fused.squeeze(2), weights.squeeze(2)

Let's test the fusion layer with simulated adapter outputs:

In[19]:

Code

# Create fusion layer
hidden_dim = 768
num_adapters = 3
fusion = AdapterFusion(hidden_dim, num_adapters)

# Simulate adapter outputs for different tasks
batch_size, seq_len = 2, 16
hidden_state = torch.randn(batch_size, seq_len, hidden_dim)

# Each adapter produces different outputs
adapter_outputs = [
    torch.randn(batch_size, seq_len, hidden_dim) * 0.1,  # Sentiment adapter
    torch.randn(batch_size, seq_len, hidden_dim) * 0.1,  # NER adapter
    torch.randn(batch_size, seq_len, hidden_dim) * 0.1,  # QA adapter
]

# Compute fused output
fused_output, attention_weights = fusion(hidden_state, adapter_outputs)

# Create fusion layer
hidden_dim = 768
num_adapters = 3
fusion = AdapterFusion(hidden_dim, num_adapters)

# Simulate adapter outputs for different tasks
batch_size, seq_len = 2, 16
hidden_state = torch.randn(batch_size, seq_len, hidden_dim)

# Each adapter produces different outputs
adapter_outputs = [
    torch.randn(batch_size, seq_len, hidden_dim) * 0.1,  # Sentiment adapter
    torch.randn(batch_size, seq_len, hidden_dim) * 0.1,  # NER adapter
    torch.randn(batch_size, seq_len, hidden_dim) * 0.1,  # QA adapter
]

# Compute fused output
fused_output, attention_weights = fusion(hidden_state, adapter_outputs)

Out[20]:

Console

AdapterFusion demonstration:

Input hidden state shape: torch.Size([2, 16, 768])
Number of adapters: 3
Fused output shape: torch.Size([2, 16, 768])
Attention weights shape: torch.Size([2, 16, 3])

Sample attention weights (first token):
  Adapter 1 (Sentiment): 0.321
  Adapter 2 (NER):       0.341
  Adapter 3 (QA):        0.338
  Sum: 1.000

The fusion layer produces soft attention weights that sum to 1.0, allowing different adapters to contribute different amounts to the final output. These weights are learned during training on the target task.

Training Adapter FusionLink Copied

Adapter fusion uses a two-stage training process that preserves the knowledge in each individual adapter while learning how to combine them:

Stage 1: Train individual adapters on their respective tasks with the base model frozen
Stage 2: Train fusion layers with both the base model and individual adapters frozen

This separation ensures that adapter knowledge is preserved during fusion training. Only the small fusion parameters are updated in stage 2.

In[21]:

Code

def prepare_for_fusion_training(model, adapters, fusion_layers):
    """
    Prepare model for AdapterFusion training.
    Freezes base model and adapters, leaves only fusion layers trainable.
    """
    # Freeze base model
    for param in model.parameters():
        param.requires_grad = False

    # Freeze all adapters
    for adapter in adapters:
        for param in adapter.parameters():
            param.requires_grad = False

    # Keep fusion layers trainable
    for fusion in fusion_layers:
        for param in fusion.parameters():
            param.requires_grad = True

    # Count parameters
    fusion_params = sum(
        sum(p.numel() for p in f.parameters()) for f in fusion_layers
    )

    return fusion_params

def prepare_for_fusion_training(model, adapters, fusion_layers):
    """
    Prepare model for AdapterFusion training.
    Freezes base model and adapters, leaves only fusion layers trainable.
    """
    # Freeze base model
    for param in model.parameters():
        param.requires_grad = False

    # Freeze all adapters
    for adapter in adapters:
        for param in adapter.parameters():
            param.requires_grad = False

    # Keep fusion layers trainable
    for fusion in fusion_layers:
        for param in fusion.parameters():
            param.requires_grad = True

    # Count parameters
    fusion_params = sum(
        sum(p.numel() for p in f.parameters()) for f in fusion_layers
    )

    return fusion_params

In[22]:

Code

# Demonstrate parameter counting
dummy_model = nn.Linear(768, 768)  # Simplified base model
adapters = [AdapterModule(768, 64) for _ in range(3)]
fusion_layers = [AdapterFusion(768, 3)]

fusion_param_count = prepare_for_fusion_training(
    dummy_model, adapters, fusion_layers
)

# Demonstrate parameter counting
dummy_model = nn.Linear(768, 768)  # Simplified base model
adapters = [AdapterModule(768, 64) for _ in range(3)]
fusion_layers = [AdapterFusion(768, 3)]

fusion_param_count = prepare_for_fusion_training(
    dummy_model, adapters, fusion_layers
)

Out[23]:

Console

AdapterFusion training setup:

Fusion layer parameters: 1,771,776
(3 projection matrices × 768 × 768 = 1,769,472)

Trainable parameter status:
  Base model: Frozen
  Adapters: Frozen
  Fusion layers: Trainable

Visualizing Adapter FusionLink Copied

Let's visualize how fusion weights might vary across different input tokens:

Out[24]:

Visualization

Heatmap showing adapter attention weights varying across tokens. — Adapter fusion attention weights across tokens in a sample sentence. The model learns to emphasize different task adapters for different parts of the input. Entity mentions trigger higher NER adapter weights, while opinion words activate the sentiment adapter.

The visualization shows how adapter fusion can learn meaningful attention patterns. Entity tokens like "Apple," "Tim," and "Cook" receive higher weights from the NER adapter, while opinion words like "amazing" activate the sentiment adapter more strongly.

Limitations and ImpactLink Copied

Adapter layers represented a significant advancement in parameter-efficient fine-tuning when introduced, establishing many concepts that influenced later methods like LoRA and prefix tuning. However, adapters come with their own set of trade-offs and limitations.

Inference OverheadLink Copied

Unlike LoRA, which can merge its modifications into the base model weights after training, adapters must remain as separate modules during inference. Each forward pass through a transformer layer must also pass through the adapter modules, adding computational overhead. For a 12-layer model with dual adapters (Houlsby placement), this means 24 additional sequential operations per inference. While each adapter operation is small, the cumulative latency can become noticeable in latency-sensitive applications. This is particularly relevant when compared to LoRA, which we covered in earlier chapters of this part, where the low-rank modifications can be absorbed into the weight matrices for zero additional inference cost.

Sequential DependenciesLink Copied

The sequential placement of adapters within the transformer forward pass limits parallelization opportunities. Modern hardware accelerators achieve peak efficiency when they can batch many operations together, but adapters introduce serial dependencies that can leave computational resources underutilized. Parallel adapter variants address this partially, but the fundamental architectural choice of inserting modules into the forward pass creates constraints that additive methods like LoRA avoid.

Activation MemoryLink Copied

During training, adapters require storing additional activations for backpropagation through their bottleneck layers. For large batch sizes or long sequences, this memory overhead can become significant. The bottleneck architecture mitigates this somewhat since the intermediate representation is small, but the input to each adapter (the full hidden state) must still be cached.

Historical ImpactLink Copied

Despite these limitations, adapters proved several important concepts that shaped the field:

Additive fine-tuning works: The success of adapters demonstrated that pretrained models contain sufficient knowledge to adapt to new tasks with minimal additional parameters
Bottleneck architecture: The compression-expansion pattern has been adopted and refined in many subsequent methods
Modular adaptation: The idea of swappable, task-specific modules enabled new deployment patterns and influenced the development of mixture-of-experts approaches
Adapter libraries: The need to manage multiple adapters catalyzed the development of tools like the Hugging Face PEFT library, which now supports many parameter-efficient methods

Adapter fusion particularly influenced the multi-task learning literature, showing that separately trained adaptations could be effectively combined without catastrophic interference between tasks.

SummaryLink Copied

This chapter explored adapter layers as a parameter-efficient approach to fine-tuning pretrained language models. We covered several key concepts:

Adapter architecture uses a bottleneck design where a down-projection compresses the hidden representation, a nonlinearity enables complex transformations, and an up-projection restores the original dimension. The residual connection around this bottleneck allows adapters to learn task-specific modifications while preserving the base model's representations.

Adapter placement can follow the original Houlsby strategy (adapters after both attention and FFN sublayers) or the more efficient Pfeiffer approach (adapters only after the FFN). The single-adapter placement often achieves comparable performance with half the parameters.

Adapter dimensionality controls the trade-off between capacity and efficiency. Bottleneck dimensions of 32 to 64 provide a good default for most tasks, with the compression ratio forcing adapters to learn generalizable transformations.

Adapter fusion enables combining multiple task-specific adapters through learned attention weights. This two-stage approach first trains individual adapters, then trains small fusion layers to dynamically weight their contributions.

Adapters add sequential inference overhead compared to methods like LoRA that can merge into base weights, but they offer clean modularity and established tooling. In the upcoming chapter on PEFT comparison, we'll examine how adapters stack up against the other methods we've covered, including LoRA variants, prefix tuning, and prompt tuning, across different tasks and model scales.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about adapter layers and their role in parameter-efficient fine-tuning.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{adapterlayersbottleneckmodulesforefficientfinetuning, author = {Michael Brenndoerfer}, title = {Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning}, year = {2025}, url = {https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning. Retrieved from https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning

MLAAcademic

Michael Brenndoerfer. "Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning." 2026. Web. today. <https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning>.

CHICAGOAcademic

Michael Brenndoerfer. "Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning." Accessed today. https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning'. Available at: https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning. https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning

Direct link:

https://mbrenndoerfer.com/writing/adapter-layers-bottleneck-modules-transformer-fine-tuning

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Adapter Layers: Bottleneck Modules for Efficient Fine-Tuning

Adapter LayersLink Copied

Adapter ArchitectureLink Copied

The Adapter ModuleLink Copied

Residual ConnectionLink Copied

Parameter CountLink Copied

Adapter PlacementLink Copied

Original Placement (Houlsby et al.)Link Copied

Efficient Placement (Pfeiffer et al.)Link Copied

Parallel AdaptersLink Copied

Placement ComparisonLink Copied

Adapter DimensionalityLink Copied

The Information Bottleneck PerspectiveLink Copied

Dimensionality GuidelinesLink Copied

Scaling with Model SizeLink Copied

Adapter Efficiency AnalysisLink Copied

Worked Example\n\nLet's trace through an adapter computation manually to build intuition for how the bottleneck transformation works. Following the computation step by step reveals how each component contributes to the final output and helps clarify the mathematical operations involved.Link Copied

Code ImplementationLink Copied

Building the Adapter ModuleLink Copied

Integrating Adapters into a Transformer LayerLink Copied

Freezing Pretrained WeightsLink Copied

Visualizing Adapter BehaviorLink Copied

Adapter FusionLink Copied

The Multi-Adapter ChallengeLink Copied

AdapterFusion ArchitectureLink Copied

Training Adapter FusionLink Copied

Visualizing Adapter FusionLink Copied

Limitations and ImpactLink Copied

Inference OverheadLink Copied

Sequential DependenciesLink Copied

Activation MemoryLink Copied

Historical ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

PEFT Comparison: Choosing the Right Fine-Tuning Method

Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts

Prefix Tuning: Steering LLMs with Learnable Virtual Tokens

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

PEFT Comparison: Choosing the Right Fine-Tuning Method

Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts

Prefix Tuning: Steering LLMs with Learnable Virtual Tokens

Stay updated