IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn how IA3 adapts large language models by rescaling activations with minimal parameters. Compare IA3 vs LoRA for efficient fine-tuning strategies.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

IA3Link Copied

While LoRA achieves impressive parameter efficiency by learning low-rank updates to weight matrices, researchers at Google asked a simpler question: what if you could adapt the model by learning to rescale existing activations instead of modifying the weights? This led to IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations), which achieves even greater parameter efficiency than LoRA. IA3 learns element-wise rescaling vectors instead of low-rank matrix decompositions.

The core idea is simple: instead of learning $\Delta W = BA$ like LoRA, IA3 learns a vector $\ell$ that scales activations through element-wise multiplication. LoRA might require thousands of parameters per adapted layer. IA3 needs only $d$ parameters, where $d$ is the hidden dimension. For a 7-billion parameter model, you can adapt with just 0.01% of the original parameters while still achieving competitive performance on many tasks.

IA3 was introduced by Liu et al. in their 2022 paper "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning." The method emerged from studying how to make parameter-efficient fine-tuning work well in few-shot scenarios where minimal training data is available. The authors discovered that the multiplicative nature of IA3's rescaling vectors, which provides better inductive biases for few-shot learning than additive methods like LoRA, improved performance.

The IA3 FormulationLink Copied

To understand IA3, let's first establish how standard transformer layers compute their outputs. This foundation shows where IA3 intervenes and why those points were chosen. In multi-head attention, the model computes queries, keys, and values through linear projections:

\begin{aligned} Q &= XW_Q \\ K &= XW_K \\ V &= XW_V \end{aligned}

where:

$Q$ : the query matrix
$K$ : the key matrix
$V$ : the value matrix
$X$ : the input sequence matrix ( $n \times d$ )
$W_Q$ : the query projection matrix ( $d \times d$ )
$W_K$ : the key projection matrix ( $d \times d$ )
$W_V$ : the value projection matrix ( $d \times d$ )
$n$ : the sequence length
$d$ : the model hidden dimension

The attention output then feeds into a feed-forward network with projections $W_1$ and $W_2$ .

The key question: where should we insert our adaptation mechanism and why? IA3 targets keys, values, and feed-forward intermediate activations. Queries determine what information a position is looking for. Keys and values determine what information is available to be found and retrieved. By rescaling keys, you can change which tokens appear relevant to a query. By rescaling values, you can modify what information gets aggregated when attention weights are applied. By rescaling feed-forward activations, you can adjust which learned features are emphasized in the transformation pipeline.

IA3 modifies these computations by introducing learned rescaling vectors that multiply the keys, values, and feed-forward intermediate activations. Specifically, IA3 learns three vectors:

$\ell_K \in \mathbb{R}^{d}$ for rescaling keys
$\ell_V \in \mathbb{R}^{d}$ for rescaling values
$\ell_{ff} \in \mathbb{R}^{d_{ff}}$ for rescaling feed-forward activations

The modified computations become:

\begin{aligned} K' &= \ell_K \odot (XW_K) \\ V' &= \ell_V \odot (XW_V) \end{aligned}

where:

$K'$ : the rescaled key matrix
$V'$ : the rescaled value matrix
$\ell_K$ : the learned rescaling vector for keys
$\ell_V$ : the learned rescaling vector for values
$XW_K$ : the standard key projection of the input
$XW_V$ : the standard value projection of the input
$\odot$ : element-wise multiplication (broadcast across the sequence dimension)

Notice how simple this formulation is. The original projection $XW_K$ produces a matrix where each row represents a token's key vector. The rescaling vector $\ell_K$ then multiplies each dimension of every key vector by the same factor. If $\ell_K[i] = 2.0$ , then the $i$ -th dimension of every key in the sequence gets doubled. If $\ell_K[j] = 0.5$ , the $j$ -th dimension gets halved. This uniform rescaling across all positions allows IA3 to learn which aspects of the key representation matter most for your downstream task.

For feed-forward layers with intermediate activation $h = \gamma(XW_1)$ where $\gamma$ is the activation function, IA3 applies:

h' = \ell_{ff} \odot h

where:

$h'$ : the rescaled intermediate activation
$\ell_{ff}$ : the learned rescaling vector
$h$ : the original intermediate activation $\gamma(XW_1)$
$\odot$ : element-wise multiplication
$\gamma$ : the activation function (e.g., GELU)
$W_1$ : the first feed-forward projection matrix

The feed-forward rescaling operates on a particularly interesting point in the computation. After the input passes through the first projection and nonlinearity, the intermediate representation $h$ exists in a higher-dimensional space (typically 4 times the hidden dimension). This expanded space contains specialized features that the model learned during pretraining. By rescaling these intermediate features, IA3 can effectively turn specific computational pathways on or off, amplifying features relevant to the current task while suppressing those that might introduce noise.

Hadamard Product

The Hadamard product, $\odot$ , performs element-wise multiplication between vectors or matrices of the same shape. For vectors $a, b \in \mathbb{R}^d$ , the result is $(a \odot b)_i = a_i \cdot b_i$ for each element $i$ .

The key insight is that these rescaling operations don't change the weight matrices at all. Instead, they learn to amplify or inhibit specific dimensions of the activation space. When $\ell_i > 1$ , the $i$ -th dimension is amplified. When $\ell_i < 1$ , it's inhibited. When $\ell_i = 1$ , the original behavior is preserved (no rescaling). This creates a natural interpretation: the model learns a "volume control" for each dimension of its internal representations, turning up dimensions that help with the task and turning down dimensions that don't contribute or that introduce interference.

Understanding Learned Rescaling VectorsLink Copied

Why does rescaling activations work for task adaptation? The intuition comes from understanding what pretrained models have already learned. During pretraining, models develop rich representations where different dimensions encode different information types, some capture syntactic relationships, others capture semantic meanings, and still others encode task-specific patterns. These dimensions emerge organically from the pretraining objective, with the model learning to distribute information across available capacity in ways that best predict the next token.

When fine-tuning for a specific downstream task, you often just need to reweight which aspects of the existing representations matter most rather than learning entirely new computations. IA3's rescaling vectors learn exactly this reweighting. Think of it as a spotlight: the pretrained model has learned to notice many different features of the input, and IA3 learns which of those features deserve attention for the current task.

Consider a concrete example, adapting a pretrained model for sentiment analysis. The model already understands language and has dimensions that respond to emotional words, negation patterns, and intensifiers. IA3 learns which of these existing capabilities to emphasize. It might learn to amplify dimensions sensitive to sentiment-bearing words while dampening dimensions focused on factual content. A dimension that fires strongly when processing numerical quantities might be suppressed because such information rarely matters for sentiment. Meanwhile, a dimension that activates for words like "wonderful," "terrible," or "disappointing" might be amplified because these words carry crucial sentiment signals.

Out[3]:

Visualization

Original activation vectors in 2D space form an isotropic circle, showing how pretrained representations distribute information equally across dimensions before adaptation.

After IA3 diagonal scaling with factors 1.8 and 0.5, the circle transforms into an axis-aligned ellipse. Unlike LoRA which can rotate the representation space, IA3 only stretches or compresses along existing axes, revealing its fundamental constraint: it reweights existing features while preserving the geometric structure learned during pretraining.

Mathematically, we can understand IA3 as learning a diagonal matrix transformation. If we define $L_K = \text{diag}(\ell_K)$ as the diagonal matrix with $\ell_K$ on the diagonal, then:

\begin{aligned} K' &= \ell_K \odot (XW_K) && \text{(element-wise multiplication)} \\ &= (X W_K) L_K && \text{(rewrite as diagonal matrix multiplication)} \end{aligned}

where:

$K'$ : the rescaled key matrix
$\ell_K$ : the rescaling vector
$X$ : the input sequence
$W_K$ : the pretrained projection matrix
$L_K$ : the diagonal matrix formed from $\ell_K$ ( $L_K = \text{diag}(\ell_K)$ )
$\odot$ : element-wise multiplication

This reformulation reveals something important about IA3's expressivity. A diagonal matrix can only stretch or compress coordinates independently. It cannot rotate the space or create new directions. This shows that IA3 is equivalent to right-multiplying the key projection by a diagonal matrix. Compared to LoRA's full low-rank update $W + BA$ , IA3's diagonal update is far more constrained but requires dramatically fewer parameters. The low-rank matrices in LoRA can express rotations, projections onto subspaces, and more general linear transformations. IA3 sacrifices this flexibility for extreme efficiency, betting that the existing coordinate system is already well-suited to the task and only needs adjustment in magnitude.

Initialization StrategyLink Copied

IA3 initializes all rescaling vectors to ones: $\ell_K = \ell_V = \ell_{ff} = \mathbf{1}$ . This initialization is crucial because it means the model starts with exactly the same behavior as the pretrained model. On the very first forward pass before any training occurs, every activation passes through unchanged. During training, the vectors gradually shift away from unity to adapt the model. Each dimension can move in its own direction based on task-specific gradients.

This contrasts with LoRA's initialization, where the $A$ matrix is initialized with small random values and $B$ with zeros. Both approaches ensure your adapted model starts equivalent to the pretrained model, but they arrive there through different means. LoRA achieves equivalence by making the adaptation term zero (since $B \cdot A = 0 \cdot A = 0$ ), while IA3 achieves it by making the multiplication an identity operation (since $1 \cdot x = x$ ). The philosophical difference is subtle but meaningful: LoRA starts from "no change" while IA3 starts from "change by factor of one."

Parameter Count AnalysisLink Copied

Let's quantify exactly how parameter-efficient IA3 is. Understanding these numbers helps you decide which method to use in resource-constrained scenarios. For a standard transformer layer with hidden dimension $d$ and feed-forward dimension $d_{ff}$ (typically $4d$ ), IA3 introduces:

$d$ parameters for $\ell_K$ (key rescaling)
$d$ parameters for $\ell_V$ (value rescaling)
$d_{ff}$ parameters for $\ell_{ff}$ (feed-forward rescaling)

The total per layer is $2d + d_{ff}$ . For a typical configuration where $d_{ff} = 4d$ :

\begin{aligned} \text{IA3 params per layer} &= 2d + d_{ff} \\ &= 2d + 4d && \text{(substitute $d_{ff} = 4d$)} \\ &= 6d \end{aligned}

where:

$d$ : the hidden dimension of the model
$d_{ff}$ : the feed-forward dimension (typically $4d$ )
$2d$ : parameters for key and value rescaling vectors ( $\ell_K, \ell_V$ )
$4d$ : parameters for the feed-forward rescaling vector ( $\ell_{ff}$ ), assuming $d_{ff} = 4d$

This linear scaling with hidden dimension is remarkably gentle. As models grow larger, their hidden dimensions increase, but IA3's parameter count grows only proportionally, not quadratically as the original weights do.

Compare this to LoRA. For LoRA with the same components (keys, values, and feed-forward layers) and rank $r$ :

\begin{aligned} \text{LoRA params per layer} &= 2 \times 2dr + 2 \times (d + d_{ff})r \\ &= 4dr + 2 \times (d + 4d)r && \text{(substitute $d_{ff} = 4d$)} \\ &= 4dr + 2 \times (5d)r \\ &= 4dr + 10dr \\ &= 14dr \end{aligned}

where:

$r$ : the rank of the adaptation
$d$ : the hidden dimension
$d_{ff}$ : the feed-forward dimension ( $4d$ )
$2dr$ : parameters per attention matrix (keys, values)
$(d + d_{ff})r$ : parameters per feed-forward matrix

The crucial difference is the factor of $r$ in LoRA's count. Even at rank 1, LoRA requires more parameters than IA3. At typical ranks used in practice, the gap becomes substantial.

For LoRA with typical rank $r = 8$ and hidden dimension $d = 4096$ , here's the comparison:

LoRA: $14 \times 8 \times 4096 = 458,752$ parameters per layer
IA3: $6 \times 4096 = 24,576$ parameters per layer

IA3 requires roughly 2.3r times fewer parameters than LoRA in general (approximately 19 times fewer at rank 8).

In[4]:

Code

def count_parameters(
    d_model: int, d_ff: int, num_layers: int, lora_rank: int = 8
):
    """Compare parameter counts for IA3 vs LoRA."""

    # IA3: rescaling vectors for keys, values, and feed-forward
    ia3_per_layer = 2 * d_model + d_ff  # l_k, l_v, l_ff
    ia3_total = ia3_per_layer * num_layers

    # LoRA: low-rank matrices for same components
    # Each adapted weight needs r*(d_in + d_out) parameters
    lora_k = lora_rank * (d_model + d_model)  # Key projection adaptation
    lora_v = lora_rank * (d_model + d_model)  # Value projection adaptation
    lora_ff1 = lora_rank * (d_model + d_ff)  # First FFN projection adaptation
    lora_ff2 = lora_rank * (d_ff + d_model)  # Second FFN projection adaptation
    lora_per_layer = lora_k + lora_v + lora_ff1 + lora_ff2
    lora_total = lora_per_layer * num_layers

    return {
        "ia3_per_layer": ia3_per_layer,
        "ia3_total": ia3_total,
        "lora_per_layer": lora_per_layer,
        "lora_total": lora_total,
        "ratio": lora_total / ia3_total,
    }


# LLaMA-7B configuration
llama_7b = count_parameters(d_model=4096, d_ff=11008, num_layers=32)

# GPT-2 configuration
gpt2 = count_parameters(d_model=768, d_ff=3072, num_layers=12)

def count_parameters(
    d_model: int, d_ff: int, num_layers: int, lora_rank: int = 8
):
    """Compare parameter counts for IA3 vs LoRA."""

    # IA3: rescaling vectors for keys, values, and feed-forward
    ia3_per_layer = 2 * d_model + d_ff  # l_k, l_v, l_ff
    ia3_total = ia3_per_layer * num_layers

    # LoRA: low-rank matrices for same components
    # Each adapted weight needs r*(d_in + d_out) parameters
    lora_k = lora_rank * (d_model + d_model)  # Key projection adaptation
    lora_v = lora_rank * (d_model + d_model)  # Value projection adaptation
    lora_ff1 = lora_rank * (d_model + d_ff)  # First FFN projection adaptation
    lora_ff2 = lora_rank * (d_ff + d_model)  # Second FFN projection adaptation
    lora_per_layer = lora_k + lora_v + lora_ff1 + lora_ff2
    lora_total = lora_per_layer * num_layers

    return {
        "ia3_per_layer": ia3_per_layer,
        "ia3_total": ia3_total,
        "lora_per_layer": lora_per_layer,
        "lora_total": lora_total,
        "ratio": lora_total / ia3_total,
    }


# LLaMA-7B configuration
llama_7b = count_parameters(d_model=4096, d_ff=11008, num_layers=32)

# GPT-2 configuration
gpt2 = count_parameters(d_model=768, d_ff=3072, num_layers=12)

Out[5]:

Console

Parameter Comparison: IA3 vs LoRA (rank=8)
==================================================

LLaMA-7B (4096 hidden, 32 layers):
  IA3 total:       614,400 params
  LoRA total:   11,927,552 params
  LoRA/IA3 ratio: 19.4x more params for LoRA

GPT-2 (768 hidden, 12 layers):
  IA3 total:        55,296 params
  LoRA total:    1,032,192 params
  LoRA/IA3 ratio: 18.7x more params for LoRA

For LLaMA-7B, IA3 requires just over 600,000 trainable parameters, while LoRA with rank 8 needs approximately 12 million. This extreme efficiency makes IA3 particularly attractive for scenarios where memory is severely constrained or when you need to store many task-specific adaptations.

Out[7]:

Visualization

Parameter scaling across six model sizes from GPT-2 to LLaMA-13B. IA3 maintains consistent efficiency below 1M parameters regardless of model size, while LoRA's parameter count grows multiplicatively with rank. At rank 8, LoRA requires 10 to 20 times more parameters than IA3. This gap widens to approximately 100 times at rank 16 for larger models, illustrating the fundamental efficiency advantage of diagonal scaling.

ImplementationLink Copied

Let's implement IA3 to see how the rescaling mechanism works in practice. We'll create IA3-adapted versions of attention and feed-forward layers.

In[8]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class IA3Attention(nn.Module):
    """Multi-head attention with IA3 rescaling on keys and values."""

    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        # Standard attention projections (frozen during IA3 training)
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        # IA3 rescaling vectors (these are the trainable parameters)
        # Initialize to ones so model starts with original behavior
        self.l_k = nn.Parameter(torch.ones(d_model))
        self.l_v = nn.Parameter(torch.ones(d_model))

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, _ = x.shape

        # Compute Q, K, V projections
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        # Apply IA3 rescaling to keys and values
        # Rescaling is applied before reshaping to multi-head format
        K = (
            self.l_k * K
        )  # Element-wise multiplication, broadcast over batch and seq
        V = self.l_v * V

        # Reshape for multi-head attention
        Q = Q.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)
        K = K.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)
        V = V.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attn_weights = F.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)

        # Reshape and project output
        attn_output = (
            attn_output.transpose(1, 2)
            .contiguous()
            .view(batch_size, seq_len, self.d_model)
        )
        return self.W_o(attn_output)

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class IA3Attention(nn.Module):
    """Multi-head attention with IA3 rescaling on keys and values."""

    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        # Standard attention projections (frozen during IA3 training)
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        # IA3 rescaling vectors (these are the trainable parameters)
        # Initialize to ones so model starts with original behavior
        self.l_k = nn.Parameter(torch.ones(d_model))
        self.l_v = nn.Parameter(torch.ones(d_model))

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, _ = x.shape

        # Compute Q, K, V projections
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        # Apply IA3 rescaling to keys and values
        # Rescaling is applied before reshaping to multi-head format
        K = (
            self.l_k * K
        )  # Element-wise multiplication, broadcast over batch and seq
        V = self.l_v * V

        # Reshape for multi-head attention
        Q = Q.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)
        K = K.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)
        V = V.view(
            batch_size, seq_len, self.num_heads, self.head_dim
        ).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attn_weights = F.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)

        # Reshape and project output
        attn_output = (
            attn_output.transpose(1, 2)
            .contiguous()
            .view(batch_size, seq_len, self.d_model)
        )
        return self.W_o(attn_output)

The key lines are K = self.l_k * K and V = self.l_v * V, which perform simple element-wise multiplication that distinguishes IA3-adapted attention from standard attention. The rescaling vectors l_k and l_v start at ones, preserving the original model behavior, then learn to amplify or suppress specific dimensions during training.

In[9]:

Code

class IA3FeedForward(nn.Module):
    """Feed-forward network with IA3 rescaling on intermediate activations."""

    def __init__(self, d_model: int, d_ff: int):
        super().__init__()

        # Standard FFN layers (frozen during IA3 training)
        self.W_1 = nn.Linear(d_model, d_ff, bias=False)
        self.W_2 = nn.Linear(d_ff, d_model, bias=False)

        # IA3 rescaling vector for intermediate activations
        self.l_ff = nn.Parameter(torch.ones(d_ff))

    def forward(self, x: torch.Tensor):
        # First projection and activation
        h = F.gelu(self.W_1(x))

        # IA3 rescaling of intermediate activations
        h = self.l_ff * h

        # Second projection
        return self.W_2(h)

class IA3FeedForward(nn.Module):
    """Feed-forward network with IA3 rescaling on intermediate activations."""

    def __init__(self, d_model: int, d_ff: int):
        super().__init__()

        # Standard FFN layers (frozen during IA3 training)
        self.W_1 = nn.Linear(d_model, d_ff, bias=False)
        self.W_2 = nn.Linear(d_ff, d_model, bias=False)

        # IA3 rescaling vector for intermediate activations
        self.l_ff = nn.Parameter(torch.ones(d_ff))

    def forward(self, x: torch.Tensor):
        # First projection and activation
        h = F.gelu(self.W_1(x))

        # IA3 rescaling of intermediate activations
        h = self.l_ff * h

        # Second projection
        return self.W_2(h)

The feed-forward implementation follows the same pattern. The rescaling happens after the first linear projection and activation function, allowing IA3 to learn which intermediate features are most relevant for the downstream task.

In[10]:

Code

class IA3TransformerBlock(nn.Module):
    """Complete transformer block with IA3 adaptation."""

    def __init__(self, d_model: int, num_heads: int, d_ff: int):
        super().__init__()

        self.attention = IA3Attention(d_model, num_heads)
        self.ffn = IA3FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        # Pre-norm architecture
        x = x + self.attention(self.norm1(x), mask)
        x = x + self.ffn(self.norm2(x))
        return x

    def ia3_parameters(self):
        """Return only the IA3 rescaling parameters."""
        return [self.attention.l_k, self.attention.l_v, self.ffn.l_ff]

    def freeze_pretrained(self):
        """Freeze all parameters except IA3 vectors."""
        for name, param in self.named_parameters():
            if "l_k" not in name and "l_v" not in name and "l_ff" not in name:
                param.requires_grad = False

class IA3TransformerBlock(nn.Module):
    """Complete transformer block with IA3 adaptation."""

    def __init__(self, d_model: int, num_heads: int, d_ff: int):
        super().__init__()

        self.attention = IA3Attention(d_model, num_heads)
        self.ffn = IA3FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        # Pre-norm architecture
        x = x + self.attention(self.norm1(x), mask)
        x = x + self.ffn(self.norm2(x))
        return x

    def ia3_parameters(self):
        """Return only the IA3 rescaling parameters."""
        return [self.attention.l_k, self.attention.l_v, self.ffn.l_ff]

    def freeze_pretrained(self):
        """Freeze all parameters except IA3 vectors."""
        for name, param in self.named_parameters():
            if "l_k" not in name and "l_v" not in name and "l_ff" not in name:
                param.requires_grad = False

The ia3_parameters method provides convenient access to just the trainable rescaling vectors, making it easy to set up optimizers that only train these parameters.

In[11]:

Code

# Create a model and examine IA3 parameters
d_model = 512
num_heads = 8
d_ff = 2048

block = IA3TransformerBlock(d_model, num_heads, d_ff)
block.freeze_pretrained()

# Count trainable vs frozen parameters
trainable = sum(p.numel() for p in block.parameters() if p.requires_grad)
frozen = sum(p.numel() for p in block.parameters() if not p.requires_grad)

# Create a model and examine IA3 parameters
d_model = 512
num_heads = 8
d_ff = 2048

block = IA3TransformerBlock(d_model, num_heads, d_ff)
block.freeze_pretrained()

# Count trainable vs frozen parameters
trainable = sum(p.numel() for p in block.parameters() if p.requires_grad)
frozen = sum(p.numel() for p in block.parameters() if not p.requires_grad)

Out[12]:

Console

IA3 Transformer Block Parameter Analysis
=============================================
Hidden dimension: 512
Feed-forward dimension: 2048

Trainable (IA3 vectors): 3,072
Frozen (pretrained):     3,147,776
Trainable percentage:    0.0975%

IA3 vector breakdown:
  l_k (key rescaling):   512 params
  l_v (value rescaling): 512 params
  l_ff (FFN rescaling):  2048 params

With a 512-dimensional model and 2048-dimensional feed-forward, IA3 adds just 3,072 trainable parameters to a block that contains over 2 million frozen parameters. This demonstrates the dramatic efficiency gains you can achieve.

Visualizing Learned RescalingLink Copied

Let's visualize how rescaling vectors change from their initial uniform values to build intuition about what IA3 learns.

In[13]:

Code

import numpy as np


# Simulate learned IA3 vectors (in practice these would come from training)
np.random.seed(42)

# Simulate l_k: some dimensions amplified, others suppressed
l_k_learned = np.ones(128)
# Amplify certain "sentiment-related" dimensions
l_k_learned[10:20] = np.random.uniform(1.5, 2.5, 10)
# Suppress "syntactic" dimensions
l_k_learned[50:70] = np.random.uniform(0.3, 0.7, 20)
# Add some noise to others
l_k_learned += np.random.normal(0, 0.1, 128)

# Simulate l_ff: stronger adaptation in feed-forward
l_ff_learned = np.ones(512)
l_ff_learned[100:150] = np.random.uniform(2.0, 3.0, 50)  # Strongly amplified
l_ff_learned[300:400] = np.random.uniform(0.1, 0.5, 100)  # Strongly suppressed
l_ff_learned += np.random.normal(0, 0.15, 512)

import numpy as np


# Simulate learned IA3 vectors (in practice these would come from training)
np.random.seed(42)

# Simulate l_k: some dimensions amplified, others suppressed
l_k_learned = np.ones(128)
# Amplify certain "sentiment-related" dimensions
l_k_learned[10:20] = np.random.uniform(1.5, 2.5, 10)
# Suppress "syntactic" dimensions
l_k_learned[50:70] = np.random.uniform(0.3, 0.7, 20)
# Add some noise to others
l_k_learned += np.random.normal(0, 0.1, 128)

# Simulate l_ff: stronger adaptation in feed-forward
l_ff_learned = np.ones(512)
l_ff_learned[100:150] = np.random.uniform(2.0, 3.0, 50)  # Strongly amplified
l_ff_learned[300:400] = np.random.uniform(0.1, 0.5, 100)  # Strongly suppressed
l_ff_learned += np.random.normal(0, 0.15, 512)

Out[14]:

Visualization

Key rescaling factors across 128 dimensions cluster tightly around 1.0, demonstrating modest reweighting during adaptation. Most dimensions remain between 0.8 and 1.2, indicating that key representation adaptation relies on subtle adjustments rather than dramatic amplification or suppression.

Feed-forward rescaling factors across 512 dimensions show amplified (2.0 to 3.0) and suppressed (0.1 to 0.5) dimensions flanking near-identity values (around 1.0). The broader distribution demonstrates more dramatic feature activation and suppression than key layers, revealing that task adaptation relies heavily on selectively activating feed-forward pathways. This pattern suggests the model uses feed-forward dimensions as primary controls for task-specific feature emphasis.

The histograms show how learned rescaling vectors deviate from their initial uniform values of 1.0. Some dimensions are amplified well above 1.0, while others are suppressed toward 0. This selective amplification and inhibition is how IA3 adapts the pretrained model's behavior without changing any of the original weights.

Out[15]:

Visualization

Per-dimension rescaling factors for 512 feed-forward dimensions as deviations from 1.0. Coral bars indicate amplified dimensions (values above 1.0) and blue bars indicate suppressed dimensions (values below 1.0). Contiguous clusters of similar colors reveal that the model learns to activate or suppress groups of related features together, suggesting that task adaptation exploits organized feature structure in the learned representations rather than treating dimensions independently.

This dimension-wise view reveals that IA3 learns structured patterns. Consecutive dimensions often show similar rescaling behavior, suggesting the model discovers meaningful groups of features to amplify or suppress together.

Out[17]:

Visualization

Attention weights from a pretrained model when processing the word 'terrible' in a sentiment context. The nearly uniform distribution reflects general pretraining objectives without sentiment specialization, allocating roughly equal attention to all tokens.

IA3 vs LoRA: A Detailed ComparisonLink Copied

Having covered LoRA in previous chapters and now IA3, let's compare these two PEFT methods across several dimensions.

Mathematical FormulationLink Copied

The fundamental difference lies in how each method parameterizes the adaptation. Understanding this difference mathematically helps clarify when each approach is most appropriate.

LoRA learns an additive low-rank update:

W' = W + \frac{\alpha}{r} BA

where:

$W'$ : the adapted weight matrix
$W$ : the frozen pretrained weights
$\alpha$ : the scaling factor
$r$ : the rank
$B$ : the low-rank output matrix ( $d_{out} \times r$ )
$A$ : the low-rank input matrix ( $r \times d_{in}$ )
$d_{in}$ : the input dimension
$d_{out}$ : the output dimension

The additive nature of LoRA means it learns to supplement the original weight matrix. The product $BA$ represents a new linear transformation added to what the model originally computed. This addition happens in weight space, before any activations are computed.

IA3 learns a multiplicative diagonal scaling:

h' = \ell \odot h

where:

$h'$ : the scaled activation
$\ell$ : the learnable rescaling vector ( $\ell \in \mathbb{R}^{d}$ )
$h$ : the input activation
$\odot$ : element-wise multiplication

IA3's multiplicative intervention happens in activation space, after the original computation but before the result is used downstream. Rather than changing what gets computed, IA3 changes how much of each computed quantity flows forward in the network.

LoRA modifies what the model computes by adding new terms to weight matrices. IA3 modifies how much the model uses what it already computes by rescaling activations. Choose between them based on your constraints and task requirements.

Parameter EfficiencyLink Copied

In[18]:

Code

def compare_efficiency(d_model: int, d_ff: int, num_layers: int):
    """Compare IA3 vs LoRA across different ranks."""

    results = []
    ranks = [1, 2, 4, 8, 16, 32, 64]

    # IA3 is fixed (no rank hyperparameter)
    ia3_total = (2 * d_model + d_ff) * num_layers

    for r in ranks:
        # LoRA applied to K, V, and both FFN projections
        lora_total = (
            2 * 2 * d_model * r + 2 * (d_model + d_ff) * r
        ) * num_layers
        results.append(
            {
                "rank": r,
                "lora_params": lora_total,
                "ia3_params": ia3_total,
                "ratio": lora_total / ia3_total,
            }
        )

    return results


# Compare for LLaMA-7B scale
comparisons = compare_efficiency(d_model=4096, d_ff=11008, num_layers=32)

def compare_efficiency(d_model: int, d_ff: int, num_layers: int):
    """Compare IA3 vs LoRA across different ranks."""

    results = []
    ranks = [1, 2, 4, 8, 16, 32, 64]

    # IA3 is fixed (no rank hyperparameter)
    ia3_total = (2 * d_model + d_ff) * num_layers

    for r in ranks:
        # LoRA applied to K, V, and both FFN projections
        lora_total = (
            2 * 2 * d_model * r + 2 * (d_model + d_ff) * r
        ) * num_layers
        results.append(
            {
                "rank": r,
                "lora_params": lora_total,
                "ia3_params": ia3_total,
                "ratio": lora_total / ia3_total,
            }
        )

    return results


# Compare for LLaMA-7B scale
comparisons = compare_efficiency(d_model=4096, d_ff=11008, num_layers=32)

Out[19]:

Visualization

Trainable parameter counts for LLaMA-7B comparing IA3 and LoRA across ranks 1 to 64 (logarithmic scale). IA3 maintains 0.6M constant parameters while LoRA grows from 2.5M (rank 1) to 32M (rank 64). At the commonly used rank 8, IA3 requires approximately 19 times fewer parameters than LoRA, illustrating the efficiency-expressivity tradeoff. IA3 diagonal scaling requires minimal parameters but cannot learn rotations, while LoRA's flexible low-rank updates provide greater expressivity at the cost of significantly more parameters.

Even at rank 1, LoRA requires about 4 times more parameters than IA3. At the commonly used rank 8, LoRA needs approximately 50 times more parameters. This dramatic difference stems from IA3 needing only $O(d)$ parameters per adapted component versus LoRA's $O(rd)$ parameters.

Expressivity vs Efficiency TradeoffLink Copied

IA3's extreme efficiency comes with reduced expressivity. LoRA can learn arbitrary low-rank updates to weight matrices, while IA3 can only learn diagonal scaling. This limitation means IA3 cannot change the direction of computations, only their magnitude.

Consider what each method can express geometrically. LoRA's update $BA$ can rotate, scale, and project the input in any direction within a rank- $r$ subspace. IA3's diagonal scaling can only stretch or compress along the existing coordinate axes. If you imagine the model's representations as points in a high-dimensional space, LoRA can learn to rotate that space into a new orientation. IA3 can only stretch or squeeze it along the original axes.

Out[21]:

Visualization

Original isotropic circle representation showing how both dimensions carry equal information in the pretrained model before task-specific adaptation.

Despite this limitation, IA3 often performs surprisingly well, particularly in few-shot scenarios. The authors of the original paper hypothesize that pretrained models already contain most of the computational machinery needed for downstream tasks. What fine-tuning needs to learn is mainly which existing computations to emphasize, which is exactly what IA3 provides. This hypothesis suggests that the geometry learned during pretraining is already well-suited to many tasks, and adaptation is more about emphasis than restructuring.

Training DynamicsLink Copied

The two methods also differ in their training dynamics. LoRA uses standard weight decay regularization on the $A$ and $B$ matrices. IA3 applies no explicit regularization to rescaling vectors, but the multiplicative nature of the adaptation provides implicit regularization.

When IA3's rescaling values approach zero, they effectively disable entire dimensions. This creates a natural form of feature selection where unimportant dimensions get suppressed. LoRA, being additive, doesn't have this same dynamic. Setting $BA$ to zero means no adaptation, not suppression of existing computations.

Merging BehaviorLink Copied

Both LoRA and IA3 can merge their adaptations into the base model weights for efficient inference. For LoRA, add the low-rank product to the original weights: $W_{merged} = W + BA$ . For IA3, the merging is multiplicative.

When the rescaling is applied after a linear projection, IA3 can merge by right-multiplying the weight matrix:

W_{merged} = W \text{diag}(\ell)

where:

$W_{merged}$ : the merged weight matrix
$W$ : the original weight matrix
$\text{diag}(\ell)$ : the diagonal matrix of rescaling factors
$\ell$ : the vector of learned rescaling values

This modifies each column of $W$ by its corresponding rescaling factor. The merged model then runs with no additional overhead, just like LoRA.

When to Choose Each MethodLink Copied

Based on this analysis, use these guidelines for choosing between IA3 and LoRA:

Choose IA3 when:

Memory is extremely constrained
You must store many task-specific adaptations
Working in few-shot scenarios (10-1000 examples)
The downstream task is similar to pretraining
Training speed is critical (fewer parameters means faster updates)

Choose LoRA when:

You need higher model expressivity
The downstream task differs significantly from pretraining
You have sufficient training data to learn more complex adaptations
You need fine-grained control over adaptation rank per layer

IA3 with PEFT LibraryLink Copied

The Hugging Face PEFT library provides a convenient implementation of IA3 that integrates seamlessly with transformers models.

In[22]:

Code

from peft import IA3Config, get_peft_model, TaskType
from transformers import AutoConfig, AutoModelForSequenceClassification

# Load a base model configuration (avoids downloading large weights)
model_name = "distilbert-base-uncased"
config = AutoConfig.from_pretrained(model_name, num_labels=2)
model = AutoModelForSequenceClassification.from_config(config)

# Configure IA3 (feedforward_modules must be subset of target_modules)
ia3_config = IA3Config(
    task_type=TaskType.SEQ_CLS,
    target_modules=["k_lin", "v_lin", "lin1"],
    feedforward_modules=["lin1"],
    modules_to_save=["classifier"],
    init_ia3_weights=True,
)

# Apply IA3 to the model
ia3_model = get_peft_model(model, ia3_config)

from peft import IA3Config, get_peft_model, TaskType
from transformers import AutoConfig, AutoModelForSequenceClassification

# Load a base model configuration (avoids downloading large weights)
model_name = "distilbert-base-uncased"
config = AutoConfig.from_pretrained(model_name, num_labels=2)
model = AutoModelForSequenceClassification.from_config(config)

# Configure IA3 (feedforward_modules must be subset of target_modules)
ia3_config = IA3Config(
    task_type=TaskType.SEQ_CLS,
    target_modules=["k_lin", "v_lin", "lin1"],
    feedforward_modules=["lin1"],
    modules_to_save=["classifier"],
    init_ia3_weights=True,
)

# Apply IA3 to the model
ia3_model = get_peft_model(model, ia3_config)

Out[23]:

Console

trainable params: 605,954 || all params: 67,560,964 || trainable%: 0.8969

The PEFT library handles all the complexity of identifying which modules to adapt and wrapping them with IA3 rescaling. The target_modules parameter specifies which attention components to adapt (keys and values), while feedforward_modules identifies feed-forward layers that get the $\ell_{ff}$ rescaling.

In[24]:

Code

# Examine the IA3 parameters
ia3_params = {
    name: param.shape
    for name, param in ia3_model.named_parameters()
    if param.requires_grad
}

# Examine the IA3 parameters
ia3_params = {
    name: param.shape
    for name, param in ia3_model.named_parameters()
    if param.requires_grad
}

Out[25]:

Console

Trainable IA3 Parameters:
--------------------------------------------------
base_model.model.distilbert.transformer.layer.0.attention.k_lin.ia3_l.default: torch.Size([768, 1])
base_model.model.distilbert.transformer.layer.0.attention.v_lin.ia3_l.default: torch.Size([768, 1])
base_model.model.distilbert.transformer.layer.0.ffn.lin1.ia3_l.default: torch.Size([1, 768])
base_model.model.distilbert.transformer.layer.1.attention.k_lin.ia3_l.default: torch.Size([768, 1])
base_model.model.distilbert.transformer.layer.1.attention.v_lin.ia3_l.default: torch.Size([768, 1])
base_model.model.distilbert.transformer.layer.1.ffn.lin1.ia3_l.default: torch.Size([1, 768])
base_model.model.distilbert.transformer.layer.2.attention.k_lin.ia3_l.default: torch.Size([768, 1])
base_model.model.distilbert.transformer.layer.2.attention.v_lin.ia3_l.default: torch.Size([768, 1])
base_model.model.distilbert.transformer.layer.2.ffn.lin1.ia3_l.default: torch.Size([1, 768])
base_model.model.distilbert.transformer.layer.3.attention.k_lin.ia3_l.default: torch.Size([768, 1])
... and 12 more parameters

Each adapted layer receives its own set of rescaling vectors. The IA3 vectors are named with ia3_l in their parameter names, making them easy to identify.

Limitations and ImpactLink Copied

IA3's extreme parameter efficiency comes with meaningful limitations. The most significant limitation is expressivity—IA3 can only rescale existing activations and cannot learn fundamentally new computations. If your downstream task requires capabilities not present in the pretrained model, IA3 may underperform compared to LoRA, which can add new computational pathways. This limitation manifests most clearly when the downstream task distribution differs substantially from the pretraining data.

Another limitation is the lack of a tunable capacity parameter. LoRA provides the rank hyperparameter $r$ that allows you to trade off between efficiency and expressivity. IA3 has no such tuning option. You get exactly $d$ parameters per rescaling vector, no more, no less. While this simplifies hyperparameter search, it also means you cannot easily scale up IA3's capacity when your task demands it.

The multiplicative nature of IA3 also creates potential training instabilities. If rescaling values grow too large during training, they can cause exploding activations. Conversely, values approaching zero can create vanishing gradient problems. You can manage these issues with careful learning rate selection and gradient clipping. However, IA3 requires more attention than LoRA.

Despite these limitations, IA3 has had a significant impact on the PEFT landscape. It demonstrated that the field's assumptions about necessary adaptation capacity were often too conservative. Many tasks that practitioners thought required LoRA or full fine-tuning actually work well with simple rescaling. This insight has informed the development of subsequent methods and encouraged exploration of other minimal-parameter approaches.

IA3 has proven particularly valuable for few-shot learning scenarios. The original paper showed that IA3 outperformed in-context learning (using examples in the prompt) while using orders of magnitude less inference compute. This positioned IA3 as an efficient alternative to few-shot prompting when you need to perform the same task repeatedly.

The method has also influenced thinking about what fine-tuning actually learns. IA3's success suggests that pretrained language models contain most needed capabilities and adaptation primarily reweights existing features rather than learning new ones. This perspective has implications for how we design both pretraining objectives and fine-tuning strategies.

SummaryLink Copied

IA3 offers a simple approach to parameter-efficient fine-tuning: instead of learning weight updates like LoRA, it learns to rescale existing activations. By learning vectors $\ell_K$ , $\ell_V$ , and $\ell_{ff}$ that multiply keys, values, and feed-forward activations respectively, IA3 adapts pretrained models with dramatically fewer parameters than other PEFT methods.

Key insights from IA3 include:

Rescaling as adaptation: Learned element-wise multiplication can effectively adapt pretrained models by amplifying relevant dimensions and suppressing irrelevant ones
Extreme efficiency: IA3 requires roughly $6d$ parameters per transformer layer compared to LoRA's $12rd$ , often achieving 10-50x parameter reduction
Multiplicative dynamics: Starting from uniform rescaling (all ones) and learning deviations provides strong inductive biases for few-shot learning scenarios
Limited expressivity: IA3 diagonal adaptation cannot learn rotations or projections, only magnitude changes along existing axes

In the next chapter on Prefix Tuning, we explore a different approach: prepending learned embeddings to the input sequence. This contrasts with IA3, which modifies how activations flow through the model. This technique offers yet another perspective on how minimal interventions can effectively adapt large language models.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about IA3 and its approach to parameter-efficient fine-tuning.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{ia3parameterefficientfinetuningwithrescalingvectors, author = {Michael Brenndoerfer}, title = {IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors}, year = {2025}, url = {https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors. Retrieved from https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling

MLAAcademic

Michael Brenndoerfer. "IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors." 2026. Web. today. <https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling>.

CHICAGOAcademic

Michael Brenndoerfer. "IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors." Accessed today. https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling.

HARVARDAcademic

Michael Brenndoerfer (2025) 'IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors'. Available at: https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors. https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling

Direct link:

https://mbrenndoerfer.com/writing/ia3-parameter-efficient-fine-tuning-activation-rescaling

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

IA3: Parameter-Efficient Fine-Tuning with Rescaling Vectors

IA3Link Copied

The IA3 FormulationLink Copied

Understanding Learned Rescaling VectorsLink Copied

Initialization StrategyLink Copied

Parameter Count AnalysisLink Copied

ImplementationLink Copied

Visualizing Learned RescalingLink Copied

IA3 vs LoRA: A Detailed ComparisonLink Copied

Mathematical FormulationLink Copied

Parameter EfficiencyLink Copied

Expressivity vs Efficiency TradeoffLink Copied

Training DynamicsLink Copied

Merging BehaviorLink Copied

When to Choose Each MethodLink Copied

IA3 with PEFT LibraryLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

AdaLoRA: Adaptive Rank Allocation for Efficient Fine-Tuning

QLoRA: 4-Bit Quantization for Memory-Efficient LLM Fine-Tuning

LoRA Hyperparameters: Rank, Alpha & Target Module Selection

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

AdaLoRA: Adaptive Rank Allocation for Efficient Fine-Tuning

QLoRA: 4-Bit Quantization for Memory-Efficient LLM Fine-Tuning

LoRA Hyperparameters: Rank, Alpha & Target Module Selection

Stay updated