Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts

Michael BrenndoerferDecember 8, 202538 min read

Learn prompt tuning for efficient LLM adaptation. Prepend trainable soft prompts to inputs while keeping models frozen. Scales to match full fine-tuning.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Prompt Tuning

In the previous chapter, we explored prefix tuning, which prepends trainable continuous vectors to the keys and values at every layer of a transformer. While effective, this approach has drawbacks. Soft prompts must be maintained across all attention layers, and the reparameterization MLP used during training adds additional overhead. This naturally raises a key question: Can we achieve similar parameter efficiency with an even simpler approach?

Prompt tuning, introduced by Lester et al. in 2021, answers this question decisively. Instead of modifying every layer of the transformer, prompt tuning prepends trainable "soft prompt" embeddings only at the input layer. The frozen model processes these soft prompts alongside the actual input tokens through all its layers normally. This radical simplification reduces the number of trainable parameters, producing a remarkable finding: as model scale increases, prompt tuning closes the gap with full fine-tuning and eventually matches its performance at sufficient scale.

This chapter examines the prompt tuning formulation, explores how initialization strategies affect performance, investigates the scaling behavior that makes prompt tuning compelling, and analyzes how prompt length influences results. Understanding these dynamics reveals why prompt tuning became a foundational technique in parameter-efficient adaptation.

From Prefix Tuning to Prompt Tuning

As we discussed in the previous chapter, prefix tuning achieves parameter efficiency by prepending trainable key-value pairs to each transformer layer. This design reflects the observation that attention patterns throughout the network could be steered by modifying what the model attends to at every layer. However, this approach has a cost: the number of parameters scales with both the prompt length and the number of layers.

Prompt tuning takes a different approach. Instead of intervening at every layer, the frozen transformer propagates task-specific information from the input embeddings through all subsequent layers. The soft prompt tokens are processed by the same attention mechanisms and feed-forward networks as regular tokens, letting the model's pretrained computations handle task adaptation internally.

This design choice has several implications. First, the parameter count becomes independent of model depth. A 12-layer or 96-layer model requires the same parameters, determined only by prompt length and embedding dimension. Second, the approach is architecturally simpler because there are no modifications to any transformer layer. The soft prompts are simply concatenated with the embedded input and processed normally.

The key insight driving prompt tuning is that large language models have learned such rich internal representations that a small modification at the input level can cascade through the network to produce task-appropriate outputs. The success of this approach depends critically on model scale, as we will see when examining the scaling behavior later in this chapter.

Prompt Tuning Formulation

Soft Prompts

Soft prompts are learnable embedding vectors that act as virtual tokens prepended to the input. Unlike discrete text prompts ("Classify this review:"), soft prompts exist only in the embedding space and have no corresponding vocabulary tokens.

To understand prompt tuning mathematically, let's examine how soft prompts integrate with the standard transformer input pipeline. In a typical transformer model, input text first passes through tokenization, then each token is mapped to a dense vector through the embedding layer. Prompt tuning intervenes at this point by inserting learnable vectors. The frozen model treats these as embeddings of additional tokens.

Let XRn×d\mathbf{X} \in \mathbb{R}^{n \times d} represent the embedded input sequence, where nn is the sequence length and dd is the embedding dimension. This matrix contains one row for each input token, with each row being a dd-dimensional vector that encodes both the token's meaning and its position in the sequence. Prompt tuning introduces a trainable prompt matrix PRp×d\mathbf{P} \in \mathbb{R}^{p \times d}, where pp is the prompt length. Each row of this matrix represents one learnable "virtual token" that exists purely in the continuous embedding space.

We form the augmented input by prepending the soft prompt matrix to the original input sequence:

X=[P;X]R(p+n)×d\mathbf{X}' = [\mathbf{P}; \mathbf{X}] \in \mathbb{R}^{(p+n) \times d}

where:

  • X\mathbf{X}': the augmented input matrix containing both prompt and data
  • P\mathbf{P}: the matrix of learnable soft prompt parameters
  • X\mathbf{X}: the original input embedding matrix
  • [;][\cdot; \cdot]: the concatenation operation along the sequence length dimension

The concatenation operation stacks the prompt matrix on top of the input matrix, creating a new sequence that begins with pp soft prompt tokens followed by the nn actual input tokens. From your transformer's perspective, this augmented sequence looks just like any other embedded input, except that the first pp positions contain learned vectors rather than embeddings looked up from the vocabulary. This is key to prompt tuning's simplicity: no architectural modifications are needed.

This concatenated sequence is then processed by the frozen transformer normally. The frozen model applies its full sequence of attention layers and feed-forward networks without any awareness that the first few tokens are special. If we denote the frozen language model as fθf_\theta, the output is computed as:

Y=fθ([P;X])\mathbf{Y} = f_\theta([\mathbf{P}; \mathbf{X}])

where:

  • Y\mathbf{Y}: the output representations from the transformer
  • fθf_\theta: the frozen transformer function parameterized by weights θ\theta
  • [P;X][\mathbf{P}; \mathbf{X}]: the concatenated sequence of prompt and input embeddings

During training, only P\mathbf{P} receives gradient updates. The entire transformer with parameters θ\theta remains frozen, meaning that backpropagation computes gradients for the prompt parameters but treats the model weights as constants. The training objective is the standard task loss, whether cross-entropy for classification, language modeling loss for generation, or any other task-appropriate objective. The gradients flow backward through the frozen model, ultimately reaching and updating only the soft prompt embeddings.

Parameter Count

Prompt tuning's simplicity is most evident when we examine its parameter count. Since soft prompts exist only at the input level, trainable parameters depend solely on the number of virtual tokens and embedding dimension. The number of trainable parameters in prompt tuning is:

Nparams=p×dN_{\text{params}} = p \times d

where:

  • N_params: the total number of trainable parameters
  • pp: the length of the soft prompt (number of virtual tokens)
  • dd: the embedding dimension of the model

To make this concrete, consider a model with embedding dimension d=1024d = 1024 and prompt length p=100p = 100. This configuration yields exactly 102,400 parameters, approximately 0.01% of a 1 billion parameter model. This is a dramatic reduction compared to full fine-tuning, where all billion parameters receive gradient updates.

The comparison with prefix tuning is equally striking. Prefix tuning requires parameters proportional to p×L×2dp \times L \times 2d, where LL is the number of layers and the factor of 2 accounts for both keys and values at each layer. For a 24-layer model with the same embedding dimension and prompt length, prefix tuning would require 48 times more parameters than prompt tuning. Prompt tuning's parameter count remains constant regardless of model depth, making it more attractive as models grow deeper.

Out[2]:
Visualization
Comparison of trainable parameter counts across transformer depths (12 to 96 layers). Prompt tuning maintains constant parameters (0.1M) regardless of depth, while prefix tuning scales linearly with depth, reaching 19.7M at 96 layers. At 96 layers, this creates a 197x difference in trainable parameters. The chart reveals why input-level modifications are more parameter-efficient: adding trainable parameters to every transformer layer multiplies the cost by the number of layers, whereas input-level prompts remain constant regardless of model depth.
Comparison of trainable parameter counts across transformer depths (12 to 96 layers). Prompt tuning maintains constant parameters (0.1M) regardless of depth, while prefix tuning scales linearly with depth, reaching 19.7M at 96 layers. At 96 layers, this creates a 197x difference in trainable parameters. The chart reveals why input-level modifications are more parameter-efficient: adding trainable parameters to every transformer layer multiplies the cost by the number of layers, whereas input-level prompts remain constant regardless of model depth.

Comparison with Prefix Tuning

The formulation reveals a key difference from prefix tuning that has important implications for how each method steers the model's behavior. In prefix tuning, the soft prefixes directly modify the attention computation at every layer by adding to the keys and values:

Attention(Q,[PK;K],[PV;V])\text{Attention}(\mathbf{Q}, [\mathbf{P}_K; \mathbf{K}], [\mathbf{P}_V; \mathbf{V}])

where:

  • Q\mathbf{Q}: the query matrix derived from the current layer's input
  • PK,PV\mathbf{P}_K, \mathbf{P}_V: the trainable prefix matrices for keys and values
  • K,V\mathbf{K}, \mathbf{V}: the key and value matrices derived from the input
  • [;][\cdot; \cdot]: the concatenation operation extending the keys and values

This direct injection means that every attention head at every layer has access to learned key-value pairs that can guide its computations. The prefixes effectively provide "steering information" directly where the model makes attention decisions. In contrast, soft prompts in prompt tuning only appear as additional "tokens" in the input sequence. They do not directly modify any attention matrices or feed-forward computations. Instead, the model's own attention mechanism decides how much to attend to these soft prompts versus the actual input tokens at each layer. This represents a more indirect form of intervention because prompt tuning relies on the model learning to use soft prompts through its natural attention patterns rather than injecting information directly into the attention computation. The soft prompts must earn their influence by being useful for the task, as determined by the attention weights the model computes.

This architectural difference explains why prompt tuning requires larger model scale to match prefix tuning's performance. Smaller models lack the capacity to propagate task-relevant information from input soft prompts through many layers. Each layer's computations dilute and transform the representations. Smaller models with fewer parameters per layer struggle to preserve task-specific signals across this deep processing. Larger models, by contrast, have richer internal representations and more parameters to capture subtle patterns, allowing them to more effectively leverage input-level modifications by learning attention patterns that extract and preserve relevant information from soft prompts.

Prompt Initialization Strategies

How should you initialize the soft prompt matrix P\mathbf{P}? This practical question directly affects whether your training succeeds. Unlike weight matrices in neural networks, where random initialization with appropriate scaling typically works well, prompt initialization significantly impacts prompt tuning performance, especially for smaller models. The reason is rooted in the embedding space. Transformers are trained to process vocabulary token embeddings, a specific vector distribution. Soft prompts that start far from this distribution may create representations that the model struggles to interpret meaningfully. Several strategies have been explored to address this challenge.

Random Initialization

The simplest approach initializes each prompt embedding from a random distribution:

PijU(a,a)orPijN(0,σ2)\mathbf{P}_{ij} \sim \mathcal{U}(-a, a) \quad \text{or} \quad \mathbf{P}_{ij} \sim \mathcal{N}(0, \sigma^2)

where:

  • Pij\mathbf{P}_{ij}: the element at index (i,j)(i,j) of the prompt matrix
  • U(a,a)\mathcal{U}(-a, a): a uniform distribution bounded by aa
  • N(0,σ2)\mathcal{N}(0, \sigma^2): a normal distribution with mean 00 and variance σ2\sigma^2

The hyperparameters aa or σ\sigma are typically chosen to match the model's existing embedding scale, often determined empirically or set to small values like 0.02. Random initialization provides no task-related inductive bias. The optimization process must discover useful prompt embeddings entirely from the training signal, starting from a point that may lie outside the manifold of meaningful token representations. This approach works reasonably well for large models with sufficient capacity to correct for poor starting points, but it can lead to suboptimal results if you're working with smaller models or when you have limited training data.

Vocabulary Initialization

A more informed strategy initializes prompt embeddings using existing token embeddings from the model's vocabulary. The intuition is straightforward: the frozen model has learned to process vocabulary embeddings effectively during pretraining. By starting in the same region of embedding space, you give the soft prompts a head start. For each prompt position ii, you sample a random token tit_i from the vocabulary and use its embedding:

Pi=Eti\mathbf{P}_i = \mathbf{E}_{t_i}

where:

  • Pi\mathbf{P}_i: the initialization vector for the ii-th prompt token
  • E\mathbf{E}: the model's embedding matrix
  • tit_i: the index of the sampled vocabulary token

This ensures soft prompts start in embedding space the model already processes meaningfully. Since attention mechanisms are designed to work with vectors from this distribution, starting nearby may accelerate optimization. During training, the prompts can then drift away from their initial positions to better serve your task, but they begin in territory the model understands. The particular tokens you choose for initialization matter less than the fact that we are sampling from the correct distribution.

Class-Label Initialization

For classification tasks, you can enhance vocabulary initialization by selecting tokens related to task or class labels. For sentiment classification, you might initialize using the embeddings of tokens like "positive", "negative", "sentiment", "review":

Pi=Eclass_tokeni\mathbf{P}_i = \mathbf{E}_{\text{class\_token}_i}

where:

  • Pi\mathbf{P}_i: the initialization vector for the ii-th prompt token
  • E\mathbf{E}: the pretrained embedding matrix
  • class_tokeni\text{class\_token}_i: the token index corresponding to a class label or task-relevant word

This provides a strong inductive bias, giving the optimizer a starting point that already encodes task-relevant semantics. The training process then refines these embeddings to better capture the specific classification boundary. In effect, you're telling the model to "start by thinking about these concepts" and letting gradient descent figure out exactly how to think about them. For tasks where the class labels have clear lexical representations, this initialization can significantly reduce the number of training steps required to achieve good performance.

Impact of Initialization

The original prompt tuning paper by Lester et al. found that initialization strategy matters most at smaller model scales. For T5-Small (60 million parameters), class-label initialization significantly outperformed random initialization by several percentage points on benchmark tasks. The gap was substantial enough to change conclusions about whether prompt tuning was viable for smaller models. As model scale increased to T5-XXL with 11 billion parameters, the gap closed and all strategies converged to similar final performance. This convergence suggests that larger models are more robust to initialization choices.

This pattern shows that larger models can recover from poor initialization during training, while smaller models with limited capacity benefit from starting closer to a good solution. Smaller models have more challenging optimization landscapes where poor initialization leads to suboptimal local minima. In practice, vocabulary or class-label initialization adds minimal implementation complexity and provides a safety margin against poor convergence, making vocabulary or class-label initialization the recommended default regardless of model scale.

Out[3]:
Visualization
Performance of three soft prompt initialization strategies (random, vocabulary, class-label) across T5 model scales (60M to 11B parameters) on SuperGLUE benchmark tasks. Class-label initialization achieves 70% performance on T5-Small while random initialization reaches only 58%, a 12 percentage point advantage. This gap narrows substantially with scale, disappearing entirely at T5-XXL where all three methods converge near 98%, demonstrating that larger models are robust to initialization choices and can adapt effectively regardless of starting point.
Performance of three soft prompt initialization strategies (random, vocabulary, class-label) across T5 model scales (60M to 11B parameters) on SuperGLUE benchmark tasks. Class-label initialization achieves 70% performance on T5-Small while random initialization reaches only 58%, a 12 percentage point advantage. This gap narrows substantially with scale, disappearing entirely at T5-XXL where all three methods converge near 98%, demonstrating that larger models are robust to initialization choices and can adapt effectively regardless of starting point.

Scaling Behavior

The most compelling finding from the prompt tuning research is its scaling behavior. This sets prompt tuning apart from many parameter-efficient methods that trade performance for efficiency at any scale. As model scale increases, prompt tuning's performance gap with full fine-tuning shrinks until they achieve equivalent performance.

The Scaling Curve

Lester et al. evaluated prompt tuning across the T5 model family on the SuperGLUE benchmark, a challenging suite of natural language understanding tasks. The results showed a clear and consistent trend that held across multiple tasks. At T5-Small (60M parameters), prompt tuning achieves only 65% of full fine-tuning performance, a gap that makes prompt tuning impractical at this scale. This gap motivated research into how to improve prompt tuning's effectiveness. For T5-Base with 220 million parameters, this improved to around 80%, a meaningful improvement but still a notable gap. By T5-Large with 770 million parameters, the gap narrowed further to about 90%, entering territory where prompt tuning became a viable option for many applications. At T5-XL (3B) and T5-XXL (11B), prompt tuning matches full fine-tuning performance, eliminating any efficiency penalty.

Out[4]:
Visualization
Performance convergence between prompt tuning (soft prompts only) and full fine-tuning across T5 model scales on SuperGLUE. At T5-Small (60M parameters), prompt tuning achieves only 65% of full fine-tuning performance, a 35 percentage point gap. This gap narrows progressively with scale, reaching parity at T5-XXL (11B parameters). The pattern reveals that larger models extract maximum utility from input-level modifications, while smaller models benefit more from layer-wise interventions like prefix tuning.
Performance convergence between prompt tuning (soft prompts only) and full fine-tuning across T5 model scales on SuperGLUE. At T5-Small (60M parameters), prompt tuning achieves only 65% of full fine-tuning performance, a 35 percentage point gap. This gap narrows progressively with scale, reaching parity at T5-XXL (11B parameters). The pattern reveals that larger models extract maximum utility from input-level modifications, while smaller models benefit more from layer-wise interventions like prefix tuning.

This scaling behavior can be understood through the lens of model capacity and representational richness. Smaller models have limited representational power distributed across their layers. When only the input embeddings are modified, the model has limited flexibility to adapt its behavior for a new task. The internal computations are constrained by the model's size, and the soft prompt signal may be diluted or lost as it propagates through the network. Larger models, with richer internal representations and more expressive attention patterns, can more flexibly adapt their computations based on the soft prompt context. They have enough parameters at each layer to learn how to extract and use the task-specific information encoded in the soft prompts.

Implications for Practice

This scaling behavior has important implications for when to use prompt tuning. For very large models with billions of parameters, prompt tuning offers comparable performance to full fine-tuning while training orders of magnitude fewer parameters. This makes many adaptation scenarios practical that would be impossible with full fine-tuning due to memory constraints. A 10B model typically requires hundreds of gigabytes of GPU memory for full fine-tuning, but prompt tuning needs only enough for gradients of ~100K parameters.

However, smaller models often show larger performance gaps that may disqualify prompt tuning for accuracy-critical applications. If you're adapting a 100 million parameter model, prompt tuning alone might not achieve acceptable performance for your use case. For these situations, consider prefix tuning (which intervenes at every layer) or LoRA (which modifies weight matrices through low-rank updates). An upcoming chapter compares PEFT methods systematically to guide your choices.

Why Does Scale Help?

Several hypotheses explain why larger models benefit more from prompt tuning. Understanding these helps determine when prompt tuning will - Larger models have more attention heads, allowing different heads to specialize in attending to soft prompts versus input tokens. Some heads learn to extract task instructions from the soft prompts while others focus on processing the input, enabling a division of labor that smaller models cannot achieve.

  • Larger models have more layers, providing more opportunities for the soft prompt information to be integrated and refined. Each layer progressively transforms the representations to make the task-relevant signal clearer.
  • Larger models have learned richer representations during pretraining that can more flexibly recombine based on context. Their embedding spaces capture finer distinctions between concepts, making it easier for small modifications to trigger appropriate behaviors.

Prompt tuning resembles in-context learning (discussed earlier) because it learns an optimal 'soft context' that activates the model's task-solving capabilities. Soft prompts function as compressed instructions that define the task, similar to how few-shot examples work. Larger models, which exhibit stronger in-context learning abilities with discrete prompts, can more effectively leverage this learned continuous context. The properties that make large models effective at following natural language instructions also help them follow learned soft prompt instructions.

Prompt Length Effects

The prompt length pp is the primary hyperparameter in prompt tuning, and understanding its effects is crucial for achieving good results. Longer prompts provide more learnable parameters to encode task-specific information. However, longer prompts also consume more of the model's context window, leaving less room for actual input tokens. Finding the right balance means understanding how prompt length interacts with task complexity, model capacity, and practical constraints.

Performance vs Prompt Length

Empirical studies show that performance improves with prompt length initially, then plateaus or declines. Early tokens provide substantial benefit, while additional tokens show diminishing returns. The optimal length depends on the task and model:

  • Very short prompts (1-5 tokens): Often lack capacity to encode task requirements, resulting in underperformance. A single token may not have enough dimensions to capture the nuances of a classification task.
  • Short prompts (10-20 tokens): Can work well for simple tasks like binary classification where the model needs to know it's handling a sentiment task or similar basic instruction.
  • Medium prompts (20-100 tokens): The sweet spot for most tasks, balancing expressiveness with efficiency. This range provides enough capacity for complex task definitions while maintaining manageable parameter counts. Good performance can be achieved without excessive computational overhead.
  • Long prompts (100+ tokens): Diminishing returns; additional parameters don't translate to better performance and may harm performance through optimization difficulties or by encouraging overreliance on soft prompts instead of input.

Task Complexity Interaction

More complex tasks generally benefit from longer prompts, though the relationship depends on the nature of the task's complexity. For simple sentiment classification with two clear categories, you might use a 20-token prompt to encode the task definition. For complex reasoning tasks or multi-class classification with many categories, you should consider prompts of 50-100 tokens to capture necessary nuances.

The intuition is clear: soft prompts must encode both the task definition (expected outputs) and task-specific patterns (relevant input features). More complex tasks require more "virtual instructions" encoded in the prompt embeddings. A sentiment classifier needs only to encode "map positive language to class 1 and negative language to class 0," while a topic classifier with 20 categories must represent all distinctions in the soft prompt representations.

Context Budget Trade-offs

A practical consideration is the context window trade-off. If a model has a 512 token context window and uses a 100 token prompt, only 412 tokens remain for the actual input. For tasks with long documents, this may force truncation of important content, potentially harming performance more than the longer prompt helps.

This trade-off is more acute for prompt tuning than prefix tuning. Prefix tuning adds key-value pairs to the attention computation rather than the sequence, so they don't consume input token space. Prompt tuning's soft prompts are true virtual tokens that compete with input tokens for the attention window. When designing a prompt tuning solution, you should consider typical input length and ensure the prompt doesn't consume so much context that important input is lost. For applications with highly variable input lengths, this may require prompt lengths that work well across the distribution rather than optimizing for average-length inputs.

Out[5]:
Visualization
Context window allocation in a 512-token model shows the trade-off between soft prompt length and available input space. A 20-token prompt uses 4% of context (leaving 492 tokens for input), while a 200-token prompt uses 39% (leaving only 312 tokens). This trade-off demonstrates a fundamental constraint: designers must balance prompt expressiveness for task encoding against preserving sufficient context for meaningful input processing.
Context window allocation in a 512-token model shows the trade-off between soft prompt length and available input space. A 20-token prompt uses 4% of context (leaving 492 tokens for input), while a 200-token prompt uses 39% (leaving only 312 tokens). This trade-off demonstrates a fundamental constraint: designers must balance prompt expressiveness for task encoding against preserving sufficient context for meaningful input processing.

Code Implementation

Let's implement prompt tuning for a text classification task to see how it works in practice. We'll create a prompt-tunable wrapper around a frozen transformer model.

In[6]:
Code
import torch
import torch.nn as nn  # noqa: F401
import torch.nn.functional as F  # noqa: F401
import numpy as np
from transformers import AutoModel, AutoTokenizer  # noqa: F401

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

Prompt Tuning Module

The core of prompt tuning is a learnable embedding matrix that gets prepended to the input. We create a module that handles the soft prompt generation and concatenation.

In[7]:
Code
class PromptTuningEmbedding(nn.Module):
    """
    Learnable soft prompts for prompt tuning.

    This module creates and manages trainable soft prompt embeddings that are prepended to the input sequence. These embeddings serve as virtual tokens that the frozen model processes normally.
    """

    def __init__(
        self,
        num_prompt_tokens: int,
        embedding_dim: int,
        init_from_vocab: bool = False,
        vocab_embeddings: torch.Tensor = None,
    ):
        super().__init__()
        self.num_prompt_tokens = num_prompt_tokens
        self.embedding_dim = embedding_dim

        # Initialize the soft prompt embeddings
        if init_from_vocab and vocab_embeddings is not None:
            # Sample random tokens from vocabulary for initialization
            vocab_size = vocab_embeddings.size(0)
            random_indices = torch.randint(0, vocab_size, (num_prompt_tokens,))
            init_embeddings = vocab_embeddings[random_indices].detach().clone()
            self.prompt_embeddings = nn.Parameter(init_embeddings)
        else:
            # Random initialization
            self.prompt_embeddings = nn.Parameter(
                torch.randn(num_prompt_tokens, embedding_dim) * 0.02
            )

    def forward(self, batch_size: int):
        """
        Generate soft prompts for a batch.

        Returns:
            Tensor of shape (batch_size, num_prompt_tokens, embedding_dim)
        """
        # Expand prompts for the batch
        return self.prompt_embeddings.unsqueeze(0).expand(batch_size, -1, -1)

Full Prompt Tuning Classifier

Now we'll create a complete classifier that combines the soft prompts with a frozen pretrained model.

In[8]:
Code
class PromptTuningClassifier(nn.Module):
    """
    Text classifier using prompt tuning.

    Prepends learnable soft prompts to input embeddings while keeping the pretrained transformer frozen. Only the soft prompt parameters receive gradient updates during training.
    """

    def __init__(
        self,
        pretrained_model: nn.Module,
        tokenizer,
        num_prompt_tokens: int = 20,
        num_classes: int = 2,
        init_from_vocab: bool = True,
    ):
        super().__init__()

        # Get model configuration
        self.embedding_dim = pretrained_model.config.hidden_size
        self.num_prompt_tokens = num_prompt_tokens

        # Freeze the pretrained model
        self.pretrained_model = pretrained_model
        for param in self.pretrained_model.parameters():
            param.requires_grad = False

        # Get the embedding layer for vocabulary initialization
        vocab_embeddings = None
        if init_from_vocab:
            vocab_embeddings = (
                self.pretrained_model.embeddings.word_embeddings.weight.data
            )

        # Create learnable soft prompts
        self.soft_prompts = PromptTuningEmbedding(
            num_prompt_tokens=num_prompt_tokens,
            embedding_dim=self.embedding_dim,
            init_from_vocab=init_from_vocab,
            vocab_embeddings=vocab_embeddings,
        )

        # Classification head
        self.classifier = nn.Linear(self.embedding_dim, num_classes)

        # Store tokenizer for reference
        self.tokenizer = tokenizer

    def forward(self, input_ids, attention_mask):
        batch_size = input_ids.size(0)
        seq_len = input_ids.size(1)

        # Get input embeddings from the frozen model
        input_embeddings = self.pretrained_model.embeddings.word_embeddings(
            input_ids
        )

        # Get soft prompts for this batch
        prompt_embeddings = self.soft_prompts(batch_size)

        # Concatenate: [soft_prompts; input_embeddings]
        combined_embeddings = torch.cat(
            [prompt_embeddings, input_embeddings], dim=1
        )

        # Extend attention mask for soft prompt tokens (always attend to them)
        prompt_attention = torch.ones(
            batch_size,
            self.num_prompt_tokens,
            device=attention_mask.device,
            dtype=attention_mask.dtype,
        )
        extended_attention_mask = torch.cat(
            [prompt_attention, attention_mask], dim=1
        )

        # Forward through the frozen transformer using inputs_embeds
        outputs = self.pretrained_model(
            inputs_embeds=combined_embeddings,
            attention_mask=extended_attention_mask,
        )

        # Use [CLS] token (or pooled output) for classification
        # Account for the soft prompt offset
        pooled_output = outputs.last_hidden_state[:, 0, :]

        # Classify
        logits = self.classifier(pooled_output)

        return logits

    def count_trainable_parameters(self):
        """Count parameters that require gradients."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

    def count_total_parameters(self):
        """Count all parameters including frozen ones."""
        return sum(p.numel() for p in self.parameters())

Creating the Model

Let's instantiate the model and examine the parameter efficiency.

In[9]:
Code
# Load a pretrained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModel.from_pretrained(model_name)

# Create prompt tuning classifier
prompt_tuning_model = PromptTuningClassifier(
    pretrained_model=base_model,
    tokenizer=tokenizer,
    num_prompt_tokens=20,  # 20 soft prompt tokens
    num_classes=2,  # Binary classification
    init_from_vocab=True,
)

trainable_params = prompt_tuning_model.count_trainable_parameters()
total_params = prompt_tuning_model.count_total_parameters()
param_percentage = 100 * trainable_params / total_params

# Verify the model is created successfully
print(f"Model created with {trainable_params:,} trainable parameters")
Out[10]:
Console
Trainable parameters: 16,898
Total parameters: 66,379,778
Trainable percentage: 0.0255%

With just 20 soft prompt tokens, the model updates approximately 0.02% of the total parameters. The soft prompt embeddings contribute 20×768=15,36020 \times 768 = 15,360 parameters (mostly from the soft prompts), and the classification head adds 768×2+2=1,538768 \times 2 + 2 = 1,538 parameters.

Training Loop

Let's create a simple training example to demonstrate how prompt tuning works in practice.

In[11]:
Code
# Create synthetic training data for demonstration
texts = [
    "This movie was absolutely wonderful and entertaining!",
    "Terrible film, complete waste of time and money.",
    "I loved every moment of this beautiful story.",
    "Boring and predictable, would not recommend.",
    "A masterpiece of modern cinema, truly inspiring.",
    "Awful acting and terrible script throughout.",
    "Heartwarming and delightful from start to finish.",
    "One of the worst movies I have ever seen.",
] * 10  # Repeat for more training data

labels = [1, 0, 1, 0, 1, 0, 1, 0] * 10  # 1 = positive, 0 = negative

# Tokenize the texts
encodings = tokenizer(
    texts, padding=True, truncation=True, max_length=64, return_tensors="pt"
)

input_ids = encodings["input_ids"]
attention_mask = encodings["attention_mask"]
labels_tensor = torch.tensor(labels)
In[12]:
Code
# Training setup
optimizer = torch.optim.AdamW(
    prompt_tuning_model.parameters(),  # Only trainable params get gradients
    lr=3e-4,
    weight_decay=0.01,
)

num_epochs = 50
batch_size = 16
losses = []

prompt_tuning_model.train()

for epoch in range(num_epochs):
    # Simple full-batch training for demonstration
    for i in range(0, len(texts), batch_size):
        batch_input_ids = input_ids[i : i + batch_size]
        batch_attention_mask = attention_mask[i : i + batch_size]
        batch_labels = labels_tensor[i : i + batch_size]

        optimizer.zero_grad()

        logits = prompt_tuning_model(batch_input_ids, batch_attention_mask)
        loss = F.cross_entropy(logits, batch_labels)

        loss.backward()
        optimizer.step()

    losses.append(loss.item())
Out[13]:
Visualization
Training loss progression for binary sentiment classification using prompt tuning with 20 soft prompt tokens. Loss decreases from 0.7 to below 0.1 by epoch 30 and remains stable through epoch 50, while only 0.02% of total model parameters receive gradient updates. This example shows that task-specific adaptation is possible with minimal trainable parameters when embeddings are properly initialized from vocabulary.
Training loss progression for binary sentiment classification using prompt tuning with 20 soft prompt tokens. Loss decreases from 0.7 to below 0.1 by epoch 30 and remains stable through epoch 50, while only 0.02% of total model parameters receive gradient updates. This example shows that task-specific adaptation is possible with minimal trainable parameters when embeddings are properly initialized from vocabulary.

The training loss decreases smoothly, demonstrating that the model can learn the classification task by updating only the soft prompt embeddings. Despite training just 0.02% of parameters, the model successfully adapts to the sentiment classification task.

Inference and Evaluation

In[14]:
Code
# Evaluation
prompt_tuning_model.eval()

test_texts = [
    "An incredible journey that touched my heart.",
    "Disappointing and forgettable experience.",
    "Best movie I've watched this year!",
    "Poorly executed with no redeeming qualities.",
]

test_encodings = tokenizer(
    test_texts,
    padding=True,
    truncation=True,
    max_length=64,
    return_tensors="pt",
)

with torch.no_grad():
    test_logits = prompt_tuning_model(
        test_encodings["input_ids"], test_encodings["attention_mask"]
    )
    predictions = torch.argmax(test_logits, dim=-1)
Out[15]:
Console
Predictions:
  'An incredible journey that touched my heart....' → Positive
  'Disappointing and forgettable experience....' → Negative
  'Best movie I've watched this year!...' → Negative
  'Poorly executed with no redeeming qualities....' → Negative

The model correctly classifies the test examples, identifying positive and negative sentiments accurately. This demonstrates that the small set of trainable soft prompt parameters is sufficient to steer the frozen model's behavior for this task.

Comparing Initialization Strategies

Let's compare random versus vocabulary initialization to see the effect on early training dynamics.

In[16]:
Code
def train_and_record(model, num_epochs=30):
    """Train model and return loss history."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    history = []

    model.train()
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        logits = model(input_ids[:32], attention_mask[:32])
        loss = F.cross_entropy(logits, labels_tensor[:32])
        loss.backward()
        optimizer.step()
        history.append(loss.item())

    return history


# Random initialization
base_model_rand = AutoModel.from_pretrained(model_name)
model_random = PromptTuningClassifier(
    pretrained_model=base_model_rand,
    tokenizer=tokenizer,
    num_prompt_tokens=20,
    num_classes=2,
    init_from_vocab=False,  # Random initialization
)

# Vocabulary initialization
base_model_vocab = AutoModel.from_pretrained(model_name)
model_vocab = PromptTuningClassifier(
    pretrained_model=base_model_vocab,
    tokenizer=tokenizer,
    num_prompt_tokens=20,
    num_classes=2,
    init_from_vocab=True,  # Vocabulary initialization
)

# Train both
torch.manual_seed(42)
history_random = train_and_record(model_random)
torch.manual_seed(42)
history_vocab = train_and_record(model_vocab)
Out[17]:
Visualization
Initialization strategy comparison for 20-token soft prompts on sentiment classification over 30 epochs. Vocabulary initialization (dashed line) converges faster, reaching loss = 0.12 by epoch 15, while random initialization requires approximately 20 epochs to match this performance. Both achieve similar final loss (approximately 0.10), demonstrating vocabulary initialization's advantage in early training.
Initialization strategy comparison for 20-token soft prompts on sentiment classification over 30 epochs. Vocabulary initialization (dashed line) converges faster, reaching loss = 0.12 by epoch 15, while random initialization requires approximately 20 epochs to match this performance. Both achieve similar final loss (approximately 0.10), demonstrating vocabulary initialization's advantage in early training.

The vocabulary-initialized model often shows slightly faster initial convergence, though both approaches eventually reach similar performance. This difference becomes more pronounced with smaller models or limited training data.

Examining Learned Prompts

One interesting property of prompt tuning is that we can analyze the learned soft prompts to understand what the model has learned.

In[18]:
Code
def find_nearest_tokens(prompt_embedding, vocab_embeddings, tokenizer, top_k=5):
    """Find vocabulary tokens nearest to a soft prompt embedding."""
    # Compute cosine similarity
    prompt_norm = prompt_embedding / prompt_embedding.norm()
    vocab_norm = vocab_embeddings / vocab_embeddings.norm(dim=1, keepdim=True)
    similarities = torch.matmul(vocab_norm, prompt_norm)

    # Get top-k
    top_values, top_indices = torch.topk(similarities, top_k)

    tokens = [tokenizer.decode([idx.item()]) for idx in top_indices]
    scores = top_values.tolist()

    return list(zip(tokens, scores))


# Get the learned soft prompts
learned_prompts = prompt_tuning_model.soft_prompts.prompt_embeddings.data
vocab_embeds = base_model.embeddings.word_embeddings.weight.data

nearest_tokens_list = []
for i in range(min(5, learned_prompts.size(0))):  # First 5 prompts
    tokens_scores = find_nearest_tokens(
        learned_prompts[i], vocab_embeds, tokenizer, top_k=3
    )
    nearest_tokens_list.append((i, tokens_scores))
Out[19]:
Console
Nearest vocabulary tokens to learned soft prompts:

Prompt token 0: [('glance', 0.9519124627113342), ('glances', 0.7681980729103088), ('glancing', 0.7599889039993286)]
Prompt token 1: [('transition', 0.957211971282959), ('transitions', 0.7650735974311829), ('transitioned', 0.7035655975341797)]
Prompt token 2: [('##dit', 0.9801098704338074), ('344', 0.6925738453865051), ('288', 0.6913914084434509)]
Prompt token 3: [('utilized', 0.9606842398643494), ('utilize', 0.8766920566558838), ('utilizing', 0.8532299399375916)]
Prompt token 4: [('良', 0.9699046611785889), ('保', 0.8295561075210571), ('昭', 0.827731192111969)]

The nearest vocabulary tokens to each soft prompt provide insight into what semantic concepts the model has encoded. For a sentiment classification task, we might expect to see tokens related to sentiment, evaluation, or opinion.

Prompt Length Analysis

Let's empirically examine how prompt length affects model performance by training models with different prompt lengths.

In[21]:
Code
prompt_lengths = [1, 5, 10, 20, 50, 100]
final_losses = []
convergence_speeds = []  # Epochs to reach loss < 0.3

for p_len in prompt_lengths:
    # Create model with this prompt length
    base_model_temp = AutoModel.from_pretrained(model_name)
    model_temp = PromptTuningClassifier(
        pretrained_model=base_model_temp,
        tokenizer=tokenizer,
        num_prompt_tokens=p_len,
        num_classes=2,
        init_from_vocab=True,
    )

    # Train
    torch.manual_seed(42)
    optimizer = torch.optim.AdamW(model_temp.parameters(), lr=3e-4)

    model_temp.train()
    convergence_epoch = None

    for epoch in range(50):
        optimizer.zero_grad()
        logits = model_temp(input_ids[:32], attention_mask[:32])
        loss = F.cross_entropy(logits, labels_tensor[:32])
        loss.backward()
        optimizer.step()

        if convergence_epoch is None and loss.item() < 0.3:
            convergence_epoch = epoch

    final_losses.append(loss.item())
    convergence_speeds.append(convergence_epoch if convergence_epoch else 50)

plt.rcParams.update(
    {
        "figure.figsize": (3.0, 2.5),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.labelsize": 10,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
    }
)

plt.figure()
colors1 = ["#e74c3c" if l > 0.1 else "#2ecc71" for l in final_losses]
plt.bar(
    [str(p) for p in prompt_lengths],
    final_losses,
    color=colors1,
    edgecolor="black",
    linewidth=0.5,
)
plt.xlabel("Prompt Length")
plt.ylabel("Final Loss")
plt.grid(True, alpha=0.3, axis="y")
plt.show()

plt.figure()
colors2 = ["#e74c3c" if s > 30 else "#2ecc71" for s in convergence_speeds]
plt.bar(
    [str(p) for p in prompt_lengths],
    convergence_speeds,
    color=colors2,
    edgecolor="black",
    linewidth=0.5,
)
plt.xlabel("Prompt Length")
plt.ylabel("Epochs to Loss < 0.3")
plt.grid(True, alpha=0.3, axis="y")
plt.show()

The results demonstrate the trade-off with prompt length. Very short prompts may lack the capacity to encode task requirements, while very long prompts provide diminishing returns and may even slightly hurt performance due to optimization challenges. The sweet spot typically lies in the 10-50 token range for most classification tasks.

Parameter Efficiency Across Prompt Lengths

In[22]:
Code
embedding_dim = 768
classifier_params = 768 * 2 + 2
# Use the base model loaded earlier to get exact count
base_model_params = sum(p.numel() for p in base_model.parameters())

efficiency_results = []
for p_len in prompt_lengths:
    prompt_params = p_len * embedding_dim
    total_trainable = prompt_params + classifier_params
    efficiency = (total_trainable / base_model_params) * 100
    efficiency_results.append((p_len, total_trainable, efficiency))
Out[23]:
Console
Parameter count by prompt length:

    1 tokens: 2,306 params (0.0035% of model)
    5 tokens: 5,378 params (0.0081% of model)
   10 tokens: 9,218 params (0.0139% of model)
   20 tokens: 16,898 params (0.0255% of model)
   50 tokens: 39,938 params (0.0602% of model)
  100 tokens: 78,338 params (0.1180% of model)
Out[24]:
Visualization
Trainable parameter count by soft prompt length for DistilBERT (268M total parameters). A 1-token prompt requires 768 trainable parameters (0.003% of model), while a 100-token prompt requires 76,800 trainable parameters (0.119% of model). The linear scaling relationship allows hundreds of task-specific soft prompts to fit in storage that would hold only one full model copy, making multi-task adaptation practical at massive scale compared to full fine-tuning.
Trainable parameter count by soft prompt length for DistilBERT (268M total parameters). A 1-token prompt requires 768 trainable parameters (0.003% of model), while a 100-token prompt requires 76,800 trainable parameters (0.119% of model). The linear scaling relationship allows hundreds of task-specific soft prompts to fit in storage that would hold only one full model copy, making multi-task adaptation practical at massive scale compared to full fine-tuning.

Even with 100 prompt tokens, we're training less than 0.12% of the model's parameters. This extreme parameter efficiency is what makes prompt tuning attractive for scenarios with limited compute or when maintaining many task-specific adaptations.

Key Parameters

Key parameters:

  • num_prompt_tokens: Number of soft prompt tokens prepended to the input
  • init_from_vocab: Whether to initialize prompts from the model's vocabulary
  • embedding_dim: Dimension of the prompt embeddings (must match model)

Limitations and Impact

Prompt tuning represented a significant advance in parameter-efficient fine-tuning, demonstrating that remarkably few parameters could adapt large models to new tasks. However, understanding its limitations is crucial for effective use. The most significant limitation is the scale dependence discussed earlier. While prompt tuning matches full fine-tuning at 10B+ parameter scales, smaller models exhibit a meaningful performance gap. This creates a practical challenge. Large models benefit most from prompt tuning, while smaller models may benefit more from alternatives such as LoRA or prefix tuning that directly modify model computation. This scale requirement partially offsets the democratization benefits of parameter-efficient methods. A second limitation involves the interpretability of learned prompts. Unlike discrete prompts, which are human-readable, soft prompts exist only in embedding space. Finding nearest vocabulary neighbors provides only rough insight into the learned representations. For applications requiring transparency or auditing of the adaptation mechanism, soft prompts present challenges that discrete prompting avoids.

The context window trade-off deserves consideration. Every soft prompt token consumes one position in the model's context window, which can be significant for tasks involving long documents or conversations. A 100-token soft prompt in a 512-token context model leaves only 412 tokens for actual input, potentially forcing truncation of important content. Prefix tuning, while requiring more parameters, does not consume input tokens' context space in the same way.

Despite these limitations, prompt tuning's impact on the field has been substantial. It demonstrated the power of scale for parameter-efficient methods, showing that simpler approaches can match complex ones given sufficient model capacity. This finding influenced subsequent research into understanding why large models are so adaptable and how to leverage this adaptability efficiently. The simplicity of prompt tuning also made it accessible: unlike methods requiring modifications to transformer architectures, prompt tuning works with any model that accepts embedding inputs.

Prompt tuning also became an important benchmark for comparing PEFT methods. The scaling curves from the original paper became a reference point, with subsequent methods often demonstrating their value by showing better performance than prompt tuning at smaller scales while maintaining efficiency at larger scales.

Summary

Prompt tuning simplifies parameter-efficient fine-tuning by prepending learnable soft prompt embeddings to the input. Unlike prefix tuning, which modifies attention computations at every layer, prompt tuning operates only at the input level, trusting the frozen model to propagate task-relevant information through its layers.

Key concepts:

  • Formulation: Soft prompts PRp×d\mathbf{P} \in \mathbb{R}^{p \times d} are concatenated with input embeddings and processed by the frozen model normally, with only P\mathbf{P} receiving gradient updates during training.

  • Initialization strategies: Vocabulary initialization uses existing token embeddings as starting points and generally outperforms random initialization, especially for smaller models. Class-label initialization provides even stronger inductive bias for classification tasks.

  • Scaling behavior: The performance gap between prompt tuning and full fine-tuning shrinks as model scale increases. At 10B+ parameters, prompt tuning matches full fine-tuning, making it increasingly attractive for very large models.

  • Prompt length effects: Moderate prompt lengths (10-100 tokens) typically work best. Very short prompts lack capacity while very long prompts provide diminishing returns and consume context window space.

The parameter efficiency of prompt tuning is remarkable: adapting a 10B parameter model might require training only 0.01% of parameters. This enables you to maintain hundreds of task-specific adaptations for a single base model, dramatically reducing storage and deployment costs.

The next chapter explores adapter layers, a different approach that inserts small trainable modules between transformer layers, providing an alternative intervention point for efficient adaptation. Subsequent chapters compare all PEFT methods systematically to help you choose when to use each approach.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about prompt tuning and its parameter-efficient approach to model adaptation.

Loading component...

Reference

BIBTEXAcademic
@misc{prompttuningparameterefficientfinetuningwithsoftprompts, author = {Michael Brenndoerfer}, title = {Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts}, year = {2025}, url = {https://mbrenndoerfer.com/writing/prompt-tuning-parameter-efficient-soft-prompts-llm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts. Retrieved from https://mbrenndoerfer.com/writing/prompt-tuning-parameter-efficient-soft-prompts-llm
MLAAcademic
Michael Brenndoerfer. "Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts." 2026. Web. today. <https://mbrenndoerfer.com/writing/prompt-tuning-parameter-efficient-soft-prompts-llm>.
CHICAGOAcademic
Michael Brenndoerfer. "Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts." Accessed today. https://mbrenndoerfer.com/writing/prompt-tuning-parameter-efficient-soft-prompts-llm.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts'. Available at: https://mbrenndoerfer.com/writing/prompt-tuning-parameter-efficient-soft-prompts-llm (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Prompt Tuning: Parameter-Efficient Fine-Tuning with Soft Prompts. https://mbrenndoerfer.com/writing/prompt-tuning-parameter-efficient-soft-prompts-llm