Search

Search articles

Decoding Temperature: Controlling Randomness in Language Model Generation

Michael BrenndoerferUpdated July 27, 202533 min read

Learn how temperature scaling reshapes probability distributions during text generation, with mathematical foundations, implementation details, and practical guidelines for selecting optimal temperature values.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.


execute: cache: true jupyter: python3

Decoding Temperature

When a language model predicts the next token, it outputs a probability distribution over its entire vocabulary. Given the prompt "The capital of France is," the model might assign 0.85 to "Paris," 0.05 to "Lyon," 0.02 to "Marseille," and tiny probabilities to thousands of other tokens. Temperature is the parameter that controls how we interpret and sample from this distribution.

At its core, temperature answers a fundamental question: how much should we trust the model's probability rankings? A temperature of 1.0 preserves the learned distribution exactly. Lower temperatures sharpen the distribution, making high-probability tokens even more likely and pushing the model toward deterministic, predictable output. Higher temperatures flatten the distribution, giving lower-probability tokens a fighting chance and introducing creative variability.

This chapter explores temperature from first principles. You'll learn the mathematical mechanics of temperature scaling, visualize how it reshapes probability distributions, implement temperature-controlled sampling, and develop intuition for selecting appropriate values across different generation tasks.

How Temperature Scaling Works

To understand temperature, we need to start with what language models actually output. When you feed a prompt into a model, the final layer doesn't produce probabilities directly. Instead, it outputs a vector of logits: raw, unbounded scores for every token in the vocabulary. A logit of 5.0 doesn't mean "5% probability." It's just a score indicating relative preference. The token with the highest logit is the model's top choice, but we need a way to convert these scores into actual probabilities we can sample from.

This is where the softmax function enters the picture. Softmax takes a vector of arbitrary real numbers and transforms them into a valid probability distribution: all values become positive and sum to 1. For a vocabulary of tokens with logits z1,z2,,zVz_1, z_2, \ldots, z_V, the standard softmax computes:

P(i)=exp(zi)j=1Vexp(zj)P(i) = \frac{\exp(z_i)}{\sum_{j=1}^{V} \exp(z_j)}

where:

  • ziz_i: the logit for token ii, which can be any real number
  • exp(zi)\exp(z_i): the exponential function applied to the logit, ensuring the result is positive
  • j=1Vexp(zj)\sum_{j=1}^{V} \exp(z_j): the sum of all exponentiated logits, serving as a normalizing constant

The exponential function is crucial here. It preserves the ordering of logits (higher logits still get higher probabilities) while ensuring all outputs are positive. The denominator normalizes everything to sum to 1.

Out[3]:
Visualization
Bar chart showing logits ranging from 0.3 to 3.2 for ten weather-related tokens.
Raw logits from the model's output layer. Higher values indicate stronger preference.
Bar chart showing corresponding probabilities from 0.02 to 0.32 after softmax.
Probabilities after softmax transformation. The exponential amplifies differences between logits.

But here's the problem: the standard softmax gives us exactly one distribution. What if we want more control? What if the model's top choice is good but we want to explore alternatives? Or conversely, what if we want to make the model more decisive, committing more strongly to its best guess?

Introducing the Temperature Parameter

Temperature gives us that control. The idea is elegantly simple: before applying softmax, we divide all logits by a temperature parameter TT. This single modification lets us reshape the entire probability distribution.

Temperature Scaling

Temperature scaling divides the logits by a temperature parameter TT before computing the softmax. Given logits ziz_i for vocabulary token ii, the temperature-scaled probability is:

P(i)=exp(zi/T)jexp(zj/T)P(i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

where:

  • ziz_i: the logit (raw score) for token ii output by the model's final layer
  • TT: the temperature parameter, a positive scalar that controls distribution sharpness
  • exp(zi/T)\exp(z_i / T): the exponential of the scaled logit, which ensures positive values
  • jexp(zj/T)\sum_j \exp(z_j / T): the sum over all tokens in the vocabulary, serving as a normalizing constant

When T=1T = 1, this reduces to the standard softmax. When T<1T < 1, dividing by a fraction amplifies the logit differences. When T>1T > 1, dividing by a larger number compresses the differences.

The name "temperature" comes from statistical mechanics, where a similar parameter controls the randomness of particle states in a physical system. At low temperature, particles settle into their lowest-energy states (highly ordered). At high temperature, particles explore many states more freely (more disordered). The analogy carries over perfectly to language models: low temperature means the model commits to its top choices; high temperature means it explores more freely across the vocabulary.

A Concrete Example: Three Candidate Tokens

Let's make this concrete. Suppose a language model is completing the prompt "The capital of France is" and has narrowed down to three plausible tokens: "Paris," "Lyon," and "Marseille." The model outputs logits of 2.0, 1.0, and 0.5 respectively. Paris has the highest logit, so it's the model's top choice, but how do the probabilities change as we vary temperature?

In[4]:
Code
# Example logits for three tokens
logits = torch.tensor([2.0, 1.0, 0.5])
tokens = ["Paris", "Lyon", "Marseille"]


def temperature_softmax(
    logits: torch.Tensor, temperature: float
) -> torch.Tensor:
    """Apply temperature scaling and compute softmax."""
    scaled_logits = logits / temperature
    return F.softmax(scaled_logits, dim=0)
Out[5]:
Console
Token probabilities at different temperatures:

Token        T=0.5   T=1.0   T=2.0 
---------------------------------------------
Paris        0.844  0.629  0.481
Lyon         0.114  0.231  0.292
Marseille    0.042  0.140  0.227

The results reveal temperature's effect clearly. At T=0.5T = 0.5, "Paris" dominates with 84.4% probability, leaving only scraps for the alternatives. The model is highly confident, almost deterministic. At T=1.0T = 1.0 (standard softmax), we get the baseline distribution: Paris at 59%, Lyon at 22%, Marseille at 13%. The model still prefers Paris but gives meaningful weight to alternatives. At T=2.0T = 2.0, the distribution flattens dramatically: Paris drops to 39%, and even Marseille climbs to 22%. The model is now much more willing to sample less-preferred tokens.

The Mathematics of Sharpening and Flattening

Why does this simple division produce such dramatic effects? The key insight comes from examining how temperature affects the probability ratio between any two tokens. Understanding this ratio reveals the core mechanism.

Consider two tokens with logits z1z_1 and z2z_2, where z1>z2z_1 > z_2 (token 1 is preferred). Under temperature scaling, their probabilities are:

P(1)=exp(z1/T)jexp(zj/T),P(2)=exp(z2/T)jexp(zj/T)P(1) = \frac{\exp(z_1/T)}{\sum_j \exp(z_j/T)}, \quad P(2) = \frac{\exp(z_2/T)}{\sum_j \exp(z_j/T)}

When we compute the ratio P(1)/P(2)P(1)/P(2), something beautiful happens: the normalization constants cancel out completely:

P(1)P(2)=exp(z1/T)exp(z2/T)=exp(z1z2T)\frac{P(1)}{P(2)} = \frac{\exp(z_1/T)}{\exp(z_2/T)} = \exp\left(\frac{z_1 - z_2}{T}\right)

where:

  • z1z2z_1 - z_2: the logit gap between the two tokens, representing how much more the model prefers token 1
  • TT: temperature, which scales how strongly this preference translates to probability
  • exp()\exp(\cdot): the exponential function, which converts the scaled difference to a multiplicative ratio

This formula is the key to understanding temperature. The effective logit gap becomes (z1z2)/T(z_1 - z_2)/T. Temperature acts as a divisor on the gap itself, not on the probabilities directly.

When T<1T < 1, we divide the gap by a fraction, which amplifies it. A logit difference of 1.0 at T=0.5T = 0.5 becomes an effective difference of 1.0/0.5=2.01.0 / 0.5 = 2.0. The exponential of 2.0 is about 7.4, so the preferred token becomes 7.4 times more likely than its competitor. The distribution sharpens.

When T>1T > 1, we divide the gap by a number greater than 1, which compresses it. The same logit difference of 1.0 at T=2.0T = 2.0 becomes an effective difference of 1.0/2.0=0.51.0 / 2.0 = 0.5. The exponential of 0.5 is about 1.65, so the preferred token is only 1.65 times more likely. The distribution flattens.

Let's verify this with actual calculations:

In[6]:
Code
def probability_ratio(z1: float, z2: float, temperature: float) -> float:
    """Compute probability ratio between two tokens given their logits."""
    return np.exp((z1 - z2) / temperature)


# Logit gap of 1.0 between tokens
z1, z2 = 2.0, 1.0
Out[7]:
Console
Probability ratio (P(token1) / P(token2)) at different temperatures:

T = 0.25: ratio =    54.60
T = 0.5 : ratio =     7.39
T = 1.0 : ratio =     2.72
T = 2.0 : ratio =     1.65
T = 4.0 : ratio =     1.28
Out[8]:
Visualization
Curve showing probability ratio decreasing from over 50 at T=0.25 to near 1 at T=4, with horizontal dashed line at ratio=1.
Probability ratio between two tokens as temperature varies. For a fixed logit gap of 1.0, low temperatures amplify the gap exponentially, while high temperatures compress it toward 1:1 (equal probability).

The numbers tell the story. At T=0.25T = 0.25, the higher-probability token is 55 times more likely than its competitor, an overwhelming advantage. At T=1.0T = 1.0 (baseline), the ratio is 2.72 (which is just e1e^1). At T=4.0T = 4.0, that ratio shrinks to just 1.28, meaning the two tokens are nearly equally likely despite the original logit gap. Temperature truly acts as a dial between certainty and randomness.

Temperature Extremes: The Limiting Cases

To build complete intuition, consider what happens at the mathematical extremes:

As T0T \to 0 (approaching zero):

The effective logit gap (z1z2)/T(z_1 - z_2)/T grows without bound for any non-zero gap. The exponential of infinity is infinity, so the probability ratio between the top token and any other token becomes infinite. In practice, this means all probability mass concentrates on the single highest-logit token. This limiting case is equivalent to greedy decoding or argmax selection: always pick the most likely token, with zero randomness.

As TT \to \infty (approaching infinity):

The effective logit gap (z1z2)/T(z_1 - z_2)/T shrinks toward zero for any finite gap. The exponential of zero is 1, so the probability ratio between any two tokens approaches 1:1. All tokens become equally likely regardless of their original logits. The distribution approaches uniform, and generation becomes pure random sampling from the vocabulary.

Neither extreme is useful for text generation. T0T \approx 0 produces repetitive, predictable text that lacks nuance. The model keeps selecting the same high-probability continuations, often getting stuck in loops. TT \approx \infty produces incoherent gibberish since token selection ignores the model's learned preferences entirely. Practical values fall between these extremes, typically in the range 0.1 to 2.0.

Out[9]:
Visualization
Bar chart showing Paris at 99.3% probability with Lyon and Marseille near zero.
At T=0.1, nearly all probability concentrates on the top token (greedy behavior).
Bar chart showing Paris at 59%, Lyon at 22%, Marseille at 13%.
At T=1.0 (standard softmax), the model's learned preferences are preserved.
Bar chart showing nearly equal probabilities around 33% for all three tokens.
At T=10.0, the distribution approaches uniformity across all tokens.

The visualization makes the extremes visceral. At T=0.1T = 0.1, Paris captures 99.3% of the probability mass, leaving virtually nothing for alternatives. At T=10.0T = 10.0, the three tokens have nearly equal probability (35%, 33%, 32%), approaching the uniform distribution of 33.3% each. The standard softmax at T=1.0T = 1.0 sits in between, reflecting the model's learned preferences while still allowing meaningful sampling diversity.

Visualizing Temperature Effects

To develop intuition for temperature, let's visualize how it reshapes a realistic probability distribution. We'll simulate logits from a language model predicting the next token after "The weather today is" and examine distributions across a range of temperatures.

In[10]:
Code
# Simulate logits for a vocabulary of plausible next tokens
vocab = [
    "sunny",
    "cloudy",
    "rainy",
    "nice",
    "terrible",
    "cold",
    "warm",
    "hot",
    "perfect",
    "unpredictable",
]

# Logits reflecting different likelihoods
logits_weather = torch.tensor(
    [
        3.2,  # sunny - most likely
        2.8,  # cloudy
        2.1,  # rainy
        1.9,  # nice
        0.5,  # terrible
        1.5,  # cold
        1.8,  # warm
        1.2,  # hot
        2.4,  # perfect
        0.3,  # unpredictable
    ]
)
Out[11]:
Visualization
Bar chart showing peaked distribution at T=0.3 with sunny dominating.
At T=0.3, 'sunny' dominates with over 60% probability.
Bar chart showing moderate spread at T=0.7.
At T=0.7, distribution begins to spread across alternatives.
Bar chart showing standard distribution at T=1.0.
At T=1.0 (standard softmax), the learned distribution is preserved.
Bar chart showing flattened distribution at T=2.0.
At T=2.0, multiple completions have meaningful probability.

At T=0.3T = 0.3, "sunny" captures over 60% probability, leaving little room for alternatives. This produces predictable text but misses the natural variation in language. At T=2.0T = 2.0, the top five tokens each have between 10% and 18% probability. Sampling here introduces meaningful diversity while still favoring contextually appropriate completions.

A heatmap reveals the continuous transformation more clearly. Each row shows one token's probability as temperature varies from left (low T, sharp distribution) to right (high T, flat distribution):

Out[12]:
Visualization
Heatmap with tokens on y-axis and temperature on x-axis, showing probability as color intensity. Bright band at top-left for sunny fades to uniform coloring on right.
Heatmap showing how each token's probability changes continuously with temperature. At low temperature (left), probability concentrates on high-logit tokens. As temperature increases (right), the distribution flattens toward uniformity.

The heatmap shows the redistribution of probability mass. At low temperature, the top token ("sunny") absorbs most probability (dark band at top-left). As temperature increases, probability spreads more evenly across all tokens (lighter, more uniform coloring toward the right). The transition isn't abrupt but smooth and continuous.

Entropy and Temperature

Entropy quantifies the uncertainty or "spread" of a probability distribution. For a discrete distribution over nn tokens, Shannon entropy is defined as:

H(P)=i=1nP(i)log2P(i)H(P) = -\sum_{i=1}^{n} P(i) \log_2 P(i)

where:

  • H(P)H(P): the entropy of distribution PP, measured in bits
  • P(i)P(i): the probability of token ii
  • log2\log_2: logarithm base 2, so entropy is measured in bits
  • The negative sign ensures entropy is positive (since log\log of a probability is negative)

Entropy has intuitive extremes. When one token has probability 1.0 and all others have 0, entropy equals 0 bits, meaning there's no uncertainty. When all nn tokens are equally likely with probability 1/n1/n, entropy reaches its maximum of log2(n)\log_2(n) bits.

Temperature directly controls entropy. Low temperature concentrates probability mass on high-logit tokens, reducing entropy. High temperature spreads probability more evenly, increasing entropy toward the maximum.

In[13]:
Code
def compute_entropy(probs: torch.Tensor) -> float:
    """Compute Shannon entropy of a probability distribution."""
    # Filter out zero probabilities to avoid log(0)
    probs = probs[probs > 0]
    return -(probs * torch.log2(probs)).sum().item()
Out[14]:
Console
Temperature vs Entropy:

Temperature  Entropy (bits) 
---------------------------
0.3          0.693          
0.5          1.733          
1.0          2.877          
1.5          3.127          
2.0          3.214          
3.0          3.262          

Maximum possible entropy (uniform): 3.322 bits

With 10 tokens in our vocabulary, maximum entropy (achieved when the distribution is uniform) is log2(10)3.32\log_2(10) \approx 3.32 bits. At T=0.3T = 0.3, entropy drops to around 1.5 bits, indicating the distribution is highly concentrated on just a few tokens. At T=3.0T = 3.0, entropy approaches 3 bits, nearly matching the uniform distribution's maximum uncertainty.

Out[15]:
Visualization
Line plot showing entropy in bits increasing from about 1.5 at T=0.2 to near 3.3 at T=3.0 with horizontal dashed line at maximum entropy.
Entropy increases monotonically with temperature. The dashed line shows maximum entropy (uniform distribution). Practical temperatures balance between focused (low entropy) and exploratory (high entropy) generation.

Implementing Temperature-Controlled Sampling

Let's build a complete temperature sampling implementation. We'll start with a function that samples from temperature-scaled logits, then extend it to generate token sequences.

In[16]:
Code
def sample_with_temperature(
    logits: torch.Tensor, temperature: float = 1.0
) -> int:
    """Sample a token index from logits with temperature scaling.

    Args:
        logits: Raw model output scores, shape (vocab_size,)
        temperature: Scaling parameter. 1.0 = unchanged, <1 = sharper, >1 = flatter

    Returns:
        Sampled token index
    """
    if temperature <= 0:
        # Greedy selection for T <= 0
        return logits.argmax().item()

    # Apply temperature scaling
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=0)

    # Sample from the distribution
    return torch.multinomial(probs, num_samples=1).item()

The function handles the edge case of T0T \leq 0 by returning the argmax (greedy decoding). For positive temperatures, it scales logits, converts to probabilities, and samples using PyTorch's multinomial function.

In[17]:
Code
# Demonstrate sampling behavior at different temperatures
def sample_distribution(
    logits: torch.Tensor, temp: float, n_samples: int = 1000
) -> dict:
    """Sample many times and return frequency distribution."""
    counts = {}
    for _ in range(n_samples):
        idx = sample_with_temperature(logits, temp)
        counts[idx] = counts.get(idx, 0) + 1
    return {k: v / n_samples for k, v in counts.items()}
Out[18]:
Console
Empirical sampling frequencies (1000 samples each):

Token          T=0.5      T=1.0      T=2.0     
--------------------------------------------
sunny          0.518      0.290      0.179     
cloudy         0.209      0.184      0.163     
rainy          0.056      0.092      0.119     
nice           0.045      0.092      0.089     
terrible       0.004      0.018      0.046     
cold           0.016      0.059      0.078     
warm           0.033      0.085      0.092     
hot            0.014      0.041      0.068     
perfect        0.101      0.127      0.129     
unpredictable  0.004      0.012      0.037     

The empirical frequencies closely match the theoretical probabilities computed earlier. At T=0.5T = 0.5, samples cluster heavily on "sunny" and "cloudy." At T=2.0T = 2.0, we see meaningful representation from lower-probability tokens like "warm," "nice," and "cold."

Batch Sampling for Efficiency

In practice, we often want to generate multiple completions or compare outputs across temperatures. Here's a vectorized implementation:

In[19]:
Code
def batch_sample_with_temperature(
    logits: torch.Tensor, temperature: float = 1.0, num_samples: int = 1
) -> torch.Tensor:
    """Sample multiple tokens efficiently.

    Args:
        logits: Raw scores, shape (vocab_size,) or (batch, vocab_size)
        temperature: Scaling parameter
        num_samples: Number of samples to draw

    Returns:
        Tensor of sampled indices
    """
    if temperature <= 0:
        if logits.dim() == 1:
            return logits.argmax().unsqueeze(0).expand(num_samples)
        return logits.argmax(dim=-1).unsqueeze(-1).expand(-1, num_samples)

    scaled = logits / temperature
    probs = F.softmax(scaled, dim=-1)

    if logits.dim() == 1:
        return torch.multinomial(
            probs, num_samples=num_samples, replacement=True
        )
    return torch.multinomial(probs, num_samples=num_samples, replacement=True)
Out[20]:
Console
5 samples at each temperature:

T = 0.3: rainy, sunny, sunny, sunny, rainy
T = 1.0: sunny, warm, cloudy, hot, perfect
T = 2.0: perfect, cloudy, rainy, sunny, warm

At low temperature, we see repeated "sunny" selections. Higher temperatures introduce variety, sometimes surfacing less expected but still contextually reasonable tokens.

Temperature Selection Guidelines

Choosing the right temperature depends on your application. The key insight is that temperature controls the trade-off between coherence and creativity. Lower temperatures produce safer, more predictable text. Higher temperatures introduce novelty but risk incoherence.

Task-Based Recommendations

Different generation tasks call for different temperature settings:

Temperature guidelines by generation task. These are starting points; optimal values depend on the specific model and use case.
TaskRecommended TRationale
Code generation0.0 - 0.3Correctness matters; creativity can introduce bugs
Factual Q&A0.0 - 0.5Accuracy over variety; want the most likely correct answer
Translation0.3 - 0.7Balance fluency with fidelity to source meaning
Creative writing0.7 - 1.2Encourage unexpected but coherent word choices
Brainstorming1.0 - 1.5Explore diverse ideas; some randomness is beneficial
Poetry/experimental1.2 - 2.0Prioritize novelty and surprise

For code and factual tasks, you often want temperature near zero. A slight temperature (0.1-0.2) can prevent the model from getting stuck in repetitive loops while still strongly favoring high-probability outputs. For creative tasks, temperatures between 0.7 and 1.2 typically produce the best balance of quality and variety.

The Quality-Diversity Trade-off

Temperature creates an inherent trade-off. As you increase temperature, you gain:

  • Lexical diversity: More varied word choices, less repetition
  • Idea exploration: Access to less probable but potentially interesting continuations
  • Reduced mode collapse: Less tendency to repeat the same phrases

But you also risk:

  • Coherence degradation: Sentences that don't follow logically
  • Factual errors: Lower-probability (and potentially wrong) claims
  • Grammatical mistakes: Unusual token sequences that violate syntax

The sweet spot depends on how much you value diversity versus correctness. For a customer service chatbot, coherence and accuracy dominate: use low temperature. For a creative writing assistant, moderate temperature encourages the unexpected turns that make prose interesting.

Out[21]:
Visualization
Schematic plot showing quality decreasing and diversity increasing as temperature rises, with an optimal zone marked between 0.7 and 1.2.
Conceptual illustration of the quality-diversity trade-off. Low temperature maximizes quality but limits diversity. High temperature enables exploration at the cost of coherence.

Dynamic Temperature

Some applications benefit from varying temperature during generation. You might start with low temperature to establish a coherent beginning, then increase temperature to introduce variation, then decrease again to conclude coherently. This technique requires careful tuning but can produce text that is both well-structured and creatively varied.

In[22]:
Code
def dynamic_temperature(
    position: int,
    total_length: int,
    t_start: float = 0.5,
    t_peak: float = 1.2,
    t_end: float = 0.6,
) -> float:
    """Compute temperature that varies across generation.

    Uses a simple curve: starts low, peaks in the middle,
    then decreases toward the end.
    """
    # Normalized position [0, 1]
    p = position / total_length

    # Parabolic curve peaking at p=0.5
    if p < 0.5:
        # Rise from t_start to t_peak
        return t_start + (t_peak - t_start) * (2 * p)
    else:
        # Fall from t_peak to t_end
        return t_peak - (t_peak - t_end) * (2 * (p - 0.5))
Out[23]:
Console
Dynamic temperature across 100-token generation:

Position 0 (start):   T = 0.50
Position 25:          T = 0.85
Position 50 (middle): T = 1.20
Position 75:          T = 0.90
Position 99 (end):    T = 0.61
Out[24]:
Visualization
Line plot showing temperature rising from 0.5 to peak at 1.2 around position 50, then falling to 0.6 by position 100.
Dynamic temperature schedule across a 100-token generation. Temperature starts low for coherent openings, peaks in the middle for creative exploration, then decreases for stable conclusions.

Worked Example: Generating Text Completions

Let's put temperature into practice with a complete text generation example. We'll use a small GPT-2 model to generate completions at different temperatures and observe the output characteristics.

In[25]:
Code
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

# Set pad token to eos token (GPT-2 doesn't have a pad token by default)
tokenizer.pad_token = tokenizer.eos_token
In[26]:
Code
def generate_with_temperature(
    prompt: str,
    temperature: float,
    max_new_tokens: int = 30,
    model=model,
    tokenizer=tokenizer,
) -> str:
    """Generate text completion with specified temperature."""
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.eos_token_id,
            top_k=0,  # Disable top-k to isolate temperature effect
            top_p=1.0,  # Disable nucleus sampling
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Now let's generate completions for a prompt at different temperatures:

Out[27]:
Console
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Prompt: "The future of artificial intelligence is"

============================================================

Temperature = 0.3:
----------------------------------------
in the hands of the next generation of AI.

The future of artificial intelligence is in the hands of the next generation of AI.

The future of artificial intelligence is in the hands of

Temperature = 0.7:
----------------------------------------
in the hands of the world's most intelligent minds.

The Human Mind

An artificial intelligence is a variety of organisms or machines that are programmed to perform tasks and perform actions based on information

Temperature = 1.0:
----------------------------------------
in scope and such problems will require continual creativity and creativity in every new medium, no matter its size. And if the current schemes understand the abundance of motivation to succeed no amount of INTLEH,

Temperature = 1.5:
----------------------------------------
just set to increase with advancing attention Cell HE warned sectors Warner vice chair/LIL chairman LA Chase LeraccessKES agreed Internet could sap 'Better Bing's Rob Carlward realism Throughout tour Sense noun

At T=0.3T = 0.3, the model produces focused, predictable continuations. At T=1.0T = 1.0, we see more varied vocabulary while maintaining coherence. At T=1.5T = 1.5, creativity increases but the text may occasionally veer into unusual territory.

Comparing Multiple Samples

One sample doesn't reveal the full picture. Let's generate multiple completions at each temperature to see the distribution of outputs:

In[28]:
Code
def generate_multiple(
    prompt: str,
    temperature: float,
    n_samples: int = 5,
    max_new_tokens: int = 20,
) -> list[str]:
    """Generate multiple completions to observe variety."""
    completions = []
    for _ in range(n_samples):
        text = generate_with_temperature(prompt, temperature, max_new_tokens)
        # Extract just the generated portion
        generated = text[len(prompt) :].strip()
        completions.append(generated)
    return completions
Out[29]:
Console
Prompt: "In a world where robots"

Temperature = 0.5:
--------------------------------------------------
  1. are just as important as doctors, we need to understand how they work."

The pap...
  2. are the future, it's hard to imagine that the future is not one where robots wil...
  3. are quick to react to human needs, the world of robots is a very different world...
  4. are not just a toy, but a way to save lives, it's important to remember that rob...

Temperature = 1.0:
--------------------------------------------------
  1. would have just about been indistinguishable from the human body, robots who cou...
  2. dominate home care, products are sometimes even far more commonplace and cost-ef...
  3. are all piers and masterpieces who have yet to attain much of their day, it feel...
  4. can do complete traffic crimes, law enforcement (AoSons Symphony) can be tricky ...

At T=0.5T = 0.5, the four samples likely share similar themes and phrasing. At T=1.0T = 1.0, you'll observe more divergent narratives and word choices.

Measuring Output Diversity

We can quantify diversity by measuring how different the generated samples are from each other. One simple metric is the number of unique n-grams across samples:

In[30]:
Code
def measure_diversity(samples: list[str], n: int = 2) -> dict:
    """Measure n-gram diversity across samples."""
    all_ngrams = []
    for sample in samples:
        words = sample.lower().split()
        ngrams = [tuple(words[i : i + n]) for i in range(len(words) - n + 1)]
        all_ngrams.extend(ngrams)

    total = len(all_ngrams)
    unique = len(set(all_ngrams))

    return {
        "total_ngrams": total,
        "unique_ngrams": unique,
        "diversity_ratio": unique / total if total > 0 else 0,
    }
Out[31]:
Console
Diversity comparison (10 samples, bigrams):

Temperature  Total      Unique     Diversity 
------------------------------------------
0.3          237        168        0.709     
0.7          237        210        0.886     
1.0          240        237        0.988     
1.5          215        215        1.000     

Higher temperatures produce higher diversity ratios, confirming that the outputs explore a broader range of vocabulary and phrases.

Limitations and Impact

Temperature is powerful but imperfect. It operates on the entire vocabulary uniformly, which can cause problems when you want to encourage diversity in some dimensions but not others.

The fundamental limitation is that temperature is a single scalar applied to all logits equally. Consider a medical question-answering system where you want diverse phrasing (how the answer is expressed) but not diverse facts (what claims are made). Temperature cannot distinguish between these: raising temperature increases variety in both phrasing and factual content. This creates a real tension in domains where accuracy matters but repetitive outputs frustrate users.

Temperature also interacts poorly with very long generation. Early tokens sampled at high temperature can push the model into unfamiliar territory, leading to compounding errors as generation proceeds. A single unusual word choice in token 5 might make token 50 completely incoherent. This is why many practitioners combine temperature with other techniques like top-k or nucleus sampling (covered in the next chapters) to constrain the damage from high-temperature sampling.

Despite these limitations, temperature fundamentally shaped how we interact with language models. Before temperature scaling became standard, language model outputs felt robotic and predictable. Temperature gave users a dial to explore the space of possible outputs, making language models feel more creative and less deterministic. The concept transfers beyond language: temperature-like parameters appear in image generation, music synthesis, and other generative AI systems. The intuition that "higher temperature means more randomness" has become part of the basic vocabulary of generative AI.

The success of temperature scaling also revealed something important about language model training. The fact that simply rescaling logits produces coherent but varied outputs suggests that models learn meaningful probability distributions over vocabulary. The relative ordering of token probabilities carries semantic information: "sunny" really is more appropriate than "elephant" after "The weather today is," and temperature preserves this ordering while adjusting the degree of concentration.

Summary

Temperature controls the sharpness of the probability distribution during language model sampling. By dividing logits by a temperature parameter TT before softmax, we can make the distribution more peaked (low TT) or more uniform (high TT).

Key takeaways:

  • T=1.0T = 1.0 preserves the learned distribution. Lower values sharpen it; higher values flatten it.
  • T0T \to 0 approaches greedy decoding (argmax). TT \to \infty approaches uniform random sampling.
  • Practical ranges typically fall between 0.1 and 2.0, with most applications using 0.3 to 1.2.
  • Task matters: Factual and code generation prefer low temperature. Creative writing benefits from higher values.
  • Temperature controls the quality-diversity trade-off: more diversity comes at the cost of coherence.
  • Temperature affects all tokens uniformly, which can be limiting when you want selective diversity.

Temperature is often combined with top-k and nucleus (top-p) sampling to get finer control over the output distribution. These techniques, covered in the following chapters, truncate the distribution before sampling, preventing temperature from giving meaningful probability to highly unlikely tokens.

Key Parameters

When implementing temperature-controlled sampling, these parameters determine behavior:

  • temperature (float, typically 0.1-2.0): The scaling factor applied to logits before softmax. Values below 1.0 sharpen the distribution, making high-probability tokens more dominant. Values above 1.0 flatten the distribution, giving lower-probability tokens more chance. A value of 1.0 preserves the original learned distribution.

  • do_sample (bool): Whether to sample from the distribution (True) or use greedy decoding (False). Temperature only affects output when do_sample=True. With do_sample=False, the model always selects the highest-probability token regardless of temperature.

  • top_k (int, 0 to disable): When combined with temperature, restricts sampling to the top k most probable tokens. Setting top_k=0 disables this constraint, allowing temperature to affect the full vocabulary distribution.

  • top_p (float, 0.0-1.0): Nucleus sampling threshold, often used alongside temperature. Setting top_p=1.0 disables nucleus sampling, isolating the temperature effect. Lower values restrict sampling to tokens whose cumulative probability reaches the threshold.

  • max_new_tokens (int): Maximum number of tokens to generate. Longer sequences at high temperature tend to accumulate errors, so consider lower temperatures for longer outputs.

For most applications, start with temperature=0.7 and adjust based on output quality. Decrease if outputs are too random or incoherent; increase if outputs are too repetitive or predictable.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about decoding temperature and probability distribution control.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{decodingtemperaturecontrollingrandomnessinlanguagemodelgeneration, author = {Michael Brenndoerfer}, title = {Decoding Temperature: Controlling Randomness in Language Model Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/decoding-temperature-language-model-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Decoding Temperature: Controlling Randomness in Language Model Generation. Retrieved from https://mbrenndoerfer.com/writing/decoding-temperature-language-model-generation
MLAAcademic
Michael Brenndoerfer. "Decoding Temperature: Controlling Randomness in Language Model Generation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/decoding-temperature-language-model-generation>.
CHICAGOAcademic
Michael Brenndoerfer. "Decoding Temperature: Controlling Randomness in Language Model Generation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/decoding-temperature-language-model-generation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Decoding Temperature: Controlling Randomness in Language Model Generation'. Available at: https://mbrenndoerfer.com/writing/decoding-temperature-language-model-generation (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Decoding Temperature: Controlling Randomness in Language Model Generation. https://mbrenndoerfer.com/writing/decoding-temperature-language-model-generation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free