Data & AI · Private Equity · Management Consulting · Entrepreneurship · Software Engineering

Why Temperature=0 Doesn't Guarantee Determinism in LLMs

Why Temperature=0 Doesn't Guarantee Determinism in LLMs

By on

TL;DR(Too Long; Did not Read)

Temperature = 0 in language models aims to eliminate randomness by always selecting the most probable next token (i.e., greedy decoding). Despite this, real-world LLMs still produce varying outputs due to factors like:

  • Floating-point precision limitations
  • Hardware and parallel processing variability
  • Decoding tie-breakers
  • Other sampling parameters like top-k or top-p that may still introduce variation

Additional nondeterminism comes from model and infrastructure complexity, including:

  • Mixture-of-Experts (MoE) architectures
  • Nondeterministic operations in frameworks like PyTorch or TensorFlow
  • Multi-GPU sharding
  • Batching, load balancing, and cross-server variability

While temperature=0 increases consistency, true repeatability requires strict control over hardware, software, and decoding settings. Ultimately, these nuances explain why outputs can still differ, even when no randomness is explicitly introduced.

By now, most of us know that LLMs are not entirely deterministic, meaning that given the same input, they will not always produce the same output. This is the case even when we set the temperature to 0. This is usually a setting that is only available to engineers, but conceptually, it means that we are dialing the randomness down to zero. But if there's no randomness anymore, why is it still not perfectly deterministic? I've asked this question to multiple experts and never got a compelling or clear answer, so I did some research to answer it.

To recap, in theory, temperature 0 means the model always picks the single most probable next token at each step. This is like a greedy strategy: one token has a probability of 1 (100%) and all others are 0, so there is only one choice at each step. In an ideal world, that would guarantee the exact same completion every time. In practice, however, LLM outputs can still vary slightly even at temperature 0, due to several technical factors.

Temperature = 0: Theoretical Determinism vs. Reality#

What does “temperature 0” mean? In text generation, temperature is a parameter that controls randomness. A higher temperature (e.g., 0.8 or 1.0) makes the model more creative or random, while a lower temperature (e.g., 0.2) makes it more focused and conservative. Mathematically, temperature scales the model's confidence distribution (via the softmax function).

As temperature approaches 0, the softmax “peaks” more sharply on the highest-probability token, essentially making that token the only viable choice. In fact, setting T = 0 is treated as a special case - it's like telling the model to always pick the top-scoring token at each step (often called greedy decoding). In theory, this should remove randomness entirely: if there is only one choice at each step, the sequence of choices should be the same every time.

Why do we still see different outputs sometimes? The answer is that the real-world implementation of LLMs involves many subtle complexities. Setting temperature to 0 does increase determinism, but it does not guarantee perfect repeatability.

Imagine two people performing very long and complex calculations by hand. Even if they follow the same steps, tiny differences (like rounding a number differently) could lead to a different final result. Similarly, inside an LLM's generation process, there are tiny numerical and procedural variations that can lead to divergent outputs. Below, we break down the key factors that contribute to non-deterministic behavior even when no randomness is intentionally introduced.

Floating-Point Precision Limitations#

One fundamental issue is that computers do not have infinite precision when handling numbers. LLMs rely on many mathematical operations with probabilities (or logits, the raw scores for tokens) represented in floating-point format.

Floating-point numbers are an approximation; they have limited decimal precision. This means rounding errors can occur at many steps, and these tiny errors can cascade through the model's calculations. Over the course of generating a long piece of text, minute differences in arithmetic can add up to a different outcome.

Parallelism and Hardware Variability#

Large language models are usually run on GPUs or TPUs that perform many operations in parallel for speed. Parallel computation can introduce nondeterminism because the order in which operations occur can vary. Unlike a simple calculator that adds numbers in a fixed sequence, a GPU might add up many numbers simultaneously in different chunks, then combine the results. Due to the way floating-point addition is not perfectly associative (i.e. (a + b) + c may not equal a + (b + c) when rounding is involved), doing operations in a different order can lead to slight differences in the final sum. The model's probability calculations might therefore differ slightly from run to run depending on how the work is partitioned and parallelized.

Hardware differences also play a role. If you run the same model on two different GPU models (say an Nvidia Tesla T4 vs. an A100), you might not get byte-for-byte identical results. Different processors have different architectures and optimization routines for math, which can produce minutely different results. One experimental report showed that the exact same prompt on two GPU types yielded slightly different token probabilities, causing the text generation to diverge after a few words. In most cases, these differences are tiny and don't change the highest-ranked token, so you wouldn't notice any change, but once in a while, especially in a long output, the difference can affect which token is deemed “most likely” at a given step. This is why LLM outputs can differ across runs or machines purely due to low-level hardware and parallelism details.

Decoding Strategies and Tie-Breaking#

Even when no randomness is intended, the decoding algorithm (the procedure that picks tokens) can introduce variability. Temperature is one factor in decoding, but it's often used alongside others like top-k or top-p (nucleus) sampling, or beam search. At temperature 0, most applications switch to a greedy selection (always pick the top token). However, if any other sampling parameters are still in play, they can affect determinism.

For example, if top_k is set greater than 1 or top_p is less than 1, the model is still allowed to consider multiple tokens for each position, which can reintroduce randomness. To truly force determinism, you typically set top_k = 1 (only one option considered) and top_p = 1 (no probability mass cutoff) - or, equivalently, top_p = 0 in some implementations, to only allow the single top token. If these aren't set strictly, the model might still do a form of sampling among the top candidates, causing variations in output. In short, using temperature=0 alone isn't always sufficient if other decoding parameters allow choice.

Another issue arises when two or more tokens have nearly equal probability (a tie or close to a tie). In theory, if one token's probability is 0.5000 and another's is 0.4999, greedy decoding will pick the 0.5000 token. But imagine due to rounding or minor differences, sometimes the model sees the second token as 0.5001 vs 0.4999 for the first. Now the “winner” flips. Even with a temperature of 0, the model needs a way to break ties or decide when probabilities are equal or extremely close.

Different implementations might handle this differently: one might consistently pick the token that comes first alphabetically or by ID number; another might introduce a tiny random jitter to break ties. If the tie-breaking rule isn't consistent, you could occasionally get a different outcome. For example, OpenAI's documentation for some models notes that if you set temperature to 0, the system will “automatically increase the temperature until certain thresholds are hit”.

This suggests that the API may raise the temperature slightly in cases where it needs to make a decision, effectively introducing a bit of randomness to avoid deadlocks or repetitive loops. So, what you thought was strict greedy mode might, under the hood, allow a tiny amount of flexibility, just to ensure the model continues generating.

Additionally, if using beam search (a deterministic search algorithm that finds a high-probability completion), you might expect it to be reproducible. Beam search is mostly deterministic, but if multiple beams have exactly the same score, the tie might be broken arbitrarily (or by nondeterministic ordering of floating-point sums, as discussed). This is usually rare, but it's another edge case where nondeterminism can slip in.

Internal Model Architecture and Framework Factors#

Mixture-of-Experts (MoE) Layers#

Some large models (most likely GPT-4) use a Mixture of Experts architecture, where the model has multiple expert sub-networks and a routing mechanism decides which “expert” handles a given token. This can introduce nondeterminism in an interesting way.

For efficiency, the model might process multiple input queries together in a batch. When it does, tokens from different sequences could compete for the same expert resources (because typically only a limited number of tokens can go to each expert).

If two requests are handled together, they might interfere with each other's expert routing. As one analysis noted, in sparse MoE systems. This means the output for a single query might differ depending on what other queries were processed alongside it.

If you send the same prompt twice but behind the scenes it was batched with different other user requests each time, an MoE-based model could give different answers. This is a unique source of nondeterminism that comes purely from the model's architecture and how the service batches workloads, rather than from the usual sampling logic.

Non-deterministic Operations in Frameworks#

The libraries and frameworks used to run LLMs (such as TensorFlow, PyTorch, CUDA libraries, etc.) sometimes use optimized algorithms that are nondeterministic.

For example, certain GPU operations (like some matrix decompositions, parallel reductions, or convolution algorithms) might trade a bit of determinism for speed. These ops might spawn threads or use atomic operations where the order of execution isn't fixed, leading to slight run-to-run differences.

The PyTorch documentation, for instance, has a whole reproducibility guide detailing which operations are not deterministic by default and how to avoid them. If an LLM uses any of these under-the-hood operations, you could see variations unless special care is taken.

Generally, deep learning frameworks do offer settings to enforce determinism (e.g., flags to use deterministic algorithms for certain operations), but these often come at a performance cost and are not always enabled in production systems.

Software Bugs or Undefined Behavior#

It's also worth acknowledging that sometimes nondeterminism can come from simple bugs or from parts of the code that aren't intended to be random but lack guarantees. For example, if the model relies on some external state or if there is a race condition in how outputs are assembled, it could manifest as unpredictable output differences.

In early discussions, even OpenAI engineers were puzzled by temperature-0 variability and wondered if it was a bug or floating-point issue. Over time, the community has largely converged on the understanding that it's mostly due to the factors listed (floating point and parallelism), but the point is that the entire software stack needs to be deterministic to get identical outputs - and that's a hard thing to achieve in distributed, optimized systems.

Multi-Server and Deployment Factors#

Different Instances or Shards#

Typically, cloud providers run LLMs on clusters of machines. Your request might go to one of many servers. If not every server is exactly identical (same hardware, same software version, same numeric libraries), their outputs could vary slightly.

Even if the provider tries to keep them consistent, subtle differences (one machine has a newer GPU driver, or a slightly different GPU model, etc.) can introduce variation. For example, one server might process using an A100 GPU and another an A10 GPU - as discussed earlier, those can produce slight differences in results. So if you call the API at two different times, behind the scenes you might be hitting different hardware or a different partition of the model.

Model Updates and Configuration Changes#

AI providers frequently update models or adjust their settings. OpenAI has acknowledged that. Even when you try to make the API calls deterministic, changes on OpenAI's end can impact outputs. Even with the same seed and parameters set that were meant to give repeatable results across calls, they caution that this is only “mostly” deterministic.

Load Balancing and Batching#

As touched on earlier, if the service batches multiple user requests together to maximize throughput, it can lead to cross-influence (especially in MoE models). Even without MoE, batching can affect low-level numerical behavior.

For example, frameworks might pad sequences in a batch or handle memory differently when batch size changes, which can shift the arithmetic slightly. If on one API call your prompt is processed alone, and on another it happens to be processed alongside 3 other prompts (to better utilize the GPU), the internal execution differs. This can contribute to output differences.

Sharding Large Models#

Very large models (like GPT-4) might be too big for one GPU, so they are sharded across multiple GPUs or machines. This means each neural net forward pass involves communication between parts of the model running on different devices. If timing or communication patterns vary, it could introduce slight nondeterminism as well.

For example, if two GPUs return results in a different order or with minor timing differences, the aggregator might produce non-bit-identical outcomes. These effects are usually small, but over many tokens, again, could show up as a difference in one token choice.

Conclusion#

Large language models do have inherent sources of non-determinism even when we try to eliminate randomness. Floating-point arithmetic, parallel computation, tie-breaking, complex model architectures, and multi-server deployment issues all contribute to this.

Temperature=0 greatly reduces randomness by always choosing the most likely token, so it's the first step toward determinism, but it doesn't magically make the entire system deterministic.

Both non-technical and technical users should be aware that some variance is normal. By understanding these contributing factors, we can manage and mitigate unexpected variations. And if absolute consistency is needed, we must employ careful controls (seeds, fixed settings, consistent environment) to narrow down the sources of nondeterminism as much as possible.

Will this change how I interact with LLMs? Probably not, but now I do have a better understanding of why the output is different, even with the temperature set to 0. I hope you do, too.