Same question, yet still different?

Same question, yet still different?

Sometimes AI discussions sound “too technical”. Yet they are often about very practical, everyday questions—for example, why an AI does not always give the same answer for the same input. This exact phenomenon is called nondeterminism. And no: you do not need to be a math expert for this. It is enough to understand where the differences come from, why they matter in practice, and how to get them under control in critical applications.

Imagine an AI-powered control platform. You ask the same question about how to post a transaction. One time the output is posted as an operating expense, another time as depreciation—even though you have not changed anything and the “temperature” (the creativity slider) is set to 0. In regulated areas such as tax, medicine, law, or finance, this is more than annoying: tests become unreliable, analyses fluctuate, and user trust suffers.

This is exactly where an initial public technical paper from the new lab Thinking Machines comes in (founded by Mira Murati, former CTO of OpenAI; you may have seen the spectacular early valuation in the media). The authors debunk a common assumption: temperature 0 automatically means identical answers. Their counterexample is simple and striking: 1,000 times the same request, temperature 0, and still 80 different complete answers. The key point: the cause is not (only) in the training, not in ambiguous data, and not merely in server “concurrency”. It arises during execution—precisely when the already trained model answers your request.

Why are there different answers?

The most important building block is floating-point arithmetic—calculations with limited precision, as chips and GPUs inevitably do. In practice, it makes a difference in which order numbers are added together.

A practical analogy is rounding to cents. Imagine your system is allowed to round to two decimal places after each step. If you add €0.004 to €1,000.00, it stays €1,000.00 because €0.004 is rounded to €0.00. If you do that twice in a row, you still end up at €1,000.00. But if you first combine the two tiny amounts (€0.004 + €0.004 = €0.008) and then round, it becomes €0.01, and you end up at €1,000.01. Same numbers, different grouping, different result. Floating-point operations on chips behave the same way—just with many more decimal places and much, much faster.

Why does this affect AI answers?

Because modern models compute a probability distribution for each word choice and then select the “best” next word. If two candidates are extremely close, these tiny rounding differences can flip the order: sometimes “Queens, New York” is just barely ahead, sometimes “New York City”. From that point on, the text takes a different path, and a small deviation becomes a visibly different answer over many words.

The second building block is batch processing.

To be fast, the server groups many requests into packages (batches). Which request is processed together with which “neighboring requests” depends on the current load. Sometimes your request ends up in a batch of five, sometimes in a batch of fifty. And that is exactly what changes the internal order of certain computation steps. Depending on batch size, GPUs even use different, especially fast shortcuts (kernel optimizations). Overall, this means: different batch neighbors → different computation order → slightly different numbers → potentially a different answer, even at temperature 0.

Why is this important?

First, for testing & integration: if a system answers the same input differently from one run to the next, end-to-end tests are difficult to rely on. Second, for evaluation & tuning: anyone who wants to compare quality, speed, and cost properly needs reproducible measurements—otherwise you are optimizing against noise. Third, for reinforcement learning (RL): many modern AI systems improve through feedback. For that, the system must respond as consistently as possible; otherwise it unintentionally learns from behavior it does not actually exhibit stably. And fourth, for production use in regulated environments: identical cases should be treated identically, otherwise audits, traceability, and trust quickly fall apart.

What does Thinking Machines propose?

A technical but well-explainable idea: “batch-invariant kernels”. Put simply, the computations are rewritten so that the result for a given input is always the same—regardless of which other requests it is processed with. This removes the “random effect” of batching on the computation order. The trade-off: performance. Determinism currently comes at a noticeable cost in throughput and compute time. For many teams, it may still be worth it on critical paths, because clean testing, reliable analyses, and more stable RL can more than offset the additional effort.

What does this mean in practice?

First: deliberately separate two operating modes. For creative tasks (ideas, texts, variants), some variance is often even desirable—AI can be allowed to “breathe”. For critical tasks (posting transactions, medical guidance, legal assessment, compliance texts), however, you need as much determinism as possible. Architecturally, this means providing a “determinism mode” for mandatory paths, asking providers specifically whether they offer batch-invariant inference or equivalent guarantees, and measuring how much performance this mode costs. This includes clean logs (model version, settings, data snapshot, seeds), gold prompts with expected outputs, load tests with varying batch sizes, and canary checks in the pipeline that alert early when deviations occur.

It is also interesting to look ahead: in addition to choosing between reasoning models and “standard” models, or between large and small variants, a new selection axis could emerge—deterministic for testing, evaluation, RL training, and regulated applications; probabilistic for everything where diversity helps. Temperature would then primarily control creativity within the chosen mode, not the fundamental question of “stable vs. variable”.

And what about us humans?

Even experts are not always consistent: ask three tax advisors and you will often get three nuances. That is normal. But standard processes should be consistent. No one will trust a platform long term if it changes its “opinion” every day for the same input. That is why determinism is not a nice-to-have, but a must-have wherever traceability, fairness, and repeatability matter. The good news: the problem is solvable—already today, with deliberate architecture and the right questions for providers. For creative work, we keep the desired variance. For critical paths, we ensure true reproducibility. This is how we get the best of both worlds.