A systematic study published on ArXiv has found that the temperature setting used when querying a large language model can alter the effectiveness of different prompting strategies — with the benefit of extended reasoning rising from 6x at T=0.0 to 14.3x at T=1.0 — challenging a widespread assumption in AI deployment.
Most practitioners treat temperature as a secondary dial, often defaulting to zero for tasks requiring logical precision. This new research directly challenges that convention by demonstrating that temperature and prompting strategy interact in ways that significantly affect output quality on hard mathematical reasoning problems.
The Experiment: Olympic-Level Maths as a Stress Test
The researchers evaluated two common prompting approaches — chain-of-thought (CoT) prompting, which instructs a model to reason step by step, and zero-shot prompting, which provides no examples or reasoning scaffolding — across four temperature values: 0.0, 0.4, 0.7, and 1.0. All tests used Grok-4.1 with extended reasoning enabled, running on 39 problems drawn from AMO-Bench, a benchmark built around International Mathematical Olympiad-level questions.
AMO-Bench is considered one of the more demanding evaluations available for mathematical reasoning, making it a meaningful setting in which to study configuration effects. The benchmark choice matters: easier datasets may not expose the performance differences that emerge under real pressure.
The benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0 — a finding that reframes how developers should think about model configuration.
Zero-Shot Peaks in the Middle, Chain-of-Thought at the Edges
The results show a clear and asymmetric pattern. Zero-shot prompting achieved its highest accuracy of 59% at moderate temperatures — specifically T=0.4 and T=0.7 — and performed worse at both extremes. Chain-of-thought prompting, by contrast, performed best at the temperature extremes, T=0.0 and T=1.0, suggesting it benefits either from maximum determinism or maximum diversity in token sampling.
This divergence is practically significant. A developer who selects CoT prompting and sets temperature to 0.7 — a common default in many production environments — may be operating in a configuration that underperforms both alternatives. The interaction effect, not either variable alone, drives the outcome.
Why Extended Reasoning Amplifies the Effect
Extended reasoning models, sometimes called "thinking" models, allocate additional computation at inference time to work through problems before producing a final answer. This test-time compute approach has become a prominent design direction across the industry, with OpenAI, Google DeepMind, and xAI all releasing models with similar capabilities.
The study's finding that extended reasoning's benefit scales substantially with temperature — from a 6x improvement at T=0.0 to a 14.3x improvement at T=1.0 over baseline — suggests that the stochastic diversity introduced by higher temperatures may help the model's internal reasoning process explore a wider solution space. At low temperatures, the model produces more deterministic outputs, potentially converging on the same reasoning path repeatedly even across extended computation steps.
This interpretation aligns with existing research on sampling diversity in multi-step reasoning, though the magnitude of the effect observed here is substantial. These findings are based on a single model and a relatively small problem set of 39 questions, which limits the generalisability of the conclusions.
A Direct Challenge to the T=0 Default
The convention of setting temperature to zero for reasoning and mathematics tasks is widespread. The logic is intuitive: lower randomness should mean fewer errors in logical chains. The study's results complicate that picture considerably.
According to the paper, the findings suggest that temperature should be optimised jointly with prompting strategy, rather than treated as a fixed parameter set independently before prompting decisions are made. This has direct implications for anyone building pipelines on top of reasoning-capable models — from researchers using them for formal problem solving to engineers deploying them in technical support or code generation contexts.
The researchers stop short of prescribing a single optimal configuration, noting that the right pairing depends on which prompting approach is in use. What the data does indicate is that T=0 is not a safe universal default for extended reasoning systems.
Limitations and What Comes Next
The study's scope is narrow by design: one model, one benchmark, 39 problems. AMO-Bench problems are also highly specialised, and it is not established whether the same temperature-prompting interactions hold across domains such as coding, scientific reasoning, or multi-step planning. Replication on other extended reasoning models — including OpenAI's o-series and Google's Gemini Thinking variants — would be needed to determine whether these patterns generalise.
The benchmark results are drawn from the researchers' own evaluation runs and have not been independently verified at this stage, as the paper is a preprint that has not yet undergone peer review.
What This Means
Developers and researchers using extended reasoning models should treat temperature as an active variable to be tuned in combination with their chosen prompting strategy — defaulting to T=0 may suppress the reasoning capabilities they are using.