Large language models systematically avoid challenging their own hypotheses, a behaviour that mirrors a well-documented human cognitive flaw — and a new study shows that simple prompting strategies can meaningfully reduce it.
Researchers tested eleven LLMs across multiple model families and scales using a classic psychology experiment called the rule-discovery task. In this test, a model is shown a sequence of three numbers — a "triple" — and must figure out the hidden rule governing it by proposing new triples and receiving yes-or-no feedback. A rational agent would propose triples designed to disprove its current best guess, a strategy known as falsification. What the researchers found instead was that models overwhelmingly proposed triples designed to confirm what they already believed.
Why Confirmation Bias Matters in AI Systems
Confirmation bias is one of the most studied failures in human reasoning. It describes the tendency to seek out, interpret, and remember information in ways that validate existing beliefs rather than test them. In humans, it distorts everything from medical diagnosis to legal judgement. In language models, the same pattern creates a subtler but consequential problem: a model that cannot effectively falsify its own hypotheses will reach wrong conclusions more often, and reach correct ones more slowly.
The researchers found that confirmation bias led to slower and less frequent discovery of hidden rules — a direct measure of degraded reasoning performance.
The rule-discovery framework, adapted from decades of human psychology research, gives researchers a clean, quantifiable way to observe this. Because the task has a single correct answer and a structured feedback loop, it is possible to track not just whether a model gets the right answer, but how it searches for it. Models exhibiting confirmation bias tended to propose variations of triples that fit their current hypothesis, generating confirmatory data rather than the disconfirmatory data that would be most informative.
Eleven Models Tested, One Consistent Pattern
The study tested models across different sizes and families, though the paper does not publicly name every model evaluated. The consistency of the finding across this breadth of systems is significant: this is not an idiosyncratic quirk of one architecture or training approach, but a pattern that appears to emerge from the way current LLMs are built and trained.
The practical consequence the researchers measured was a lower rate of successful rule discovery. On baseline settings, models correctly identified the hidden rule 42% of the time on average. That figure is the anchor against which every intervention is measured.
How Prompting Interventions Reduced the Bias
The researchers drew on intervention strategies originally designed for human participants — techniques developed in psychology to help people reason more objectively. Translated into prompting instructions, these strategies encouraged the model to actively consider counterexamples: triples that would not fit its current hypothesis, and which could therefore reveal whether that hypothesis was wrong.
The results were consistent across models. Adding these instructions raised the average rule-discovery rate from 42% to 56% — a 14 percentage point improvement from a relatively simple change in how the task was framed. According to the researchers, this reduction in confirmation bias was measurable and reliable, not a marginal or noisy effect.
It is worth noting that these benchmark results are self-reported by the research team and have not yet undergone peer review, as the paper was posted to arXiv as a preprint.
Distilling Better Reasoning into Model Weights
Perhaps the most technically significant element of the work is what comes after the prompting experiments. The researchers did not stop at showing that prompting helps — they used a process called behavioural distillation to embed the intervention-induced reasoning patterns directly into the model itself. Rather than requiring a special prompt at inference time, the goal is for the model to exhibit less confirmation bias by default.
The distilled models were then tested on a separate task — the Blicket test, a causal reasoning challenge also borrowed from developmental psychology — and showed promising generalization. This matters because it suggests the improvement is not purely a surface-level response to specific prompt wording, but reflects something closer to a genuine shift in how the model approaches hypothesis exploration.
Generalization across tasks is always the harder test. The fact that distilled behaviour transferred at all is an encouraging sign, though the researchers describe these results as "promising" rather than definitive — an appropriate level of caution for early-stage findings.
What Remains to Be Done
The study opens several questions it does not fully resolve. The rule-discovery task, while well-validated in psychology, is a controlled and artificial environment. How confirmation bias manifests in open-ended reasoning tasks — scientific analysis, legal argument, multi-step planning — is a harder problem to measure and one the current framework does not directly address.
The gap between a 14-point improvement in a structured task and robust falsificatory reasoning across real-world applications is substantial. Prompting interventions also introduce overhead: every deployed system that relies on anti-bias instructions must include and maintain those instructions reliably. Distillation offers a cleaner path, but requires retraining, which is expensive and not universally accessible.
What This Means
For anyone building or relying on LLMs for analytical tasks — research assistance, decision support, diagnostic reasoning — this study is a concrete reminder that models are not neutral hypothesis-testers by default, and that prompting design can meaningfully change how reliably they reason.