Large language models give contradictory medical answers depending on how a question is worded, even when both versions are supported by the same underlying clinical evidence, according to a new study from researchers who tested eight LLMs across 6,614 question pairs.
The research, posted to arXiv in April 2025, examines a specific and practical risk in AI-assisted medical question answering: that the same patient, asking the same question in different ways, may receive meaningfully different — and sometimes opposing — guidance. The authors used a controlled retrieval-augmented generation (RAG) setup, where documents were selected by experts rather than retrieved automatically, isolating phrasing as the variable under study.
How Framing a Question Can Flip an AI's Answer
The study tested two dimensions of query variation. The first was question framing — whether a question was phrased positively ("Does this treatment work?") versus negatively ("This treatment works, right?"). The second was language style — whether the question used technical medical terminology or plain everyday language.
The findings on framing were significant. Positively- and negatively-framed question pairs were significantly more likely to produce contradictory conclusions than pairs where both questions used the same framing. In other words, an AI might affirm that a treatment is effective when asked one way, then cast doubt on that same treatment when asked differently — despite reading from the same clinical trial abstract both times.
LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence.
Language style — technical versus plain English — showed no significant interaction with the framing effect, suggesting the problem is not about vocabulary complexity but about the implied assumption baked into a question's structure.
The Multi-Turn Problem: Persuasion Amplifies Inconsistency
The framing effect becomes more pronounced in multi-turn conversations — the back-and-forth dialogue format increasingly common in consumer AI health tools. According to the study, sustained persuasion across multiple exchanges increases inconsistency further, meaning a user who pushes back on an initial answer is more likely to get the model to reverse its position.
This has direct implications for how people actually use AI health assistants. Patients who are anxious, hopeful, or simply rephrasing a concern they didn't feel was properly addressed may inadvertently — or deliberately — lead a model toward a preferred conclusion. The model, rather than holding firm to the evidence, accommodates the framing.
The researchers note this is a problem even within RAG-based systems, which are often considered more reliable than standard LLM responses because they anchor answers to specific retrieved documents. The study's controlled design — using expert-curated clinical trial abstracts as the grounding evidence — means the inconsistency cannot be attributed to the model retrieving different or poor-quality sources. The evidence was fixed. Only the question changed.
Eight Models Tested, Systemic Weakness Found
The study evaluated eight LLMs, though the abstract does not name specific models. All showed susceptibility to framing effects to varying degrees, suggesting this is a systemic characteristic of current language model architectures rather than a flaw isolated to one product or provider.
The dataset of 6,614 query pairs was grounded in clinical trial abstracts — a rigorous and clinically relevant evidence base. The scale of the evaluation strengthens the generalisability of the findings across different medical topics and question types.
The authors frame phrasing robustness as an underappreciated evaluation criterion for medical AI systems. Most benchmarks for medical LLMs assess accuracy — whether the model gives the right answer to a well-formed question. Far fewer assess consistency — whether the model gives the same answer regardless of how the question is posed. This study argues the latter matters just as much, particularly in high-stakes settings.
What the Medical AI Industry Needs to Do Next
The study points toward a gap in how medical AI systems are currently validated before deployment. Developers and regulators typically assess whether a model can answer medical questions correctly. This research suggests an additional standard is needed: whether a model answers consistently across the range of ways real patients might phrase the same concern.
Patients do not arrive at AI health tools with perfectly neutral, objectively framed questions. They arrive worried, hopeful, confused, or already leaning toward a conclusion. A system that reflects those leanings back at them — rather than grounding its responses in evidence regardless of framing — is not functioning as a reliable health information source.
The multi-turn finding is especially relevant given the current commercial push toward conversational AI health assistants, where extended dialogue is a feature, not a bug. If each additional exchange increases the risk of a model contradicting its earlier evidence-based position, that conversational depth becomes a liability in medical contexts.
What This Means
Developers and evaluators of AI medical tools need to add phrasing robustness as a standard benchmark criterion — a model that gives contradictory answers based on question framing cannot be considered suitable for patient-facing health applications.