A new academic survey and benchmark have found that large language models, despite strong scores on medical licensing-style exams, fall significantly short when tested on authentic clinical decision tasks drawn from real hospital data.
The work, published on arXiv in April 2025, combines a structured review of existing medical reasoning methods with the introduction of MR-Bench, a novel evaluation framework built from real-world clinical cases. The authors argue that exam-style benchmarks — which have driven much of the optimism around AI in medicine — do not adequately capture the complexity of how doctors actually reason through patient care.
Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks.
Why Medical Exams Are the Wrong Yardstick
Medical licensing exams test factual recall and pattern recognition in controlled, well-posed scenarios. Clinical reasoning, by contrast, is what happens when a doctor faces an ambiguous presentation, incomplete data, and a patient whose condition may be changing by the hour. The survey's authors ground their analysis in cognitive theories of clinical reasoning, framing the process as an iterative cycle of abduction (generating candidate diagnoses from symptoms), deduction (predicting what should follow if a hypothesis is correct), and induction (updating beliefs based on new evidence).
This framing matters because it sets a higher bar than most current benchmarks demand. Passing a multiple-choice question about drug interactions is not the same as deciding, in real time, whether a patient's deteriorating labs warrant an immediate intervention or watchful waiting.
Seven Technical Routes, One Unified Evaluation
The survey organizes existing approaches to medical reasoning into seven major technical routes, spanning both training-based methods — such as fine-tuning on clinical text — and training-free approaches, including chain-of-thought prompting and retrieval-augmented generation. According to the authors, prior work has evaluated these methods across inconsistent experimental settings, making it difficult to draw reliable comparisons.
To address this, the team conducted a unified cross-benchmark evaluation of representative models under consistent conditions. This kind of methodological standardization is relatively rare in the field and allows the survey to offer something more than a catalogue of results: it provides a systematic picture of where different methods actually stand relative to one another.
MR-Bench: Grounding AI Evaluation in Hospital Reality
MR-Bench is the survey's most concrete contribution. Unlike benchmarks assembled from textbooks or curated exam databases, MR-Bench is derived from real-world hospital data, placing models in scenarios that reflect the messiness and stakes of actual clinical environments. The benchmark is designed to assess clinically grounded reasoning rather than surface-level recall.
The results show a notable difference in performance. Models that perform competitively on established medical benchmarks show a pronounced drop in accuracy when evaluated on MR-Bench's clinical decision tasks. The authors do not specify exact figures for all models in the abstract, but describe the performance gap as significant — indicating a difference large enough to matter in practice, not just statistically.
This finding aligns with a broader pattern in AI research: performance on curated benchmarks can overstate real-world capability, sometimes dramatically. In medical AI, that gap carries direct consequences for patient safety.
What the Field Has Been Missing
The survey also highlights what it describes as key gaps between current model performance and the requirements of real-world clinical reasoning. These include handling evolving evidence — where the right answer may change as new test results arrive — and operating reliably in context-dependent situations where patient history, comorbidities, and clinical setting all shape the appropriate response.
Safety is an explicit concern throughout. Clinical decision-making is described as inherently safety-critical, and the authors make clear that reliable performance in this domain requires more than factual knowledge. A model that confidently recalls the standard dosing for a medication is not necessarily equipped to reason about whether that medication is appropriate for a specific patient with a complex history.
The survey does not advocate for or against clinical deployment of LLMs. Instead, it provides a framework for understanding what robust medical reasoning actually requires and where current methods fall short — a more useful contribution than either uncritical enthusiasm or blanket caution.
What This Means
For researchers, clinicians, and health system decision-makers evaluating AI tools, this survey is a practical reminder that exam-score benchmarks are insufficient proxies for clinical readiness — and that MR-Bench offers a more demanding, realistic standard against which to measure progress.