Vision-language models deployed in medical settings are overconfident in their predictions, and neither scaling up model size nor using advanced prompting techniques resolves the problem, according to a new empirical study posted to ArXiv.

The paper, titled Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation, examines a critical but underexplored dimension of clinical AI: not just whether a model gives the right answer, but whether its stated confidence in that answer can actually be trusted. In high-stakes medical settings, an AI that is wrong but certain can be more dangerous than one that flags its own uncertainty.

Why Confidence Calibration Matters in Clinical AI

Calibration refers to how well a model's expressed confidence aligns with its actual accuracy. A well-calibrated model that says it is 90% confident should be correct approximately 90% of the time. A poorly calibrated — or overconfident — model might claim 90% certainty while only being correct 60% of the time. In medical imaging, where clinicians may use AI outputs to support diagnoses, miscalibrated confidence could directly influence patient care decisions.

The researchers tested three model familiesQwen3-VL, InternVL3, and LLaVA-NeXT — across model sizes ranging from 2 billion to 38 billion parameters, using three separate medical visual question answering (VQA) benchmarks. They also evaluated multiple prompting strategies, including chain-of-thought reasoning and asking models to verbally state their own confidence levels.

Overconfidence persists across model families and is not resolved by scaling or prompting.

Bigger Models, Same Problem

The study's first key finding is that overconfidence is not a quirk of any single model or size — it is a consistent pattern. Scaling a model from 2B to 38B parameters did not meaningfully improve calibration. Prompting techniques that encourage models to reason step-by-step or self-report uncertainty also failed to produce reliable confidence estimates. This challenges a common assumption that larger, more capable models will naturally become more self-aware about what they do and do not know.

The second finding offers a more practical solution. Simple post-hoc calibration methods — statistical corrections applied after a model generates its output, rather than changes to the model itself — consistently reduced calibration error across all tested models. The researchers specifically highlight Platt scaling, a well-established technique that fits a logistic regression model on top of raw model confidence scores to adjust them toward reality. These post-hoc methods outperformed every prompt-based strategy tested.

The Ceiling That Post-Hoc Calibration Cannot Break

However, the study identifies an important limitation of post-hoc approaches. Because methods like Platt scaling are mathematically monotonic — meaning they preserve the original ranking of predictions even while adjusting their scores — they cannot improve a model's ability to distinguish correct answers from incorrect ones. The metric used to measure this discriminative quality, AUROC (Area Under the Receiver Operating Characteristic curve), remained unchanged after post-hoc calibration. The model became better at expressing how confident it should be, but not better at actually knowing what it got right.

This distinction matters in practice. A clinician using an AI tool wants not just accurate confidence scores, but a system that reliably ranks its own answers — surfacing the cases it is most likely to have gotten right, and flagging those it may have gotten wrong.

Hallucination Signals as a Calibration Boost

To address this ceiling, the researchers investigated what they call hallucination-aware calibration (HAC). Hallucinations in vision-language models refer to instances where a model generates plausible-sounding but factually incorrect content — a well-documented and serious problem in medical AI, where fabricated clinical details could be harmful.

HAC works by incorporating signals from vision-grounded hallucination detection — tools that check whether a model's textual output is actually supported by the image it was given — as additional inputs when estimating confidence. Feeding these hallucination signals into the calibration process improved both calibration error and AUROC scores. The largest gains appeared on open-ended questions, where models are asked to generate descriptive answers rather than choose from fixed options — arguably the more realistic and demanding clinical scenario.

All benchmark results and comparisons in this study are based on the researchers' own evaluations, as noted in the ArXiv preprint, and have not yet been independently peer-reviewed.

A Practical Roadmap for Deploying Medical AI

The paper's recommendations are direct. Post-hoc calibration should become standard practice when deploying vision-language models in medical VQA settings, replacing reliance on raw model confidence outputs. For teams seeking stronger performance, integrating hallucination detection signals into the confidence pipeline offers measurable additional gains without requiring model retraining.

The study also implicitly raises questions for model developers. If overconfidence is structural — baked into how these models are trained rather than addressable through prompting — then calibration and hallucination detection may need to be treated as essential infrastructure rather than optional add-ons.

What This Means

Clinicians and healthcare AI developers should not rely on the confidence scores that vision-language models produce by default — post-hoc recalibration combined with hallucination detection currently represents the most evidence-backed approach to making medical AI outputs genuinely trustworthy.