A new AI metric called VERT outperforms existing methods for automatically evaluating radiology reports, improving agreement with expert radiologists by up to 11.7% relative to the current leading approach — and a fine-tuned version achieves gains of 25% while running 37 times faster, according to researchers publishing on ArXiv.

Automated evaluation of radiology reports has become an active research area as AI-generated clinical text becomes more common. But most existing tools were designed and tested almost exclusively on chest X-ray reports, leaving their reliability across other imaging modalities — such as MRI, CT, and different anatomical regions — largely unproven. VERT directly targets this gap.

Why Existing Radiology AI Evaluation Methods Have Limitations

The study benchmarks VERT against three established LLM-as-a-judge metrics: RadFact, GREEN, and FineRadScore. The researchers tested these systems across two expert-annotated datasets, RadEval and RaTE-Eval, which together span multiple imaging modalities and anatomies — a significantly broader evaluation scope than prior work.

The core problem the researchers identified is generalisability. Models fine-tuned narrowly on chest X-ray data may learn dataset-specific patterns rather than clinically meaningful evaluation principles, making them unreliable when applied to, say, a brain MRI report or an abdominal CT finding.

Fine-tuning Qwen3 30B on just 1,300 training samples yields gains of up to 25% — and reduces inference time up to 37.2 times.

The team ran a systematic error detection and categorisation study to understand precisely where existing metrics agree or diverge from radiologist judgments. This kind of failure-mode analysis is relatively rare in the literature and gives the findings additional credibility beyond headline correlation numbers.

How VERT Works and What Makes It Different

VERT is designed as a flexible LLM-based evaluation framework tested with both open- and closed-source models, including reasoning and non-reasoning variants of different parameter sizes. Rather than committing to a single model, the researchers explored which configurations — prompt design, model type, model scale — best predict expert radiologist ratings.

The study also evaluated few-shot prompting, ensemble methods (combining outputs from multiple models), and parameter-efficient fine-tuning as strategies to boost performance without building an entirely new system from scratch. The most striking result came from fine-tuning: adapting Qwen3 30B on a modest dataset of 1,300 labelled examples delivered the largest accuracy gains while dramatically shrinking the time needed to generate an evaluation.

A 37.2-fold reduction in inference time is not a marginal efficiency improvement — it is the difference between a tool that might run overnight on a research server and one that could realistically integrate into a clinical or development workflow.

Benchmarks Are Self-Reported — Context Matters

It is worth noting that all performance figures in this study are self-reported by the paper's authors, as is standard for ArXiv preprints that have not yet completed peer review. The correlation improvements are measured against radiologist annotations in RadEval and RaTE-Eval — both established datasets, which lends credibility — but independent replication on other datasets and clinical environments has not yet been conducted.

The choice of correlation with expert judgment as the primary metric is also meaningful. Radiology report evaluation is notoriously subjective; two experienced radiologists often disagree on borderline findings. A metric that better tracks the centre of expert opinion represents progress, but does not eliminate the underlying ambiguity of the task.

Practical Path to Deployment

The combination of results points toward a practical deployment pathway that prior work had not clearly established. Using a capable open-source model — Qwen3 30B is publicly available — and fine-tuning it on a relatively small, domain-specific dataset appears sufficient to build a radiology report evaluator that is both more accurate than existing tools and fast enough for real-world use.

This matters for AI developers building and validating radiology models, who currently lack reliable automated evaluation tools for anything beyond chest X-rays. It also matters for clinical AI vendors who need scalable quality-assurance mechanisms as regulators increasingly expect demonstration of model performance across diverse patient populations and imaging contexts.

The researchers' systematic error analysis further identified specific categories of clinical finding where LLM-based metrics consistently underperform — information that can guide the next generation of training data collection and prompt engineering.

What This Means

VERT provides AI developers and clinical researchers with a faster, more accurate, and more generalisable tool for automatically evaluating radiology reports across imaging modalities — reducing dependence on scarce expert annotation time while raising performance standards for what automated evaluation can reliably achieve.