A new benchmark called XpertBench shows that the most advanced large language models available today top out at roughly 66% success on authentic expert-level professional tasks, with an average score of around 55% — exposing what its authors describe as a significant "expert gap" in current AI systems.

The research, published on ArXiv in April 2025, arrives at a moment when many widely used AI benchmarks are showing signs of saturation. Leading models have begun achieving near-perfect scores on tests like MMLU and GSM8K, making it increasingly difficult for developers and researchers to distinguish capable systems from those that have overfitted to familiar question formats. XpertBench is designed to push past that ceiling.

1,346 Tasks, 80 Categories, Zero Easy Answers

XpertBench consists of 1,346 tasks spanning 80 categories across five professional domains: finance, healthcare, legal services, education, and dual-track research covering both STEM and the humanities. Crucially, these tasks were not assembled by generalist crowdworkers. They were derived from more than 1,000 submissions by domain experts — including researchers from elite academic institutions and practitioners with extensive clinical or industrial experience — a design choice the authors say ensures what they call "superior ecological validity."

Each task is evaluated using detailed rubrics, with most containing between 15 and 40 weighted checkpoints that assess professional rigour rather than simple factual accuracy. The difference matters: a task about a medical diagnosis, for example, is not graded as right or wrong but scored across dimensions such as clinical reasoning, appropriate caveats, and communication clarity.

Even leading models achieve a peak success rate of only ~66%, with a mean score around 55% — a finding that underscores a pronounced performance ceiling in current AI systems.

A New Evaluation Method to Counter AI Self-Scoring Bias

One persistent problem in AI evaluation is self-rewarding bias: when an LLM is used to judge another LLM's output, it tends to favour responses that resemble its own style, inflating scores. To address this, the XpertBench team introduces a new evaluation approach called ShotJudge.

ShotJudge uses LLM judges that are first calibrated using expert-written few-shot exemplars — real examples of what a high-quality professional response looks like, provided by human domain experts. This calibration step grounds the judge's assessments in human-aligned standards rather than the model's own preferences. According to the authors, the result is an evaluation pipeline that scales without sacrificing alignment to genuine expert expectations.

The approach is notable because scalable human evaluation remains one of the hardest problems in AI benchmarking. Hiring domain experts to score thousands of model outputs is expensive and slow; using another AI to score them is fast but unreliable. ShotJudge attempts to thread that needle, though the benchmark's claims about its effectiveness are self-reported by the research team and have not yet been independently validated.

Models Show Divergent Strengths Across Domains

Beyond the headline scores, the empirical results reveal a more nuanced picture of where current models succeed and where they fail. According to the paper, models show non-overlapping strengths depending on task type: some perform relatively well on quantitative reasoning tasks in STEM or finance, while others show comparative strength in linguistic synthesis tasks common to legal writing or humanities research.

This domain-specific divergence has practical implications. An organisation evaluating an AI model for deployment in legal services, for instance, cannot rely on a model's strong performance in coding or mathematics as a proxy for its suitability in that context. XpertBench's breadth is intended to surface these distinctions.

The 66% ceiling is particularly striking given that these are the best-performing systems currently available. It suggests that professional domains present structural challenges — ambiguity, contextual judgement, awareness of professional norms — that current model architectures do not reliably handle, even at scale.

Filling a Structural Gap in AI Evaluation

The broader motivation behind XpertBench is a structural problem the AI research community has been grappling with for several years. Standard benchmarks were designed when AI capabilities were far more limited; as models have improved, those benchmarks have lost their ability to differentiate. The community has responded with a wave of harder evaluation sets, but many still rely on multiple-choice formats or tasks that skilled generalists — rather than domain experts — could complete.

XpertBench's authors argue that existing frameworks suffer from three specific weaknesses: narrow domain coverage, over-reliance on generalist tasks, and self-evaluation biases. Their benchmark is positioned as a direct response to all three. Whether it achieves that goal at scale will depend in part on how the research community adopts and independently stress-tests it.

The use of rubric-based, multi-checkpoint scoring also pushes the field toward a more granular understanding of model capability — not just whether a model gets the right answer, but whether it demonstrates the kind of structured professional reasoning that domain experts actually use.

What This Means

For organisations considering AI deployment in high-stakes professional settings, XpertBench offers a sobering calibration: even the most capable models today operate well below expert-level proficiency on authentic professional tasks, and no single model leads across all domains — making careful, domain-specific evaluation essential before deployment.