A new position paper published on arXiv argues that AI benchmarks — the primary tool used to justify deploying generative AI in high-stakes domains — are riddled with systemic validity failures that current evaluation practices cannot fix.

The paper, titled Science of AI Evaluation Requires Item-level Benchmark Data, contends that without a principled framework for gathering validity evidence and conducting granular diagnostic analysis, the field is building consequential deployment decisions on uncertain foundations. The authors draw on both computer science and psychometrics — the discipline that studies psychological measurement — to make their case.

Why Today's AI Benchmarks Are Structurally Flawed

Current AI evaluations typically report aggregate scores: a model achieves 87% accuracy on a reasoning benchmark, or performs comparably to competitors on a standardised test. The problem, according to the paper, is that these headline numbers obscure what is actually being measured. The authors identify a range of issues they describe as "unjustified design choices" and "misaligned metrics" — problems that remain invisible when only summary statistics are reported.

Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks — something aggregate scores simply cannot provide.

The core argument is that researchers and developers need access to item-level data: the individual questions, prompts, and tasks that make up a benchmark, along with model responses to each one. This granularity allows analysts to identify, for example, whether a model is performing well overall but systematically failing on a specific type of reasoning, or whether certain benchmark questions are poorly constructed and inflate or deflate scores in misleading ways.

What Psychometrics Teaches AI About Testing

The paper's methodological contribution is its explicit borrowing from psychometrics, a field with decades of experience designing and validating tests used in education, clinical assessment, and employment screening. Psychometricians have long studied concepts like construct validity — whether a test actually measures what it claims to measure — and item response theory, which models the relationship between a test-taker's ability and their probability of answering a given question correctly.

The authors argue that AI evaluation has largely ignored these frameworks, and that this neglect explains many of the benchmark problems the research community keeps rediscovering. Benchmarks get saturated — models score near ceiling — without researchers being confident they ever measured the intended capability in the first place. New benchmarks are created, the cycle repeats.

By applying item-level analysis, researchers can examine latent constructs: the underlying capabilities a benchmark is implicitly testing, which may differ substantially from what the benchmark designers intended. The paper includes illustrative analyses showing how item-level data reveals these hidden structures in ways aggregate scores cannot.

OpenEval: A Repository Built for Rigorous Analysis

To move from argument to action, the authors introduce OpenEval, described as a growing repository of item-level benchmark data. The platform is designed to support what the paper calls "evidence-centered AI evaluation" — a structured approach in which the evidence gathered from a benchmark can be traced back to specific design choices and validated against clear standards.

The repository is positioned as a community resource, with the authors explicitly framing it as a catalyst for broader adoption of item-level practices. Details about current coverage — which benchmarks are included, how many items are available — are not specified in the abstract, and the full scope of OpenEval would require review of the complete paper.

It is worth noting that this is a position paper, meaning it presents an argued stance rather than reporting experimental results. The claims about validity failures in current benchmarks are substantiated through analysis and illustration rather than a systematic empirical study of benchmark quality across the field.

A Credibility Problem With Real Consequences

The stakes the authors identify are concrete. Generative AI systems are being deployed in healthcare, legal services, education, and other high-stakes domains, and benchmark performance is frequently cited as justification for those deployments. If the benchmarks themselves are poorly validated, the evidential chain supporting real-world deployment decisions becomes unreliable.

This concern sits within a broader ongoing debate in AI research about benchmark integrity. Issues such as data contamination — where models are trained on data that overlaps with test sets — benchmark overfitting, and the gap between benchmark performance and real-world utility have all attracted significant research attention in recent years. The position paper adds a more foundational critique: even setting aside contamination, the structure of how benchmarks are designed and reported is not fit for purpose.

The psychometrics framing is notable because it suggests the solution already exists in an adjacent discipline. Rather than calling for entirely new methodology, the authors are advocating for cross-disciplinary borrowing — applying tools that have been stress-tested in high-stakes human assessment contexts to the problem of AI evaluation.

What This Means

For researchers, developers, and policymakers relying on benchmark results to make decisions about AI systems, this paper is a direct challenge to treat evaluation science with the same rigour applied to the systems being evaluated — and OpenEval offers a concrete starting point for doing so.