AI Agents Benchmark Exposes Critical Web-Calculation Gap

The best AI frontier models score just 20% accuracy on DRBENCHER, a newly published benchmark that tests whether AI agents can combine web research and mathematical computation in a single task — a capability gap that existing evaluations have largely ignored.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Researchers behind the project, published on ArXiv in April 2025, argue that today's AI benchmarks create a false picture of agent performance by testing browsing and calculation separately. In practice, real-world research tasks require both skills together: finding the right entity, retrieving specific properties, and then doing the math.

The strongest frontier model achieves only 20% answer accuracy — a result that exposes how poorly current benchmarks reflect what agents actually need to do.

What DRBENCHER Actually Tests

The benchmark is not a static dataset but a synthetic benchmark generator — a system that creates new questions on demand, making it harder for models to game through memorisation. Each question is built around four strict criteria.

Verifiability means every answer is computed by executing code against a knowledge graph, so there is always a definitive correct answer. Complexity requires multi-hop reasoning: a model must first identify an entity, retrieve one or more of its properties, and then perform domain-specific computation. Difficulty is enforced through a two-stage filter that removes any question the generating model itself can already answer. Diversity is maintained through a greedy max-min embedding filter designed to maximise coverage across topics.

These criteria apply across five domains: biochemistry, financial analysis, geophysics, cybersecurity, and history.

How the Pipeline Works

The research team built what they call an "answer-first" pipeline. Rather than writing a question and hoping for a valid answer, the system starts by computing a ground-truth answer from structured knowledge-graph data, then generates a question that leads to it. This approach sidesteps a common problem in benchmark design: questions that sound reasonable but turn out to have ambiguous or unverifiable answers.

Human evaluators assessed 76% of generated questions as valid — rising to 84% when questions involving outdated data are excluded. The remaining errors broke down meaningfully: 35% of failures stemmed from stale knowledge-graph entries, where real-world facts had changed after the graph was compiled. The researchers flag this as an inherent limitation of any system reasoning over time-sensitive data, not a flaw unique to their design.

Compared to Existing Benchmarks

The paper benchmarks DRBENCHER's semantic diversity against three well-known evaluations: BrowseComp+, MATH-500, and GPQA. According to the authors, DRBENCHER achieves the highest semantic diversity of the group — meaning its questions cover a broader and less repetitive range of concepts.

That distinction matters because narrow benchmarks can flatter models that have seen similar questions during training. A benchmark with high semantic diversity is harder to overfit and more likely to reflect genuine capability.

The 20% accuracy figure for the best-performing frontier model is self-reported by the research team based on their own evaluation runs. Independent replication has not yet been published.

The Blind Spot in Agent Evaluation

The core problem DRBENCHER addresses is not new, but it has become more urgent. AI agents — systems that autonomously browse the web, write and execute code, and chain together multi-step reasoning — are being deployed in research, finance, and scientific workflows. Yet the benchmarks used to evaluate them typically test one skill at a time.

A model can score highly on a browsing benchmark by retrieving facts without computing anything, and score highly on a maths benchmark by calculating without needing to look anything up. Neither score tells developers how well the agent performs when both are required simultaneously — which is most of the time in applied settings.

DRBENCHER is designed to close that gap by making the combination mandatory. A model cannot succeed by retrieving the right page if it cannot then compute the correct answer, and it cannot succeed by calculating correctly if it retrieves the wrong property from the wrong entity.

Synthetic Generation as a Scalability Strategy

One practical advantage of a generator over a fixed dataset is longevity. Static benchmarks tend to become obsolete as models train on their contents — a problem sometimes called "benchmark contamination." Because DRBENCHER generates questions procedurally from a knowledge graph, new questions can be produced continuously, and the difficulty filter ensures the generating model cannot trivially solve them.

The stale-data problem identified by the researchers cuts the other way, however. Knowledge graphs are snapshots; the real world keeps moving. Questions about financial figures, geopolitical boundaries, or security vulnerability databases can become unanswerable — or have their answers change — as underlying facts shift. The 35% error rate attributable to outdated entries suggests this is a non-trivial operational challenge for anyone deploying the benchmark at scale.

What This Means

For developers building and evaluating AI agents, DRBENCHER establishes that combining web retrieval with computation remains a largely unsolved problem — and that a 20% ceiling on current frontier models should prompt serious scrutiny of how agent capabilities are being measured and marketed.

New Benchmark Exposes AI Agents' Blind Spot: Combining Web Research With Calculation

What DRBENCHER Actually Tests

How the Pipeline Works

Compared to Existing Benchmarks

The Blind Spot in Agent Evaluation

Synthetic Generation as a Scalability Strategy

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

New Benchmark Exposes AI Agents' Blind Spot: Combining Web Research With Calculation

What DRBENCHER Actually Tests

How the Pipeline Works

Compared to Existing Benchmarks

The Blind Spot in Agent Evaluation

Synthetic Generation as a Scalability Strategy

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models