A large-scale empirical study posted to ArXiv challenges the prevailing assumption that equipping AI web agents with specialised tools reliably improves their performance, finding that gains are inconsistent and that tool use can introduce measurable side effects.

Web agents — AI systems designed to navigate browsers, fill forms, and complete tasks on the internet autonomously — have become a focal point of applied AI research. Over the past two years, a growing number of papers have moved beyond basic browser interactions toward "tool use," a paradigm in which agents call discrete, higher-level functions rather than simulating individual clicks and keystrokes. The intuition is straightforward: purpose-built tools should make agents faster, more reliable, and less error-prone. This new study suggests the reality is considerably more complicated.

Prior Research Drew on Too-Small, Non-Comparable Experiments

The researchers identify a structural problem with the existing literature: most studies demonstrating the value of tool use were conducted at limited experimental scales, often with a single backbone model or a narrow set of benchmarks, and in settings that resist direct comparison. This makes it difficult to know whether positive results reflect genuine capability gains or the particular conditions of each experiment.

Prior conclusions about tool use in web agents were often drawn from limited experimental scales and sometimes non-comparable settings — leaving fundamental questions unanswered.

The new study attempts to correct for this by running a controlled comparison across diverse tool sources, multiple backbone language models, different tool-use frameworks, and several evaluation benchmarks. The authors do not name a single winning configuration. Instead, they report that the picture is messier than the field has generally acknowledged.

Three Questions the Field Has Left Unanswered

The paper frames its investigation around three core questions that, the authors argue, remain genuinely unresolved despite significant prior work. First, do tools provide consistent performance gains for web agents, or do results vary enough across settings to undermine broad claims? Second, what practical design principles distinguish tools that actually help from those that do not? Third, what side effects does tool use introduce — and how significant are they?

On the first question, the study's findings are cautionary. According to the authors, tools do not reliably improve agent performance across all tested conditions. Some configurations benefit, others do not, and the variance appears to depend heavily on which model is serving as the agent's backbone and which benchmark is used for evaluation. This represents a different framing from more optimistic conclusions that have circulated in recent work.

On side effects, the paper flags an underexplored risk. When agents operate through higher-level tool abstractions rather than atomic browser actions, they may lose visibility into certain page states, make harder-to-diagnose errors, or behave in ways that are less predictable. The study does not quantify a single failure rate but identifies the existence and variety of these effects as a reason for caution.

What Makes a Tool Actually Useful

The study's middle contribution is a set of practical design principles for effective tools, derived from the controlled experiments rather than stated as prior assumptions. While the paper does not provide a simple checklist — and readers should note these findings are drawn from the authors' own experimental setup and have not yet undergone peer review — the thrust is that tool design cannot be treated as an afterthought. A tool that performs well with one backbone model may perform poorly with another, meaning developers cannot assume transferability.

This finding has direct practical relevance. Many teams building web agents adopt tool libraries developed by others, assuming the published performance numbers will translate to their own deployment context. The study's evidence suggests that assumption deserves more scrutiny.

Benchmarks and the Measurement Problem

The choice of evaluation benchmark also emerges as a significant variable in the study's findings. Different benchmarks reward different agent behaviours, and a tool-use strategy that scores well on one may underperform on another. This is not a new problem in AI research, but the paper's multi-benchmark design makes the scale of the variation visible in a way that single-benchmark studies cannot.

It is worth noting that all results reported in this paper are the authors' own experimental findings, not independently verified by a third party. The study has been posted as a preprint and has not yet been peer-reviewed. That said, the methodological approach — controlling for model, framework, and benchmark simultaneously — represents a more rigorous experimental design than much of the prior work it critiques.

What This Means

Developers and researchers building or evaluating web agents should treat published tool-use performance claims with greater scepticism, particularly when those claims come from studies using a single model or benchmark — and should test tool configurations directly against their own target conditions before drawing conclusions about what works.