A peer-reviewed preprint published on ArXiv argues that single large language models consistently match or outperform multi-agent AI systems on complex reasoning tasks when both are given equal computational resources — directly challenging a widely held assumption in AI development.

The AI research community has invested heavily in multi-agent systems (MAS), architectures where multiple AI models collaborate, debate, or specialise to solve problems together. The expectation has been that coordination produces reasoning gains greater than any single model could achieve alone. This new paper, from researchers publishing via ArXiv CS.CL, argues that belief rests on a methodological flaw: multi-agent systems have routinely been tested with access to more computation than their single-agent counterparts.

The Hidden Cost Inflating Multi-Agent Results

When multiple agents each perform reasoning steps, the total number of "thinking tokens" — the computational work the model does before producing an answer — multiplies across agents. Previous benchmarks often compared multi-agent outcomes against single-agent outcomes without accounting for this disparity. The researchers describe this as "confounded by increased test-time computation."

Many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits.

To correct for this, the study normalised the reasoning-token budget across conditions, ensuring each system — single or multi-agent — consumed the same total computational resource before results were compared.

Information Theory as the Theoretical Foundation

The paper doesn't rely solely on empirical results. The researchers ground their argument in the Data Processing Inequality, a principle from information theory stating that processing data through additional steps can only reduce or preserve information — never increase it. Applied to language model reasoning, this suggests that routing a problem through multiple agents introduces coordination overhead that can degrade, not enhance, the information available to reach a correct answer.

Under this framework, a single agent with perfect use of its full context window should be more information-efficient than a chain of agents, each of whom receives only a partial or processed version of the original problem. The theory also makes a specific, testable prediction: multi-agent systems should become competitive when a single agent struggles to use its context effectively — for instance, on tasks requiring extremely long inputs that exceed practical context limits.

Three Model Families, Consistent Results

The empirical testing covered three model families: Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, each compared across single-agent and multiple multi-agent architectures. The focus was multi-hop reasoning tasks — problems that require chaining several logical or factual steps to reach an answer, widely regarded as a meaningful test of genuine reasoning capability.

Across all three families, single-agent systems matched or outperformed multi-agent configurations when reasoning tokens were held constant. The results held across different multi-agent architectures, suggesting this is not an artefact of one particular coordination strategy.

Benchmarks and APIs Under Scrutiny

Beyond the core performance comparison, the researchers conducted what they call a "detailed diagnostic analysis" of both evaluation methodology and system behaviour. They identified two sources of misleading results in prior work.

First, standard benchmarks used to evaluate multi-hop reasoning contain artefacts — structural properties that can artificially favour multi-agent approaches without reflecting genuine reasoning improvement. Second, and notably, the study flags significant artefacts in API-based budget control, particularly in Gemini 2.5. When researchers attempt to control how much thinking a model does via API parameters, the actual computational behaviour may not match the intended constraint — meaning some results in the literature may reflect unintended compute differences rather than architectural advantages.

These findings matter because the field depends on reproducible, comparable benchmarks. If the tools used to enforce fair comparisons are themselves unreliable, published performance gaps between systems may be systematically misleading.

When Multi-Agent Systems Still Make Sense

The paper does not argue that multi-agent systems are useless. The information-theoretic framework explicitly predicts scenarios where they become competitive or necessary. When a single agent's effective context utilisation degrades — on tasks with extremely long context requirements, or when memory and retrieval limitations reduce how much relevant information an agent can actually use — distributing work across agents may recover performance.

Similarly, when an application genuinely requires more total computation and that cost is acceptable, multi-agent systems may still deliver better outcomes. The researchers' argument is narrower but important: the architectural coordination of multi-agent systems does not itself provide reasoning gains. Any advantage is attributable to additional compute, not design.

What This Means

Organisations and developers building AI systems for complex reasoning tasks should scrutinise whether multi-agent architectures deliver genuine performance improvements or simply spend more compute — and benchmark comparisons in this space should explicitly account for total reasoning-token usage before drawing conclusions about architectural superiority.