A new benchmarking framework has revealed that several open-weight AI models suffer accuracy collapses of up to 55% on average — and 100% on individual problem types — when mathematical questions are presented in slightly unfamiliar formats, according to research published on ArXiv.
The study challenges the prevailing narrative that high scores on standard mathematical benchmarks reflect genuine reasoning capability. Researchers designed a 14-technique perturbation pipeline applied to the AIME 2024 dataset — a set of challenging competition mathematics problems — and evaluated eight state-of-the-art models against the resulting benchmark. The perturbations do not change the underlying mathematics; they alter textual formatting, presentation style, and problem structure to test whether models truly understand problems or have simply learned to recognize familiar patterns.
When Formatting Changes, Reasoning Falls Apart
The results split sharply along model type. Frontier closed-weight models demonstrated meaningful resilience to the perturbations, maintaining relatively stable performance across formatting variations. Open-weight reasoning models, by contrast, exhibited significant collapses — reflecting not minor dips but wholesale failures on problems the models could otherwise solve.
Frontier models exhibit resilience, but open-weight reasoning models suffer significant collapses, exposing structural fragility.
This distinction matters practically. Open-weight models — those whose parameters are publicly available and can be run independently — are increasingly the foundation for enterprise deployments, research applications, and consumer tools. A model that scores well on a published leaderboard but fails when inputs deviate from familiar formatting represents a significant reliability risk in real-world use.
A Second Problem: Reasoning Steps That Contaminate Each Other
The researchers did not stop at formatting perturbations. They designed a second experiment to isolate a different potential weakness: working memory within a model's context window. By forcing models to solve multiple unperturbed mathematics problems sequentially within a single context window, the team could observe whether earlier reasoning steps interfere with later ones.
The results were consistent and concerning. Open-weight models ranging from 7 billion to 120 billion parameters, as well as Claude Opus 4.6 (a closed-weight model from Anthropic), all showed accuracy decay on problems presented later in a sequence. The more reasoning steps accumulated in the context window, the worse subsequent performance became.
The researchers' explanation is architectural. Standard transformer models use dense attention mechanisms, meaning every token in the context window influences every other. When a model works through a complex problem — generating intermediate steps, exploring dead ends, revising estimates — those steps remain in the context and, according to the authors, "permanently pollute" the model's ability to approach the next problem cleanly. There is no mechanism for the model to reset, to declare prior working irrelevant, and to begin fresh.
What This Reveals About Chain-of-Thought Reasoning
Chain-of-Thought (CoT) prompting — encouraging models to show their working before giving a final answer — has become a standard technique for improving AI reasoning performance. The benchmark results suggest this approach has an underappreciated cost: the accumulated intermediate steps that make CoT effective on single problems may actively degrade performance across multiple problems in a session.
This has direct implications for how AI assistants are used in practice. A tutoring tool, a coding assistant, or a research aid that handles multiple queries in sequence — all common deployment patterns — could be systematically less reliable later in a conversation than at its start, without any obvious signal to the user that degradation is occurring.
The authors argue this points toward a structural requirement for future reasoning architectures: explicit contextual resets built into the model's own reasoning process. Rather than treating a Chain-of-Thought as a continuous, cumulative log, future designs may need to define discrete reasoning units with clean boundaries — preventing earlier working from bleeding into later problems.
Benchmark Limitations Worth Noting
Several caveats apply. The benchmark results are self-reported by the research team and have not yet undergone formal peer review, as is standard for ArXiv preprints. The AIME dataset, while challenging, represents a specific domain — competition mathematics — and degradation patterns may differ across other reasoning tasks such as logical deduction, coding, or scientific analysis.
The paper also does not exhaustively test all major frontier models under the same conditions, meaning the apparent resilience of closed-weight models relative to open-weight ones may partly reflect which specific models were evaluated rather than a categorical difference between model types. The inclusion of Claude Opus 4.6 in the working-memory degradation findings complicates any clean open-versus-closed narrative.
Nonetheless, the methodology — separating formatting sensitivity from working-memory contamination — offers a more granular diagnostic than most existing robustness evaluations, and the scale of the observed drops in open-weight models is difficult to dismiss.
What This Means
Organizations deploying open-weight models for any task requiring sustained or reformatted reasoning should treat published benchmark scores as an upper bound rather than a reliable performance estimate — and the field faces a genuine architectural challenge before AI reasoning can be considered robust.