AI Reasoning Failures New DRTO Training Method

Researchers have published a new training method designed to make large language models more consistent when prompts vary in wording, format, or language — a problem that causes significant failures in multi-step reasoning tasks.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The paper, posted to ArXiv in April 2025, introduces Distributionally Robust Token Optimization (DRTO), which combines token-level Reinforcement Learning from Human Feedback (RLHF) with a mathematical framework called Distributionally Robust Optimization (DRO). The research targets one of the more frustrating and well-documented weaknesses of modern language models: their tendency to fail on questions they should be able to answer, simply because those questions are phrased differently from what appeared in training data.

Why Small Wording Changes Break Capable Models

Large language models are trained on enormous datasets, but that training is uneven. A model might solve a math problem flawlessly when it is phrased one way, then fail completely when the same problem is reworded or reformatted. This fragility is especially pronounced in multi-step reasoning, where a small misread early in a chain of logic can cascade into a wrong answer.

The core issue is distributional shift — the gap between the data a model was trained on and the inputs it encounters in real use. Standard RLHF, which uses human feedback to fine-tune model behaviour, does not explicitly account for this gap. It optimises for average performance, which can mask poor behaviour on edge cases.

DRTO bounds worst-case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, giving the method a theoretical robustness guarantee.

How DRTO Works

DRTO addresses distributional shift at the token level — meaning it examines individual words and sub-word units within a prompt, rather than treating each prompt as a single unit. The method builds what the authors call an f-divergence ambiguity set, a mathematically defined region of possible input distributions surrounding each training batch. By optimising for the worst case within that region, the model is forced to perform adequately even on inputs that differ from its training examples.

This is distinct from standard robust training approaches, which typically operate at the sample or sentence level. Working at the token level allows DRTO to capture finer-grained variation — the kind that arises when a user swaps a synonym, changes punctuation, or switches from formal to informal phrasing.

The theoretical foundation draws on established work in robust optimisation, applying it to the specific structure of RLHF pipelines. According to the authors, this combination yields a formal robustness guarantee, not just empirical improvement — though the guarantee holds under the assumptions of the f-divergence framework.

Benchmark Results and What They Show

The authors tested DRTO on two mathematical reasoning benchmarks. On GSM8K, a widely used dataset of grade-school math word problems, DRTO achieved a 9.17% improvement over the baseline. On MathQA, a more complex multiple-choice math benchmark, the improvement was 2.49%. Both results are self-reported by the researchers and have not been independently verified.

The GSM8K gain is notable because that benchmark is already considered relatively mature — many recent methods report marginal improvements, making a near double-digit gain meaningful if it holds under independent testing. The smaller MathQA improvement suggests the method's benefits may vary by task complexity and format.

It is worth noting that benchmark performance on reasoning tasks does not always translate directly to real-world robustness. GSM8K and MathQA test specific, structured problem types; whether DRTO's gains extend to open-ended or domain-specific reasoning remains an open question.

What Happens Next

DRTO is presented as a general framework compatible with existing RLHF pipelines, which means it could in principle be applied to any model currently being fine-tuned with human feedback — a category that includes most major commercial language models. However, the paper does not report results on non-mathematical reasoning tasks, leaving open whether the method generalises to areas like coding, scientific reasoning, or instruction-following under varied formats.

The research also does not address computational overhead. Robust optimisation methods typically require more computation than standard training, and whether DRTO's costs are practical at scale is not discussed in the abstract. These are questions that independent replication and follow-on work will need to address.

The approach joins a growing body of work focused on making language models more reliable under distribution shift — an area that has attracted significant research attention as models move from benchmark settings into deployment environments where input variation is the norm, not the exception.

What This Means

For practitioners fine-tuning language models with RLHF, DRTO offers a potentially practical route to reducing inconsistency under prompt variation — but independent validation at scale will be needed before the method's real-world value can be assessed with confidence.

New Training Method Cuts AI Reasoning Failures From Slight Wording Changes

Why Small Wording Changes Break Capable Models

How DRTO Works

Benchmark Results and What They Show

What Happens Next

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

New Training Method Cuts AI Reasoning Failures From Slight Wording Changes

Why Small Wording Changes Break Capable Models

How DRTO Works

Benchmark Results and What They Show

What Happens Next

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models