Emotional Tone Cuts AI Math Accuracy by 10% in

Emotional language in a question — even when every number and logical relationship stays exactly the same — can reduce an AI model's accuracy on maths problems by up to 10 percentage points, according to new research published on arXiv.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The study introduces TEMPER (Testing Emotional Perturbation in Quantitative Reasoning), a controlled benchmark designed to isolate the effect of emotional framing on large language model performance. Researchers built Temper-5400, a dataset of 5,400 semantically verified pairs drawn from three established benchmarks — GSM8K, MultiArith, and ARC-Challenge — rewriting each problem into emotionally charged variants while preserving all quantities and mathematical relationships.

How Emotional Framing Was Isolated From Other Variables

The core methodological challenge was ensuring that any performance drop could be attributed to emotional style rather than changes in the underlying content. The team developed a controlled "emotion translation" framework that rewrites neutral problems into variants expressing frustration, urgency, or enthusiasm, without altering numbers, operators, or logical structure.

To confirm that surface-level linguistic change alone was not responsible, the researchers also generated non-emotional paraphrases of the same problems. According to the paper, these neutral rewrites caused no measurable accuracy degradation — directly implicating emotional content, rather than mere rephrasing, as the source of the performance drop.

Neutralizing emotional variants recovers most of the lost performance, showing the degradation is tied to emotional style rather than content corruption.

This finding matters because it points to a specific and addressable vulnerability, not a general sensitivity to varied language.

Eighteen Models Tested, From Small to Frontier Scale

The benchmark was evaluated across 18 language models ranging from 1 billion to frontier scale parameters, capturing a broad slice of the current model landscape. The paper does not single out specific model names in the abstract, but the scale of evaluation — from compact, deployable models to the largest available systems — suggests the effect spans architectures of varying size and age.

Accuracy drops of 2 to 10 percentage points emerged consistently when models encountered emotionally framed problems, even though every piece of mathematical information needed to answer correctly remained present. The variance across that range likely reflects differences in model size, training data composition, and instruction tuning — factors the full paper's results section presumably unpacks in detail.

A Lightweight Fix Already Exists

One of the study's more practically useful findings is that the performance loss is largely reversible. When researchers applied a neutralization step — stripping emotional language from queries before passing them to a model — most of the accuracy drop disappeared. According to the paper, this positions neutralization as a "lightweight inference-time mitigation": a pre-processing step that could be applied without retraining or fine-tuning the underlying model.

This is meaningful for deployment contexts where user queries arrive in natural, unfiltered language. Customer support tools, tutoring systems, and any application where users submit maths or reasoning questions under real-world conditions — often with frustration or urgency attached — would be affected by the vulnerability TEMPER describes.

Why This Benchmark Design Has Broader Uses

Beyond the specific findings on emotional framing, the researchers argue that their construction procedure offers a general framework for controlled stylistic robustness evaluation. The same approach — rewrite a problem along a defined stylistic axis, verify semantic equivalence, test performance — could in principle be applied to formality, dialect, verbosity, or other dimensions of language variation.

This positions TEMPER as both a dataset and a reusable methodology. As AI systems are deployed in increasingly diverse real-world settings, tools that can systematically probe stylistic robustness — rather than relying on naturalistically collected variation — become more valuable to both researchers and developers.

The benchmark also raises questions about evaluation practices across the field. Standard leaderboard benchmarks like GSM8K are written in clean, neutral prose. If emotional framing consistently degrades performance, then scores on neutral benchmarks may overstate how well models perform in realistic conditions where users rarely write like textbooks.

What This Means

Any organisation deploying AI for reasoning tasks — from education to finance to customer service — should treat emotional language sensitivity as a known, testable vulnerability, and consider inference-time neutralization as an immediate mitigation while longer-term fixes are explored.

Emotional Tone in Math Questions Cuts AI Accuracy by Up to 10 Points

How Emotional Framing Was Isolated From Other Variables

Eighteen Models Tested, From Small to Frontier Scale

A Lightweight Fix Already Exists

Why This Benchmark Design Has Broader Uses

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

Emotional Tone in Math Questions Cuts AI Accuracy by Up to 10 Points

How Emotional Framing Was Isolated From Other Variables

Eighteen Models Tested, From Small to Frontier Scale

A Lightweight Fix Already Exists

Why This Benchmark Design Has Broader Uses

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models