Emotional language in a question — even when every number and logical relationship stays exactly the same — can reduce an AI model's accuracy on maths problems by up to 10 percentage points, according to new research published on arXiv.
The study introduces TEMPER (Testing Emotional Perturbation in Quantitative Reasoning), a controlled benchmark designed to isolate the effect of emotional framing on large language model performance. Researchers built Temper-5400, a dataset of 5,400 semantically verified pairs drawn from three established benchmarks — GSM8K, MultiArith, and ARC-Challenge — rewriting each problem into emotionally charged variants while preserving all quantities and mathematical relationships.
How Emotional Framing Was Isolated From Other Variables
The core methodological challenge was ensuring that any performance drop could be attributed to emotional style rather than changes in the underlying content. The team developed a controlled "emotion translation" framework that rewrites neutral problems into variants expressing frustration, urgency, or enthusiasm, without altering numbers, operators, or logical structure.
To confirm that surface-level linguistic change alone was not responsible, the researchers also generated non-emotional paraphrases of the same problems. According to the paper, these neutral rewrites caused no measurable accuracy degradation — directly implicating emotional content, rather than mere rephrasing, as the source of the performance drop.
Neutralizing emotional variants recovers most of the lost performance, showing the degradation is tied to emotional style rather than content corruption.
This finding matters because it points to a specific and addressable vulnerability, not a general sensitivity to varied language.
Eighteen Models Tested, From Small to Frontier Scale
The benchmark was evaluated across 18 language models ranging from 1 billion to frontier scale parameters, capturing a broad slice of the current model landscape. The paper does not single out specific model names in the abstract, but the scale of evaluation — from compact, deployable models to the largest available systems — suggests the effect spans architectures of varying size and age.
Accuracy drops of 2 to 10 percentage points emerged consistently when models encountered emotionally framed problems, even though every piece of mathematical information needed to answer correctly remained present. The variance across that range likely reflects differences in model size, training data composition, and instruction tuning — factors the full paper's results section presumably unpacks in detail.
A Lightweight Fix Already Exists
One of the study's more practically useful findings is that the performance loss is largely reversible. When researchers applied a neutralization step — stripping emotional language from queries before passing them to a model — most of the accuracy drop disappeared. According to the paper, this positions neutralization as a "lightweight inference-time mitigation": a pre-processing step that could be applied without retraining or fine-tuning the underlying model.
This is meaningful for deployment contexts where user queries arrive in natural, unfiltered language. Customer support tools, tutoring systems, and any application where users submit maths or reasoning questions under real-world conditions — often with frustration or urgency attached — would be affected by the vulnerability TEMPER describes.
Why This Benchmark Design Has Broader Uses
Beyond the specific findings on emotional framing, the researchers argue that their construction procedure offers a general framework for controlled stylistic robustness evaluation. The same approach — rewrite a problem along a defined stylistic axis, verify semantic equivalence, test performance — could in principle be applied to formality, dialect, verbosity, or other dimensions of language variation.
This positions TEMPER as both a dataset and a reusable methodology. As AI systems are deployed in increasingly diverse real-world settings, tools that can systematically probe stylistic robustness — rather than relying on naturalistically collected variation — become more valuable to both researchers and developers.
The benchmark also raises questions about evaluation practices across the field. Standard leaderboard benchmarks like GSM8K are written in clean, neutral prose. If emotional framing consistently degrades performance, then scores on neutral benchmarks may overstate how well models perform in realistic conditions where users rarely write like textbooks.
What This Means
Any organisation deploying AI for reasoning tasks — from education to finance to customer service — should treat emotional language sensitivity as a known, testable vulnerability, and consider inference-time neutralization as an immediate mitigation while longer-term fixes are explored.