Large language models can mimic the shape of human social reasoning but routinely distort its strength, according to a new study posted to ArXiv that introduces two novel metrics to measure the gap between how humans and AI systems interpret implied meaning.

The research, posted to ArXiv's Computation and Language section in April 2025, tackles a question that sits at the intersection of linguistics and AI: not just whether LLMs understand social meaning qualitatively — the kind of inference humans draw when someone says "about fifty people attended" rather than "exactly fifty" — but whether they get the magnitude of that inference right. According to the authors, the answer is largely no.

A New Ruler for Social Calibration

To make this measurable, the researchers introduce two metrics. The Effect Size Ratio (ESR) captures whether a model's inferences move in the same direction as human judgments — in other words, whether the model's social reading has the right structure. The Calibration Deviation Score (CDS) measures how far the model's inferences drift from human baselines in terms of intensity. Together, they allow a finer-grained diagnosis than simply asking whether a model is "right" or "wrong."

The case study centres on numerical imprecision — a well-studied area of pragmatics where word choice carries social weight. Saying a meeting lasted "around two hours" rather than "two hours exactly" implies something about the speaker's knowledge, confidence, or communicative intent. Humans pick up on these cues in nuanced, calibrated ways. The question is whether LLMs do the same.

LLMs capture inferential structure while variably distorting inferential strength — and pragmatic theory provides a useful but incomplete handle for improving that approximation.

Three unnamed frontier LLMs were tested. According to the paper, all three reliably reproduced the qualitative pattern of human social inferences — the models agreed with humans on which cues mattered and which direction those cues pointed. But all three diverged from human baselines on magnitude, tending to amplify inferences beyond what human respondents would endorse.

What Happens When You Tell a Model to Think Like a Listener

The second half of the study tests whether smarter prompting can close the gap. The authors design prompting conditions grounded in two principles from pragmatic theory. The first asks models to reason about linguistic alternatives — the idea that meaning arises partly from what a speaker could have said but didn't. The second asks models to reason about speaker knowledge and motives — inferring why a speaker chose a particular formulation.

The results are instructive and somewhat counterintuitive. Prompting for alternative-awareness — the more linguistically classical approach — tended to increase exaggeration, not reduce it. Models prompted this way appeared to over-index on contrast effects, amplifying inferences rather than tempering them.

Prompting for speaker knowledge and motives, by contrast, most consistently reduced magnitude deviation across all three models. When both components were combined, it was the only intervention that improved every calibration-sensitive metric across every model tested — according to the paper's reported benchmark results.

Why Fine-Grained Calibration Remains Elusive

Despite the improvement, the authors are careful not to overstate the result. Fine-grained magnitude calibration — getting not just the direction but the precise intensity of a social inference right — remains only partially resolved even under the best prompting conditions. The gap between how strongly humans read implied meaning and how strongly LLMs do is narrowed, not closed.

This matters practically. Social meaning calibration has direct implications for how LLMs perform in applications that depend on nuanced communication: summarisation, dialogue systems, content moderation, and any context where interpreting implied intent is as important as interpreting literal content. A model that consistently overshoots the social weight of a phrase may misread tone, flag neutral content as loaded, or generate responses that feel socially off.

The study also contributes to a broader methodological conversation. Most evaluations of LLM social or pragmatic reasoning focus on qualitative accuracy — did the model get it right or wrong? The ESR and CDS metrics proposed here push toward a more demanding standard, asking how closely a model's internal calibration tracks human judgment on a continuous scale. Whether these metrics are adopted more widely will depend on whether other researchers find them tractable and generalisable beyond the numerical imprecision domain tested here.

A Useful Frame, Not a Complete Solution

The authors situate their work within a growing body of research asking how well LLMs approximate human linguistic competence — and where that approximation breaks down. The finding that structural fidelity is high but magnitude calibration is variable suggests that current LLMs have absorbed something real about the architecture of social inference, likely from the statistical patterns in training data, but have not reliably learned to weight those inferences the way humans do.

Pragmatic theory, the paper argues, offers a principled toolkit for prompting interventions — but it is not a solution that covers all cases. The divergence between the two prompting strategies tested here (one helping, one hurting) underscores that translating theoretical constructs into effective prompts is itself a non-trivial design problem.

What This Means

For developers building LLMs into communication-sensitive applications, this research suggests that prompting models to reason about speaker intent and knowledge — rather than purely about linguistic form — is the more effective lever for improving social inference accuracy, though a meaningful calibration gap with human judgment persists.