A new benchmark called SAGE has found that even capable large language models regularly fail at real-world customer service tasks — not because they misunderstand users, but because they cannot reliably follow the structured procedures that actual deployments require.

Customer service automation has become one of the most commercially significant applications of large language models, with enterprises deploying AI agents to handle everything from billing disputes to technical support. Yet the tools used to evaluate these systems have struggled to keep pace. Most existing benchmarks test models in static, single-turn scenarios and rely on one-dimensional scoring, giving developers an incomplete picture of how their agents will behave under the pressures of real customer interactions.

How SAGE Turns Messy Procedures Into Testable Graphs

SAGE — which stands for Service Agent Graph-guided Evaluation — addresses this by converting unstructured Standard Operating Procedures (SOPs) into what the researchers call Dynamic Dialogue Graphs. In practice, SOPs are the step-by-step rulebooks that human customer service agents follow: if a customer says X, check Y, then do Z. These procedures are often written in plain language and full of conditional branches. SAGE formalizes them into graph structures that allow automated systems to verify, precisely, whether an AI agent followed the correct logical path through a conversation.

This matters because logical compliance is not the same as sounding helpful. A model can produce a warm, professional response while simultaneously skipping a required verification step or routing a customer to the wrong resolution — failures that carry real consequences in production.

Models accurately classify intents but fail to derive correct subsequent actions — a gap that politeness and fluency can easily mask.

The benchmark also introduces an Adversarial Intent Taxonomy, a structured catalogue of difficult or manipulative user behaviors designed to stress-test agents. This allows evaluators to probe how models respond when users push back, provide contradictory information, or attempt to circumvent procedures — scenarios that are common in real deployments but rarely captured in standard benchmarks.

Testing 27 Models Across Six Industrial Scenarios

The research team tested 27 large language models across 6 industrial scenarios, using a dual-axis evaluation framework. Rather than a single quality score, SAGE assesses models on two dimensions simultaneously: logical compliance with the SOP graph, and dialogue quality. Evaluation itself is automated through a combination of Judge Agents and a Rule Engine that analyze the interactions between simulated user agents and service agents, producing what the researchers describe as deterministic ground truth — meaning the system's verdict on whether a procedure was followed correctly is not itself a matter of interpretation.

These results are self-reported by the research team and have not yet undergone independent peer review, as the paper was posted to arXiv as a preprint.

The Execution Gap and Empathy Resilience

The experiments surfaced two findings with direct implications for enterprise AI deployment. The first is what the authors term the "Execution Gap": across the models tested, there is a consistent and significant disconnect between intent classification accuracy and action accuracy. Models correctly identify what a user is asking for at high rates, but then select the wrong next action in the procedural sequence. For businesses that have assumed strong intent recognition translates into reliable task completion, this finding challenges that assumption.

The second finding — labelled "Empathy Resilience" — is subtler and arguably more concerning from a quality-assurance perspective. Under high adversarial pressure, models tend to maintain polite, empathetic conversational behavior even as their underlying logical processing fails. In other words, the surface quality of the conversation degrades more slowly than the procedural quality. This creates a scenario where human reviewers spot-checking transcripts might rate an interaction positively while the model has actually violated the SOP.

A Modular Design Built for Cross-Industry Deployment

Beyond the diagnostic findings, SAGE is designed as a practical tool. The Extension Mechanism built into the benchmark allows organizations to adapt it to new domains at relatively low cost by plugging in different SOP graphs and adversarial scenario sets. The system also supports automated dialogue data synthesis, meaning it can generate evaluation conversations at scale without requiring large volumes of human-annotated data.

The researchers have made the benchmark code and resources publicly available, though currently through an anonymized repository, suggesting the work may be under review for a formal conference or journal submission.

What This Means

For teams building or procuring AI customer service systems, SAGE provides evidence that fluency and intent recognition are insufficient proxies for operational reliability — and that evaluation frameworks must test procedural compliance directly, not just conversation quality.