A new study from ArXiv's CS.AI category argues that LLM-powered agents solving complex optimization problems through conversation consistently reach better outcomes than systems that attempt to solve the same problems in a single pass — and proposes a scalable framework for measuring exactly how much better.
Optimization problems — scheduling, resource allocation, logistics — have traditionally been the domain of specialists who translate messy human priorities into precise mathematical models. The challenge is that identifying the right objectives, constraints, and trade-offs requires sustained dialogue between technical experts and the people who actually live with the results. This paper, titled Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization, argues that large language models can bridge that gap, but only if they are built and evaluated the right way.
Why One-Shot Evaluation Falls Short
The paper's central methodological contribution is a framework for evaluating conversational optimization agents at scale — a problem that has no obvious off-the-shelf solution. Traditional benchmarks for AI systems measure a single output against a single correct answer. Conversations don't work that way: the quality of an outcome depends on the full sequence of exchanges, the stakeholder's evolving preferences, and the agent's ability to interpret ambiguous feedback.
One-shot evaluation is severely limiting: the same optimization agent converges to much higher-quality solutions through conversations.
To address this, the researchers built LLM-powered decision agents that role-play as diverse stakeholders. Each synthetic stakeholder is governed by an internal utility function — a hidden set of preferences — but communicates the way a real decision-maker would: in natural language, with all the vagueness and shifting priorities that implies. The team then generated thousands of simulated conversations around a school scheduling case study, creating enough data to draw statistically meaningful conclusions.
Tailored Agents Beat General-Purpose Chatbots
The study's second major finding concerns how optimization agents should be built. The researchers compared general-purpose chatbots against tailored optimization agents — systems equipped with domain-specific prompts and structured tools designed for the problem at hand. The specialized agents reached significantly better solutions in fewer conversational turns, according to the paper.
This matters because the obvious default for many organizations deploying AI in decision-support roles is to reach for a general-purpose model and prompt it minimally. The research provides evidence that this approach leaves substantial performance on the table. Building effective optimization agents, the authors argue, requires genuine operations research expertise — not just access to a capable language model.
The school scheduling domain was chosen as a case study, but the methodology is designed to be replicable and scalable across other optimization contexts. The researchers describe their approach as generalizable to any setting where human stakeholders must negotiate complex trade-offs with an AI system over multiple exchanges.
The Harder Problem: Modeling the Right Question
Underpinning the paper is a philosophical point that practitioners in operations research have long understood: solving an optimization problem correctly matters far less than solving the right optimization problem. Getting to the right problem formulation demands conversation — iterative clarification of what stakeholders actually want, not just what they initially say they want.
This is where large language models offer something new. Previous optimization tools required stakeholders to express their preferences in formal, mathematical terms, creating a significant barrier to adoption outside specialist teams. LLM agents can accept natural language input, propose solutions, explain trade-offs in plain terms, and refine their approach based on feedback — handling the translation layer that previously required a human expert.
The paper positions this capability as a way to expand the reach of optimization technologies to decision-makers who would never interact directly with traditional solver-based tools. Whether that promise holds across domains beyond scheduling remains an open question — the current results are based on simulated stakeholders rather than real human users, which is a meaningful limitation the methodology is designed to eventually address.
Synthetic Stakeholders as a Scaling Tool
The use of LLM-powered synthetic stakeholders to generate evaluation conversations is itself a notable methodological choice. Running thousands of conversations with real human participants would be prohibitively expensive and slow. Using AI agents to simulate stakeholders makes large-scale evaluation tractable, though it introduces its own assumptions: the synthetic stakeholders are only as realistic as the utility functions and prompting strategies used to construct them.
The researchers acknowledge that their evaluation framework is a stepping stone rather than a final answer. Its value lies in enabling systematic comparison between different agent designs before costly real-world deployment — a form of pre-screening that could significantly reduce the gap between laboratory performance and practical results.
The findings also carry an implicit message for organizations building AI tools for professional use: domain expertise doesn't disappear when you add a language model. It shifts. The expertise required to prompt, structure, and constrain an optimization agent effectively is substantively similar to the expertise required to build a good optimization model in the first place.
What This Means
For organizations deploying AI in complex decision-making roles, this research is a direct argument against off-the-shelf general-purpose chatbots and in favor of purpose-built agents — and it offers a concrete methodology for measuring whether those agents are actually working.