A new AI architecture called DUPLEX improves robotic task planning by preventing large language models from doing the planning — assigning that role instead to a classical symbolic system while confining the LLM to structured data extraction.
The paper, published on arXiv in the CS.AI category, addresses a well-documented weakness in applying LLMs to robotics and long-horizon task planning: their tendency toward hallucination and logical inconsistency. When an LLM is asked to generate a complete plan — particularly across many sequential steps — small errors compound, and the resulting behavior can be unpredictable or outright wrong.
The Core Idea: Divide Labour by Competence
DUPLEX (Agentic Dual-System Planning via LLM-Driven Information Extraction) splits the planning process between two components the authors call a Fast System and a Slow System. The Fast System uses a lightweight LLM solely to parse natural language instructions — identifying entities, relationships, and constraints — and maps them into a structured format called PDDL (Planning Domain Definition Language). A classical symbolic planner then takes that structured input and generates the actual plan.
The key is not to make the LLM plan better, but to restrict the LLM to the part it is good at — structured semantic grounding — and leave logical plan synthesis to a symbolic planner.
This distinction matters more than it might initially appear. Classical planners are deterministic and logically sound — given a valid problem description, they will find a valid solution or confirm none exists. The persistent challenge has been translating messy, ambiguous human language into the precise format these planners require. DUPLEX positions the LLM as that translator, not as a decision-maker.
When Plans Fail: The Slow System Kicks In
Not every situation maps cleanly from natural language to a well-formed planning problem. Instructions can be underspecified, environments can contain unexpected elements, or the initial extraction can be incomplete. For these cases, DUPLEX activates its Slow System — but only when the symbolic planner has already failed.
Rather than starting over, the Slow System uses the planner's own diagnostic output — the specific error or failure reason — to prompt a high-capacity LLM. That model then performs iterative reflection and repair on the PDDL problem description until the planner can succeed, or until the system determines the task is genuinely unsolvable with available information. This targeted use of a more powerful (and computationally expensive) model only on confirmed failures is a deliberate efficiency choice.
The architecture is described as "agentic" because the Slow System operates in a loop, autonomously revising its outputs based on structured feedback rather than requiring human intervention at each failure point.
Benchmark Results Across 12 Domains
The authors evaluated DUPLEX across 12 classical and household planning domains, comparing it against end-to-end LLM planners — systems that ask an LLM to generate a complete plan directly — and hybrid baselines that combine LLMs with some symbolic components but less strictly. According to the paper, DUPLEX significantly outperformed both categories on success rate and reliability. These benchmarks are self-reported by the research team and have not yet undergone independent replication.
The inclusion of household planning domains is notable. These environments — think robots navigating kitchens, managing objects, or following multi-step domestic instructions — are intentionally messy and linguistically varied, making them a realistic stress test for the language extraction component.
Why This Approach Runs Against the Prevailing Trend
Much recent work in LLM-based planning has moved in the opposite direction: giving models more autonomy, more tools, and more responsibility for end-to-end task completion. Systems like ReAct, code-generating agents, and various tool-use frameworks ask LLMs to reason through problems and emit executable plans or code directly.
DUPLEX represents a deliberate counter-argument. Its premise is that LLMs are fundamentally ill-suited to rigorous logical synthesis — not because they lack knowledge, but because their probabilistic, pattern-matching nature is structurally mismatched with tasks requiring guaranteed correctness over long action sequences. By hard-coding that limitation into the architecture itself, rather than trying to prompt or fine-tune it away, the researchers claim more consistent real-world performance.
This connects to a broader debate in AI research about where neural and symbolic methods each belong. Pure neural approaches offer flexibility and generalization; pure symbolic approaches offer correctness and interpretability. Neuro-symbolic hybrids attempt to capture both — but the specific boundary between them varies enormously across different systems, and where you draw that line has significant consequences for performance.
What This Means
DUPLEX offers a concrete, tested answer to the question of how to make LLM-assisted robotics more reliable: not by improving LLM planning ability, but by structurally preventing LLMs from planning at all — a design philosophy that could meaningfully influence how agentic systems are built for high-stakes, long-horizon tasks.