Researchers have proposed a feedback-driven agentic framework that uses heuristic search over language model outputs to generate high-quality planning domains from natural language — a task that current large language models consistently struggle to complete reliably.
Formal planning domains, written in languages like PDDL (Planning Domain Definition Language), describe the rules, actions, and states that an automated planner needs to solve structured problems — from logistics scheduling to robotic task execution. While LLMs can approximate these representations, the resulting domains frequently contain logical errors or omissions that make them unusable in real deployments. This paper, posted to ArXiv CS.AI, directly addresses that gap.
Why LLMs Struggle With Planning Domains
Generating a valid planning domain is not simply a text generation task — it requires precise symbolic consistency. An action's preconditions must logically connect to its effects, object types must be correctly inherited, and the domain must remain coherent across all possible states the planner might encounter. LLMs, trained on probabilistic next-token prediction, have no built-in mechanism to enforce these hard logical constraints.
Recent benchmarking work has confirmed that even the most capable reasoning models produce planning domains that frequently fail formal validation. The authors of this paper cite that body of evidence as motivation, arguing that raw LLM generation — even with chain-of-thought prompting — is insufficient for production-grade use.
Heuristic search over model space, guided by symbolic feedback, reframes domain generation from a single-shot prediction problem into an iterative optimisation process.
How the Feedback Framework Works
The core idea is to treat the space of possible LLM-generated domains as a search space, and to use symbolic feedback signals as a heuristic to navigate it. Rather than asking a model to generate a domain once and accepting the result, the framework runs multiple generations and evaluates each using two primary feedback mechanisms.
The first is landmarks — partial symbolic descriptions of key states or subgoals that a valid plan should pass through. These are derived from natural language descriptions augmented with a minimal amount of symbolic annotation, keeping the human effort low while providing meaningful structural guidance.
The second feedback source is VAL, a widely used plan validator that formally checks whether a generated domain and its associated plans are logically consistent. VAL's output — errors, warnings, or confirmation — feeds back into the search process, steering subsequent model generations toward better solutions.
The framework operates as an agentic loop: the language model generates a domain, receives feedback, and revises. This cycle continues until a domain passes quality thresholds or a generation budget is exhausted.
What the Experiments Show
The paper evaluates domain quality across several configurations, varying the type and combination of symbolic feedback provided. According to the authors, incorporating both landmark information and VAL validation feedback produces measurably better domains than either source alone — though specific numerical results have not undergone independent peer review at this stage.
The authors also note that the minimal symbolic augmentation requirement is a deliberate design choice. Requiring extensive hand-crafted symbolic input would undermine the practical value of using LLMs in the first place. The goal is to find the smallest useful signal that meaningfully improves output quality.
The heuristic search itself operates over model space — essentially, the distribution of outputs a model can produce given different prompts, temperatures, or reasoning traces. This framing is conceptually distinct from search within a fixed symbolic state space, and the paper's contribution lies partly in formalising that distinction.
The Broader Challenge of Neurosymbolic Planning
This research sits within a growing strand of work attempting to bridge neural language models and classical symbolic AI planning — sometimes called neurosymbolic AI. Classical planners are powerful and formally verifiable, but they require precisely specified domains that are expensive to write by hand. LLMs lower that authoring cost but introduce correctness risks. Frameworks like this one attempt to combine the benefits of both.
Several research groups have explored related directions, including using LLMs as heuristics within planners, or using planners to verify LLM-generated task decompositions. This paper's specific contribution — treating model outputs as a search space navigated by symbolic validators — adds a concrete mechanism to that broader programme.
The work also has practical relevance beyond academic benchmarks. Industries that use automated planning — supply chain management, robotics, game AI, and medical protocol design — face the same bottleneck: domain authoring is a specialist skill in short supply. A reliable natural-language-to-domain pipeline would meaningfully lower that barrier.
What This Means
For developers and researchers working with automated planners, this framework offers a concrete method for improving LLM-generated planning domains without requiring deep symbolic expertise from end users — provided the feedback loop and minimal symbolic annotations are correctly configured.