RAMP System: Reinforcement Learning Meets Classical Planning

Researchers have introduced RAMP, a hybrid AI system that simultaneously learns how to act in an environment and builds a formal, structured model of that environment's rules — eliminating the need for hand-crafted or expert-provided action descriptions.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Automated planning — the branch of AI concerned with finding sequences of actions to reach a goal — has long depended on action models: explicit specifications of what each action requires and what it changes. Writing these models by hand is labour-intensive, and while algorithms exist to learn them from recorded expert behaviour, those approaches are offline, meaning they need a curated dataset before they can begin. RAMP, described in a paper posted to ArXiv CS.AI (arXiv:2604.08685), removes that dependency entirely.

How the Feedback Loop Works

RAMP integrates three components that reinforce one another. A Deep Reinforcement Learning (DRL) policy explores the environment and accumulates experience. A numeric action model learner analyses that experience to build an increasingly accurate formal model of what actions do — expressed in a numeric planning language. A planner then uses that model to generate goal-directed action sequences, which in turn feed more structured behaviour back into the RL training loop.

The RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy.

The result is a positive feedback cycle: better exploration produces a better model, and a better model produces better plans, which produce better training data. According to the authors, none of the three components needs to be perfect for the system to make progress — they improve together over time.

The Problem With Numeric Domains

Most prior work on learning action models focuses on propositional planning, where the world is described purely in true-or-false terms. Numeric planning is harder: actions can increase or decrease continuous quantities, such as fuel levels, distances, or resource counts. Existing algorithms that handle numeric domains require expert traces — recorded sequences of actions from a human or a pre-trained agent — which are not always available.

RAMP targets this gap directly. By learning online, from its own interactions, it avoids the expert-trace requirement. The authors also built a supporting tool called Numeric PDDLGym, an automated framework that converts numeric planning problems into the Gym interface standard used widely in reinforcement learning research. This conversion layer is what makes it practical to run a planner and an RL agent side by side on the same problem.

Benchmark Results and What They Show

The team tested RAMP on standard domains from the International Planning Competition (IPC), a long-running benchmark suite used to evaluate planning algorithms. According to the paper, RAMP outperforms PPO — Proximal Policy Optimisation, one of the most widely used deep RL algorithms — on two key measures: solvability (whether the agent actually reaches the goal) and plan quality (how efficiently it does so).

It is important to note that these benchmark results are self-reported by the authors and have not yet been independently replicated. The paper has been posted as a preprint on ArXiv and has not undergone formal peer review at the time of writing.

The comparison with PPO alone is meaningful but limited. PPO is a strong general-purpose RL baseline, but it is not specifically designed for structured planning tasks. The paper does not appear to compare RAMP against other hybrid planning-and-learning systems, which would provide a fuller picture of where it sits in the field.

Why Combining Planning and Learning Is Hard

Merging classical planning with machine learning is a well-recognised challenge in AI research. Classical planners require precise, logically consistent models; RL systems learn approximate, statistical representations. The two paradigms use different internal languages and make different assumptions about the world.

Previous hybrid approaches have typically kept the two components separate — using RL to fill gaps where the planner lacks a model, or using a planner to guide RL exploration. RAMP's contribution is tighter integration: the action model is continuously updated from RL experience, and planning is used not just as a one-off guide but as an ongoing source of training signal. The Numeric PDDLGym framework is a practical contribution that may lower the barrier for other researchers attempting similar integrations.

What This Means

For AI systems that need to operate in structured, goal-directed environments without access to pre-built rules or expert demonstrations, RAMP offers a credible blueprint — one that learns both how to behave and why certain behaviours work, at the same time.

RAMP System Blends Reinforcement Learning and Classical Planning

How the Feedback Loop Works

The Problem With Numeric Domains

Benchmark Results and What They Show

Why Combining Planning and Learning Is Hard

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

RAMP System Blends Reinforcement Learning and Classical Planning

How the Feedback Loop Works

The Problem With Numeric Domains

Benchmark Results and What They Show

Why Combining Planning and Learning Is Hard

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models