DRAFT Framework Boosts AI Agent Safety Detection to 91%

A new framework called DRAFT (Task Decoupled Latent Reasoning for Agent Safety) improves the ability of AI systems to detect unsafe behaviour in tool-using language model agents, raising average benchmark accuracy from 63.27% to 91.18% over a LoRA baseline, according to researchers whose paper appeared on ArXiv in April 2025.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The work addresses a growing challenge in AI safety: as large language models gain the ability to use external tools — browsing the web, executing code, calling APIs — monitoring their safety becomes far more complex. Traditional safety approaches focus on moderating model outputs, but tool-using agents produce long, multi-step interaction logs where dangerous behaviour may be buried deep within noisy sequences. Standard binary classifiers struggle to assign responsibility for risk across these trajectories.

Why Standard Safety Tools Break Down for Agentic AI

The core challenge, as the researchers describe it, is sparse evidence across long contexts. In a typical agentic interaction, an LLM might take dozens of steps — querying databases, invoking tools, producing intermediate outputs — before any harmful outcome materialises. Identifying which steps contributed to that outcome requires a system that can compress and reason over the full trajectory, not simply flag a single sentence.

Existing approaches often use explicit summarisation pipelines: summarise the trajectory in text, then judge the summary. But this introduces information loss at the summarisation step — if the summary omits a subtle but critical detail, the judge never sees it.

DRAFT avoids lossy explicit summarize-then-judge pipelines by performing evidence aggregation in latent space, enabling end-to-end differentiable training.

How DRAFT's Two-Stage Architecture Works

DRAFT sidesteps this by decoupling safety judgment into two trainable components operating in continuous latent space rather than in human-readable text. The first component, called the Extractor, reads the full interaction trajectory and compresses it into a compact latent representation — a kind of internal draft that captures safety-relevant signals without forcing them into words. The second component, the Reasoner, then attends jointly to both this latent draft and the original trajectory to make the final safety prediction.

Because both stages are differentiable and trained end-to-end, the system can learn what information is worth preserving in the latent draft specifically for the purpose of safety judgment — rather than general summarisation. The researchers describe this as "continuous latent reasoning prior to readout," and their ablation experiments confirm that removing either the Extractor or the Reasoner independently degrades performance, suggesting the two components develop complementary capabilities.

The architecture draws on a broader trend in machine learning research toward using latent or "chain-of-thought" reasoning that never surfaces as text — keeping intermediate computations inside the model rather than externalising them. Applied here to safety monitoring rather than problem-solving, it represents a novel application of that idea.

Benchmark Results and What They Show

The researchers evaluated DRAFT on two benchmarks: ASSEBench and R-Judge, both designed to test safety reasoning over agentic interaction logs. Results are self-reported by the authors and have not been independently verified at the time of publication.

Across these benchmarks, DRAFT averaged 91.18% accuracy, compared to 63.27% for the LoRA baseline — an improvement of nearly 28 percentage points. The paper also reports that DRAFT learns more separable internal representations, meaning the model more cleanly distinguishes safe from unsafe trajectories in its latent space, which the authors argue indicates genuine understanding rather than surface-level pattern matching.

The gap between DRAFT and existing methods is large enough to suggest the underlying approach — latent compression before judgment — is doing meaningful work, though independent replication on held-out datasets would strengthen that claim.

What Comes Next for Agent Safety Monitoring

The DRAFT paper arrives at a moment when major AI labs and regulators are paying increased attention to agentic systems. Models like OpenAI's GPT-4o, Anthropic's Claude, and Google DeepMind's Gemini are increasingly deployed in agentic configurations — operating with tools, memory, and multi-step autonomy. The safety infrastructure around these deployments is still developing alongside the capabilities.

Current safety pipelines for agentic systems tend to be reactive: they check outputs at the end of a task, or apply keyword-based filters at each step. A framework that reasons holistically over the full trajectory — and does so efficiently enough to be deployable — would represent a meaningful operational improvement for teams building or auditing agent-based products.

The researchers position DRAFT as a practical path rather than a theoretical one, noting that its end-to-end training and latent-space operation make it compatible with standard model training infrastructure. Whether it scales to the longest and most complex real-world agentic sessions — which can involve hundreds of tool calls — remains an open question the paper does not fully address.

What This Means

For teams building or deploying tool-using AI agents, DRAFT offers a concrete architectural approach to trajectory-level safety monitoring that substantially outperforms simpler classifiers — and signals that latent reasoning, not just output filtering, may be the appropriate layer at which to enforce agent safety.

DRAFT Framework Lifts AI Agent Safety Detection Accuracy from 63% to 91%

Why Standard Safety Tools Break Down for Agentic AI

How DRAFT's Two-Stage Architecture Works

Benchmark Results and What They Show

What Comes Next for Agent Safety Monitoring

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

DRAFT Framework Lifts AI Agent Safety Detection Accuracy from 63% to 91%

Why Standard Safety Tools Break Down for Agentic AI

How DRAFT's Two-Stage Architecture Works

Benchmark Results and What They Show

What Comes Next for Agent Safety Monitoring

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models