ESL-Bench: AI Health Agents Benchmark

Researchers have published ESL-Bench, a new open benchmark that stress-tests AI health agents on synthetic patient records spanning up to five years — and found that even the best-performing systems answer fewer than 6 in 10 evaluation questions correctly.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Evaluating AI systems designed to manage long-term patient health is notoriously difficult. Real-world health records cannot be released at scale due to privacy constraints, and most existing datasets lack the structured ground truth needed to definitively score an AI's reasoning about cause and effect over time. ESL-Bench, published on arXiv in April 2025, attempts to solve both problems at once.

A Synthetic Patient Population Built from the Ground Up

The benchmark generates 100 synthetic users, each assigned a health profile, a multi-phase life narrative, and a continuous stream of simulated data. That data includes daily device measurements — think wearable sensor readings — periodic clinical exam records, and a detailed event log that explicitly encodes how specific life events affect specific health indicators. Trajectories run between one and five years per synthetic patient.

The technical design is layered. Each health indicator follows a baseline stochastic process, meaning it fluctuates randomly around a norm, but discrete events — a new medication, a stressful life episode, a change in activity — trigger changes modelled with sigmoid-onset and exponential-decay curves. This mirrors how real physiological changes actually unfold: they ramp up gradually, then fade. A hybrid pipeline uses large language models to generate the sparse narrative elements, while deterministic algorithmic simulation handles the dense numerical dynamics, keeping the data within hard physiological bounds.

Database-native agents scored 48–58%, substantially outperforming memory-based retrieval systems, which reached only 30–38% — a gap concentrated precisely where clinical reasoning matters most.

Five Types of Questions, Three Difficulty Tiers

Each synthetic user is paired with 100 evaluation queries, totalling 10,000 questions across the full dataset. The queries span five dimensions: Lookup (retrieve a specific fact), Trend (identify a pattern over time), Comparison (contrast values or periods), Anomaly (detect something unusual), and Explanation (attribute a change to a cause). Each dimension is further stratified into Easy, Medium, and Hard tiers.

Critically, all ground-truth answers are programmatically computable directly from the recorded event-indicator relationships — meaning there is no ambiguity in scoring. This is a meaningful design choice. In real clinical data, it is often genuinely unclear whether a health change was caused by one factor or another. ESL-Bench sidesteps this by construction, giving evaluators a clean, definitive answer key.

Database Agents Lead, But the Gap Tells the Real Story

The researchers evaluated 13 methods drawn from three broad categories: large language models equipped with tools, database-native agents, and memory-augmented retrieval-augmented generation (RAG) systems. The results were unambiguous in direction, if not encouraging in absolute terms.

Database-native agents scored between 48% and 58%, establishing themselves as the strongest performers. Memory-augmented RAG systems — a popular architecture in enterprise AI applications — scored between 30% and 38%, a gap of roughly 20 percentage points. The researchers note that this difference is concentrated on Comparison and Explanation queries, precisely the tasks that require an AI to hold multiple pieces of evidence in mind simultaneously and attribute an outcome to a specific cause.

This finding has direct relevance to how health AI is currently being built. RAG-based systems, which retrieve relevant passages from a memory store and pass them to a language model, are widely used partly because they are relatively straightforward to deploy. The benchmark suggests they are not well-suited to the kind of multi-hop, temporally grounded reasoning that managing a patient's health history actually demands.

Why Evaluation Infrastructure Matters for Health AI

The broader problem ESL-Bench addresses is a bottleneck in the field: without good evaluation tools, it is impossible to know whether a health AI system is actually improving, or simply getting better at appearing to improve. Real patient data is locked behind privacy regulations, rightly so, and the datasets that are available rarely include the kind of explicit causal structure needed to test attribution reasoning.

Synthetic benchmarks carry their own limitations. A system that scores well on ESL-Bench has demonstrated competence on carefully constructed simulated data — it has not been validated on real patients, in real clinical workflows, with real stakes. The researchers do not claim otherwise. What the benchmark does offer is a reproducible, scalable, privacy-safe environment for comparative evaluation — something the field has lacked for longitudinal health agents specifically.

The five-dimensional query structure is also notable for what it captures. Most existing AI benchmarks test retrieval or single-step reasoning. ESL-Bench's Explanation tier, in particular, asks systems to work backwards from an observed health change to identify which event in the record caused it — a task closer to clinical differential diagnosis than to standard question answering.

What This Means

For teams building or procuring AI systems to manage long-term patient health, ESL-Bench provides the first structured evidence that database-native architectures meaningfully outperform popular RAG-based alternatives on the reasoning tasks that matter most — and that even the best current systems have substantial room to improve before clinical deployment can be seriously considered.

New Benchmark Tests AI Health Agents on Years of Synthetic Patient Data

A Synthetic Patient Population Built from the Ground Up

Five Types of Questions, Three Difficulty Tiers

Database Agents Lead, But the Gap Tells the Real Story

Why Evaluation Infrastructure Matters for Health AI

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

New Benchmark Tests AI Health Agents on Years of Synthetic Patient Data

A Synthetic Patient Population Built from the Ground Up

Five Types of Questions, Three Difficulty Tiers

Database Agents Lead, But the Gap Tells the Real Story

Why Evaluation Infrastructure Matters for Health AI

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models