IntentScore: AI Agent Quality Check for Desktop Tasks

A new reward model called IntentScore can judge whether an AI agent is about to take the right action on a computer — before the action is executed — reducing the cascading errors that make today's computer-use agents unreliable.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Computer-use agents (CUAs) are AI systems that control desktop software by interacting with graphical interfaces: clicking buttons, filling forms, navigating menus. They have attracted significant research and commercial interest as a route to automating complex knowledge work. However, a persistent problem undermines their practical usefulness: these agents generate actions without any internal quality check, meaning a single wrong click can send a task irreversibly off course.

Training on Nearly 400,000 Real Interactions

IntentScore, developed by a team publishing on ArXiv, is a plan-aware reward model trained on 398,000 offline GUI interaction steps collected across three operating systems. Rather than learning from synthetic data or human ratings alone, it draws on heterogeneous recorded trajectories — real sequences of actions and screen states — giving it broad exposure to how software is actually used.

The model trains using two complementary objectives. The first, contrastive alignment, teaches it to recognise whether a given action is relevant to the current screen state. The second, margin ranking, teaches it to distinguish correct actions from plausible-but-wrong alternatives. Together, these objectives push IntentScore beyond simple pattern matching.

The system can discriminate between candidates with similar actions but different rationales — a distinction that trips up most existing evaluation approaches.

Why Intent Is the Key Ingredient

The architectural insight at the heart of IntentScore is embedding planning intent directly into the action encoder. Two agents might issue the same click on the same interface element, but for entirely different reasons — one because it is the correct next step, the other because it has misread the task. Standard action evaluators cannot distinguish these cases. IntentScore encodes the why behind each candidate action alongside the what, allowing it to separate superficially similar actions that differ in their downstream consequences.

This design is what the authors call "plan-aware" evaluation — scoring actions not in isolation but in the context of the agent's stated reasoning about what it is trying to accomplish.

97.5% Accuracy and Real-World Performance Gains

On a held-out evaluation set, IntentScore achieved 97.5% pairwise discrimination accuracy — correctly identifying the better of two candidate actions in 97.5 out of 100 comparisons. This figure is self-reported by the researchers and has not yet been independently replicated.

The more practically significant test came when IntentScore was deployed as a re-ranker for Agent S3, a separate computer-use agent, running on OSWorld — a benchmark environment the reward model had never encountered during training. Acting as a filter between the agent's proposed actions and their execution, IntentScore improved Agent S3's task success rate by 6.9 percentage points.

That improvement on an entirely unseen agent and task distribution is the result the authors emphasise most. It suggests the reward signal IntentScore learned from its training data captures something general about action quality, rather than overfitting to specific interfaces or workflows.

What Re-Ranking Means in Practice

Deploying IntentScore as a re-ranker means the underlying agent generates multiple candidate actions, IntentScore scores them, and the highest-scoring action is executed. This approach does not require retraining the agent itself — IntentScore slots in as a modular component on top of existing systems.

That modularity matters commercially and practically. Organisations deploying computer-use agents could, in principle, add IntentScore to an existing pipeline without rebuilding their core model. The approach mirrors techniques used in other AI domains — re-ranking is a standard tool in information retrieval and code generation — now applied to GUI automation.

The reliance on offline trajectories for training also has practical implications. Collecting 398,000 interaction steps is expensive but feasible; the authors' approach suggests that archived interaction logs, which many enterprise software deployments already generate, could become training data for action quality models.

Limits and Open Questions

The research has limitations worth noting. OSWorld, while a well-regarded benchmark, represents a constrained set of desktop tasks. Real-world computer use spans a far wider range of applications, permission structures, and edge cases. Whether a 6.9-point improvement on OSWorld translates proportionally to production deployments remains an open question.

The training data spans three operating systems, but the paper's abstract does not specify which ones, leaving open questions about coverage of less common environments. And like all reward models, IntentScore could in principle be gamed by agents that learn to produce actions that look high-quality to the scorer without being genuinely correct — a challenge the broader reinforcement learning community has studied under the heading of reward hacking.

What This Means

IntentScore offers a practical, modular way to reduce the compounding errors that currently limit computer-use agents in real deployments — and its ability to generalise to unseen agents and benchmarks suggests this approach could become a standard component in the computer-use agent stack.

IntentScore Gives AI Computer Agents a Quality Check Before They Click

Training on Nearly 400,000 Real Interactions

Why Intent Is the Key Ingredient

97.5% Accuracy and Real-World Performance Gains

What Re-Ranking Means in Practice

Limits and Open Questions

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

IntentScore Gives AI Computer Agents a Quality Check Before They Click

Training on Nearly 400,000 Real Interactions

Why Intent Is the Key Ingredient

97.5% Accuracy and Real-World Performance Gains

What Re-Ranking Means in Practice

Limits and Open Questions

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models