A new benchmark called Prediction Arena places AI models in live prediction markets with real money, offering what its creators argue is one of the most rigorous and ungameable tests of AI decision-making yet devised.

Published on ArXiv in April 2025, the study ran from January 12 to March 9, 2026 across two platforms: Kalshi and Polymarket. Six frontier models participated in live trading for the full 57-day period (Cohort 1), each beginning with $10,000 and operating as fully autonomous agents, making decisions every 15 to 45 minutes without human intervention. A second cohort of four next-generation models ran a shorter 3-day paper trading preliminary.

All Six Models Lost Money on Kalshi — But Not Equally

On Kalshi, the results were uniformly negative. Final returns across Cohort 1 ranged from -16.0% to -30.8%, meaning every model lost a significant portion of its starting capital. The researchers found a clear performance hierarchy, with two factors emerging as the primary drivers of relative success: initial prediction accuracy and the ability to convert correct predictions into profitable positions. Notably, research volume — how much information a model processed before trading — showed no correlation with outcomes.

A striking cross-platform contrast emerges: Cohort 1 models averaged only -1.1% on Polymarket versus -22.6% on Kalshi.

That divergence is one of the study's most significant findings. The same models that lost capital on Kalshi performed substantially better on Polymarket, suggesting that platform mechanics — including market structure, liquidity, and available contract types — may matter as much as raw model intelligence.

One Model Stands Out Across Both Platforms

Grok-4-20-checkpoint, developed by xAI, achieved a 71.4% settlement win rate on Polymarket — the highest recorded across any platform or cohort in the study. That figure means the model correctly predicted outcomes on more than seven in ten settled contracts. The result positions grok-4-20-checkpoint as a strong performer in live trading conditions, even as it still lost money on Kalshi alongside its peers.

The standout result from paper trading came from Gemini-3.1-pro-preview (Google DeepMind's Cohort 2 entry), which achieved a +6.02% return on Polymarket over just three days — the best overall return of any model across either cohort. Unusually, this model executed zero trades on Kalshi during its evaluation window, raising questions about whether selective platform engagement reflects a genuine strategic preference or a limitation in the model's market interface.

Why Real Markets Beat Synthetic Benchmarks

The core argument behind Prediction Arena is methodological. Standard AI benchmarks — multiple-choice tests, coding challenges, math problems — are vulnerable to overfitting, data contamination, and strategic optimisation by model developers who know what's being measured. Prediction markets offer something different: outcomes determined by future real-world events, with financial consequences attached.

Because trades execute on actual exchanges with real capital at stake, there is no way to retrospectively adjust results or cherry-pick favourable conditions. The ground truth is objective and timestamped. This makes Prediction Arena structurally resistant to benchmark gaming that has increasingly undermined confidence in AI leaderboards.

The researchers also tracked a range of secondary metrics beyond profit and loss — including token usage (computational cost per decision), cycle time (how quickly models acted), exit patterns, and market preferences. This broader view allows comparisons of efficiency alongside effectiveness, asking not just whether a model made money but how much compute it consumed in the process.

Platform Design as a Hidden Variable

The gap between Kalshi and Polymarket performance raises a question the study leaves partly open: why did the same models behave so differently across platforms? Kalshi operates as a regulated US exchange with binary event contracts and stricter market rules. Polymarket runs on blockchain infrastructure and tends to feature more diverse, globally-focused event markets with different liquidity profiles.

The researchers describe platform design as having a "profound effect on which models succeed," but the study does not fully decompose the mechanisms behind that effect. Future work could isolate whether the differences stem from contract type, liquidity depth, interface design, or something else entirely. That question matters beyond academic curiosity — if AI trading agents are eventually deployed at scale, choosing the right market environment could determine whether they add or destroy value.

It is worth noting that Cohort 2's results come from paper trading only, meaning no real capital changed hands for those models. The +6.02% Polymarket return from Gemini-3.1-pro-preview, while the best figure in the study, was generated in a simulated environment over just three days — a sample too small to draw confident conclusions. All benchmark results in this study are self-reported by the research team and have not been independently audited.

What This Means

Prediction Arena offers AI researchers and developers a benchmark that real-world consequences make significantly harder to game — and its early results suggest that headline model rankings may look very different once financial stakes are in the game.