Large language models can write a playable video game from a single prompt but cannot figure out how to actually play one — a contradiction that reveals deep structural limits in today's AI systems, according to Julian Togelius, director of New York University's Game Innovation Lab and co-founder of AI game testing company Modl.ai.
Togelius recently published a paper exploring what LLMs' failure at video games tells us about the state of AI more broadly, and spoke with IEEE Spectrum about his findings. His analysis arrives at a moment when AI benchmarks are growing more complex to keep pace with rapidly improving models — yet video game performance has remained flat.
Why Coding Is a "Well-Behaved Game" — and Video Games Are Not
Togelius frames the contrast between coding and gaming as a question of feedback structure. Coding, he argues, behaves like a well-designed game: a task arrives like a level, code either compiles or it doesn't, tests pass or fail, and failure messages explain what went wrong. The reward signal is immediate, granular, and interpretable.
Video games offer no such clarity. Mechanics differ radically between titles, inputs vary, and the feedback loop is often delayed, ambiguous, or visually encoded in ways LLMs struggle to parse. Togelius points out that even AI systems not based on language models — like Google's AlphaZero, which mastered both chess and Go — had to be retrained and re-engineered for each game individually, and those two games share far more structural similarity than most game pairs do.
"They fail. They absolutely suck. All of them. They don't even do as well as a simple search algorithm."
That quote, from Togelius, refers to his updated General Video Game AI benchmark framework, which he ran for seven years before pausing it because agents stopped showing consistent progress. When his team adapted the framework for LLMs, the results were worse than expected — LLMs underperformed even basic search algorithms on novel games they had never encountered in training data.
The Data Problem Behind the Performance Gap
One underappreciated reason some AI systems can play certain games well is data volume, not general intelligence. Games like Minecraft and Pokémon have accumulated millions of hours of walkthroughs, guides, and gameplay footage. An LLM trained on that corpus can approximate competence. A less-documented game offers almost nothing to learn from.
In May 2025, Google's Gemini 2.5 Pro became notable for completing Pokémon Blue — but the achievement was qualified. According to IEEE Spectrum's reporting, the model completed the game far more slowly than a typical human player, made repetitive and illogical errors, and required custom software to manage its interactions. That result illustrates the gap between narrow success on well-documented tasks and genuine generalisable gameplay ability.
Togelius also identifies spatial reasoning as a core weakness. Most video games require tracking objects in two- or three-dimensional space, understanding relative positions, and reacting to visual layouts — none of which feature prominently in LLM training data, which skews heavily toward text.
The Paradox: Building Games vs. Playing Them
Perhaps the most striking finding in Togelius's analysis is the asymmetry between creation and play. Tools like Cursor and Claude can generate a functional, playable game from a single prompt. Ask for something resembling Asteroids, and a working clone appears. That is, by any measure, impressive.
But the output is invariably generic. Game development is an iterative discipline: developers write, play, feel the friction, and adjust. An LLM cannot play its own creation, which means it cannot sense when a jump feels sluggish, when a difficulty curve breaks, or when a mechanic stops being fun. Without that feedback loop, the model can only replicate what it has seen — competently, but without novelty or refinement.
Togelius draws a broader parallel: the same limitation applies to any software with a user experience. An LLM can generate a graphical interface with buttons, but it has no model of what it feels like to use one.
What Simulation-Based Training Can — and Cannot — Fix
Companies including Nvidia and Google have promoted game-like simulations as a path to improving AI performance. Togelius offers a measured assessment. Games are easier than the real world in one respect — fewer layers of abstraction — but harder in another: games are radically more diverse than reality. The real world runs on consistent physics. Games do not.
He cites Waymo's use of world models in its autonomous driving training loop as a case where simulation works well, precisely because driving is a low-diversity domain. The same logic does not transfer to games, where Halo and Space Invaders are, in a meaningful structural sense, more different from each other than two academic essays on unrelated topics.
This reframes a common public misunderstanding. The breadth of LLM knowledge — writing essays on quantum physics, summarising legal documents, generating code — creates an impression of general intelligence. Video games expose the boundary of that impression sharply and concretely.
What This Means
The inability of LLMs to play video games is not a quirky limitation — it is a precise diagnostic of where current AI architecture stops working, pointing to structural gaps in spatial reasoning, iterative feedback, and generalisation across diverse mechanics that developers and researchers will need to close before claims of general-purpose AI can hold up to scrutiny.
