A purpose-built AI model with just 1.3 million parameters has outperformed large language models up to 92,000 times its size at playing the classic first-person shooter DOOM in real time, scoring more kills than all tested LLMs combined.
The model, named SauerkrautLM-Doom-MultiVec, was developed by researchers and published on arXiv in April 2025. It was trained on only 31,000 human gameplay demonstrations and runs on consumer hardware, making a decision every 31 milliseconds. The large models it was benchmarked against include Nemotron-120B, Qwen3.5-27B, and GPT-4o-mini — all receiving identical inputs.
How a Model This Small Outperforms Models This Large
The architecture combines a ModernBERT encoder with hash embeddings, depth-aware token representations, and an attention pooling classification head. Rather than processing video or pixel data, all agents — both the small model and the LLMs — received the same ASCII representations of game frames alongside depth maps, converting the visual scene into a text-like format that language models can theoretically interpret.
The task was the DOOM scenario called "defend_the_center", where an agent must survive waves of enemies in an enclosed arena. SauerkrautLM-Doom-MultiVec scored 178 frags across 10 episodes, averaging 17.8 kills per episode. The combined total for all tested LLMs was just 13 kills.
Despite having 92,000 times fewer parameters than Nemotron-120B, the small model was the only agent that actively engaged enemies rather than purely evading them.
The researchers note that this behavioural difference is significant: the LLMs defaulted to evasion strategies, while the specialised model learned to actively hunt and eliminate threats — the core objective of the scenario.
Why Large Language Models Struggle With Real-Time Control
The result reflects structural differences in model design. Large language models are trained on broad text corpora to predict the next token across a vast range of topics and tasks. Real-time game control, by contrast, demands low-latency, sequential decision-making grounded in precise spatial and temporal context — a fundamentally different kind of problem.
LLMs also carry significant inference overhead. Running a 120-billion parameter model in real time introduces latency that is structurally incompatible with the 31ms decision windows that live gameplay requires. The small model's speed is not a secondary benefit — it is a core part of what makes it functional at all in this context.
The study contributes to a growing body of research questioning whether scaling general-purpose models is the right approach for every AI task. Several recent papers have demonstrated that narrow, domain-trained models can outperform general systems at specific tasks — from protein structure prediction to code execution — but this result is notable for how extreme the parameter gap is.
Trained on Humans, Designed for One Job
The training methodology is as instructive as the architecture. Rather than using reinforcement learning from scratch or fine-tuning a large pretrained model, the researchers trained their model on a relatively small set of human gameplay demonstrations. This approach — known as imitation learning or behavioural cloning — allows the model to learn directly from the strategies a competent human player would use, without needing to explore the game environment from first principles.
The 31,000 demonstration dataset is modest by modern machine learning standards, yet it proved sufficient to produce a competitive agent. This suggests that when training data is tightly matched to the target task, volume is less critical than relevance.
The model's ability to run on consumer hardware also carries practical implications. Deploying a 1.3 million parameter model requires a fraction of the compute resources needed to serve any of the LLMs it outperformed. In cost terms, inference from a model this size is orders of magnitude cheaper per decision.
Benchmarks Are Self-Reported — With Caveats Worth Noting
The performance figures in this study are self-reported by the authors in a preprint that has not yet undergone formal peer review, and independent replication has not been confirmed. The benchmark scenario — defend_the_center — is a single, specific DOOM environment, and results may not generalise to more complex maps, objectives, or game types.
It is also worth noting that the LLMs were not fine-tuned for this task; they were prompted to make decisions from ASCII input, which is a significant structural disadvantage. A more competitive comparison might involve fine-tuned versions of smaller language models, not frontier systems used out of the box. The researchers acknowledge the setup tests LLMs in a zero-shot or minimally adapted configuration.
Nonetheless, even accounting for these caveats, the scale of the performance gap — 178 versus 13 kills — is difficult to explain away entirely as a methodological artefact.
What This Means
For practitioners and organisations deciding how to deploy AI systems, this research reinforces a practical principle: when the task is well-defined and training data is available, a small specialised model will almost always outperform a large general one — faster, cheaper, and with less infrastructure.