Researchers have published AlphaLab, an autonomous AI system capable of running complete scientific experiments — including writing and testing GPU code, reviewing literature, and iterating on results — without human intervention, reporting performance gains of up to 91x over standard PyTorch compilation in one domain.
The paper, released on ArXiv in April 2025, describes a system built around frontier large language models that handles three distinct research tasks through a single, unmodified pipeline. The authors tested the system using GPT-5.2 and Claude Opus 4.6, evaluating it across CUDA kernel optimisation, LLM pretraining, and traffic forecasting — three domains chosen specifically because they are structurally dissimilar.
How AlphaLab Runs an Experiment Without Human Input
AlphaLab operates in three sequential phases. First, given only a dataset and a natural-language objective, the system explores the data independently: it writes analysis code, executes it, and produces a research report summarising what it has learned. Second, it constructs its own evaluation framework and stress-tests that framework adversarially to check for flaws. Third, it runs large-scale GPU experiments using what the authors call a Strategist/Worker loop — one agent sets direction and prioritises hypotheses while another executes experiments — accumulating findings in a persistent document the team calls a "playbook."
That playbook functions as a form of online prompt optimisation: the system continuously updates its own knowledge base as results come in, allowing later experiments to build on earlier findings rather than starting from scratch each time.
The same pipeline handles qualitatively different tasks without modification — all domain-specific behaviour is factored into adapters generated by the model itself.
This design choice is significant. Rather than hand-coding domain knowledge or task-specific instructions, AlphaLab generates its own adapters for each new problem. The authors argue this makes the system general-purpose across these test domains, not merely a specialised tool.
What the Benchmarks Show
The performance figures reported by the authors in the paper are notable across all three test domains.
In CUDA kernel optimisation, AlphaLab wrote GPU kernels that ran 4.4x faster than torch.compile on average, with a peak gain of 91x on specific tasks. torch.compile is PyTorch's built-in compilation tool, widely used in production ML pipelines, making this a meaningful comparison point for practitioners.
In LLM pretraining, the full AlphaLab system achieved 22% lower validation loss compared to a single-shot baseline using the same underlying model — meaning the iterative, multi-agent research process extracted substantially more from the same compute and model capacity.
In traffic forecasting, the system improved standard baselines by 23–25% after autonomously researching and implementing published model architectures from the academic literature. This last result is particularly notable: the system identified relevant prior work, read it, and incorporated it into its experimental approach without being told to do so.
Two Models, Different Solutions
One of the paper's more interesting findings concerns what happens when you run two different frontier models through the same pipeline on the same problem. GPT-5.2 and Claude Opus 4.6 consistently discovered qualitatively different solutions in every domain tested. Neither model outperformed the other uniformly across all tasks.
The authors interpret this as evidence that running multiple models in parallel — what they call a "multi-model campaign" — provides complementary search coverage. In other words, the two models explore different parts of the solution space, and a research operation that uses both is likely to find better answers than one that commits to a single model.
This has direct practical implications for organisations building automated research infrastructure. Relying on a single frontier model may leave significant performance gains undiscovered.
Autonomous Research Systems: Where This Fits
AlphaLab joins a growing category of systems designed to automate or accelerate scientific research using LLMs. Earlier efforts, including AI Scientist from Sakana AI and various agent-based coding systems, have demonstrated that language models can generate hypotheses, write code, and interpret results. AlphaLab's contribution is an end-to-end harness that connects these capabilities into a coherent experimental loop with persistent memory and self-generated evaluation criteria.
The use of an adversarially validated evaluation framework is worth highlighting. A longstanding criticism of automated research systems is that they can optimise for the wrong metric or game their own benchmarks. By having the system actively attempt to break its own evaluation before trusting it, AlphaLab's authors try to address this failure mode structurally.
The authors have released all code publicly at the project's GitHub page, which lowers the barrier for replication and independent evaluation — an important consideration given that all reported results are, at this stage, self-reported by the research team.
What This Means
AlphaLab represents a concrete step toward AI systems that can conduct meaningful quantitative research autonomously across domains. The reported performance gains — if independently verified — suggest that multi-agent, iterative experimentation can extract substantially more value from frontier models than single-shot prompting alone.