QuanBench+: LLMs Struggle With Quantum Code Generation

A new benchmark called QuanBench+ tests large language models on quantum code generation across three major frameworks simultaneously, revealing that even the strongest models succeed on fewer than 60% of tasks in a single attempt — and that performance varies sharply depending on which framework is used.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Quantum computing software has coalesced around a handful of competing frameworks, most notably Qiskit (developed by IBM), PennyLane (from Xanadu), and Cirq (from Google). Until now, evaluations of LLM-based quantum code generation have largely been siloed within a single framework, making it nearly impossible to tell whether a model genuinely understands quantum mechanics or has simply memorised framework-specific syntax and patterns.

Why Single-Framework Benchmarks Miss the Point

The researchers behind QuanBench+, whose paper appeared on arXiv in April 2025, argue that this single-framework approach introduces a fundamental blind spot. A model that scores well on Qiskit tasks might be pattern-matching against its training data rather than reasoning about quantum states, gate operations, or algorithm structure. By aligning 42 tasks across all three frameworks — covering quantum algorithms, gate decomposition, and state preparation — QuanBench+ forces models to demonstrate transferable understanding rather than syntactic familiarity.

The benchmark evaluates models using executable functional tests, reporting Pass@1 (success on the first attempt) and Pass@5 (success within five attempts). For probabilistic outputs, where quantum circuits produce distributions rather than deterministic results, the team used KL-divergence-based acceptance — a statistical measure of how closely a model's output distribution matches the expected one.

Reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

The Numbers: Progress Is Real, but Gaps Remain Wide

The headline results are instructive. In one-shot testing, the strongest models achieved 59.5% on Qiskit tasks, 54.8% on Cirq, and just 42.9% on PennyLane — a spread of nearly 17 percentage points across frameworks that are, in principle, expressing the same underlying quantum operations. The consistent underperformance on PennyLane likely reflects its relative scarcity in LLM training data compared to Qiskit, which benefits from IBM's extensive public documentation and community activity.

The study also introduced a feedback-based repair condition, in which a model is shown a runtime error or incorrect output and allowed to revise its code. This mechanism — analogous to how a human developer would iterate — produced substantially better results: 83.3% on Qiskit, 76.2% on Cirq, and 66.7% on PennyLane. The improvement is substantial, but the framework gap persists even with repair. These benchmark results are self-reported by the research team and have not yet undergone independent peer review.

What Feedback-Based Repair Reveals About Model Reasoning

The repair results deserve careful interpretation. A jump from 59.5% to 83.3% on Qiskit suggests that many initial failures are recoverable — syntax errors, minor logic mistakes, or misremembered API calls that a model can fix when given explicit error signals. This is encouraging for practical deployment scenarios, where iterative code generation with a compiler or simulator in the loop is already common practice.

However, the persistent gap between frameworks after repair suggests something deeper than surface-level syntax errors. If failures were purely syntactic, repair rates should converge across frameworks. Instead, the 16-percentage-point gap between Qiskit and PennyLane after repair implies that models lack sufficient conceptual grounding to translate quantum intent reliably into less-familiar framework idioms.

This matters because real-world quantum software development is not framework-monogamous. Research teams regularly prototype in one framework and deploy in another, or must work across frameworks when collaborating with partners using different hardware backends.

The 42-Task Design: Breadth Over Depth

The benchmark's 42 aligned tasks are structured to cover three categories: quantum algorithms (such as Grover's search or the Quantum Fourier Transform), gate decomposition (expressing operations as sequences of primitive gates), and state preparation (initialising qubits in specified quantum states). Aligning these tasks across all three frameworks required careful design to ensure equivalent difficulty — a non-trivial challenge given that the frameworks expose different levels of abstraction and use different naming conventions.

The executable testing approach is a meaningful methodological choice. Many coding benchmarks rely on surface-level checks or human evaluation; QuanBench+ actually runs the generated code and checks whether it produces correct results. For quantum circuits, this means either simulating the circuit and comparing output probabilities, or — in the case of deterministic operations — checking exact outputs.

What This Means

QuanBench+ establishes a clearer standard for evaluating LLM quantum coding ability, and its results set a concrete baseline: current models are useful assistants for quantum code generation but cannot yet be trusted to work reliably across the full ecosystem of frameworks without human oversight and iterative correction.

QuanBench+: New Benchmark Reveals LLMs Still Struggle With Quantum Code

Why Single-Framework Benchmarks Miss the Point

The Numbers: Progress Is Real, but Gaps Remain Wide

What Feedback-Based Repair Reveals About Model Reasoning

The 42-Task Design: Breadth Over Depth

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

QuanBench+: New Benchmark Reveals LLMs Still Struggle With Quantum Code

Why Single-Framework Benchmarks Miss the Point

The Numbers: Progress Is Real, but Gaps Remain Wide

What Feedback-Based Repair Reveals About Model Reasoning

The 42-Task Design: Breadth Over Depth

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models