Researchers have built a system that uses a large language model to automatically rewrite and improve the internal heuristics of IC3, one of the most widely used algorithms in hardware safety verification — and the resulting checker runs with no AI components at all once deployed.

IC3, also known as property-directed reachability (PDR), is a standard algorithm used to verify that a hardware design — a chip, circuit, or digital system — behaves safely under all possible conditions. It either confirms safety by producing a mathematical proof called an inductive invariant, or flags a violation by producing a counterexample trace. The algorithm is trusted precisely because its outputs are checkable: you do not have to trust the tool, only the certificate it produces. The problem, according to the researchers, is that IC3's real-world performance depends heavily on a tangle of hand-tuned heuristics — implementation choices that are costly to optimise manually, fragile across different hardware designs, and difficult to reproduce.

Why Heuristic Tuning in Formal Verification Is So Hard

Heuristics are the informal rules that guide a verification algorithm when no mathematically optimal choice exists. In IC3, dozens of such choices — about how to generalise blocked states, when to restart, how to order proof obligations — interact in ways that are poorly understood even by experts. Improving them typically requires deep specialist knowledge and weeks of manual experimentation across large benchmark suites. IC3-Evolve, described in a paper posted to ArXiv, automates this process by asking an LLM to propose small, targeted code modifications to an existing IC3 implementation.

Every candidate patch is admitted only through proof- and witness-gated validation, preventing unsound edits from ever being deployed.

The framework's key design constraint is that it treats the LLM purely as a proposal engine, not as a decision-maker. Patches are described as "slot-restricted and auditable" — meaning the LLM can only modify specific, predefined parts of the code, and every change it proposes is human-readable. This keeps the search space manageable and maintains the kind of transparency that formal verification demands.

The Proof Gate: Why Correctness Is Non-Negotiable in Hardware Verification

What separates IC3-Evolve from a conventional automated code-improvement tool is its proof-/witness-gated validation mechanism. When a patched IC3 reports that a hardware property is safe, it must emit a checkable inductive invariant certificate that an independent checker verifies. When it reports a violation, it must produce a replayable counterexample trace. Any patch that fails either gate — or that produces an unverifiable certificate — is rejected automatically, regardless of how much faster or more efficient it appears.

This matters because hardware verification is a high-stakes domain. A chip certified as safe when it is not can have consequences ranging from security vulnerabilities to catastrophic physical failures. The correctness gate means that the evolution process cannot accidentally introduce a bug that makes the checker appear to perform better by cutting corners on soundness.

Because the LLM operates entirely offline during the evolution phase, the final deployed artifact is a conventional, standalone IC3 checker. It carries no machine learning models, requires no GPU, and introduces no runtime inference latency. From the perspective of an engineer using the tool in production, it is indistinguishable from a hand-tuned checker — except that its heuristics have been optimised automatically.

Benchmark Results and Generalisability

The team trained IC3-Evolve's evolution process on the public Hardware Model Checking Competition (HWMCC) benchmark suite, which is the standard competitive testbed for formal hardware verification tools. They then evaluated how well the evolved heuristics transferred to separate, unseen benchmarks — including both public and industrial model checking problems.

According to the paper, IC3-Evolve discovered practical heuristic improvements that generalised beyond the training benchmarks. The researchers do not publish specific numerical performance figures in the abstract, and the full quantitative results are contained in the paper itself. It is worth noting that these benchmark results are self-reported by the research team and have not yet undergone independent peer review, as the paper is a preprint.

The use of industrial benchmarks as part of the evaluation is significant. Hardware verification tools that perform well on public competition benchmarks but fail on real chip designs are a known problem in the field. The inclusion of proprietary industrial cases suggests the researchers are aware of this gap and are testing for it directly.

A Template for LLM-Assisted Algorithm Engineering

IC3-Evolve fits into a small but growing body of work exploring whether LLMs can serve as useful engineering assistants in formal methods and algorithms research — not by replacing mathematical reasoning, but by navigating large, poorly-charted spaces of implementation choices. Related ideas have appeared in work on algorithm selection, hyperparameter optimisation, and automated theorem proving, though the specific combination of offline LLM evolution with formal correctness gating is less common.

The offline-only design is a deliberate architectural choice with practical consequences. Many proposals to integrate AI into safety-critical software tools founder on questions of runtime reliability, latency, and certification. By confining the LLM to the development phase and shipping a pure-code artifact, IC3-Evolve sidesteps these concerns entirely — though it also means the system cannot adapt to new hardware designs at inference time without re-running the evolution process.

What This Means

For hardware verification engineers, IC3-Evolve represents a credible path to improved formal checkers without sacrificing the mathematical guarantees that make those checkers trustworthy — and without requiring AI infrastructure in production environments.