A new neuro-symbolic architecture has lifted scores on the ARC-AGI-2 benchmark from 16% to 30.8%, according to a paper published on ArXiv by researchers at CoreThink-AI, without any task-specific fine-tuning or reinforcement learning.

The Abstraction and Reasoning Corpus (ARC), originally created by AI researcher François Chollet, is widely regarded as one of the hardest tests of machine general reasoning. Its second iteration, ARC-AGI-2, requires systems to identify abstract visual patterns and apply transformations to novel grid puzzles — a task that exposes the combinatorial brittleness of purely neural approaches and the perceptual limitations of purely symbolic ones.

Why Pure Neural and Pure Symbolic Systems Both Fall Short

The paper's central argument is that neither paradigm succeeds alone. Purely neural architectures — including large language models — can struggle to reliably generalise combinations of rules they haven't seen during training. Purely symbolic systems, by contrast, can struggle to ground abstract rules in messy visual inputs. The authors propose a three-stage pipeline designed to address both weaknesses simultaneously.

Separating perception, neural-guided transformation proposal, and symbolic consistency filtering improves generalisation without task-specific fine-tuning or reinforcement learning.

The system first extracts object-level structure from the input grids — essentially parsing raw pixel arrangements into meaningful units. It then uses neural priors, drawn from a large language model, to propose candidate transformations selected from a fixed domain-specific language (DSL) of atomic visual patterns inspired by how humans describe visual abstraction. Finally, a symbolic consistency filter cross-checks those proposed transformations across multiple examples before committing to an answer.

How the Numbers Break Down

On the public evaluation set of ARC-AGI-2, the base LLM alone scored 16% — a figure consistent with reported LLM performance on this benchmark class. The neuro-symbolic framework raised that to 24.4% independently. When the system was combined with ARC Lang Solver via a meta-classifier that selects between the two approaches, the combined score reached 30.8%. These benchmarks are self-reported by the authors and have not been independently verified at the time of publication.

The improvement of roughly 14.8 percentage points over the base LLM is meaningful in context. ARC-AGI-2 is deliberately constructed to resist pattern-matching and memorisation, so gains here are generally taken as evidence of genuine reasoning improvement rather than data leakage or overfitting.

What the Architecture Actually Does Differently

The framework's unit patterns — the basic building blocks in its DSL — are designed to mirror the kind of perceptual chunks humans use when solving visual puzzles: symmetry, repetition, colour groupings, object boundaries. By constraining the search space to these human-interpretable primitives, the system avoids the combinatorial explosion that plagues brute-force search methods.

The authors argue their approach reduces reliance on sampling-based test-time scaling — the technique of generating thousands of candidate answers and selecting the most common one. That method is computationally expensive and scales poorly. The neuro-symbolic pipeline instead uses structured hypothesis filtering to reach answers more efficiently.

The code has been released publicly at the CoreThink-AI GitHub repository, which allows independent researchers to reproduce results and extend the framework.

What Comes Next for ARC Performance

The paper positions this work as complementary to, rather than competing with, existing LLM-based solvers — the meta-classifier combination being the clearest illustration of that philosophy. Future directions implied by the architecture include expanding the DSL to cover more transformation types, improving the object-extraction stage for more complex grids, and testing whether the same pipeline transfers to reasoning domains beyond visual pattern matching.

ARC-AGI-2 was released in 2025 as a harder successor to the original ARC benchmark, with its creators explicitly designing it to be resistant to the test-time compute scaling strategies that had begun to inflate scores on ARC-AGI-1. A score above 30% on ARC-AGI-2 from a system that does not use fine-tuning is a notable data point in that ongoing contest.

What This Means

For researchers and developers working on general-purpose reasoning, this paper offers concrete evidence that hybrid neuro-symbolic architectures can outperform both of their component parts — and that structured abstraction, not more compute, may be a durable path toward robust generalisation.