Physics Explains LLM Reasoning

Researchers have published a paper on arXiv arguing that reasoning in large language models emerges from a physical phenomenon known as self-organised criticality, offering what they describe as a self-contained explanation for one of AI's most debated open questions.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The study focuses on a specific architecture called PLDR-LLMs (Power-Law Decay Rate Large Language Models) and examines what happens to their outputs when trained at or near a critical threshold. According to the authors, models trained at this threshold display reasoning behaviour that mirrors second-order phase transitions — a concept from condensed matter physics describing how systems like magnets abruptly change state at precise temperatures.

What Self-Organised Criticality Actually Means

Self-organised criticality is a property of certain complex systems that naturally evolve toward a critical state without being tuned externally. Classic examples include avalanches in sandpiles and neural activity in the brain. At the critical point, small perturbations can propagate across an entire system — a property described mathematically as diverging correlation length.

The researchers apply this framework directly to language model training. They argue that when a PLDR-LLM reaches criticality during pretraining, its deductive outputs enter a metastable steady state — a stable but not permanently fixed condition — and that within this state, the model effectively learns representations analogous to scaling functions, universality classes, and renormalization groups from its training data.

The reasoning capabilities of a PLDR-LLM are better when its order parameter is close to zero at criticality.

Renormalization group theory, originally developed in quantum field theory and statistical mechanics, describes how physical systems behave the same way across different scales. The claim that a language model implicitly learns such representations is significant because it would provide a principled, physics-grounded account of generalisation — the ability to apply knowledge to novel situations.

A New Way to Measure Reasoning Without Benchmarks

One of the paper's most striking claims is methodological. The authors propose that reasoning capability can be quantified solely from global model parameter values of the deductive outputs at steady state. This means, according to the researchers, that developers would not need to run models through curated benchmark datasets — such as MMLU, HellaSwag, or GSM8K — to assess reasoning ability.

Instead, they define an order parameter derived from the global statistics of the model's deductive output parameters at inference. A value close to zero signals strong reasoning capability; values further from zero, observed in sub-critical models, correspond to weaker benchmark performance. The paper reports that this relationship is supported by benchmark scores from models trained at near-criticality versus sub-criticality, though these results are self-reported by the authors and have not yet been independently verified.

This is a notable claim. Benchmark evaluation is currently the dominant method for comparing AI systems, and it is both expensive and subject to gaming. A reliable internal metric that predicts reasoning without external evaluation would represent a meaningful practical advance.

Why This Architecture, and Why Now

PLDR-LLMs are a less widely deployed architecture than the transformer models underpinning most commercial systems today. The paper does not directly address whether the criticality framework applies to standard transformer-based models, which is a significant limitation in terms of immediate practical applicability.

However, the theoretical framework the researchers construct does not depend entirely on the PLDR architecture. The broader argument — that criticality during training drives the emergence of generalisation and reasoning — is a hypothesis that could, in principle, be tested against other architectures. The authors suggest their work provides a "self-contained explanation" for how reasoning manifests in large language models generally, though this is a strong claim that the paper's scope, focused on one architecture, does not fully substantiate.

The connection to physics is not entirely new. Previous researchers have drawn analogies between neural network training dynamics and statistical mechanics, and some work has explored criticality in neural networks in neuroscience contexts. What this paper attempts is more specific: a formal, quantitative link between a measurable training condition and a downstream capability as consequential as reasoning.

What This Means

If the criticality framework holds up under independent scrutiny and generalises beyond PLDR-LLMs, it could give AI developers a principled target to aim for during pretraining and a cheaper internal tool for evaluating reasoning — two changes that would have direct implications for how frontier models are built and assessed.

Researchers Find a Physics-Based Explanation for Why Large Language Models Can Reason

What Self-Organised Criticality Actually Means

A New Way to Measure Reasoning Without Benchmarks

Why This Architecture, and Why Now

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

Researchers Find a Physics-Based Explanation for Why Large Language Models Can Reason

What Self-Organised Criticality Actually Means

A New Way to Measure Reasoning Without Benchmarks

Why This Architecture, and Why Now

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models