A new study published on arXiv argues that hallucinations in large automatic speech recognition models are not random errors but the predictable result of a mathematical phase transition in how neural networks process information — and that bigger models may be structurally prone to more dangerous forms of confabulation.

The research, posted to arXiv in April 2025, targets OpenAI's Whisper model family — one of the most widely deployed ASR systems in the world, used in everything from medical transcription to legal proceedings. The authors introduce what they call the Spectral Sensitivity Theorem, a theoretical framework that predicts two distinct behavioural regimes in deep networks based on how signal energy is distributed across a model's internal layers.

What the Spectral Sensitivity Theorem Actually Claims

At the heart of the theory is the idea that a neural network's layers can either disperse or attract information, depending on the gain — roughly, how much a layer amplifies its input — and alignment between successive layers. In the dispersive regime, signal decays as it passes through the network. In the attractor regime, information collapses toward a single dominant pattern, a phenomenon the researchers call rank-1 collapse.

To test this prediction, the team analysed the eigenspectra of activation graphs — mathematical representations of how information flows through a model — across Whisper variants ranging from Tiny to Large-v3-Turbo, subjecting each to adversarial stress conditions designed to probe their stability.

Larger models enter a 'Compression-Seeking Attractor' state where Self-Attention actively compresses rank and decouples the model from acoustic evidence.

The results, according to the authors, confirmed the theory's predictions with notable precision.

Two Failure Modes, Not One

The study identifies a meaningful distinction between how intermediate and large models fail. Intermediate-scale Whisper models exhibit what the researchers call Structural Disintegration — classified as Regime I — characterised by a 13.4% collapse in Cross-Attention rank. Cross-Attention is the mechanism by which a language model connects its predictions to the actual audio input; a rank collapse here means the model is drawing on a narrower and less representative slice of the acoustic signal.

Large models exhibit different behavior. Rather than disintegrating, they enter Regime II, a Compression-Seeking Attractor state, where Self-Attention — the mechanism the model uses to reason about its own prior outputs — actively compresses rank by -2.34% and, critically, hardens the model's spectral slope. This makes large models more internally consistent, but that consistency comes at a cost: the model becomes increasingly decoupled from the actual audio it is supposed to be transcribing.

In practical terms, this means a large Whisper model under stress does not simply produce garbled output. It produces fluent, confident, grammatically coherent text that may bear little relationship to what was said.

Why This Matters for Safety-Critical Deployments

Hallucination in speech recognition is not a minor inconvenience. Whisper and comparable ASR models are deployed in medical settings to transcribe clinical notes, in courtrooms to log proceedings, and in accessibility tools for deaf and hard-of-hearing users. A model that confidently fabricates words — particularly one that does so in a structurally consistent way that resists easy detection — poses real risks in any of these contexts.

The research frames this as a critical safety risk, and the theoretical framing strengthens that concern. If large models are structurally predisposed to enter attractor states under adversarial or out-of-distribution conditions, then scaling up model size — the default industry response to improving performance — may not reduce hallucination risk. It may change its character in ways that are harder to catch.

The adversarial stress testing methodology merits scrutiny. The paper does not detail the exact nature of the adversarial inputs used, and the benchmark results are drawn from the authors' own experimental setup rather than an independent third-party evaluation. That said, the mathematical framework the authors propose is falsifiable and specific, which sets it apart from more descriptive accounts of hallucination behaviour.

Spectral Analysis as a Diagnostic Tool

One of the more actionable contributions of the paper is methodological. By using eigenspectral analysis of activation graphs as a diagnostic lens, the researchers propose a way to monitor — and potentially predict — when a model is drifting into an attractor state during inference. This could, in principle, serve as the basis for real-time hallucination detection or for flagging high-risk outputs before they reach an end user.

The Spectral Sensitivity Theorem, if it holds under broader experimental validation, would also give model developers a new lever to consider during training: layer-wise gain and alignment as variables that influence not just performance but structural stability under stress.

The Whisper family was chosen, the authors suggest, because its range of scales — from the lightweight Tiny model to the full Large-v3-Turbo — makes it an ideal testbed for observing the transition between regimes. Whether the same phase transition dynamics apply to other large ASR architectures, or to large language models more broadly, remains an open question the paper does not attempt to answer.

What This Means

If the Spectral Sensitivity Theorem is validated by independent research, it would give the AI safety and deployment community a concrete, mathematically grounded explanation for why scaling speech recognition models does not eliminate hallucination — and why the hallucinations that survive may be the most dangerous kind.