AI Models Ignore Audio Input Despite Claims

Audio-Visual Large Language Models process sound and image together but default to vision when the two conflict, according to a new study published on ArXiv — the first mechanistic interpretability analysis of its kind applied to multimodal AI systems.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Researchers examined how audio and visual signals travel and merge through the layers of an AVLLM, tracing the path from raw input to final text output. Their central finding: audio information is genuinely present inside these models at intermediate layers, but it gets crowded out before it reaches the text-generation stage.

Audio Is There — It Just Gets Ignored

Using probing analyses — a technique that tests what information is encoded at different points inside a neural network — the researchers confirmed that useful audio representations exist in the model's intermediate layers. The model is not deaf. It processes sound and builds meaningful internal representations of it.

The problem emerges in the deeper fusion layers, where audio and visual streams combine before the model generates text. At that stage, visual representations dominate, systematically suppressing audio cues. When audio and visual signals agree, this bias is invisible. When they conflict, the model sides with its eyes.

Useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues.

This distinction matters practically. A model evaluating a video clip where a speaker's words contradict an on-screen caption, or where background sound signals danger invisible to the camera, would likely miss the audio signal entirely — even though it technically processed it.

The Training Problem Behind the Bias

The researchers traced the imbalance back to how AVLLMs are built. Most are constructed by extending an existing vision-language model (VLM) — trained on image-text pairs — with additional audio capabilities. The study found that the AVLLM's audio behaviour closely mirrors that of its vision-language base model, suggesting that the audio fine-tuning stage does not substantially realign the model's internal priorities.

In plain terms: the model was trained to see first, and adding audio on top did not override that foundation. The visual bias is not a bug introduced during audio training — it is an inherited feature that audio training failed to correct.

This has implications for how multimodal models are evaluated. Standard benchmarks typically test modalities in isolation or in cooperative settings where audio and vision reinforce one another. The study suggests these benchmarks would not catch a systematic audio suppression problem, because the bias only surfaces under conflict conditions. All benchmark results referenced in the original paper are self-reported by the research team.

What Mechanistic Interpretability Adds to the Picture

Most AI evaluation work tests what a model outputs. Mechanistic interpretability — the field this study draws from — asks what happens inside the model to produce that output. Applied to language models, it has revealed phenomena like how models store factual associations and how attention heads specialise. This study applies the same lens to multimodal systems, tracking signals layer by layer.

The approach matters because output-level testing can miss systematic internal failures. A model might answer audio-visual questions correctly in most test cases while harbouring a deep structural bias that only becomes visible under adversarial or conflicting conditions. The researchers argue that without this kind of internal analysis, developers cannot know whether audio capabilities are genuinely integrated or merely decorative.

The paper describes itself as the first mechanistic interpretability study of AVLLMs — a claim that, if accurate, marks a significant gap in prior evaluation methodology for a class of models increasingly deployed in real-world applications including video understanding, accessibility tools, and surveillance systems.

Implications for Multimodal AI Development

The findings carry direct consequences for how AVLLMs are built and tested. If audio fine-tuning leaves the visual hierarchy largely intact, developers seeking genuine audio-visual integration may need to rethink training pipelines — potentially training on more adversarial audio-visual pairs, or restructuring fusion layers to give audio signals a stronger footing before visual representations consolidate.

The study also raises questions about accountability. Products marketed as capable of understanding both sound and image may be operating primarily as vision models with an audio facade — a gap between advertised capability and actual behaviour that evaluation teams and regulators would have limited tools to detect using standard benchmarks.

The researchers have not proposed a fix, but the mechanistic framing of the problem provides a starting point: if the suppression happens in specific fusion layers, architectural changes targeting those layers become a testable hypothesis.

What This Means

AI systems marketed as audio-visual may be far more visually biased than their benchmarks reveal — and developers building on vision-language base models should treat genuine audio integration as an open engineering problem, not a solved one.

Study Finds AI Models That Claim to See and Hear Are Systematically Ignoring Audio

Audio Is There — It Just Gets Ignored

The Training Problem Behind the Bias

What Mechanistic Interpretability Adds to the Picture

Implications for Multimodal AI Development

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

Study Finds AI Models That Claim to See and Hear Are Systematically Ignoring Audio

Audio Is There — It Just Gets Ignored

The Training Problem Behind the Bias

What Mechanistic Interpretability Adds to the Picture

Implications for Multimodal AI Development

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models