Audio-Visual Large Language Models process sound and image together but default to vision when the two conflict, according to a new study published on ArXiv — the first mechanistic interpretability analysis of its kind applied to multimodal AI systems.
Researchers examined how audio and visual signals travel and merge through the layers of an AVLLM, tracing the path from raw input to final text output. Their central finding: audio information is genuinely present inside these models at intermediate layers, but it gets crowded out before it reaches the text-generation stage.
Audio Is There — It Just Gets Ignored
Using probing analyses — a technique that tests what information is encoded at different points inside a neural network — the researchers confirmed that useful audio representations exist in the model's intermediate layers. The model is not deaf. It processes sound and builds meaningful internal representations of it.
The problem emerges in the deeper fusion layers, where audio and visual streams combine before the model generates text. At that stage, visual representations dominate, systematically suppressing audio cues. When audio and visual signals agree, this bias is invisible. When they conflict, the model sides with its eyes.
Useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues.
This distinction matters practically. A model evaluating a video clip where a speaker's words contradict an on-screen caption, or where background sound signals danger invisible to the camera, would likely miss the audio signal entirely — even though it technically processed it.
The Training Problem Behind the Bias
The researchers traced the imbalance back to how AVLLMs are built. Most are constructed by extending an existing vision-language model (VLM) — trained on image-text pairs — with additional audio capabilities. The study found that the AVLLM's audio behaviour closely mirrors that of its vision-language base model, suggesting that the audio fine-tuning stage does not substantially realign the model's internal priorities.
In plain terms: the model was trained to see first, and adding audio on top did not override that foundation. The visual bias is not a bug introduced during audio training — it is an inherited feature that audio training failed to correct.
This has implications for how multimodal models are evaluated. Standard benchmarks typically test modalities in isolation or in cooperative settings where audio and vision reinforce one another. The study suggests these benchmarks would not catch a systematic audio suppression problem, because the bias only surfaces under conflict conditions. All benchmark results referenced in the original paper are self-reported by the research team.
What Mechanistic Interpretability Adds to the Picture
Most AI evaluation work tests what a model outputs. Mechanistic interpretability — the field this study draws from — asks what happens inside the model to produce that output. Applied to language models, it has revealed phenomena like how models store factual associations and how attention heads specialise. This study applies the same lens to multimodal systems, tracking signals layer by layer.
The approach matters because output-level testing can miss systematic internal failures. A model might answer audio-visual questions correctly in most test cases while harbouring a deep structural bias that only becomes visible under adversarial or conflicting conditions. The researchers argue that without this kind of internal analysis, developers cannot know whether audio capabilities are genuinely integrated or merely decorative.
The paper describes itself as the first mechanistic interpretability study of AVLLMs — a claim that, if accurate, marks a significant gap in prior evaluation methodology for a class of models increasingly deployed in real-world applications including video understanding, accessibility tools, and surveillance systems.
Implications for Multimodal AI Development
The findings carry direct consequences for how AVLLMs are built and tested. If audio fine-tuning leaves the visual hierarchy largely intact, developers seeking genuine audio-visual integration may need to rethink training pipelines — potentially training on more adversarial audio-visual pairs, or restructuring fusion layers to give audio signals a stronger footing before visual representations consolidate.
The study also raises questions about accountability. Products marketed as capable of understanding both sound and image may be operating primarily as vision models with an audio facade — a gap between advertised capability and actual behaviour that evaluation teams and regulators would have limited tools to detect using standard benchmarks.
The researchers have not proposed a fix, but the mechanistic framing of the problem provides a starting point: if the suppression happens in specific fusion layers, architectural changes targeting those layers become a testable hypothesis.
What This Means
AI systems marketed as audio-visual may be far more visually biased than their benchmarks reveal — and developers building on vision-language base models should treat genuine audio integration as an open engineering problem, not a solved one.