A new paper from researchers posting to ArXiv proposes the first unified framework for understanding attention sinks in Large Vision-Language Models (LVLMs), identifying two distinct types and demonstrating that these patterns can either help or harm model performance depending on context.

Attention sinks — tokens that attract a disproportionately large share of a model's attention during processing — have been studied in text-only transformer models, but their behaviour in systems that handle both images and language has remained poorly understood. This paper addresses that gap directly, and the findings suggest the picture is significantly more complex than previously assumed.

Two Types of Visual Sink, Not One

The researchers draw a clear distinction between two categories of visual attention sink. ViT-emerged sinks (V-sinks) originate in the vision encoder — the component that initially processes images — and carry forward into the broader model. LLM-emerged sinks (L-sinks), by contrast, arise within the deeper layers of the large language model component itself, emerging during the reasoning process rather than being inherited from image processing.

This distinction matters because each type plays a different functional role. V-sinks encode what the researchers describe as "global scene-level priors" — broad contextual information about what a scene contains overall. L-sinks arise later in the processing chain and appear tied to the model's linguistic reasoning rather than direct visual interpretation.

While sinks effectively encode global scene-level priors, their prevalence can suppress the fine-grained visual evidence required for local perception.

The core tension the paper identifies is that the same mechanism making models good at understanding a scene broadly can make them worse at answering questions that require precise local detail — identifying a specific object in a corner of an image, for example, or reading small text.

Why This Trade-Off Has Been Overlooked

Previous research on attention sinks largely treated them as a curiosity or a quirk of transformer architecture — sometimes described as "sink tokens" that absorb excess attention without contributing meaningfully to outputs. The assumption in many studies was that sinks were essentially redundant, an artefact of how transformers distribute attention mathematically.

This paper challenges that interpretation by showing the sinks are functionally active, encoding real information. The problem is not that they exist, but that their prevalence at the wrong moments or in the wrong layers pulls the model's focus away from the fine-grained visual evidence some tasks require.

The researchers also identify specific layers within the model where modulating sink behaviour has the greatest downstream impact — a finding with practical significance for anyone trying to improve LVLM performance efficiently.

The LSG Module: A Lightweight Fix

To act on these findings, the team proposes Layer-wise Sink Gating (LSG), a module designed to be added to existing LVLMs without retraining the underlying model. LSG dynamically scales the attention contributions of V-sinks relative to other visual tokens, layer by layer, rather than applying a fixed adjustment across the entire network.

The approach is described as "plug-and-play" — the LVLM backbone remains frozen, meaning organisations using existing models do not need to repeat expensive training runs. LSG itself is trained using standard next-token prediction, the same objective used to train language models generally, requiring no task-specific labelled data.

According to the paper, LSG yields improvements across representative multimodal benchmarks in most layers tested. These benchmarks are self-reported by the researchers, and independent replication has not yet occurred given the paper's recent publication on ArXiv.

What the Results Suggest About Multimodal Architecture

The broader implication of the framework is that LVLMs contain internal tension between two desirable capabilities: understanding the overall meaning of a scene and perceiving precise local details. These are not always in conflict — a model answering "what is the general setting of this image?" benefits from global priors, while one asked "what does the sign in the background say?" needs fine-grained local attention.

Current LVLMs often handle one type of question better than the other, and the sink framework offers a mechanistic explanation for why. The finding that specific layers are disproportionately responsible for this behaviour opens the door to targeted interventions, rather than broad architectural overhauls.

The paper also raises questions about evaluation. Many standard multimodal benchmarks weight global reasoning tasks heavily, which may have obscured the local-perception deficit that sink prevalence causes. Models performing well overall on benchmarks might be systematically weaker on tasks requiring fine-grained visual grounding.

What This Means

For teams building or fine-tuning vision-language models, this framework provides both a diagnostic tool for understanding where attention is going wrong and a practical, low-cost intervention — LSG — that can be applied without modifying the base model.