A new study published on ArXiv finds that Vision Language Models perform significantly worse on visual tasks when the objects involved cannot be named in language — revealing a systematic bias toward textual shortcuts over genuine visual perception.
The research, posted to ArXiv CS.CV, challenges a widely held assumption about multimodal AI: that strong performance on vision-language benchmarks reflects genuine visual understanding. According to the authors, VLMs are instead exploiting a learned shortcut — translating visual inputs into language concepts as quickly as possible, and reasoning from there. When a visual entity resists that translation, model performance degrades sharply.
The 'Nameable' Advantage in Visual Correspondence
The researchers tested several VLMs on visual correspondence tasks — challenges where a model must identify matching entities across two different images. These tasks were grouped into three categories: semantic correspondence (matching conceptually similar objects), shape correspondence (matching by geometric form), and face correspondence (matching facial features). Performance varied dramatically depending on whether the target entity had a common name.
VLMs can only reason about visual entities that can be mapped to known concepts in the language space.
For nameable entities — objects with clear, established labels — models performed well. For unnameable entities — novel shapes, unfamiliar faces, or abstract visual features with no standard linguistic label — performance dropped significantly. This held across model types and task categories, suggesting the pattern is structural rather than incidental.
What the Internal Representations Reveal
To understand the mechanism behind this behaviour, the team applied Logit Lens analysis, a technique that examines how information is represented layer-by-layer inside a neural network. The results were revealing: for nameable entities, VLMs explicitly assigned semantic labels to visual inputs early in their processing pipeline. For unnameable entities, the models produced far less distinctive internal token representations — a sign that they lacked a reliable strategy to handle the input.
This supports the authors' central claim: VLMs are not failing to see visual detail. The required information is present in their internal representations. They are failing to use that detail because their training has optimised them to convert visual information into language as the primary mode of reasoning. When that conversion fails, the models resort to what the authors call hallucinated textual descriptions — plausible-sounding language outputs that do not accurately reflect the visual content.
Teaching Arbitrary Names Helps — But Fine-Tuning Helps More
The study also tested two potential remedies. In the first, researchers taught models completely arbitrary names for unnameable entities — essentially giving the model a label to attach to a novel visual concept. This improved performance, confirming that the bottleneck is genuinely linguistic: once a model has a word to work with, it processes the visual entity more reliably.
In the second approach, the team applied task-specific fine-tuning — training models directly on the correspondence tasks without relying on language priors. This produced even stronger results, and the improvements generalised better across different examples. According to the authors, this suggests that VLMs possess the latent visual capability needed for these tasks; the training pipeline simply has not developed it.
A Training Problem, Not an Architecture Problem
The distinction the authors draw is important. Much recent debate about VLM limitations has centred on whether multimodal architectures are fundamentally suited to genuine visual reasoning. This study pushes back on the pessimistic view. The failure mode identified here — over-reliance on language anchors — is a learned behaviour produced by training objectives, not a ceiling imposed by the architecture itself.
Current VLM training pipelines are heavily oriented toward tasks that reward moving visual information into textual space: image captioning, visual question answering, and document understanding, among others. These tasks do not require — and may actively discourage — the development of vision-native reasoning strategies. The models learn what they are rewarded for.
This framing has direct implications for how researchers design training curricula and evaluation benchmarks. If benchmarks predominantly test nameable entities, they will systematically overestimate a model's visual capabilities. A model can score well on such benchmarks while remaining blind, in a practical sense, to visual information it cannot verbalise.
Implications for Real-World Deployment
The practical stakes are significant. VLMs are being deployed in applications that require genuine fine-grained visual perception: medical imaging analysis, satellite imagery interpretation, manufacturing quality control, and facial recognition systems, among others. In each domain, models may encounter entities — anomalous tissue patterns, novel terrain features, unfamiliar product defects — that resist easy labelling. According to this research, current VLMs are poorly equipped for exactly these situations.
The finding that arbitrary name assignment improves performance also raises an interesting design possibility: systems could be built to automatically generate temporary labels for novel visual entities before passing them to a VLM, acting as a kind of perceptual scaffolding. Whether this approach scales is an open question, but the authors' results suggest it is worth investigating.
What This Means
VLM benchmark scores may be systematically inflated by the prevalence of nameable test cases — and developers building on these models for vision-critical applications should validate performance specifically on entities that fall outside standard language vocabularies.