Multilingual language models sort languages by their surface writing system — not by underlying grammar or linguistic structure — according to new research published on ArXiv, challenging assumptions about how these models achieve cross-lingual ability.

The study, which focuses on Llama-3.2-1B and Gemma-2-2B — two compact, distilled models — used the Language Activation Probability Entropy (LAPE) metric to identify which internal units activate for specific languages, then applied Sparse Autoencoders to decompose those activations into interpretable components. The central question the researchers asked was straightforward: do these models understand language in the abstract, or are they responding primarily to what text looks like on the surface?

Romanisation Creates Near-Invisible Walls Between Languages

The most striking finding concerns romanisation — the practice of writing a non-Latin-script language using Latin characters. When researchers romanised languages like Arabic or Hindi, the resulting internal representations were nearly disjoint from both the native-script version of the same language and from English. In other words, the model treated romanised Hindi almost as an entirely different language from Devanagari-script Hindi, despite the spoken language being identical.

Romanisation induces near-disjoint representations that align with neither native-script inputs nor English — suggesting the model's identity of a language is anchored to its visual form.

This has direct consequences for a common real-world practice. Many users of South Asian, Arabic, and other non-Latin languages type informally in romanised form — a style known as transliteration or "romanised chat." If the model treats this as a representationally foreign input, its performance on such text may be far weaker than benchmarks on standard native-script text would suggest.

Word Order Matters Less Than Expected

The researchers also tested what happens when word order is shuffled — a disruption that should, in theory, obscure grammatical structure significantly. The effect on language-associated unit identity was limited, suggesting these units are not primarily tracking syntactic patterns. This finding cuts against the intuition that models deeply encode grammar at the level of their internal language representations.

Typological structure — features like whether a language is verb-final or uses noun cases — does become more accessible as inputs travel through deeper model layers, the probing experiments show. But this emergence is gradual and does not result in a unified cross-lingual space where, say, Turkish and Finnish converge because they share agglutinative grammar.

What the Model Actually Uses to Generate Text

A key distinction in the research is between what representations contain and what the model acts on during generation. Causal intervention experiments — where researchers directly manipulated internal activations to measure downstream effects on output — found that generation is most sensitive to units that remain stable across surface-form perturbations, not to units identified purely through typological alignment.

This is a meaningful nuance. A model may encode some abstract linguistic information in its deeper layers, but when it actually produces text, it relies on representations that are robust to changes in spelling or script — which, paradoxically, are also the representations most tied to orthographic identity rather than linguistic abstraction.

The overall picture the researchers paint is of a system that organises multilingual knowledge around the visual and orthographic surface of text, with linguistic abstraction appearing only as a secondary, emergent property in deeper processing stages — and never fully converging into what linguists would call an interlingua, a shared abstract representation of meaning independent of any specific language's form.

Why Compact Models Were Chosen Deliberately

The choice to study compact, distilled models is itself methodologically significant. Smaller models operate under tighter representational constraints, meaning trade-offs between storing surface-form versus abstract linguistic information are more visible and explicit. The researchers argue this makes Llama-3.2-1B and Gemma-2-2B sharper lenses for examining representational organisation than larger frontier models, where abundant capacity might blur these distinctions.

The findings align with a growing body of interpretability research suggesting that large language models are more sensitive to text formatting, tokenisation, and surface presentation than their benchmark scores typically reveal. Previous work has shown performance degrading with unusual whitespace, character substitutions, or non-standard punctuation — the script-level organisation found here offers a deeper structural explanation for some of those sensitivities.

The research does not address whether larger models — with billions more parameters — would show the same patterns, or whether instruction tuning and reinforcement learning from human feedback alter the internal organisation. Those remain open questions.

What This Means

For developers building multilingual applications, this research suggests that benchmark performance on native-script text may significantly overstate a model's ability to handle romanised or mixed-script input from real users — and that achieving genuine cross-lingual understanding likely requires training data and evaluation that explicitly accounts for script variation, not just language identity.