A new study warns that a core technique used to represent speech in AI systems systematically loses lexical tone — the pitch-based meaning distinctions that differentiate words in roughly 70% of the world's languages — raising questions about the inclusivity of current speech technology.
The paper, posted to ArXiv in April 2025, investigates discrete speech units (DSUs), the compressed, text-like tokens that modern speech AI systems use to process audio. DSUs are produced by taking the internal representations of a self-supervised learning (SSL) model — an AI trained on raw audio without human-labelled data — and running them through a quantisation algorithm that maps continuous values to a finite set of discrete codes. The resulting tokens are popular in applications ranging from text-to-speech synthesis to multimodal dialogue systems, in large part because they allow speech and text to be handled within the same framework.
Why Tone Languages Expose a Hidden Weakness
The researchers tested DSUs on two tone languages: Mandarin Chinese and Yorùbá, a major West African language spoken by over 50 million people. In both languages, identical sequences of consonants and vowels carry entirely different meanings depending on the pitch contour — what linguists call lexical tone. In Mandarin, for instance, the syllable ma can mean mother, hemp, horse, or scold depending on which of four tones is used.
The SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded.
This distinction matters: the researchers found that the problem does not originate in the SSL model itself. When they probed the raw internal representations before quantisation, tonal information was present. The compression step — quantisation — is where tone gets discarded. According to the paper, this held across multiple quantisation methods, not just the most common approach, K-means clustering, suggesting the issue is structural rather than an artefact of any single algorithm.
The Segmental Versus Suprasegmental Divide
At the heart of the finding is a distinction linguists draw between segmental and suprasegmental features. Segmental features are the building blocks of words — individual consonants and vowels. Suprasegmental features, by contrast, operate across those segments: tone, stress, rhythm, and intonation. The study's evidence suggests that current quantisation strategies are implicitly optimised for segmental structure, compressing audio in a way that preserves which sounds were spoken but not necessarily how they were pitched.
This has practical consequences beyond tone languages. Prosody — the rise and fall of pitch that conveys emotion, emphasis, and sentence structure in any language — is also a suprasegmental feature. If DSUs reliably drop tonal information, they may also be losing emotionally relevant or pragmatically important pitch patterns in English and other non-tonal languages, though the paper focuses specifically on lexical tone as its test case.
A Two-Stage Residual Fix
The researchers propose a potential solution using a two-pass K-means approach. In the first pass, standard K-means clustering is applied to encode phonetic information. The algorithm then computes the residual — essentially the information left over after phonetic structure has been captured — and applies K-means a second time to that remainder. According to the authors, this residual representation encodes lexical tone more reliably than any single-pass method tested.
The approach is preliminary and the paper frames it as pointing toward a solution rather than delivering one. No large-scale benchmark results are reported for the residual method in downstream tasks such as speech synthesis or automatic speech recognition, and the findings are based on probing experiments rather than end-to-end system evaluations. The benchmark results cited in the paper are from the authors' own experimental setup and have not been independently replicated.
Implications for Global Speech AI
The timing of the paper coincides with growing industry investment in universal speech models — systems designed to handle dozens or hundreds of languages within a single architecture. Companies including Google, Meta, and OpenAI have released or are developing multilingual speech models that rely on SSL-derived representations. If DSUs systematically underrepresent tonal information, models built on them may perform measurably worse on the roughly 1.4 billion native Mandarin speakers and hundreds of millions of speakers of other tone languages worldwide.
The study also raises a broader methodological concern. Because DSUs are evaluated primarily on benchmarks dominated by English and other non-tonal languages, the tonal information loss may have gone unnoticed in standard evaluations. Probing specifically for suprasegmental encoding — testing not just whether a model transcribes words correctly, but whether it preserves pitch — is not yet standard practice in the field.
The authors call for the development of tone-aware and prosody-aware quantisation techniques as a distinct research direction, arguing that the field cannot simply rely on scaling existing methods to fix the problem.
What This Means
Any speech AI system that uses discrete speech units — including text-to-speech, voice assistants, and multimodal dialogue models — may be silently degrading tonal and prosodic information, with the sharpest real-world impact felt by speakers of Mandarin, Yorùbá, and the world's other tone languages.