A new study posted to ArXiv proposes neural models that match human rater consistency when scoring text-to-speech audio quality, potentially reducing the industry's reliance on slow and costly human evaluation panels.

Human evaluation of TTS systems typically relies on two established protocols: Mean Opinion Score (MOS), where listeners rate audio quality on an absolute scale, and Side-by-Side (SBS) comparisons, where listeners judge which of two audio samples sounds better. Both methods are considered gold standards in the field, but they are expensive to run at scale and prone to assessor bias — raters differ meaningfully from one another even when evaluating identical samples.

The Models: NeuralSBS and WhisperBert

To address this, the researchers built two distinct systems targeting each evaluation type. For relative (SBS) comparisons, they introduce NeuralSBS, a model backed by HuBERT — a self-supervised speech representation model developed by Meta — achieving 73.7% accuracy on the SOMOS dataset. For absolute (MOS) scoring, they enhanced the existing MOSNet architecture with custom sequence-length batching, and developed WhisperBert, a multimodal ensemble that combines audio features from OpenAI's Whisper with textual embeddings from BERT.

WhisperBert uses a technique called weak-learner stacking — combining outputs from multiple simpler models rather than merging their internal representations directly. This distinction proved critical.

Our best MOS models achieve a Root Mean Square Error of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62.

An RMSE of 0.40 versus a human baseline of 0.62 means these models produce predictions closer to a consensus expert score than individual human raters do to one another — a meaningful benchmark in a field where human judgment has long been considered irreplaceable.

Why Text Fusion Backfired — and What That Reveals

One of the study's more instructive findings concerns how textual information should be incorporated into audio quality models. The researchers tested cross-attention — a technique that attempts to directly merge text and audio representations within a neural network's internal layers — and found it degraded model performance rather than improving it.

The ensemble-based stacking approach used in WhisperBert, by contrast, keeps the two modalities somewhat independent before combining their outputs at a higher level. This suggests that forcing deep integration of text and audio signals during training can introduce noise rather than signal, at least for the task of quality prediction. The result is a practical caution for researchers building multimodal evaluation tools.

Large Language Models Stumble on This Task

The study also tested whether general-purpose large language models could serve as ready-made TTS evaluators — a tempting proposition given their broad capabilities. The results, based on these experiments, showed underperformance compared to dedicated metric-learning models. Qwen2-Audio and Gemini 2.5 Flash Preview, tested in zero-shot settings (given no task-specific training examples), both underperformed accordingly.

Similarly, architectures based on SpeechLM — language models pre-trained on speech data — produced negative results. The researchers argue this reinforces the need for purpose-built evaluation frameworks rather than assuming frontier models can generalise to specialised audio quality assessment.

These are self-reported benchmark results from the paper's authors, and independent replication on different TTS datasets would be needed to confirm generalisability.

A Scalability Problem That Has Long Frustrated the Industry

The commercial stakes here are real. Companies deploying TTS systems — in navigation, accessibility tools, virtual assistants, and content platforms — need to continuously evaluate audio quality across many voices, languages, and acoustic conditions. Running human evaluation panels for every model update is neither fast nor cheap. Automated metrics that reliably approximate human judgment could substantially compress development cycles.

Existing automated metrics like MOS predictors have existed for several years, but their correlation with human judgments has often been inconsistent enough that many teams run human studies anyway. A model that performs at the level of human inter-rater agreement on a standardised dataset represents a step toward changing that calculus.

The study does not claim to replace human evaluation entirely. Edge cases, cultural context, and novel synthesis artefacts may still require human judgment. But for routine quality benchmarking at scale, the case for automated neural evaluation grows stronger.

What This Means

For teams developing or deploying TTS systems, these models offer a credible path to faster, cheaper quality evaluation — though independent validation across diverse datasets will determine how broadly the results hold.