A new framework called WAND — short for Windowed Attention and Knowledge Distillation — reduces the memory footprint of autoregressive text-to-speech models by up to 66.2% while maintaining audio quality comparable to full-attention systems, according to a paper published on ArXiv.

State-of-the-art text-to-speech systems have increasingly adopted decoder-only autoregressive architectures — the same family of design used in large language models like GPT — because they produce natural-sounding speech. The problem is that these models carry a fundamental computational burden: their memory and processing demands grow quadratically with the length of the audio being generated. A sentence costs far less than a paragraph; a paragraph costs far less than a full document read aloud. For real-world deployment, especially on devices with limited hardware, this scaling behaviour is a serious obstacle.

Why Quadratic Scaling Is a Problem for Speech AI

The root cause is a mechanism called self-attention, which allows a model to consider every previously generated token when producing the next one. This is powerful for quality but expensive at scale. The memory structure that stores these computations — known as the KV cache — expands with every additional audio token generated, meaning longer speech outputs consume disproportionately more memory and take longer to process per step.

WAND addresses this by splitting the attention mechanism into two distinct components. The first is persistent global attention, which keeps the model connected to the original conditioning tokens — the text input and speaker characteristics that define what the model is supposed to say and how. The second is local sliding-window attention, which limits how far back the model looks when generating new audio tokens, focusing only on a recent window rather than the entire history.

The framework achieves up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency across three modern AR-TTS models.

This combination preserves the contextual information that matters most — the source text and voice style — while eliminating the expensive, and arguably redundant, long-range dependencies within the generated audio sequence itself.

Teaching an Efficient Model From a More Capable One

Adapting a pretrained model to use windowed attention is not straightforward. Simply swapping the attention type and retraining tends to destabilise the model, producing degraded audio quality. The WAND team tackles this with two complementary techniques.

The first is curriculum learning: rather than immediately restricting the model to a narrow attention window, training progressively tightens the window over time, allowing the model to gradually adjust. This staged approach — moving from wide to narrow attention — stabilises the fine-tuning process and prevents the quality degradation that blunt architectural changes can cause.

The second technique is knowledge distillation, where the efficient WAND model learns by mimicking the outputs of the original, full-attention model — referred to as the "teacher." The student model is trained to reproduce the teacher's behaviour, recovering synthesis quality that might otherwise be lost in the transition to windowed attention. The researchers report that this distillation process achieves high quality with strong data efficiency, meaning it does not require enormous additional datasets to work.

Tested Across Three Speech Models

The framework was evaluated on three modern autoregressive TTS models, though the paper does not name all of them explicitly in the abstract. Across these systems, WAND consistently preserved the original audio quality while delivering the memory savings and latency benefits. Critically, per-step latency becomes length-invariant — meaning the time taken to generate each audio token does not increase as the output grows longer. This is a qualitative shift from current behaviour, where longer outputs slow the system down at an accelerating rate.

These benchmarks are reported by the researchers themselves and have not yet undergone independent peer review, as the paper is a preprint posted to ArXiv.

Practical Implications for Speech Deployment

The significance of near-constant latency extends beyond academic benchmarks. Streaming text-to-speech applications — such as voice assistants, real-time narration tools, or accessibility software — are particularly sensitive to per-token generation speed. A system that slows down mid-sentence as it processes longer context is a system that stutters. WAND's windowed approach removes that dependency, making performance predictable regardless of how long the spoken output becomes.

Memory reduction matters equally for deployment on edge devices — smartphones, hearing aids, embedded systems — where RAM is constrained and running a full-attention TTS model may simply be impractical. A 66.2% reduction in KV cache memory could make the difference between a model fitting on a device or not.

The knowledge distillation component also has a practical upside: organisations that have already trained high-quality TTS models can use those existing models as teachers to produce efficient WAND variants, without rebuilding from scratch or assembling large new training datasets.

What This Means

WAND offers a tested route to deploying high-fidelity text-to-speech AI in environments where memory and latency constraints have previously made it impractical — potentially broadening access to quality voice synthesis across consumer devices and real-time applications.