A new study from ArXiv CS.CL demonstrates that exponential moving average traces — the simplest possible form of recurrent memory in AI models — can capture grammatical structure effectively but are mathematically incapable of preserving word identity, a limitation no downstream processing can overcome.
The research addresses a foundational question in AI sequence modelling: what do more complex memory mechanisms actually add over the simplest possible alternative? As efficient sequence models have proliferated — often marketed on speed and compactness — the field has lacked a clear, controlled account of where simplicity stops being enough. This study provides that account with precision.
How EMA Traces Work — and Why They Were Chosen
An exponential moving average (EMA) trace is a running summary of past inputs, where recent information counts more than older information. The key word is "fixed": the coefficients that control how much weight each past token receives are set in advance and never change based on what the model is actually reading. There is no gating, no selective attention, no content-based retrieval — just accumulation.
The researchers chose EMA traces deliberately as a controlled probe. By starting with the simplest possible mechanism, they could isolate exactly which capabilities require something more sophisticated, rather than attributing failures to the complexity of a full model.
Fixed-coefficient accumulation suffers irreversible information dilution that only learned, input-dependent selection can resolve.
The Structure Finding: EMA Is Competitive With Supervised Models
The first major finding is notable for its implications. A Hebbian architecture — a learning approach inspired by how biological neurons strengthen connections — built on multi-timescale EMA traces achieved 96% of the performance of a supervised BiGRU (a standard bidirectional recurrent neural network) on grammatical role assignment. It did this with zero labels, meaning it required no annotated training data.
On structure-dependent grammatical roles specifically, the EMA-based model matched or exceeded the supervised model's performance. This suggests that temporal structure — the rhythm and ordering of language — is genuinely encoded in simple accumulation across multiple timescales. For tasks that depend on sequence shape rather than specific word content, EMA traces perform comparably to far more complex alternatives.
Where Simple Memory Fails
The second finding identifies the hard boundary. When the researchers built a 130-million-parameter language model that used only EMA context — no attention, no gating — the model reached a perplexity of 260 on the C4 dataset. Perplexity measures prediction error; lower is better. For comparison, GPT-2, a model now several years old, achieves roughly 30 on comparable benchmarks. The EMA-only model performed approximately 8 times worse.
Perplexity at this scale represents substantial failure at language modelling as a practical task. The model cannot reliably predict what word comes next because it has lost track of which specific words appeared earlier in the sequence.
To confirm where the failure originates, the researchers ran a precise ablation experiment: they replaced the simple linear predictor at the end of the model with full softmax attention — one of the most powerful prediction mechanisms available. The loss did not improve. This result localises the problem definitively to the traces themselves, not to any weakness in the prediction stage.
The Mathematical Reason: Data Processing Inequality
The researchers ground their explanation in information theory rather than empirical observation alone. EMA traces perform what they call lossy, data-independent compression — they discard information about specific tokens in a way that is determined by the fixed coefficients, not by the content of what is being read.
The data processing inequality is a theorem in information theory stating that processing data can never increase the amount of information it contains. Once information is discarded at the compression stage, no downstream system — however powerful — can recover it. The researchers apply this principle to argue that the failure of EMA traces to preserve token identity is not a fixable engineering problem; it is a mathematical ceiling.
This framing shifts the question from "how do we make EMA models better" to "which tasks fall above or below this ceiling."
Implications for Efficient Model Design
The AI industry has invested heavily in efficient sequence models — architectures that process long documents faster and more cheaply than standard transformers. Many of these models use variants of fixed or semi-fixed recurrent context as part of their design. This study does not argue those architectures are valueless, but it maps the conditions under which their memory mechanisms will fail.
Specifically, the findings suggest that tasks requiring recall of specific tokens — which word appeared, not just that some word in a certain grammatical position appeared — require learned, input-dependent selection. Mechanisms like gating (used in LSTMs and GRUs) or attention allow a model to decide what to remember based on content. Fixed accumulation cannot make that decision.
For practitioners building or selecting models for specific applications, this provides a diagnostic framework: if the task is structure-sensitive (parsing, grammatical analysis, rhythm detection), simple recurrent context may suffice. If the task requires content retrieval (question answering, factual recall, precise summarisation), it will not.
What This Means
Designers of efficient AI sequence models now have a mathematically grounded boundary separating tasks where simple recurrent memory is adequate from those where it will not be sufficient — meaning architectural choices can be matched to task requirements rather than defaulted to maximum complexity.