A new mathematical framework published on arXiv shows that the growing presence of AI-generated text in the public record creates a feedback loop that, left unmanaged, progressively erodes linguistic diversity — but that deliberate, quality-based filtering can counteract the effect.
The paper, posted to arXiv CS.CL in April 2025, addresses a problem that has grown more urgent as large language models become prolific producers of publicly available content. Both humans and AI systems now learn from the same shared text corpus — and that corpus is increasingly populated by AI outputs. The authors develop what they describe as an exactly solvable mathematical framework to model this recursive process, using variable-order n-gram agents as their analytical tool.
Two Forces Pulling the Corpus in Opposite Directions
The research identifies two distinct forces acting on the evolving public text record. The first, which the authors call drift, describes what happens when AI-generated text is reused without filtering: rare linguistic forms are gradually squeezed out. In the mathematical limit of an infinitely large corpus, the researchers characterise exactly where this process stabilises — a so-called "shallow" equilibrium in which the statistical structure of language becomes impoverished and further processing yields no meaningful gain.
The second force is selection: the editorial, algorithmic, and verification processes that determine what actually enters the public record. Unlike drift, selection is not neutral. Its effect depends entirely on the criteria applied.
When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit.
That finding is the paper's sharpest warning. If ranking systems and publication mechanisms simply amplify whatever is already statistically common — rewarding fluency over correctness, or popularity over novelty — they accelerate the drift toward shallowness rather than correcting it.
What "Shallow" Actually Means
The term "shallow equilibrium" refers to a state in which the text corpus loses the deeper structural patterns that make language informationally rich. In practical terms, this could manifest as AI models that produce text which is grammatically plausible and stylistically consistent but which lacks nuance, rare vocabulary, or the kind of structural complexity associated with expert or creative writing.
The researchers establish an optimal upper bound on how far a corpus can diverge from this shallow baseline when publication criteria are genuinely normative — that is, when they reward quality, correctness, or novelty. This bound matters because it quantifies the maximum benefit that good curation can deliver, giving dataset designers a concrete target rather than a vague aspiration.
The mathematical approach — described as exactly solvable — means the framework produces closed-form results rather than approximations, lending the conclusions more analytical weight than simulation-based studies typically carry.
The Model-Collapse Literature, Extended
This work sits within a growing body of research on what is sometimes called model collapse — the phenomenon whereby models trained on AI-generated data progressively degrade. Earlier empirical studies, including influential work from 2023 and 2024, demonstrated the degradation experimentally. The new paper advances the field by offering a theoretical foundation that separates the mechanisms responsible and characterises their outcomes precisely.
Where previous research tended to treat the problem as a binary — model collapse happens or it doesn't — the new framework shows it as a spectrum governed by the interplay between drift and selection. The corpus does not simply collapse; it drifts toward a predictable equilibrium whose depth depends on what filtering occurs along the way.
It also extends the analysis beyond individual model generations to the broader public text ecosystem, acknowledging that the feedback loop involves human readers and writers as well as AI systems. That framing is significant: it positions the problem not merely as a technical failure of training pipelines but as a structural feature of how knowledge now circulates.
Implications for Dataset Design
The practical upshot for AI developers and data curators is explicit in the paper. The framework identifies the conditions under which recursive publication compresses public text — and the conditions under which selective filtering sustains richer structure. This gives dataset designers a principled basis for making curation decisions rather than relying on intuition or ad hoc rules.
The key variable is the nature of the selection criterion. Filtering that merely reflects statistical frequency — ranking content by engagement metrics, for instance, or selecting text that resembles existing high-volume sources — will tend to reinforce drift. Filtering that rewards properties orthogonal to frequency, such as factual accuracy, structural complexity, or genuine novelty, can preserve the diversity that future models depend on.
The authors do not prescribe a specific filtering methodology, and the framework is theoretical rather than empirical. How well the n-gram model generalises to the behaviour of modern transformer-based systems operating at scale is a question the paper does not fully resolve — a limitation worth noting given the complexity gap between the two.
What This Means
For anyone building or maintaining AI training datasets, this research provides the clearest mathematical account yet of why unmanaged data feedback loops degrade language corpora — and confirms that quality-based curation is not merely good practice but a structural necessity for sustaining the richness that capable AI systems require.