Researchers at Apple ML Research have identified a fundamental problem with the training methods behind many of today's most capable AI reasoning systems: the algorithms designed to make models smarter are simultaneously making them less creative and less diverse in their outputs.

The paper, published on Apple's machine learning research portal, focuses on policy gradient algorithms — a family of techniques that have become central to training large language models, particularly for tasks requiring multi-step reasoning. These methods work by having a model generate its own responses, evaluate how good they are, and then adjust to produce more of what worked. The problem, according to Apple's researchers, is what happens to everything else.

How Training Methods Reduce Model Output Diversity

The concept at the centre of this research is entropy — in this context, a measure of how varied and unpredictable a model's outputs are. High entropy means a model explores many different approaches to a problem. Low entropy means it has converged on a narrow set of responses it repeatedly favours.

Apple's researchers argue that most policy gradient algorithms naturally reduce entropy as a side effect of training. As the model learns which responses score well, it gravitates toward those patterns and away from alternatives — even alternatives that might be useful in different contexts. Over many training steps, this compounds into a model that is structurally less capable of exploration.

This matters because exploration is not just a nice-to-have. For tasks like mathematical reasoning, coding, or complex problem-solving, a model needs to consider multiple approaches before arriving at a solution. A model that has learned to always reach for the same strategies will struggle when those strategies don't apply.

A model that has learned to always reach for the same strategies will struggle when those strategies don't apply.

Why This Problem Has Gone Largely Unaddressed

Entropy reduction is not a new observation in machine learning. Researchers have long known that optimisation processes tend to reduce variability. What Apple's paper contributes is a formal analysis of how and why this happens specifically within policy gradient training for language models, and an argument that the field has not prioritized the problem.

Many existing training pipelines do include entropy-related terms — penalty coefficients designed to discourage a model from becoming too confident too quickly. But according to the Apple researchers, these are often set once at the start of training and left unchanged, rather than being actively managed in response to how entropy actually evolves during the run. The paper argues this passive approach is insufficient.

The researchers propose that entropy should instead be continuously monitored throughout training, with active interventions when it drops below levels that would impair a model's ability to explore. The paper includes a formal analysis of this dynamic, though the full technical details of their proposed method are available in the complete paper rather than the published summary.

What This Means for AI Reasoning Models

The timing of this research is significant. Over the past year, reasoning-focused language models have become one of the most competitive areas in AI development. Systems from OpenAI, Google DeepMind, Anthropic, and others have all leaned heavily on reinforcement learning from human feedback and related policy gradient techniques to improve step-by-step reasoning.

If Apple's analysis is correct, there is a meaningful risk that some of the performance gains seen in reasoning benchmarks come at the cost of generalisation — that models trained this way become very good at the specific types of problems they were rewarded for, while losing the flexibility to handle genuinely novel challenges.

It is worth noting that the benchmarks typically used to evaluate reasoning models are self-reported by the companies that develop them, and measure performance on fixed test sets. They may not capture whether a model's problem-solving repertoire is narrowing over time.

Apple does not currently operate a publicly available AI assistant or reasoning model in the same commercial capacity as its peers, which gives this research a somewhat different character. Rather than defending an existing product, the company appears to be contributing a critique of techniques the broader industry relies upon — a posture that reflects Apple's historically cautious approach to AI deployment.

Implications for AI Development

If Apple's framework is adopted, AI developers may need to add active entropy management as a standard component of training pipelines — a shift that could meaningfully improve how well reasoning models handle unfamiliar problems, rather than simply improving performance on familiar ones.