Researchers have introduced Sequence-Level PPO (SPPO), a reinforcement learning algorithm designed to make training large language models on long reasoning tasks more efficient, according to a paper published on arXiv in April 2025.

The work addresses a persistent tension in AI training: Proximal Policy Optimization (PPO), the standard algorithm for aligning language models, encounters serious problems when applied to tasks requiring extended chains of reasoning — the kind of step-by-step thinking now central to frontier AI systems.

Why Standard PPO Struggles With Long Reasoning Chains

PPO's core difficulty in this context is what researchers call temporal credit assignment — determining which part of a long reasoning sequence deserved credit or blame for a final correct or incorrect answer. When a model works through a multi-step maths problem across dozens of tokens, attributing reward to the right decisions becomes unstable. Standard PPO also requires a large "value model" running alongside the main model, which carries a substantial memory cost.

SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling.

A competing approach, GRPO (Group Relative Policy Optimisation), sidesteps the value model entirely — but introduces its own cost. GRPO estimates a performance baseline by generating multiple responses to each prompt and comparing them, which the paper's authors argue "severely limits training throughput" due to the compute required to produce those additional samples.

What SPPO Does Differently

SPPO's central design choice is to treat the entire reasoning sequence as a single action, rather than evaluating every token individually. This is what the researchers mean by a "Sequence-Level Contextual Bandit" formulation — the model commits to a full response, and the reward is assigned at the outcome level, not distributed token by token.

To generate advantage estimates — the measure of how much better or worse a given response was than expected — SPPO uses a decoupled scalar value function. This is a deliberately lightweight component compared to the full value networks used in standard PPO, and it removes the need to generate multiple samples per prompt that GRPO depends on.

The result, according to the authors, is a training method that retains the sample efficiency of PPO (meaning it extracts more learning from each piece of training data) while achieving the stability that outcome-based methods like GRPO offer.

Benchmark Performance: Self-Reported Results on Maths Tasks

The paper reports results across mathematical reasoning benchmarks, a standard testing ground for reasoning-focused language models. These benchmarks are self-reported by the paper's authors and have not been independently verified.

According to the paper, SPPO "significantly surpasses standard PPO" and "matches the performance of computation-heavy group-based methods" — a reference to GRPO-style approaches. The researchers frame this as demonstrating a favourable trade-off: competitive accuracy at materially lower computational cost.

The paper does not specify which model sizes were tested, nor does it report precise throughput figures comparing training time or GPU hours against GRPO directly — details that independent researchers will likely want to scrutinise before drawing firm conclusions.

Where This Fits in the Broader Reinforcement Learning Picture

The problem SPPO addresses sits at the heart of current AI development. Reinforcement learning from human or automated feedback is the standard technique for pushing language models toward reliable, accurate reasoning — and PPO has been its workhorse since OpenAI popularised it for language model alignment. But scaling that approach to the long, deliberate chains of thought that characterise advanced reasoning models has proven technically difficult and expensive.

Recent high-profile systems, including DeepSeek-R1 and reasoning-focused variants of models from Google and Anthropic, all grapple with this same training challenge. Methods that reduce compute requirements without sacrificing quality have direct commercial relevance: training costs remain a major constraint on how frequently and extensively these models can be updated.

SPPO joins a growing body of work — including GRPO, REINFORCE-style variants, and various hybrid approaches — that attempts to find a more practical middle ground. The "contextual bandit" framing is not entirely novel in reinforcement learning research, but applying it explicitly at the sequence level for language model reasoning, with a decoupled value function, represents a distinct architectural choice.

What This Means

If SPPO's efficiency gains are validated through independent evaluation, the algorithm could reduce the cost of training capable reasoning models — making it more accessible to research groups and organisations without access to the largest compute clusters.