A new study testing LLM-guided scheduling for low-Earth orbit satellites found that fine-tuned LLMs collapsed system throughput to 45.3 Mbps — roughly one-seventh the 342.1 Mbps achieved by a near-constant, static reward configuration — not because the models lacked relevant knowledge, but because their output fluctuated too erratically for the underlying reinforcement learning algorithm to function.

The paper, posted to ArXiv CS.AI and titled When Adaptive Rewards Hurt, examines a widely held assumption in applied AI research: that dynamically adjusting reward weights based on operating conditions should outperform fixed ones. The researchers tested this assumption systematically across multi-beam LEO (low-Earth orbit) satellite scheduling — a domain where traffic patterns shift dramatically between regimes such as polar handovers and high-demand hotspot regions — and found the assumption largely unfounded.

The Switching-Stability Dilemma

The core finding is what the authors call a switching-stability dilemma. Deep reinforcement learning algorithms like PPO (Proximal Policy Optimization) rely on a stable reward signal to converge their value functions — the internal model they use to estimate future returns. Every time reward weights change, that convergence process effectively restarts from scratch.

Weight adaptation — regardless of quality — degrades performance by repeatedly restarting convergence.

This means even well-calibrated, carefully tuned dynamic weights underperform static ones simply by virtue of changing. The dynamically-weighted system achieved just 103.3 Mbps on average, with a standard deviation of 96.8 Mbps — a figure that signals severe instability rather than controlled adaptation. Static weights delivered 342.1 Mbps with none of that variance.

A Probing Method That Finds Hidden Leverage

To understand which reward components actually drive performance, the researchers developed what they call single-variable causal probing. The method independently perturbs each reward term by ±20% and measures PPO's response after 50,000 training steps. The technique functions like a sensitivity analysis but applied directly inside the reinforcement learning loop.

The results were unexpected. Increasing the switching penalty — a term that discourages unnecessary handovers between satellite beams — by just 20% produced throughput gains of +157 Mbps in polar handover regimes and +130 Mbps in hot-cold traffic regimes. Neither human domain experts nor trained neural networks identified this leverage without systematic probing, according to the authors.

This finding has practical significance: it suggests that reward design in complex RL systems contains non-obvious sensitivities that cannot be discovered through intuition or standard ablation studies alone.

Four Architectures, One Clear Winner

The study evaluated four MDP (Markov Decision Process) architect variants that differ in how they set reward weights: a fixed static configuration, a rule-based system, a learned MLP (multilayer perceptron), and a fine-tuned LLM.

The MLP performed best overall. On known traffic regimes it reached 357.9 Mbps; on novel, previously unseen regimes it achieved 325.2 Mbps — a relatively small drop that suggests reasonable generalization. The fixed configuration remained competitive at 342.1 Mbps on known regimes.

The fine-tuned LLM, by contrast, collapsed to 45.3 Mbps with a standard deviation of 43.0 Mbps. The researchers are explicit that this failure does not reflect the model's lack of satellite scheduling knowledge. The LLM understood the domain. Its problem was output consistency: the model produced sufficiently variable weight recommendations that the PPO algorithm could never stabilize.

Where LLMs Actually Add Value

Rather than dismissing LLMs from the pipeline entirely, the paper draws a more precise boundary. According to the authors, LLMs offer substantial value in interpreting natural language operator intent — translating high-level human instructions like "prioritize polar coverage during storm season" into structured objectives. That translation task plays to genuine LLM strengths.

What LLMs should not do, the paper argues, is directly output numerical reward weights in a closed-loop RL system. The binding constraint is not knowledge — it is the consistency and stationarity that value function training requires. A component that generates variable outputs on each call is structurally incompatible with that requirement, regardless of how accurate those outputs are in isolation.

This distinction matters for the growing field of LLM-RL integration, where researchers increasingly explore using language models as reward designers, curriculum planners, or environment configurators. The satellite scheduling results suggest that any such integration must account for the stationarity requirements of the RL algorithm receiving the LLM's outputs.

Benchmarks and Study Limitations

All performance figures reported are self-reported by the study authors and have not been independently replicated. The experimental setup covers specific LEO scheduling scenarios — polar handover and hot-cold traffic regimes — and results may not transfer directly to other satellite architectures or RL algorithms beyond PPO. The causal probing method itself is novel and untested outside this domain.

The paper does not name a specific LLM used for the fine-tuned variant, which limits external reproducibility. The ArXiv preprint has not yet undergone peer review.

What This Means

For engineers building AI-assisted communication systems, this research establishes a concrete principle: LLMs should sit upstream of the RL loop — shaping goals and interpreting intent — while stable, simpler components handle the numerical reward weights that reinforcement learning algorithms actually consume.