A new paper on ArXiv argues that neural network training works as well as it does because of hidden mathematical conservation laws — and that understanding exactly how those laws break under standard gradient descent explains why models self-regularise during learning.
The question the paper addresses is deceptively simple: gradient descent has no theoretical guarantee of finding good solutions in non-convex landscapes, yet in practice it almost always does. This gap between theory and reality has troubled machine-learning researchers for years. The new work, posted to ArXiv CS.LG in April 2025, attempts to close that gap with a rigorous, experimentally validated framework.
The Conservation Laws Hidden Inside ReLU Networks
The central finding is that gradient flow — the continuous-time idealisation of gradient descent — preserves a set of conservation laws in L-layer ReLU networks without bias. Specifically, the quantity C_l = ||W_{l+1}||²_F − ||W_l||²_F (the difference in squared Frobenius norms between adjacent weight matrices) stays constant throughout training. This confines the network's learning trajectory to a lower-dimensional surface inside the full weight space, which the authors argue is why the optimiser does not wander into the worst-case regions that make the problem theoretically hard.
The key word, however, is flow. Real training uses discrete steps — gradient descent with a finite learning rate — and those discrete steps break the conservation laws. The paper's second major contribution is quantifying exactly how much they break.
The total drift in the conservation laws scales as η^α, where α is approximately 1.1–1.6 depending on architecture, loss function, and network width — a relationship derived from first principles and confirmed experimentally.
A Closed-Form Formula for How Training Goes Wrong
The authors decompose the drift precisely as η² × S(η), where η is the learning rate and S(η) is a quantity they call the gradient imbalance sum. Crucially, they derive a closed-form spectral formula for S(η), with mode coefficients c_k proportional to e_k(0)² × λ_{x,k}² — that is, the square of the initial layer imbalance multiplied by the square of the corresponding input data eigenvalue.
This formula was validated for both linear networks (R = 0.85) and ReLU networks (R > 0.80) across 23 experiments, giving it reasonable empirical grounding. These are self-reported correlation figures from the authors' own experimental suite, and independent replication would strengthen the claims.
The spectral approach means the theory connects the geometry of the training data (via its eigenvalues) directly to how conservation laws erode during training — a link that had not previously been made explicit.
Why Cross-Entropy Loss Behaves Differently
One of the more practically useful findings concerns cross-entropy loss, the standard objective for classification tasks. The paper shows that as training progresses, softmax probability outputs concentrate — meaning the model becomes more confident — and this drives exponential compression of the Hessian's eigenvalue spectrum. The timescale of that compression is τ = Θ(1/η), meaning it scales inversely with the learning rate and, notably, does not depend on training set size.
The practical implication: cross-entropy loss essentially self-regularises the conservation-law drift, pushing the exponent α toward 1.0 regardless of architecture. This may partly explain why cross-entropy is so robust across diverse settings — a property practitioners have long exploited without a clear theoretical reason.
Two Training Regimes and the Edge of Stability
The paper also identifies two distinct dynamical regimes separated by a width-dependent transition. In the first — a perturbative regime below what the authors call the Edge of Stability — the spectral formula applies cleanly and the theory makes accurate predictions. In the second, non-perturbative regime above that threshold, extensive coupling between spectral modes breaks the formula's assumptions.
The Edge of Stability is a concept that has attracted significant attention in the optimisation community over the past few years. It refers to the observation that learning rates in practice often sit near or above the threshold at which gradient descent should theoretically diverge, yet training continues. This paper offers a mechanistic account of what happens at and around that threshold, framing it as a phase transition in how conservation laws behave.
Network width plays a key role in where the transition sits, which connects the theory to the long-running debate about how overparameterisation — using far more parameters than training examples — affects optimisation.
Limits and What Comes Next
The paper analyses ReLU networks without bias terms, which simplifies the mathematics considerably. Most production networks use biases, and whether the conservation laws and their spectral decomposition generalise to those architectures is an open question. The authors acknowledge this and other scope limitations; the 23 experiments, while broad, cover controlled academic settings rather than the scale of frontier model training.
The spectral crossover formula also relies on quantities measured at initialisation (the initial layer imbalances e_k(0)), which means its predictive power depends on how stable those early measurements are across different training runs and initialisations.
Nonetheless, the framework is specific enough to generate testable predictions — a relative rarity in deep-learning theory, where post-hoc explanations are common but falsifiable forecasts are not.
What This Means
If the theory holds up to independent scrutiny, it gives researchers a principled way to predict how learning-rate choices and architecture decisions affect training stability — potentially reducing the costly trial-and-error that currently dominates model development.