A new approach to sparsifying transformer feed-forward layers activates fewer than 5% of network units per token while matching dense model performance, according to research published on arXiv.
The feed-forward block — technically a multi-layer perceptron, or MLP — consumes a disproportionately large slice of a transformer's total computation at typical context lengths. Reducing that cost without sacrificing quality has been a persistent challenge, and most existing approaches either require separate routing networks or struggle to scale reliably.
How Tree-Structured Routing Replaces Dense MLP Blocks
The researchers propose replacing standard MLP blocks with tree-structured feed-forward layers, where computation follows a branching hierarchy rather than passing through every unit. Each token is routed down a specific path through the tree, activating only the nodes along that branch — a form of conditional computation. Crucially, no separate router network is needed; the routing emerges directly from the layer's own structure.
This approach is designed as a drop-in replacement, meaning it can substitute for existing MLP blocks in deep transformer architectures without redesigning the surrounding model.
Despite activating fewer than 5% of the feed-forward block's units per token, the models match dense baselines under controlled training and fine-tuning protocols.
The team validates the method on autoregressive language modeling and downstream question answering tasks, including zero-shot and few-shot settings — conditions that more closely reflect real-world deployment than standard fine-tuning benchmarks alone.
Scalability Beyond 1 Billion Parameters
One of the paper's central claims is scale. Prior work on tree-structured sparsity had not been demonstrated beyond relatively modest model sizes. According to the researchers, this study demonstrates that this form of hierarchical conditional sparsity scales to models exceeding 1 billion parameters while retaining competitive performance against dense equivalents trained under the same conditions.
The benchmark comparisons are based on controlled training and fine-tuning protocols — the researchers are comparing against their own dense baselines rather than published third-party models, which is standard for this type of architectural ablation study. Readers should note the results are self-reported.
An Emergent Self-Pruning Effect
Beyond the headline efficiency gains, the research identifies an emergent phenomenon the authors call auto-pruning. As training progresses, the interaction between hard routing decisions and asymmetric nonlinear activation functions causes unused branches of the tree to progressively deactivate. Parts of the network that were initially dynamic — routing decisions made at inference time based on input — gradually convert into static structural sparsity, where certain paths are simply never used.
This is notable because it suggests the model is, in effect, pruning itself during training without any explicit pruning objective or auxiliary loss. The researchers describe this as a conversion from dynamic sparsity to structural sparsity over the course of training.
The effect, while potentially useful for further compression, can also lead to imbalanced trees — where some branches handle far more traffic than others, undermining the efficiency of the hierarchical design. The paper shows that simple architectural modifications can counteract this, recovering balanced routing without introducing additional loss terms.
Why This Matters for Inference Costs
The practical appeal of this work lies in what sparse activation means for inference. In a standard dense MLP block, every unit participates in every forward pass for every token. With tree-structured routing, 95% or more of units are bypassed per token, which in principle translates directly to reduced computation — and, with appropriate hardware support, reduced latency and energy consumption.
The question of hardware support is not trivial. Dynamic sparsity is notoriously difficult to exploit on GPUs optimised for dense matrix operations. The researchers do not claim wall-clock speed improvements in the abstract; the efficiency argument is framed primarily in terms of activated parameters rather than measured throughput. Whether the theoretical compute savings translate to real-world speed gains depends heavily on implementation and hardware — an important caveat for practitioners.
The absence of a separate router network is also a meaningful design choice. Mixture-of-experts architectures, which achieve sparsity through learned gating, require additional parameters and training stability measures for the router itself. Tree-structured routing sidesteps this by embedding routing logic directly in the layer hierarchy, reducing architectural complexity.
What This Means
For teams building or deploying large language models, this research offers a credible, architecturally simple route to substantially lower per-token compute costs — one that, according to the authors, holds up at billion-parameter scale and requires no auxiliary routing machinery.