Focus Method Cuts AI Attention Costs 8.6x, Boosts Accuracy

A new method called Focus, published on ArXiv, enables large language models to selectively attend to relevant parts of their input rather than processing every token pair — delivering up to an 8.6x wall-clock speedup at one million tokens while improving accuracy over standard full attention.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

Attention mechanisms sit at the heart of modern language models, but they come with a steep cost: computation scales quadratically with sequence length, meaning longer inputs demand exponentially more resources. Existing approaches to this problem — sparse attention, linear approximations, and similar techniques — typically trade accuracy for speed. Focus takes a different path by learning which token pairs actually matter, rather than approximating all of them.

How Focus Learns What to Ignore

The method introduces learnable centroids — small trainable parameters that cluster tokens into groups. Distant attention (between tokens far apart in a sequence) is restricted to pairs within the same group, while local attention operates at full resolution for nearby tokens. The result is a structured sparsity pattern that reflects the model's own learned sense of what belongs together.

Critically, Focus leaves all existing model weights frozen. Only the centroid parameters — as few as 148,000 parameters — are trained. This additive design means it can be retrofitted onto already-trained models without degrading their capabilities on downstream tasks, according to the researchers.

Because all model weights stay frozen, Focus is purely additive: centroid-only training improves domain perplexity with zero degradation on downstream benchmarks.

The contrast with LoRA, a widely used parameter-efficient fine-tuning method, is pointed. The paper reports that instruction-tuned models retain their TruthfulQA scores after Focus adaptation, while LoRA degrades performance at every learning rate and rank tested. For teams deploying aligned models in production, that distinction carries real weight.

Performance Numbers Across Scale

The benchmark results — which are self-reported by the researchers and have not yet undergone peer review — show consistent gains across a wide range of model sizes. At 124 million parameters, Focus surpasses full attention with a perplexity of 30.3 versus 31.4. Lower perplexity indicates better language modeling performance.

When trained from scratch at 7 billion parameters on 2 billion tokens, Focus again edges out full attention: 13.82 versus 13.89 perplexity. The improvements are modest in absolute terms but consistent, and they come alongside — not in exchange for — the efficiency gains.

At inference time, the method discretizes its soft routing into a hard sparsity pattern by restricting each token to its top-k highest-scoring groups. This yields a 2x speedup in the basic configuration. Decomposing the resulting pattern into two standard FlashAttention calls — without any custom GPU kernels — reaches the headline 8.6x speedup at one million tokens.

Five Architectures, One Retrofit Approach

The researchers tested Focus across five different attention architectures, finding improvements in each case. The claim that no existing efficient attention method achieves this in the retrofit setting — improving perplexity without downstream degradation — is a strong one, though independent replication on production workloads will ultimately determine whether it holds broadly.

Sinkhorn normalization, a mathematical technique borrowed from optimal transport theory, enforces that groups remain balanced in size — preventing the model from collapsing all tokens into a single cluster. A notable side effect: the resulting groups appear to capture interpretable linguistic categories without any explicit supervision. The paper does not elaborate extensively on this, but it suggests the centroids are learning something structurally meaningful about language rather than arbitrary partitions.

The efficiency gains are achieved entirely through standard operations, meaning practitioners can implement Focus without writing custom CUDA kernels or waiting for specialized hardware support. That lowers the barrier to adoption considerably.

What This Means

For teams running large language models at scale, Focus offers a method that reduces inference costs by nearly an order of magnitude at long contexts while preserving — and in some cases improving — model accuracy, with no modifications to existing weights or alignment properties.

Focus Method Cuts AI Attention Costs by 8.6x, Improves Accuracy

How Focus Learns What to Ignore

Performance Numbers Across Scale

Five Architectures, One Retrofit Approach

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

Focus Method Cuts AI Attention Costs by 8.6x, Improves Accuracy

How Focus Learns What to Ignore

Performance Numbers Across Scale

Five Architectures, One Retrofit Approach

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models