CSAttention: 4.6x Faster LLM Inference on Long Contexts

Researchers have proposed CSAttention (Centroid-Scoring Attention), a training-free method that speeds up large language model inference by up to 4.6 times on 128,000-token contexts while maintaining near-identical accuracy to full attention, according to a paper posted to ArXiv in April 2025.

Editor's Note: This article is based on a preprint research paper that has not yet undergone peer review. DeepBrief is actively monitoring for peer-reviewed publication and additional independent analysis.

The work targets a specific and growing bottleneck in production AI systems: long, reusable prompt prefixes. Agents and domain-specific question-answering tools increasingly rely on extended context windows that stay fixed across many user queries — think a customer-support bot that always starts with the same 50,000-token knowledge base. In these scenarios, the attention mechanism and the key-value (KV) cache that supports it consume both compute time and memory bandwidth during each decoding step.

Why Existing Sparse Attention Falls Short

Sparse attention — the broad technique of computing attention only over a selected subset of tokens rather than the full context — is an established strategy for reducing this cost. The problem, the authors argue, is that most existing methods struggle to maintain accuracy when sparsity is pushed very high, because the statistical distribution of query vectors at decode time differs from the key vectors stored during prefill. At 95% sparsity, meaning only 5% of tokens are attended to at each step, this mismatch causes accuracy to degrade noticeably in prior approaches.

CSAttention front-loads computation into a one-time offline prefill phase that can be amortized across multiple queries, while aggressively optimising per-step decoding latency.

CSAttention addresses this by rethinking when and where work gets done. Rather than trying to select relevant tokens dynamically during each decoding step — which is both expensive and error-prone — the system pre-computes structured lookup tables during an offline prefill phase. This phase runs once per reusable context and its cost is spread across every subsequent query that uses that context, making the per-query overhead negligible in high-throughput settings.

How the Lookup Table Mechanism Works

The core idea is what the authors call a "query-centric lookup table." During offline prefill, the method clusters the key vectors in the KV cache using centroid representations — essentially computing compact summaries of groups of tokens. These summaries are stored in a table whose size stays fixed regardless of how long decoding runs.

At decode time, instead of scanning every token in a potentially 128,000-token context, the model consults the lookup table to identify which clusters are most relevant to the current query, then accumulates attention scores only over those clusters. The operation is designed to be GPU-friendly, taking advantage of how modern accelerators handle structured memory access more efficiently than irregular sparse reads.

The authors describe this as a "storage-for-computation" trade-off: the system accepts a modest increase in memory used during prefill in exchange for faster decode steps.

Benchmark Results and What They Show

According to the paper, CSAttention was evaluated across long-context settings ranging from 32,000 to 128,000 tokens. At 128K tokens and 95% sparsity, it achieved a 4.6x inference speedup over the most accurate competing sparse attention baseline. The authors report that accuracy remained near-identical to full attention across these tests.

It is important to note that these benchmarks are self-reported by the research team and have not yet undergone independent peer review, as is standard for ArXiv preprints. The comparison baseline is described as "the most accurate" sparse attention method, which means the speedup figure reflects a best-case scenario against a strong but specific competitor — not necessarily every alternative approach.

The method is also explicitly optimised for the offline-prefill/online-decode deployment pattern. Its advantages are most pronounced when the same long context is reused many times, allowing the one-time prefill cost to be amortised effectively. Use cases with highly dynamic or single-use contexts would see less benefit.

Implications for AI Deployment

The practical relevance of this work sits in enterprise and agentic AI deployment. As organisations build systems where LLMs continuously reference large, stable knowledge bases — legal document repositories, technical manuals, product catalogues — the cost of repeatedly attending over those documents at each generation step becomes a significant operational expense.

CSAttention's training-free design is a practical advantage. Methods that require fine-tuning or architectural changes to a base model impose integration costs and can complicate model versioning; a plug-in inference optimisation that requires no retraining is substantially easier to adopt.

The approach also fits naturally into serving infrastructure that already separates prefill and decode into distinct computational phases — a pattern that has become common in large-scale LLM deployment platforms.

What This Means

For teams running LLMs over long, reusable contexts, CSAttention offers a potential reduction in inference latency and cost with no model retraining required — assuming the self-reported accuracy and speed gains hold up under independent evaluation.

CSAttention Cuts LLM Inference Time by Up to 4.6x on Long Contexts

Why Existing Sparse Attention Falls Short

How the Lookup Table Mechanism Works

Benchmark Results and What They Show

Implications for AI Deployment

What This Means

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models

CSAttention Cuts LLM Inference Time by Up to 4.6x on Long Contexts

Why Existing Sparse Attention Falls Short

How the Lookup Table Mechanism Works

Benchmark Results and What They Show

Implications for AI Deployment

What This Means

Related

Berkeley Researchers Propose GRASP, a Gradient-Based Planner for Long-Horizon World Models

Stanford AI Index Reports US-China Model Performance Gap Narrowed to 2.7%

Vidoc Security Says It Replicated Anthropic's Mythos Findings Using Public Models