A new plug-and-play framework called SepSeq (Separate Sequence) improves large language model performance on long numerical sequences without requiring any additional training, according to a paper published on ArXiv CS.CL. Tested across nine widely-adopted LLMs, the method delivers an average relative accuracy improvement of 35.6% while reducing inference token consumption by 16.4%.
Large language models have expanded their context windows dramatically in recent years, with some supporting hundreds of thousands of tokens. Yet longer contexts have not automatically translated into better performance on numerical data — financial time series, sensor readings, scientific datasets — where models frequently struggle despite having sufficient theoretical capacity.
Why LLMs Struggle With Numbers in Long Contexts
The researchers attribute the failure to a property of the Softmax attention mechanism that sits at the heart of most transformer models. When a model processes a long sequence, Softmax distributes attention scores across every token. The longer the sequence, the thinner that attention is spread — a phenomenon the authors call attention dispersion. For numerical sequences, where precise local relationships matter enormously, dispersed attention causes the model to lose track of the values it needs to reason about.
This is a structural problem, not a training problem. The model has not learned the wrong thing; the mechanism itself works against concentration when sequences grow long.
Separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context.
What SepSeq Actually Does
SepSeq addresses the problem by inserting separator tokens at strategic intervals within the numerical input. These tokens are not novel to the vocabulary — they are existing tokens repurposed for a mechanical role. When placed between segments of a numerical sequence, they function as what the authors call attention sinks: points that absorb and redirect the model's attention, effectively partitioning the long sequence into manageable local chunks.
Critically, the framework does not discard global context. The separator tokens reset local attention while still allowing the model to maintain awareness of the broader sequence structure. The result is that each numerical segment receives concentrated, meaningful attention rather than a diluted share of a model's capacity spread thin across thousands of tokens.
Because SepSeq requires no retraining or fine-tuning, it can be applied directly to existing models at inference time. The authors describe it as plug-and-play, meaning it slots into a standard inference pipeline without architectural modification.
Performance Across Nine Models
The researchers evaluated SepSeq across nine LLMs spanning diverse domains and sequence types. All benchmark results are self-reported by the paper's authors and have not yet undergone independent peer review, as the work appears as a preprint on ArXiv.
Across those evaluations, the framework produced an average relative accuracy improvement of 35.6%. Alongside the accuracy gains, inference token consumption fell by an average of 16.4%. The reduction in token count is a secondary but notable benefit: fewer tokens at inference means lower computational cost, which matters at the scale at which commercial LLMs operate.
The combination — better accuracy and lower cost — is unusual. Accuracy improvements in AI research typically come with compute trade-offs. SepSeq's efficiency gain stems from the same mechanism driving its accuracy improvement: by organising input more effectively, the model needs fewer tokens overall to process equivalent information.
Where This Matters Most
Numerical sequence processing is a bottleneck for LLM deployment in several high-stakes domains. Financial institutions use LLMs to analyse time-series data, audit logs, and transactional records. Scientists feed sensor outputs and experimental readings into models for pattern detection. Healthcare applications involve patient monitoring data and lab result sequences. In all these cases, degraded performance on long numerical inputs is not a theoretical concern — it directly limits what models can reliably do.
Existing workarounds typically involve chunking sequences manually, training specialised models on numerical data, or accepting degraded accuracy. SepSeq offers a lower-friction alternative: modify the input format, not the model.
The framework also raises a broader point about attention mechanisms. The attention sink behaviour the authors describe — where certain tokens attract disproportionate attention and thereby organise the model's focus — has been observed in other contexts in the research literature. SepSeq exploits this property deliberately rather than treating it as an artefact to be engineered away.
What This Means
For organisations deploying LLMs on structured numerical data, SepSeq offers an immediately applicable technique to recover significant accuracy at no additional training cost — a rare combination that warrants close attention as the preprint moves toward peer review.