Researchers have published a method they call Knowledge Packs that delivers factual context to language models without consuming a single input token, by injecting pre-computed key-value cache entries directly into the model's attention mechanism — potentially replacing retrieval-augmented generation for many use cases.

Retrieval-augmented generation, or RAG, has become the standard approach for grounding language model responses in specific documents or databases. It works by appending retrieved text to the user's prompt, which the model then reads as part of its context window. The problem: that retrieved text costs tokens — sometimes thousands of them — which drives up latency and compute expense at scale.

How Knowledge Packs Exploit the Causal Mask

The new approach, detailed in a preprint posted to ArXiv in April 2025, exploits a mathematical property of causal transformer architectures. Because these models use a causal attention mask — meaning each token only attends to tokens that came before it — the key-value cache generated by running a forward pass on a document F is identical to what would be produced during a joint pass on F followed by a query. The query tokens never affect the document's cached representations.

This means a KV cache computed once from a document can be stored, reused, and injected at inference time without re-encoding the document, and without placing the document text in the visible context window at all.

Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior — a steering capability RAG simply cannot offer.

The authors tested the approach on Qwen3-8B and Llama-3.1-8B across 700 questions, reporting zero divergences in output compared to standard in-context document inclusion — meaning the model answers identically whether the knowledge is injected via cache or placed directly in the prompt. Token savings reached up to 95%, according to the paper.

The Formatting Flaw That Explains Earlier Contradictions

The authors make a pointed claim about prior research. Earlier work had reported that KV cache injection sometimes outperformed RAG rather than simply matching it — a result that should be theoretically impossible if the method is truly equivalent. The researchers argue this anomaly was caused by incorrect chat template formatting during KV construction, which introduced a 6 to 7 percentage point degradation in RAG performance in those experiments, making cache injection look artificially superior.

With correct formatting, the authors report the equivalence is exact. This is a meaningful methodological clarification: it suggests a body of prior comparisons may have been measuring a formatting artifact rather than a genuine capability difference.

Behavioural Steering as a Second Channel

Beyond token savings, the paper introduces a separate capability enabled by the KV interface that has no direct RAG equivalent: behavioural steering through value-space arithmetic.

The authors exploit the fact that RoPE (Rotary Position Embedding), the positional encoding scheme used in both tested models, rotates key vectors but leaves value vectors unchanged. By computing contrastive deltas — differences between cached values for two contrasting inputs — and injecting these into the value stream, they demonstrate a shift in model behaviour in a controlled direction.

The steering effect concentrates in mid-layer values, specifically layers spanning roughly 33% to 66% of model depth. The authors report that independent steering directions are nearly orthogonal (cosine similarity approximately 0), meaning multiple steering signals compose without cancelling each other out. Both the knowledge injection channel and the steering channel can operate simultaneously at a scaling factor of alpha ≤ 0.7 without measurable interference.

No model training or weight modification is required for any of this.

What the Approach Does Not Yet Address

The preprint is frank about its scope. The equivalence guarantee holds for causal transformers with standard attention, but the authors note it is fragile in practice: chat template formatting errors alone caused multi-percentage-point drops in their own testing. This suggests careful engineering would be required before deployment.

The work also does not address retrieval — Knowledge Packs assume you already know which document cache to inject. In production RAG systems, deciding what to retrieve remains the harder problem. Knowledge Packs would sit downstream of that retrieval step, not replace it entirely for systems that need dynamic document selection.

The behavioural steering results, while intriguing, are reported without extensive red-teaming or safety analysis. The ability to inject hidden behavioural nudges via cached values — with no visible trace in the context window — will likely attract scrutiny from alignment researchers.

What This Means

If the results hold under broader testing, Knowledge Packs offer a practical path to lower inference costs for knowledge-intensive applications, while opening a new and largely unexplored interface for model behaviour control that operates entirely outside the token stream.