Cloudflare published a technical post on April 16, 2026 describing infrastructure changes to its Workers AI platform for hosting extra-large open-source language models, saying it has made Moonshot AI's Kimi K2.5 3x faster since launch and has additional model integrations in progress. Source: https://blog.cloudflare.com/high-performance-llms/
The post, written by Cloudflare's Workers AI team, attributes the speed gains to a shift toward prefill-decode (PD) disaggregated architecture, prompt caching tied to a session-affinity header, and KV-cache sharing across GPUs using Moonshot AI's Mooncake Transfer Engine.
What Cloudflare Says It Changed
According to the post, Cloudflare splits LLM inference into two stages — prefill, which processes input tokens and populates the KV cache, and decode, which generates output tokens — and now runs them on separate inference servers. The company writes that prefill is "compute bound" while decode is "memory bound," and that running both on a single machine leaves GPU capacity underused.
Cloudflare says it built a load balancer that routes requests between pools of prefill and decode endpoints, rewrites streaming responses to carry KV-cache transfer metadata between them, and performs what the company calls "token-aware load balancing" to estimate in-flight prefill and decode tokens per endpoint.
The company reports that after shifting traffic to the disaggregated architecture, p90 time-per-token dropped from "~100 ms with high variance to 20-30 ms," which it describes as "a 3x improvement in intertoken latency." Cloudflare states the improvement was measured using the same number of GPUs while request volume increased. DeepBrief has not independently verified the benchmark.
Targeting Agent Workloads
Cloudflare frames the tuning decisions around agentic use cases, where input contexts grow as system prompts, tool definitions, MCP integrations, and prior turns accumulate. "For Workers AI, that means we had to focus on two things: fast input token processing and fast tool calling," the company writes.
A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model.
The post describes an x-session-affinity HTTP header that clients can send to route requests back to the region holding previously computed input tensors. Cloudflare says it has submitted integrations adding the header to agent harnesses including OpenCode, linking to a pull request on the anomalyco/opencode repository.
Cloudflare reports that after working with its "heaviest internal users" to adopt the header, input token cache hit ratios rose "from 60% to 80% during peak times." The company says it offers discounted pricing on cached tokens to encourage adoption and points developers to its prompt caching documentation.
KV Cache Across Multiple GPUs
For models large enough that one instance spans multiple GPUs, Cloudflare says it needed a mechanism for KV cache — the stored input tensors from prefill — to move between GPU VRAM across machines. The company writes that it adopted Mooncake Transfer Engine and Mooncake Store, open-source components from Moonshot AI available on GitHub under the kvcache-ai organization.
Cloudflare describes Mooncake's Transfer Engine as a framework that works with Remote Direct Memory Access (RDMA) protocols including NVLink and NVMe over Fabric, enabling "direct memory-to-memory data transfer without involving the CPU." The company says this matters for multi-GPU and multi-node deployments where KV cache cannot fit within the VRAM of a single accelerator.
Context From Earlier Announcements
The April 16 post builds on an earlier Cloudflare blog entry announcing Kimi K2.5 availability on Workers AI, which the company links from the new post. In that earlier entry, Cloudflare introduced the session-affinity header and described its approach to running large open-source models on edge hardware.
The company frames its infrastructure approach as "squeezing every bit of efficiency out of our hardware through clever software engineering," and says hardware configuration choices depend on whether workloads are input-heavy — such as summarization — or output-heavy, such as long-form generation.
DeepBrief reached out to independent infrastructure analysts and developers building on Workers AI for outside perspective on Cloudflare's performance claims and the tradeoffs of edge-distributed inference versus centralized GPU hosting. No outside comment was available at publication time.
All performance figures in this article — including the 3x latency improvement, the shift from ~100 ms to 20-30 ms p90 time-per-token, and the 60% to 80% cache hit ratio change — are reported by Cloudflare and have not been independently benchmarked by DeepBrief. Cloudflare has not disclosed in the post the specific GPU models, quantization settings, or request mixes used to produce the figures shown in its charts.
Sources
Primary source: Cloudflare Blog, "Building the foundation for running extra-large language models," April 16, 2026 — https://blog.cloudflare.com/high-performance-llms/
Referenced by the primary source: Cloudflare's earlier post on large models in Workers AI (https://blog.cloudflare.com/workers-ai-large-models/), Workers AI prompt caching documentation (https://developers.cloudflare.com/workers-ai/features/prompt-caching/), and the Mooncake project (https://github.com/kvcache-ai/Mooncake).
