Cloudflare published a blog post on April 16, 2026 describing what it calls a unified inference layer built on AI Gateway and Workers AI, giving developers a single API to call models from multiple providers, along with a preview of upcoming support for customer-supplied models. The company says the update targets developers building AI agents, where chained model calls amplify the cost of latency and provider outages.
One binding, multiple providers
According to the post, developers using Cloudflare Workers can now call third-party models through the same AI.run() binding previously used for Workers AI models. Cloudflare states that switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or another provider is a one-line change, and provides a code sample calling anthropic/claude-opus-4-6 through the binding with a default gateway configuration.
For developers not using Workers, Cloudflare said in the post that REST API support is part of its stated roadmap and will be released "in the coming weeks." DeepBrief notes this is Cloudflare's announced timeline rather than a confirmed ship date.
Cloudflare says the catalog now spans more than 70 models across more than 12 providers accessible through one API and one set of credits. The company lists expanded model access from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu, and says the catalog now includes image, video, and speech models in addition to text.
Centralized spend tracking
Cloudflare argues that consolidating provider calls behind one endpoint also consolidates cost visibility. The post cites a third-party survey from aidbintel.com stating that companies today call "an average of 3.5 models" across multiple providers. DeepBrief has not independently verified that survey figure; it is presented here as Cloudflare's citation of external data.
With AI Gateway, you'll get one centralized place to monitor and manage AI spend.
That sentence is Cloudflare's own phrasing from the announcement. The company says developers can attach custom metadata to each request — for example team IDs or user IDs — to break down spend by customer segment, workflow, or plan tier. A code example in the post shows metadata fields passed alongside a call to the model identifier @cf/moonshotai/kimi-k2.5, which Cloudflare lists as a Workers AI-hosted model.
Bring your own model via Cog
Cloudflare said in the post that it is working on letting customers bring their own fine-tuned or custom models to Workers AI. According to the announcement, the majority of Cloudflare's current AI traffic comes from dedicated Enterprise instances running custom models, and the company wants to extend that capability to a broader set of customers.
The mechanism relies on Replicate's Cog technology, an open-source tool for containerizing machine learning models. Cloudflare's post includes example cog.yaml and predict.py files showing how dependencies and inference code are declared. The company states that after running cog build, developers will be able to push the container to Workers AI, where Cloudflare will deploy and serve the model behind the standard Workers AI APIs.
Cloudflare states in the post that additional pieces are still in development, including customer-facing APIs, wrangler commands for pushing containers, and faster cold starts via GPU snapshotting. The company says it has been testing the feature internally and with a set of external design partners, and is inviting additional partners to reach out. DeepBrief notes these capabilities are described as in-progress rather than generally available.
Latency framing for agents
Cloudflare frames the announcement around agent workloads specifically. The post argues that when an agent chains ten inference calls to complete a task, a single slow provider adds 500ms rather than 50ms, and a failed request can cascade into downstream failures. The company says recent updates to its platform include a refreshed dashboard, zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls.
On latency, Cloudflare states that its network spans data centers in 330 cities, which it says positions AI Gateway close to both end users and inference endpoints to minimize network time before token streaming begins. The company argues that time to first token is the user-perceived speed metric for live agents, and that shaving 50ms off that figure changes whether an agent "feels zippy" — Cloudflare's phrasing — versus sluggish.
The post lists Kimi K2.5 (identified in code samples as @cf/moonshotai/kimi-k2.5) among the open-source models hosted on the Workers AI public catalog, which Cloudflare describes as including large models for agents.



