Cloudflare Enters the Big Model Game: Workers AI Now Runs Kimi K2.5 With 256K Context

Cloudflare has quietly entered the frontier AI inference market. Workers AI now hosts Kimi K2.5, Moonshot AI’s open-source model with a full 256K context window, vision inputs, multi-turn tool calling, and structured outputs. This is the first large-scale model on Cloudflare’s inference platform, and the pricing suggests they’re serious about competing with dedicated inference providers.

The Numbers

Kimi K2.5 on Workers AI costs $0.60 per million input tokens and $3.00 per million output tokens. Cached tokens—a new feature Cloudflare is surfacing for the first time—cost just $0.10 per million, a 6x discount that makes multi-turn agent conversations significantly cheaper.

Cloudflare claims 77% cost savings compared to mid-tier proprietary models for their internal security review agent, which processes over 7 billion tokens daily. That translates to roughly $2.4 million in annual savings on a single workload.

Why Kimi K2.5

Moonshot AI’s model isn’t a random choice. Kimi K2.5 benchmarks competitively with frontier models while remaining open-source and optimized for agentic workloads. The 256K context window handles long documents and extended conversations. Multi-turn tool calling enables complex agent workflows. Vision inputs mean you can pipe screenshots and documents directly into prompts.

Cloudflare integrated the model into their own OpenCode development environment and automated code review pipeline (internally called “Bonk”). According to their engineering blog, the model “has proven to be a fast, efficient alternative to larger proprietary models without sacrificing quality”—which is the kind of claim worth verifying in your own use case.

New Infrastructure Features

Two technical additions make this launch more interesting than just “another model”:

Prefix Caching with Session Affinity: Developers can now use an x-session-affinity header with unique session identifiers to route requests to the same model instance. This improves cache hit rates and reduces time to first token for multi-turn conversations. Cloudflare surfaces cached tokens as a separate usage metric with discounted pricing.

Redesigned Async API: A pull-based system replaces the previous push-based architecture, eliminating out-of-capacity errors for asynchronous workloads. Cloudflare claims internal testing shows async requests typically execute within 5 minutes.

How to Use It

Kimi K2.5 works through the Workers AI binding (env.AI.run()), the REST API, AI Gateway, or the OpenAI-compatible endpoint. If you’re already using Workers, it’s a configuration change. If you’re not, you’ll need to buy into Cloudflare’s platform.

The Agents SDK starter kit automatically implements session affinity, and integration with OpenCode provides direct model access.

What This Means

Cloudflare’s entry into frontier model inference signals that the “run AI at the edge” story is getting real. Until now, Workers AI was limited to smaller models—useful for embeddings and simple classification, not for replacing your OpenAI or Anthropic API calls.

Now you can build complete agent pipelines on Cloudflare’s infrastructure: Kimi K2.5 for reasoning, smaller models for classification and embeddings, Workers for orchestration, Durable Objects for state, and R2 for storage. Whether that’s better than dedicated inference providers depends on your latency requirements and existing infrastructure.

The pricing is competitive but not exceptional. At $0.60/$3.00 per million tokens, Kimi K2.5 through Cloudflare is roughly comparable to Moonshot’s direct API pricing. The value proposition is infrastructure consolidation and the 6x cached token discount, not raw cost savings.

The Privacy Angle

Running inference through Cloudflare means your prompts traverse their network. For some organizations, that’s a non-issue—you’re probably already routing traffic through Cloudflare. For others, it’s a data handling question worth answering before you connect sensitive workloads.

Cloudflare’s data processing addendum covers Workers AI, but read it yourself if prompt data sensitivity matters to your use case.

What You Can Do

If you’re already on Cloudflare: Try Kimi K2.5 for tasks currently hitting external APIs. The session affinity feature alone could reduce costs on multi-turn agent workflows.

If you’re evaluating inference providers: Add Workers AI to your comparison. The all-in-one-platform story has appeal if you’re trying to minimize vendor complexity.

If you’re running local models: This doesn’t change anything. Kimi K2.5 is open-source—you can run it yourself with Ollama or vLLM if you have the hardware. Cloudflare is just another hosting option.

The bigger picture: inference is becoming a commodity. Cloudflare, with its global edge network and developer platform, is betting that convenience and consolidation matter more than specialized AI infrastructure. Whether that bet pays off depends on how many developers want to run large models where their Workers already live.