What does this writing signal mean?

CoreWeave Writing: Kimi K2.7 Code Now Available on Serverless Inference with Leading Benchmark Price-Performance

Captured source

wf.coreweave.com/wf.coreweave.com/blog/kimi-k2-7-code-now-available-on-serverless-inference-with-leading-benchmark-price-performance

Kimi K2.7 Code Now Available on Serverless Inference with Leading Benchmark Price-Performance

Source ↗

published Jun 22, 2026seen 1wcaptured 1whttp 200method plain

CoreWeave Leads on Artificial Analysis for Kimi K2.7 Code Inference | CoreWeave Blog

Announcement

Webinar

Podcast

GTC 2026

CoreWeave to Join Nasdaq-100 Index. Read the press release

Products

Data and storage

Infrastructure control

Runtime acceleration

Model and agent development

Mission control

Solutions

Pricing

Resources

About us

Clear

Kimi K2.7 Code, the latest coding model from Moonshot AI, is now available on CoreWeave Serverless Inference . In addition, CoreWeave is the leading AI inference provider in the most attractive quadrant of Artificial Analysis's Speed vs. Price chart for Kimi K2.7 Code , delivering the highest output speed at a low blended price. This is the second Kimi model in a row that our team has optimized and put live on CoreWeave Inference with improved performance. Artificial Analysis measures the two metrics that matter most to production teams: output speed in tokens per second and price per million tokens. For price, a 7:2:1 blend of cache-hit, input, and output costs reflects real workload distribution more accurately than input or output price alone. Among all providers serving Kimi K2.7 Code, the latest coding model from Moonshot AI, CoreWeave delivers the highest token throughput with the lowest latency, delivering the best price-performance. 1

Speed vs. Price for Kimi K2.7 Code across inference providers (source: Artificial Analysis ). This result comes from metal-to-model optimization and collaboration between our Applied Training and Inference teams. In a previous blog , we covered the Kimi K2.6 optimization that earned CoreWeave the same validation on Artificial Analysis. In this post we go into the engineering work delivered for Kimi K2.7 Code that earned the team this validation for the second consecutive time and what that means for your inference workloads. Kimi K2.7 Code is Moonshot AI’s latest coding model Kimi K2.7 Code is Moonshot AI’s latest coding-focused agentic model, built on the same trillion-parameter MoE (Mixture-of-Experts) architecture as Kimi K2.6, with 32 billion active parameters per token, a 256K-token context window, and open weights released under a Modified MIT license. 2 The model is tuned for long-horizon, end-to-end coding tasks and agentic tool use.The most production-relevant change is in efficiency: Moonshot AI reports Kimi K2.7 Code uses about 30% fewer reasoning tokens than K2.6 on the same work, 3 which cuts cost and latency in agent loops that call the model repeatedly. Optimizing Kimi K2.7 Code on Blackwell: NVFP4 quantization and a DFlash speculator CoreWeave’s Applied Training team uses NVIDIA GB300 NVL72 and GB200 NVL72 clusters to optimize the latest AI models for production inference, with tooling that includes NVIDIA’s Model-Optimizer for post-training quantization. Every experiment is backed by accuracy and quality evaluations in Weights & Biases and modern test-harness frameworks. Kimi K2.7 Code shipped pre-quantized in INT4 (4-bit integer). Because INT4 is an integer format and Blackwell accelerates FP4 (4-bit floating point), including NVFP4 (NVIDIA 4-bit floating point), the released weights need quantization before they can fully benefit from Blackwell’s FP4 execution. The CoreWeave Applied Training team did exactly that, pairing a custom NVFP4 quantization with a DFlash speculative decoder. Getting to this level of performance on a 1T parameter scale model required a multi-step training and evaluation process. First, we started by mirroring the original weights from Moonshot AI to a CoreWeave AI Object Storage bucket with Local Object Transport Acceleration enabled so that we could horizontally scale out access to the weights without hitting any rate limits. To get a baseline for the model, we measured the INT4 weights with a comprehensive list of evals like Terminal-Bench 2.0 , AIME2025 , SWE-Bench Pro and SWE-Bench Multilingual , powered by the Harbor framework with the Terminus-2 agent with multiple trials each. Then we used the NVIDIA Model-Optimizer project with a custom 3-pass calibration for converting the INT4 weights into NVFP4 weights. This included a short context window pass over general conversation data and long context window passes over math and coding datasets to help align to Kimi K2.7 Code’s own focus on coding work. All evals were executed in CoreWeave’s very own Sandbox platform and evaluated with large amounts of concurrency to speed up the iteration loop. Once we were confident in the accuracy of the quantized checkpoint, we moved on to training a DFlash speculative decoding model . This involved generating a new dataset using the NVFP4 checkpoint, where we used over 1 million different prompts, once again spanning various long context agentic tasks involving tool calling and other agentic coding work. We generate these new responses based on the NVFP4 checkpoint to guarantee that our speculator model learns the distributions of the NVFP4 checkpoint specifically, to maximize acceptance length in the most diverse set of cases. In order to facilitate the training itself, we used the TorchSpec project from the LightSeek Foundation . We used a modified loss function with the DFlash architecture called D-PACE which came from a combination of Harvard and MIT researchers which showed us an approximate 8% improvement in acceptance length across all categories as measured by the SPEED-Bench AL dataset . We have since contributed D-PACE support back to TorchSpec upstream. To measure the acceptance length of our speculator model, we made heavy use of AIPerf from the NVIDIA Dynamo team which already included direct support for SPEED-Bench, but we also contributed support for running the most commonly reported AL evals according to the modern literature to make it easy to compare our results to those published online. The draft model training occurred on multiple GB300 NVL72 racks managed by the CoreWeave Kubernetes Service and scaled out over a RoCE fabric for the fastest training speeds such that we can deliver this fast and high-quality Kimi K2.7 Code deployment to our customers. The production K2.7 Code endpoint on CoreWeave Inference runs this NVFP4 + DFlash configuration by default, allowing customers to benefit from the throughput and cost improvements, with no additional setup on their side. Inference performance goes beyond model-level optimizations In production inference, performance and cost are tightly coupled,...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable code model release from Moonshot/Kimi on CoreWeave.