CoreWeave Trains DeepSeek-V3 Benchmark in Two Minutes
Captured source
source ↗CoreWeave Sets MLPerf® Training v6.0 Records | CoreWeave Blog
Announcement
Webinar
Podcast
GTC 2026
CoreWeave to Join Nasdaq-100 Index. Read the press release
Products
Data and storage
Infrastructure control
Runtime acceleration
Model and agent development
Mission control
Solutions
Pricing
Resources
About us
Contact us Login
Contact us Login
Clear
CoreWeave ran the DeepSeek-V3 671B MLPerf ® Training v6.0 benchmark in 2.02 minutes with 8,192 NVIDIA Blackwell Ultra GPUs connected with the NVIDIA Spectrum-X Ethernet networking platform, the largest cluster of its kind ever benchmarked. In the latest MLPerf Training v6.0 round, CoreWeave set record breaking results and delivered high performance consistently across cluster sizes ranging from 64 to 8,192 GPUs using NVIDIA HGX TM B200 and NVIDIA GB300 NVL72 . The benchmark was run on the same production infrastructure CoreWeave customers rely on every day, not a benchmark-only side cluster or specially tuned environment. That matters because real AI performance is not created by hardware alone. It comes from how every layer of the platform works together, from GPU infrastructure and high-speed networking to orchestration, storage, observability, and expert operations. CoreWeave trains DeepSeek-V3 671B in two minutes No benchmark hits every layer of an AI cloud simultaneously like DeepSeek-V3. Dense matrix throughput, MoE routing efficiency, communication primitives, fault tolerance across multi-thousand GPU clusters, topology-aware sharding across NVLink, NVL72, and scale-out fabrics. It stresses all of it at once, and if your platform has a weak point, this workload finds it. CoreWeave submitted three GB300 NVL72 configurations on DeepSeek-V3 671B
Compute Platform Nodes GPUs Time-to-train Precision
NVIDIA GB300 NVL72 2,048 8,192 2.02 min MXFP8
NVIDIA GB300 NVL72 1,024 4,096 3.09 min MXFP8
NVIDIA GB300 NVL72 512 2,048 5.54 min MXFP8
CoreWeave scaled an 8,192-GPU GB300 NVL72 cluster connected with Spectrum-X Ethernet networking to achieve a breakthrough time-to-train of 2.02 minutes, the fastest DeepSeek-V3 671B training performance of all time, with the #1 spot across all Closed/Available-cloud submissions. What makes this result remarkable isn't just the absolute wall clock time; it’s how our infrastructure sustained efficient performance at scale. CoreWeave was the only submitter in the v6.0 round to successfully scale a GB300 NVL72 platform beyond 2,048 GPUs on the DeepSeek-V3 benchmark. From there, we doubled the cluster size to 4,096 GPUs, and then doubled it again to 8,192 GPUs, all while maintaining an incredibly strong scaling efficiency. In practice that means customers using this popular open source model can achieve training faster and accelerate time-to-market for their AI application or agent. Efficient performance and scale for thousands of GPUs CoreWeave performed benchmark testing with Llama-3.1-405B using 4,096 Blackwell Ultra GPUs and reached the reference quality target in 9.77 minutes, 2.8x faster compared to our own results from MLPerf ® Training v5.0 using the same test. The result is a direct reflection of CoreWeave's full-stack engineering philosophy. The performance gain didn't come from adding more GPUs, it came from software optimizations made at every layer of the stack, from NVLink-domain aware scheduling in CoreWeave Kubernetes Service (CKS) and topology-aware workload placement in SUNK , to deep networking optimizations that keep thousands of GPUs in tight synchronization throughout a training run. CoreWeave’s performance with Llama-3.1-405B using 4,096 Blackwell Ultra GPUs
Compute Platform Nodes GPUs Time-to-train
NVIDIA GB300 NVL72 1,024 4,096 9.77 min
The run was built on NVIDIA NeMo Framework Release 26.04 and leveraged full CUDA Graphs to minimize CPU scheduling overhead and maximize GPU utilization throughout training. Tensor, pipeline, and context parallelism were carefully tuned to align with the GB300 NVL72 architecture. At the network layer, NVIDIA Spectrum-X Ethernet running RoCE provided the scale-out fabric, delivering the bandwidth, advance congestion control, and low latency communication required to maintain high efficiency during distributed training. This GB300 NVL72 deployment achieved near-parity with larger NVIDIA GB200 NVL72 configurations while using 20% fewer GPUs. That efficiency gap of delivering comparable results with materially less hardware underscores a principle CoreWeave has built its platform around: raw compute capacity matters, but system-wide optimization is what determines real-world performance and economics. For customers training the industry's largest models at scale, that distinction translates directly into lower cost per training run, faster iteration cycles, and more efficient use of infrastructure investment. Consistent scaling efficiency for 64 GPU clusters To demonstrate that our software and infrastructure optimizations deliver results at every scale, we also submitted GPT-OSS-20B and Llama 3.1 8B benchmark results on a compact 64 GPU NVIDIA HGX B200 cluster connected via NVIDIA Quantum-2 InfiniBand. This configuration is accessible to a much broader range of customers than frontier-scale deployments. The results speak for themselves. GPT-OSS-20B reached the reference quality target in 26.98 minutes, while Llama 3.1 8B completed training in 16.54 minutes which was 9.7% faster than the next submitter with the same set up. These numbers matter because of what they reveal about where the performance is coming from. Through targeted enhancements in orchestration, communication libraries, networking, scheduling, and distributed training configuration, we extracted performance from the Blackwell platform that rivals larger or newer-generation deployments. This isn't about having access to more hardware, but it’s about making every GPU count more. Under the hood: How CoreWeave platform holds efficiency at 8,192 GPUs Building and operating a cluster of 8,192 GPUs is a different problem than running a few hundred. At this scale, performance depends on whether compute, networking, storage, scheduling, and orchestration work together as one connected platform. For MoE and dense workloads, efficient scaling requires coordinated optimizations across workload placement, fleet health, network topology, observability, and orchestration. CoreWeave Mission Control , SUNK and CKS each play key roles in making that...
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Infrastructure demo of fast training, notable but not flagship.