WritingCoreWeaveCoreWeavepublished May 20, 2026seen 1w

CoreWeave Delivers Breakthrough AI Performance with NVIDIA GB200 and H200 GPUs in MLPerf Inference v5.0

Open original ↗

Captured source

source ↗

CoreWeave Delivers Breakthrough AI Performance with NVIDIA GB200 and H200 GPUs in MLPerf Inference v5.0

Announcement

Webinar

Podcast

GTC 2026

CoreWeave to Join Nasdaq-100 Index. Read the press release

Products

Data and storage

Infrastructure control

Runtime acceleration

Model and agent development

Mission control

Solutions

Pricing

Resources

About us

Contact us Login

Contact us Login

Clear

CoreWeave First CSP to Submit NVIDIA Grace Blackwell Results, Achieves Top-Tier Performance in MLPerf Inference v5.0 with NVIDIA GB200 Grace Blackwell Superchips and H200 GPUs CoreWeave is proud to be the first cloud service provider (CSP) to submit MLPerf Inference v5.0 results for NVIDIA GB200 Grace Blackwell instances, achieving an impressive 800 tokens per second (TPS) on the Llama 3.1 405B model – a 2.86X per-chip performance increase over NVIDIA H200 GPUs. CoreWeave’s NVIDIA H200 GPU instances also delivered 33,000 TPS on the Llama 2 70B model, marking a 40% improvement in throughput compared to NVIDIA H100 instances. This milestone highlights our commitment to providing customers with early access to the latest NVIDIA GPUs and delivering industry-leading performance. ‍ Continuing Our Track Record of MLPerf Milestones ‍ MLPerf Inference is an industry-standard suite that measures machine learning performance across realistic deployment scenarios. The speed at which systems process inputs and generate outputs using a trained model significantly impacts performance and, thus, user experience, making the MLPerf Inference benchmark a critical metric for both CoreWeave and our customers. Historically, CoreWeave has set record-breaking MLPerf results, including our 2023 submission, which showed 29x faster training performance than the next best competitor.

In this year’s submitted MLPerf results, CoreWeave's NVIDIA GB200 Superchip and CoreWeave’s NVIDIA H200 GPU instances showed impressive performance, summarized in the table below: ‍

Table 1: CoreWeave leads MLPerf benchmarks for NVIDIA H2007 and NVIDIA Blackwell GPUs8 NVIDIA GB200 NVL72: CoreWeave Sets a New Benchmark for Industry-Leading Performance ‍ The latest MLPerf release was the first to include the Llama 3.1 405B model — one of the largest open-source models. We achieved over 800 tokens per second in a GB200 instance featuring two NVIDIA Grace™ CPUs coupled with four Blackwell GPUs. While some of the performance improvement can be attributed to the lower precision (FP4) of the NVIDIA Blackwell MLPerf Inference benchmark, we see a 2.86X speedup on a per-chip basis. We reached a normalized per-chip throughput of over 200 TPS compared to 70 TPS for NVIDIA’s H200 MLPerf Inference v5.0 submission.

Image 1: NVIDIA GB200 Instances on CoreWeave achieved a 2.86X speed-up over NVIDIA’s H200 MLPerf Inference v5.0 submission for the Llama 3.1 405B model on a per-GPU basis8 NVIDIA H200 GPUs: CoreWeave Achieves Top-Tier Llama 2 70B Throughput The latest MLPerf release continues to support Llama 2 70B. Unlike the larger Llama 3.1 405B model, its smaller memory footprint allows throughput comparisons between the NVIDIA H200 and H100 GPUs. CoreWeave delivers top-tier Llama 2 70B throughput on NVIDIA H200 GPUs in the MLPerf Server scenario, which has tighter latency constraints. CoreWeave achieved 33,000 tokens per second — 40% higher throughput than the fastest NVIDIA H100 GPU inference submission for the same model in MLPerf Inference v4.1 . ‍

Image 2: NVIDIA H200 GPUs on CoreWeave achieved 40% higher throughput over NVIDIA’s H100 MLPerf Inference v4.1 submission for the Llama 2 70B model7 How CoreWeave Cloud Optimizes AI Inference Performance ‍ The CoreWeave Cloud Platform is purpose-built for AI inference, delivering industry-leading performance as demonstrated by our MLPerf Inference v5.0 results. Every layer of our stack — data centers, infrastructure, managed services, and application software services — is optimized to maximize throughput and efficiency. Our cutting-edge infrastructure features the latest NVIDIA GPUs, high-performance CPUs, and NVIDIA Quantum InfiniBand networking, all of which help reduce communication overhead and accelerate large-model inference. CoreWeave Mission Control helps ensure all compute resources operate at peak performance with advanced cluster validation, proactive health monitoring, and rapid node replacement, reducing hardware failures, and therefore sustaining higher inference throughput. CoreWeave Kubernetes Service (CKS) runs directly on bare metal, supporting any K8s-based inference engine. For Application Software Services, Slurm on Kubernetes (SUNK) enables efficient scheduling across 32,000+ GPUs, optimizing node-to-node communication with topology-aware scheduling. The SUNK Scheduler enhances cluster utilization by dynamically scheduling Kubernetes pods and Slurm jobs side by side within the same cluster, increasing flexibility and reducing serving costs. CoreWeave Tensorizer further accelerates model loading, minimizing time-to-first-token for faster inference. These optimizations drive best-in-class AI performance, resilience, and usability, which is why top AI companies like Cohere, Mistral, and IBM trust CoreWeave as their AI cloud provider. Looking Ahead Performance and reliability are critical to AI labs and enterprises building world-changing technology. Today's MLPerf benchmarks demonstrate CoreWeave’s continued commitment to improving AI systems' performance and accelerating our clients’ AI ambitions. Results at a glance: CoreWeave is the first CSP to submit MLPerf results for NVIDIA Grace Blackwell Superchips, continuing our track record of providing early access to leading-edge technology, as demonstrated with the H100. 2.86X performance improvement per GPU for NVIDIA Grace Blackwell Superchips (compared to the previous generation of NVIDIA H200 GPUs) 40% faster throughput for NVIDIA H200 GPUs (compared to the previous generation of NVIDIA H100 GPUs in the MLPerf Inference v4.1 submission for the Llama 2 70B model 7 )

‍ Learn more about how CoreWeave can support your AI needs with NVIDIA Blackwell and H200 GPUs. ‍ ‍ 1 Node with 8 x NVIDIA H200 GPUs, each with 141GB of HBM3e memory. 2 Node with 4 x NVIDIA Blackwell GPUs, as a part of 2 x NVIDIA GB200 Grace Blackwell Superchips each with 372GB of HBM3e memory. 3 Input/output tokens, precision, and batch sizes are defined by MLPerf for each benchmark. 4 H200 benchmark numbers are displayed for the Server...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

MLPerf inference benchmark results on new NVIDIA hardware.