WritingCoreWeaveCoreWeavepublished May 20, 2026seen 6d

CoreWeave Becomes One of the First Cloud Providers to Achieve NVIDIA Exemplar Cloud Validation for Inference on NVIDIA GB200 NVL72

Open original ↗

Captured source

source ↗

CoreWeave Earns NVIDIA Exemplar Validation for GB200

Announcement

Announcement

Webinar

Announcement

Podcast

Announcement

GTC 2026

Announcement

CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.

Read more

Products

Data and storage

Infrastructure control

Runtime acceleration

Model and agent development

Mission control

Solutions

Pricing

Resources

About us

Contact us Login

Contact us Login

Clear

Running production-scale inference workloads is a significant data center scale challenge, requiring optimizations across the entire AI infrastructure stack. When that optimization breaks down, performance suffers, leading to slow user experiences, higher compute costs, and unpredictable reliability, thus slowing down AI innovation and increasing TCO. By establishing the Exemplar Cloud in 2025, NVIDIA provides a standard benchmark for cloud providers to validate their infrastructure performance. Today, CoreWeave has become one of the first cloud providers to become an NVIDIA Exemplar Cloud for Inference on NVIDIA GB200 NVL72 . CoreWeave demonstrated extraordinary inference throughput and latency results, achieving NVIDIA’s high performance standards based on its reference architecture. This follows our recent milestone as one of the first cloud providers to achieve NVIDIA Exemplar Cloud for Training on NVIDIA GB200 NVL72 . This is  further proof that the CoreWeave Cloud not only delivers a highly performant platform for training AI models, but also for serving them efficiently and reliably in production. Together, being one of the first cloud providers to become an NVIDIA Exemplar Cloud for both training and inference showcases CoreWeave’s vertically integrated stack, with Mission Control offering the operating standard for AI cloud with the most performant environment for the entire AI lifecycle. CoreWeave meticulously engineers every layer of our stack from bare metal infrastructure to inference, bringing out the optimal performance of hardware and software combined. That means CoreWeave Cloud is not only highly tuned for training AI models at unprecedented speeds, but also for serving those models efficiently and reliably in production. NVIDIA Exemplar Cloud represents a consistent benchmarking framework NVIDIA Exemplar Cloud provides a standard benchmark for cloud providers to validate workload performance in the cloud. Every participating provider undergoes a comprehensive evaluation process designed to reflect real-world customer needs for highly complex and demanding AI workloads. Becoming an Exemplar Cloud requires the ability to demonstrate high performance and resiliency across a suite of open, workload-specific benchmarking recipes covering inference, fine-tuning, and scaled pretraining. The result: a transparent comparison of performance that is validated using the same criteria. With this consistent benchmark data, AI pioneers can reap the following benefits:. Predictable, consistent AI workload performance on NVIDIA‑accelerated cloud infrastructure, validated through joint testing and benchmarks Confidence in a tuned, optimized infrastructure stack through co‑engineering and ongoing performance validation with NVIDIA Objective benchmark data to guide which cloud environments to choose, grounded in real application performance measurements, not vendor claims

The results demonstrate how CoreWeave’s approach to GPU performance with full stack observability via Mission Control and automated performance optimizations consistently yields peak performance and reliability. This means AI pioneers have the ability to deploy large-scale training, disaggregated multi-node inference, or anything in between, with the confidence that their jobs will run effectively and efficiently. This minimizes guesswork and consistently gives them access to new GPUs, providing the predictability, reproducibility, and performance AI pioneers need as they evolve models, scale training, and run inference in production. CoreWeave achieves NVIDIA’s inference benchmark targets NVIDIA’s Inference benchmarks test DeepSeek-R1, Llama 3.3, and GPT-OSS models in single and multi-node configurations and measure inference throughput and latencies for common agentic use cases. The number of NVIDIA GB200 NVL72 GPUs was specified by NVIDIA along with TRT-LLM or SGLANG as the backend. The throughput test also included NVIDIA Dynamo for multi-node, which is a high-throughput, low-latency distributed inference model. For each test scenario, the benchmark evaluated five distinct phases of inference: Reasoning, Chat, Summarization, Generation, and Disaggregation with input and output context lengths . Each is designed to stress-test specific architectural areas to ensure comprehensive coverage within the stack. Metrics used were TPS/GPU (Tokens-Per-Second/GPU) for throughput, and milliseconds for Time-to-First-Token (TTFT) latency. Each test name is followed by (input context length/output context length) below: Reasoning (1k/1k): This test used 1K input and 1K output context lengths with long prompts and completions reflecting Chain-of-Thought processing. Chat (128/128): Evaluates responsiveness of interactive applications such as chat, prioritizing ultra-low latency and high user concurrency. Summarization (8k/512): Tests the I/O and memory bandwidth required to ingest massive prompts before generating a concise output. Generation (512/8k): Measures the raw throughput and efficiency of the generation phase, where the model must maintain high speed over a high volume of continuous token production. Disaggregation (8k/1k across nodes): Evaluates the efficiency of disaggregated inference, where the prompt processing and token generation phases are split across different GPU nodes.

Throughput tests were conducted using DeepSeek-R1, Llama 3.3, and GPT-OSS in single node configuration with one to four NVIDIA Blackwell GPUs and multi-node with NVIDIA Dynamo using 32 NVIDIA Blackwell GPUs. CoreWeave met or exceeded each of the test scenarios across the five distinct phases described above. While throughput measures the ability to process and complete the phases of inference of the cluster, TTFT latency measures the speed of the individual unit. In the era of agentic AI, where a single user request might trigger ten sequential model calls, latency becomes the primary constraint on responsiveness. If a model takes too long to process or generate its first word, the user experience…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Notable cloud validation but limited broad traction