WritingCoreWeaveCoreWeavepublished May 20, 2026seen 6d

NVIDIA H100 GPU Benchmark Results: What We Learned From Large-Scale GPU Testing

Open original ↗

Captured source

source ↗

NVIDIA H100 Benchmarks for Large-Scale Training | CoreWeave

Announcement

Announcement

Webinar

Announcement

Podcast

Announcement

GTC 2026

Announcement

CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.

Read more

Products

Data and storage

Infrastructure control

Runtime acceleration

Model and agent development

Mission control

Solutions

Pricing

Resources

About us

Contact us Login

Contact us Login

Clear

The real bottleneck isn't TFLOPs. It's maintaining both performance AND reliability at scale. We all celebrate bigger model parameters and faster GPUs , yet production training runs fail not just from lack of compute but from the compound effect of interruptions and inefficiencies. When large-scale training runs crash every eight hours (0.33 days MTTF at 1,024 GPUs, according to a leading AI lab’s “Revisiting Reliability” paper), you lose days to reloads and wasted steps. And when Model FLOPs Utilization (MFU) drops to 35–45% (as is common) your effective speed is cut in half. A recent foundation model training report crystallized the challenge When a leading AI lab published their training report , documenting hundreds of unplanned interruptions over 54 days at 16K GPUs, it offered rare transparency into large-scale training challenges. This level of openness underscored a critical industry need: infrastructure that delivers both power and stability under sustained, intense workloads. The dual challenge of reliability and performance became our engineering mandate. Our six-week benchmarking study: Can purpose-built infrastructure solve both? We designed a comprehensive benchmark to rigorously test our hypothesis. The study parameters: Model : 30-billion parameter Llama 3-style architecture Scale : Up to 1,024 NVIDIA H100 GPUs Dataset : 2 trillion tokens from Dolma v1.6 Framework : Production Megatron-LM Metrics : MTTF, ETTR, MFU, checkpoint performance, tokenization throughput

This wasn’t a lab-only synthetic test. We ran production-scale training on a real model, with real training data, to measure real-world performance. The infrastructure architecture: Every layer optimized for AI Our goal was to optimize GPU clusters to achieve superior speed, efficiency, and reliability—without cutting into performance. Our approach addressed known bottlenecks systematically: Hardware layer Bare-metal NVIDIA H100 GPU clusters eliminating hypervisor overhead, with NUMA pinning under our control. Dual-fabric architecture: NVIDIA Quantum InfiniBand for all-reduce operations, separate NVIDIA BlueField DPU-offloaded Ethernet for storage traffic, preventing network contention.

Orchestration layer SUNK (Slurm on Kubernetes) : Topology-aware scheduling with health probes that evict failing nodes before they impact jobs. Automated re-queue: Failed processes restart in ~90 seconds instead of 4+ minutes of manual triage.

Storage and checkpointing Tensorizer-based async checkpointing: Reduced save time from 129 seconds to 17 seconds while maintaining 99%+ compute utilization. Custom gpt_bpe tokenizer achieving 63M tokens/second—6–12x faster than HuggingFace Tokenizers.

A defining moment: The 2:17 a.m. non-incident During a 512-GPU run at 2:17 a.m., our on-call engineer’s phone stayed silent. SUNK had automatically: Detected a failing node via health probes Evicted it from the pool Rescheduled the workload Resumed training within 3 minutes

Grafana was showing green before anyone was even alerted. This is what infrastructure-level reliability looks like. The results: Validated performance at scale Our study achieved: 51-52% MFU on NVIDIA H100 GPUs (vs. 35-45% typically reported) 3.66 days MTTF at 1,024 GPUs (10x improvement over 0.33 days baseline) 97.5% ETTR (Effective Training Time Ratio) 8x faster checkpointing via async Tensorizer implementation 43.7% MTTF improvement projected at 16,384 GPUs vs. Llama 3's reported numbers

Third-party validation We validated against published configurations: Compared to another AI research group: 51.9% MFU vs. their 40.43% (28% improvement) Compared to another leading AI lab: 49.2% MFU vs. their 41.85% (18% improvement) Achieved performance and reliability parity with NVIDIA DGX Cloud Benchmarking recipes

Business impact: Every percentage point matters For a 30-day, 1,024-GPU training run at $2.10/GPU-hour, improving total average MFU from 42% to 51% delivers ~6,000 GPU-hours of additional effective compute worth $12,600 without changing the invoice. Combined with 10x reliability improvements, this means: Models reach production weeks sooner Engineers iterate instead of debugging Predictable timelines for critical projects

Our 9 percentage point MFU improvement (from 42% to 51%) translates to completing training runs 18% faster, saving nearly a week on month-long jobs. What's next: Extending these results We're already applying these optimizations to NVIDIA GB200 NVL72 clusters and building live MFU dashboards for customer workloads. Our full 30-page technical report, including methodology, survival model mathematics, and raw logs, is available now. Ready to test these results on your models? The data proves that infrastructure architecture matters. Our benchmark demonstrates that with the right approach, you can achieve both speed and stability at thousand-GPU scale. Download the technical report to read the whole story. If you’re interested, schedule a deep dive with our team to learn how to apply these benchmarks to your own clusters. ‍

Discover how NVIDIA H100 benchmarks prove GPU clusters can achieve higher reliability, performance, and MFU for large-scale AI training.

Share this article: Copied

Related Blogs

The Data Center Questions Everyone Is Asking, Answered 5 min read

What a Reference Architecture for Distributed AI Training Actually Looks Like 6 min read

Why Inference Latency and Availability Drift in Production 7 min read

5 Misunderstandings About Enterprise AI Training Infrastructure 5 min read

Choosing the Right NVIDIA Platform for Running Inference on CoreWeave 5 min read

CoreWeave Closes the Loop Between Training and Inference 4 min read

Why Distributed Training Fails at Scale 7 min read

Run Agentic Workloads Safely at Scale with CoreWeave Sandboxes 6 min read

Red Hat AI Inference on CKS for Hybrid Inference 4 min read

CoreWeave Is Now the Fastest at Inference on One of the Best Open Source Models Kimi K2.6 3 min read

Contact us Login

Products GPU Compute…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Informative benchmark post, not a model release or major launch