NVIDIA H100 GPU Benchmark Results: What We Learned From Large-Scale GPU Testing
Captured source
source ↗NVIDIA H100 Benchmarks for Large-Scale Training | CoreWeave
Announcement
Announcement
Webinar
Announcement
Podcast
Announcement
GTC 2026
Announcement
CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.
Read more
Products
Data and storage
Infrastructure control
Runtime acceleration
Model and agent development
Mission control
Solutions
Pricing
Resources
About us
Contact us Login
Contact us Login
Clear
The real bottleneck isn't TFLOPs. It's maintaining both performance AND reliability at scale. We all celebrate bigger model parameters and faster GPUs , yet production training runs fail not just from lack of compute but from the compound effect of interruptions and inefficiencies. When large-scale training runs crash every eight hours (0.33 days MTTF at 1,024 GPUs, according to a leading AI lab’s “Revisiting Reliability” paper), you lose days to reloads and wasted steps. And when Model FLOPs Utilization (MFU) drops to 35–45% (as is common) your effective speed is cut in half. A recent foundation model training report crystallized the challenge When a leading AI lab published their training report , documenting hundreds of unplanned interruptions over 54 days at 16K GPUs, it offered rare transparency into large-scale training challenges. This level of openness underscored a critical industry need: infrastructure that delivers both power and stability under sustained, intense workloads. The dual challenge of reliability and performance became our engineering mandate. Our six-week benchmarking study: Can purpose-built infrastructure solve both? We designed a comprehensive benchmark to rigorously test our hypothesis. The study parameters: Model : 30-billion parameter Llama 3-style architecture Scale : Up to 1,024 NVIDIA H100 GPUs Dataset : 2 trillion tokens from Dolma v1.6 Framework : Production Megatron-LM Metrics : MTTF, ETTR, MFU, checkpoint performance, tokenization throughput
This wasn’t a lab-only synthetic test. We ran production-scale training on a real model, with real training data, to measure real-world performance. The infrastructure architecture: Every layer optimized for AI Our goal was to optimize GPU clusters to achieve superior speed, efficiency, and reliability—without cutting into performance. Our approach addressed known bottlenecks systematically: Hardware layer Bare-metal NVIDIA H100 GPU clusters eliminating hypervisor overhead, with NUMA pinning under our control. Dual-fabric architecture: NVIDIA Quantum InfiniBand for all-reduce operations, separate NVIDIA BlueField DPU-offloaded Ethernet for storage traffic, preventing network contention.
Orchestration layer SUNK (Slurm on Kubernetes) : Topology-aware scheduling with health probes that evict failing nodes before they impact jobs. Automated re-queue: Failed processes restart in ~90 seconds instead of 4+ minutes of manual triage.
Storage and checkpointing Tensorizer-based async checkpointing: Reduced save time from 129 seconds to 17 seconds while maintaining 99%+ compute utilization. Custom gpt_bpe tokenizer achieving 63M tokens/second—6–12x faster than HuggingFace Tokenizers.
A defining moment: The 2:17 a.m. non-incident During a 512-GPU run at 2:17 a.m., our on-call engineer’s phone stayed silent. SUNK had automatically: Detected a failing node via health probes Evicted it from the pool Rescheduled the workload Resumed training within 3 minutes
Grafana was showing green before anyone was even alerted. This is what infrastructure-level reliability looks like. The results: Validated performance at scale Our study achieved: 51-52% MFU on NVIDIA H100 GPUs (vs. 35-45% typically reported) 3.66 days MTTF at 1,024 GPUs (10x improvement over 0.33 days baseline) 97.5% ETTR (Effective Training Time Ratio) 8x faster checkpointing via async Tensorizer implementation 43.7% MTTF improvement projected at 16,384 GPUs vs. Llama 3's reported numbers
Third-party validation We validated against published configurations: Compared to another AI research group: 51.9% MFU vs. their 40.43% (28% improvement) Compared to another leading AI lab: 49.2% MFU vs. their 41.85% (18% improvement) Achieved performance and reliability parity with NVIDIA DGX Cloud Benchmarking recipes
Business impact: Every percentage point matters For a 30-day, 1,024-GPU training run at $2.10/GPU-hour, improving total average MFU from 42% to 51% delivers ~6,000 GPU-hours of additional effective compute worth $12,600 without changing the invoice. Combined with 10x reliability improvements, this means: Models reach production weeks sooner Engineers iterate instead of debugging Predictable timelines for critical projects
Our 9 percentage point MFU improvement (from 42% to 51%) translates to completing training runs 18% faster, saving nearly a week on month-long jobs. What's next: Extending these results We're already applying these optimizations to NVIDIA GB200 NVL72 clusters and building live MFU dashboards for customer workloads. Our full 30-page technical report, including methodology, survival model mathematics, and raw logs, is available now. Ready to test these results on your models? The data proves that infrastructure architecture matters. Our benchmark demonstrates that with the right approach, you can achieve both speed and stability at thousand-GPU scale. Download the technical report to read the whole story. If you’re interested, schedule a deep dive with our team to learn how to apply these benchmarks to your own clusters.
Discover how NVIDIA H100 benchmarks prove GPU clusters can achieve higher reliability, performance, and MFU for large-scale AI training.
Share this article: Copied
Related Blogs
The Data Center Questions Everyone Is Asking, Answered 5 min read
What a Reference Architecture for Distributed AI Training Actually Looks Like 6 min read
Why Inference Latency and Availability Drift in Production 7 min read
5 Misunderstandings About Enterprise AI Training Infrastructure 5 min read
Choosing the Right NVIDIA Platform for Running Inference on CoreWeave 5 min read
CoreWeave Closes the Loop Between Training and Inference 4 min read
Why Distributed Training Fails at Scale 7 min read
Run Agentic Workloads Safely at Scale with CoreWeave Sandboxes 6 min read
Red Hat AI Inference on CKS for Hybrid Inference 4 min read
CoreWeave Is Now the Fastest at Inference on One of the Best Open Source Models Kimi K2.6 3 min read
Contact us Login
Products GPU Compute…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Informative benchmark post, not a model release or major launch