WritingCoreWeaveCoreWeavepublished May 20, 2026seen 6d

Faster, Smarter AI Training at Thousand-GPU Scale

Open original ↗

Captured source

source ↗
published May 20, 2026seen 6dcaptured 2dhttp 200method exa

Purpose-Built Cloud for AI at Scale: Achieving 20% Higher MFU and 10 Reliability on Thousand-GPU Clusters NVIDIA H100 PerformanceBenchmarks A CoreWeave Technical Report Wes Brown (Distinguished Engineer), David Marx (Senior Engineer), Anthony Mercurio (Engineer), Eta Syra (Engineer), Sanger Steel (Engineer), Rex Wang (Engineer), Deok Filho (Product Manager) August 2025 Table of Contents Executive Summary 2 Introduction: The Challenge of Large-Scale AI Training 4 Benchmarking Results 6 Benchmarking Methodology 12 The CoreWeave Approach: Purpose-Builtfor AI Scale 15 Technical Implementation 19 Best Practices for Large-Scale Training on CoreWeave 19 Conclusion: Reliable, Performant AI Training at Scale with CoreWeave 30 Appendix 34 Executive Summary The artificial intelligence industry's push toward trillion-parameter models has exposed critical gap: while GPU availability has increased,the infrastructure capability to maintain stable, efficienttraining at scale has not kept pace. Industry reports document effective training time ratios as low as 90%1 and mean time to failure under 8 hours 2 for thousand-GPU clusters, directly impacting development costs and competitive timelines. For organizations investing in AI development, infrastructure-related failures can extend training times from weeks to months, delaying time-to-marketin a rapidly evolving competitive landscape. CoreWeave provides a specialized AI cloud platform meticulously optimized for large-scale, GPU-accelerated workloads, differentiating through bare metal performance, low-latency networking, flexible configurations, and deep AI/ML operational expert During a six-week period in May-June 2025, we performed LLM pre-training exercises to provide concrete evidence of CoreWeave's value proposition, demonstrating superior performance, reliability, and stability crucialfor large modeltraining. CoreWeave minimizes downtime, averaging an Effective Training Time Ratio (ETTR) of 98%while maximizing computational efficiency, reducing costs, and accelerating time-to-marketfor large-scale AI. Our experiments demonstrate a Mean Time To Failure (MTTF) of 3.66 days for a 1,024-GPU job, representing a 43.7% improvement on MTT over a similarly trained industry model 1 when our results are projected up to 16,384 GPUs. Our benchmarking shows CoreWeave achieves Model FLOPS Utilization (MFU) exceeding 50% on NVIDIA Hopper GPUs. This level of efficiency represents up to 20% higher 2 performance compared to the 35%-45% MFU range typically observed in public foundation modeltraining benchmarks, significantly bridging the "AI Efficiency Gap 3 Further benchmarking against specific published results showed MFU improvements 18-28% over published results from leading AI labs. Additionally, collaborative testin using NVIDIA DGX CloudBenchmarking Recipes confirmed CoreWeave's NVIDIA Hopp GPU infrastructure performs on par with the NVIDIA reference architecture. Achieving this level of performance was only made possible by the specific capabiliti unique to CoreWeave Cloud: bare metal performance, robust health checking, automated fleet and node lifecycle management, optimized storage solutions accessed via NVID BlueField DPU-managed network links on dedicated network fabric separate from the NVIDIA Quantum InfiniBand fabric used to communicate updates during trainin topology-aware scheduling via CoreWeave's Slurm on Kubernetes (SUNK), and integrated detailed observability. We additionally demonstrated efficiency gains from in-house implementations of modern best practices, including asynchronous checkpointing using tools like Tensorizer, massively paralleltext processing with the high-speed gpt_bpe tokenizer, and automated recovery facilitated via SUNK. Key Results Table 1. Summary of CoreWeave Benchmark Results Compared to Industry Baselines Metric CoreWeave Result Industry Baseline CoreWeave Uplift Model FLOPS Utilization (MFU) 51–52% (1024 NVIDIA H100 GPUs) 35-45%4,5,6 +18–28 Effective Training Time Ratio (ETTR) 97.5% @ 1024 GPUs ~90% or lower 1 ~8% gain Mean Time to Failure (MTTF) 3.66 days @ 1024 GPUs ~0.33 days 2 @ 1024 GPUs 10× longer Checkpoint Save Time 17s (async,1024 GPUs) 129s (synchronous baseline, Section 6.2) ~8× faster Checkpoint Load Time 8.8–34.5s (Tensorizer) 25.9–68.3s (torch.distributed, Section 6.2) 2–3× faster 3 Tokenization Throughput 63M tokens/sec (gpt_bpe) ~5-10M tokens/se (HugginFace Tokenizers) 7 6–12× faster Overview of performance and reliability metrics from large-scale AItraining jobs on CoreWeave’s platform, benchmarked against public results from industry leading benchmarks. CoreWeave demonstrated 18–28% higher GPU efficiency (MFU), 10x longer MTTF, and significantly faster checkpointing and tokenization—all contributing to improved cost-efficiency time-to-marketforfoundation modeltraining. 1. Introduction: The Challenge of Large-Scale AI Training The AI industry continues its rapid trajectory towards ever-larger models, demanding unprecedented levels of computational demand and unwavering infrastructure reliability. While hardware advancements provide the necessary raw compute, continuously using thousands of GPUs for extended training runs remains a formidable challenge, pushing the boundaries ofinfrastructure design and operational management. Training large language models (LLMs) at scale presents significantinfrastructu challenges that extend far beyond simply acquiring sufficient GPU resources. Training models on thousands of GPUs synchronously is immensely complex, with failures significantly impacting Time-To-Market(TTM) and total cost. As we will discus CoreWeave's infrastructure is holistically designed to mitigate these risks. This document concentrates on the benchmarking methodology, infrastructure advantages, and performance results;this is not a comprehensive guide to the practical steps involved in pre-training Ixchel, CoreWeave’s Llama 3-based 30B model 1.1. Model Flops Utilization (MFU) Centralto evaluating training efficiency is Model FLOPS Utilization (MFU), which measures the percentage oftheoretical peak GPU performance achieved during modeltraining. MFU is calculated as the ratio of observed computationalthroughputto theoretical hardware capacity, accounting forthe actual FLOPS required by the model architecture. Fortransformer models,theoretical FLOPS pertoken approximates 6N forthe forward and backward passes, where N represents total parameters. Our 30Bparameter modelthus requires…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Infrastructure blog post, not a model release.