Achieve AI Infrastructure Goodput of up to 96% with 3 Key Strategies
Captured source
source ↗Achieve AI Infrastructure Goodput of up to 96% with 3 Key Strategies
Announcement
Webinar
Podcast
GTC 2026
CoreWeave to Join Nasdaq-100 Index. Read the press release
Products
Data and storage
Infrastructure control
Runtime acceleration
Model and agent development
Mission control
Solutions
Pricing
Resources
About us
Contact us Login
Contact us Login
Clear
Goodput, or the amount of compute time spent doing meaningful work, is an essential metric indicating AI cluster performance. Training foundation models efficiently has become a critical challenge as we push the boundaries of model size and complexity—and aim to maximize hardware performance. As such, optimizing the training process to maximize goodput is paramount due to its direct impact on a team's ability to build, train, fine-tune, and deploy AI applications in a timely and cost-effective manner.
At CoreWeave, the AI Hyperscaler™, we’ve designed our cloud platform from the ground up to maximize the performance and efficiency of state-of-the-art infrastructure to minimize interruptions and recover quickly from interruptions to deliver a goodput as high as 96%. Higher goodput, in turn, helps enable leading AI labs and enterprises to innovate faster, maximize the utilization of their infrastructure, and lower overall model training and deployment costs.
In this blog, we’ll explore three strategies we apply at CoreWeave to optimize resource utilization and deliver the higher goodput from AI infrastructure. Why GPU reliability matters and challenges in training large models Scaling laws represent empirical observations that state that the performance (i.e., quality) of foundational models gets better as you increase the size of the models and train them against a larger corpus of data. To achieve higher linear improvements in model quality, the model parameters and data set size grow exponentially, necessitating an exponential growth of compute required.
Figure 1: Foundational model scaling laws necessitate more orders of compute At the same time, the pace of innovation for new models has also accelerated, and the average time between launches has reduced to around 120 days, as shown in the chart below.
Figure 2: Time reduction between new model launches Leading AI Labs and enterprises are training the latest FMs using clusters with tens of thousands of GPUs and training jobs spanning days to multiple weeks. With new and evolving technologies and hardware (e.g., GPUs and system interconnects), job interruptions are common, even expected. Unlike traditional general compute workloads, GenAI workloads are more prone to job interruptions due to the massive parallelism applied across thousands of GPUs—where the failure of a single component (GPU, network, memory, cable, cooling, etc.) can result in the entire job failing for an unoptimized cluster. Diagnosing, troubleshooting, and recovering from these failures is critical to increasing infrastructure utilization and delivering faster time-to-market for new models. As such, a higher goodput is better for fast time-to-market. The ideal is 100%. A recent study showed that industry average goodput is close to 90%. But at CoreWeave, our customers experience a goodput of up to 96%.
Here are three key strategies we use to make that a reality. 1. Start with high-performance hardware optimized for AI workloads AI labs and enterprises require access to specialized GPU-based computing environments designed to handle the unique challenges of running AI workloads. Each layer of the technology stack—including data center architecture, compute resources, networking, storage, and infrastructure management —must demonstrate proven performance, scalability, and efficiency to run AI workloads reliably. For example, GPUs selected must be optimized for LLMs with high-bandwidth memory and rapid data access. Additionally, networking fabric should offer low-latency, high-throughput interconnects. At CoreWeave, we’ve created a purpose-built cloud with the latest infrastructure, offering resilient and reliable GPU clusters to power some of the world’s most compute-intensive AI workloads. We are first-to-market with the advanced NVIDIA GPUs , including the latest NVIDIA GB200 NVL72 systems . That allows our customers to take advantage of among the most cutting-edge chips that provide compute on the massive scale required for reliably running AI model training and experimentation. Additionally, we built our networking architecture with NVIDIA Quantum-2 InfiniBand fibers, which allows us to deliver a highly performant multi-node interconnect at supercomputing scale. We leverage NVIDIA BlueField-3 DPUs to offload, accelerate, and isolate networking from GPU nodes, allowing compute capacity to focus specifically on AI workloads. Our software stack is also purpose-built and optimized for running AI workloads. CoreWeave Kubernetes Service (CKS) runs Kubernetes directly on high-performance bare-metal servers for maximum performance and efficiency. With Slurm on Kubernetes (SUNK), customers can easily run Slurm-based workloads on more than 32K GPUs, helping to optimize distributed training performance through topology-aware scheduling, utilizing the speed of the InfiniBand fabric for node-to-node communication. 2. Identify and remediate interruptions proactively Job interruptions are bound to happen. Our goal is to minimize interruptions and help enable quick recovery when they do occur. Rigorous health checking and remediation processes are critical to prevent hardware issues from snowballing into more significant interruptions. CoreWeave Mission Control provides advanced cluster validation, health monitoring, proactive node replacement, and deep observability. These capabilities help ensure your workloads run on healthy infrastructure, significantly reducing the likelihood of disruptions. By minimizing interruptions and recovering faster, we can help clients achieve a goodput rate as high as 96%. CoreWeave Fleet Lifecycle Controller performs rigorous AI infrastructure validation from initial deployment through the entire node and cluster lifecycle. It runs a series of sophisticated tests to validate node health, GPUs, and networking—along with end-to-end testing for the entire cluster before bringing capacity into the production fleet. Here’s why that’s important. AI workloads perform complex mathematical operations at large scale. Silent data corruption can cause the results of...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine infrastructure blog post.