WritingCoreWeaveCoreWeavepublished May 20, 2026seen 6d

CAIOS Achieves 7+ GB/s per GPU on NVIDIA Blackwell Ultra

Open original ↗

Captured source

source ↗

CAIOS Achieves 7+ GB/s per GPU on NVIDIA Blackwell Ultra | CoreWeave

Announcement

Announcement

Webinar

Announcement

Podcast

Announcement

GTC 2026

Announcement

CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.

Read more

Products

Data and storage

Infrastructure control

Runtime acceleration

Model and agent development

Mission control

Solutions

Pricing

Resources

About us

Contact us Login

Contact us Login

Clear

Earlier this year in March, we published the latest benchmarking results of CoreWeave AI Object Storage (CAIOS) on NVIDIA H100 GPU nodes, showing sustained throughput of over 2 GB/s per GPU across any number of GPUs. CAIOS is CoreWeave’s innovative AI-focused storage service, designed to deliver higher throughput per GPU than traditional object storage services and to be scalable to hundreds of thousands of GPUs. CAIOS includes the Local Object Transport Accelerator (LOTA), which transparently prestages and caches objects on GPU nodes for accelerated performance. Since the March benchmark testing, we’ve continued pushing the limits of CAIOS with new hardware configurations, transport layers, and optimizations. CAIOS is purpose-built to accelerate the most data-intensive stages of AI. It streamlines the flow of massive training sets into GPUs, shortens checkpointing and restore cycles, speeds the loading of model weights, and powers high-throughput key-value caches for inference.

We’re excited today to share the breakthrough results of our latest tests on 16 NVIDIA Blackwell Ultra GPU nodes, where CAIOS achieved an average throughput of 7+ GB/s per GPU , representing a more than 3x improvement per GPU compared to our March benchmarks. Benchmark setup Test Harness: Warp S3 benchmarking tool, set up to run object read tests in CoreWeave Kubernetes Service (CKS) . Objects: 10,000 objects at 50 MB each, with 15 MB parts Concurrency: 100 for Warp (100 goroutines making calls) Pipelining: Enabled (cache reads queued ahead across nodes rather than sequential calls) Transport: Both Ethernet (TCP) and RDMA (NVIDIA Quantum InfiniBand) tested Cluster Size: 16 x Blackwell Ultra nodes

Comparing results across Ethernet and NVIDIA Quantum InfiniBand Ethernet (TCP) Sustained throughput capped at 180 GB/s across 16 nodes (11.25 GB/s/node or 2.81 GB/s/GPU). In this case, the Ethernet network in the lab where we conducted the testing limited our ability to hit the same per-node throughput as our previous test.

RDMA (NVIDIA Quantum InfiniBand) Sustained throughput capped at 449 GB/s across 16 nodes (28.06 GB/s/node or 7.02 GB/s/GPU)

Unlike Ethernet, the NVIDIA Quantum InfiniBand fabric easily handled full-fleet concurrency, demonstrating the scale advantage of RDMA transport. In the graph below, the sustained throughput of the 16 Blackwell Ultra nodes using InfiniBand is shown, with 10 of the nodes listed in the table on the right. The throughput increases over time as objects are staged on the LOTA cache. The “Max” column shows a range of 25.6-36.9 GB/s for each node. The average was 28.06 GB/s for each node. Since there are 4 GPUs in a GB300 node, this equates to an average of 7+ GB/s per GPU.

Accelerated Byte Rate by Pod Throughput gains since March When comparing these new results to our NVIDIA H100 benchmarks, here’s how CAIOS performance has advanced from 2 GB/s/GPU: Fewer GPUs per node: 2 GB/s/GPU ➡ 4 GB/s/GPU The H100 nodes have 8 GPUs each, while Blackwell Ultra nodes have 4 GPUs each. With per-node throughput being the same or higher, this doubles the throughput per GPU.

Moving to NVIDIA Quantum InfiniBand: 4 GB/s/GPU ➡ 6 GB/s/GPU Because the LOTA cache in CAIOS is global across all nodes, increasing node-to-node throughput also increases overall CAIOS throughput. By moving this node-to-node data transfer to InfiniBand, we consistently saw an approximately 50% improvement in throughput. This was verified by separate H100 InfiniBand testing where we saw similar speed increases compared to our March testing of H100’s using Ethernet..

LOTA pipeline optimizations: 6 GB/s/GPU ➡ 7+ GB/s/GPU Since March, we have made a number of optimizations in the way that LOTA reads and writes data, which resulted in an approximately 17% performance improvement in throughput. This increase in throughput was verified by recent LOTA testing we did with H100 nodes to compare to our March H100 test results.

This cumulative progression highlights the scalability and adaptability of CAIOS across hardware generations and network fabrics. Key takeaways from the research With CAIOS, workloads scale seamlessly across fleets of GPUs. On NVIDIA Blackwell Ultra nodes with NVIDIA Quantum InfiniBand and pipelining enabled, we’re now sustaining 7+ GB/s per GPU - performance that goes well beyond the 2 GB/s/GPU milestone we shared earlier this year. This new throughput further reduces training times, decreases inference Time to First Token (TTFT), and improves ETL efficiency. As we continue to optimize CAIOS, the gap between GPU compute power and data availability will continue to shrink, ensuring that GPUs stay fully utilized even at extreme concurrency and scale seen in training and inference use cases. Read the original blog that outlines our initial benchmark methodology and results. Learn more about the CoreWeave AI Cloud and how it can help you simplify infrastructure complexity so your team can focus their energy on AI innovation If you’re interested in learning more about CAIOS and how it can support your AI innovations, reach out and let’s talk .

CoreWeave’s CAIOS delivers 7+ GB/s per GPU on NVIDIA Blackwell Ultra with LOTA caching, RDMA, and pipeline optimizations to accelerate AI workloads.

Share this article: Copied

Related Blogs

The Data Center Questions Everyone Is Asking, Answered 5 min read

What a Reference Architecture for Distributed AI Training Actually Looks Like 6 min read

Why Inference Latency and Availability Drift in Production 7 min read

5 Misunderstandings About Enterprise AI Training Infrastructure 5 min read

Choosing the Right NVIDIA Platform for Running Inference on CoreWeave 5 min read

CoreWeave Closes the Loop Between Training and Inference 4 min read

Why Distributed Training Fails at Scale 7 min read

Run Agentic Workloads Safely at Scale with CoreWeave Sandboxes 6 min read

Red Hat AI Inference on CKS for Hybrid Inference 4 min read

CoreWeave Is Now the Fastest at Inference on One of the Best Open Source Models Kimi K2.6…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Significant performance benchmark on new hardware by notable AI cloud provider