coreweave/nccl-tests
Shell
Captured source
source ↗coreweave/nccl-tests
Description: NVIDIA NCCL Tests for Distributed Training
Language: Shell
Stars: 146
Forks: 32
Open issues: 6
Created: 2022-06-29T10:49:49Z
Pushed: 2026-06-10T16:04:56Z
Default branch: master
Fork: no
Archived: no
README:
NCCL for Distributed Training
CoreWeave supports the NVIDIA Collective Communication Library (NCCL) for powering multi-GPU and multi-node neural network training. NCCL underpins the vast majority of all distributed training frameworks such as DeepSpeed, PyTorch Distributed and Horovod.
NCCL is supported across CoreWeave NVIDIA GPUs over Ethernet and InfiniBand. In addition, the specialized GB200 NVL72 clusters are built with NVIDIA Quantum-X800 InfiniBand networking and in-network collections using NVIDIA SHARP to deliver the highest distributed training performance possible.
- [NCCL for Distributed Training](#nccl-for-distributed-training)
- [Docker Images](#docker-images)
- [Running NCCL Tests](#running-nccl-tests)
- [MPI Operator](#mpi-operator)
- [Running Jobs](#running-jobs)
- [Slurm](#slurm)
- [Running Jobs](#running-jobs-1)
- [Enroot](#enroot)
- [Running DeepSpeed Training Jobs](#running-deepspeed-training-jobs)
- [GDRCopy](#gdrcopy)
- [Expected Performance](#expected-performance)
- [GB200](#gb200)
- [Single Rack](#single-rack)
- [2 Racks](#2-racks)
- [20 Racks](#20-racks)
Docker Images
This repository includes Dockerfiles that can be used directly or as a template for your distributed training applications. The Dockerfiles include the following components:
- NVIDIA Mellanox OFED Driver
userspace components. The kernel side is installed on our bare-metal nodes and does not need to be installed by users. The OFED drivers are necessary for optimized InfiniBand communication.
- NVIDIA HPC-X which is a
packaging of OpenMPI and UCX
- NVIDIA HPC-X OpenMPI compiled with external PMIx to
enable SLURM integration
- NVIDIA GDRCopy libraries leverage
GPUDirect RDMA for improved GPU to host memory copy performance in certain applications. The kernel support for GDRCopy exists on CoreWeave's bare-metal nodes.
- NVIDIA NCCL SHARP Plugin
for SHARP support in NCCL
- NVIDIA NCCL Tests for verification
and benchmarking purposes
- NVIDIA DCGM for GPU tests and health
checks
- NVIDIA bandwidthTest
utility
- RDMA Perftest with GPUDirect
- OpenSSH server and related settings to enable images to easily be used as
MPI Runners
CoreWeave also publishes images built from these Dockerfiles that can be used as base for your own images. The images below include NCCL v2.30.4-1, HPC-X v2.26, and cuDNN v9.20.0.48-1. Each image is multi-arch, and can be used for both linux/amd64 and linux/arm64 containers. Compute capabilities up to Blackwell (10.0 & 12.0) are supported.
Ubuntu 24.04
| Image Tag | CUDA | |----------------------------------------------------------------------------|----------| | ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.2.1 | | ghcr.io/coreweave/nccl-tests:13.1.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.1.1 | | ghcr.io/coreweave/nccl-tests:13.0.2-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 13.0.2 | | ghcr.io/coreweave/nccl-tests:12.9.1-devel-ubuntu24.04-nccl2.30.4-1-2eedd7c | 12.9.1 |
Ubuntu 22.04
| Image Tag | CUDA | |----------------------------------------------------------------------------|----------| | ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.2.1 | | ghcr.io/coreweave/nccl-tests:13.1.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.1.1 | | ghcr.io/coreweave/nccl-tests:13.0.2-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 13.0.2 | | ghcr.io/coreweave/nccl-tests:12.9.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.9.1 | | ghcr.io/coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.8.1 | | ghcr.io/coreweave/nccl-tests:12.6.3-devel-ubuntu22.04-nccl2.30.4-1-2eedd7c | 12.6.3 |
Running NCCL Tests
There are many sample jobs in this repo showing how to run distributed NCCL tests, using the following workload managers:
MPI Operator
CoreWeave provides a managed instance of the MPI Operator to allow running MPI Jobs in a container native fashion. No installation is required by the user, simply execute an MPIJob manifest in your namespace.
Example manifests are provided in the mpi-operator/ directory. There you'll find the following examples of 64 GPU (8 node) runs:
- [A40](./mpi-operator/nccl-test-distributed-a40-64-mpijob.yaml)
- [A100](./mpi-operator/nccl-test-distributed-a100-64-mpijob.yaml)
- [A100 with GDRCopy](./mpi-operator/nccl-test-distributed-a100-64-gdrcopy-mpijob.yaml)
- [A100 without Infiniband](./mpi-operator/nccl-test-distributed-a100-64-noib-mpijob.yaml)
- [A100 with SHARP](./mpi-operator/nccl-test-distributed-a100-64-sharp-mpijob.yaml)
- [H100](./mpi-operator/nccl-test-distributed-h100-64-mpijob.yaml)
- [H100 with SHARP](./mpi-operator/nccl-test-distributed-h100-64-sharp-mpijob.yaml)
- [B200](./mpi-operator/nccl-test-distributed-b200-64-mpijob.yaml)
- [B200 with SHARP](./mpi-operator/nccl-test-distributed-b200-64-sharp-mpijob.yaml)
- [B300](./mpi-operator/nccl-test-distributed-b300-64-mpijob.yaml)
- [B300 with SHARP](./mpi-operator/nccl-test-distributed-b300-64-sharp-mpijob.yaml)
- [GB200 NVL72](./mpi-operator/nccl-test-distributed-gb200-nvl72-mpijob.yaml)
- [GB200 128 GPU multi-rack](./mpi-operator/nccl-test-distributed-gb200-128-multirack-mpijob.yaml)
- [GB300 NVL72…
Excerpt shown — open the source for the full document.