RepoNVIDIANVIDIApublished Aug 8, 2017seen 5d

NVIDIA/nccl-tests

Cuda

Open original ↗

Captured source

source ↗
published Aug 8, 2017seen 5dcaptured 8hhttp 200method plain

NVIDIA/nccl-tests

Description: NCCL Tests

Language: Cuda

License: BSD-3-Clause

Stars: 1551

Forks: 378

Open issues: 161

Created: 2017-08-08T23:21:47Z

Pushed: 2026-06-09T00:21:13Z

Default branch: master

Fork: no

Archived: no

README:

NCCL Tests

These tests check both the performance and the correctness of NCCL operations.

Build

To build the tests, just type make or make -j

If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL_HOME.

$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

NCCL tests rely on MPI to work on multiple processes, hence multiple nodes. If you want to compile the tests with MPI support, you need to set MPI=1 and set MPI_HOME to the path where MPI is installed.

$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

You can also add a suffix to the name of the generated binaries with NAME_SUFFIX. For example when compiling with the MPI versions you could use:

$ make MPI=1 NAME_SUFFIX=_mpi MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

This will generate test binaries with names such as all_reduce_perf_mpi.

Usage

NCCL tests can run on multiple processes, multiple threads, and multiple CUDA devices per thread. The number of process is managed by MPI and is therefore not passed to the tests as argument. The total number of ranks (=CUDA devices) will be equal to (number of processes)*(number of threads)*(number of GPUs per thread).

Quick examples

Run on single node with 8 GPUs (-g 8), scanning from 8 Bytes to 128MiB (Mebibytes), doubling between each test (-f 2) :

$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

Run 64 MPI processes on nodes with 8 GPUs each, for a total of 64 GPUs spread across 8 nodes. Scanning from 8 Bytes to 8GiB (Gibibytes), doubling between each test (-f 2). (NB: The nccl-tests binaries must be compiled with MPI=1 for this case)

$ mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Performance

See the [Performance](doc/PERFORMANCE.md) page for explanation about numbers, and in particular the "busbw" column.

Arguments

All tests support the same set of arguments :

  • Number of GPUs
  • -t,--nthreads number of threads per process. Default : 1.
  • -g,--ngpus number of gpus per thread. Default : 1.
  • Sizes to scan
  • -b,--minbytes minimum size to start with. Default : 32M (Mebibytes).
  • -e,--maxbytes maximum size to end at. Default : 32M (Mebibytes).
  • Increments can be either fixed or a multiplication factor. Only one of those should be used.
  • -i,--stepbytes fixed increment between sizes. Default : 1M (Mebibytes).
  • -f,--stepfactor multiplication factor between sizes. Default : disabled.
  • NCCL operations arguments
  • -o,--op Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum.
  • -d,--datatype Specify which datatype to use. Default : Float.
  • -r,--root Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0.
  • Performance
  • -n,--iters number of iterations. Default : 20.
  • -w,--warmup_iters number of warmup iterations (not timed). Default : 1.
  • -m,--agg_iters number of operations to aggregate together in each iteration. Default : 1.
  • -N,--run_cycles run & print each cycle. Default : 1; 0=infinite.
  • -a,--average Report performance as an average across all ranks (MPI=1 only). . Default : 1.
  • Test operation
  • -p,--parallel_init use threads to initialize NCCL in parallel. Default : 0.
  • -c,--check perform count iterations, checking correctness of results on each iteration. This can be quite slow on large numbers of GPUs. Default : 1.
  • -z,--blocking collective blocking: 1=wait for completion and barrier, 2=wait without barrier. Default : 0.
  • -G,--cudagraph Capture iterations as a CUDA graph and then replay specified number of times. Default : 0.
  • -C,--report_cputime Report CPU time instead of latency. Default : 0.
  • -R,--local_register enable local (1) or symmetric (2) buffer registration on send/recv buffers. Default : 0.
  • -D,--device_implementation use custom device API implementation. Not every collective has a custom device API implementations (currently just all\_reduce and alltoall). Default : 0 (use traditional NCCL host implementation). Note: values > 0 require symmetric memory registration (-R 2).
  • -V,--device_cta_count number of CTAs for device API implementation. Must be positive and less than 128. Default : 16.
  • -S,--report_timestamps Add timestamp ("%Y-%m-%d %H:%M:%S") to each performance report line. Default : 0.
  • -J,--output_file Write [JSON] output to filepath. Infer type from suffix (only json supported presently).
  • -T,--timeout timeout each test after specified number of seconds. Default : disabled.
  • -M,--memory enable memory usage report. Default : 0.
  • -u,--unalign Misalign source and destination buffers. Default : 0.

Running multiple operations in parallel

NCCL tests allow to partition the set of GPUs into smaller sets, each executing the same operation in parallel. To split the GPUs, NCCL will compute a "color" for each rank, based on the NCCL_TESTS_SPLIT environment variable, then all ranks with the same color will end up in the same group. The resulting group is printed next to each GPU at the beginning of the test.

NCCL_TESTS_SPLIT takes the following syntax: `. Operation can be AND, OR, MOD or DIV. The &, |, %, and / symbols are also supported. The value can be either decimal, hexadecimal (prefixed by 0x) or binary (prefixed by 0b`).

NCCL_TESTS_SPLIT_MASK="" is equivalent to NCCL_TESTS_SPLIT="&".

Here are a few examples:

  • NCCL_TESTS_SPLIT="AND 0x7" or NCCL_TESTS_SPLIT="MOD 8": On systems with 8 GPUs, run 8 parallel operations, each with 1 GPU per node (purely communicating over the inter-node network)
  • NCCL_TESTS_SPLIT="OR 0x7" or NCCL_TESTS_SPLIT="DIV 8": On systems with 8 GPUs, run one operation per node, purely intra-node.
  • NCCL_TESTS_SPLIT="AND 0x1" or NCCL_TESTS_SPLIT="MOD 2": Run two operations, each operation using every other rank.

Note that the reported bandwidth is per group, hence to get the total bandwidth used…

Excerpt shown — open the source for the full document.