ReleaseNVIDIANVIDIApublished Jun 4, 2026seen 5d

NVIDIA/nccl v2.30.7-1

NVIDIA/nccl

Open original ↗

Captured source

source ↗
published Jun 4, 2026seen 5dcaptured 15hhttp 200method plain

NCCL v2.30.7-1 Release

Repository: NVIDIA/nccl

Tag: v2.30.7-1

Published: 2026-06-04T22:33:08Z

Prerelease: no

Release notes:

Zero-SM Collectives

  • Adds hierarchical zero-SM collectives (AllGather and All2all) that use RMA CPU proxy for inter-node communication and Copy Engines for intra-node communication.
  • Enables better overlap of compute and communication.
  • Enable hierarchical zero-SM collectives with NCCL_CTA_POLICY_ZERO flag.

GIN Enhancements

  • Adds new experimental GPU Push Interface (GPI) backend for GIN.
  • Adds explicit signal semantics with Strong and Weak signals.
  • Adds proper ncclGinFenceLevel semantics for barriers.
  • Adds separate NCCL_GIN_IB_TC toggle to control traffic class used by GIN.
  • Adds NCCL_GIN_RESOURCE_SHARING_THREAD to enable more optimizations.
  • Optimizes QP overhead, including GDAKI mode when counters are not used.
  • Ensures GIN is usable when NIC fusion is enabled.
  • Adds GIN plugin example in plugins/gin/example.

Symmetric Memory Improvements

  • Restructures RMA plugin architecture.
  • Adds support for asymmetric buffer sizes during window registration.
  • Optimizes ReduceScatter symmetric kernel performance.
  • Optimizes performance for RMA operations using CE.
  • Adds batched CE operations to improve performance in the RMA CE put/wait path.
  • Adds support for window registration during CUDA graph capture.

MPS with MLOPart Support (Experimental)

  • NCCL now leverages CUDA feature Memory Locality Optimized Partition (MLOPart).
  • Supports up to 2 ranks per physical GPU with MPS+mlopart.

Other Improvements

  • Adds support for IB ports that require global route headers (GRH).
  • Adds logic to gin.flush to ensure all prior gets are visible.
  • Adds makefile support to compile python wheels from source.
  • Adds NCCL_RMA_DISABLE env to enable/disable RMA (Github PR #2151).
  • Implements reset-without-zeroing for signals and counters in GIN (Github PR #2155).
  • Pins GIN proxy thread to NUMA-local CPU set (Github PR #2182).
  • Adds optimized weight transfer APIs in contrib/nccl_xfer.
  • Adds custom kernels in contrib/custom_algos for alltoall and allreduce using NCCL Device API.
  • Adds examples of Root Mean Square Normalization (RMSNorm), demonstrating the fusion of computation and communication using the device API.
  • Unifies coding style by using clang-format. Please see docs/dev_guide/nccl_coding_style.md for more details.
  • Drops support for v11 and v12 GIN plugin APIs.

Bug Fixes

  • Fixes a deadlock caused by cuda stream allocation under PXN when memseting a buffer at runtime.
  • Reintroduce cudaGridDependencySynchronize in built-in symmetric kernels, ensuring that newly launched kernels cannot access memory modified by prior kernels before it reaches point of coherency.
  • Ignores system headers in include/header processing, thereby avoiding excessive realpath calls in some builds (Github PR #1806).
  • Improves QP load balancing on systems configured with RoCE LAG with the round-robin queue affinity policy (Github PR #2150).
  • Fixes issue when receiving an external TCP request causes the proxy thread's ncclProxyService to hang (Github PR #1834).
  • Fixes rma_proxy MR registration type for host-NUMA cpuAccessSignals, which ensures that the net plugin does not reject the registration due to wrong memory type (Github PR #2187).
  • Fixes GIN init context leak (Github PR #2179).
  • Fixes issue with one-sided host APIs when a custom GIN plugin is used.
  • Fixes one-sided host API issue where requests are dropped at a high message rate (Github Issue #2119).

Acknowledgements

We thank the following contributors for their work on this release:

@andrewjcg, @baymaxhuang, @bhasunit, @fishautumn, @mozarhua, @ngoyal2707, @wanglei875 for your PRs.

We also thank the community for issue reports, testing, and feedback.

Known Issues

  • NCCL one-sided host RMA APIs, e.g., ncclPutSignal, require every rank to call the API as a one-time initialization warm-up. This will be fixed in an upcoming release.
  • NCCL one-sided RMA operations have a possible corruption issue when multiple symmetric windows are carved from the same backing memory allocation. See https://github.com/NVIDIA/nccl/issues/2198. This has been fixed on dev branch.

Notability

notability 2.0/10

Routine patch release, no major news.