What does this release signal mean?

NVIDIA published NVIDIA/nccl v2.30.7-1 (NVIDIA/nccl). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's library for GPU collective communication operations. · NCCL v2.30.7-1 Release Repository: NVIDIA/nccl Tag: v2.30.7-1 Published: 2026-06-04T22:33:08Z Prerelease: no Release notes: Zero-SM Collectives - Adds hierarchical.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/nccl v2.30.7-1

Captured source

source ↗

GitHub/github.com/NVIDIA/nccl

NVIDIA/nccl v2.30.7-1

Source ↗

published Jun 4, 2026seen Jun 6captured Jun 11http 200method plain

NCCL v2.30.7-1 Release

Repository: NVIDIA/nccl

Tag: v2.30.7-1

Published: 2026-06-04T22:33:08Z

Prerelease: no

Release notes:

Zero-SM Collectives

Adds hierarchical zero-SM collectives (AllGather and All2all) that use RMA CPU proxy for inter-node communication and Copy Engines for intra-node communication.
Enables better overlap of compute and communication.
Enable hierarchical zero-SM collectives with NCCL_CTA_POLICY_ZERO flag.

GIN Enhancements

Adds new experimental GPU Push Interface (GPI) backend for GIN.
Adds explicit signal semantics with Strong and Weak signals.
Adds proper ncclGinFenceLevel semantics for barriers.
Adds separate NCCL_GIN_IB_TC toggle to control traffic class used by GIN.
Adds NCCL_GIN_RESOURCE_SHARING_THREAD to enable more optimizations.
Optimizes QP overhead, including GDAKI mode when counters are not used.
Ensures GIN is usable when NIC fusion is enabled.
Adds GIN plugin example in plugins/gin/example.

Symmetric Memory Improvements

Restructures RMA plugin architecture.
Adds support for asymmetric buffer sizes during window registration.
Optimizes ReduceScatter symmetric kernel performance.
Optimizes performance for RMA operations using CE.
Adds batched CE operations to improve performance in the RMA CE put/wait path.
Adds support for window registration during CUDA graph capture.

MPS with MLOPart Support (Experimental)

NCCL now leverages CUDA feature Memory Locality Optimized Partition (MLOPart).
Supports up to 2 ranks per physical GPU with MPS+mlopart.

Other Improvements

Adds support for IB ports that require global route headers (GRH).
Adds logic to gin.flush to ensure all prior gets are visible.
Adds makefile support to compile python wheels from source.
Adds NCCL_RMA_DISABLE env to enable/disable RMA (Github PR #2151).
Implements reset-without-zeroing for signals and counters in GIN (Github PR #2155).
Pins GIN proxy thread to NUMA-local CPU set (Github PR #2182).
Adds optimized weight transfer APIs in contrib/nccl_xfer.
Adds custom kernels in contrib/custom_algos for alltoall and allreduce using NCCL Device API.
Adds examples of Root Mean Square Normalization (RMSNorm), demonstrating the fusion of computation and communication using the device API.
Unifies coding style by using clang-format. Please see docs/dev_guide/nccl_coding_style.md for more details.
Drops support for v11 and v12 GIN plugin APIs.

Bug Fixes

Fixes a deadlock caused by cuda stream allocation under PXN when memseting a buffer at runtime.
Reintroduce cudaGridDependencySynchronize in built-in symmetric kernels, ensuring that newly launched kernels cannot access memory modified by prior kernels before it reaches point of coherency.
Ignores system headers in include/header processing, thereby avoiding excessive realpath calls in some builds (Github PR #1806).
Improves QP load balancing on systems configured with RoCE LAG with the round-robin queue affinity policy (Github PR #2150).
Fixes issue when receiving an external TCP request causes the proxy thread's ncclProxyService to hang (Github PR #1834).
Fixes rma_proxy MR registration type for host-NUMA cpuAccessSignals, which ensures that the net plugin does not reject the registration due to wrong memory type (Github PR #2187).
Fixes GIN init context leak (Github PR #2179).
Fixes issue with one-sided host APIs when a custom GIN plugin is used.
Fixes one-sided host API issue where requests are dropped at a high message rate (Github Issue #2119).

Acknowledgements

We thank the following contributors for their work on this release:

@andrewjcg, @baymaxhuang, @bhasunit, @fishautumn, @mozarhua, @ngoyal2707, @wanglei875 for your PRs.

We also thank the community for issue reports, testing, and feedback.

Known Issues

NCCL one-sided host RMA APIs, e.g., ncclPutSignal, require every rank to call the API as a one-time initialization warm-up. This will be fixed in an upcoming release.
NCCL one-sided RMA operations have a possible corruption issue when multiple symmetric windows are carved from the same backing memory allocation. See https://github.com/NVIDIA/nccl/issues/2198. This has been fixed on dev branch.

Notability

notability 2.0/10

Routine patch release, no major news.