NVIDIA/nccl v2.30.7-1
NVIDIA/nccl
Captured source
source ↗published Jun 4, 2026seen 5dcaptured 15hhttp 200method plain
NCCL v2.30.7-1 Release
Repository: NVIDIA/nccl
Tag: v2.30.7-1
Published: 2026-06-04T22:33:08Z
Prerelease: no
Release notes:
Zero-SM Collectives
- Adds hierarchical zero-SM collectives (AllGather and All2all) that use RMA CPU proxy for inter-node communication and Copy Engines for intra-node communication.
- Enables better overlap of compute and communication.
- Enable hierarchical zero-SM collectives with
NCCL_CTA_POLICY_ZEROflag.
GIN Enhancements
- Adds new experimental GPU Push Interface (GPI) backend for GIN.
- Adds explicit signal semantics with Strong and Weak signals.
- Adds proper
ncclGinFenceLevelsemantics for barriers. - Adds separate
NCCL_GIN_IB_TCtoggle to control traffic class used by GIN. - Adds
NCCL_GIN_RESOURCE_SHARING_THREADto enable more optimizations. - Optimizes QP overhead, including GDAKI mode when counters are not used.
- Ensures GIN is usable when NIC fusion is enabled.
- Adds GIN plugin example in
plugins/gin/example.
Symmetric Memory Improvements
- Restructures RMA plugin architecture.
- Adds support for asymmetric buffer sizes during window registration.
- Optimizes ReduceScatter symmetric kernel performance.
- Optimizes performance for RMA operations using CE.
- Adds batched CE operations to improve performance in the RMA CE put/wait path.
- Adds support for window registration during CUDA graph capture.
MPS with MLOPart Support (Experimental)
- NCCL now leverages CUDA feature Memory Locality Optimized Partition (MLOPart).
- Supports up to 2 ranks per physical GPU with MPS+mlopart.
Other Improvements
- Adds support for IB ports that require global route headers (GRH).
- Adds logic to
gin.flushto ensure all prior gets are visible. - Adds makefile support to compile python wheels from source.
- Adds
NCCL_RMA_DISABLEenv to enable/disable RMA (Github PR #2151). - Implements reset-without-zeroing for signals and counters in GIN (Github PR #2155).
- Pins GIN proxy thread to NUMA-local CPU set (Github PR #2182).
- Adds optimized weight transfer APIs in
contrib/nccl_xfer. - Adds custom kernels in
contrib/custom_algosfor alltoall and allreduce using NCCL Device API. - Adds examples of Root Mean Square Normalization (RMSNorm), demonstrating the fusion of computation and communication using the device API.
- Unifies coding style by using clang-format. Please see
docs/dev_guide/nccl_coding_style.mdfor more details. - Drops support for v11 and v12 GIN plugin APIs.
Bug Fixes
- Fixes a deadlock caused by cuda stream allocation under PXN when memseting a buffer at runtime.
- Reintroduce
cudaGridDependencySynchronizein built-in symmetric kernels, ensuring that newly launched kernels cannot access memory modified by prior kernels before it reaches point of coherency. - Ignores system headers in include/header processing, thereby avoiding excessive realpath calls in some builds (Github PR #1806).
- Improves QP load balancing on systems configured with RoCE LAG with the round-robin queue affinity policy (Github PR #2150).
- Fixes issue when receiving an external TCP request causes the proxy thread's
ncclProxyServiceto hang (Github PR #1834). - Fixes
rma_proxyMR registration type for host-NUMAcpuAccessSignals, which ensures that the net plugin does not reject the registration due to wrong memory type (Github PR #2187). - Fixes GIN init context leak (Github PR #2179).
- Fixes issue with one-sided host APIs when a custom GIN plugin is used.
- Fixes one-sided host API issue where requests are dropped at a high message rate (Github Issue #2119).
Acknowledgements
We thank the following contributors for their work on this release:
@andrewjcg, @baymaxhuang, @bhasunit, @fishautumn, @mozarhua, @ngoyal2707, @wanglei875 for your PRs.
We also thank the community for issue reports, testing, and feedback.
Known Issues
- NCCL one-sided host RMA APIs, e.g.,
ncclPutSignal, require every rank to call the API as a one-time initialization warm-up. This will be fixed in an upcoming release. - NCCL one-sided RMA operations have a possible corruption issue when multiple symmetric windows are carved from the same backing memory allocation. See https://github.com/NVIDIA/nccl/issues/2198. This has been fixed on dev branch.
Notability
notability 2.0/10Routine patch release, no major news.