ReleaseNVIDIANVIDIApublished Apr 15, 2026seen 5d

NVIDIA/nccl v2.30.3-1

NVIDIA/nccl

Open original ↗

Captured source

source ↗
published Apr 15, 2026seen 5dcaptured 11hhttp 200method plain

NCCL v2.30.3-1 Release

Repository: NVIDIA/nccl

Tag: v2.30.3-1

Published: 2026-04-15T03:03:54Z

Prerelease: no

Release notes:

Device API and GIN Enhancements

  • GIN contexts are no longer shared between device communicators backed by the same host communicator.
  • Adds per-context resource sharing modes for GIN, allowing GPU-scope or CTA-scoped resource sharing.
  • Adds TrafficClass support to device communicator.
  • Adds versioning to ncclDevComm.
  • Adds timeout support to the device APIs.
  • Adds max_rd_atomic and max_dest_rd_atomic support in GIN.
  • Upgrades doca-gpunetio to v2.0.2-rc1

Elastic Buffers (LSA support)

  • Support new use cases where large tensors are split into multi-segment windows, with the active region in GPU memory and the remainder in host memory.
  • Enables larger effective models and reduces memory pressure during spilling.
  • Elastic buffers will support GIN in a future release.

gin.get with Nonblocking Flush (Experimental)

  • Support GPU‑initiated gets and check completion without stalling.
  • It currently only works with GDAKI (not with CPU proxy) and doesn't work on directNIC and Ampere.

Symmetric Memory Improvements

  • Adds AVG operator to ReduceScatter Symmetric kernels.
  • Enable dynamic memory offload with group support for single-process, multi-GPU scenarios.
  • Adds support for GPU-only multi-segment registration for symmetric windows.
  • Adds CUDA graph capture and replay support for ncclPutSignal and ncclWaitSignal APIs.
  • One-sided RMA can now use an external network plugin.

Tensor Memory Accelerator (TMA) Support

  • Adds TMA support in select built-in symmetric kernels to offload bulk peer‑to‑peer copies and reductions, improving NVLink bandwidth and latency.
  • Can be enabled with NCCL_SYM_TMA_ENABLE=1.

DDP Support

  • Enables Dynamic Direct Path (DDP) so that NCCL can take advantage of hardware multipath and out‑of‑order receive for higher network performance on supported systems.
  • Can be enabled with NCCL_IB_OOO_RQ=1.

Port Recovery

  • Adds support for IB port recovery in NCCL.
  • Improves NCCL’s ability to recover from transient network issues so communicators can continue operating without full re‑initialization.
  • Can be enabled with NCCL_IB_RESILIENCY_PORT_RECOVERY=1.

Cross Clique Support

  • Add support for treating multiple cliques as the same NVLINK domain.
  • Can be enabled with NCCL_MNNVL_CROSS_CLIQUE=1

NCCL Parameter Infrastructure

  • Adds new C APIs to support querying NCCL parameters.
  • Introduces ncclParamGetAllParameterKeys,ncclParamDumpAll, ncclParamGet and ncclParamGetParameter APIs.

NCCL4PY v0.2.0

  • Adds new APIs from NCCL 2.29 release.
  • Add devcomm create/destroy APIs to prepare for device API.
  • Enables Freethreading support.

Other Improvements

  • Adds NCCL Inspector P2P event support.
  • ncclGinBarrierSession can now be created directly for the world team without manual resource allocation.
  • GIN proxy GFD size increased to 128 bytes with version field added.
  • GIN proxy CQ polling (ginProgress) moved to per-context to improve performance.
  • ncclBarrierSession no longer shares resources with ncclLsaBarrierSession or ncclGinBarrierSession.
  • Redundant NCCL_DEBUG=INFO log volume reduced significantly.
  • NVLSTree tuning that improves performance for various Blackwell systems.
  • Adds p2pMaxPeers to communicator to achieve better tuning for send/recv vs. all2all.
  • Enables LL128 protocol in heterogeneous scenarios for Hopper and later GPUs.
  • Adds checks for mismatched Net and CollNet counts across communicators.
  • Adds Graphana template for NCCL inspector dashboard rendering using Prometheus data.
  • Removes unused members nccl_id, comm, nccl_unique_id, and thread_ranks in the examples (Github PR #1989).
  • Adds NCCL_LIBIBVERBS_SO environment variable to specify an absolute path for libibverbs (Github PR #2043).
  • Extends suspend memory offload to channel device allocations (Github PR #2060).

Bug Fixes

  • Fixes implicit CUDA synchronization in putSignal and CE collectives caused by pageable CPU stack memcpy.
  • Fixes a hang when using CE collectives and cuda graph under an edge case.
  • Fixes NULL access issue during finalize when RMA and GIN plugins are both initialized.
  • Fixes race conditions in all2all GIN/Hybrid examples with more than one CTA.
  • Fixes ncclGinType_t uint8_t enum compatibility issue in nccl4py.
  • Fixes several memory leaks in communicator create/destroy code paths.
  • Fixes a bug in plugin compat layer for v11 related to lazy initialization.
  • Fixes data corruption in symmetric LL kernels with unaligned buffer.
  • Fixes plugin name being cleared after communicator destroy (Github Issue #1978).
  • Fixes deadlock and use-after-free in the inspector plugin (Github Issue #2000).
  • Fixes incorrect network interface selection caused by inverted boolean logic in matchSubnet (Github PR #2047).
  • Fixes regression from 2.29.2 where CPU affinity mask is not restored in initTransportsRank (Github issue #2033)

Known Limitations

  • Applications that use GIN APIs need to be recompiled with 2.30.3 to work with 2.30.3 runtime.
  • gin.get requires GDAKI and is not supported on Ampere or directNIC platforms.

Acknowledgments

We thank the following contributors for their work on this release:

  • @chenhengqi, @liangxs, @phu0ngng, @SreevatsaAnantharamu, @SongXiaoXi for your PRs.
  • @sphish, @LyricZhao for continued contribution on improving the NCCL device API.

We also thank the community for issue reports, testing, and feedback.

Notability

notability 3.0/10

Routine maintenance release of a library.