ReleaseNVIDIANVIDIApublished Apr 24, 2026seen 5d

NVIDIA/nccl nccl4py-v0.2.0

NVIDIA/nccl

Open original ↗

Captured source

source ↗
published Apr 24, 2026seen 5dcaptured 13hhttp 200method plain

nccl4py v0.2.0 Release

Repository: NVIDIA/nccl

Tag: nccl4py-v0.2.0

Published: 2026-04-24T21:33:08Z

Prerelease: no

Release notes:

Release Notes — nccl4py 0.2.0

This release adds Python bindings for the new NCCL 2.30 one-sided RMA, Device API (GIN), and elastic communicator features, along with substantially more control over communicator configuration.

Highlights

  • One-sided RMA (point-to-point) — New Communicator.put_signal(), Communicator.signal(), and Communicator.wait_signal() methods, plus a WaitSignalDesc helper for describing signal values and match operations.
  • NCCL Device API host side setup — New Communicator.create_dev_comm() that produces a DevCommResource for use with device-side NCCL kernels. Configure the device communicator through the new NCCLDevCommRequirements class, and introspect support via device_api_support, gin_type, railed_gin_type, host_rma_support, and n_lsa_teams properties.
  • Device pointer access for registered windowsRegisteredWindowHandle now exposes user_ptr, get_lsa_device_pointer(), get_lsa_multimem_device_pointer(), and get_peer_device_pointer() for direct access to LSA, multimem, and peer mappings.
  • Elastic and fault-tolerant communicators — New Communicator.grow(), revoke(), suspend(), and resume() methods to support elastic topology changes and error-handling flows. CommSuspendFlag added alongside existing CommShrinkFlag.
  • More flexible construction — In addition to init(), communicators can now be created with class method init_all() and instance method initialize(). Communicator.get_mem_stat() reports per-communicator memory statistics.

Configuration

New tuning knobs on NCCLConfig:

  • graph_usage_mode, num_rma_ctx, max_p2p_peers.

NCCLDevCommRequirements — passed to Communicator.create_dev_comm() to describe the resources and capabilities a device communicator needs:

  • LSA: lsa_multimem, barrier_count, lsa_barrier_count, rail_gin_barrier_count, world_gin_barrier_count, lsa_ll_a2a_block_count, lsa_ll_a2a_slot_count.
  • GIN: gin_force_enable, gin_context_count, gin_signal_count, gin_counter_count, gin_queue_depth, gin_connection_type, gin_exclusive_contexts.

Device / topology introspection

New Communicator properties: cuda_dev, nvml_dev, device_api_support, multimem_support, gin_type, railed_gin_type, n_lsa_teams, host_rma_support.

Other changes

  • CTAPolicy is now an IntFlag (was IntEnum) so multiple policies can be combined.
  • Interop submodules nccl.core.cupy and nccl.core.torch are now lazy-loaded via __getattr__ and only imported on first attribute access, so import nccl.core no longer pulls in CuPy or PyTorch.

Notability

notability 4.0/10

Routine library release, minor version update