NVIDIA/nccl nccl4py-v0.2.0
NVIDIA/nccl
Captured source
source ↗published Apr 24, 2026seen 5dcaptured 13hhttp 200method plain
nccl4py v0.2.0 Release
Repository: NVIDIA/nccl
Tag: nccl4py-v0.2.0
Published: 2026-04-24T21:33:08Z
Prerelease: no
Release notes:
Release Notes — nccl4py 0.2.0
This release adds Python bindings for the new NCCL 2.30 one-sided RMA, Device API (GIN), and elastic communicator features, along with substantially more control over communicator configuration.
Highlights
- One-sided RMA (point-to-point) — New
Communicator.put_signal(),Communicator.signal(), andCommunicator.wait_signal()methods, plus aWaitSignalDeschelper for describing signal values and match operations. - NCCL Device API host side setup — New
Communicator.create_dev_comm()that produces aDevCommResourcefor use with device-side NCCL kernels. Configure the device communicator through the newNCCLDevCommRequirementsclass, and introspect support viadevice_api_support,gin_type,railed_gin_type,host_rma_support, andn_lsa_teamsproperties. - Device pointer access for registered windows —
RegisteredWindowHandlenow exposesuser_ptr,get_lsa_device_pointer(),get_lsa_multimem_device_pointer(), andget_peer_device_pointer()for direct access to LSA, multimem, and peer mappings. - Elastic and fault-tolerant communicators — New
Communicator.grow(),revoke(),suspend(), andresume()methods to support elastic topology changes and error-handling flows.CommSuspendFlagadded alongside existingCommShrinkFlag. - More flexible construction — In addition to
init(), communicators can now be created with class methodinit_all()and instance methodinitialize().Communicator.get_mem_stat()reports per-communicator memory statistics.
Configuration
New tuning knobs on NCCLConfig:
graph_usage_mode,num_rma_ctx,max_p2p_peers.
NCCLDevCommRequirements — passed to Communicator.create_dev_comm() to describe the resources and capabilities a device communicator needs:
- LSA:
lsa_multimem,barrier_count,lsa_barrier_count,rail_gin_barrier_count,world_gin_barrier_count,lsa_ll_a2a_block_count,lsa_ll_a2a_slot_count. - GIN:
gin_force_enable,gin_context_count,gin_signal_count,gin_counter_count,gin_queue_depth,gin_connection_type,gin_exclusive_contexts.
Device / topology introspection
New Communicator properties: cuda_dev, nvml_dev, device_api_support, multimem_support, gin_type, railed_gin_type, n_lsa_teams, host_rma_support.
Other changes
CTAPolicyis now anIntFlag(wasIntEnum) so multiple policies can be combined.- Interop submodules
nccl.core.cupyandnccl.core.torchare now lazy-loaded via__getattr__and only imported on first attribute access, soimport nccl.coreno longer pulls in CuPy or PyTorch.
Notability
notability 4.0/10Routine library release, minor version update