NVIDIA/nccl nccl4py-v0.3.1
NVIDIA/nccl
Captured source
source ↗NCCL4py v0.3.1 Release
Repository: NVIDIA/nccl
Tag: nccl4py-v0.3.1
Published: 2026-06-11T18:45:48Z
Prerelease: no
Release notes:
Highlights
- Added
nccl.ep, a Pythonic interface tolibnccl_ep.sofor expert
parallel dispatch/combine workflows. The package exposes Group, Handle, Tensor, typed config dataclasses, Algorithm, Layout, PassDir, and the named input/output structs used by the NCCL EP API.
- Added
nccl.core.device.cute, enabling CuTeDSL kernels to call NCCL device
APIs.
- Added top-level stack diagnostics with
nccl.get_version()and
nccl.show_versions(), reporting nccl4py, libnccl.so, and libnccl_ep.so versions, CUDA build variants, and loaded shared-library paths.
- Added free-threaded CPython support.
New Features
NCCL EP Python API
- New
nccl.eppackage provides Pythonic access to the NCCL EP extension
library.
Group.create()creates EP groups from aCommunicatorandGroupConfig;
Group.create_handle() creates handles with an explicit Layout.
Handlesupportsupdate(),dispatch(),combine(),complete(), and
destroy().
DispatchInputs,DispatchOutputs,CombineInputs,CombineOutputs, and
LayoutInfo provide named containers for the tensors and metadata used by dispatch, combine, and handle setup.
Tensorresolves Python buffers intoncclEpTensor_tdescriptors.GroupConfig,HandleConfig,DispatchConfig,CombineConfig, and
AllocConfig expose typed configuration objects.
AllocFnandFreeFnexpose caller-controlled EP allocation hooks.nccl.ep.interop.torch.get_nccl_comm_from_group()provides PyTorch interop
for creating an NCCL communicator from a PyTorch process group's rank and world-size information.
- Importing
nccl.epsets defaultNCCL_EP_HOMEwhen bundled EP JIT headers
are present, and NCCL_HOME when NCCL public headers are available from the installed nvidia.nccl package.
nccl.epchecks that the loadedlibnccl.soandlibnccl_ep.sowere built
with the same CUDA major version. CUDA minor differences are accepted.
Communicator Configuration
- Added
graph_stream_orderingtoNCCLConfig.
Device API and CuTe DSL
- New
nccl.core.device.cutemodule exposes the NCCL device API to CuTeDSL
kernels, including communicator/window access, GIN primitives, barrier operations, and typed structs.
- Added
bindings/nccl4py/examples/cute/main.py, a GIN put/wait example with
host-side validation.
- Added
gin_strong_signals_requiredandgin_va_signals_requiredto
NCCLDevCommRequirements for configuring device communicator requirements.
- Added
NcclGinType.GPIfor the GPU-Push Interface transport.
Version and Diagnostics API
- Top-level
nccl.get_version()returns aVersionInfodataclass containing
the nccl4py package version plus LibraryInfo entries for the loaded libnccl.so and, when available, libnccl_ep.so.
- Top-level
nccl.show_versions()prints the same stack information in a
human-readable version block.
- Direct library probes are available for each native library:
nccl.core.get_lib_version() and nccl.core.get_lib_path() report the loaded libnccl.so; nccl.ep.get_lib_version() and nccl.ep.get_lib_path() report the loaded libnccl_ep.so.
- Each
LibraryInfoincludes release version, CUDA build variant, and loaded
shared-library path.
Installation and Packaging
- CuTeDSL support can be installed through the CUDA-specific extras:
nccl4py[cu12] installs nvidia-cutlass-dsl>=4.5.2,=4.5.2,<5.0.
- Wheels include package data for
nccl/ep/lib/libnccl_ep.soplus EP JIT
headers. The bundled libnccl_ep.so is built with CUDA 13, regardless of whether the cu12 or cu13 extra is installed. Users who want to use a CUDA 12 build of libnccl_ep.so must provide that library themselves, for example through LD_PRELOAD or LD_LIBRARY_PATH.
- Wheels are available for free-threaded CPython 3.14t.
Examples and Documentation
- Added Python examples for:
- multiple devices in one process:
docs/examples/01_communicators/01_multiple_devices_single_process/python/;
- one device per MPI process:
docs/examples/01_communicators/03_one_device_per_process_mpi/python/;
- point-to-point ring pattern:
docs/examples/02_point_to_point/01_ring_pattern/python/;
- allreduce:
docs/examples/03_collectives/01_allreduce/python/; - user-buffer allreduce:
docs/examples/04_user_buffer_registration/01_allreduce/python/;
- symmetric-memory allreduce:
docs/examples/05_symmetric_memory/01_allreduce/python/;
- symmetric-memory allgather:
docs/examples/05_symmetric_memory/02_allgather/python/.
- Added nccl4py documentation under
docs/userguide/source/nccl4py/, with the
main entry point at docs/userguide/source/nccl4py.rst.
Breaking Changes
Removed APIs
nccl.core.group_simulate_end()has been removed. Use
nccl.core.group_end(simulate=True):
from nccl.core import group_end, group_start group_start() # enqueue operations info = group_end(simulate=True)
NCCL_SPLIT_NOCOLORhas been removed from the public constants. Use
color=None when a rank should opt out of Communicator.split().
Deprecated APIs
nccl.core.get_version()remains available, but is deprecated. Use top-level
nccl.get_version() for structured version information, or nccl.show_versions() for human-readable output.
Other Compatibility Notes
- Public NCCL enum wrappers are pure-Python
IntEnumorIntFlagclasses.
Integer compatibility is preserved, and dtype conversion remains supported. Code that depends on binding-backed enum class identity from earlier releases may need updates.
- Enum members now follow the Python enum convention of
UPPER_SNAKE_CASE
names, such as CTAPolicy.DEFAULT, CommShrinkFlag.ABORT, WindowFlag.COLL_SYMMETRIC, and NcclCommMemStat.GPU_MEM_TOTAL. The previous PascalCase/camelCase aliases, such as CTAPolicy.Default and NcclCommMemStat.GpuMemTotal, still work in 0.3.1 for compatibility, but will be removed in a future release. New code should use the uppercase names.
Fixes and Enhancements
- Fixed pointer lifetime handling for non-blocking communicator and window
initialization.
- Torch interop covers
torch.uint32andtorch.uint64when those dtypes are
available.
API Stability
nccl.epandnccl.core.device.cuteare initial API support. Their public
interfaces may change in future releases as the NCCL EP and CuTeDSL device API integration...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine point release of nccl4py library.