What does this release signal mean?

NVIDIA published NVIDIA/nccl nccl-ep-v0.1.0 (NVIDIA/nccl). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Optimized multi-GPU and multi-node communication library by NVIDIA. · Release: NVIDIA/nccl nccl-ep-v0.1.0 - Repository: NVIDIA/nccl | Optimized primitives for collective multi-GPU communication | 5K stars | C++ - Name: NCCL EP v0.1.0.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/nccl nccl-ep-v0.1.0

Captured source

source ↗

GitHub/github.com/NVIDIA/nccl

NVIDIA/nccl nccl-ep-v0.1.0

Source ↗

published Jun 8, 2026seen Jun 9captured Jun 10http 200method exa

Release: NVIDIA/nccl nccl-ep-v0.1.0

Repository: NVIDIA/nccl | Optimized primitives for collective multi-GPU communication | 5K stars | C++
Name: NCCL EP v0.1.0 Release
Author: @bhramesh-nvidia
Created: 2026-06-08T22:53:52Z
Published: 2026-06-08T22:54:11Z
Reactions: 🎉 3

NCCL EP is a high-performance NCCL API extension for efficient Mixture-of-Experts (MoE) communication. It provides optimized dispatch and combine primitives for Expert Parallelism (EP) across distributed GPU systems implemented on top of NCCL Device API: Load-Store Accessible (LSA) and GPU-Initiated Networking (GIN) operations.

API Improvements and Extensions

Refactor the API signatures to improve user experience and support backward compatibility.
Change the device memory ownership for EP Tensor data. The user is now responsible for device memory allocations for EP Tensors.
Refactor the EP tensor data structure management for the host-side NCCL EP Tensor object. EP tensor now supports both dynamic allocation for long-term storage and static on-stack allocation for malloc-free usage on the data path.
Add lightweight and CUDA Graph-compatible EP Handle management on the data path. ncclEpCreateHandle is split into ncclEpInitHandle, which is a control-path operation that may allocate device memory and may be collective, and ncclEpUpdateHandle, which updates the Handle's routing information before calling the Dispatch operation.
Allow users to set the number of SMs used by NCCL EP.
Extend the API to associate an NCCL EP Tensor with an NCCL Window to enable zero-copy optimizations.
Add flexible Dispatch output layout configurations:

HT mode supports Flat and Expert-major layouts.
Enable users to provide expert padding to align with GEMM requirements.
LL mode supports Expert- and Rank-major layouts.

Add active rank mask support to identify failed ranks and exclude them from future communication, allowing operation to continue instead of aborting the process.
Introduce an explicit Forward/Backward pass selector in Dispatch and Combine operations.
Drop top-K indices from the Dispatch operation signature and use the tensor provided to the Handle update.

Implementation Improvements

Migrate to Just-In-Time (JIT) compilation for HT mode. This addresses performance issues and a number of limitations. LL migration to JIT is planned in the next release.
Add full Multi-node-NVLINK (MNNVL) support.
Remove limitations on the number of ranks in an LSA team. This has been tested on NVLink72.
Fully migrate to NCCL infrastructure. All CUDA IPC references are removed and the code only depends on NCCL.
Enable CUDA Graph support through EP handle management API changes and implementation changes.
Support MoE and prefill workloads by enabling a variable number of tokens per sender on Dispatch.

Performance Optimizations

Improve the performance of Dispatch for HT mode by leveraging NCCL Device API extensions available starting from NCCL v2.30.
Improve the performance of the Combine operation in HT mode by leveraging JIT compilation.
Enable zero-copy flows for HT mode.
Optimize LL performance for NVLink-only configurations by avoiding the send-side staging buffer.
Update ep_bench to measure kernel-only performance.
Introduce ncclTeamRail in HT mode instead of a split communicator.
Improve Dispatch/recv and Combine/send parallelization in LL mode.

Memory Footprint Optimizations

Optimize the Dispatch staging buffer in LL mode. Use per-rank token deduplication and rank-major layout to reduce the staging buffer size by a factor of experts per rank.
Expose rank-major layout at the API level in LL mode. Rank-major mode reduces the memory footprint by a factor of the number of experts per rank.
Optimize HT mode handle memory usage by moving the global routing map buffer from the handle to the group scope. This allows different handles to share the buffer.

Python Bindings

Expose NCCL EP through nccl4py.
Make Python bindings more pythonic compared to the original 1-to-1 C-Python mapping.

Performance Benchmark (`ep_bench`)

Report kernel-only performance metrics through CUPTI, if available.
Extend the number of settings: number of SMs, number of experts, and layout selection.
Add sophisticated validation for Dispatch and Combine phases to detect memory corruption and routing issues.

Bug Fixes

Fix the bug in HT mode preventing launches on more than 8 nodes.
Fix HT mode inter-node flags sizing that would cause overflow for 9 or more nodes.
Clean the API and tools from quantization-related code. Quantization support is planned to be re-enabled in the following release.
Fix memory ordering in Dispatch/Combine grid barriers.
Fix a bug causing crashes in LL mode for batch sizes.
Fix integer overflow in inter-node N2N warp at 8 or more nodes. Thanks to Mozar Huang.

Known Issues and Limitations

The number of RDMA domains, or NCCL LSA Teams, in HT mode is limited to 32 due to algorithmic limitations.
nccl4py 0.3 wheel is shipped with libnccl_ep.so built with CUDA 13. To use CUDA 12, users have to build libnccl_ep.so from source and specify the .so file path using LD_PRELOAD or LD_LIBRARY_PATH. In addition, NCCL_EP_HOME needs to be set to point to the corresponding nccl_ep installation directory.
NCCL EP v0.1 does not support quantization. While the API has appearances of quantization-related parameters, such as the scales tensor, the implementation was not tested and is not guaranteed to work. Elements of quantization support are expected to be introduced in the next release.
The Dispatch operation has resource limitations associated with the amount of available shared on-chip memory. Consumption is impacted by two factors:

The hidden dimension of the token.
LSA team size, which is the size of the NVLink domain.

If a job launch is aborted due to shared memory overflow, try to reduce the current stage or pipeline settings. In v0.1, this can only be done statically at build time: reduce the HYBRIDEP_DISPATCH_NUM_OF_STAGES and/or HYBRIDEP_DISPATCH_NUM_OF_PIPELINES_PER_BLOCK macro values in hybridep_configs.cuh, rebuild, and retry.

LL Mode Limitations

Maximum top-K: 9.
Hidden dimensions: 2048, 2560, 4096, 5120, 6144, 7168, and 8192.

Known Bugs

In LL mode, ep_bench reports Combine verification failure when a batch size of 1...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New NCCL extension from NVIDIA, solid but no traction data.