NVIDIA/nccl nccl-ep-v0.1.0
NVIDIA/nccl
Captured source
source ↗published Jun 8, 2026seen 2dcaptured 1dhttp 200method exa
Release: NVIDIA/nccl nccl-ep-v0.1.0
- Repository: NVIDIA/nccl | Optimized primitives for collective multi-GPU communication | 5K stars | C++
- Name: NCCL EP v0.1.0 Release
- Author: @bhramesh-nvidia
- Created: 2026-06-08T22:53:52Z
- Published: 2026-06-08T22:54:11Z
- Reactions: 🎉 3
NCCL EP is a high-performance NCCL API extension for efficient Mixture-of-Experts (MoE) communication. It provides optimized dispatch and combine primitives for Expert Parallelism (EP) across distributed GPU systems implemented on top of NCCL Device API: Load-Store Accessible (LSA) and GPU-Initiated Networking (GIN) operations.
API Improvements and Extensions
- Refactor the API signatures to improve user experience and support backward compatibility.
- Change the device memory ownership for EP Tensor data. The user is now responsible for device memory allocations for EP Tensors.
- Refactor the EP tensor data structure management for the host-side NCCL EP Tensor object. EP tensor now supports both dynamic allocation for long-term storage and static on-stack allocation for malloc-free usage on the data path.
- Add lightweight and CUDA Graph-compatible EP Handle management on the data path.
ncclEpCreateHandleis split intoncclEpInitHandle, which is a control-path operation that may allocate device memory and may be collective, andncclEpUpdateHandle, which updates the Handle's routing information before calling the Dispatch operation. - Allow users to set the number of SMs used by NCCL EP.
- Extend the API to associate an NCCL EP Tensor with an NCCL Window to enable zero-copy optimizations.
- Add flexible Dispatch output layout configurations:
- HT mode supports Flat and Expert-major layouts.
- Enable users to provide expert padding to align with GEMM requirements.
- LL mode supports Expert- and Rank-major layouts.
- Add active rank mask support to identify failed ranks and exclude them from future communication, allowing operation to continue instead of aborting the process.
- Introduce an explicit Forward/Backward pass selector in Dispatch and Combine operations.
- Drop top-K indices from the Dispatch operation signature and use the tensor provided to the Handle update.
Implementation Improvements
- Migrate to Just-In-Time (JIT) compilation for HT mode. This addresses performance issues and a number of limitations. LL migration to JIT is planned in the next release.
- Add full Multi-node-NVLINK (MNNVL) support.
- Remove limitations on the number of ranks in an LSA team. This has been tested on NVLink72.
- Fully migrate to NCCL infrastructure. All CUDA IPC references are removed and the code only depends on NCCL.
- Enable CUDA Graph support through EP handle management API changes and implementation changes.
- Support MoE and prefill workloads by enabling a variable number of tokens per sender on Dispatch.
Performance Optimizations
- Improve the performance of Dispatch for HT mode by leveraging NCCL Device API extensions available starting from NCCL v2.30.
- Improve the performance of the Combine operation in HT mode by leveraging JIT compilation.
- Enable zero-copy flows for HT mode.
- Optimize LL performance for NVLink-only configurations by avoiding the send-side staging buffer.
- Update
ep_benchto measure kernel-only performance. - Introduce
ncclTeamRailin HT mode instead of a split communicator. - Improve Dispatch/recv and Combine/send parallelization in LL mode.
Memory Footprint Optimizations
- Optimize the Dispatch staging buffer in LL mode. Use per-rank token deduplication and rank-major layout to reduce the staging buffer size by a factor of experts per rank.
- Expose rank-major layout at the API level in LL mode. Rank-major mode reduces the memory footprint by a factor of the number of experts per rank.
- Optimize HT mode handle memory usage by moving the global routing map buffer from the handle to the group scope. This allows different handles to share the buffer.
Python Bindings
- Expose NCCL EP through
nccl4py. - Make Python bindings more pythonic compared to the original 1-to-1 C-Python mapping.
Performance Benchmark (ep_bench)
- Report kernel-only performance metrics through CUPTI, if available.
- Extend the number of settings: number of SMs, number of experts, and layout selection.
- Add sophisticated validation for Dispatch and Combine phases to detect memory corruption and routing issues.
Bug Fixes
- Fix the bug in HT mode preventing launches on more than 8 nodes.
- Fix HT mode inter-node flags sizing that would cause overflow for 9 or more nodes.
- Clean the API and tools from quantization-related code. Quantization support is planned to be re-enabled in the following release.
- Fix memory ordering in Dispatch/Combine grid barriers.
- Fix a bug causing crashes in LL mode for batch sizes.
- Fix integer overflow in inter-node N2N warp at 8 or more nodes. Thanks to Mozar Huang.
Known Issues and Limitations
- The number of RDMA domains, or NCCL LSA Teams, in HT mode is limited to 32 due to algorithmic limitations.
nccl4py0.3 wheel is shipped withlibnccl_ep.sobuilt with CUDA 13. To use CUDA 12, users have to buildlibnccl_ep.sofrom source and specify the.sofile path usingLD_PRELOADorLD_LIBRARY_PATH. In addition,NCCL_EP_HOMEneeds to be set to point to the correspondingnccl_epinstallation directory.- NCCL EP v0.1 does not support quantization. While the API has appearances of quantization-related parameters, such as the scales tensor, the implementation was not tested and is not guaranteed to work. Elements of quantization support are expected to be introduced in the next release.
- The Dispatch operation has resource limitations associated with the amount of available shared on-chip memory. Consumption is impacted by two factors:
- The hidden dimension of the token.
- LSA team size, which is the size of the NVLink domain.
- If a job launch is aborted due to shared memory overflow, try to reduce the current stage or pipeline settings. In v0.1, this can only be done statically at build time: reduce the
HYBRIDEP_DISPATCH_NUM_OF_STAGESand/orHYBRIDEP_DISPATCH_NUM_OF_PIPELINES_PER_BLOCKmacro values inhybridep_configs.cuh, rebuild, and retry.
LL Mode Limitations
- Maximum top-K: 9.
- Hidden dimensions: 2048, 2560, 4096, 5120, 6144, 7168, and 8192.
Known Bugs
- In LL mode,
ep_benchreports Combine verification failure when a batch size of 1…
Excerpt shown — open the source for the full document.