deepseek-ai/DeepEP
Cuda
Captured source
source ↗deepseek-ai/DeepEP
Description: DeepEP: an efficient expert-parallel communication library
Language: Cuda
License: MIT
Stars: 9711
Forks: 1282
Open issues: 266
Created: 2025-02-17T01:33:04Z
Pushed: 2026-06-01T02:33:48Z
Default branch: main
Fork: no
Archived: no
README:
DeepEP
DeepEP (DeepEveryParallel) is a high-performance communication library for modern machine learning training and inference. The library currently focuses on expert parallelism (EP) — providing high-throughput and low-latency all-to-all GPU kernels (MoE dispatch and combine) with low-precision support including FP8 — while also offering experimental primitives for pipeline parallelism (PP), context parallelism (CP), and remote memory access (Engram), all designed for zero or minimal SM occupation. All kernels are compiled at runtime via a lightweight Just-In-Time (JIT) module, requiring no CUDA compilation during installation.
Despite its lightweight design, DeepEP's performance matches or exceeds hardware bandwidth limits across various configurations.
News
- V2 release: A complete refactoring of Expert Parallelism — achieving extreme performance with several times fewer SM resources compared to V1, while supporting significantly larger scale-up and scale-out domains. V2 has also switched from the NVSHMEM backend to the more lightweight NCCL Gin backend.
New features
- Fully JIT (Just-In-Time compilation)
- NCCL Gin backend
- Header-only & lightweight
- Able to reuse existing NCCL communicators
- EPv2
- High-throughput and low-latency APIs unified into a single
ElasticBufferinterface, with a new GEMM layout - Larger scale-up & scale-out domain support (up to EP2048)
- Analytical SM & QP count calculation — no more auto-tuning needed
- Both hybrid & direct modes remain supported
- For V3-like legacy training, SM usage reduced from 24 to 4 - 6 while maintaining equivalent or better performance
- 0 SM Engram (with RDMA)
- 0 SM PP (with RDMA)
- 0 SM CP (with Copy Engine)
Notes
- Buffer size consumption is larger than V1
- 0 SM RDMA low-latency EP is no longer supported
- Engram, PP, and CP are experimental features
Still on-going features
- Elastic GPU & CPU buffers: A contiguous virtual address space that maps to a hybrid of GPU and CPU physical memory under the hood, enabling fully automatic and transparent Engram or imbalanced EP
- Reducing intermediate buffer sizes by leveraging EP replay to handle load imbalance
- All-gather updates and reduce-scatter implementations for DP & TP
For the legacy V1 documentation (NVSHMEM-based), see [docs/legacy.md](docs/legacy.md).
Performance
Following V3's configuration, we tested with 8K tokens per batch, 7168 hidden dimensions, top 8 experts, FP8 dispatching, and BF16 combining, and obtained the following results:
| Arch | NIC type | Topo | Dispatch Bottleneck Bandwidth | Combine Bottleneck Bandwidth | #SMs | |--|--|--|--|--|--| | SM90 | CX7 | EP 8 x 2 | 90 GB/s (RDMA) | 81 GB/s (RDMA) | 12 | | SM90 | CX7 | EP 8 x 4 | 61 GB/s (RDMA) | 61 GB/s (RDMA) | 6 | | SM100 | CX7 | EP 8 x 2 | 90 GB/s (RDMA) | 91 GB/s (RDMA) | 12 | | SM100 | N/A | EP 8 | 726 GB/s (NVLink) | 740 GB/s (NVLink) | 64 (Max perf) | | SM100 | N/A | EP 8 | 643 GB/s (NVLink) | 675 GB/s (NVLink) | 24 (Min #SM) |
Notes: the results are logical bandwidth. For example, under the EP 8 x 2 case, 90 GB/s actually contains local rank traffic.
Comparing with V1, V2 achieves up to 1.3x peak performance, while saving up to 4x SM count.
We omit results for larger EP configurations for the time being, but encourage interested users to benchmark them directly. Based on our internal experience, we expect the kernel to continue saturating hardware bandwidth at scale.
For V1 performance data, see [docs/legacy.md](docs/legacy.md#performance).
Quick start
Requirements
- Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support
- Python 3.8 and above
- CUDA version
- CUDA 12.3 and above for SM90 GPUs
- PyTorch 2.10 and above
- NCCL 2.30.4 and above
- NVLink for intranode communication
- RDMA network for internode communication
Install NCCL dependency
We recommend using pip to install NCCL so that DeepEP can automatically locate it within the Python environment. You can install it using the following command:
pip install "nvidia-nccl-cu13>=2.30.4" --no-deps
Install NVSHMEM dependency
DeepEP also depends on NVSHMEM to provide support for legacy methods. Please refer to our [NVSHMEM Installation Guide](docs/nvshmem.md) for instructions.
Development
# Build and make symbolic links for SO files python setup.py build # You may modify the specific SO names according to your own platform ln -s build/lib.linux-x86_64-cpython-38/deep_ep_cpp.cpython-38-x86_64-linux-gnu.so # Run test cases # NOTES: you may modify the `init_dist` function in `tests/utils/envs.py` # according to your own cluster settings, and launch into multiple nodes python tests/elastic/test_ep.py python tests/elastic/test_agrs.py python tests/elastic/test_engram.py python tests/elastic/test_pp.py
Installation
python setup.py install
Then, import deep_ep in your Python project, and enjoy!
Interfaces and examples
Buffer initialization
In V2, all EP operations — high-throughput and low-latency — are unified under a single ElasticBuffer interface. The buffer can be initialized by specifying MoE settings directly, and the optimal SM and QP counts are calculated analytically.
import torch import torch.distributed as dist from typing import Optional from deep_ep import ElasticBuffer # Communication buffer (will allocate at runtime) _buffer: Optional[ElasticBuffer] = None # Number of SMs to use for communication kernels (will be set at buffer creation) _num_comm_sms: int = 0 def get_buffer(group: dist.ProcessGroup, num_max_tokens_per_rank: int, hidden: int, num_topk: int, num_experts: int, use_fp8_dispatch: bool = False) -> ElasticBuffer: """Initialize or retrieve the ElasticBuffer for EP communication.""" global _buffer, _num_comm_sms # Check if we can reuse the existing buffer required_bytes = ElasticBuffer.get_buffer_size_hint( group, num_max_tokens_per_rank, hidden, num_topk=num_topk, use_fp8_dispatch=use_fp8_dispatch, ) if _buffer is not None and _buffer.group == group and _buffer.num_bytes >= required_bytes: return _buffer # Allocate a new buffer with MoE…
Excerpt shown — open the source for the full document.
Notability
notability 9.0/10Very high stars for new repo from major lab.