RepoDeepSeekDeepSeekpublished Feb 17, 2025seen 6d

deepseek-ai/DeepEP

Cuda

Open original ↗

Captured source

source ↗
published Feb 17, 2025seen 6dcaptured 8hhttp 200method plain

deepseek-ai/DeepEP

Description: DeepEP: an efficient expert-parallel communication library

Language: Cuda

License: MIT

Stars: 9711

Forks: 1282

Open issues: 266

Created: 2025-02-17T01:33:04Z

Pushed: 2026-06-01T02:33:48Z

Default branch: main

Fork: no

Archived: no

README:

DeepEP

DeepEP (DeepEveryParallel) is a high-performance communication library for modern machine learning training and inference. The library currently focuses on expert parallelism (EP) — providing high-throughput and low-latency all-to-all GPU kernels (MoE dispatch and combine) with low-precision support including FP8 — while also offering experimental primitives for pipeline parallelism (PP), context parallelism (CP), and remote memory access (Engram), all designed for zero or minimal SM occupation. All kernels are compiled at runtime via a lightweight Just-In-Time (JIT) module, requiring no CUDA compilation during installation.

Despite its lightweight design, DeepEP's performance matches or exceeds hardware bandwidth limits across various configurations.

News

  • V2 release: A complete refactoring of Expert Parallelism — achieving extreme performance with several times fewer SM resources compared to V1, while supporting significantly larger scale-up and scale-out domains. V2 has also switched from the NVSHMEM backend to the more lightweight NCCL Gin backend.

New features

  • Fully JIT (Just-In-Time compilation)
  • NCCL Gin backend
  • Header-only & lightweight
  • Able to reuse existing NCCL communicators
  • EPv2
  • High-throughput and low-latency APIs unified into a single ElasticBuffer interface, with a new GEMM layout
  • Larger scale-up & scale-out domain support (up to EP2048)
  • Analytical SM & QP count calculation — no more auto-tuning needed
  • Both hybrid & direct modes remain supported
  • For V3-like legacy training, SM usage reduced from 24 to 4 - 6 while maintaining equivalent or better performance
  • 0 SM Engram (with RDMA)
  • 0 SM PP (with RDMA)
  • 0 SM CP (with Copy Engine)

Notes

  • Buffer size consumption is larger than V1
  • 0 SM RDMA low-latency EP is no longer supported
  • Engram, PP, and CP are experimental features

Still on-going features

  • Elastic GPU & CPU buffers: A contiguous virtual address space that maps to a hybrid of GPU and CPU physical memory under the hood, enabling fully automatic and transparent Engram or imbalanced EP
  • Reducing intermediate buffer sizes by leveraging EP replay to handle load imbalance
  • All-gather updates and reduce-scatter implementations for DP & TP

For the legacy V1 documentation (NVSHMEM-based), see [docs/legacy.md](docs/legacy.md).

Performance

Following V3's configuration, we tested with 8K tokens per batch, 7168 hidden dimensions, top 8 experts, FP8 dispatching, and BF16 combining, and obtained the following results:

| Arch | NIC type | Topo | Dispatch Bottleneck Bandwidth | Combine Bottleneck Bandwidth | #SMs | |--|--|--|--|--|--| | SM90 | CX7 | EP 8 x 2 | 90 GB/s (RDMA) | 81 GB/s (RDMA) | 12 | | SM90 | CX7 | EP 8 x 4 | 61 GB/s (RDMA) | 61 GB/s (RDMA) | 6 | | SM100 | CX7 | EP 8 x 2 | 90 GB/s (RDMA) | 91 GB/s (RDMA) | 12 | | SM100 | N/A | EP 8 | 726 GB/s (NVLink) | 740 GB/s (NVLink) | 64 (Max perf) | | SM100 | N/A | EP 8 | 643 GB/s (NVLink) | 675 GB/s (NVLink) | 24 (Min #SM) |

Notes: the results are logical bandwidth. For example, under the EP 8 x 2 case, 90 GB/s actually contains local rank traffic.

Comparing with V1, V2 achieves up to 1.3x peak performance, while saving up to 4x SM count.

We omit results for larger EP configurations for the time being, but encourage interested users to benchmark them directly. Based on our internal experience, we expect the kernel to continue saturating hardware bandwidth at scale.

For V1 performance data, see [docs/legacy.md](docs/legacy.md#performance).

Quick start

Requirements

  • Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support
  • Python 3.8 and above
  • CUDA version
  • CUDA 12.3 and above for SM90 GPUs
  • PyTorch 2.10 and above
  • NCCL 2.30.4 and above
  • NVLink for intranode communication
  • RDMA network for internode communication

Install NCCL dependency

We recommend using pip to install NCCL so that DeepEP can automatically locate it within the Python environment. You can install it using the following command:

pip install "nvidia-nccl-cu13>=2.30.4" --no-deps

Install NVSHMEM dependency

DeepEP also depends on NVSHMEM to provide support for legacy methods. Please refer to our [NVSHMEM Installation Guide](docs/nvshmem.md) for instructions.

Development

# Build and make symbolic links for SO files
python setup.py build
# You may modify the specific SO names according to your own platform
ln -s build/lib.linux-x86_64-cpython-38/deep_ep_cpp.cpython-38-x86_64-linux-gnu.so

# Run test cases
# NOTES: you may modify the `init_dist` function in `tests/utils/envs.py`
# according to your own cluster settings, and launch into multiple nodes
python tests/elastic/test_ep.py
python tests/elastic/test_agrs.py
python tests/elastic/test_engram.py
python tests/elastic/test_pp.py

Installation

python setup.py install

Then, import deep_ep in your Python project, and enjoy!

Interfaces and examples

Buffer initialization

In V2, all EP operations — high-throughput and low-latency — are unified under a single ElasticBuffer interface. The buffer can be initialized by specifying MoE settings directly, and the optimal SM and QP counts are calculated analytically.

import torch
import torch.distributed as dist
from typing import Optional

from deep_ep import ElasticBuffer

# Communication buffer (will allocate at runtime)
_buffer: Optional[ElasticBuffer] = None

# Number of SMs to use for communication kernels (will be set at buffer creation)
_num_comm_sms: int = 0

def get_buffer(group: dist.ProcessGroup,
num_max_tokens_per_rank: int,
hidden: int,
num_topk: int,
num_experts: int,
use_fp8_dispatch: bool = False) -> ElasticBuffer:
"""Initialize or retrieve the ElasticBuffer for EP communication."""
global _buffer, _num_comm_sms

# Check if we can reuse the existing buffer
required_bytes = ElasticBuffer.get_buffer_size_hint(
group, num_max_tokens_per_rank, hidden,
num_topk=num_topk, use_fp8_dispatch=use_fp8_dispatch,
)
if _buffer is not None and _buffer.group == group and _buffer.num_bytes >= required_bytes:
return _buffer

# Allocate a new buffer with MoE…

Excerpt shown — open the source for the full document.

Notability

notability 9.0/10

Very high stars for new repo from major lab.