siliconflow/xDiT
forked from xdit-project/xDiT
Captured source
source ↗siliconflow/xDiT
Description: xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2024-08-15T07:37:15Z
Pushed: 2024-08-15T07:36:38Z
Default branch: main
Fork: yes
Parent repository: xdit-project/xDiT
Archived: no
README:
Table of Contents
- [🔥 Meet xDiT](#meet-xdit)
- [📢 Updates](#updates)
- [🎯 Supported DiTs](#support-dits)
- [📈 Performance](#perf)
- [🚀 QuickStart](#QuickStart)
- [✨ the xDiT's secret weapons](#secrets)
- [1. PipeFusion](#PipeFusion)
- [2. Unified Sequence Parallel](#USP)
- [3. Hybrid Parallel](#hybrid_parallel)
- [4. CFG Parallel](#cfg_parallel)
- [5. Parallel VAE](#parallel_vae)
- [📚 Develop Guide](#dev-guide)
- [🚧 History and Looking for Contributions](#history)
- [📝 Cite Us](#cite-us)
🔥 Meet xDiT
Diffusion Transformers (DiTs), pivotal in text-to-image and text-to-video models, are driving advancements in high-quality image and video generation. With the escalating input sequence length in DiTs, the computational demand of the Attention mechanism grows quadratically! Consequently, multi-GPU and multi-machine deployments are essential to maintain real-time performance in online services.
To meet real-time demand for DiTs applications, parallel inference is a must. xDiT is an inference engine designed for the parallel deployment of DiTs on large scale. xDiT provides a suite of efficient parallel inference approaches for Diffusion Models.
1. Sequence Parallelism, USP is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention.
2. PipeFusion, a patch level pipeline parallelism using displaced patch by taking advantage of the diffusion model characteristics.
3. Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.
4. CFG Parallel, also known as Split Batch: Activates when using classifier-free guidance (CFG) with a constant parallelism of 2.
The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware.
xDiT offers a set of APIs to adapt DiT models in huggingface/diffusers to hybrid parallel implementation through simple wrappers. If the model you require is not available in the model zoo, developing it yourself is straightforward; please refer to our [Dev Guide](#dev-guide).
We also have implemented the following parallel stategies for reference:
1. Tensor Parallelism 2. DistriFusion
The communication and memory costs associated with the aforementioned parallelism, except for the CFG and DP, in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)
As we can see, PipeFusion and Sequence Parallel achieve lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.
𝒑: Number of pixels; 𝒉𝒔: Model hidden size; 𝑳: Number of model layers; 𝑷: Total model parameters; 𝑵: Number of parallel devices; 𝑴: Number of patch splits; 𝑸𝑶: Query and Output parameter count; 𝑲𝑽: KV Activation parameter count; 𝑨 = 𝑸 = 𝑶 = 𝑲 = 𝑽: Equal parameters for Attention, Query, Output, Key, and Value;
📢 Updates
- 🎉August 9, 2024: Support Latte sequence parallel version. The inference scripts are [examples/latte_example](examples/latte_example.py).
- 🎉August 8, 2024: Support Flux sequence parallel version. The inference scripts are [examples/flux_example](examples/flux_example.py).
- 🎉August 2, 2024: Support Stable Diffusion 3 hybrid parallel version. The inference scripts are [examples/sd3_example](examples/sd3_example.py).
- 🎉July 18, 2024: Support PixArt-Sigma and PixArt-Alpha. The inference scripts are [examples/pixartsigma_example.py](examples/pixartsigma_example.py), [examples/pixartalpha_example.py](examples/pixartalpha_example.py).
- 🎉July 17, 2024: Rename the project to xDiT. The project has evolved from a collection of parallel methods into a unified inference framework and supported the hybrid parallel for DiTs.
- 🎉July 10, 2024: Support HunyuanDiT. The inference script is [legacy/scripts/hunyuandit_example.py](./legacy/scripts/hunyuandit_example.py).
- 🎉June 26, 2024: Support Stable Diffusion 3. The inference script is [legacy/scripts/sd3_example.py](./legacy/scripts/sd3_example.py).
- 🎉May 24, 2024: PipeFusion is public released. It supports PixArt-alpha [legacy/scripts/pixart_example.py](./legacy/scripts/pixart_example.py), DiT [legacy/scripts/ditxl_example.py](./legacy/scripts/ditxl_example.py) and SDXL [legacy/scripts/sdxl_example.py](./legacy/scripts/sdxl_example.py).
🎯 Supported DiTs
| Model Name | CFG | SP | PipeFusion | | --- | --- | --- | --- | | 🎬 Latte | ❎ | ✔️ | ❎ | | 🔵 HunyuanDiT-v1.2-Diffusers | ✔️ | ❎ | ✔️ | | 🟠 Flux | NA | ✔️ | ❎ | | 🔴 PixArt-Sigma | ✔️ | ✔️ | ✔️ | | 🟢 PixArt-alpha | ✔️ | ✔️ | ✔️ | | 🟠 Stable Diffusion 3 | ✔️ | ✔️ | ✔️ |
Supported by legacy version only:
📈 Performance
Here are the benchmark results for Pixart-Alpha using the 20-step DPM solver as the scheduler across various image resolutions. To replicate these findings, please refer to the script at [./legacy/scripts/benchmark.sh](./legacy/scripts/benchmark.sh).
TBD: Updates results on hybrid parallelism.
1. The Latency on 4xA100-80GB (PCIe)
2. The Latency on 8xL20-48GB (PCIe)
3. The Latency on 8xA100-80GB (NVLink)
4. The Latency on 4xT4-16GB (PCIe)
🚀 QuickStart
1. Install from pip
pip install xfuser
2. Install from source
2.1 Install yunchang for sequence parallel.
Install yunchang from feifeibear/long-context-attention. Please note that it has a dependency on flash attention and specific GPU model requirements.…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine internal fork