What does this fork signal mean?

SiliconFlow forked siliconflow/xDiT (forked from xdit-project/xDiT). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo siliconflow/xDiT · parent xdit-project/xDiT · Routine internal fork. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

SiliconFlow Fork: siliconflow/xDiT

Captured source

source ↗

GitHub/github.com/siliconflow/xDiT

siliconflow/xDiT repository metadata

Source ↗

published Aug 15, 2024seen 5dcaptured 9hhttp 200method plain

siliconflow/xDiT

Description: xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-08-15T07:37:15Z

Pushed: 2024-08-15T07:36:38Z

Default branch: main

Fork: yes

Parent repository: xdit-project/xDiT

Archived: no

README:

Table of Contents

[🔥 Meet xDiT](#meet-xdit)
[📢 Updates](#updates)
[🎯 Supported DiTs](#support-dits)
[📈 Performance](#perf)
[🚀 QuickStart](#QuickStart)
[✨ the xDiT's secret weapons](#secrets)
[1. PipeFusion](#PipeFusion)
[2. Unified Sequence Parallel](#USP)
[3. Hybrid Parallel](#hybrid_parallel)
[4. CFG Parallel](#cfg_parallel)
[5. Parallel VAE](#parallel_vae)
[📚 Develop Guide](#dev-guide)
[🚧 History and Looking for Contributions](#history)
[📝 Cite Us](#cite-us)

🔥 Meet xDiT

Diffusion Transformers (DiTs), pivotal in text-to-image and text-to-video models, are driving advancements in high-quality image and video generation. With the escalating input sequence length in DiTs, the computational demand of the Attention mechanism grows quadratically! Consequently, multi-GPU and multi-machine deployments are essential to maintain real-time performance in online services.

To meet real-time demand for DiTs applications, parallel inference is a must. xDiT is an inference engine designed for the parallel deployment of DiTs on large scale. xDiT provides a suite of efficient parallel inference approaches for Diffusion Models.

1. Sequence Parallelism, USP is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention.

2. PipeFusion, a patch level pipeline parallelism using displaced patch by taking advantage of the diffusion model characteristics.

3. Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.

4. CFG Parallel, also known as Split Batch: Activates when using classifier-free guidance (CFG) with a constant parallelism of 2.

The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware.

xDiT offers a set of APIs to adapt DiT models in huggingface/diffusers to hybrid parallel implementation through simple wrappers. If the model you require is not available in the model zoo, developing it yourself is straightforward; please refer to our [Dev Guide](#dev-guide).

We also have implemented the following parallel stategies for reference:

1. Tensor Parallelism 2. DistriFusion

The communication and memory costs associated with the aforementioned parallelism, except for the CFG and DP, in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)

As we can see, PipeFusion and Sequence Parallel achieve lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.

𝒑: Number of pixels; 𝒉𝒔: Model hidden size; 𝑳: Number of model layers; 𝑷: Total model parameters; 𝑵: Number of parallel devices; 𝑴: Number of patch splits; 𝑸𝑶: Query and Output parameter count; 𝑲𝑽: KV Activation parameter count; 𝑨 = 𝑸 = 𝑶 = 𝑲 = 𝑽: Equal parameters for Attention, Query, Output, Key, and Value;

📢 Updates

🎉August 9, 2024: Support Latte sequence parallel version. The inference scripts are [examples/latte_example](examples/latte_example.py).
🎉August 8, 2024: Support Flux sequence parallel version. The inference scripts are [examples/flux_example](examples/flux_example.py).
🎉August 2, 2024: Support Stable Diffusion 3 hybrid parallel version. The inference scripts are [examples/sd3_example](examples/sd3_example.py).
🎉July 18, 2024: Support PixArt-Sigma and PixArt-Alpha. The inference scripts are [examples/pixartsigma_example.py](examples/pixartsigma_example.py), [examples/pixartalpha_example.py](examples/pixartalpha_example.py).
🎉July 17, 2024: Rename the project to xDiT. The project has evolved from a collection of parallel methods into a unified inference framework and supported the hybrid parallel for DiTs.
🎉July 10, 2024: Support HunyuanDiT. The inference script is [legacy/scripts/hunyuandit_example.py](./legacy/scripts/hunyuandit_example.py).
🎉June 26, 2024: Support Stable Diffusion 3. The inference script is [legacy/scripts/sd3_example.py](./legacy/scripts/sd3_example.py).
🎉May 24, 2024: PipeFusion is public released. It supports PixArt-alpha [legacy/scripts/pixart_example.py](./legacy/scripts/pixart_example.py), DiT [legacy/scripts/ditxl_example.py](./legacy/scripts/ditxl_example.py) and SDXL [legacy/scripts/sdxl_example.py](./legacy/scripts/sdxl_example.py).

🎯 Supported DiTs

| Model Name | CFG | SP | PipeFusion | | --- | --- | --- | --- | | 🎬 Latte | ❎ | ✔️ | ❎ | | 🔵 HunyuanDiT-v1.2-Diffusers | ✔️ | ❎ | ✔️ | | 🟠 Flux | NA | ✔️ | ❎ | | 🔴 PixArt-Sigma | ✔️ | ✔️ | ✔️ | | 🟢 PixArt-alpha | ✔️ | ✔️ | ✔️ | | 🟠 Stable Diffusion 3 | ✔️ | ✔️ | ✔️ |

Supported by legacy version only:

🔴 DiT-XL

📈 Performance

Here are the benchmark results for Pixart-Alpha using the 20-step DPM solver as the scheduler across various image resolutions. To replicate these findings, please refer to the script at [./legacy/scripts/benchmark.sh](./legacy/scripts/benchmark.sh).

TBD: Updates results on hybrid parallelism.

1. The Latency on 4xA100-80GB (PCIe)

2. The Latency on 8xL20-48GB (PCIe)

3. The Latency on 8xA100-80GB (NVLink)

4. The Latency on 4xT4-16GB (PCIe)

🚀 QuickStart

1. Install from pip

pip install xfuser

2. Install from source

2.1 Install yunchang for sequence parallel.

Install yunchang from feifeibear/long-context-attention. Please note that it has a dependency on flash attention and specific GPU model requirements.…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine internal fork