zai-org/Kaleido
Python
Captured source
source ↗zai-org/Kaleido
Description: Kaleido: Open-sourced multi-subject reference video generation model, enabling controllable, high-fidelity video synthesis from multiple image references.
Language: Python
Stars: 134
Forks: 14
Open issues: 6
Created: 2025-10-20T15:13:25Z
Pushed: 2026-03-02T11:54:40Z
Default branch: main
Fork: no
Archived: no
README:
KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL
---
---
Update and News
- 2025.10.28: 🔥 We release the checkpoints of Kaleido-14B-S2V.
- 2025.10.22: 🔥 We propose Kaleido, a novel multi-subject reference video generation model. Both the training and inference code have been open-sourced to facilitate further research and reproduction.
Qucik Start
Prompt Optimization
Before running the model, please refer to this guide to see how we use large models like GLM-4.5 (or other comparable products, such as GPT-5) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.
Diffusers
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
Checkpoints Download
| ckpts | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | Kaleido-14B | 🤗 Hugging Face | Supports 512P
Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).
# Download the repository (skip automatic LFS file downloads) GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/zai-org/Kaleido-14B-S2V # Enter the repository folder cd Kaleido-14B-S2V # Merge the checkpoint files python merge_kaleido.py
Arrange the model files into the following structure:
. ├── Kaleido-14B-S2V │ ├── model │ │ └── .... │ ├── Wan2.1_VAE.pth │ │ │ └── umt5-xxl │ └── .... ├── configs ├── sat └── sgm
Usage
Inference
python sample_video.py --base configs/video_model/dit_crossattn_14B_wanvae.yaml configs/sampling sample_wanvae_concat_14b.yaml
You can also use multiple GPUs to accelerate the inference process:
bash torchrun_multi_gpu.sh
You can accelerate the inference process by utilizing multiple GPUs. Additionally, you can enable Sequence Parallelism in the YAML configuration file to further speed up inference.
args: s2v_concat: True .... sequence_parallel_size: 8
Note: The condition input txt file should contain lines in the following format:
prompt@@image1.png@@image2.png@@image3.png
Training
Preparing the Dataset
The dataset should be structured as follows:
. ├── labels │ ├── 1.txt │ ├── 2.txt │ ├── 3.txt │ ├── ... ├── videos │ ├── 1.mp4 │ ├── 2.mp4 │ ├── 3.mp4 │ ├── ... └── references ├── 1 │ ├── ref1.png │ ├── ref2.png │ └── ref3.png ├── 2 │ ├── ref1.png │ ├── ref2.png │ └── ref3.png ├── ...
After you have prepared the dataset, you can execute the following command to generate the training data. Note: Please update the dataset directory paths in the YAML configuration file to match your local setup before running.
bash multi_gpu_training.sh
Note: Our training strategy is based on the CogivideoX model. For detailed information about the training process, please refer to the CogivideoX repository. In addition to the DeepSpeed training approach, we also provide an implementation using FSDP2 for distributed training.
Gallery
Our model can broadly reference various types of images, including humans, objects, and diverse scenarios such as try-on. This demonstrates its versatility and generalization ability across different tasks.
Reference Images
Kaleido Results
Todo List
- [x] Inference codes and Training codes for Kaleido
- [x] Checkpoint of Kaleido
- [ ] Datapipline of Kaleido
Citation
If you find our work helpful, please cite our paper:
@article{DBLP:journals/corr/abs-2510-18573,
author = {Zhenxing Zhang and
Jiayan Teng and
Zhuoyi Yang and
Tiankun Cao and
Cheng Wang and
Xiaotao Gu and
Jie Tang and
Dan Guo and
Meng Wang},
title = {Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model},
journal = {CoRR},
volume = {abs/2510.18573},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2510.18573},
doi = {10.48550/ARXIV.2510.18573},
eprinttype = {arXiv},
eprint = {2510.18573},
timestamp = {Sat, 15 Nov 2025 15:31:50 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2510-18573.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}Notability
notability 5.0/10Solid new repo with moderate stars