RepoZhipu AI (GLM)Zhipu AI (GLM)published Oct 20, 2025seen 5d

zai-org/Kaleido

Python

Open original ↗

Captured source

source ↗
published Oct 20, 2025seen 5dcaptured 13hhttp 200method plain

zai-org/Kaleido

Description: Kaleido: Open-sourced multi-subject reference video generation model, enabling controllable, high-fidelity video synthesis from multiple image references.

Language: Python

Stars: 134

Forks: 14

Open issues: 6

Created: 2025-10-20T15:13:25Z

Pushed: 2026-03-02T11:54:40Z

Default branch: main

Fork: no

Archived: no

README:

KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL

---

---

Update and News

  • 2025.10.28: 🔥 We release the checkpoints of Kaleido-14B-S2V.
  • 2025.10.22: 🔥 We propose Kaleido, a novel multi-subject reference video generation model. Both the training and inference code have been open-sourced to facilitate further research and reproduction.

Qucik Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4.5 (or other comparable products, such as GPT-5) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Checkpoints Download

| ckpts | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | Kaleido-14B | 🤗 Hugging Face | Supports 512P

Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).

# Download the repository (skip automatic LFS file downloads)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/zai-org/Kaleido-14B-S2V

# Enter the repository folder
cd Kaleido-14B-S2V

# Merge the checkpoint files
python merge_kaleido.py

Arrange the model files into the following structure:

.
├── Kaleido-14B-S2V
│ ├── model
│ │ └── ....
│ ├── Wan2.1_VAE.pth
│ │
│ └── umt5-xxl
│ └── ....
├── configs
├── sat
└── sgm

Usage

Inference

python sample_video.py --base configs/video_model/dit_crossattn_14B_wanvae.yaml configs/sampling sample_wanvae_concat_14b.yaml

You can also use multiple GPUs to accelerate the inference process:

bash torchrun_multi_gpu.sh

You can accelerate the inference process by utilizing multiple GPUs. Additionally, you can enable Sequence Parallelism in the YAML configuration file to further speed up inference.

args:
s2v_concat: True
....
sequence_parallel_size: 8

Note: The condition input txt file should contain lines in the following format:

prompt@@image1.png@@image2.png@@image3.png

Training

Preparing the Dataset

The dataset should be structured as follows:

.
├── labels
│ ├── 1.txt
│ ├── 2.txt
│ ├── 3.txt
│ ├── ...
├── videos
│ ├── 1.mp4
│ ├── 2.mp4
│ ├── 3.mp4
│ ├── ...
└── references
├── 1
│ ├── ref1.png
│ ├── ref2.png
│ └── ref3.png
├── 2
│ ├── ref1.png
│ ├── ref2.png
│ └── ref3.png
├── ...

After you have prepared the dataset, you can execute the following command to generate the training data. Note: Please update the dataset directory paths in the YAML configuration file to match your local setup before running.

bash multi_gpu_training.sh

Note: Our training strategy is based on the CogivideoX model. For detailed information about the training process, please refer to the CogivideoX repository. In addition to the DeepSpeed training approach, we also provide an implementation using FSDP2 for distributed training.

Gallery

Our model can broadly reference various types of images, including humans, objects, and diverse scenarios such as try-on. This demonstrates its versatility and generalization ability across different tasks.

Reference Images

Kaleido Results

Todo List

  • [x] Inference codes and Training codes for Kaleido
  • [x] Checkpoint of Kaleido
  • [ ] Datapipline of Kaleido

Citation

If you find our work helpful, please cite our paper:

@article{DBLP:journals/corr/abs-2510-18573,
author = {Zhenxing Zhang and
Jiayan Teng and
Zhuoyi Yang and
Tiankun Cao and
Cheng Wang and
Xiaotao Gu and
Jie Tang and
Dan Guo and
Meng Wang},
title = {Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model},
journal = {CoRR},
volume = {abs/2510.18573},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2510.18573},
doi = {10.48550/ARXIV.2510.18573},
eprinttype = {arXiv},
eprint = {2510.18573},
timestamp = {Sat, 15 Nov 2025 15:31:50 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2510-18573.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Notability

notability 5.0/10

Solid new repo with moderate stars