Tencent-Hunyuan/HunyuanVideo-I2V
Python
Captured source
source ↗Tencent-Hunyuan/HunyuanVideo-I2V
Description: HunyuanVideo-I2V: A Customizable Image-to-Video Model based on HunyuanVideo
Language: Python
License: NOASSERTION
Stars: 1827
Forks: 191
Open issues: 57
Created: 2025-03-04T12:02:05Z
Pushed: 2026-04-07T06:13:09Z
Default branch: main
Fork: no
Archived: no
README:
[中文阅读](./README_zh.md)
HunyuanVideo-I2V 🌅
👋 Join our WeChat and Discord
-----
Following the great successful open-sourcing of our HunyuanVideo, we proudly present the HunyuanVideo-I2V, a new image-to-video generation framework to accelerate open-source community exploration!
This repo contains official PyTorch model definitions, pre-trained weights and inference/sampling code. You can find more visualizations on our project page. Meanwhile, we have released the LoRA training code for customizable special effects, which can be used to create more interesting video effects.
> **HunyuanVideo: A Systematic Framework For Large Video Generation Model**
🔥🔥🔥 News!!
- Mar 13, 2025: 🚀 We release the parallel inference code for HunyuanVideo-I2V powered by xDiT.
- Mar 11, 2025: 🎉 We have updated the lora training and inference code after fixing the bug.
- Mar 07, 2025: 🔥 We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of HunyuanVideo-I2V to ensure full visual consistency in the first frame and produce higher quality videos.
- Mar 06, 2025: 👋 We release the inference code and model weights of HunyuanVideo-I2V. Download.
🎥 Demo
I2V Demo
First Frame Consistency Demo
| Reference Image | Generated Video | |:----------------:|:----------------:| | | | | | | | | |
Customizable I2V LoRA Demo
| I2V Lora Effect | Reference Image | Generated Video | |:---------------:|:--------------------------------:|:----------------:| | Hair growth | | | | Embrace | | |
🧩 Community Contributions
If you develop/use HunyuanVideo-I2V in your projects, welcome to let us know.
- ComfyUI-Kijai (FP8 Inference, V2V and IP2V Generation): ComfyUI-HunyuanVideoWrapper by Kijai
- HunyuanVideoGP (GPU Poor version): HunyuanVideoGP by DeepBeepMeep
- xDiT compatibility improvement: xDiT compatibility improvement by pftq and xibosun
📑 Open-source Plan
- HunyuanVideo-I2V (Image-to-Video Model)
- [x] Inference
- [x] Checkpoints
- [x] ComfyUI
- [x] Lora training scripts
- [x] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
Contents
- [HunyuanVideo-I2V 🌅](#hunyuanvideo-i2v-)
- [🔥🔥🔥 News!!](#-news)
- [🎥 Demo](#-demo)
- [I2V Demo](#i2v-demo)
- [Frist Frame Consistency Demo](#frist-frame-consistency-demo)
- [Customizable I2V LoRA Demo](#customizable-i2v-lora-demo)
- [🧩 Community Contributions](#-community-contributions)
- [📑 Open-source Plan](#-open-source-plan)
- [Contents](#contents)
- [HunyuanVideo-I2V Overall Architecture](#hunyuanvideo-i2v-overall-architecture)
- [📜 Requirements](#-requirements)
- [🛠️ Dependencies and Installation](#️-dependencies-and-installation)
- [Installation Guide for Linux](#installation-guide-for-linux)
- [🧱 Download Pretrained Models](#-download-pretrained-models)
- [🔑 Single-gpu Inference](#-single-gpu-inference)
- [Tips for Using Image-to-Video Models](#tips-for-using-image-to-video-models)
- [Using Command Line](#using-command-line)
- [More Configurations](#more-configurations)
- [🎉 Customizable I2V LoRA effects training](#-customizable-i2v-lora-effects-training)
- [Requirements](#requirements)
- [Environment](#environment)
- [Training data construction](#training-data-construction)
- [Training](#training)
- [Inference](#inference)
- [🚀 Parallel Inference on Multiple GPUs by xDiT](#-parallel-inference-on-multiple-gpus-by-xdit)
- [Using Command Line](#using-command-line-1)
- [🔗 BibTeX](#-bibtex)
- [Acknowledgements](#acknowledgements)
---
HunyuanVideo-I2V Overall Architecture
Leveraging the advanced video generation capabilities of HunyuanVideo, we have extended its application to image-to-video generation tasks. To achieve this, we employ a token replace technique to effectively reconstruct and incorporate reference image information into the video generation process.
Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.
The overall architecture of our system is designed to maximize the synergy between image and text modalities, ensuring a robust and coherent generation of video content from static images. This integration not only improves the fidelity of the generated videos but also enhances the model's ability to interpret and utilize complex multimodal inputs. The overall architecture is as follows.
📜 Requirements
The following table shows the requirements for running HunyuanVideo-I2V model (batch size = 1) to generate videos:
| Model | Resolution | GPU Peak Memory | |:----------------:|:-----------:|:----------------:| | HunyuanVideo-I2V | 720p | 60GB |
- An NVIDIA GPU with CUDA support is required.
- The model is tested on a single 80G GPU.
- Minimum: The minimum GPU memory required is 60GB for 720p.
- Recommended: We recommend using a GPU with 80GB of memory for better generation quality.
- Tested operating system: Linux
🛠️ Dependencies and Installation
Begin by cloning the repository:...
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable image-to-video model release with strong stars.