Tencent-Hunyuan/OmniWeaving
Python
Captured source
source ↗Tencent-Hunyuan/OmniWeaving
Description: Official Implementation of OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
Language: Python
License: NOASSERTION
Stars: 878
Forks: 27
Open issues: 3
Created: 2026-03-31T11:10:21Z
Pushed: 2026-04-11T08:19:27Z
Default branch: main
Fork: no
Archived: no
README:
OmniWeaving
🔥🔥🔥 News
- 📌 OmniWeaving is developed by the HunyuanVideo team and is built upon the latest [HunyuanVideo-1.5](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) as the backbone. If you find our work useful, please consider giving this repository a star and citing our paper~
- 🚀 April 3, 2026: We release the code
and model weights of OmniWeaving.
- 🏃♂️ April 3, 2026: We release the IntelligentVBench.
- 📖 Mar 26, 2026: We release the OmniWeaving paper on Arxiv.
- 👋 Mar 25, 2026: We release the webpage of OmniWeaving.
📑 Open-source Plan
- OmniWeaving
- [✅] Inference Code
- [✅] Model Checkpoints
- [✅] Training Data Construction Code
- [✅] Training Example Code
- IntelligentVBench
- [✅] Test cases
- [✅] Evaluation Code
📋 Table of Contents
- [🔥🔥🔥 News](#news)
- [📑 Open-source Plan](#open-source-plan)
- [📖 Abstract](#abstract)
- [🏗 Model Architecture](#model-architecture)
- [🚀 Supported Tasks](#supported-tasks)
- [🛠 Preparation](#preparation)
- [🔑 Inference](#inference)
- [🗂 Training Data Construction](#training-data-construction)
- [🎓 Training](#training)
- [📊 Evaluation on IntelligentVBench](#evaluation-on-intelligentvbench)
- [🎬 Qualitative Examples](#examples)
- [📚 Citation](#citation)
- [🙏 Acknowledgements](#acknowledgements)
📖 Abstract
We propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models.
🏗 Model Architecture
Following the paper, OmniWeaving is built as an integrated MLLM + MMDiT + VAE framework for unified free-form video generation. The MLLM serves as the semantic parser for interleaved text, images, and video inputs, mapping them into a high-level semantic space and forwarding its hidden states through an MLP connector. The VAE acts as the visual tokenizer, compressing visual inputs into low-level latents, while the MMDiT uses these semantic conditions together with latent noise to generate semantically aligned, high-fidelity videos.
On this basis, we further introduce two extra improvements tailored for advanced reasoning and composition.
- (1) Activating Thinking Mode of the MLLM: Direct MLLM encoding of interleaved visual-text inputs often yields semantic ambiguity due to weak intra-correlations and unclear video creation intents. We elevate the MLLM from a passive feature extractor to an active reasoner. By activating the thinking mode to generate intermediate reasoning steps, it autonomously deduces a semantically precise, enhanced prompt. The hidden states of this enhanced prompt are then forwarded alongside the original MLLM features to condition the MMDiT, effectively bridging the cognitive gap between abstract user intent and pixel-level generation.
- (2) Hidden States DeepStacking: Compositional video generation involving multiple subjects or intricate scenes often relies on both low- and high-level semantic representations. Drawing inspiration from the DeepStacking mechanism in Qwen3-VL, we extract hidden states from a broader range of intermediate MLLM layers to capture a rich semantic spectrum spanning from fine-grained details to high-level abstractions. An MLP connector projects these multi-level features into the MMDiT embedding space. These projected features are then directly added to the corresponding hidden states within the first three layers of the MMDiT conditioning branch, effectively injecting multi-granular semantic guidance into the generative process.
🚀 Supported Tasks
OmniWeaving is flexible in its input and output configurations, supporting a wide range of unified video generation tasks:
Task Input Type Output Description Demo Input Demo Output
Text-to-Video (T2V) Text 📝 Video 🎬 Generating a video from text prompts.
First-Frame-to-Video (I2V) Image 🖼 + Text 📝 Video 🎬 Generating a video based on the first frame.
Key-Frames-to-Video 2 × Images 🖼 + Text 📝 Video 🎬 Generating a video conditioned on start and end frames.
Video-to-Video Editing Video 🎬 + Text 📝 Video 🎬 Instruction-based video manipulation and stylization.
Reference-to-Video Image 🖼 + Text 📝 Video 🎬 Single-subject reference-driven video generation.
Compositional Multi-Image-to-Video 2–4 × Images 🖼 + Text 📝 Video 🎬 Multi-subject compositional video generation.
Text-Image-Video-to-Video Video 🎬 + Image 🖼 + Text 📝 Video 🎬 Generating a video conditioned on text, image, and video inputs.
Reasoning-Augmented Video Generation Image(s) 🖼 + Text 📝 Reasoning 💭 + Video 🎬 Reasoning over user intent before generating the video.
🛠 Preparation
Step 1: Clone the Repository
git clone https://github.com/Tencent-Hunyuan/OmniWeaving cd OmniWeaving
Step 2: Install Dependencies
OmniWeaving is built upon HunyuanVideo-1.5. The way to install dependencies is similar to HunyuanVideo-1.5. Specifically, you should install basic dependencies:
pip install -r requirements.txt
Additionally, install the attention libraries as needed (we use Flash Attention in practice):
- Flash Attention: Install for faster inference and reduced GPU memory consumption. See Flash Attention…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10New repo with moderate stars