Tencent-Hunyuan/HunyuanVideo-Foley
Python
Captured source
source ↗Tencent-Hunyuan/HunyuanVideo-Foley
Description: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.
Language: Python
License: NOASSERTION
Stars: 1043
Forks: 101
Open issues: 23
Created: 2025-08-15T06:44:09Z
Pushed: 2025-09-28T17:48:49Z
Default branch: main
Fork: no
Archived: no
README:
---
🏢 1Tencent Hunyuan • 🎓 2Zhejiang University • ✈️ 3Nanjing University of Aeronautics and Astronautics
*Equal contribution • †Project lead
---
🔥🔥🔥 News
- [2025.9.29] 🚀 HunyuanVideo-Foley-XL Model Release - Release XL-sized model with offload inference support, significantly reducing VRAM requirements.
- [2025.8.28] 🌟 HunyuanVideo-Foley Open Source Release - Inference code and model weights publicly available.
---
🎥 Demo & Showcase
---
🤝 Community Contributions
ComfyUI Integration - Thanks to the amazing community for creating ComfyUI nodes:
- [if-ai/ComfyUI_HunyuanVideoFoley](https://github.com/if-ai/ComfyUI_HunyuanVideoFoley) - ComfyUI workflow integration which supports cpu offloading and FP8 quantization
- [phazei/ComfyUI-HunyuanVideo-Foley](https://github.com/phazei/ComfyUI-HunyuanVideo-Foley) - Alternative ComfyUI node implementation which supports different precision modes
---
✨ Key Highlights
🎭 Multi-scenario Sync High-quality audio synchronized with complex video scenes
🧠 Multi-modal Balance Perfect harmony between visual and textual information
🎵 48kHz Hi-Fi Output Professional-grade audio generation with crystal clarity
---
📄 Abstract
🎯 Core Highlights
🎬 Multi-scenario Audio-Visual Synchronization Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
⚖️ Multi-modal Semantic Balance Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
🎵 High-fidelity Audio Output Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
---
🔧 Technical Architecture
📊 Data Pipeline Design
The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
🏗️ Model Architecture
HunyuanVideo-Foley employs a sophisticated hybrid architecture:
- 🔄 Multimodal Transformer Blocks: Process visual-audio streams simultaneously
- 🎵 Unimodal Transformer Blocks: Focus on audio stream refinement
- 👁️ Visual Encoding: Pre-trained encoder extracts visual features from video frames
- 📝 Text Processing: Semantic features extracted via pre-trained text encoder
- 🎧 Audio Encoding: Latent representations with Gaussian noise perturbation
- ⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation
---
📈 Performance Benchmarks
🎬 MovieGen-Audio-Bench Results
| 🏆 Method | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | MOS-Q ↑ | MOS-S ↑ | MOS-T ↑ | |:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|:------------:|:------------:|:------------:| | FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36±0.78 | 3.54±0.88 | 3.46±0.95 | | V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55±0.97 | 2.60±1.20 | 2.70±1.37 | | Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92±0.95 | 2.76±1.20 | 2.94±1.26 | | MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58±0.84 | 3.63±1.00 | 3.47±1.03 | | ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20±0.97 | 3.01±1.04 | 3.02±1.08 | | HunyuanVideo-Foley (ours) | 6.59 | 2.74 | 3.88 | 6.13 | 0.35 | 0.74 | 0.33 | 4.14±0.68 | 4.12±0.77 | 4.15±0.75 |
🎯 Kling-Audio-Eval Results
| 🏆 Method | FD_PANNs ↓ | FD_PASST ↓ | KL ↓ | IS ↑ | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | |:-------------:|:--------------:|:--------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:| | FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 | | V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 | | Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 | | MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 | | ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 | | HunyuanVideo-Foley (ours) | 6.07 | 202.12 | 1.89 | 8.30 | 6.12 | 2.76 | 3.22 | 5.53 | 0.38 | 0.54 | 0.24 |
---
🚀 Quick Start
📦 Installation
🔧 System Requirements
- CUDA: 12.4 or 11.8 recommended
- Python: 3.8+
- OS: Linux (primary support)
- VRAM: 20GB for XXL model (or 12GB with
--enable_offload), 16GB for XL model (or 8GB with--enable_offload)
Step 1: Clone Repository
# 📥 Clone the repository git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley cd HunyuanVideo-Foley
Step 2: Environment Setup
💡 Tip: We recommend using Conda for Python environment management.
# 🔧 Install dependencies pip install -r requirements.txt
Step 3: Download Pretrained Models
🔗 Download Model weights from Huggingface
# using git-lfs git clone https://huggingface.co/tencent/HunyuanVideo-Foley # using huggingface-cli huggingface-cli download tencent/HunyuanVideo-Foley
---
💻 Usage
📊 Model Specifications
| Model | Checkpoint | VRAM (Normal) | VRAM (Offload) | |-------|------------|---------------|----------------| | XXL *(Default)* | hunyuanvideo_foley.pth | 20GB | 12GB…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid new repo with decent stars for Foley sound generation