RepoTencent HunyuanTencent Hunyuanpublished Aug 15, 2025seen 5d

Tencent-Hunyuan/HunyuanVideo-Foley

Python

Open original ↗

Captured source

source ↗

Tencent-Hunyuan/HunyuanVideo-Foley

Description: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.

Language: Python

License: NOASSERTION

Stars: 1043

Forks: 101

Open issues: 23

Created: 2025-08-15T06:44:09Z

Pushed: 2025-09-28T17:48:49Z

Default branch: main

Fork: no

Archived: no

README:

---

🏢 1Tencent Hunyuan • 🎓 2Zhejiang University • ✈️ 3Nanjing University of Aeronautics and Astronautics

*Equal contribution • †Project lead

---

🔥🔥🔥 News

  • [2025.9.29] 🚀 HunyuanVideo-Foley-XL Model Release - Release XL-sized model with offload inference support, significantly reducing VRAM requirements.
  • [2025.8.28] 🌟 HunyuanVideo-Foley Open Source Release - Inference code and model weights publicly available.

---

🎥 Demo & Showcase

---

🤝 Community Contributions

ComfyUI Integration - Thanks to the amazing community for creating ComfyUI nodes:

  • [if-ai/ComfyUI_HunyuanVideoFoley](https://github.com/if-ai/ComfyUI_HunyuanVideoFoley) - ComfyUI workflow integration which supports cpu offloading and FP8 quantization
  • [phazei/ComfyUI-HunyuanVideo-Foley](https://github.com/phazei/ComfyUI-HunyuanVideo-Foley) - Alternative ComfyUI node implementation which supports different precision modes

---

Key Highlights

🎭 Multi-scenario Sync High-quality audio synchronized with complex video scenes

🧠 Multi-modal Balance Perfect harmony between visual and textual information

🎵 48kHz Hi-Fi Output Professional-grade audio generation with crystal clarity

---

📄 Abstract

🎯 Core Highlights

🎬 Multi-scenario Audio-Visual Synchronization Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.

⚖️ Multi-modal Semantic Balance Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.

🎵 High-fidelity Audio Output Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.

---

🔧 Technical Architecture

📊 Data Pipeline Design

The TV2A (Text-Video-to-Audio) task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.

🏗️ Model Architecture

HunyuanVideo-Foley employs a sophisticated hybrid architecture:

  • 🔄 Multimodal Transformer Blocks: Process visual-audio streams simultaneously
  • 🎵 Unimodal Transformer Blocks: Focus on audio stream refinement
  • 👁️ Visual Encoding: Pre-trained encoder extracts visual features from video frames
  • 📝 Text Processing: Semantic features extracted via pre-trained text encoder
  • 🎧 Audio Encoding: Latent representations with Gaussian noise perturbation
  • ⏰ Temporal Alignment: Synchformer-based frame-level synchronization with gated modulation

---

📈 Performance Benchmarks

🎬 MovieGen-Audio-Bench Results

| 🏆 Method | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | MOS-Q ↑ | MOS-S ↑ | MOS-T ↑ | |:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|:------------:|:------------:|:------------:| | FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36±0.78 | 3.54±0.88 | 3.46±0.95 | | V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55±0.97 | 2.60±1.20 | 2.70±1.37 | | Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92±0.95 | 2.76±1.20 | 2.94±1.26 | | MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58±0.84 | 3.63±1.00 | 3.47±1.03 | | ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20±0.97 | 3.01±1.04 | 3.02±1.08 | | HunyuanVideo-Foley (ours) | 6.59 | 2.74 | 3.88 | 6.13 | 0.35 | 0.74 | 0.33 | 4.14±0.68 | 4.12±0.77 | 4.15±0.75 |

🎯 Kling-Audio-Eval Results

| 🏆 Method | FD_PANNs ↓ | FD_PASST ↓ | KL ↓ | IS ↑ | PQ ↑ | PC ↓ | CE ↑ | CU ↑ | IB ↑ | DeSync ↓ | CLAP ↑ | |:-------------:|:--------------:|:--------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:| | FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 | | V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 | | Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 | | MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 | | ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 | | HunyuanVideo-Foley (ours) | 6.07 | 202.12 | 1.89 | 8.30 | 6.12 | 2.76 | 3.22 | 5.53 | 0.38 | 0.54 | 0.24 |

---

🚀 Quick Start

📦 Installation

🔧 System Requirements

  • CUDA: 12.4 or 11.8 recommended
  • Python: 3.8+
  • OS: Linux (primary support)
  • VRAM: 20GB for XXL model (or 12GB with --enable_offload), 16GB for XL model (or 8GB with --enable_offload)

Step 1: Clone Repository

# 📥 Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
cd HunyuanVideo-Foley

Step 2: Environment Setup

💡 Tip: We recommend using Conda for Python environment management.

# 🔧 Install dependencies
pip install -r requirements.txt

Step 3: Download Pretrained Models

🔗 Download Model weights from Huggingface

# using git-lfs
git clone https://huggingface.co/tencent/HunyuanVideo-Foley

# using huggingface-cli
huggingface-cli download tencent/HunyuanVideo-Foley

---

💻 Usage

📊 Model Specifications

| Model | Checkpoint | VRAM (Normal) | VRAM (Offload) | |-------|------------|---------------|----------------| | XXL *(Default)* | hunyuanvideo_foley.pth | 20GB | 12GB…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new repo with decent stars for Foley sound generation