microsoft/AVGen-Bench
Python
Captured source
source ↗microsoft/AVGen-Bench
Description: [ICML26] AVGen-Bench is a task-driven benchmark for multi-granular evaluation of Text-to-Audio-Video (T2AV) generation.
Language: Python
License: MIT
Stars: 19
Forks: 0
Open issues: 26
Created: 2026-03-15T16:17:49Z
Pushed: 2026-06-17T04:25:23Z
Default branch: main
Fork: no
Archived: no
README:
AVGen-Bench
AVGen-Bench is a task-driven benchmark for multi-granular evaluation of Text-to-Audio-Video (T2AV) generation.
Repository Information
- Support: see [SUPPORT.md](SUPPORT.md) for bug reports, usage questions, and issue filing guidance.
- Security: see [SECURITY.md](SECURITY.md) for responsible vulnerability reporting instructions.
- Code of Conduct: see [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for community participation expectations.
Compared with prior benchmarks, AVGen-Bench emphasizes joint audio-video evaluation, fine-grained multi-dimensional assessment, and more complex task-oriented prompts.
Benchmark Results
Following the paper's reporting narrative, AV/Lip are complementary measurements for synchronization, while Lo-Phy and Hi-Phy are weighted as separate fine-grained dimensions.
Metric direction: higher is better for Vis, Aud (PQ), Text, Face, Music, Speech, Lo-Phy, Hi-Phy, and Holistic; lower is better for AV and Lip.
Models are sorted by Total in descending order. Best scores are in bold. Second-best scores are in *italics*.
The compact table below shows the leaderboard summary. Expand the detailed table for the full per-metric breakdown. Component labels in this README use text markers instead of color backgrounds.
| Model | Components | Total | |---|---|---:| | Seedance 2.0 | Seedance 2.0 (Proprietary) | 72.07 | | Veo 3.1-fast | Veo 3.1-fast (Proprietary) | *67.87* | | Veo 3.1-quality | Veo 3.1-quality (Proprietary) | 66.28 | | Sora-2 | Sora-2 (Proprietary) | 64.16 | | Wan2.6 | Wan2.6 (Proprietary) | 62.97 | | Seedance-1.5 Pro | Seedance-1.5 Pro (Proprietary) | 62.55 | | Kling-V2.6 | Kling-V2.6 (Proprietary) | 61.82 | | LTX-2.3 | LTX-2.3 (Open-source) | 59.97 | | NanoBanana2 + MOVA | NanoBanana2 (Proprietary) + MOVA (Open-source) | 58.10 | | LTX-2 | LTX-2 (Open-source) | 56.62 | | Emu3.5 + MOVA | Emu3.5 (Open-source) + MOVA (Open-source) | 56.12 | | Wan2.2 + HunyuanVideo-Foley | Wan2.2 (Open-source) + HunyuanVideo-Foley (Open-source) | 53.29 | | Ovi | Ovi (Open-source) | 52.02 |
Full per-metric results
| Model | Components | Vis | Aud (PQ) | AV | Lip | Text | Face | Music | Speech | Lo-Phy | Hi-Phy | Holistic | Total | |---|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Seedance 2.0 | Seedance 2.0 (Proprietary) | 0.945 | *7.15* | 0.15 | 4.14 | 74.83 | 60.95 | 28.12 | 94.09 | 3.89 | 83.16 | 89.61 | 72.07 | | Veo 3.1-fast | Veo 3.1-fast (Proprietary) | *0.960* | 6.64 | *0.21* | 2.39 | 75.10 | 52.77 | 3.13 | *94.53* | 3.68 | 67.43 | 86.27 | *67.87* | | Veo 3.1-quality | Veo 3.1-quality (Proprietary) | 0.954 | 6.77 | 0.24 | 3.59 | *76.53* | 52.90 | 5.00 | 96.09 | 3.74 | 68.53 | 84.10 | 66.28 | | Sora-2 | Sora-2 (Proprietary) | 0.848 | 5.91 | 0.25 | 4.50 | 74.84 | 51.17 | 7.81 | 88.63 | 4.05 | *78.95* | *88.89* | 64.16 | | Wan2.6 | Wan2.6 (Proprietary) | 0.959 | *7.15* | 0.30 | 4.32 | 76.95 | 49.27 | 1.75 | 89.33 | 3.69 | 66.92 | 80.98 | 62.97 | | Seedance-1.5 Pro | Seedance-1.5 Pro (Proprietary) | 0.970 | 7.48 | 0.26 | 3.43 | 38.28 | 54.42 | 1.88 | 93.45 | 3.72 | 66.88 | 77.38 | 62.55 | | Kling-V2.6 | Kling-V2.6 (Proprietary) | 0.906 | 6.93 | *0.21* | *2.30* | 14.52 | *57.33* | 5.00 | 89.62 | 3.84 | 63.92 | 76.74 | 61.82 | | LTX-2.3 | LTX-2.3 (Open-source) | 0.858 | 7.11 | 0.36 | 2.00 | 54.17 | 45.06 | 1.38 | 86.66 | *3.99* | 64.31 | 65.22 | 59.97 | | NanoBanana2 + MOVA | NanoBanana2 (Proprietary) + MOVA (Open-source) | 0.890 | 6.71 | 0.44 | 2.70 | 68.26 | 41.33 | 0.59 | 82.45 | 3.91 | 60.95 | 72.48 | 58.10 | | LTX-2 | LTX-2 (Open-source) | 0.828 | 6.84 | 0.23 | 4.76 | 24.76 | 48.53 | 5.75 | 87.07 | 4.05 | 60.20 | 66.59 | 56.62 | | Emu3.5 + MOVA | Emu3.5 (Open-source) + MOVA (Open-source) | 0.911 | 6.80 | 0.38 | 4.83 | 64.72 | 48.44 | 0.62 | 81.74 | 3.89 | 55.85 | 66.55 | 56.12 | | Wan2.2 + HunyuanVideo-Foley | Wan2.2 (Open-source) + HunyuanVideo-Foley (Open-source) | 0.936 | 6.60 | 0.23 | 5.38 | 48.46 | 36.23 | 3.44 | 53.40 | 3.90 | 54.11 | 60.63 | 53.29 | | Ovi | Ovi (Open-source) | 0.839 | 6.31 | 0.37 | 5.40 | 41.36 | 49.05 | *11.25* | 76.49 | 3.93 | 52.92 | 57.45 | 52.02 |
Failure Demo Videos
The following demos are selected from Appendix A failure cases. Preview images remain in assets/failure_demos/, while showcased videos are hosted through GitHub attachments to keep the repository lightweight.
Case 1: Prompted Text Rendering ("Your customers are talking")
Original Prompt A single wind-up chattering teeth toy clacks continuously against a solid teal background. The scene cuts to a blue screen displaying the white text 'Your customers are talking,' abruptly followed by rows of multi-colored chattering teeth toys all moving at once, creating a loud chaotic mechanical clatter. A green screen appears with the text 'Are you listening?' before cutting to a generic product logo and a 'Try it free' button on a white background as the noise ceases.
Veo 3.1 Fast
https://github.com/user-attachments/assets/ce5bbd62-95d2-49f7-9066-3d4ee1d8b478
Ovi
https://github.com/user-attachments/assets/53e83129-116c-4a35-b6de-3ca03b6151ce
LTX-2
https://github.com/user-attachments/assets/5e9dd440-682f-4bfe-b066-f3555c776968
Kling 2.6
https://github.com/user-attachments/assets/c14fdae7-cdb1-42fa-8288-024556692a59
Case 2: Trailer Title Rendering ("EIGHTY-SEVEN SECONDS")
Original Prompt Four-shot high-tempo teaser with clean sync hits. Shot 1: Inside a bank vault, fluorescent hum and distant alarms; a timer on a device beeps faster as a thief whispers, "Eighty-seven seconds, move." Shot 2: Close-up of a glass cutter scoring a pane with a sharp scratch, then a suction cup pops as the circle lifts free, landing on a bass hit. Shot 3: Smash cut to a getaway car; engine revs, tires chirp, and the car fishtails out of a tight alley with gravel spraying and rattling off the chassis. Shot 4: A final slow-motion shot of a duffel bag hitting the pavement with a heavy thud as sirens surge; the title EIGHTY-SEVEN SECONDS slams onto black with a metallic logo sting....
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New benchmark repo, low traction so far.