RepoTencent HunyuanTencent Hunyuanpublished Jun 26, 2025seen 5d

Tencent-Hunyuan/ArtifactsBenchmark

Python

Open original ↗

Captured source

source ↗

Tencent-Hunyuan/ArtifactsBenchmark

Language: Python

License: NOASSERTION

Stars: 264

Forks: 21

Open issues: 2

Created: 2025-06-26T12:44:06Z

Pushed: 2025-12-30T06:38:40Z

Default branch: main

Fork: no

Archived: no

README:

Figure 1: Automation level versus human–alignment across evaluation frameworks. The red star marks the fully manual WebDev Arena (100% human effort), while the blue bubble denotes our checklist-guided MLLM evaluation, ArtifactsBench, which achieves 94.4% agreement with human votes with 100% automation.

Introduction

The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences.

To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior, which is then assessed by an MLLM-as-Judge guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring.

ArtifactsBench is open-sourced, including the benchmark with 1,825 diverse tasks, the evaluation harness, and baseline results, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

🚀 Latest Updates & Release Notes

  • 2025.10.27 | MiniMax-M2 — ArtifactsBench: 66.8 (DeepSeek‑V3.2: 55.8). 🎉🎉🎉 A compact MoE model (230B total parameters with 10B active) built for elite performance in coding and agentic tasks. Link: MiniMax-M2 on Hugging Face; GitHub: MiniMax-M2. Scores per the report in the linked page.
  • 2025.10.27 | JanusCoder (paper) — Releases JanusCode‑800K 🎉🎉🎉, a large-scale multimodal code corpus establishing a visual‑programmatic interface for code intelligence; reported improvements on ArtifactsBench: +3.0🎉 for Qwen3‑8B and +1.3🎉 for Qwen3‑14B, further expanding research on code visualization. Link: arXiv:2510.23538.
  • 2025.10.23 | ReLook (paper) — Vision‑grounded RL for agentic web coding: MLLM visual critic, zero‑reward for invalid renders, Forced Optimization, critic‑free inference; outperforms strong baselines. 🎉🎉🎉 Link: arXiv:2510.11498.
  • 2025.10.08 | Ling‑1T (1T) — ArtifactsBench: 59.31 (DeepSeek‑V3.1‑Terminus: 43.29). 🎉🎉🎉 Link: Ling-1T model card; Technical report: *"Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation"* (arXiv:2510.22115). Scores per the report in the linked page.
  • 2025.06.27 | Tencent Hunyuan-A13B — ArtifactsBench: 42.95. 🎉 An efficient MoE model with 13B active parameters (80B total) demonstrating strong performance in generating interactive visual artifacts while maintaining excellent resource efficiency. Link: Hunyuan-A13B-Instruct; Technical Report: Hunyuan-A13B Technical Report.

Version 1.2 - August 9, 2025 🔥🔥🔥

Breakthrough Performance Updates:

🚀 Revolutionary Model Additions: Added comprehensive evaluation of cutting-edge models including GPT-5, GPT-OSS-120B, and Claude Opus 4.1, representing the latest advances in AI code generation capabilities.

🏆 Historic Achievements:

  • GPT-5 achieves an unprecedented 72.55 average score, setting a new benchmark record and establishing the new state-of-the-art for closed-source models
  • GPT-OSS-120B secures the #1 position among open-source models with a remarkable 57.69 average score
  • Claude Opus 4.1 demonstrates significant advancement with 59.76 average score

💡 OpenAI's Code Visualization Dominance: The results showcase OpenAI's exceptional capabilities in visual code generation, with both GPT-5 and GPT-OSS-120B leading their respective categories and demonstrating superior understanding of interactive visual artifact creation.

📈 Performance Insights: Enhanced analysis reveals the relationship between model inference patterns and visual code generation quality, providing deeper insights into what makes models excel at creating interactive experiences.

Key Highlights:

  • First 70+ Score Achievement: GPT-5 breaks the 70-point barrier, demonstrating quantum leap in visual artifact generation
  • Open-Source Leadership: GPT-OSS-120B establishes new standards for open-source visual code generation
  • Cross-Category Excellence: OpenAI models demonstrate consistent superiority across web applications, interactive games, and dynamic visualizations

Version 1.1 - July 30, 2025 🔥🔥🔥

Key Updates:

🆕 Model Coverage Expansion: Added comprehensive evaluation of GLM-4.5 to expand our coverage of state-of-the-art language models and provide more comprehensive benchmarking insights.

📊 Enhanced Visualization: Introduced a new analysis chart artifactsbench_vs_model_infer.png that visualizes the relationship between model inference scores and model response lengths, providing deeper insights into model behavior patterns.

Version 1.1 - July 25, 2025 🔥🔥🔥

We're excited to announce important updates to ArtifactsBench that significantly improve reproducibility, expand model coverage, and enhance evaluation stability:

Key Updates:

🔧 Unified Judge Model: Migrated from Gemini-2.5-Pro-Preview-0605 (now deprecated) to the stable Gemini-2.5-Pro for all evaluations, ensuring consistent reproducibility for research communities.

🆕 Expanded Model Coverage: Added comprehensive evaluation of latest high-quality open-source code models to keep pace with rapid developments in the field.

📊 Enhanced Transparency: Released intermediate reasoning results and evaluation data to improve research confidence and full reproducibility.

🎯 Full Open-Source & Complete Reproducibility:

🔓 100% Data Open-Source: All evaluation data, model outputs, judge…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New benchmark repo, modest stars.