What does this repo signal mean?

Meituan (LongCat) published meituan-longcat/UNO-Bench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo meituan-longcat/UNO-Bench · language Python · New benchmark repo from Meituan, modest traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Meituan (LongCat) Repo: meituan-longcat/UNO-Bench

Captured source

source ↗

GitHub/github.com/meituan-longcat/UNO-Bench

meituan-longcat/UNO-Bench repository metadata

Source ↗

published Oct 24, 2025seen Jun 5captured Jun 11http 200method plain

meituan-longcat/UNO-Bench

Description: Omni Model Benchmark with high quality and diversity, which reveals the Compositional Law. We’re now focused on Chinese scenarios — and actively seeking partners to co-build English & multilingual versions! Let’s expand global impact together.

Language: Python

License: MIT

Stars: 78

Forks: 0

Open issues: 2

Created: 2025-10-24T04:20:18Z

Pushed: 2026-01-12T02:51:14Z

Default branch: main

Fork: no

Archived: no

README: UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models

🔔News

🔥[2025/12/04] We have released the evaluation scripts [uno-eval](./uno-eval/), a unified evaluation framework for omni-modal benchmarks. More benchmarks will be supported in the future.
🔥[2025/12/04] We have released the scoring model UNO-Scorer-Qwen3-14B. Feel free to use it!
🔥[2025/10/29] We proposed a new omni-modal benchmark UNO-Bench. The technical report is available Arxiv. The dataset is available Hugging Face.

👀 UNO-Bench Overview

Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we introduce a novel, high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is designed to effectively evaluate both UNi-modal and Omni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. It includes 1250 human curated samples for omni-modal with 98% cross-modality solvability, and 2480 enhanced uni-modal samples. The human-generated dataset is well-suited to real-world scenarios, particularly within the Chinese context, whereas the automatically compressed dataset offers a 90% increase in speed and maintains 98% consistency across 18 public benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

Main Contributions

🌟 Propose UNO-Bench, the first unified omni model benchmark, efficiently assessing uni-modal and omni-modal understanding. It verifies the compositional law between these capabilities, acting as a bottleneck for weaker models and enhancing stronger ones.

🌟 Establish a high-quality dataset pipeline with human-centric processes and automated compression. UNO-Bench contains 1250 omni-modal samples with 98% cross-modality solvability and 2480 uni-modal samples across 44 task types and 5 modality combinations. The dataset excels in real-world scenarios, especially in China, and offers a 90% speed increase while maintaining 98% consistency across 18 benchmarks.

🌟 Introduce Multi-Step Open-Ended Questions (MO) for complex reasoning evaluation, providing realistic results. A General Scoring Model supports 6 question types with 95% accuracy on OOD models and benchmarks.

📊 Dataset Construction

Material Collection

Our materials feature three key characteristics: a. Diverse Sources—primarily real-world photos and videos from crowdsourcing, supplemented by copyright-free websites and high-quality public datasets. b. Rich and Diverse Topics—spanning society, culture, art, life, literature, and science. c. Live-Recorded Audio—dialogue recorded by over 20 human speakers, ensuring rich audio features that mirror real-world vocal diversity.

QA Annotation

Our annotators include human experts and skilled crowd-sourced users. Human experts bring extensive experience in cross-modal data and model understanding, ensuring professional and specific data. Crowd-sourced users, mainly college students, offer authentic and diverse data due to their experience with multi-modal models and varied backgrounds.

Quality Inspection

To ensure data quality, we use a multi-stage quality assurance system combining automated tools and manual review. Each question undergoes three independent inspections: a preliminary model check filters out ambiguous or non-conforming questions; modality ablation experiments test cross-modality solvability by removing one modality; and final manual inspection and revision ensure accuracy.

Data Compression

Regarding automated data compression, we propose a cluster-guided stratified sampling method to compress the scale of 18 public benchmarks and achieve a 90% dataset compression with 98% ranking consistency.

📍 Dataset Examples

The capabilities of UNO-Bench are systematically categorized into two primary dimensions: Perception and Reasoning. Please click link to download UNO-Bench. Below shows some examples from UNO-Bench:

---

For more samples, please refer to the project page.

🔍 Results

Our main evaluation reveals a clear performance hierarchy where proprietary models, particularly Gemini-2.5-Pro, establish the state-of-the-art across all benchmarks.

Finding 1. 📍Perception Ability and Reasoning Ability: Compared to human experts, Gemini-2.5-Pro exhibits similar performance in perception, but falls significantly behind in reasoning. Meanwhile, humans are more proficient in reasoning as opposed to perception (81.3% compared to 74.3%).

Finding 2. 📍Compositional Law: Omni-modal capability effectiveness correlates with the product of individual modality performances following a power-law. Based on the fundamental premise that nearly 100% of the questions in UNO-Bench require a joint understanding of audio and visual information, we combine experimental observations with rigorous mathematical derivation to propose the following formula for the compositional law.

$$...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New benchmark repo from Meituan, modest traction.