What does this model signal mean?

Meituan (LongCat) published meituan-longcat/LongCat-Flash-Omni. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license mit · 79 HF downloads · Fast omnimodal AI model from Meituan. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Meituan (LongCat) Model: meituan-longcat/LongCat-Flash-Omni

Captured source

source ↗

Hugging Face/huggingface.co/meituan-longcat/LongCat-Flash-Omni

meituan-longcat/LongCat-Flash-Omni model card

Source ↗

published Oct 23, 2025seen Jun 6captured Jun 11http 200method plaintask any-to-anylicense mitlibrary transformersparams 561Bdownloads 79likes 115

LongCat-Flash-Omni

Tech Report 📄

Model Introduction

We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters (with 27B activated), excelling at real-time audio-visual interaction, which is attained by leveraging LongCat-Flash's high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, augmented by efficient multimodal perception and speech reconstruction modules. Through an effective curriculum-inspired progressive training strategy, our model achieves comprehensive multimodal capabilities while maintaining strong unimodal capability. Now, we open-source the model to foster future research and development in the community.

Model Architecture

Key Features

🌟 SOTA and Unified Omni-Modal Model

LongCat-Flash-Omni is an open-source omni-modal model that achieves state-of-the-art cross-modal comprehension performance. It seamlessly integrates powerful offline multi-modal understanding with real-time audio–visual interaction within a single all-in-one framework.

🌟 Large-Scale with Low-Latency Audio–Visual Interaction

By leveraging an efficient LLM backbone, carefully designed lightweight modality encoders and decoder, and a chunk-wise audio–visual feature interleaving mechanism, LongCat-Flash-Omni achieves low-latency, high-quality audio–visual processing and streaming speech generation. It supports a context window of up to 128K tokens, enabling advanced capabilities in long-term memory, multi-turn dialogue, and temporal reasoning across multiple modalities.

🌟 Effective Early-Fusion Training

The model adopts an innovative multi-stage pretraining pipeline that progressively incorporates text, audio, and visual modalities under a balanced data strategy and early-fusion training paradigm, ensuring strong omni-modal performance without degradation in any single modality.

🌟 Efficient Training Infrastructure

Inspired by the concept of modality decoupling, we propose a Modality-Decoupled Parallelism training scheme that significantly enhances the efficiency of large-scale and highly challenging multimodal training.

🌟 Open-Source Contribution

We provide a comprehensive overview of the training methodology and data strategies behind LongCat-Flash-Omni, and release the model to accelerate future research and innovation in omni-modal intelligence.

For more detail, please refer to the comprehensive ***LongCat-Flash-Omni Technical Report***.

Evaluation Results

Omni-modality

| Benchmark | LongCat-Flash-Omni Instruct | Gemini-2.5-Pro (ThinkingBudget128) | Gemini-2.5-Flash (non-thinking) | Qwen3-Omni Instruct | Qwen2.5-Omni Instruct | |-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|-------------------------| | OmniBench | 61.38 | 66.80 | 54.99 | 58.41 | 48.16 | | WorldSense | 60.89 | 63.96 | 58.72 | 52.01 | 46.69 | | DailyOmni | 82.38 | 80.61 | 80.78 | 69.33 | 47.45 | | UNO-Bench | 49.90 | 64.48 | 54.30 | 42.10 | 32.60 |

Vision

Image-to-Text

| Benchmark | LongCat-Flash-Omni Instruct | Gemini-2.5-Pro (ThinkingBudget128) | Gemini-2.5-Flash (non-thinking) | Qwen3-Omni Instruct | Seed-1.6 | GPT-4o-1120 | Qwen3-VL-235B-A22B-Instruct | Qwen2.5-VL-72B-Instruct | |-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|----------|---------------|------------------------------|---------------------------| | General |||||||||| | MMBench-ENtest | 87.5 | 89.8 | 89.3 | 86.8 | 88.5 | 83.7 | 88.3 | 88.6* | | MMBench-ZHtest | 88.7 | 89.2 | 88.5 | 86.4 | 83.8 | 82.8 | 89.8 | 87.9* | | RealWorldQA | 74.8 | 76.0 | 73.9 | 72.9 | 74.5 | 74.1 | 79.3* | 75.7* | | MMStar | 70.9 | 78.5* | 75.5 | 68.5* | 71.5 | 63.2 | 78.4* | 68.2 | | STEM & Reasoning |||||||||| | MathVistamini | 77.9 | 77.7* | 77.1 | 75.9 | 78.7 | 62.8 | 84.9* | 74.8* | | MMMUval | 70.7 | 80.9* | 76.3 | 69.1* | 74.9 | 69.4 | 78.7* | 70.2* | | MMVet | 69.0 | 80.7 | 79.5 | 68.9 | 74.4 | 76.6 | 75.9 | 74.5 | | Multi-Image |||||||||| | BLINK | 63.1 | 70.0* | 65.7 | 56.1 | 65.0 | 65.5 | 70.7* | 60.1 | | MuirBench | 77.1 | 74.0* | 73.7 | 62.1 | 74.6 | 70.5 | 72.8* | 70.7* | | Mantis | 84.8 | 83.9 | 83.4 | 80.7 | 81.1 | 79.3 | 79.7 | 82.0 | | Text Recognition & Chart/Document Understanding |||||||||| | ChartQA | 87.6 | 71.7 | 77.6 | 86.8* | 82.4 | 74.5 | 89.2 | 89.5* | | DocVQA | 91.8 | 94.0* | 93.6* | 95.7 | 94.3 | 80.9 | 94.6 | 96.4* | | OCRBench | 84.9 | 87.2* | 85.6 | 85.5 | 85.6 | 82.3 | 91.2 | 88.5 | | OmniDocBenchEN/ZH↓ | 22.8/29.0 | 31.9/24.5 | 22.8/32.9 | 28.4/40.5 | 22.0/27.6 | 25.9/37.7 | 13.6/17.5 | 22.6/32.4* | | Grounding & Counting |||||||||| | RefCOCO-avg | 92.3 | 75.4 | 71.9 | 89.3 | 80.2 | - | 87.1 | 90.3 | | CountBench | 92.4 | 91.0* | 78.6 | 90.0* | 94.1 | 85.6* | 94.3 | 93.6* | | Graphical User Interface (GUI) |||||||||| | VisualWebBench | 78.7 | 81.1 | 73.5 | 79.3 | 81.1 | 77.1 | 80.8 | 82.3* | | ScreenSpot-v2 | 91.2 | 75.8 | 63.9 | 94.7 | 91.7 | - | 93.4 | 92.9 | | AndroidControllow | 91.2 | 79.2 | 79.1 | 90.5 | 84.6 | 65.2 | 90.0 | 93.7* | | AndroidControlhigh | 75.6 | 60.8 | 55.5 | 70.8 | 55.2 | 41.7 | 74.1 | 67.4* |

Note: Values marked with * are sourced from public reports. As GPT-4o does not support image grounding, we do not report its results on RefCOCO and ScreenSpot-v2

---

Video-to-Text

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction, routine release