meituan-longcat/LongCat-Flash-Omni
Captured source
source ↗LongCat-Flash-Omni
Tech Report 📄
Model Introduction
We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters (with 27B activated), excelling at real-time audio-visual interaction, which is attained by leveraging LongCat-Flash's high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, augmented by efficient multimodal perception and speech reconstruction modules. Through an effective curriculum-inspired progressive training strategy, our model achieves comprehensive multimodal capabilities while maintaining strong unimodal capability. Now, we open-source the model to foster future research and development in the community.
Model Architecture
Key Features
🌟 SOTA and Unified Omni-Modal Model
LongCat-Flash-Omni is an open-source omni-modal model that achieves state-of-the-art cross-modal comprehension performance. It seamlessly integrates powerful offline multi-modal understanding with real-time audio–visual interaction within a single all-in-one framework.
🌟 Large-Scale with Low-Latency Audio–Visual Interaction
By leveraging an efficient LLM backbone, carefully designed lightweight modality encoders and decoder, and a chunk-wise audio–visual feature interleaving mechanism, LongCat-Flash-Omni achieves low-latency, high-quality audio–visual processing and streaming speech generation. It supports a context window of up to 128K tokens, enabling advanced capabilities in long-term memory, multi-turn dialogue, and temporal reasoning across multiple modalities.
🌟 Effective Early-Fusion Training
The model adopts an innovative multi-stage pretraining pipeline that progressively incorporates text, audio, and visual modalities under a balanced data strategy and early-fusion training paradigm, ensuring strong omni-modal performance without degradation in any single modality.
🌟 Efficient Training Infrastructure
Inspired by the concept of modality decoupling, we propose a Modality-Decoupled Parallelism training scheme that significantly enhances the efficiency of large-scale and highly challenging multimodal training.
🌟 Open-Source Contribution
We provide a comprehensive overview of the training methodology and data strategies behind LongCat-Flash-Omni, and release the model to accelerate future research and innovation in omni-modal intelligence.
For more detail, please refer to the comprehensive ***LongCat-Flash-Omni Technical Report***.
Evaluation Results
Omni-modality
| Benchmark | LongCat-Flash-Omni Instruct | Gemini-2.5-Pro (ThinkingBudget128) | Gemini-2.5-Flash (non-thinking) | Qwen3-Omni Instruct | Qwen2.5-Omni Instruct | |-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|-------------------------| | OmniBench | 61.38 | 66.80 | 54.99 | 58.41 | 48.16 | | WorldSense | 60.89 | 63.96 | 58.72 | 52.01 | 46.69 | | DailyOmni | 82.38 | 80.61 | 80.78 | 69.33 | 47.45 | | UNO-Bench | 49.90 | 64.48 | 54.30 | 42.10 | 32.60 |
Vision
Image-to-Text
| Benchmark | LongCat-Flash-Omni Instruct | Gemini-2.5-Pro (ThinkingBudget128) | Gemini-2.5-Flash (non-thinking) | Qwen3-Omni Instruct | Seed-1.6 | GPT-4o-1120 | Qwen3-VL-235B-A22B-Instruct | Qwen2.5-VL-72B-Instruct | |-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|----------|---------------|------------------------------|---------------------------| | General |||||||||| | MMBench-ENtest | 87.5 | 89.8 | 89.3 | 86.8 | 88.5 | 83.7 | 88.3 | 88.6* | | MMBench-ZHtest | 88.7 | 89.2 | 88.5 | 86.4 | 83.8 | 82.8 | 89.8 | 87.9* | | RealWorldQA | 74.8 | 76.0 | 73.9 | 72.9 | 74.5 | 74.1 | 79.3* | 75.7* | | MMStar | 70.9 | 78.5* | 75.5 | 68.5* | 71.5 | 63.2 | 78.4* | 68.2 | | STEM & Reasoning |||||||||| | MathVistamini | 77.9 | 77.7* | 77.1 | 75.9 | 78.7 | 62.8 | 84.9* | 74.8* | | MMMUval | 70.7 | 80.9* | 76.3 | 69.1* | 74.9 | 69.4 | 78.7* | 70.2* | | MMVet | 69.0 | 80.7 | 79.5 | 68.9 | 74.4 | 76.6 | 75.9 | 74.5 | | Multi-Image |||||||||| | BLINK | 63.1 | 70.0* | 65.7 | 56.1 | 65.0 | 65.5 | 70.7* | 60.1 | | MuirBench | 77.1 | 74.0* | 73.7 | 62.1 | 74.6 | 70.5 | 72.8* | 70.7* | | Mantis | 84.8 | 83.9 | 83.4 | 80.7 | 81.1 | 79.3 | 79.7 | 82.0 | | Text Recognition & Chart/Document Understanding |||||||||| | ChartQA | 87.6 | 71.7 | 77.6 | 86.8* | 82.4 | 74.5 | 89.2 | 89.5* | | DocVQA | 91.8 | 94.0* | 93.6* | 95.7 | 94.3 | 80.9 | 94.6 | 96.4* | | OCRBench | 84.9 | 87.2* | 85.6 | 85.5 | 85.6 | 82.3 | 91.2 | 88.5 | | OmniDocBenchEN/ZH↓ | 22.8/29.0 | 31.9/24.5 | 22.8/32.9 | 28.4/40.5 | 22.0/27.6 | 25.9/37.7 | 13.6/17.5 | 22.6/32.4* | | Grounding & Counting |||||||||| | RefCOCO-avg | 92.3 | 75.4 | 71.9 | 89.3 | 80.2 | - | 87.1 | 90.3 | | CountBench | 92.4 | 91.0* | 78.6 | 90.0* | 94.1 | 85.6* | 94.3 | 93.6* | | Graphical User Interface (GUI) |||||||||| | VisualWebBench | 78.7 | 81.1 | 73.5 | 79.3 | 81.1 | 77.1 | 80.8 | 82.3* | | ScreenSpot-v2 | 91.2 | 75.8 | 63.9 | 94.7 | 91.7 | - | 93.4 | 92.9 | | AndroidControllow | 91.2 | 79.2 | 79.1 | 90.5 | 84.6 | 65.2 | 90.0 | 93.7* | | AndroidControlhigh | 75.6 | 60.8 | 55.5 | 70.8 | 55.2 | 41.7 | 74.1 | 67.4* |
Note: Values marked with * are sourced from public reports. As GPT-4o does not support image grounding, we do not report its results on RefCOCO and ScreenSpot-v2
---
Video-to-Text
| Benchmark | LongCat-Flash-Omni Instruct | Gemini-2.5-Pro (ThinkingBudget128) | Gemini-2.5-Flash (non-thinking) | Qwen3-Omni Instruct | Seed-1.6 | GPT-4o-1120 | Qwen3-VL (235B-A22B-Instruct) | Qwen2.5-VL-72B-Instruct | |-----------|-------------------------------|-----------------------------------|------------------------------|----------------------|----------|---------------|------------------------------|---------------------------| | Short Video |||||||||| | MVBench | 75.2 | 66.4 | 63.0 | 69.3* | 68.4 | 62.1 | 71.3 | 70.4* | | NextQA | 86.2 | 84.2 | 81.4 | 82.4 | 84.1 | 79.7 | 81.3 | 82.3 | | TempCompass | 82.2 | 80.8 | 80.2 | 73.5 | 79.4 | 76.4 | 80.5 | 74.8* | | Long Video |||||||||| | VideoMME (w/o audio) | 76.2 | - | - | 70.5* | 75.2 | 73.2 | 79.2* | 73.3* | |…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction, routine release