WritingInclusionAI (Ant Group)InclusionAI (Ant Group)published Jun 11, 2025seen 5d

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Open original ↗

Captured source

source ↗
published Jun 11, 2025seen 5dcaptured 3dhttp 200method plain

Ming-Omni: A Unified Multimodal Model for Perception and Generation | INCLUSION AI

Skip to main content GITHUB 📑 Technical Report |📖 Project Page |🤗 Hugging Face | 🤖 ModelScope

Introduction ​

Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

📌 Updates ​

[2025.06.12] 🔥 Our Technical Report is in public on arxiv.

[2025.05.28] 🔥 The official version of Ming-lite-omni is released, with better performance and image generation support.

[2025.05.04] 🔥 We release the test version of Ming-lite-omni: Ming-lite-omni-Preview .

Key Features ​

Unified Omni-Modality Perception : Ming-lite-omni, built on Ling , an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.

Unified Perception and Generation : Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.

Innovative Generation Capabilities : Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.

Evaluation ​

Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods.

Image benchmark ​

Benchmarks Ming-lite-omni Qwen2.5-VL-7B-Instruct InternVL2.5-8B-MPO AI2D 83.1 84.4 84.5 HallusionBench 55.0 55.8 51.7 MMBench_TEST_V11 80.8 82.8 82.0 MMMU 56.3 56.6 54.8 MMStar 64.7 65.3 65.2 MMVet 71.3 71.6 68.1 MathVista 71.6 68.1 67.9 OCRBench 88.4 87.8 88.2 Average 71.4 71.5 70.3

Encyclopedia Benchmarks ​

Object Recognition Ming-lite-omni Qwen2.5-VL-7B-Instruct Plants 54.96 47.8 Animals 56.7 50.85 Vehicles 41.91 42.29 Food & Ingredients 62.28 54.09 Dishes 44.3 39.07 General 91.08 92.42 Average 58.54 54.43

Video benchmark ​

Benchmarks Ming-lite-omni Qwen2.5VL-7B-Instruct VideoMME 67.0 67.3 MVBench 67.7 67.4 Video-MMMU 46.3 47.4 LongVideoBench 56.6 54.7 Average 59.4 59.2

Note: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark ​

SpeechQA ​

Model Average AlpacaEval CommonEval SD-QA MMSU OpenBookQA IFEval AdvBench Qwen2-Audio-chat 3.545 3.69 3.40 35.35 35.43 49.01 22.57 98.85 Baichuan-Audio 3.695 4.00 3.39 49.64 48.80 63.30 41.32 86.73 GLM-4-Voice 3.77 4.06 3.48 43.31 40.11 52.97 24.91 88.08 Kimi-Audio 4.215 4.46 3.97 63.12 62.17 83.52 61.10 100.00 Qwen2.5-Omni 4.21 4.49 3.93 55.71 61.32 81.10 52.87 99.42 Ming-lite-omni 4.34 4.63 4.06 58.84 47.53 61.98 58.36 99.04

ASR ​

Model aishell1 aishell2_android aishell2_ios cv15_zh fleurs_zh wenetspeech_meeting wenetspeech_net librispeech_test_clean librispeech_test_other multilingual_librispeech cv15_en fleurs_en voxpopuli_v1.0_en Ming-lite-omni 1.47 2.55 2.52 6.31 2.96 5.95 5.46 1.44 2.80 4.15 6.89 3.39 5.80 Qwen2.-Omni 1.18 2.75 2.63 5.20 3.00 5.90 7.70 1.80 3.40 7.56 7.60 4.10 5.80 Qwen2-Audio 1.53 2.92 2.92 6.90 7.50 7.16 8.42 1.60 3.60 5.40 8.60 6.90 6.84 Kimi-Audio 0.60 2.64 2.56 7.21 2.69 6.28 5.37 1.28 2.42 5.88 10.31 4.44 7.97

Information-Seeking Benchmark ​

Model InfoSeek_H-mean InfoSeek_unseen_question InfoSeek_unseen_entity GPT-4o 36.05 - - PaLI-X 22.06 23.5 20.8 Qwen2.5-vl-32B 19.35 20.55 18.28 Ming-lite-omni 27.7 30.4 25.4

OCR ​

Model Ming-lite-omni Qwen2.5-VL-7B-Instruct ChartQA_TEST 85.1 87.3 DocVQA_TEST 93 95.7 OCRBenchV2_en/zh 53.3/52 56.3/57.2 OmniDocBench↓ 34/ 34.4 30.8 /39.8 TextVQA_VAL 82.8 84.9

GUI ​

Model Ming-lite-omni InternVL3 8B Qwen2.5-VL-7B-Instruct ScreenSpot 82.1 79.5 78.9* ScreenSpot-V2 84.1 81.4 - AITZ(EM) 66.6 - 57.6*

Note: * denotes the reproduced results.

Unified Generation Benchmark ​

Model single_object two_object counting colors position color_attr GENEVAL DPGBench FID↓ Ming-lite-omni 0.9875 0.7727 0.6812 0.7872 0.31 0.29 0.64 81.72 4.85 Metaquery-XL - - - - - - 0.61 82.05 6.02 SDv2.1 0.98 0.51 0.44 0.85 0.07 0.17 0.50 68.09 26.96 Emu3-Gen 0.98 0.71 0.34 0.81 0.17 0.21 0.54 80.60 - SDXL 0.98 0.74 0.39 0.85 0.15 0.23 0.55 74.65 8.76 Janus 0.97 0.68 0.30 0.84 0.46 0.42 0.61 79.68 10.10 JanusFlow - - - - - - 0.63 80.09 9.51

Please refer to our technical report for more comprehensive evaluation results.

Model Downloads ​

You can download the model from both Huggingface and ModelScope.

Model Input…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable multimodal model release, moderate impact