WritingInclusionAI (Ant Group)InclusionAI (Ant Group)published May 5, 2025seen 5d

Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities

Open original ↗

Captured source

source ↗

Ming-Lite-Omni-Preview: A MoE Model Designed to Perceive a Wide Range of Modalities | INCLUSION AI

Skip to main content GITHUB 🤗 Hugging Face | 🤖 ModelScope

Introduction ​

Ming-Lite-Omni-Preview is built upon Ling-Lite , which is a MoE model designed to perceive a wide range of modalities, including text, images, audio, and video, while generating text and natural speech in a streaming manner. To naturely handle the diverse modalities, we have enhanced Ling-Lite by incorporating modality-specific routers for each modality. As a result, Ming-Omni excels at handling information from diverse modalities and is highly scalable.

Key Features ​

Omni and Novel MoE Architecture : An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.

Video understanding : Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.

Natural Speech Generation and Fine-grained Voice Dialogue : Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression

Evaluation ​

Image benchmark ​

Benchmarks Ming-Lite-Omni-Preview Qwen2.5-VL-7B-Instruct InternVL2.5-8B-MPO AI2D 83.84 83.9 84.5 HallusionBench 54.68 51.9 51.7 MMBench_TEST_V11 79.63 84.3 82.0 MMMU 57.0 58.6 54.8 MMStar 62.0 63.9 65.2 MMVet 73.6 67.1 68.1 MathVista 69.0 68.2 67.9 OCRBench 87.9 86.4 88.2 Average 70.96 70.5 70.3

Object Recognition ​

Object Recognition Ming-Lite-Omni-Preview Qwen2.5-VL-7B InternVL-2.5-8B Plants 52.1 55.3 32.8 Animals 52.6 54.8 36.5 Home appliances & furniture 93.5 97.4 90.9 Personal Electronics 96.1 95.1 93.2 Food & Ingredients 57.5 60.0 48.7 Tableware 96.6 94.9 88.1 Vehicles 31.9 40.9 31.9 Average 68.6 71.2 60.3

Video benchmark ​

Benchmarks Ming-Lite-Omni-Preview Qwen2.5VL-7B VideoMME wo/w sub. 63.9/67.6 65.1/71.6 MVBench 67.0 72.0 Video-MMMU 45.4 47.44 LongVideoBench 53.7 60.0

Audio benchmark ​

SpeechQA ​

Model AlpacaEval CommonEval SD-QA MMSU OpenBookQA IFEval AdvBench Qwen2-Audio-chat 3.69 3.40 35.35 35.43 49.01 22.57 98.85 Baichuan-Audio 4.00 3.39 49.64 48.80 63.30 41.32 86.73 GLM-4-Voice 4.06 3.48 43.31 40.11 52.97 24.91 88.08 Kimi-Audio 4.46 3.97 63.12 62.17 83.52 61.10 100.00 Qwen2.5-Omni 4.49 3.93 55.71 61.32 81.10 52.87 99.42 Ming-Lite-Omni-Preview 4.25 3.88 58.95 46.06 60.00 46.71 96.53

ASR ​

Model Aishell-1 Aishell-2 ios Wenetspeech test-net Wenet test-meeting Librispeech test-clean Librispeech test-other Whisper Large-v3 5.14 4.76 9.68 18.54 1.9 3.65 Qwen2-Audio 1.53 3.06 7.72 8.4 1.6 3.6 GLM-4-voice Base 2.46 - - - 2.82 7.66 Baichuan-Omni-1.5 - - 6.9 8.4 - - Qwen2.5-Omni 1.18 2.36 5.9 7.7 1.8 3.4 Ming-Lite-Omni-Preview 1.62 2.82 6.23 6.9 2.34 5.74

Knowledge ​

Model InfoSeek_H-mean InfoSeek_unseen_question InfoSeek_unseen_entity GPT-4o 36.05 - - PaLI-X 22.06 23.5 20.8 Qwen2.5-vl-32B 19.35 20.55 18.28 Ming-Lite-Omni-Preview 27.3 28.9 25.9

OCR&GUI ​

Model Ming-Lite-Omni-Preview Qwen2.5-VL-7B-Instruct ChartQA_TEST 85.2 87.3 DocVQA_TEST 93.2 95.7 OCRBenchV2_en/zh 52.2/51.6 56.3/57.2 OmniDocBench↓ 34.7/34.5 30.8/39.8 TextVQA_VAL 82.36 84.9 ScreenSpot 79.3 84.7

Model Downloads ​

You can download the model from both Huggingface and ModelScope.

Model Input modality Oput modality Download Ming-Lite-Omni-Preview Image,text,viedio,audio Image,text,audio 🤗 HuggingFace 🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope .

Use Cases ​

Video-Audio-QA ​

MultiModal Input QA Q: (audio content: 请描述视频内容。) A: The video features a woman performing a series of yoga poses on a rooftop with a scenic view of mountains and a clear blue sky. Q: Is there any food in front of me? A: Yes, there's candy on the table.

Speech2Speech (supports dialect) ​

Quickstart ​

Please download our model following Model Downloads , then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os from transformers import AutoProcessor from modeling_bailingmm import BailingMMNativeForConditionalGeneration

build model

model = BailingMMNativeForConditionalGeneration . from_pretrained ( "inclusionAI/Ming-Lite-Omni" , torch_dtype = torch . bfloat16 , low_cpu_mem_usage = True ) . to ( "cuda" )

assets_path = YOUR_ASSETS_PATH

build processor

processor = AutoProcessor . from_pretrained ( "inclusionAI/Ming-Lite-Omni" , trust_remote_code = True )

qa

messages = [ { "role" : "HUMAN" , "content" : [ { "type" : "text" , "text" : "请详细介绍鹦鹉的生活习性。" } ] , } , ]

Output:

鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:

### 1. 栖息地

鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。

### 2. 饮食

鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。

......

image qa

messages = [ { "role" : "HUMAN" , "content" : [ { "type" : "image" , "image" : os . path . join ( assets_path , "flowers.jpg" ) } , { "type" : "text" , "text" : "What kind of flower is this?" } , ] , } , ]

Output:

The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in ... tags, then the final answer enclosed in ... tags. The critical answer or key result should be placed within \\boxed{}.\n"

And your input message should be like this:

messages = [ { "role" : "HUMAN" , "content" : [ { "type" : "image" , "image" : os . path . join ( assets_path , "reasoning.png" ) } , { "type" : "text" , "text" : cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$" } , ] , } , ]

Output:

\\nOkay, so I have this problem about a rectangle ABCD…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Preview MoE model, notable but lacks traction data.