RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Apr 21, 2025seen 5d

inclusionAI/Ming

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Apr 21, 2025seen 5dcaptured 15hhttp 200method plain

inclusionAI/Ming

Description: Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.

Language: Jupyter Notebook

License: MIT

Stars: 656

Forks: 58

Open issues: 25

Created: 2025-04-21T07:39:03Z

Pushed: 2026-03-17T11:55:16Z

Default branch: main

Fork: no

Archived: no

README:

Ming-flash-omni 2.0

📑 Technical Report|🤗 Hugging Face| 🤖 ModelScope

Introduction

The newly released Ming-flash-omni 2.0 leverages the Ling-2.0 architecture—a Mixture-of-Experts (MoE) framework comprising 100B total and 6B active parameters. Representing a generational advancement over its predecessor, it establishes new State-of-the-Art (SOTA) benchmarks among open-source omni-MLLMs. Ming-flash-omni 2.0 effectively synergizes foundational abilities with specialized domain expertise. In particular, it exhibits superior performance in visual encyclopedic knowledge, immersive speech synthesis, and high-dynamic image generation and manipulation.

📌 Updates

  • [2026.02.11] 🔥 We release the official version of Ming-flash-omni 2.0, an open-source SOTA omni-MLLM that pushes the boundaries of multimodal understanding and synthesis.
  • [2025.10.27] 🔥 We release the preview version of Ming-flash-omni:Ming-flash-omni Preview.
  • [2025.07.15] 🔥 We release Ming-lite-omni v1.5 with significant improvements across all modalities.
  • [2025.06.12] 🔥 Our Technical Report is in public on arxiv.
  • [2025.05.28] 🔥 The official version of Ming-lite-omni v1 is released, with better performance and image generation support.
  • [2025.05.04] 🔥 We release the test version of Ming-lite-omni:Ming-lite-omni-Preview.

Key Features

Compared to Ming-flash-omni Preview, Ming-flash-omni 2.0 focuses on optimizing capabilities across the following key domains:

  • Expert-level Multimodal Cognition: It accurately identifies plants and animals, recognizing cultural references (from regional cuisines to global landmarks), and delivering expert-level analysis of artifacts, including era, form, and craftsmanship. By synergizing high-resolution visual capture with a vast knowledge graph, the model achieves "vision-to-knowledge" synthesis, enabling superior knowledge understanding.
  • Immersive and Controllable Unified Acoustic Synthesis: Ming-flash-omni 2.0 introduces a unified end-to-end acoustic generation pipeline that integrates Speech, Audio, and Music within a single channel. Leveraging Continuous Autoregression coupled with a Diffusion Transformer (DiT) head, the model enables zero-shot voice cloning and nuanced attribute control (e.g., emotion, timbre, and ambient atmosphere). This architecture facilitates a transition from simple text-to-speech to highly expressive, emotionally resonant, and immersive auditory experiences.
  • High-Dynamic Controllable Image Generation and Manipulation: Ming-flash-omni 2.0 features a native multi-task architecture that unifies segmentation, generation, and editing, allowing for sophisticated spatiotemporal semantic decoupling. It excels in high-dynamic content creation, including atmospheric reconstruction, seamless scene composition, and context-aware object removal. By maintaining texture coherence and spatial depth consistency, Ming-flash-omni 2.0 achieves state-of-the-art precision in complex image manipulation tasks.

Use Cases

Enhanced Multimodal Cognition & Free Modality Switching

Enhanced Multimodal Cognition & Free Modality Switching

Streaming Video Conversation

Streaming Video Conversation

Controllable Audio Generation

Audio Context ASR & Dialect ASR

Image Generation & Editing

Controllable Image Generation

Model Downloads

You can download our latest model from both Huggingface and ModelScope. For previous version model like Ming-flash-omni-Preview, Please refer to this link.

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-flash-omni-2.0 --local_dir inclusionAI/Ming-flash-omni-2.0 --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt
pip install nvidia-cublas-cu12==12.4.5.8 # for H20 GPU

Example Usage

We provide a step-by-step running example:

Step 1 - Download the source code

git clone https://github.com/inclusionAI/Ming.git
cd Ming

Step 2 - Download the model weights and create a soft link to the source code directory

Download our model following [Model Downloads](#model-downloads)

mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-flash-omni-2.0 inclusionAI/Ming-flash-omni-2.0

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-flash-omni model.

jupyter notebook cookbook.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb.

import os
import torch
import warnings
from bisect import bisect_left
warnings.filterwarnings("ignore")

from transformers import AutoProcessor
from modeling_bailingmm2 import BailingMM2NativeForConditionalGeneration

def split_model():
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 32
layer_per_gpu = num_layers // world_size
layer_per_gpu = [i * layer_per_gpu for i in range(1, world_size + 1)]
for i in range(num_layers):
device_map[f'model.model.layers.{i}'] = bisect_left(layer_per_gpu, i)
device_map['vision'] = 0
device_map['audio'] = 0
device_map['linear_proj'] = 0
device_map['linear_proj_audio'] = 0
device_map['model.model.word_embeddings.weight'] = 0
device_map['model.model.norm.weight'] = 0
device_map['model.lm_head.weight'] = 0…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo with moderate stars