RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Sep 30, 2025seen 5d

inclusionAI/Ming-UniVision

Python

Open original ↗

Captured source

source ↗
published Sep 30, 2025seen 5dcaptured 9hhttp 200method plain

inclusionAI/Ming-UniVision

Description: Code release for Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer

Language: Python

License: MIT

Stars: 143

Forks: 5

Open issues: 4

Created: 2025-09-30T11:36:01Z

Pushed: 2025-10-14T13:38:52Z

Default branch: main

Fork: no

Archived: no

README:

Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer

📄 Technical Report | 📖Project Page |🤗 Hugging Face| 🤖 ModelScope

🌍 Introduction

🌐 Ming-UniVision is a groundbreaking multimodal large language model (MLLM) that unifies vision understanding, generation, and editing within a single autoregressive next-token prediction (NTP) framework, powered by MingTok — the first continuous, unified visual tokenizer. By eliminating discrete quantization and leveraging a shared continuous latent space, Ming-UniVision enables seamless, end-to-end multimodal reasoning across diverse tasks. Trained on high-fidelity continuous visual representations, Ming-UniVision supports multi-round, in-context vision-language interactions, such as iterative question answering, image generation, and semantic editing — all without needing to decode intermediate states into pixels. This enables efficient, coherent, and human-like multimodal dialogue with consistent feature dynamics throughout.

  • 🌐 First NTP MLLM with Continuous Unified Vision Representations: Ming-UniVision

unifies vision and language via next-token prediction using continuous visual tokens — no discrete quantization, full autoregressive generative paradigm, and support for both understanding and generation in a shared latent space.

enables both understanding and generation in a single continuous space, preserving semantic and perceptual quality.

  • 3.5× Faster Training Convergence:

Shared representation reduces conflict between tasks, enabling faster, more stable joint training.

  • 🔄 Multi-Round In-Context Vision Tasks:

Perform iterative reasoning, generation, and editing in one latent space — no image decoding needed mid-process.

  • 🔗 Single Space, Unified Workflow:

All modalities and tasks share one coherent feature space — simpler training, efficient inference, true autoregressive fusion.

📌 Updates

  • [2025.10.09] 📄 Technical Report Released!

The full technical report is now available on arXiv: 👉 arXiv:2510.06590 | Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer Dive into the architecture, unified continuous tokenizer, and end-to-end autoregressive framework that power our system.

  • [2025.10.02] 🔥 We’re live!

We’re thrilled to announce the release of Ming-UniVision and MingTok-Vision — the first joint autoregressive vision-language system with unified continuous visual tokenization!

✨ Enable seamless multimodal reasoning, generation, and editing in a single latent space. 🚀 Faster training, richer semantics, and true end-to-end autoregression — no quantization, no compromises.

👉 Check out our blog post to learn how we’re redefining unified vision-language intelligence.

📊 Evaluation

MingTok-Vision achieves strong image reconstruction capability and Ming-UniVision enables unified multimodal understanding and generation within a single continuous latent space.

Image Reconstruction

MingTok-Vision achieves competitive reconstruction quality with high PSNR and low rFID, demonstrating its ability to preserve both perceptual fidelity and semantic structure in a continuous representation.

Table 1. Image reconstruction performance on ImageNet-val-50k.

Tokenizer Res.

Tokens

rFID ↓ PSNR ↑ SSIM ↑ LPIPS ↓

Specialized tokenizers

SD-VAE 256 1024 1.06 28.62 0.86 -

GigaTok 256 256 0.51 21.32 0.69 0.21

VA-VAE 256 256 0.26 28.59 0.80 0.09

HieraTok 256 256 1.04 23.90 0.72 0.09

DC-AE 512 64 0.22 26.15 0.71 0.08

MAE-Tok 512 128 0.62 - - -

TexTok 512 256 0.73 24.45 0.66 0.19

Unified tokenizers

UniTok 256 256 0.38 - - -

TokenFlow 384 729 0.63 22.77 0.73 -

MingTok-Vision 512 256 0.54 30.77 0.62 0.14

MingTok-Vision † 512 256 0.38 31.09 0.64 0.12

† denotes using semantic decoder after joint pre-training.

Visual Understanding

Ming-UniVision achieves competitive performance on multimodal understanding benchmarks, showing that continuous latent tokens can effectively support high-level vision-language reasoning without discrete quantization.

Table 2. Quantitative evaluations on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME.

Model MMB ↑ MMS ↑ MMMU ↑ MathV ↑ Hall ↑ AI2D ↑ MM-Vet ↑ OCRBench ↑ MME ↑

Understanding Only

Emu3-Chat 58.5 - 31.6 - - - 37.2 687 -

Qwen2.5-VL-3B 79.1 55.9 53.1 62.3 46.3 81.6 - 797 2157

Qwen2.5-VL-7B 83.5 63.9 58.6 68.2 52.9 83.9 67.1 864 2347

InternVL2.5-4B 81.1 58.3 52.3 60.5 46.3 81.4 60.6 828 2338

InternVL2.5-8B 84.6 62.8 56.0 64.4 50.1 84.5 62.8 822 2344

DeepSeek-VL2 79.6 61.3 51.1 62.8 - 81.4 - 811 2253

Unified model, Separate representation

Janus-Pro-7B 79.2 - 41.0 - - - 50.0 - -

LMFusion - - 41.7 - - - - - 1603

MetaQuery-L 78.6 - 53.1 - - - 63.2 - -

Show-o2-7B 79.3 56.6 48.9 - - 78.6 - - -

BLIP3-o 4B 78.6 - 46.6 - - - 60.1 - 2161

BAGEL 85.0 - 55.3 73.1 - - 67.2 - 2388

Unified model, Unified representation

VILA-U - - - - - - 33.5 - 1402

TokenFlow-XL 76.8 - 43.2 - - - 48.2 - 1922

UniTok - - - - - - 33.9 - 1448

Harmon-1.5B 65.5 - 38.9 - - - - - 1476

TokLIP 67.6 - 43.1 - - - 29.8 - -

Ming-UniVision-16B-A3B (Ours) 78.5 63.7 40.3 66.6 47.8 82.8 64.2 724 2023

Visual Generation

Ming-UniVision achieves top performance among unified representation models in text-to-image generation, demonstrating superior object composition and spatial reasoning capabilities.

Table 3. Evaluation of text-to-image generation ability on GenEval and DPG-Bench. † denotes performance obtained by rewritten prompts.

Method Single Obj. ↑ Two Obj. ↑ Counting ↑ Colors ↑ Position ↑ Color Attri. ↑ Overall ↑ DPG-Bench ↑

Generation Only

LlamaGen 0.71 0.34 0.21 0.58 0.07 0.04 0.32 -

PixArt-α 0.98 0.50 0.44 0.80 0.08 0.07 0.48 -

SDv2.1 0.98 0.51 0.44 0.85 0.07 0.17 0.50 -

DALL-E 2 0.94 0.66 0.49…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New vision repo with modest traction