inclusionAI/Ming-UniVision
Python
Captured source
source ↗inclusionAI/Ming-UniVision
Description: Code release for Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer
Language: Python
License: MIT
Stars: 143
Forks: 5
Open issues: 4
Created: 2025-09-30T11:36:01Z
Pushed: 2025-10-14T13:38:52Z
Default branch: main
Fork: no
Archived: no
README:
Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer
📄 Technical Report | 📖Project Page |🤗 Hugging Face| 🤖 ModelScope
🌍 Introduction
🌐 Ming-UniVision is a groundbreaking multimodal large language model (MLLM) that unifies vision understanding, generation, and editing within a single autoregressive next-token prediction (NTP) framework, powered by MingTok — the first continuous, unified visual tokenizer. By eliminating discrete quantization and leveraging a shared continuous latent space, Ming-UniVision enables seamless, end-to-end multimodal reasoning across diverse tasks. Trained on high-fidelity continuous visual representations, Ming-UniVision supports multi-round, in-context vision-language interactions, such as iterative question answering, image generation, and semantic editing — all without needing to decode intermediate states into pixels. This enables efficient, coherent, and human-like multimodal dialogue with consistent feature dynamics throughout.
- 🌐 First NTP MLLM with Continuous Unified Vision Representations: Ming-UniVision
unifies vision and language via next-token prediction using continuous visual tokens — no discrete quantization, full autoregressive generative paradigm, and support for both understanding and generation in a shared latent space.
- 🖼️ First Continuous Unified Visual Tokenizer: MingTok-Vision
enables both understanding and generation in a single continuous space, preserving semantic and perceptual quality.
- ⚡ 3.5× Faster Training Convergence:
Shared representation reduces conflict between tasks, enabling faster, more stable joint training.
- 🔄 Multi-Round In-Context Vision Tasks:
Perform iterative reasoning, generation, and editing in one latent space — no image decoding needed mid-process.
- 🔗 Single Space, Unified Workflow:
All modalities and tasks share one coherent feature space — simpler training, efficient inference, true autoregressive fusion.
📌 Updates
- [2025.10.09] 📄 Technical Report Released!
The full technical report is now available on arXiv: 👉 arXiv:2510.06590 | Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer Dive into the architecture, unified continuous tokenizer, and end-to-end autoregressive framework that power our system.
- [2025.10.02] 🔥 We’re live!
We’re thrilled to announce the release of Ming-UniVision and MingTok-Vision — the first joint autoregressive vision-language system with unified continuous visual tokenization!
✨ Enable seamless multimodal reasoning, generation, and editing in a single latent space. 🚀 Faster training, richer semantics, and true end-to-end autoregression — no quantization, no compromises.
👉 Check out our blog post to learn how we’re redefining unified vision-language intelligence.
📊 Evaluation
MingTok-Vision achieves strong image reconstruction capability and Ming-UniVision enables unified multimodal understanding and generation within a single continuous latent space.
Image Reconstruction
MingTok-Vision achieves competitive reconstruction quality with high PSNR and low rFID, demonstrating its ability to preserve both perceptual fidelity and semantic structure in a continuous representation.
Table 1. Image reconstruction performance on ImageNet-val-50k.
Tokenizer Res.
Tokens
rFID ↓ PSNR ↑ SSIM ↑ LPIPS ↓
Specialized tokenizers
SD-VAE 256 1024 1.06 28.62 0.86 -
GigaTok 256 256 0.51 21.32 0.69 0.21
VA-VAE 256 256 0.26 28.59 0.80 0.09
HieraTok 256 256 1.04 23.90 0.72 0.09
DC-AE 512 64 0.22 26.15 0.71 0.08
MAE-Tok 512 128 0.62 - - -
TexTok 512 256 0.73 24.45 0.66 0.19
Unified tokenizers
UniTok 256 256 0.38 - - -
TokenFlow 384 729 0.63 22.77 0.73 -
MingTok-Vision 512 256 0.54 30.77 0.62 0.14
MingTok-Vision † 512 256 0.38 31.09 0.64 0.12
† denotes using semantic decoder after joint pre-training.
Visual Understanding
Ming-UniVision achieves competitive performance on multimodal understanding benchmarks, showing that continuous latent tokens can effectively support high-level vision-language reasoning without discrete quantization.
Table 2. Quantitative evaluations on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME.
Model MMB ↑ MMS ↑ MMMU ↑ MathV ↑ Hall ↑ AI2D ↑ MM-Vet ↑ OCRBench ↑ MME ↑
Understanding Only
Emu3-Chat 58.5 - 31.6 - - - 37.2 687 -
Qwen2.5-VL-3B 79.1 55.9 53.1 62.3 46.3 81.6 - 797 2157
Qwen2.5-VL-7B 83.5 63.9 58.6 68.2 52.9 83.9 67.1 864 2347
InternVL2.5-4B 81.1 58.3 52.3 60.5 46.3 81.4 60.6 828 2338
InternVL2.5-8B 84.6 62.8 56.0 64.4 50.1 84.5 62.8 822 2344
DeepSeek-VL2 79.6 61.3 51.1 62.8 - 81.4 - 811 2253
Unified model, Separate representation
Janus-Pro-7B 79.2 - 41.0 - - - 50.0 - -
LMFusion - - 41.7 - - - - - 1603
MetaQuery-L 78.6 - 53.1 - - - 63.2 - -
Show-o2-7B 79.3 56.6 48.9 - - 78.6 - - -
BLIP3-o 4B 78.6 - 46.6 - - - 60.1 - 2161
BAGEL 85.0 - 55.3 73.1 - - 67.2 - 2388
Unified model, Unified representation
VILA-U - - - - - - 33.5 - 1402
TokenFlow-XL 76.8 - 43.2 - - - 48.2 - 1922
UniTok - - - - - - 33.9 - 1448
Harmon-1.5B 65.5 - 38.9 - - - - - 1476
TokLIP 67.6 - 43.1 - - - 29.8 - -
Ming-UniVision-16B-A3B (Ours) 78.5 63.7 40.3 66.6 47.8 82.8 64.2 724 2023
Visual Generation
Ming-UniVision achieves top performance among unified representation models in text-to-image generation, demonstrating superior object composition and spatial reasoning capabilities.
Table 3. Evaluation of text-to-image generation ability on GenEval and DPG-Bench. † denotes performance obtained by rewritten prompts.
Method Single Obj. ↑ Two Obj. ↑ Counting ↑ Colors ↑ Position ↑ Color Attri. ↑ Overall ↑ DPG-Bench ↑
Generation Only
LlamaGen 0.71 0.34 0.21 0.58 0.07 0.04 0.32 -
PixArt-α 0.98 0.50 0.44 0.80 0.08 0.07 0.48 -
SDv2.1 0.98 0.51 0.44 0.85 0.07 0.17 0.50 -
DALL-E 2 0.94 0.66 0.49…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10New vision repo with modest traction