ModelOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Apr 13, 2026seen 5d

openbmb/MiniCPM-V-4.6

Open original ↗

Captured source

source ↗
published Apr 13, 2026seen 5dcaptured 11hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 1.3Bdownloads 616klikes 1.1k

A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone

GitHub | CookBook | Demo | Feishu (Lark)

News

  • [2026.05.17] ⭐️⭐️⭐️ We release the API service of MiniCPM-V 4.6, with a public free API key together! Try it now.

MiniCPM-V 4.6

MiniCPM-V 4.6 is our most edge-deployment-friendly model to date. The model is built based on SigLIP2-400M and the Qwen3.5-0.8B LLM. It inherits the strong single-image, multi-image, and video understanding capabilities of MiniCPM-V family, while significantly improving computation efficiency. It also introduces mixed 4x/16x visual token compression. Notable features of MiniCPM-V 4.6 include:

  • 🔥 Leading Foundation Capability.

MiniCPM-V 4.6 scores 13 on the Artificial Analysis Intelligence Index benchmark, outperforming Qwen3.5-0.8B's score of 10 with 19x fewer token cost, and Qwen3.5-0.8B-Thinking's score of 11 with 43x fewer token cost. It also surpasses the larger Ministral 3 3B (score of 11).

  • 💪 Strong Multimodal Capability.

MiniCPM-V 4.6 outperforms Qwen3.5-0.8B on most vision-language understanding tasks, and reaches Qwen3.5 2B-level capability on many benchmarks including OpenCompass, RefCOCO, HallusionBench, MUIRBench, and OCRBench.

  • 🚀 Ultra-Efficient Architecture.

Based on the latest technique in LLaVA-UHD v4, MiniCPM-V 4.6 reduces the visual encoding computation FLOPs by more than 50%. It enables MiniCPM-V 4.6 to achieve better efficiency to even smaller models, achieving ~1.5x token throughput compared to Qwen3.5-0.8B. It also supports mixed 4x/16x visual token compression rate, allowing flexible switching between accuracy and speed.

  • 📱 Broad Mobile Platform Coverage.

MiniCPM-V 4.6 can be deployed across all three mainstream mobile platforms — iOS, Android, and HarmonyOS. With every edge adaptation code open-sourced, developers can reproduce the on-device experience in [just a few steps](#deploy-minicpm-v-46-on-ios-android-and-harmonyos-platforms).

  • 🛠️ Developer Friendly.

MiniCPM-V 4.6 is adapted to [inference frameworks](#inference-and-training) such as vLLM, SGLang, llama.cpp, Ollama, and supports [fine-tuning ecosystems](#inference-and-training) such as SWIFT and LLaMA-Factory. Developers can quickly customize models for new domains and tasks on consumer-grade GPUs. We provide multiple quantized variants across GGUF, BNB, AWQ, and GPTQ formats.

Evaluation

Overall Performance (Instruct)

Click to view MiniCPM-V 4.6-Thinking performance.

High-Concurrency Throughput

Single Request TTFT (ms)

Examples

Overall

MiniCPM-V 4.6 can be deployed across three mainstream end-side platforms — iOS, Android and HarmonyOS. The clips below are raw screen recordings on phone devices without edition.

iPhone iPhone 17 Pro Max Android Redmi K70 HarmonyOS HUAWEI nova 14

Usages

Inference with Transformers

##### Installation

pip install "transformers[torch]>=5.7.0" torchvision torchcodec

> Note on CUDA compatibility: torchcodec (used for video decoding) may have compatibility issues with certain CUDA versions. For example, torch>=2.11 bundles CUDA 13.1 by default, while environments with CUDA 12.x may encounter errors such as RuntimeError: Could not load libtorchcodec. Two workarounds: > > 1. Replace `torchcodec` with `PyAV` — supports both image and video inference without CUDA version constraints: > ``bash > pip install "transformers[torch]>=5.7.0" torchvision av > > 2. **Pin the CUDA version** when installing torch to match your environment (e.g. CUDA 12.8): > bash > pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128 >

##### Load Model

from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "openbmb/MiniCPM-V-4.6"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)

# Flash Attention 2 is recommended for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
# model_id,
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )

##### Image Inference

messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
{"type": "text", "text": "What causes this phenomenon?"},
],
}
]

downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_slice_nums=36,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

##### Video Inference

messages = [
{
"role": "user",
"content": [
{"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
{"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
],
}
]

downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_num_frames=128,
stack_frames=1,
max_slice_nums=1,
use_image_id=False,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,…

Excerpt shown — open the source for the full document.

Notability

notability 10.0/10

High HF downloads, notable model release.