openbmb/MiniCPM-V-4.6-Thinking
Captured source
source ↗A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone
GitHub | CookBook | Demo | Feishu (Lark)
News
- [2026.05.17] ⭐️⭐️⭐️ We release the API service of MiniCPM-V 4.6, with a public free API key together! Try it now.
MiniCPM-V 4.6 Thinking
MiniCPM-V 4.6 Thinking is the long chain-of-thought reasoning variant of MiniCPM-V 4.6. It generates an explicit reasoning trace before producing the final answer, substantially boosting performance on complex multimodal reasoning, math, and OCR-heavy tasks, while keeping the same edge-friendly architecture (SigLIP2-400M vision encoder + Qwen3.5-0.8B LLM) and the mixed 4x/16x visual token compression of MiniCPM-V 4.6.
Evaluation
Overall Performance (Thinking)
Click to view MiniCPM-V 4.6 (Instruct) performance.
Click to view MiniCPM-V 4.6 inference efficiency results.
High-Concurrency Throughput
Single Request TTFT (ms)
Examples
Overall
MiniCPM-V 4.6 can be deployed across three mainstream end-side platforms — iOS, Android and HarmonyOS. The clips below are raw screen recordings on phone devices without edition.
iPhone iPhone 17 Pro Max Android Redmi K70 HarmonyOS HUAWEI nova 14
Usages
Inference with Transformers
##### Installation
pip install "transformers[torch]>=5.7.0" torchvision torchcodec
> Note on CUDA compatibility: torchcodec (used for video decoding) may have compatibility issues with certain CUDA versions. For example, torch>=2.11 bundles CUDA 13.1 by default, while environments with CUDA 12.x may encounter errors such as RuntimeError: Could not load libtorchcodec. Two workarounds: > > 1. Replace `torchcodec` with `PyAV` — supports both image and video inference without CUDA version constraints: > ``bash > pip install "transformers[torch]>=5.7.0" torchvision av > > 2. **Pin the CUDA version** when installing torch to match your environment (e.g. CUDA 12.8): > bash > pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128 >
##### Load Model
from transformers import AutoModelForImageTextToText, AutoProcessor model_id = "openbmb/MiniCPM-V-4.6-Thinking" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForImageTextToText.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) # Flash Attention 2 is recommended for better acceleration and memory saving, # especially in multi-image and video scenarios. # model = AutoModelForImageTextToText.from_pretrained( # model_id, # torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # device_map="auto", # )
##### Image Inference
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
{"type": "text", "text": "What causes this phenomenon?"},
],
}
]
downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_slice_nums=36,
).to(model.device)
generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])##### Video Inference
messages = [
{
"role": "user",
"content": [
{"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
{"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
],
}
]
downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_num_frames=128,
stack_frames=1,
max_slice_nums=1,
use_image_id=False,
).to(model.device)
generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])##### Advanced Parameters
You can customize image/video processing by passing additional parameters to apply_chat_template:
| Parameter | Default | Applies to | Description | |-----------|---------|------------|-------------| | downsample_mode | "16x" | Image & Video | Visual token downsampling. "16x" merges tokens for efficiency; "4x" keeps 4× more tokens for finer detail. Must also be passed to generate(). | | max_slice_nums | 9 | Image & Video | Maximum number of slices when splitting a high-resolution image. Higher values preserve more detail for large images. Recommended: 36 for image, 1 for video. | | max_num_frames | 128 | Video only | The max_num_frames parameter dynamically controls the temporal context length and prevents VRAM overflow: Short Videos (duration ≤ max_num_frames sec): The processor defaults to 1 FPS, capturing second-by-second details without hitting the upper limit. Long Videos (duration > max_num_frames sec): The processor automatically switches to uniform sampling, selecting exactly max_num_frames evenly spaced across the entire timeline. | | stack_frames | 1 | Video only | Total sample points per second. 1 = main frame only (no stacking). N (N>1) = 1 main frame + N−1 sub-frames per second; the sub-frames are composited into a grid image and interleaved with main frames. Recommended setting is 1 for short videos, and 3 or 5 for long videos. | | use_image_id | True | Image & Video | Whether to prepend N tags before…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10High-download multimodal model release