What does this model signal mean?

OpenBMB (MiniCPM) published openbmb/MiniCPM-V-4.6-Thinking. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 17.4K HF downloads · Vision-language model with chain-of-thought reasoning, version 4.6.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

OpenBMB (MiniCPM) Model: openbmb/MiniCPM-V-4.6-Thinking

Captured source

source ↗

Hugging Face/huggingface.co/openbmb/MiniCPM-V-4.6-Thinking

openbmb/MiniCPM-V-4.6-Thinking model card

Source ↗

published May 8, 2026seen Jun 6captured Jun 11http 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 1.3Bdownloads 17klikes 29

A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone

GitHub | CookBook | Demo | Feishu (Lark)

News

[2026.05.17] ⭐️⭐️⭐️ We release the API service of MiniCPM-V 4.6, with a public free API key together! Try it now.

MiniCPM-V 4.6 Thinking

MiniCPM-V 4.6 Thinking is the long chain-of-thought reasoning variant of MiniCPM-V 4.6. It generates an explicit reasoning trace before producing the final answer, substantially boosting performance on complex multimodal reasoning, math, and OCR-heavy tasks, while keeping the same edge-friendly architecture (SigLIP2-400M vision encoder + Qwen3.5-0.8B LLM) and the mixed 4x/16x visual token compression of MiniCPM-V 4.6.

Evaluation

Overall Performance (Thinking)

Click to view MiniCPM-V 4.6 (Instruct) performance.

Click to view MiniCPM-V 4.6 inference efficiency results.

High-Concurrency Throughput

Single Request TTFT (ms)

Examples

Overall

MiniCPM-V 4.6 can be deployed across three mainstream end-side platforms — iOS, Android and HarmonyOS. The clips below are raw screen recordings on phone devices without edition.

iPhone iPhone 17 Pro Max Android Redmi K70 HarmonyOS HUAWEI nova 14

Usages

Inference with Transformers

##### Installation

pip install "transformers[torch]>=5.7.0" torchvision torchcodec

> Note on CUDA compatibility: torchcodec (used for video decoding) may have compatibility issues with certain CUDA versions. For example, torch>=2.11 bundles CUDA 13.1 by default, while environments with CUDA 12.x may encounter errors such as RuntimeError: Could not load libtorchcodec. Two workarounds: > > 1. Replace `torchcodec` with `PyAV` — supports both image and video inference without CUDA version constraints: > ``bash > pip install "transformers[torch]>=5.7.0" torchvision av > > 2. **Pin the CUDA version** when installing torch to match your environment (e.g. CUDA 12.8): > bash > pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128 >

##### Load Model

from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "openbmb/MiniCPM-V-4.6-Thinking"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)

# Flash Attention 2 is recommended for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
# model_id,
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )

##### Image Inference

messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
{"type": "text", "text": "What causes this phenomenon?"},
],
}
]

downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_slice_nums=36,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

##### Video Inference

messages = [
{
"role": "user",
"content": [
{"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
{"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
],
}
]

downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_num_frames=128,
stack_frames=1,
max_slice_nums=1,
use_image_id=False,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

##### Advanced Parameters

You can customize image/video processing by passing additional parameters to apply_chat_template:

| Parameter | Default | Applies to | Description | |-----------|---------|------------|-------------| | downsample_mode | "16x" | Image & Video | Visual token downsampling. "16x" merges tokens for efficiency; "4x" keeps 4× more tokens for finer detail. Must also be passed to generate(). | | max_slice_nums | 9 | Image & Video | Maximum number of slices when splitting a high-resolution image. Higher values preserve more detail for large images. Recommended: 36 for image, 1 for video. | | max_num_frames | 128 | Video only | The max_num_frames parameter dynamically controls the temporal context length and prevents VRAM overflow: Short Videos (duration ≤ max_num_frames sec): The processor defaults to 1 FPS, capturing second-by-second details without hitting the upper limit. Long Videos (duration > max_num_frames sec): The processor automatically switches to uniform sampling, selecting exactly max_num_frames evenly spaced across the entire timeline. | | stack_frames | 1 | Video only | Total sample points per second. 1 = main frame only (no stacking). N (N>1) = 1 main frame + N−1 sub-frames per second; the sub-frames are composited into a grid image and interleaved with main frames. Recommended setting is 1 for short videos, and 3 or 5 for long videos. | | use_image_id | True | Image & Video | Whether to prepend N tags before...

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High-download multimodal model release