MoonshotAI/Kimi-VL
Captured source
source ↗MoonshotAI/Kimi-VL
Description: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
License: MIT
Stars: 1199
Forks: 86
Open issues: 40
Created: 2025-04-09T08:34:29Z
Pushed: 2025-07-15T15:48:21Z
Default branch: main
Fork: no
Archived: no
README:
1. Introduction
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities—all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B).
Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.
In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.
Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.
Building on this foundation, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal thinking models.
Besides original model variants, we also provide a new Kimi-VL-A3B-Thinking-2506 variant with several new or improved abilities:
- It Thinks Smarter while Consuming Less Tokens: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.2), 64.0 on MMMU (+2.1), while in average reducing 20% thinking length.
- It Sees Clearer with Thinking: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4) compared to the original non-thinking version (Kimi-VL-A3B-Instruct).
- It Extends to Video Scenarios: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retaining good ability on general video understanding (71.9 on Video-MME).
- It Extends to Higher Resolution: The new 2506 version supports 3.2 million total pixels in a single image (1792x1792), 4X compared to the original release. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
2. Architecture
The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
3. News
- 2025.06.21: Release of Kimi-VL-A3B-Thinking-2506: Tech Blog \& Cookbook, 🤗 Hugging Face
- 2025.04.15: vLLM has supported Kimi-VL deployment. See #16387 for details.
- 2025.04.14: LLaMA-Factory has supported Kimi-VL finetuning. See #7719 for details.
4. Model Variants
🤗 For common general multimodal perception and understanding, OCR, long video and long document, video perception, and OS-agent uses, we recommend Kimi-VL-A3B-Instruct for efficient inference; meanwhile, our new thinking version, Kimi-VL-A3B-Thinking-2506 also has excellent multimodal perception, long video and long document and OS-agent grounding abilities while achieving better multimodal reasoning skills. See this blog for more information.
> [!Note] > Recommended parameter settings: > - For Thinking models, it is recommended to use Temperature = 0.8. > - For Instruct models, it is recommended to use Temperature = 0.2.
Hugging Face Demo
> 🤗 We serve our model demo in Hugging Face spaces: > - Chat with Kimi-VL-A3B-Thinking-2506👀🤔🗺️🎬📖🖥️ (*unifying thinking, general understanding, puzzle solving, agent, video, PDF*) model on Chat Web.
5. Performance
> [!Note] > See the performance of Kimi-VL-A3B-Thinking-2506 at Hugging Face.
As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):
With effective long-thinking abilities, Kimi-VL-A3B-Thinking (2504 version) can match the performance of 30B/70B frontier open-source VLMs on MathVision benchmark:
6. Example usage
Setup
conda create -n kimi-vl python=3.10 -y conda activate kimi-vl pip install -r requirements.txt
> [!Note] > If you encounter Out-of-Memory or want to speed up inference, please install flash-attn…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable model release with decent stars