QwenLM/Qwen3-VL
Jupyter Notebook
Captured source
source ↗QwenLM/Qwen3-VL
Description: Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Language: Jupyter Notebook
License: Apache-2.0
Stars: 19353
Forks: 1783
Open issues: 413
Created: 2024-08-29T08:30:38Z
Pushed: 2026-01-30T04:47:30Z
Default branch: main
Fork: no
Archived: no
README:
Qwen3-VL
💜 Qwen Chat   |   🤗 Hugging Face   |   🤖 ModelScope   |   📑 Blog   |   📚 Cookbooks   |   📑 Paper  
🖥️ Demo   |   💬 WeChat (微信)   |   🫨 Discord   |   📑 API   |   🖥️ PAI-DSW
Introduction
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.
This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.
Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment.
Key Enhancements:
- Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks.
- Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos.
- Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI.
- Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing.
- Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers.
- Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc.
- Expanded OCR: Supports 32 languages (up from 10); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing.
- Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension.
Model Architecture Updates:
1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning.
2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment.
3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling.
News
- 2025.11.27: We have released the **Qwen3-VL paper**, which introduces many technical details about Qwen3-VL, and we hope it will be helpful to everyone.
- 2025.10.21: We have released the Qwen3-VL-2B (Instruct/Thinking) and Qwen3-VL-32B (Instruct/Thinking). Enjoy it!
- 2025.10.15: We have released the Qwen3-VL-4B (Instruct/Thinking) and Qwen3-VL-8B (Instruct/Thinking). Enjoy it!
- 2025.10.4: We have released the Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-30B-A3B-Thinking. We have also released the FP8 version of the Qwen3-VL models — available in our HuggingFace collection and ModelScope collection.
- 2025.09.23: We have released the Qwen3-VL-235B-A22B-Instruct and Qwen3-VL-235B-A22B-Thinking. For more details, please check our blog!
- 2025.04.08: We provide the code for fine-tuning Qwen2-VL and Qwen2.5-VL.
- 2025.03.25: We have released the Qwen2.5-VL-32B. It is smarter and its responses align more closely with human preferences. For more details, please check our blog!
- 2025.02.20: we have released the Qwen2.5-VL Technical Report. Alongside the report, we have also released AWQ-quantized models for Qwen2.5-VL in three different sizes: 3B, 7B , and 72B parameters.
- 2025.01.28: We have released the Qwen2.5-VL series. For more details, please check our blog!
- 2024.12.25: We have released the QvQ-72B-Preview. QvQ-72B-Preview is an experimental research model, focusing on enhancing visual reasoning capabilities. For more details, please check our blog!
- 2024.09.19: The instruction-tuned Qwen2-VL-72B model and its quantized version [AWQ, GPTQ-Int4, GPTQ-Int8] are now available. We have also released the Qwen2-VL paper simultaneously.
- 2024.08.30: We have released the Qwen2-VL series. The 2B and 7B models are now available, and the 72B model for open source is coming soon. For more details, please…
Excerpt shown — open the source for the full document.
Notability
notability 9.0/10High traction new flagship VL model repo