QwenLM/Qwen-VL
Python
Captured source
source ↗QwenLM/Qwen-VL
Description: The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Language: Python
License: NOASSERTION
Stars: 6666
Forks: 491
Open issues: 324
Created: 2023-08-21T07:57:15Z
Pushed: 2024-08-07T02:37:06Z
Default branch: master
Fork: no
Archived: no
README:
中文  |  English   |  日本語 |  한국어 
Qwen-VL 🤗 🤖  | Qwen-VL-Chat 🤗 🤖  (Int4: 🤗 🤖 ) | Qwen-VL-Plus 🤗 🤖  | Qwen-VL-Max 🤗 🤖 
Web   |    APP   |    API   |    WeChat   |    Discord   |    Paper   |    Tutorial
---
Qwen-VL-Plus & Qwen-VL-Max
Qwen-Vl-Plus and Qwen-VL-Max are the upgraded and latest versions of the Qwen-VL model family, currently supporting access for free through 🤗, 🤖, Web pages, APP and APIs.
| Model name | Model description | | --- | --- | | Qwen-VL-Plus | Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers significant performance across a broad range of visual tasks. | | Qwen-VL-Max | Qwen's Most Capable Large Visual Language Model. Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers optimal performance on an even broader range of complex tasks. |
The key technical advancements in these versions include:
- Substantially boost in image-related reasoning capabilities;
- Considerable enhancement in recognizing, extracting, and analyzing details of images, especially for text-oriented tasks;
- Support for high-definition images with resolutions above one million pixels and extreme aspect ratios;
These two models not only significantly surpass all previous best results from open-source LVLM models, but also perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks.
Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application.
Model DocVQA ChartQA AI2D TextVQA MMMU MathVista MM-Bench-CN
Other Best Open-source LVLM 81.6% (CogAgent) 68.4% (CogAgent) 73.7% (Fuyu-Medium) 76.1% (CogAgent) 45.9% (Yi-VL-34B) 36.7% (SPHINX-V2) 72.4% (InternLM-XComposer-VL)
Gemini Pro 88.1% 74.1% 73.9% 74.6% 47.9% 45.2% 74.3%
Gemini Ultra 90.9% 80.8% 1 79.5% 1 82.3% 1 59.4% 1 53.0% 1 -
GPT-4V 88.4% 78.5% 78.2% 78.0% 56.8% 49.9% 73.9%
Qwen-VL-Plus 91.4% 78.1% 75.9% 78.9% 45.2% 43.3% 68.0%
Qwen-VL-Max 93.1% 1 79.8% 2 79.3% 2 79.5% 2 51.4% 3 51.0% 2 75.1% 1
All numbers are obtained without any use of external OCR tools ('pixel only').
---
News and Updates
2024.01.18`` 💥💥💥 We introduce Qwen-VL-Max, our most capable model that significantly surpasses all previous open-source LVLM models, and it performs on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks. You can enjoy the new model by directly visiting our web pages, 🤗 and 🤖.2023.11.28`` 🏆🏆🏆 Qwen-VL-Plus achieved the best performance in DOCVQA by using a single model, surpassing GPT4V and PALI-X, without using model ensemble or OCR-pipeline. Meanwhile, it is also a general model that can help you analyze and understand various tasks by directly inputting images.2023.9.25`` 🚀🚀🚀 We update Qwen-VL-Chat with more robust Chinese instruction-following ability, improved understanding of web pages and table images, and better dialogue performance (Touchstone: CN: 401.2->481.7, EN: 645.2->711.6).2023.9.12`` 😃😃😃 We now support finetuning on the Qwen-VL models, including full-parameter finetuning, LoRA and Q-LoRA.2023.9.8`` 👍👍👍 Thanks to camenduru for contributing the wonderful Colab. Everyone can use it as a local or online Qwen-VL-Chat-Int4 Demo tutorial on one 12G GPU.2023.9.5`` 👏👏👏 Qwen-VL-Chat achieves SOTAs on MME Benchmark, a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks.2023.9.4`` ⭐⭐⭐ Qwen-VL series achieve SOTAs on Seed-Bench, a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs including both image and video understanding.2023.9.1`` 🔥🔥🔥 We release the TouchStone Evaluation, which is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text.2023.8.31`` 🌟🌟🌟 We release the Int4 quantized model for Qwen-VL-Chat, Qwen-VL-Chat-Int4, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.2023.8.22`` 🎉🎉🎉 We release both Qwen-VL and Qwen-VL-Chat on ModelScope and Hugging Face. We also provide a paper for more details about the model, including training details and model performance.
---
Qwen-VL
Qwen-VL (Qwen Large Vision Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text, and bounding box. The features of Qwen-VL include:
- Strong performance: It significantly…
Excerpt shown — open the source for the full document.