WritingQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Jan 26, 2025seen 6d

Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!

Open original ↗

Captured source

source ↗
published Jan 26, 2025seen 6dcaptured 3dhttp 200method plain

Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL! | Qwen

We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL! January 26, 2025 · 21 min · 4364 words · Qwen Team | Translations: 简体中文

QWEN CHAT GITHUB HUGGING FACE MODELSCOPE DISCORD We release Qwen2.5-VL , the new flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL. To try the latest model, feel free to visit Qwen Chat and choose Qwen2.5-VL-72B-Instruct. Also, we open both base and instruct models in 3 sizes, including 3B, 7B, and 72B, in both Hugging Face and ModelScope . The key features include: Understand things visually : Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

Being agentic : Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.

Understanding long videos and capturing events : Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments.

Capable of visual localization in different formats : Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.

Generating structured outputs : for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.

Performance # We evaluate our models with the SOTA models as well as the best models of similar model sizes. In terms of the flagship model Qwen2.5-VL-72B-Instruct, it achieves competitive performance in a series of benchmarks covering domains and tasks, inlcuding college-level problems, math, document understanding, general question answering, math, video understanding, and visual agent. Notably, Qwen2.5-VL achieves significant advantages in understanding documents and diagrams, and it is capable of playing as a visual agent without task-specific finetuning. In terms of smaller models, Qwen2.5-VL-7B-Instruct outperforms GPT-4o-mini in a number of tasks, and Qwen2.5-VL-3B, which is a solution for edge AI, even outperforms the 7B model of our previous version Qwen2-VL. Model Capabilities # 1. World-wide Image Recognition # Qwen2.5-VL has significantly enhanced its general image recognition capabilities, expanding the categories of images to an ultra-large number. It not only includes plants, animals, landmarks of famous mountains and rivers, but also IPs from film and TV series, as well as a wide variety of products. Example: Attractions Identification Next User What are these attractions? Please give their names in Chinese and English.

Qwen2.5-VL These attractions are: Top left: The Great Pyramids of Giza (埃及的吉萨大金字塔) Top right: The Great Wall of China (中国的长城) Bottom left: The Statue of Liberty (美国的自由女神像) Bottom right: The Terracotta Army (中国的秦始皇兵马俑)

Example: Birds Identification Next User What are these birds? Please give their names in Chinese and English.

Qwen2.5-VL These birds are: Top Left: 红翅黑鹂 (Red-winged Blackbird) Top Right: 蓝翅八色鸫 (Blue-winged Pitta) Bottom Left: 小斑鸬鹚 (Little Pied Cormorant) Bottom Right: 金头扇尾莺 (Golden-headed Cisticola)

Example: Cars Identification Next User What are these cars? Please give their names in Chinese and English.

Qwen2.5-VL 这些车分别是: 左上角:劳斯莱斯(Rolls-Royce) 右上角:梅赛德斯-奔驰GLE(Mercedes-Benz GLE) 左下角:法拉利Portofino M(Ferrari Portofino M) 右下角:保时捷911(Porsche 911)

Example: Celebrity Recognition Next User Who are these in this picture? Please give their names in Chinese and English.

Qwen2.5-VL 这张图片中的四位名人分别是: 左上角:迈克尔·杰克逊(Michael Jackson) 右上角:迈克尔·乔丹(Michael Jordan) 左下角:成龙(Jackie Chan) 右下角:唐纳德·特朗普(Donald Trump)

Example: Foods Identification Next User What are these foots? Please give their names in Chinese and English.

Qwen2.5-VL Sure! Here are the names of the foods in the images, both in Chinese and English: Top Left Image: Chinese: 牛仔骨 (Niú zǎi gǔ) English: Beef Short Ribs

Top Right Image: Chinese: 蔬菜沙拉 (Shūcài shālā) English: Vegetable Salad

Bottom Left Image: Chinese: 红烧狮子头 (Hóngshāo shizi tóu) English: Braised Meatballs

Bottom Right Image: Chinese: 火锅 (Huǒguō) English: Hot Pot (with various ingredients like meat, vegetables, and seafood)

Example: Products Identification Next User What are these products? Please give their names in Chinese and English.

Qwen2.5-VL Sure, here are the products and their Chinese and English names: Top Left Image: English Name: Propolinse Mouthwash Chinese Name: 蜂胶漱口水

Top Right Image: English Name: George Caroll Moisturizing and Refreshing Shampoo Chinese Name: 乔治卡罗尔保湿清爽洗发水

Bottom Left Image: English Name: On The Body Sweet Love Perfume Water Chinese Name: 洋甘菊甜蜜爱香水

Bottom Right Image: English Name: Tiger Balm Mosquito Repellent Chinese Name: 虎标驱蚊水

2. Precise Object Grounding # Qwen2.5-VL utilizes bounding boxes and point-based representations for grounding, enabling hierarchical positioning and standardized JSON output. This enhanced localization capability serves as a foundation for visual reasoning. Example: Precise Object Grounding Next User Detect all motorcyclists in the image and return their locations in the form of coordinates. The format of output should be like {“bbox_2d”: [x1, y1, x2, y2], “label”: “motorcyclist”, “sub_label”: “wearing helmat” # or “not wearing helmat”}.

Qwen2.5-VL [ { "bbox_2d" : [ 341 , 258 , 397 , 360 ], "label" : "motorcyclist" , "sub_label" : "not wearing helmat " }, { "bbox_2d" : [ 212 , 332 , 274 , 448 ], "label" : "motorcyclist" , "sub_label" : "not wearing helmat " }, { "bbox_2d" : [ 66 , 124 , 112 , 198 ], "label" : "motorcyclist" , "sub_label" : "not wearing helmat " }, { "bbox_2d" : [ 5 , 235 , 63 , 320 ], "label" : "motorcyclist" , "sub_label" : "wearing helmat " } ]

Example: Precise Object Grounding Next User Identify basketball players and detect the key points of their heads and hands, returning them in the form of points. The…

Excerpt shown — open the source for the full document.

Notability

notability 10.0/10

Major flagship model release