Qwen VLo: From "Understanding" the World to "Depicting" It
Captured source
source ↗Qwen VLo: From "Understanding" the World to "Depicting" It | Qwen
We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
Qwen VLo: From "Understanding" the World to "Depicting" It June 26, 2025 · 16 min · 3250 words · Qwen Team | Translations: 简体中文
QWEN CHAT DISCORD Introduction # The evolution of multimodal large models is continually pushing the boundaries of what we believe technology can achieve. From the initial QwenVL to the latest Qwen2.5 VL, we have made progress in enhancing the model’s ability to understand image content. Today, we are excited to introduce a new model, Qwen VLo, a unified multimodal understanding and generation model. This newly upgraded model not only “understands” the world but also generates high-quality recreations based on that understanding, truly bridging the gap between perception and creation. Note that this is a preview version and you can access it through Qwen Chat. You can directly send a prompt like “Generate a picture of a cute cat” to generate an image or upload an image of a cat and ask “Add a cap on the cat’s head” to modify an image. The image generation process is shown below. The Creative Process: Turn Your Imagination Into Reality As demonstrated in the video showcasing the generative process, Qwen VLo employs a progressive generation method, gradually constructing the entire image from left to right and top to bottom. During this process, the model continuously refines and optimizes its predictions to ensure that the final result is coherent and harmonious. This generative mechanism not only enhances visual quality but also provides users with a more flexible and controllable creative experience. From Understanding to Creation: Enhanced Multimodal Generation Capabilities # Qwen VLo has undergone a comprehensive upgrade in both its original multimodal understanding and generation capabilities. It significantly deepens its comprehension of image content and achieves more accurate and consistent generation results. Below are the core highlights of Qwen VLo: More Precise Content Understanding and Recreation Previous multimodal models often struggled with semantic inconsistencies during the generation process, such as misinterpreting a car as another object or failing to retain key structural features of the original image. Qwen VLo, equipped with enhanced detail-capturing abilities, maintains a high level of semantic consistency throughout the generation process. For instance, when a user inputs a photo of a car and requests a “color change,” Qwen VLo can accurately identify the car model, preserve its original structure, and naturally transform its color style. The generated result meets expectations while maintaining realism.
Support for Open-Ended Instruction-Based Editing Users can provide creative instructions in natural language, such as “change this painting to a Van Gogh style,” “make this photo look like it’s from the 19th century,” or “add a sunny sky to this image.” Qwen VLo can flexibly respond to these open-ended commands and produce results that align with user expectations. Whether it’s artistic style transfer, scene reconstruction, or detailed touch-ups, the model handles them all with ease. Even traditional visual perception tasks, such as predicting depth maps, segmentation maps, detection maps, and edge information, can be accomplished through simple editing instructions. Furthermore, Qwen VLo can also seamlessly handle more complex instructions — such as modifying objects, editing text, and changing backgrounds — all within a single command.
Multilingual Instruction Support Qwen VLo supports multiple languages, including Chinese and English, breaking down language barriers and providing a unified, convenient interaction experience for global users. Regardless of the language you use, simply describe your needs, and the model will quickly understand and deliver the desired output.
Demo Cases # Qwen VLo acts like a human artist, using its understanding to turn imagination into reality. Below are some examples for reference. Qwen VLo is capable of directly generating images and modifying them by replacing backgrounds, adding subjects, performing style transfers, and even executing extensive modifications based on open-ended instructions, as well as handling detection and segmentation tasks. A cute Shiba Inu Next User 生成一个可爱的柴犬 Translation: Generate a cute Shiba Inu
Qwen-VLo
User 背景改成草原 Translation: Change the background to a grassland
Qwen-VLo
User 给它带上红色帽子和黑色透明墨镜,帽子上写着“QwenVLo” Translation: Put a red hat and black transparent sunglasses on it, with ‘QwenVLo’ written on the hat
Qwen-VLo
User 变成吉卜力风格 Translation: Switch to Ghibli style
Qwen-VLo
User 变成3d Q版风格 Translation: Switch to 3D Q-version style
Qwen-VLo
User 把它放到水晶球里 Translation: Place it inside a crystal ball
Qwen-VLo
User 桌面上摆着这个水晶球,生成以一个人的第一视角在公园的圆形咖啡桌上在笔记本上画画 Translation: Place this crystal ball on a desk and generate an image from a first-person perspective of someone drawing on a notebook placed on a round coffee table in a park
Qwen-VLo
User 用蓝色的蒙版检测框框出图中的笔 Translation: Use a blue mask to detect and frame the pen in the picture
Qwen-VLo
User 用粉色的mask分割出图中的狗狗边缘 Translation: Use a pink mask to segment the edge of the dog in the picture
Qwen-VLo
Qwen VLo can reinterpret and recreate based on its understanding, allowing for greater flexibility in style changes and migrations, such as transforming cartoons into realistic images or turning figures into balloons, among other creative outputs. Style Conversion Next User 变成真实照片 Translation: Turn into a real photo Qwen-VLo
User 背景换成艾弗尔铁塔 Translation: Change the background to the Eiffel Tower Qwen-VLo
User 变成气球飘到空中 Translation: Turn into a balloon floating in the air Qwen-VLo
User 把西瓜换成榴莲 Translation: Replace the watermelon with durian Qwen-VLo
Style Conversion Next User Convert the couple in the photo into a minimalist flat illustration style Q-version sticker, retaining the facial features, with thick white borders, figures extending beyond the circular area, the circular area filled with a solid color, transparent background, and an overall cute style. Qwen-VLo
User Convert the couple in the photo into a detailed, exquisite, and adorable 3D rendered…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable research post from a major AI lab