zai-org/GLM-V
Python
Captured source
source ↗zai-org/GLM-V
Description: GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Language: Python
License: Apache-2.0
Stars: 2326
Forks: 171
Open issues: 12
Created: 2025-06-28T08:44:06Z
Pushed: 2026-05-16T05:42:14Z
Default branch: main
Fork: no
Archived: no
README:
GLM-V
[中文阅读.](./README_zh.md)
👋 Join our WeChat and Discord communities.
📖 Check out the GLM-4.6V blog and GLM-4.5V & GLM-4.1V paper.
📍 Try online or use the API.
Introduction
Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents.
Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications.
This open-source repository contains our `GLM-4.6V`, `GLM-4.5V` and `GLM-4.1V` series models. For performance and details, see [Model Overview](#model-overview). For known issues, see [Fixed and Remaining Issues](#fixed-and-remaining-issues).
Project Updates
- News:
2026/04/02: We released GLM-5V-Turbo
and GLM-skills.
- News:
2026/03/28: We have released multiple GLM-V related Skills, covering several specialized areas
such as GLM-V-Grounding and GLM-V-Prompt-Gen. You are welcome to try them [here](skills).
- News:
2025/11/10: We released UI2Code^N, a RL-enhanced UI coding model with UI-to-code, UI-polish, and
UI-edit capabilities. The model is trained based on GLM-4.1V-Base. Check it out here.
- News:
2025/10/27: We’ve released Glyph, a framework for scaling the context length through visual-text
compression, the glyph model trained based on GLM-4.1V-Base. Check it out here.
- News:
2025/08/11: We released GLM-4.5V with significant improvements across multiple benchmarks. We also
open-sourced our handcrafted desktop assistant app for debugging. Once connected to GLM-4.5V, it can capture visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it into your own multimodal assistant. Click here to download the installer or [build from source](examples/vllm-chat-helper/README.md)!
- News:
2025/07/16: We have open-sourced the VLM Reward System used to train GLM-4.1V-Thinking.View
the [code repository](glmv_reward) and run locally: python examples/reward_system_demo.py.
- News:
2025/07/01: We released GLM-4.1V-9B-Thinking and
its technical report.
Model Implementation Code
- GLM-4.5V and GLM-4.6V model algorithm: see the full implementation
in transformers.
- GLM-4.1V-9B-Thinking model algorithm: see the full implementation
in transformers.
- Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish
carefully.
Model Downloads
| Model | Download Links | Type | |----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | GLM-4.6V | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.6V-FP8 | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.6V-Flash | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.5V | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.5V-FP8 | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.1V-9B-Thinking | 🤗 Hugging Face 🤖 ModelScope | Reasoning | | GLM-4.1V-9B-Base | 🤗 Hugging Face 🤖 ModelScope | Base |
+ Hugging Face provides GGUF format model weights. You can download the GGUF format model of GLM-V from here.
Using Case
Grounding
GLM-4.5V / GLM-4.6V / GLM-4.1V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, the model is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports complex descriptions of the target object as well as specified output formats, for example: > > - Help me to locate in the image and give me its bounding boxes. > - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description.
Here, `` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image width (for x) or height (for y) and scaled by 1000.
In the response, the special tokens ` and ` are used to mark the image bounding box in the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates of the box.
GUI Agent
examples/gui-agent: Demonstrates prompt construction and output handling for GUI Agents, including strategies for
mobile, PC, and web. Prompt templates differ between GLM-4.1V and GLM-4.5V.
Quick Demo
examples/vlm-helper: A desktop assistant for GLM multimodal models…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable model release from Zhipu with strong traction.