ModelZhipu AI (GLM)Zhipu AI (GLM)published Aug 10, 2025seen 5d

zai-org/GLM-4.5V

Open original ↗

Captured source

source ↗
published Aug 10, 2025seen 5dcaptured 11hhttp 200method plaintask image-text-to-textlicense mitlibrary transformersparams 108Bdownloads 167klikes 718

GLM-4.5V

This model is part of the GLM-V family of models, introduced in the paper GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

Introduction & Model Overview

Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents.

Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications.

This Hugging Face repository hosts the `GLM-4.5V` model, part of the `GLM-V` series.

GLM-4.5V

GLM-4.5V is based on ZhipuAI’s next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on 42 public vision-language benchmarks. It covers common tasks such as image, video, and document understanding, as well as GUI agent operations.

!GLM-4.5V Benchmarks

Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:

  • Image reasoning (scene understanding, complex multi-image analysis, spatial recognition)
  • Video understanding (long video segmentation and event recognition)
  • GUI tasks (screen reading, icon recognition, desktop operation assistance)
  • Complex chart & long document parsing (research report analysis, information extraction)
  • Grounding (precise visual element localization)

The model also introduces a Thinking Mode switch, allowing users to balance between quick responses and deep reasoning. This switch works the same as in the GLM-4.5 language model.

GLM-4.1V-9B

*Contextual information about GLM-4.1V-9B is provided for completeness, as it is part of the GLM-V series and foundational to GLM-4.5V's development.*

Built on the GLM-4-9B-0414 foundation model, the GLM-4.1V-9B-Thinking model introduces a reasoning paradigm and uses RLCS (Reinforcement Learning with Curriculum Sampling) to comprehensively enhance model capabilities. It achieves the strongest performance among 10B-level VLMs and matches or surpasses the much larger Qwen-2.5-VL-72B in 18 benchmark tasks.

We also open-sourced the base model GLM-4.1V-9B-Base to support researchers in exploring the limits of vision-language model capabilities.

!Reinforcement Learning with Curriculum Sampling (RLCS)

Compared with the previous generation CogVLM2 and GLM-4V series, GLM-4.1V-Thinking brings: 1. The series’ first reasoning-focused model, excelling in multiple domains beyond mathematics. 2. 64k context length support. 3. Support for any aspect ratio and up to 4k image resolution. 4. A bilingual (Chinese/English) open-source version.

GLM-4.1V-9B-Thinking integrates the Chain-of-Thought reasoning mechanism, improving accuracy, richness, and interpretability. It leads on 23 out of 28 benchmark tasks at the 10B parameter scale, and outperforms Qwen-2.5-VL-72B on 18 tasks despite its smaller size.

!GLM-4.1V-9B Benchmarks

Project Updates

  • 🔥 News: 2025/08/11: We released GLM-4.5V with significant improvements across multiple benchmarks. We also open-sourced our handcrafted desktop assistant app for debugging. Once connected to GLM-4.5V, it can capture visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it into your own multimodal assistant. Click here to download the installer or build from source!
  • News: 2025/07/16: We have open-sourced the VLM Reward System used to train GLM-4.1V-Thinking. View the code repository and run locally: python examples/reward_system_demo.py.
  • News: 2025/07/01: We released GLM-4.1V-9B-Thinking and its technical report.

Model Implementation Code

  • GLM-4.5V model algorithm: see the full implementation in transformers.
  • GLM-4.1V-9B-Thinking model algorithm: see the full implementation in transformers.
  • Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish carefully.

Usage

Environment Installation

For SGLang and transformers:

pip install transformers>=4.57.1
pip install sglang>=0.5.3

For vLLM:

pip install vllm>=0.10.2

Quick Start with Transformers

from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
import torch

MODEL_PATH = "zai-org/GLM-4.5V"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
},
{
"type": "text",
"text": "describe this image"
}
],
}
]
processor =…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High download traction, notable model from Zhipu.