RepoZhipu AI (GLM)Zhipu AI (GLM)published Nov 28, 2023seen 5d

zai-org/CogAgent

Python

Open original ↗

Captured source

source ↗
published Nov 28, 2023seen 5dcaptured 12hhttp 200method plain

zai-org/CogAgent

Description: An open-sourced end-to-end VLM-based GUI Agent

Language: Python

License: Apache-2.0

Stars: 1183

Forks: 99

Open issues: 29

Created: 2023-11-28T09:28:08Z

Pushed: 2025-04-04T13:29:55Z

Default branch: main

Fork: no

Archived: no

README:

CogAgent: An open-sourced VLM-based GUI Agent

[中文文档](README_zh.md)

  • 🔥 🆕 December 2024: We open-sourced the latest version of the CogAgent-9B-20241220 model. Compared to the

previous version of CogAgent, CogAgent-9B-20241220 features significant improvements in GUI perception, reasoning accuracy, action space completeness, task universality, and generalization. It supports bilingual (Chinese and English) interaction through both screen captures and natural language.

  • 🏆 June 2024: CogAgent was accepted by CVPR 2024 and recognized as a conference Highlight (top 3%).
  • December 2023: We open-sourced the first GUI Agent: CogAgent (with the former repository

available here) and published the corresponding paper: 📖 [CogAgent Paper](https://arxiv.org/abs/2312.08914).

Model Introduction

| Model | Model Download Links | Technical Documentation | Online Demo | |:--------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | cogagent-9b-20241220 | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel 🧩 Modelers (Ascend) | 📄 Official Technical Blog 📘 Practical Guide (Chinese) | 🤗 HuggingFace Space 🤖 ModelScope Space 🧩 Modelers Space (Ascend) |

Model Overview

CogAgent-9B-20241220 model is based on GLM-4V-9B, a bilingual open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements, CogAgent-9B-20241220 achieves significant advancements in GUI perception, inference prediction accuracy, action space completeness, and generalizability across tasks. The model supports bilingual (Chinese and English) interaction with both screenshots and language input. This version of the CogAgent model has already been applied in ZhipuAI's GLM-PC product. We hope the release of this model can assist researchers and developers in advancing the research and applications of GUI agents based on vision-language models.

Capability Demonstrations

The CogAgent-9b-20241220 model has achieved state-of-the-art results across multiple platforms and categories in GUI Agent tasks and GUI Grounding Benchmarks. In the CogAgent-9b-20241220 Technical Blog, we compared it against API-based commercial models (GPT-4o-20240806, Claude-3.5-Sonnet), commercial API + GUI Grounding models (GPT-4o + UGround, GPT-4o + OS-ATLAS), and open-source GUI Agent models (Qwen2-VL, ShowUI, SeeClick). The results demonstrate that CogAgent leads in GUI localization (Screenspot), single-step operations (OmniAct), the Chinese step-wise in-house benchmark (CogAgentBench-basic-cn), and multi-step operations (OSWorld), with only a slight disadvantage in OSWorld compared to Claude-3.5-Sonnet, which specializes in Computer Use, and GPT-4o combined with external GUI Grounding models.

CogAgent wishes you a Merry Christmas! Let the large model automatically send Christmas greetings to your friends.

Want to open an issue? Let CogAgent help you send an email.

Table of Contents

  • [CogAgent](#cogagent)
  • [Model Introduction](#model-introduction)
  • [Model Overview](#model-overview)
  • [Capability Demonstrations](#capability-demonstrations)
  • [Inference and Fine-tuning Costs](#inference-and-fine-tuning-costs)
  • [Model Inputs and Outputs](#model-inputs-and-outputs)
  • [User Input](#user-input)
  • [Model Output](#model-output)
  • [An Example](#an-example)
  • [Notes](#notes)
  • [Running the Model](#running-the-model)
  • [Environment Setup](#environment-setup)
  • [Running an Agent APP Example](#running-an-agent-app-example)
  • [Fine-tuning the Model](#fine-tuning-the-model)
  • [Previous Work](#previous-work)
  • [License](#license)
  • [Citation](#citation)
  • [Research and Development Team \& Acknowledgements](#research-and-development-team---acknowledgements)

Inference and Fine-tuning Costs

+ The model requires at least 29GB of VRAM for inference at BF16 precision. Using INT4 precision for inference is not recommended due to significant performance loss. The VRAM usage for INT4 inference is about 8GB, while for INT8 inference it is about 15GB. In the inference/cli_demo.py file, we have commented out these two lines. You can uncomment them and use INT4 or INT8 inference. This solution is only supported on NVIDIA devices. + All GPU references above refer to A100 or H100 GPUs. For other devices, you need to calculate the required GPU/CPU memory accordingly. + During SFT (Supervised Fine-Tuning), this codebase freezes the Vision Encoder, uses a batch size of 1, and trains on 8 * A100 GPUs. The total input tokens (including images, which account for 1600 tokens) add up to 2048 tokens. This codebase cannot conduct SFT fine-tuning without freezing the Vision Encoder. For LoRA…

Excerpt shown — open the source for the full document.