RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published May 13, 2025seen 5d

OpenBMB/AgentCPM-GUI

Python

Open original ↗

Captured source

source ↗
published May 13, 2025seen 5dcaptured 14hhttp 200method plain

OpenBMB/AgentCPM-GUI

Description: AgentCPM-GUI: An on-device GUI agent for operating Android apps, enhancing reasoning ability with reinforcement fine-tuning for efficient task execution.

Language: Python

License: Apache-2.0

Stars: 1375

Forks: 132

Open issues: 2

Created: 2025-05-13T04:11:16Z

Pushed: 2026-01-11T08:24:15Z

Default branch: main

Fork: no

Archived: no

README:

【English | 中文】

Overview • Quick Start • Model • Evaluation Data • Technical Report

News

  • [2025-06-03] 📄📄📄 We have released the technical report of AgentCPM-GUI! Check it out here.
  • [2025-05-13] 🚀🚀🚀 We have open-sourced AgentCPM-GUI, an on-device GUI agent capable of operating Chinese & English apps and equipped with RFT-enhanced reasoning abilities.

Overview

AgentCPM-GUI is an open-source on-device LLM agent model jointly developed by THUNLP, Renmin University of China and ModelBest. Built on MiniCPM-V with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.

Key features include:

  • High-quality GUI grounding — Pre-training on a large-scale bilingual Android dataset significantly boosts localization and comprehension of common GUI widgets (buttons, input boxes, labels, icons, etc.).
  • Chinese-app operation — The first open-source GUI agent finely tuned for Chinese apps, covering 30 + popular titles such as Amap, Dianping, bilibili and Xiaohongshu.
  • Enhanced planning & reasoning — Reinforcement fine-tuning (RFT) lets the model “think” before outputting an action, greatly improving success on complex tasks.
  • Compact action-space design — An optimized action space and concise JSON format reduce the average action length to 9.7 tokens, boosting on-device inference efficiency.

Demo Case (1x speed):

https://github.com/user-attachments/assets/694d3c2c-12ce-4084-8feb-4937ca9ad247

Quick Start

Install dependencies

git clone https://github.com/OpenBMB/AgentCPM-GUI
cd AgentCPM-GUI
conda create -n gui_agent python=3.11
conda activate gui_agent
pip install -r requirements.txt

Download the model

Download AgentCPM-GUI from Hugging Face and place it in model/AgentCPM-GUI.

Huggingface Inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import json

# 1. Load the model and tokenizer
model_path = "model/AgentCPM-GUI" # model path
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to("cuda:0")

# 2. Build the input
instruction = "请点击屏幕上的‘会员’按钮"
image_path = "assets/test.jpeg"
image = Image.open(image_path).convert("RGB")

# 3. Resize the longer side to 1120 px to save compute & memory
def __resize__(origin_img):
resolution = origin_img.size
w,h = resolution
max_line_res = 1120
if max_line_res is not None:
max_line = max_line_res
if h > max_line:
w = int(w * max_line / h)
h = max_line
if w > max_line:
h = int(h * max_line / w)
w = max_line
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
return img
image = __resize__(image)

# 4. Build the message format
messages = [{
"role": "user",
"content": [
f"{instruction}\n当前屏幕截图:",
image
]
}]

# 5. Inference
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# Role
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。

# Task
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。

# Rule
- 以紧凑JSON格式输出
- 输出操作必须遵循Schema约束

# Schema
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''

outputs = model.chat(
image=None,
msgs=messages,
system_prompt=SYSTEM_PROMPT,
tokenizer=tokenizer,
temperature=0.1,
top_p=0.3,
n=1,
)

# 6. Output
print(outputs)

Expected output:

{"thought":"任务目标是点击屏幕上的‘会员’按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击‘会员’按钮可以访问应用的会员相关内容。","POINT":[729,69]}

Note: AgentCPM-GUI outputs relative coordinates ranging from 0-1000. The conversions are as follows:

rel_x, rel_y = [int(abs_x / width * 1000), int(abs_y / height * 1000)]
abs_x, abs_y = [int(rel_x / 1000 * width), int(rel_y / 1000 * height)]

where width and height refer to the original width and height of the image, respectively.

vLLM Inference

# Launch the vLLM server
# If run out of VRAM, try add --max_model_len 2048
vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code --limit-mm-per-prompt image=10
import base64
import io
import json
import requests
from PIL import Image

END_POINT = "http://localhost:8000/v1/chat/completions" # Replace with actual endpoint

# system prompt
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# Role
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。

# Task
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。

# Rule
- 以紧凑JSON格式输出
- 输出操作必须遵循Schema约束

# Schema
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''

def encode_image(image: Image.Image) -> str:
"""Convert PIL Image to base64-encoded string."""
with io.BytesIO() as in_mem_file:
image.save(in_mem_file, format="JPEG")
in_mem_file.seek(0)
return base64.b64encode(in_mem_file.read()).decode("utf-8")

def __resize__(origin_img):
resolution = origin_img.size
w,h = resolution
max_line_res = 1120
if max_line_res is not None:
max_line = max_line_res
if h > max_line:
w = int(w * max_line / h)
h = max_line
if w > max_line:
h = int(h * max_line / w)
w = max_line
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
return img

def predict(text_prompt: str, image: Image.Image):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": f"{text_prompt}\n当前屏幕截图:(./)"},
{"type": "image_url",…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Decent stars for new GUI repo.