OpenBMB/AgentCPM-GUI
Python
Captured source
source ↗OpenBMB/AgentCPM-GUI
Description: AgentCPM-GUI: An on-device GUI agent for operating Android apps, enhancing reasoning ability with reinforcement fine-tuning for efficient task execution.
Language: Python
License: Apache-2.0
Stars: 1375
Forks: 132
Open issues: 2
Created: 2025-05-13T04:11:16Z
Pushed: 2026-01-11T08:24:15Z
Default branch: main
Fork: no
Archived: no
README:
【English | 中文】
Overview • Quick Start • Model • Evaluation Data • Technical Report
News
- [2025-06-03] 📄📄📄 We have released the technical report of AgentCPM-GUI! Check it out here.
- [2025-05-13] 🚀🚀🚀 We have open-sourced AgentCPM-GUI, an on-device GUI agent capable of operating Chinese & English apps and equipped with RFT-enhanced reasoning abilities.
Overview
AgentCPM-GUI is an open-source on-device LLM agent model jointly developed by THUNLP, Renmin University of China and ModelBest. Built on MiniCPM-V with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.
Key features include:
- High-quality GUI grounding — Pre-training on a large-scale bilingual Android dataset significantly boosts localization and comprehension of common GUI widgets (buttons, input boxes, labels, icons, etc.).
- Chinese-app operation — The first open-source GUI agent finely tuned for Chinese apps, covering 30 + popular titles such as Amap, Dianping, bilibili and Xiaohongshu.
- Enhanced planning & reasoning — Reinforcement fine-tuning (RFT) lets the model “think” before outputting an action, greatly improving success on complex tasks.
- Compact action-space design — An optimized action space and concise JSON format reduce the average action length to 9.7 tokens, boosting on-device inference efficiency.
Demo Case (1x speed):
https://github.com/user-attachments/assets/694d3c2c-12ce-4084-8feb-4937ca9ad247
Quick Start
Install dependencies
git clone https://github.com/OpenBMB/AgentCPM-GUI cd AgentCPM-GUI conda create -n gui_agent python=3.11 conda activate gui_agent pip install -r requirements.txt
Download the model
Download AgentCPM-GUI from Hugging Face and place it in model/AgentCPM-GUI.
Huggingface Inference
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import json
# 1. Load the model and tokenizer
model_path = "model/AgentCPM-GUI" # model path
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to("cuda:0")
# 2. Build the input
instruction = "请点击屏幕上的‘会员’按钮"
image_path = "assets/test.jpeg"
image = Image.open(image_path).convert("RGB")
# 3. Resize the longer side to 1120 px to save compute & memory
def __resize__(origin_img):
resolution = origin_img.size
w,h = resolution
max_line_res = 1120
if max_line_res is not None:
max_line = max_line_res
if h > max_line:
w = int(w * max_line / h)
h = max_line
if w > max_line:
h = int(h * max_line / w)
w = max_line
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
return img
image = __resize__(image)
# 4. Build the message format
messages = [{
"role": "user",
"content": [
f"{instruction}\n当前屏幕截图:",
image
]
}]
# 5. Inference
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# Role
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
# Task
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
# Rule
- 以紧凑JSON格式输出
- 输出操作必须遵循Schema约束
# Schema
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
outputs = model.chat(
image=None,
msgs=messages,
system_prompt=SYSTEM_PROMPT,
tokenizer=tokenizer,
temperature=0.1,
top_p=0.3,
n=1,
)
# 6. Output
print(outputs)Expected output:
{"thought":"任务目标是点击屏幕上的‘会员’按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击‘会员’按钮可以访问应用的会员相关内容。","POINT":[729,69]}Note: AgentCPM-GUI outputs relative coordinates ranging from 0-1000. The conversions are as follows:
rel_x, rel_y = [int(abs_x / width * 1000), int(abs_y / height * 1000)] abs_x, abs_y = [int(rel_x / 1000 * width), int(rel_y / 1000 * height)]
where width and height refer to the original width and height of the image, respectively.
vLLM Inference
# Launch the vLLM server # If run out of VRAM, try add --max_model_len 2048 vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code --limit-mm-per-prompt image=10
import base64
import io
import json
import requests
from PIL import Image
END_POINT = "http://localhost:8000/v1/chat/completions" # Replace with actual endpoint
# system prompt
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# Role
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
# Task
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
# Rule
- 以紧凑JSON格式输出
- 输出操作必须遵循Schema约束
# Schema
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
def encode_image(image: Image.Image) -> str:
"""Convert PIL Image to base64-encoded string."""
with io.BytesIO() as in_mem_file:
image.save(in_mem_file, format="JPEG")
in_mem_file.seek(0)
return base64.b64encode(in_mem_file.read()).decode("utf-8")
def __resize__(origin_img):
resolution = origin_img.size
w,h = resolution
max_line_res = 1120
if max_line_res is not None:
max_line = max_line_res
if h > max_line:
w = int(w * max_line / h)
h = max_line
if w > max_line:
h = int(h * max_line / w)
w = max_line
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
return img
def predict(text_prompt: str, image: Image.Image):
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": f"{text_prompt}\n当前屏幕截图:(./)"},
{"type": "image_url",…Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Decent stars for new GUI repo.