RepoStepFunStepFunpublished Feb 9, 2026seen 5d

stepfun-ai/GEBench

Python

Open original ↗

Captured source

source ↗
published Feb 9, 2026seen 5dcaptured 9hhttp 200method plain

stepfun-ai/GEBench

Language: Python

License: Apache-2.0

Stars: 54

Forks: 1

Open issues: 0

Created: 2026-02-09T13:18:49Z

Pushed: 2026-02-25T08:33:22Z

Default branch: main

Fork: no

Archived: no

README:

GEBench: Benchmarking Image Generation Models as GUI Environments

![Benchmark Comparison](./assets/teaser.jpg)

Features

  • 5 Data Types: Type 1 (single-step), Type 2 (multi-step), Type 3 (text-fictionalapp), Type 4 (text-realapp), Type 5 (grounding)
  • Bilingual Support: Automatic Chinese/English prompt selection based on folder naming
  • 5-Dimensional Metrics: goal, logic, consistency, ui, quality

Quick Start

Installation

# Clone repository
git clone https://github.com/stepfun-ai/GEBench
cd GEBench

# Create conda environment
conda create -n gebench python=3.10 -y
conda activate gebench

# Install dependencies
pip install -r requirements.txt

Data

The GEBench data is available on HuggingFace:

📊 [StepFun-ai/GEBench](https://huggingface.co/datasets/stepfun-ai/GEBench) - HuggingFace Datasets Hub

To download:

cd /path/to/GEBench
git clone https://huggingface.co/datasets/stepfun-ai/GEBench ./data

Generate Images

python scripts/generate.py --data-type type1 --data-folder data/01_single_step --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY
python scripts/generate.py --data-type type2 --data-folder data/02_multi_step --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY
python scripts/generate.py --data-type type3 --data-folder data/03_trajectory_text_fictionalapp --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY
python scripts/generate.py --data-type type4 --data-folder data/04_trajectory_text_realapp --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY
python scripts/generate.py --data-type type5 --data-folder data/05_grounding_data --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY

# With multiple workers
python scripts/generate.py --data-type type1 --data-folder data/01_single_step --output-dir outputs/gemini --gemini-api-key YOUR_GEMINI_API_KEY --workers 4

Evaluate Results

python scripts/evaluate.py --data-type type1 --output-folder outputs/gemini/01_single_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY
python scripts/evaluate.py --data-type type2 --output-folder outputs/gemini/02_multi_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY
python scripts/evaluate.py --data-type type3 --output-folder outputs/gemini/03_trajectory_text_fictionalapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY
python scripts/evaluate.py --data-type type4 --output-folder outputs/gemini/04_trajectory_text_realapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY
python scripts/evaluate.py --data-type type5 --output-folder outputs/gemini/05_grounding_data --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY

# With multiple workers
python scripts/evaluate.py --data-type type1 --output-folder outputs/gemini/01_single_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4
python scripts/evaluate.py --data-type type2 --output-folder outputs/gemini/02_multi_step --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4
python scripts/evaluate.py --data-type type3 --output-folder outputs/gemini/03_trajectory_text_fictionalapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4
python scripts/evaluate.py --data-type type4 --output-folder outputs/gemini/04_trajectory_text_realapp --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4
python scripts/evaluate.py --data-type type5 --output-folder outputs/gemini/05_grounding_data --dataset-root data --openai-api-key YOUR_OPENAI_API_KEY --workers 4

Main Results

Chinese Subset Results

Model Single-Step Multi-Step Fiction-App Real-App Grounding GE Score

Nano Banana pro 84.50 68.65 65.75 64.35 64.83 69.62

Nano Banana 64.36 34.16 64.82 65.89 54.48 56.74

GPT-image-1.5 83.79 56.97 60.11 55.65 53.33 63.22

GPT-image-1.0 64.72 49.20 57.31 59.04 31.68 52.39

Seedream 4.5 63.64 53.11 56.48 53.44 52.90 55.91

Seedream 4.0 62.04 48.64 49.28 50.93 53.53 52.88

Wan 2.6 64.20 50.11 52.72 50.40 59.58 55.40

Flux-2-pro 68.83 55.07 58.13 55.41 50.24 57.54

Bagel 34.84 13.45 27.36 33.52 35.10 28.85

UniWorld-V2 55.33 24.95 32.03 21.39 49.60 36.66

Qwen-Image-Edit 41.12 26.79 23.78 26.10 50.80 33.72

Longcat-Image 48.76 12.75 30.03 30.00 51.02 34.51

English Subset Results

Model Single-Step Multi-Step Fiction-App Real-App Grounding GE Score

Nano Banana pro 84.32 69.51 46.33 47.20 58.64 61.20

Nano Banana 64.80 50.75 48.88 47.12 49.04 52.12

GPT-image-1.5 80.80 58.87 63.68 58.93 49.23 63.16

GPT-image-1.0 60.92 64.33 58.94 56.16 37.84 55.64

Seedream 4.5 49.49 45.30 53.81 51.80 49.63 50.01

Seedream 4.0 53.28 37.57 47.92 49.36 44.17 46.46

Wan 2.6 60.17 44.36 49.55 44.80 53.36 50.45

Flux-2-pro 61.00 52.17 49.92 47.16 45.67 51.18

Bagel 32.91 8.61 26.08 35.12 37.30 28.00

UniWorld-V2 42.68 14.14 30.08 26.83 47.04 32.15

Qwen-Image-Edit 40.12 18.61 25.80 25.95 54.55 33.01

Longcat-Image 36.69 8.44 37.30 36.83 47.12 33.28

Citation

If you find GEBench useful, please cite our paper:

@article{li2026gebench,
title={GEBench: Benchmarking Image Generation Models as GUI Environments},
author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
journal={arXiv preprint arXiv:2602.09007},
year={2026}
}

Notability

notability 5.0/10

Solid new benchmark repo with moderate traction