ModelQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published May 21, 2026seen 5d

Qwen/Qwen-Image-Bench

Open original ↗

Captured source

source ↗
published May 21, 2026seen 5dcaptured 14hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 27Bdownloads 13klikes 56

Q-Judger

A fine-tuned judge model for evaluating text-to-image (T2I) generation quality. Built on top of Qwen3.6-27B, it scores generated images across 5 hierarchical dimensions using structured checklists and outputs JSON-formatted evaluation results.

Links

| Resource | Link | |----------|------| | 📑 Paper | http://arxiv.org/abs/2605.28091 | | 📊 Benchmark Dataset (HuggingFace) | https://huggingface.co/datasets/Qwen/Qwen-Image-Bench | | 📊 Benchmark Dataset (ModelScope) | https://www.modelscope.cn/datasets/Qwen/Qwen-Image-Bench | | 💻 GitHub | https://github.com/QwenLM/Qwen-Image-Bench | | 🧑‍⚖️ Q-Judger Model | https://huggingface.co/Qwen/Qwen-Image-Bench | | 🧑‍⚖️ Q-Judger Model | https://modelscope.cn/models/Qwen/Qwen-Image-Bench |

Model Description

Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy and outputs structured JSON scores.

  • Base Model: Qwen3.6-27B
  • Task: Image quality evaluation / judging
  • Input: Text prompt + generated image
  • Output: Structured JSON with per-dimension scores (0 = Fail, 1 = Pass, 2 = Excel, N/A)
  • Thinking Mode: Enabled — the model uses chain-of-thought reasoning before producing the final JSON output

Evaluation Dimensions

The model evaluates images across 5 top-level dimensions, each with multiple sub-dimensions:

Quality

  • Realism: Physical Logic, Material Texture
  • Detail: Noise, Edge Clarity, Naturalness
  • Resolution: Resolution

Aesthetics

  • Composition: Composition
  • Color Harmony: Color Harmony
  • Lighting: Lighting & Atmosphere
  • Anatomical Portraiture: Anatomical Fidelity
  • Emotional Expression: Emotional Expression
  • Style Control: Style Control

Alignment

  • Attributes: Quantity, Facial Expression, Material Properties, Color, Shape, Size
  • Actions: Contact Interaction, Non-contact Interaction, Full-body Action
  • Layout: 2D Space, 3D Space
  • Relations: Composition Relationship, Difference/Similarity, Containment
  • Scene: Real-world Scene, Virtual Scene

Real-world Fidelity

  • Fairness: Social Bias, Cultural Fairness
  • Safety & Compliance: Safety & Compliance
  • World Knowledge: Animals, Objects, Information Visualization, Temporal Characteristics, Cultural Elements

Creative Generation

  • Imagination: Imagination
  • Feature Matching: Feature Matching
  • Logical Resolution: Logical Resolution
  • Text Rendering: Text Accuracy, Text Layout, Font, Cross-lingual Generation
  • Design Applications: Graphic Design, Product Design, Spatial Design, Fashion Styling, Game Design, Art Design
  • Visual Storytelling: Cinematic Style, Camera / Lens Style, Storyboard Creation, Shot Sizes, Composition, Angles, Comic Creation

Scoring Methodology

Raw Score Mapping

| Raw Score | Meaning | Mapped Score | |-----------|---------|--------------| | 0 | Fail | 0 | | 1 | Pass | 60 | | 2 | Excel | 100 | | N/A | Not applicable | Excluded |

Aggregation

1. Level-3 → Level-2: Average all non-N/A Level-3 scores within a Level-2 category 2. Level-2 → Level-1: Average all Level-2 scores within a Level-1 dimension 3. Level-1 → Total: Average all Level-1 dimension scores

Human Agreement

We validate the judge model against human expert rankings by computing Spearman rank correlation ($\rho$) between the model's rankings and human expert rankings across the five L1 pillars and overall. All correlations are statistically significant ($p < 10^{-4}$, $N = 18$ models).

| Dimension | Spearman $\rho$ | |----------------------|:---------------:| | Quality | 0.89 | | Aesthetics | 0.89 | | Alignment | 0.89 | | Real-world Fidelity | 0.92 | | Creative Generation | 0.92 | | Overall | 0.92 |

Quick Start

Get the Inference Code

git clone https://github.com/QwenLM/Qwen-Image-Bench.git
cd Qwen-Image-Bench

Installation

1. Create and activate a virtual environment with uv:

uv venv myenv --python 3.11
source myenv/bin/activate

2. Install PyTorch (select the command matching your CUDA version):

See the official guide: https://pytorch.org/get-started/locally/

3. Install Python dependencies:

uv pip install -r requirements.txt

This installs all required dependencies including ms-swift.

Run Inference

python judge.py \
--input your_data.jsonl \
--model Qwen/Qwen-Image-Bench

Input Format

Prepare a CSV, JSON, or JSONL file with the following columns:

| Column | Type | Description | |--------|------|-------------| | ID | int | Prompt identifier (1-1000), must match benchmark metadata | | prompt | str | The text prompt used to generate the image | | image_path | str | Path to the generated image file |

Output Format

The model outputs a JSON object per dimension, structured as:

{
"Level-2 Dimension": {
"Level-3 Dimension": {"score": 0|1|2|"N/A"}
}
}

Example (Quality dimension):

{
"Realism": {
"Physical Logic": {"score": 1},
"Material Texture": {"score": 2}
},
"Detail": {
"Noise": {"score": 1},
"Edge Clarity": {"score": 1},
"Naturalness": {"score": 1}
},
"Resolution": {
"Resolution": {"score": 2}
}
}

CLI Options

| Argument | Default | Description | |----------|---------|-------------| | --input | (required) | Input CSV/JSON/JSONL with ID, prompt, image_path | | --model | (required) | HuggingFace model ID or local model path | | --hf-bench-repo | - | HF dataset repo for bench metadata | | --local-metadata | - | Local metadata file path (overrides default) | | --max-batch-size | 24 | ms-swift max_batch_size | | --max-new-tokens | 4096 | Max generation tokens |

Inference Parameters

The judge model uses fixed inference parameters for reproducibility:

| Parameter | Value | |-----------|-------| | seed | 42 | | temperature | 0 | | top_k | 1 | | top_p | 1.0 | | repetition_penalty | 1.05 | | max_new_tokens | 4096 | | enable_thinking | True | | max_batch_size | 24 |

Citation

If you find this model useful, please cite our paper:

@misc{li2026qwenimagebenchgenerationcreationtexttoimage,
title={Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation},
author={Niantong Li…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Notable benchmark, moderate downloads