RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Feb 12, 2026seen 5d

inclusionAI/Zooming-without-Zooming

Python

Open original ↗

Captured source

source ↗

inclusionAI/Zooming-without-Zooming

Description: [ICML 2026] ZwZ model family: SOTA fine-grained perception performace; ZoomBench: a new challenging perception benchmark

Language: Python

License: Apache-2.0

Stars: 155

Forks: 2

Open issues: 0

Created: 2026-02-12T08:14:14Z

Pushed: 2026-05-04T12:18:51Z

Default branch: main

Fork: no

Archived: no

README:

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception (ICML 2026)

1. School of Computer Science, Shanghai Jiao Tong University

2. Ant Group

3. Zhongguancun Academy

4. Shanghai Innovation Institute

📃 Paper | 🤗 Models & Training Datasets & ZoomBench

✨ Introduction

Recent "Thinking-with-Images" methods improve fine-grained perception by iteratively zooming into regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. In this work, we present ZwZ models (4/7/8B), achieving SOTA performance on multimodal perception benchmarks among open-source models. In addition, we present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap".

⚙️ Method

We propose Region-to-Image Distillation (R2I), which transforms zooming from an inference-time tool into a training-time primitive. We: 1. Zoom in to micro-cropped regions and let strong teacher models generate high-quality VQA data 2. Distill this region-grounded supervision back to the full image with explicit bounding-box overlays 3. Enable smaller student models to achieve single-glance fine-grained perception without tool use

This can also be summarized as an idea of "Zooming without Zooming". The first "Zooming" refers to the training-time primitive: we zoom into micro-regions to synthesize fine-grained training data. In contrast, the second "Zooming" denotes the inference-time tool-use we seek to bypass.

🌟 Key Features

  • 🎯 Superior Accuracy: Achieve SOTA performance on perception benchmarks among open-source models
  • ⚡ Single-Pass Efficiency: Just need one forward pass, eliminating inference-time tool calling overhead
  • 📈 Broad Improvements: Enhance not only perception benchmarks but also out-of-distribution generalization on visual reasoning, GUI agent, and AIGC detection
  • 🔍 ZoomBench: A comprehensive benchmark with 845 samples across 6 fine-grained dimensions, featuring various evaluation protocols

🎯 Models and Datasets

Models

| Model | Base | Download | |-------|------|----------| | ZwZ-2B | Qwen3-VL-2B | 🤗 inclusionAI/ZwZ-2B | | ZwZ-4B | Qwen3-VL-4B | 🤗 inclusionAI/ZwZ-4B | | ZwZ-7B | Qwen2.5-VL-7B | 🤗 inclusionAI/ZwZ-7B | | ZwZ-8B | Qwen3-VL-8B | 🤗 inclusionAI/ZwZ-8B | ---

Training Datasets

Our Region-to-Image distilled training data (37K samples): 🤗 inclusionAI/ZwZ-RL-VQA

Source image pools:

  • SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, STPLS3D (we just take a small part of images from each image pool; most of high resolution images are from train-0000-of-0013.parquet in https://modelscope.cn/datasets/Tongyi-DataEngine/SA1B-Paired-Captions-Images)

Question Generator: Qwen3-VL-235B-A22B-Instruct

Answer Generators: Qwen3-VL-235B-A22B-Instruct, GLM-4.5V

---

📊 ZoomBench

We introduce 🤗 **ZoomBench**, a challenging benchmark for fine-grained multimodal perception:

  • 845 high-quality samples across 6 perceptual dimensions:
  • Fine-Grained Counting
  • OCR (text & symbol recognition)
  • Color Attributes
  • Structural Attributes
  • Material Attributes
  • Object Identification
  • Dual-View Protocol: Each sample includes both full image and cropped region to quantify the "zooming gap"
  • Attention Map Analysis: Evaluate whether the model grounds its predictions on task-relevant image regions from a view of interpretability
  • Hybrid Construction: Gemini-2.5-Pro-generated + human-verified for quality and scalability
  • High Difficulty: Average accuracy of Qwen2.5-VL-7B is only 42.5%

🛠️ Installation

git clone https://github.com/inclusionAI/Zooming-without-Zooming.git
cd Zooming-without-Zooming
pip install -r requirements.txt
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e . # please refer to the official repo of SAM3 for detailed installation
cd ../EasyR1
pip install -e . # please refer to the official repo of EasyR1 for detailed installation

🔥 Let's Start

1. Region to Image Distillation

The pipeline supports checkpointing. Each step can be executed independently and resumed from any stage. Note that we use Qwen3-VL-235B and Sam3 to get a meaningful cropped image, and use Kimi-K2 to extract the majority answer.

cd Zooming-without-Zooming/data_synthesis

export MLLM_KEY="your_mllm_key"
export MLLM_URL="your_mllm_url"
export KIMI_KEY="your_llm_key"
export KIMI_URL="your_llm_url"

## step 1
python create_crops.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--image_folders "/path/images/sa1b" \ # Support multiple folders; replace to your own path (just containing images)
--output_jsonl "generated_bboxes_sa1b.jsonl"

## step 2
python create_questions.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--input_files "generated_bboxes_sa1b.jsonl" \
--output_file "generated_questions.jsonl" \
--crop_output_dir "/path/images/crops" # Replace to your own path

## step 3
bash qwen_serve.sh

python create_answers.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--kimi_api_key "$KIMI_KEY" \
--kimi_api_url "$KIMI_URL" \
--input_file "generated_questions.jsonl" \
--output_file "validated_vqa.jsonl" \
--bbox_output_dir "/path/images/bbox_images" # Replace to your own path

## step 4
python convert_jsonl2parquet.py \
--input_file "validated_vqa.jsonl" \
--output_file "validated_vqa.parquet"

We also provide an end-to-end data synthesis script.

cd Zooming-without-Zooming/data_synthesis

export MLLM_KEY="your_mllm_key"
export MLLM_URL="your_mllm_url"
export KIMI_KEY="your_llm_key"
export KIMI_URL="your_llm_url"

bash qwen_serve.sh

python create_vqa.py \
--api_key "$MLLM_KEY" \…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo with moderate stars