ByteDance-Seed/VINCIE
Python
Captured source
source ↗ByteDance-Seed/VINCIE
Description: Official code for VINCIE: Unlocking In-context Image Editing from Video
Language: Python
License: Apache-2.0
Stars: 57
Forks: 4
Open issues: 4
Created: 2025-06-30T03:50:56Z
Pushed: 2026-03-28T09:43:48Z
Default branch: main
Fork: no
Archived: no
README:
VINCIE: Unlocking In-context Image Editing from Video
> Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang > > In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (*e.g.*, segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
News
- 19 Mar, 2026: Released the Evaluation Code on MSE-Bench.
- 15 Mar, 2026: Released the Generated Images on MSE-Bench, including VINCIE-3B, VINCIE-7B, Nano Banana, Qwen-Image-Edit, FLUX.1-Kontext-dev, Bagel, Step1X-Edit, Omnigen 2, Omnigen, ICEdit, UltraEdit, HQEdit, Magicbrush, and InstructPix2Pix.
- 15 Mar, 2026: Released the Multi-turn Session image Editing Benchmark (MSE-Bench).
- 6 Jan, 2026: Released the VINCIE-7B checkpoint (full attention).
- 6 Sep, 2025: Released the VINCIE-3B checkpoint (full attention).
- 25 Aug, 2025: Released the official website and the inference code.
- 23 Aug, 2025: Released the VINCIE-10M dataset.
- 12 Jun, 2025: Released the VINCIE technical report .
Quick Start
1️⃣ Set up environment
git clone https://github.com/ByteDance-Seed/VINCIE cd VINCIE conda create -n vincie python=3.10 -y conda activate vincie pip install -r requirements.txt pip install flash_attn==2.6.3 --no-build-isolation
2️⃣ Download pretrained checkpoint
from huggingface_hub import snapshot_download save_dir = "ckpt/VINCIE-3B" repo_id = "ByteDance-Seed/VINCIE-3B" cache_dir = save_dir + "/cache" snapshot_download(cache_dir=cache_dir, local_dir=save_dir, repo_id=repo_id, local_dir_use_symlinks=False, resume_download=True )
Inference for Multi-turn Image Editing
turn1="Lower the pineapple beside her face, and change it to a smaller one." turn2="Add a crown to the woman's head. " turn3="Change the woman’s expression so that she is laughing." turn4="Change the background to a pastel gradient of blue and lavender." turn5="Add a colorful bird hovering above the crown." input_img=assets/woman_pineapple.png output_dir=output/woman_pineapple python main.py configs/generate.yaml \ generation.positive_prompt.image_path="[\"$input_img\"]" \ generation.positive_prompt.prompts="[\"$turn1\", \"$turn2\", \"$turn3\", \"$turn4\", \"$turn5\"]" \ generation.output.dir=$output_dir
Inference for Multi-concept Composition
p1=": "; p2=": "; p3=": "; p4=": "; p5=": " p6="Based on , , , , , and , A smiling multi-generational family including the father in , mother in , son in , daughter in , dog in , and cat in , poses for a portrait amidst the sunlit trees and ferns of a forest. Output : " img0="./assets/father.png"; img1="./assets/mother.png"; img2="./assets/son.png"; img3="./assets/daughter.png"; img4="./assets/dog1.png"; img5="./assets/cat.png"; output_dir=output/family python main.py configs/generate.yaml \ generation.pad_img_placehoder=False \ generation.positive_prompt.image_path="[\"$img0\", \"$img1\", \"$img2\", \"$img3\", \"$img4\", \"$img5\"]" \ generation.positive_prompt.prompts="[\"$p1\", \"$p2\", \"$p3\", \"$p4\", \"$p5\", \"$p6\"]" \ generation.output.dir=$output_dir
Evaluation
To evaluate multi-turn image editing performance on the MSE-Bench benchmark:
1. Install dependencies:
cd evaluation pip install -r evaluation/requirements.txt
2. Set your OpenAI-compatible API key:
export OPENAI_API_KEY=""
3. Run evaluation:
model_name="vincie_7b" python3 compute_score.py \ --model_name "$model_name" \ --api_model gpt-5-nano \ --num_workers 32 \ --res_path ./tmp_data/results/"$model_name".json
This evaluates prompt-following and consistency using a VLM. Results are saved to the specified path. See evaluation/README.md for details.
Citation
@article{qu2025vincie,
title = {VINCIE: Unlocking In-context Image Editing from Video},
author = {Qu, Leigang and Cheng, Feng and Yang, Ziyan and Zhao, Qi and Lin, Shanchuan and Shi, Yichun and Li, Yicong and Wang, Wenjie and Chua, Tat-Seng and Jiang, Lu},
journal = {arXiv preprint arXiv:2506.10941},
year = {2025}
}License
This project is licensed under the [Apache-2.0 License](LICENSE), subject to any intellectual property rights in the model owned by ByteDance. The text encoder of the model is adapted from Qwen-14B and your use of that model must comply with its license.
Notability
notability 3.0/10Low star count, routine new repo