RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published Apr 17, 2025seen 5d

ByteDance-Seed/Bagel

Python

Open original ↗

Captured source

source ↗
published Apr 17, 2025seen 5dcaptured 9hhttp 200method plain

ByteDance-Seed/Bagel

Description: Open-source unified multimodal model

Language: Python

License: Apache-2.0

Stars: 6000

Forks: 533

Open issues: 151

Created: 2025-04-17T06:54:07Z

Pushed: 2026-05-04T17:01:02Z

Default branch: main

Fork: no

Archived: no

README:

Unified Model for Multimodal Understanding and Generation

> Chaorui Deng*, Deyao Zhu*, Kunchang Li*, Chenhui Gou*, Feng Li*, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi :email: , Haoqi Fan* :tophat: > > contact: shiguang.sg@bytedance.com > > We present BAGEL, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3. Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models. The figure below showcases BAGEL's qualitative performance.

📢 News

We sincerely thank all contributors from the open community for their valuable support.

📮 Notice

Call for Bad Cases: If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue#11 or Discord.

About Inference Hyperparameters:

  • `cfg_text_scale`: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
  • `cfg_image_scale`: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
  • `cfg_interval`: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
  • `timestep_shift`: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
  • `num_timesteps`: Total denoising steps. Typical: 50.
  • `cfg_renorm_min`: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
  • `cfg_renorm_type`: CFG-Renorm method:
  • global: Normalize over all tokens and channels (default for T2I).
  • channel: Normalize across channels for each token.
  • text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
  • If edited images appear blurry, try `global` CFG-Renorm, decrease `cfg_renorm_min` or decrease `cfg_scale`.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/bytedance-seed/BAGEL.git
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolation

2️⃣ Download pretrained checkpoint

from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

3️⃣ Use Gradio WebUI to start playing with BAGEL!

# For 32GB+ VRAM GPU or multi…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High star count, notable release from ByteDance.