What does this repo signal mean?

ByteDance (Doubao/Seed) published ByteDance-Seed/SAIL (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo ByteDance-Seed/SAIL · language Python · ByteDance repo, moderate stars, low traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

ByteDance (Doubao/Seed) Repo: ByteDance-Seed/SAIL

Captured source

source ↗

GitHub/github.com/ByteDance-Seed/SAIL

ByteDance-Seed/SAIL repository metadata

Source ↗

published Apr 19, 2025seen 5dcaptured 13hhttp 200method plain

ByteDance-Seed/SAIL

Description: Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"

Language: Python

License: Apache-2.0

Stars: 84

Forks: 3

Open issues: 2

Created: 2025-04-19T02:04:51Z

Pushed: 2025-10-29T01:24:18Z

Default branch: main

Fork: no

Archived: no

README:

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer (SAIL)

We are extremely delighted to release SAIL, a Single trAnsformer model for vIsion and Language. SAIL is a unified multimodal large language model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture. Without relying on pre-trained vision encoders, SAIL achieves competitive performance across a wide range of vision-language tasks and demonstrates strong visual representation, rivaling state-of-the-art vision models in tasks like semantic segmentation.

Model & Micro Design

An Overview of Comparison

(A) Data scaling curve for Modular Multimodal Large Language Model (MLLM) and SAIL, our Single Transformer-based MLLM. As pretraining data increases, SAIL shows a sharper performance gain, demonstrating its superior data scalability. (B) Comparison to existing Single Transformer-based MLLMs, our SAIL pushes the performance boundaries on both vision tasks and vision-language tasks.

News

[2025/06/26]🎉SAIL is accepted to ICCV 2025 (Highlight).
[2025/04/02]🔥We release SAIL models and technical report.

Getting started

Prepraration

pip3 install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip3 install einops transformers==4.42.0

Example

Firstly, clone the SAIL repo,

git clone https://github.com/bytedance/SAIL
cd SAIL

and then, simpley run example.py

python3 example.py

or refer to the following code block:

from example import *

NON_VISION_TOKEN_ID = -1
PATH_TO_MODEL = "path to model"
PATH_TO_TOKENIZER = "path to tokenizer"
IMAGE_PATH = "path to image"
PROMPT = "content of prompt"

model, tokenizer = get_transformer_and_tokenizer(
PATH_TO_MODEL,
PATH_TO_TOKENIZER
)
model = model.cuda()

image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)
prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)
image_path = IMAGE_PATH
image_patches = image_processor(image_path)
nh, nw = image_patches.shape[:2]
image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)

input_tokens = image_tokens + prompt_inp
input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids
vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)
vision_patches = image_patches.view(nh * nw, -1)
assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)
assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len

vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))
attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)
position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)

input_ids = input_ids.long().cuda()
vision_patch_indices = vision_patch_indices.long().cuda()
vision_patches = vision_patches.to(torch.bfloat16).cuda()
position_ids = position_ids.long().cuda()
attention_mask = attention_mask.cuda()

padding_attention_mask = torch.ones_like(input_ids).cuda()

inputs = dict(
input_ids = input_ids,
position_ids = position_ids,
attention_mask = padding_attention_mask,
vision_patches = vision_patches,
vision_patch_indices = vision_patch_indices,
use_cache=True
)

cached_inputs = dict(
input_ids = input_ids[:, :image_tokens_len],
position_ids = position_ids[:, :, :image_tokens_len],
attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],
vision_patches = vision_patches,
vision_patch_indices = vision_patch_indices[:, :image_tokens_len],
use_cache=True
)

prefix_cache = DynamicCache()
with torch.no_grad():
prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values

past_key_values = copy.deepcopy(prefix_cache)
generate_config = GenerationConfig(
max_new_tokens=1024,
return_dict_in_generate=True,
output_attentions=False
)
generated = model.generate(
**inputs,
past_key_values=past_key_values,
generation_config=generate_config
)
generated_ids = generated['sequences'][:, input_ids.size(1):]
response = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(f"\nModel Response: ===\n{response}\n===")

Features

SAIL as an MLLM, check our model at huggingface.
SAIL as a Vision Encoder, check our model at huggingface.
Explore Pixel SAIL, using SAIL For Pixel-Grounded Understanding.

Acknowledgement

Part of our codes are built up on SOLO. We thank the authors for their impressive contribution.

License

This project is licensed under Apache2.0. See the [LICENSE](LICENSE). flie for details.

Citation

If you find SAIL useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@InProceedings{Lei_2025_ICCV,
author = {Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong},
title = {The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {20758-20769}
}

About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

Notability

notability 5.0/10

ByteDance repo, moderate stars, low traction.