What does this repo signal mean?

MiniMax published MiniMax-AI/VTP (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo MiniMax-AI/VTP · language Python · Solid new repo with moderate traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Infrastructure in the data-business radar.

MiniMax Repo: MiniMax-AI/VTP

Captured source

source ↗

GitHub/github.com/MiniMax-AI/VTP

MiniMax-AI/VTP repository metadata

Source ↗

published Dec 11, 2025seen Jun 5captured Jun 11http 200method plain

MiniMax-AI/VTP

Description: Towards Scalable Pre-training of Visual Tokenizers for Generation

Language: Python

License: NOASSERTION

Stars: 490

Forks: 14

Open issues: 8

Created: 2025-12-11T10:32:23Z

Pushed: 2026-04-15T07:09:46Z

Default branch: main

Fork: no

Archived: no

README:

News

[2026.03.09] We have updated our technical report with more experimental results.

[2025.12.16] We have released our technical report and [pretrained weights](#get-checkpoints).

Takeaways

By integrating contrastive, self-supervised, and reconstruction learning, we have trained numerous visual tokenizers from scratch. We are seeking to unveil the novel scalability interlinking understanding, generation, and reconstruction.

Same FLOPs in DiT Training, VTP scaling helps better generation.

Traditional auto-encoders CANNOT be scaled up for diffusion generative models.

Understanding is the key driver for improving the learnability scaling.

Parameter, data and training scalability can be seen while representation learning involved.

Get Checkpoints

| Checkpoints | |-------|

Weights will be released very soon.

🚀 Click Here to Quick Start

pip install -r requirements.txt

import torch
from PIL import Image
from torchvision import transforms

from vtp.models.vtp_hf import VTPConfig, VTPModel
from vtp.tokenizers import get_tokenizer

model = VTPModel.from_pretrained("/path/to/MiniMaxAI/VTP-Large-f16d64")
model.eval()

# print model parameters
def count_params(m): return sum(p.numel() for p in m.parameters()) / 1e6
print(f"Vision Encoder: {count_params(model.trunk):.1f}M")
print(f"Pixel Decoder: {count_params(model.pixel_decoder):.1f}M")
print(f"Text Encoder: {count_params(model.text_transformer):.1f}M")

preprocess = transforms.Compose([
transforms.Resize((256, 256)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
image = preprocess(Image.open("figures/dog.png")).unsqueeze(0)

# ---------------------------------------------------------------------------------------
# use it as an auto-encoder; rFID=0.36
# ---------------------------------------------------------------------------------------
denormalize = transforms.Normalize(
mean=[-0.485/0.229, -0.456/0.224, -0.406/0.225],
std=[1/0.229, 1/0.224, 1/0.225]
)
with torch.no_grad(), torch.autocast("cuda"):
latents = model.get_reconstruction_latents(image) # encode
recon = model.get_latents_decoded_images(latents) # decode
recon_image = denormalize(recon[0]).clamp(0, 1).permute(1, 2, 0).cpu().numpy()
Image.fromarray((recon_image * 255).astype("uint8")).save("output/reconstructed.png")

# ---------------------------------------------------------------------------------------
# use it as a clip; zero-shot 78.2
# ---------------------------------------------------------------------------------------
tokenizer = get_tokenizer('ViT-B-32', context_length=model.config.text_context_length)
text = tokenizer(["a diagram", "a dog", "a cat", "a person"])
with torch.no_grad(), torch.autocast("cuda"):
image_features = model.get_clip_image_feature(image, normalize=True)
text_features = model.get_clip_text_feature(text, normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", [f"{p:.4f}" for p in text_probs[0].tolist()])

# ---------------------------------------------------------------------------------------
# use it as an ssl feature extractor; linear probing 85.7
# ---------------------------------------------------------------------------------------
with torch.no_grad(), torch.autocast("cuda"):
# get last layer features (cls token + patch tokens)
features = model.get_last_layer_feature(image)
cls_token = features['cls_token'] # (B, 1024)
patch_tokens = features['patch_tokens'] # (B, 256, 1024) for 256x256 image

# or get intermediate layer features for linear probing
intermediate = model.get_intermediate_layers_feature(
image, n=4, return_class_token=True
) # returns 4 x (patch_tokens, cls_token), each cls_token is (B, 1024)
for i in range(1, 5):
print('Last %d layers:' % i)
print('Patch tokens shape:', intermediate[-i][0].shape)
print('Cls token shape:', intermediate[-i][1].shape)

Performance

Introduction

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundational flaw: better pixel-level reconstruction accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the "pre-training scaling problem" and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics.

We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our study reveals that perception-oriented tokenizer pre-training unlocks a new scaling law for generation, where generative performance scales effectively with compute, parameters, and data allocated to the pre-training of the visual tokenizer. Our large-scale pre-training experiments demonstrate the following results: (1) Without modifying DiT training specs and FLOPs, solely scaling VTP pre-training consistently achieves gains in both ImageNet class-conditional and LAION text-to-image generation, while conventional autoencoders stagnate very early at 1/10 of the FLOPs. (2) VTP achieves 0.36 rFID while simultaneously delivering 78.2% zero-shot accuracy and 85.7% linear probing accuracy, surpassing prior unified tokenizers such as VILA-U and UniTok. (3) Furthermore, the VTP-based diffusion model exhibits exceptionally fast convergence---reaching 2.03 gFID in only 80 epochs without guidance tricks, outperforming previous methods like VA-VAE and RAE---and ultimately scales to achieve a remarkable 1.11 gFID on ImageNet 256×256 generation.

Evaluation

Installation

conda create -n vtp python=3.10
conda activate vtp
git submodule update --init --recursive
pip install -r requirements.txt

Zero-shot Classification

Modify the corresponding paths in...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Solid new repo with moderate traction.