ModelInclusionAI (Ant Group)InclusionAI (Ant Group)published Apr 22, 2026seen 5d

inclusionAI/LLaDA2.0-Uni

Open original ↗

Captured source

source ↗
published Apr 22, 2026seen 5dcaptured 14hhttp 200method plaintask any-to-anylicense apache-2.0library transformersparams 16Bdownloads 7.2klikes 247

Model Capabilities

LLaDA2.0-Uni is a unified diffusion Large Language Model (dLLM) based on Mixture-of-Experts (MoE) that seamlessly integrates multimodal understanding and generation within a single model. It supports:

  • 🖼️ Text-to-Image Generation — high-fidelity image synthesis with optional thinking/reasoning.
  • 🔍 Image Understanding — visual question answering, image captioning, document understanding, etc.
  • ✏️ Image Editing — instruction-based editing with single or multi-reference support.
  • 🎨 Interleaved Generation and Reasoning — provide preliminary support for interleaved generation and unlock advanced interleaved reasoning.
  • Sprint Acceleration — KV cache reuse and adaptive unmasking for faster inference.

Model Architecture

  • Unified dLLM-MoE Backbone: Unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm.
  • Discrete Semantic Tokenizer: Utilizes SigLIP-VQ to convert visual inputs into discrete semantic tokens, significantly enhancing multimodal understanding.
  • Efficient Diffusion Decoder: Pairs discrete tokens with a specialized diffusion decoder for high-fidelity generation, enabling rapid 8-step inference via distillation.

Evaluation Results

Quick Start

> Note: Full installation instructions and CLI scripts are available in the GitHub repository.

⚙️ Installation

1. Create a conda environment

git clone https://github.com/inclusionAI/LLaDA2.0-Uni && cd LLaDA2.0-Uni
conda create -n llada2_uni python=3.10 -y
conda activate llada2_uni

2. Install PyTorch (CUDA 12.4)

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

3. Install Flash Attention 2 (required for efficient inference)

pip install flash-attn --no-build-isolation

4. Install remaining dependencies

pip install -r requirements.txt

🌟 Text-to-Image Generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Generate image tokens
result = model.generate_image(
"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
image_h=1024, image_w=1024,
steps=8, cfg_scale=2.0,
)

# Decode to PIL image (default: 50-step ODE)
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
image.save("output.png")

> [!Note] > 💡 Faster decoding — Use the decoder-turbo (distilled decoder) for ~10× faster image decoding (8 steps instead of 50) with minimal quality loss: > ``python > image = decode_vq_tokens( > result["token_ids"], result["h"], result["w"], model_path, "cuda", > num_steps=8, decode_mode="decoder-turbo", > ) >

🌟 Text-to-Image Generation with Thinking

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from decoder import decode_vq_tokens

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Generate image tokens with thinking process
result = model.generate_image(
"A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
image_h=1024, image_w=1024,
mode="thinking",
steps=8, cfg_scale=2.0,
thinking_steps=32, thinking_gen_length=4096,
)

# Print thinking trace
print("Thinking:", result["thinking"])

# Decode to PIL image
image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
image.save("output_thinking.png")

🌟 Image Understanding

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.smart_img_process import smart_resize_images

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Encode image to discrete tokens
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
pil_image = smart_resize_images(["./assets/understanding_example.png"])[0]
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]

# Understand the image
response = model.understand_image(
image_tokens, h, w,
question="Describe this image in detail.",
steps=32, gen_length=2048,
)
print(response)

🌟 Image Editing

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from encoder.image_tokenizer import ImageTokenizer
from decoder.utils import generate_crop_size_list, var_center_crop
from decoder import decode_vq_tokens
from PIL import Image

model_path = "inclusionAI/LLaDA2.0-Uni"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
).eval()
model.tokenizer = tokenizer

# Encode source image
image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
crop_size_list = generate_crop_size_list((512 // 32) ** 2, 32)
pil_image = var_center_crop(Image.open("./assets/edit_example.png").convert("RGB"), crop_size_list=crop_size_list)
info = image_tokenizer.encode_with_info(pil_image)
image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
_, h, w = info["grid_thw"]

# Edit the image
result = model.edit_image(…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release with moderate traction.