ModelInclusionAI (Ant Group)InclusionAI (Ant Group)published Feb 9, 2026seen 5d

inclusionAI/LLaDA2.1-flash

Open original ↗

Captured source

source ↗
published Feb 9, 2026seen 5dcaptured 13hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 103Bdownloads 151klikes 92

LLaDA2.1-flash

🚀 LLaDA2.1-flash is now live on ZenmuxAI! Try it via API 🛠️ or Chat 💬: https://zenmux.ai/inclusionai/llada2.1-flash

LLaDA2.1-flash is a diffusion language model of the LLaDA series featuring the editing enhancement. It significantly improves inference speed while delivering strong task performance.

---

Benchmark Qwen3-30B- A3B-Inst-2507 (Score) Ling-flash-2.0

(Score) LLaDA2.0-flash

(Score | TPF) LLaDA2.1-flash (S Mode) (Score | TPF) LLaDA2.1-flash (Q Mode) (Score | TPF)

Average 73.09 71.52 72.43 | 3.08 72.34 | 5.93 73.54 | 3.64

Knowledge

GPQA 54.14 69.16 62.31 | 3.29 66.67 | 3.95 67.30 | 2.37

MMLU-Pro 74.21 77.55 74.79 | 2.36 75.31 | 4.43 76.59 | 2.62

C-EVAL 88.12 87.54 85.21 | 1.90 86.93 | 2.71 86.71 | 1.75

PHYBench 29.84 27.67 30.06 | 2.70 26.04 | 4.10 28.23 | 2.66

TriviaQA 65.61 69.76 66.88 | 1.94 72.55 | 4.30 72.93 | 2.92

Reasoning

BIG-Bench Hard 85.54 89.36 86.75 | 2.66 87.82 | 5.61 88.69 | 3.28

BIG-Bench Extra Hard 37.80 23.24 27.86 | 4.60 33.51 | 5.04 35.77 | 3.17

bbh-zh 86.18 75.09 87.52 | 3.21 82.55 | 5.78 86.23 | 3.77

MuSR 79.15 82.72 80.48 | 1.70 80.10 | 2.90 79.84 | 1.85

ZebraLogic 90.97 87.60 82.30 | 2.74 84.20 | 5.80 88.90 | 3.26

PrOntoQA 97.12 97.88 96.50 | 2.64 95.00 | 9.23 97.00 | 5.73

PIQA 91.57 91.95 92.76 | 1.43 92.44 | 2.38 92.17 | 1.44

OCNLI 71.59 65.36 71.63 | 1.09 72.17 | 1.83 72.75 | 1.32

HellaSwag 86.31 81.59 84.97 | 1.26 85.60 | 2.31 85.31 | 1.51

KOR-Bench 69.2 69.44 63.04 | 3.44 62.80 | 4.97 65.12 | 2.77

DROP 87.57 88.32 87.90 | 2.26 87.55 | 5.40 87.86 | 2.53

SQuAD 2.0 89.51 81.32 90.00 | 3.10 90.65 | 5.01 90.80 | 3.90

Coding

LiveCodeBench 46.42 52.48 42.51 | 4.23 44.05 | 6.48 45.37 | 3.80

CRUXEval-O 86.75 82.75 85.12 | 3.21 85.25 | 6.54 87.50 | 3.80

MBPP+ 78.21 80.89 79.37 | 4.02 76.72 | 10.43 77.25 | 5.96

HumanEval+ 87.88 87.58 88.41 | 6.45 89.63 | 13.81 89.63 | 9.18

MultiPL-E 70.67 65.76 74.87 | 3.14 70.89 | 7.77 73.34 | 4.33

BigCodeBench-Full 41.49 40.70 41.58 | 3.33 37.11 | 8.51 39.21 | 4.70

BIRD-SQL 47.75 47.49 45.76 | 2.16 42.18 | 5.09 44.04 | 2.95

Spider 81.79 80.58 82.49 | 4.42 79.18 | 8.74 81.04 | 5.70

Math

AIME 2025 61.88 55.89 60.00 | 4.57 63.33 | 5.36 63.33 | 3.46

OlympiadBench 77.59 76.19 74.07 | 3.70 75.85 | 6.46 76.59 | 3.81

GSM-Plus 89.41 89.71 89.74 | 2.68 89.23 | 7.14 89.69 | 3.83

CMATH 96.58 96.52 96.90 | 2.17 96.54 | 4.84 96.63 | 2.65

Omni-MATH 54.00 53.00 50.30 | 3.39 52.30 | 6.01 54.10 | 3.50

Agent & Alignment

IFEval-strict-prompt 83.73 81.15 82.62 | 1.47 83.36 | 2.24 83.55 | 1.41

BFCL v3 73.41 67.69 74.94 | 4.87 74.86 | 9.24 75.61 | 6.76

Nexus FC 49.93 36.25 50.45 | 5.53 44.83 | 11.29 47.65 | 7.38

---

🚀 Highlights

+ Error-Correcting Editable: Structural innovation of editable generation for dLLM + Speedy vs Quality Mode: The 100B flash model achieves ultra-fast inference under Speed Mode while remaining competitive across various tasks and under Quality Mode. + Reinforcement Learning on 100B-scale dLLM: Tailored algorithm and framework to enable reinforcement learning for large dLLM.

🗺️ What's Next

+ Powerful Agentic/Tool Use Capability with LLaDA: Next update will be equipped with powerful Agentic and long-distance tool-use capability. + Extreme Editing: Next update will feature stronger and more extensive editing capabilities, aimed at correcting more errors in parallel reasoning. + Explore More Training Paradigms: We want to explore more training paradigms than SFT and RL for dLLM. ---

📦 Model Variants

| Model ID | Description | Hugging Face Link | | --- | --- | --- | | inclusionAI/LLaDA2.1-mini | Instruction-tuned model, ready for downstream applications. | 🤗 Model Card | | inclusionAI/LLaDA2.1-flash | Instruction-tuned model, ready for downstream applications. | 🤗 Model Card |

---

🔍 Model Overview

LLaDA2.1-flash has the following specifications:

+ Type: Mixture-of-Experts (MoE) Diffusion Language Model + Total Parameters (Non-Embedding): 100B + Number of Layers: 32 + Attention Heads: 32 + Context Length: 32,768 tokens + Position Embedding: Rotary (RoPE) + Vocabulary Size: 157,184

---

🤗 Hugging Face Transformers

Make sure you have transformers and its dependencies installed:

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/path/to/LLaDA2.1-flash"
device = "auto"
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map=device,
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

prompt = """Calculate 1+5-28*0.5-200=?"""
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
)
generated_tokens = model.generate(
inputs=input_ids,
eos_early_stop=True,
gen_length=512,
block_length=32,
threshold=0.5,
editing_threshold=0,
temperature=0.0,
)
generated_answer = tokenizer.decode(
generated_tokens[0],
skip_special_tokens=True,
)
print(generated_answer)

Multi-block Editing inference comming soon.

Best Practices

To achieve optimal performance, we recommend the following settings:

1. Sampling Parameters: We recommend the following general sampling parameters: block_length=32, temperature=0.0, top_p=None and top_k=None. We are currently exploring more diverse sampling configurations.

2. Denoising Thresholds: There are three denoising params: threshold, editing_threshold and max_post_steps. We recommend threshold=0.7, editing_threshold=0.5 for Quality Mode and threshold=0.5, editing_threshold=0.0 for Speed Mode. For both modes, we suggest setting max_post_steps to a value greater than 5. We recommend 16 as a balanced default, which was used for most of our internal testing.

Note: Low threshold may causes stuttering in trade-off for quick inference.

3. Adequate Output Length: We recommend using an output length of 16384 tokens for most scenarios.

---

🤖ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖ModelScope

---

Deployment

SGLang

SGLang enables dLLM inference either through…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High HF downloads, notable model release