ModelInclusionAI (Ant Group)InclusionAI (Ant Group)published Feb 9, 2026seen 5d

inclusionAI/LLaDA2.1-mini

Open original ↗

Captured source

source ↗
published Feb 9, 2026seen 5dcaptured 9hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 16Bdownloads 15klikes 125

LLaDA2.1-mini

🚀 LLaDA2.1-flash is now live on ZenmuxAI! Try it via API 🛠️ or Chat 💬: https://zenmux.ai/inclusionai/llada2.1-flash

LLaDA2.1-mini is a diffusion language model of the LLaDA series featuring the editing enhancement. It significantly improves inference speed while delivering strong task performance.

---

Model Performance

Benchmark Qwen3-8B (no_think) (Score) Ling-mini-2.0

(Score) LLaDA2.0-mini

(Score | TPF) LLaDA2.1-mini (S Mode) (Score | TPF) LLaDA2.1-mini (Q Mode) (Score | TPF)

Average 61.59 64.72 63.39 | 2.60 62.07 | 5.34 63.90 | 3.12

Knowledge

GPQA 48.01 59.41 47.76 | 2.73 48.36 | 3.62 53.28 | 2.12

MMLU-Pro 65.83 67.18 64.27 | 2.15 63.42 | 4.22 64.84 | 2.41

C-EVAL 80.6 82.17 81.80 | 1.78 78.40 | 3.39 78.59 | 1.91

PHYBench 9.76 14.59 11.70 | 2.48 12.75 | 4.41 13.05 | 2.52

TriviaQA 52.51 55.63 51.33 | 1.54 53.33 | 3.21 54.24 | 2.02

Reasoning

BIG-Bench Hard 79.48 83.70 78.21 | 2.36 78.42 | 5.02 80.58 | 2.86

BIG-Bench Extra Hard 18.27 14.81 16.47 | 2.03 15.30 | 3.19 15.78 | 1.66

bbh-zh 80.09 66.11 75.75 | 2.77 67.65 | 3.89 70.40 | 2.35

MuSR 70.02 71.36 71.48 | 1.45 70.43 | 2.48 71.89 | 1.56

ZebraLogic 37.48 79.85 64.20 | 2.30 68.50 | 5.38 77.10 | 2.93

PrOntoQA 93.12 96.06 86.00 | 2.36 87.50 | 4.86 84.50 | 2.73

PIQA 88.30 87.54 86.51 | 1.45 84.87 | 2.59 86.89 | 1.45

OCNLI 61.49 60.17 64.51 | 4.06 61.02 | 1.78 61.59 | 1.23

HellaSwag 79.56 69.02 79.01 | 1.50 75.71 | 2.39 76.19 | 1.49

KOR-Bench 54.96 63.2 49.92 | 2.45 46.64 | 4.28 48.00 | 2.35

DROP 84.56 78.80 81.89 | 2.02 81.55 | 5.84 82.37 | 2.87

SQuAD 2.0 85.21 75.56 86.50 | 2.47 84.51 | 4.33 85.13 | 3.09

Coding

LiveCodeBench 26.76 42.29 31.83 | 3.34 28.85 | 6.42 30.40 | 3.63

CRUXEval-O 74.06 76.12 71.62 | 2.78 70.62 | 5.85 73.75 | 3.35

MBPP+ 72.69 77.25 78.24 | 3.43 73.28 | 10.59 74.07 | 6.30

HumanEval+ 79.5 80.03 81.40 | 5.16 80.49 | 12.32 82.93 | 7.77

MultiPL-E 61.70 67.09 67.46 | 2.78 64.16 | 7.23 67.17 | 4.01

BigCodeBench-Full 36.05 35.00 32.89 | 2.87 30.18 | 7.33 34.39 | 4.09

BIRD-SQL 36.11 39.67 39.34 | 1.96 37.32 | 4.48 38.40 | 2.42

Spider 72.80 76.43 76.76 | 3.93 75.78 | 7.98 77.55 | 5.48

Math

AIME 2025 22.08 47.66 36.67 | 2.41 36.67 | 6.34 43.33 | 3.29

OlympiadBench 55.33 72.30 67.70 | 2.63 64.30 | 7.08 66.67 | 3.99

GSM-Plus 85.56 87.18 86.50 | 2.41 85.88 | 6.82 86.55 | 3.69

CMATH 95.42 96.40 95.72 | 1.98 95.63 | 4.94 94.99 | 2.56

Omni-MATH 33.20 48.80 41.70 | 2.57 41.70 | 6.41 43.60 | 3.56

Agent & Alignment

IFEval-strict-prompt 84.29 76.16 80.78 | 1.24 81.33 | 1.83 83.18 | 1.25

BFCL v3 70.12 53.75 70.72 | 4.26 72.06 | 7.39 73.61 | 5.14

Nexus FC 37.71 34.38 35.18 | 4.06 31.59 | 8.27 33.69 | 4.91

---

🚀 Highlights

+ Error-Correcting Editable: Structural innovation of editable generation for dLLM + Speedy vs Quality Mode: The 16B mini model achieves ultra-fast inference under Speed Mode while remaining competitive across various tasks and under Quality Mode. + Reinforcement Learning on 100B-scale dLLM: Tailored algorithm and framework to enable reinforcement learning for large dLLM.

🗺️ What's Next

+ Powerful Agentic/Tool Use Capability with LLaDA: Next update will be equipped with powerful Agentic and long-distance tool-use capability. + Extreme Editing: Next update will feature stronger and more extensive editing capabilities, aimed at correcting more errors in parallel reasoning. + Explore More Training Paradigms: We want to explore more training paradigms than SFT and RL for dLLM.

---

📦 Model Variants

| Model ID | Description | Hugging Face Link | | --- | --- | --- | | inclusionAI/LLaDA2.1-mini | Instruction-tuned model, ready for downstream applications. | 🤗 Model Card | | inclusionAI/LLaDA2.1-flash | Instruction-tuned model, ready for downstream applications. | 🤗 Model Card |

---

🔍 Model Overview

LLaDA2.1-mini has the following specifications:

+ Type: Mixture-of-Experts (MoE) Diffusion Language Model + Total Parameters (Non-Embedding): 16B + Number of Layers: 20 + Attention Heads: 16 + Context Length: 32,768 tokens + Position Embedding: Rotary (RoPE) + Vocabulary Size: 157,184

---

🤗 Hugging Face Transformers

Make sure you have transformers and its dependencies installed:

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/path/to/LLaDA2.1-mini"
device = "auto"
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map=device,
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

prompt = """Calculate 1+5-28*0.5-200=?"""
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
)
generated_tokens = model.generate(
inputs=input_ids,
eos_early_stop=True,
gen_length=512,
block_length=32,
threshold=0.5,
editing_threshold=0,
temperature=0.0,
)
generated_answer = tokenizer.decode(
generated_tokens[0],
skip_special_tokens=True,
)
print(generated_answer)

Best Practices

To achieve optimal performance, we recommend the following settings:

1. Sampling Parameters: We recommend the following general sampling parameters: block_length=32, temperature=0.0, top_p=None and top_k=None. We are currently exploring more diverse sampling configurations.

2. Denoising Thresholds: There are three denoising params: threshold, editing_threshold and max_post_steps. We recommend threshold=0.7, editing_threshold=0.5 for Quality Mode and threshold=0.5, editing_threshold=0.0 for Speed Mode. For both modes, we suggest setting max_post_steps to a value greater than 5. We recommend 16 as a balanced default, which was used for most of our internal testing.

Note: Low threshold may causes stuttering in trade-off for quick inference.

3. Adequate Output Length: We recommend using an output length of 16384 tokens for most scenarios.

---

🤖ModelScope

If you're in mainland China, we strongly recommend you to use our model from 🤖ModelScope

---

Deployment

SGLang

SGLang enables dLLM inference either through offline batching or by launching an HTTP server…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable mini model with solid downloads