inclusionAI/LLaDA2.1-flash
Captured source
source ↗LLaDA2.1-flash
🚀 LLaDA2.1-flash is now live on ZenmuxAI! Try it via API 🛠️ or Chat 💬: https://zenmux.ai/inclusionai/llada2.1-flash
LLaDA2.1-flash is a diffusion language model of the LLaDA series featuring the editing enhancement. It significantly improves inference speed while delivering strong task performance.
---
Benchmark Qwen3-30B- A3B-Inst-2507 (Score) Ling-flash-2.0
(Score) LLaDA2.0-flash
(Score | TPF) LLaDA2.1-flash (S Mode) (Score | TPF) LLaDA2.1-flash (Q Mode) (Score | TPF)
Average 73.09 71.52 72.43 | 3.08 72.34 | 5.93 73.54 | 3.64
Knowledge
GPQA 54.14 69.16 62.31 | 3.29 66.67 | 3.95 67.30 | 2.37
MMLU-Pro 74.21 77.55 74.79 | 2.36 75.31 | 4.43 76.59 | 2.62
C-EVAL 88.12 87.54 85.21 | 1.90 86.93 | 2.71 86.71 | 1.75
PHYBench 29.84 27.67 30.06 | 2.70 26.04 | 4.10 28.23 | 2.66
TriviaQA 65.61 69.76 66.88 | 1.94 72.55 | 4.30 72.93 | 2.92
Reasoning
BIG-Bench Hard 85.54 89.36 86.75 | 2.66 87.82 | 5.61 88.69 | 3.28
BIG-Bench Extra Hard 37.80 23.24 27.86 | 4.60 33.51 | 5.04 35.77 | 3.17
bbh-zh 86.18 75.09 87.52 | 3.21 82.55 | 5.78 86.23 | 3.77
MuSR 79.15 82.72 80.48 | 1.70 80.10 | 2.90 79.84 | 1.85
ZebraLogic 90.97 87.60 82.30 | 2.74 84.20 | 5.80 88.90 | 3.26
PrOntoQA 97.12 97.88 96.50 | 2.64 95.00 | 9.23 97.00 | 5.73
PIQA 91.57 91.95 92.76 | 1.43 92.44 | 2.38 92.17 | 1.44
OCNLI 71.59 65.36 71.63 | 1.09 72.17 | 1.83 72.75 | 1.32
HellaSwag 86.31 81.59 84.97 | 1.26 85.60 | 2.31 85.31 | 1.51
KOR-Bench 69.2 69.44 63.04 | 3.44 62.80 | 4.97 65.12 | 2.77
DROP 87.57 88.32 87.90 | 2.26 87.55 | 5.40 87.86 | 2.53
SQuAD 2.0 89.51 81.32 90.00 | 3.10 90.65 | 5.01 90.80 | 3.90
Coding
LiveCodeBench 46.42 52.48 42.51 | 4.23 44.05 | 6.48 45.37 | 3.80
CRUXEval-O 86.75 82.75 85.12 | 3.21 85.25 | 6.54 87.50 | 3.80
MBPP+ 78.21 80.89 79.37 | 4.02 76.72 | 10.43 77.25 | 5.96
HumanEval+ 87.88 87.58 88.41 | 6.45 89.63 | 13.81 89.63 | 9.18
MultiPL-E 70.67 65.76 74.87 | 3.14 70.89 | 7.77 73.34 | 4.33
BigCodeBench-Full 41.49 40.70 41.58 | 3.33 37.11 | 8.51 39.21 | 4.70
BIRD-SQL 47.75 47.49 45.76 | 2.16 42.18 | 5.09 44.04 | 2.95
Spider 81.79 80.58 82.49 | 4.42 79.18 | 8.74 81.04 | 5.70
Math
AIME 2025 61.88 55.89 60.00 | 4.57 63.33 | 5.36 63.33 | 3.46
OlympiadBench 77.59 76.19 74.07 | 3.70 75.85 | 6.46 76.59 | 3.81
GSM-Plus 89.41 89.71 89.74 | 2.68 89.23 | 7.14 89.69 | 3.83
CMATH 96.58 96.52 96.90 | 2.17 96.54 | 4.84 96.63 | 2.65
Omni-MATH 54.00 53.00 50.30 | 3.39 52.30 | 6.01 54.10 | 3.50
Agent & Alignment
IFEval-strict-prompt 83.73 81.15 82.62 | 1.47 83.36 | 2.24 83.55 | 1.41
BFCL v3 73.41 67.69 74.94 | 4.87 74.86 | 9.24 75.61 | 6.76
Nexus FC 49.93 36.25 50.45 | 5.53 44.83 | 11.29 47.65 | 7.38
---
🚀 Highlights
+ Error-Correcting Editable: Structural innovation of editable generation for dLLM + Speedy vs Quality Mode: The 100B flash model achieves ultra-fast inference under Speed Mode while remaining competitive across various tasks and under Quality Mode. + Reinforcement Learning on 100B-scale dLLM: Tailored algorithm and framework to enable reinforcement learning for large dLLM.
🗺️ What's Next
+ Powerful Agentic/Tool Use Capability with LLaDA: Next update will be equipped with powerful Agentic and long-distance tool-use capability. + Extreme Editing: Next update will feature stronger and more extensive editing capabilities, aimed at correcting more errors in parallel reasoning. + Explore More Training Paradigms: We want to explore more training paradigms than SFT and RL for dLLM. ---
📦 Model Variants
| Model ID | Description | Hugging Face Link | | --- | --- | --- | | inclusionAI/LLaDA2.1-mini | Instruction-tuned model, ready for downstream applications. | 🤗 Model Card | | inclusionAI/LLaDA2.1-flash | Instruction-tuned model, ready for downstream applications. | 🤗 Model Card |
---
🔍 Model Overview
LLaDA2.1-flash has the following specifications:
+ Type: Mixture-of-Experts (MoE) Diffusion Language Model + Total Parameters (Non-Embedding): 100B + Number of Layers: 32 + Attention Heads: 32 + Context Length: 32,768 tokens + Position Embedding: Rotary (RoPE) + Vocabulary Size: 157,184
---
🤗 Hugging Face Transformers
Make sure you have transformers and its dependencies installed:
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "/path/to/LLaDA2.1-flash"
device = "auto"
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map=device,
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = """Calculate 1+5-28*0.5-200=?"""
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
)
generated_tokens = model.generate(
inputs=input_ids,
eos_early_stop=True,
gen_length=512,
block_length=32,
threshold=0.5,
editing_threshold=0,
temperature=0.0,
)
generated_answer = tokenizer.decode(
generated_tokens[0],
skip_special_tokens=True,
)
print(generated_answer)Multi-block Editing inference comming soon.
Best Practices
To achieve optimal performance, we recommend the following settings:
1. Sampling Parameters: We recommend the following general sampling parameters: block_length=32, temperature=0.0, top_p=None and top_k=None. We are currently exploring more diverse sampling configurations.
2. Denoising Thresholds: There are three denoising params: threshold, editing_threshold and max_post_steps. We recommend threshold=0.7, editing_threshold=0.5 for Quality Mode and threshold=0.5, editing_threshold=0.0 for Speed Mode. For both modes, we suggest setting max_post_steps to a value greater than 5. We recommend 16 as a balanced default, which was used for most of our internal testing.
Note: Low threshold may causes stuttering in trade-off for quick inference.
3. Adequate Output Length: We recommend using an output length of 16384 tokens for most scenarios.
---
🤖ModelScope
If you're in mainland China, we strongly recommend you to use our model from 🤖ModelScope
---
Deployment
SGLang
SGLang enables dLLM inference either through…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10High HF downloads, notable model release