meituan-longcat/LongCat-Flash-Thinking
Captured source
source ↗LongCat-Flash-Thinking
Tech Report 📄
Model Introduction
We introduce and release LongCat-Flash-Thinking, which is a powerful and efficient large reasoning model (LRM) with 560 billion total parameters, featuring an innovative Mixture-of-Experts (MoE) architecture. The model incorporates a dynamic computation mechanism that activates 18.6B∼31.3B parameters (averaging∼27B) based on contextual demands, optimizing both computational efficiency and performance. LongCat-Flash-Thinking is developed by our DORA system, which is an efficient distributed RL framework that supports asynchronous training and flexible accelerator usage to ensure stability and efficiency. Our comprehensive data curation and domain-parallel training recipe ensures stable and efficient training. In addition to general reasoning, the model is also equipped with techniques of formal reasoning and agentic reasoning, advancing the LRMs' reasoning ability on diverse complex tasks such as mathematics, logic, programming, automatic theorem proving, and tool use.
Specifically, the development of LongCat-Flash-Thinking follows a two-phase pipeline:
- Long CoT Cold-Start Training: This phase aims to cultivate the model's foundational reasoning abilities.
This begins with a curriculum learning strategy during mid-training to bolster intrinsic capabilities, followed by a SFT stage on reasoning-intensive and agentic data to prepare the model for advanced learning.
- Large-Scale RL: The second phase scales up this potential through an efficient RL framework, built upon our Dynamic Orchestration for Asynchronous Rollout (DORA) system for industrial-scale asynchronous training.
To address the stability challenges in asynchronous RL training, we adapt and extend the GRPO algorithm for a robust exploration-exploitation balance. A key innovation in this phase is our domain-parallel training scheme, which simultaneously optimizes the model across distinct domains and subsequently merges the resulting domain-expert models into a fused model. Finally, we perform a general RL stage to further refine the fused model and enhance its robustness, safety, and human alignment ability.
Key Features
🌟 Domain-Parallel RL Training Methodology
To overcome the instability of traditional mixed-domain RL training, LongCat-Flash-Thinking incorporates a domain-parallel training scheme that decouples optimization across STEM, coding, and agentic tasks. This approach not only stabilizes training, but also allows to fuse the resulting domain-expert models into a nearly Pareto-optimal final model that excels across all specialties.
🌟 Pioneering RL Infrastructure
LongCat-Flash-Thinking is built upon our self-designed DORA system. The main motivation is to optimize long-tail generation by leveraging multiple old versions of the Actor model through streaming rollout while keeping sampling consistency. DORA system consists of two core components, such as elastic colocation and multi-version asynchronous pipeline. These components aim to enhance training efficiency, ensure policy consistency per sample, and further enable efficient KV-cache reuse, facilitating stable and scalable training on tens of thousands of accelerators.
🌟 Advancing Formal Reasoning and Agentic Reasoning
In addition to general reasoning (e.g., mathematics, logic, coding, instruction-following, etc.), LongCat-Flash-Thinking also emphasizes two other critical capabilities.
- Formal Reasoning: LongCat-Flash-Thinking can solve complex formal reasoning tasks, e.g., automatic theorem proving. To help realize this potential and empower researchers, we introduce significant enhancements to our model's formal reasoning capabilities.
To achieve this, we introduce a novel expert iteration framework for careful data synthesis, involving statement formalization, iterative proof synthesis, and syntax/consistency filtering.
- Agentic Reasoning: LongCat-Flash-Thinking can adaptively utilize provided tools to solve complex reasoning tasks. To reach this goal, we introduce a dual-path reasoning approach to identify and retain high-quality queries that genuinely require tool assistance, thereby fostering the development of robust agentic abilities.
After high-value query selection, we synthesize corresponding high-quality solution trajectories based on a versatile environment with diverse tool APIs, including MCP servers and simulated tools for both single and multi-turn interactions.
For more details, please refer to the comprehensive **LongCat-Flash-Thinking Technical Report**.
Evaluation Results
| Benchmark | DeepSeek-V3.1-Thinking | Qwen3-235B-A22B-Thinking-2507 | GLM-4.5 | OpenAI-o3 | Gemini2.5-Pro | GPT-5-Thinking | LongCat-Flash-Thinking | |---------------|-------------------------|------------------------------|--------|-----------|---------------|----------------|-------------------------| | Architecture | MoE | MoE | MoE | - | - | - | MoE | | \# Total Params | 671B | 235B | 355B | - | - | - | 560B | | \# Activated Params | 37B | 22B | 32B | - | - | - | 27B | | General QA | | | | | | | | | MMLU-Pro(acc) | 84.4 | 84.4 | 81.5 | 85.3 | 86.7 | 84.5 | 82.6 | | MMLU-Redux(acc) | 90.5 | 91.4 | 89.9 | 93.1 | 90.1 | 92.6 | 89.3 | | Alignment | | | | | | | | | IFEval(strict prompt) | 86.3 | 89.3 | 85.4 | 90.2 | 92.4 | 92.8 | 86.9 | | Arena-Hard(hard prompt gemini) | 57.1 | 74.5 | 67.7 | 87.1 | 87.1 | 87.7 | 69.9 | | Mathematical Reasoning | | | | | | | | | MATH500(Mean@1) | 98.8 | 99.6 | 95.4 | 98.4 | 98.0 | 99.2 | 99.2 | | HMMT25(Mean@32) | 80.4 | 83.8 | 76.3 | 71.9 | 79.3 | 84.8 | 83.7 | | AIME24(Mean@32) | 93.9 | 93.9 | 89.3 | 91.6* | 90.7 | 92.0 | 93.3 | | AIME25(Mean@32) | 87.9 | 92.5 | 85.5 | 88.9* | 89.2 | 94.6* | 90.6 | | BeyondAIME(Mean@10) | 71.8 | 71.5 | 66.0 | 63.2 | 63.0 | 70.0 | 69.5 | | General Reasoning | | | | | | | | | GPQA-Diamond(Mean@16) | 84.2 | 80.4 | 78.3 | 81.9 | 84.0 | 84.4 | 81.5 | | ZebraLogic(Mean@1) | 96.1 | 97.5 | 90.9 | 94.3 | 92.4 | 92.7 | 95.5 | | Sudoku-Bench(Mean@1) | 1.0 | 2.0 | 1.0 | 70.0 | 0.0 | 63.0 | 56.0 | | ARC-AGI(Mean@1) | 37.5 | 45.3 | 21.41 | 47.3 | 46.8 | 59.0 | 50.3 | | Coding | | | | | | | | | LiveCodeBench(Mean@4) | 73.5 | 75.4 | 61.1 | 76.2 | 74.2 | 80.6 | 79.4 | | OJBench(Mean@1) | 33.6 | 32.1 | 19.0 | 38.4 | 41.6 | 34.1 | 40.7 | | Agentic Tool Using | | | | | | | | | SWE-Bench(Pass@1) | 66.0* | 34.4 | 64.2* | 69.1* | 59.6*…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low traction, minor release