RepoMeituan (LongCat)Meituan (LongCat)published Aug 30, 2025seen 5d

meituan-longcat/LongCat-Flash-Chat

Open original ↗

Captured source

source ↗

meituan-longcat/LongCat-Flash-Chat

License: MIT

Stars: 1339

Forks: 67

Open issues: 13

Created: 2025-08-30T16:00:04Z

Pushed: 2026-06-09T02:08:37Z

Default branch: main

Fork: no

Archived: no

README:

LongCat-Flash-Chat

Tech Report 📄

Model Introduction

We introduce LongCat-Flash, a powerful and efficient language model with 560 billion total parameters, featuring an innovative Mixture-of-Experts (MoE) architecture. The model incorporates a dynamic computation mechanism that activates 18.6B∼31.3B parameters (averaging∼27B) based on contextual demands, optimizing both computational efficiency and performance. To achieve advanced training and inference efficiency, we employ a shortcut-connected architecture that expands computation-communication overlap window, achieving over 100 tokens per second (TPS) for inference cost-effectively. Our comprehensive training and scaling strategies ensure stable, efficient training, while tailored data strategies enhance model performance.

Now we release LongCat-Flash-Chat, a non-thinking foundation model that delivers highly competitive performance among leading models, with exceptional strengths in agentic tasks.

Key Features

🌟 Scalable Architectural Design for Computational Efficiency

LongCat-Flash is designed and optimized under two key principles: efficient computation utilization, as well as efficient training and inference. Specifically, (1) As not all tokens are equal, we introduce the zero-computation experts mechanism in MoE blocks to allocate a dynamic computation budget to important tokens based on their significance, i.e., activating 18.6 to 31.3 billion parameters (out of 560 billion total) based on contextual demands. To ensure consistent computation load, we employ expert bias adjusted by a PID-controller, maintaining an average of∼27 billion activated parameters per token. (2) As communication overhead becomes a bottleneck during MoE model scaling, we incorporate the Shortcut-connected MoE (ScMoE) design to expand the computation-communication overlap window. Combined with customized infrastructure optimizations, this design enables training at a massive scale of over tens of thousands accelerators and inference with high throughput and low latency.

🌟 Effective Model Scaling Strategy

Effectively and efficiently scaling model size remains a key challenge in strategy design. To this end, we develop a comprehensive stability-and-scaling framework for robustly training large-scale models: (1) We successfully apply a hyperparameter transfer strategy to such a large model, predicting optimal hyperparameter configurations by leveraging results from smaller proxy models with theoretical guarantees. (2) We initialize the model using a model-growth mechanism based on a refined half-scale checkpoint, achieving improved performance compared to conventional initialization methods. (3) A multi-pronged stability suite incorporates principled router-gradient balancing, a hidden z-loss to suppress massive activations, and fine-tuned optimizer configurations. (4) To enhance the reliability of large-scale cluster training, we introduce deterministic computation. This guarantees the exact reproducibility of experiments and enables the detection of SDC (Silent Data Corruption) during the training process. These interventions ensure that LongCat-Flash ’s training remains stable, with no irrecoverable loss spikes.

🌟 Multi-Stage Training Pipeline for Agentic Capability

Through a meticulously designed pipeline, LongCat-Flash is endowed with advanced agentic behaviors. Initial efforts focus on constructing a more suitable base model for agentic post-training, where we design a two-stage pretraining data fusion strategy to concentrate reasoning-intensive domain data. During mid-training, we enhance reasoning and coding capabilities while extending the context length to 128k to meet agentic post-training requirements. Building on this advanced base model, we proceed with a multi-stage post-training. Recognizing the scarcity of high-quality, high-difficulty training problems for agentic tasks, we design a multi-agent synthesis framework that defines task difficulty across three axes, i.e., information processing, tool-set complexity, and user interaction—using specialized controllers to generate complex tasks requiring iterative reasoning and environmental interaction.

For more detail, please refer to the comprehensive ***LongCat-Flash Technical Report***.

Evaluation Results

| Benchmark | DeepSeek V3.1 | Qwen3 MoE-2507 | Kimi-K2 | GPT-4.1 | Claude4 Sonnet | Gemini2.5 Flash | LongCat-Flash | |---------------|-------------------|--------------------|-------------|-------------|--------------------|---------------------|-------------| | Architecture | MoE | MoE | MoE | - | - | - | MoE | | # Total Params | 671B | 235B | 1043B | - | - | - | 560B | | # Activated Params | 37B | 22B | 32B | - | - | - | 27B | | General Domains | | | | | | | | | MMLU(acc) | 90.96 | 90.23 | 89.86 | 89.64 | 91.75 | 86.33 | 89.71 | | MMLU-Pro(acc) | 84.45 | 84.83 | 82.06 | 81.72 | 83.74 | 81.95 | 82.68 | | ArenaHard-V2(acc) | 84.10 | 88.20 | 85.70 | 61.50 | 62.10 | 77.00 | 86.50 | | CEval(acc) | 89.21 | 92.70 | 91.26 | 79.53 | 86.63 | 78.78 | 90.44 | | CMMLU(acc) | 88.04 | 88.14 | 89.66 | 77.65 | 86.51 | 78.30 | 84.34 | | Instruction Following | | | | | | | | | IFEval(acc) | 86.69 | 88.54 | 88.91 | 85.58 | 88.35 | 83.92 | 89.65 | | COLLIE(acc) | 43.80 | 49.71 | 56.34 | 50.00 | 51.22 | 48.60 | 57.10 | | Meeseeks-zh(acc) | 33.83 | 35.32 | 42.79 | 41.54 | 35.07 | 34.84 | 43.03 | | Mathematical Reasoning | | | | | | | | | MATH500(acc) | 96.08 | 98.80 | 97.60 | 90.60 | 93.80 | 98.40 | 96.40 | | AIME24(avg@10) | 66.30* | 81.67 | 69.60* | 47.00 | 47.00 | 79.67 | 70.42 | | AIME25(avg@10) | 49.27 | 68.33 | 50.66 | 32.00 | 37.00 | 67.33 | 61.25 | | BeyondAIME(avg@10) | 36.50 | 57.60 | 36.60 | 22.10 | 20.50 | 44.20 | 43.00 | | General Reasoning | | | | | | | | | GPQA-diamond(acc) | 74.90* | 77.43 | 75.76 | 67.68 | 70.71 | 80.30 | 73.23 | | DROP(f1) | 84.19 | 78.57 | 89.04 | 66.94 | 73.06 | 45.03 | 79.06 | | ZebraLogic(acc) | 85.30 | 94.22 | 89.11 | 56.30* | 75.85 | 51.78 | 89.30 | | GraphWalks-128k(precision) | 73.54 | 80.72 | 47.50 | 85.02 | 80.57 | 64.83 | 51.05 | | Coding | | | | | | | | |…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new repo with decent traction