ModelMiniMaxMiniMaxpublished Dec 20, 2025seen 5d

MiniMaxAI/MiniMax-M2.1

Open original ↗

Captured source

source ↗
published Dec 20, 2025seen 5dcaptured 9hhttp 200method plaintask text-generationlicense otherlibrary transformersparams 229Bdownloads 10klikes 1.4k

Meet MiniMax-M2.1

Today, we are handing MiniMax-M2.1 over to the open-source community. This release is more than just a parameter update; it is a significant step toward democratizing top-tier agentic capabilities.

M2.1 was built to shatter the stereotype that high-performance agents must remain behind closed doors. We have optimized the model specifically for robustness in coding, tool use, instruction following, and long-horizon planning. From automating multilingual software development to executing complex, multi-step office workflows, MiniMax-M2.1 empowers developers to build the next generation of autonomous applications—all while being fully transparent, controllable, and accessible.

We believe true intelligence should be within reach. M2.1 is our commitment to the future, and a powerful new tool in your hands.

How to Use

  • The MiniMax-M2.1 API is now live on the MiniMax Open Platform: https://platform.minimax.io/docs/guides/text-generation
  • Our product MiniMax Agent, built on MiniMax-M2.1, is now publicly available: https://agent.minimax.io/
  • The MiniMax-M2.1 model weights are now open-source, allowing for local deployment and use: https://huggingface.co/MiniMaxAI/MiniMax-M2.1

Benchmarks

MiniMax-M2.1 delivers a significant leap over M2 on core software engineering leaderboards. It shines particularly bright in multilingual scenarios, where it outperforms Claude Sonnet 4.5 and closely approaches Claude Opus 4.5.

| Benchmark | MiniMax-M2.1 | MiniMax-M2 | Claude Sonnet 4.5 | Claude Opus 4.5 | Gemini 3 Pro | GPT-5.2 (thinking) | DeepSeek V3.2 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | SWE-bench Verified | 74.0 | 69.4 | 77.2 | 80.9 | 78.0 | 80.0 | 73.1 | | Multi-SWE-bench | 49.4 | 36.2 | 44.3 | 50.0 | 42.7 | x | 37.4 | | SWE-bench Multilingual | 72.5 | 56.5 | 68 | 77.5 | 65.0 | 72.0 | 70.2 | | Terminal-bench 2.0 | 47.9 | 30.0 | 50.0 | 57.8 | 54.2 | 54.0 | 46.4 |

We also evaluated MiniMax-M2.1 on SWE-bench Verified across a variety of coding agent frameworks. The results highlight the model's exceptional framework generalization and robust stability.

Furthermore, across specific benchmarks—including test case generation, code performance optimization, code review, and instruction following—MiniMax-M2.1 demonstrates comprehensive improvements over M2. In these specialized domains, it consistently matches or exceeds the performance of Claude Sonnet 4.5.

| Benchmark | MiniMax-M2.1 | MiniMax-M2 | Claude Sonnet 4.5 | Claude Opus 4.5 | Gemini 3 Pro | GPT-5.2 (thinking) | DeepSeek V3.2 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | SWE-bench Verified (Droid) | 71.3 | 68.1 | 72.3 | 75.2 | x | x | 67.0 | | SWE-bench Verified (mini-swe-agent) | 67.0 | 61.0 | 70.6 | 74.4 | 71.8 | 74.2 | 60.0 | | SWT-bench | 69.3 | 32.8 | 69.5 | 80.2 | 79.7 | 80.7 | 62.0 | | SWE-Perf | 3.1 | 1.4 | 3.0 | 4.7 | 6.5 | 3.6 | 0.9 | | SWE-Review | 8.9 | 3.4 | 10.5 | 16.2 | x | x | 6.4 | | OctoCodingbench | 26.1 | 13.3 | 22.8 | 36.2 | 22.9 | x | 26.0 |

To evaluate the model's full-stack capability to architect complete, functional applications "from zero to one," we established a novel benchmark: VIBE (Visual & Interactive Benchmark for Execution in Application Development). This suite encompasses five core subsets: Web, Simulation, Android, iOS, and Backend. Distinguishing itself from traditional benchmarks, VIBE leverages an innovative Agent-as-a-Verifier (AaaV) paradigm to automatically assess the interactive logic and visual aesthetics of generated applications within a real runtime environment.

MiniMax-M2.1 delivers outstanding performance on the VIBE aggregate benchmark, achieving an average score of 88.6—demonstrating robust full-stack development capabilities. It excels particularly in the VIBE-Web (91.5) and VIBE-Android (89.7) subsets.

| Benchmark | MiniMax-M2.1 | MiniMax-M2 | Claude Sonnet 4.5 | Claude Opus 4.5 | Gemini 3 Pro | | ----- | ----- | ----- | ----- | ----- | ----- | | VIBE (Average) | 88.6 | 67.5 | 85.2 | 90.7 | 82.4 | | VIBE-Web | 91.5 | 80.4 | 87.3 | 89.1 | 89.5 | | VIBE-Simulation | 87.1 | 77.0 | 79.1 | 84.0 | 89.2 | | VIBE-Android | 89.7 | 69.2 | 87.5 | 92.2 | 78.7 | | VIBE-iOS | 88.0 | 39.5 | 81.2 | 90.0 | 75.8 | | VIBE-Backend | 86.7 | 67.8 | 90.8 | 98.0 | 78.7 |

MiniMax-M2.1 also demonstrates steady improvements over M2 in both long-horizon tool use and comprehensive intelligence metrics.

| Benchmark | MiniMax-M2.1 | MiniMax-M2 | Claude Sonnet 4.5 | Claude Opus 4.5 | Gemini 3 Pro | GPT-5.2 (thinking) | DeepSeek V3.2 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Toolathlon | 43.5 | 16.7 | 38.9 | 43.5 | 36.4 | 41.7 | 35.2 | | BrowseComp | 47.4 | 44.0 | 19.6 | 37.0 | 37.8 | 65.8 | 51.4 | | BrowseComp (context management) | 62.0 | 56.9 | 26.1 | 57.8 | 59.2 | 70.0 | 67.6 | | AIME25 | 83.0 | 78.0 | 88.0 | 91.0 | 96.0 | 98.0 | 92.0 | | MMLU-Pro | 88.0 | 82.0 | 88.0 | 90.0 | 90.0 | 87.0 | 86.0 | | GPQA-D | 83.0 | 78.0 | 83.0 | 87.0 | 91.0 | 90.0 | 84.0 | | HLE w/o tools | 22.2 | 12.5 | 17.3 | 28.4 | 37.2 | 31.4 | 22.2 | | LCB | 81.0 | 83.0 | 71.0 | 87.0 | 92.0 | 89.0 | 86.0 | | SciCode | 41.0 | 36.0 | 45.0 | 50.0 | 56.0 | 52.0 | 39.0 | | IFBench | 70.0 | 72.0 | 57.0 | 58.0 | 70.0 | 75.0 | 61.0 | | AA-LCR | 62.0 | 61.0 | 66.0 | 74.0 | 71.0 | 73.0 | 65.0 | | 𝜏²-Bench Telecom | 87.0 | 87.0 | 78.0 | 90.0 | 87.0 | 85.0 | 91.0 |

> Evaluation Methodology Notes: > - SWE-bench Verified: Tested on internal infrastructure using Claude Code, Droid, or mini-swe-agent as scaffolding. By default, we utilized Claude Code metrics. When using Claude Code, the default system prompt was overridden. Results represent the average of 4 runs. > - Multi-SWE-Bench & SWE-bench Multilingual & SWT-bench & SWE-Perf: Tested on internal infrastructure using Claude Code as scaffolding, with the default system prompt overridden. Results represent the average of 4 runs. > - Terminal-bench 2.0: Tested using Claude Code on our internal evaluation framework. We verified the full dataset and fixed environmental issues. Timeout limits were removed, while all other configurations remained consistent with official settings. Results represent the average of 4 runs. > - SWE Review: Built upon the SWE framework, this internal benchmark for code…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release, moderate HF downloads.