ModelMoonshot AI (Kimi)Moonshot AI (Kimi)published Jul 11, 2025seen 5d

moonshotai/Kimi-K2-Instruct

Open original ↗

Captured source

source ↗
published Jul 11, 2025seen 5dcaptured 11hhttp 200method plaintask text-generationlicense otherlibrary transformersparams 1026Bdownloads 541klikes 2.4k

📰 Tech Blog | 📄 Paper

0. Changelog

2025.8.11

  • Messages with name field are now supported. We’ve also moved the chat template to a standalone file for easier viewing.

2025.7.18

  • We further modified our chat template to improve its robustness. The default system prompt has also been updated.

2025.7.15

  • We have updated our tokenizer implementation. Now special tokens like [EOS] can be encoded to their token ids.
  • We fixed a bug in the chat template that was breaking multi-turn tool calls.

1. Model Introduction

Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.

Key Features

  • Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
  • MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
  • Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.

Model Variants

  • Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
  • Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.

2. Model Summary

3. Evaluation Results

Instruction model evaluation results

• Bold denotes global SOTA, and underlined denotes open-source SOTA.

• Data points marked with * are taken directly from the model's tech report or blog.

• All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length.

• Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.

• To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2.

• Some data points have been omitted due to prohibitively expensive evaluation costs.

---

Base model evaluation results

Benchmark Metric Shot Kimi K2 Base Deepseek-V3-Base Qwen2.5-72B Llama 4 Maverick

General Tasks

MMLU EM 5-shot 87.8 87.1 86.1 84.9

MMLU-pro EM 5-shot 69.2 60.6 62.8 63.5

MMLU-redux-2.0 EM 5-shot 90.2 89.5 87.8 88.2

SimpleQA Correct 5-shot 35.3 26.5 10.3 23.7

TriviaQA EM 5-shot 85.1 84.1 76.0 79.3

GPQA-Diamond Avg@8 5-shot 48.1 50.5 40.8 49.4

SuperGPQA EM 5-shot 44.7 39.2 34.2 38.8

Coding Tasks

LiveCodeBench v6 Pass@1 1-shot 26.3 22.9 21.1 25.1

EvalPlus Pass@1 - 80.3 65.6 66.0 65.5

Mathematics Tasks

MATH EM 4-shot 70.2 60.1 61.0 63.0

GSM8k EM 8-shot 92.1 91.7 90.4 86.3

<td align="

Excerpt shown — open the source for the full document.

Notability

notability 9.0/10

High traction, notable model from Moonshot AI