ModelMoonshot AI (Kimi)Moonshot AI (Kimi)published Apr 14, 2026seen 5d

moonshotai/Kimi-K2.6

Open original ↗

Captured source

source ↗
published Apr 14, 2026seen 5dcaptured 13hhttp 200method plaintask image-text-to-textlicense otherlibrary transformersparams 1059Bdownloads 2764klikes 1.4k

🤗 huggingchat | 📰 Tech Blog

1. Model Introduction

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

Key Features

  • Long-Horizon Coding: K2.6 achieves significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization.
  • Coding-Driven Design: K2.6 is capable of transforming simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision.
  • Elevated Agent Swarm: Scaling horizontally to 300 sub-agents executing 4,000 coordinated steps, K2.6 can dynamically decompose tasks into parallel, domain-specialized subtasks, delivering end-to-end outputs from documents to websites to spreadsheets in a single autonomous run.
  • Proactive & Open Orchestration: For autonomous tasks, K2.6 demonstrates strong performance in powering persistent, 24/7 background agents that proactively manage schedules, execute code, and orchestrate cross-platform operations without human oversight.

2. Model Summary

3. Evaluation Results

Footnotes

1. General Testing Details

  • We report results for Kimi K2.6 and Kimi K2.5 with thinking mode enabled, Claude Opus 4.6 with max effort, GPT-5.4 with xhigh reasoning effort, and Gemini 3.1 Pro with a high thinking level.
  • Unless otherwise specified, all Kimi K2.6 experiments were conducted with temperature = 1.0, top-p = 1.0, and a context length of 262,144 tokens.
  • Benchmarks without publicly available scores were re-evaluated under the same conditions used for Kimi K2.6 and are marked with an asterisk (*). Except where noted with an asterisk, all other results are cited from official reports.

2. Reasoning Benchmarks

  • IMO-AnswerBench scores for GPT-5.4 and Claude 4.6 were obtained from z.ai/blog/glm-5.1.
  • Humanity's Last Exam (HLE) and other reasoning tasks were evaluated with a maximum generation length of 98,304 tokens. By default, we report results on the HLE full set. For the text-only subset, Kimi K2.6 achieves 36.4% accuracy without tools and 55.5% with tools.

3. Tool-Augmented / Agentic Tasks

  • Kimi K2.6 was equipped with search, code-interpreter, and web-browsing tools for HLE with tools, BrowseComp, DeepSearchQA, and WideSearch.
  • For HLE-Full with tools, the maximum generation length is 262,144 tokens with a per-step limit of 49,152 tokens. We employ a simple context management strategy: once the context window exceeds the threshold, only the most recent round of tool-related messages is retained.
  • For BrowseComp, we report scores obtained with context management using the same discard-all strategy as Kimi K2.5 and DeepSeek-V3.2.
  • For DeepSearchQA, no context management was applied to Kimi K2.6 tests, and tasks exceeding the supported context length were directly counted as failed. Scores for Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on DeepSearchQA are cited from the Claude Opus 4.7 System Card.
  • For WideSearch, we report results under the "hide tool result" context management setting. Once the context window exceeds the threshold, only the most recent round of tool-related messages is retained.
  • The test system prompts are identical to those used in the Kimi K2.5 technical report.
  • Claw Eval was conducted using version 1.1 with max-tokens-per-step = 16384.
  • For APEX-Agents, we evaluate 452 tasks from the public 480-task release, as done by Artificial Analysis(excluding Investment Banking Worlds 244 and 246, which have external runtime dependencies)

4. Coding Tasks

  • Terminal-Bench 2.0 scores were obtained with the default agent framework (Terminus-2) and t

Excerpt shown — open the source for the full document.

Notability

notability 9.0/10

Very high downloads, likely frontier model.