Insights
Dec 25, 2025
Tech Updates
Startups
Artificial Intelligence
Asia
NewDecoded
4 min read
Moonshot AI has released Kimi K2 Thinking, an open-source reasoning model that scored 44.9% on Humanity's Last Exam (HLE) with tools, surpassing GPT-5's baseline of 41.7% on the same subset. The model also achieved 60.2% on BrowseComp, substantially exceeding the human baseline of 29.2%, and 71.3% on SWE-Bench Verified. K2 Thinking can execute 200 to 300 sequential tool calls autonomously across search, code execution, and web browsing.
The model implements test-time scaling through expanded thinking tokens and extended tool-calling sequences. Internal testing showed K2 Thinking reduced research task completion from 28 minutes to 16 minutes compared to manual workflows plus generic language models. In one demonstration, it solved a PhD-level mathematics problem through 23 interleaved reasoning and tool calls. The architecture maintains an internal plan that updates after each tool execution, routing outputs back into the reasoning context.
K2 Thinking employs Quantization-Aware Training with INT4 weight-only quantization on Mixture of Experts components, achieving roughly 2x generation speed improvement. All benchmark results reflect INT4 precision. The model supports a 256k context window and caps thinking budgets at 96k tokens for reasoning tasks and 128k tokens for competitive programming challenges.
On coding benchmarks, K2 Thinking scored 61.1% on SWE-Multilingual and 83.1% on LiveCodeBench v6. The model shows particular strength on HTML, React, and component-intensive frontend tasks, translating conceptual prompts into functional products. On mathematics benchmarks, it achieved 99.1% on AIME 2025 with Python access and 95.1% on HMMT 2025 with tools.
The model is available on kimi.com under chat mode and through the Kimi K2 Thinking API. The public interface uses a selective subset of tools and reduced tool-call turns for speed, meaning benchmark scores may not reproduce in standard interactions. Moonshot AI plans to release full agentic mode capabilities in a forthcoming update. For agentic search tasks, evaluations equipped K2 Thinking with search, code interpreter, and web-browsing tools, setting a maximum of 120 steps for HLE and 300 steps for search benchmarks. The testing methodology blocked access to Hugging Face to prevent data leakage; without this restriction, K2 Thinking scored 51.3% on HLE. Coding task results represent averages across five independent runs to ensure reliability.
The release of Kimi K2 Thinking signals a shift in competitive dynamics for reasoning models. While OpenAI's GPT-5 and Anthropic's Claude Sonnet 4.5 have commanded headlines, Moonshot AI's achievement of 44.9% on Humanity's Last Exam with tools (versus GPT-5's 41.7%) demonstrates that open-source alternatives are closing the performance gap on complex reasoning tasks. The model's ability to execute 200-300 sequential tool calls represents a practical advance in autonomous agent architecture, reducing task completion times significantly in internal tests. However, K2 Thinking's deployment strategy reveals constraints: the public chat interface uses limited tools to maintain speed, while full agentic capabilities remain forthcoming. This pattern mirrors broader industry tension between benchmark performance and practical user experience. The INT4 quantization approach addresses a key challenge in deploying thinking models, whose extended reasoning chains create computational bottlenecks. As reasoning models proliferate, the question shifts from whether they can match frontier capabilities to which architectures prove most deployable and cost-effective at scale.