Tech Updates

Startups

Artificial Intelligence

Asia

Moonshot AI Releases Kimi K2 Thinking: Open-Source Agent Tops Benchmarks

Kimi K2 Thinking achieves state-of-the-art scores on expert reasoning and agentic search tasks while executing hundreds of sequential tool calls autonomously.

NewDecoded

Published Nov 10, 2025

Nov 10, 2025

4 min read

Moonshot AI has released Kimi K2 Thinking, an open-source reasoning model that scored 44.9% on Humanity's Last Exam (HLE) with tools, surpassing GPT-5's baseline of 41.7% on the same subset. The model also achieved 60.2% on BrowseComp, substantially exceeding the human baseline of 29.2%, and 71.3% on SWE-Bench Verified. K2 Thinking can execute 200 to 300 sequential tool calls autonomously across search, code execution, and web browsing.

The model implements test-time scaling through expanded thinking tokens and extended tool-calling sequences. Internal testing showed K2 Thinking reduced research task completion from 28 minutes to 16 minutes compared to manual workflows plus generic language models. In one demonstration, it solved a PhD-level mathematics problem through 23 interleaved reasoning and tool calls. The architecture maintains an internal plan that updates after each tool execution, routing outputs back into the reasoning context.

K2 Thinking employs Quantization-Aware Training with INT4 weight-only quantization on Mixture of Experts components, achieving roughly 2x generation speed improvement. All benchmark results reflect INT4 precision. The model supports a 256k context window and caps thinking budgets at 96k tokens for reasoning tasks and 128k tokens for competitive programming challenges.

On coding benchmarks, K2 Thinking scored 61.1% on SWE-Multilingual and 83.1% on LiveCodeBench v6. The model shows particular strength on HTML, React, and component-intensive frontend tasks, translating conceptual prompts into functional products. On mathematics benchmarks, it achieved 99.1% on AIME 2025 with Python access and 95.1% on HMMT 2025 with tools.

The model is available on kimi.com under chat mode and through the Kimi K2 Thinking API. The public interface uses a selective subset of tools and reduced tool-call turns for speed, meaning benchmark scores may not reproduce in standard interactions. Moonshot AI plans to release full agentic mode capabilities in a forthcoming update. For agentic search tasks, evaluations equipped K2 Thinking with search, code interpreter, and web-browsing tools, setting a maximum of 120 steps for HLE and 300 steps for search benchmarks. The testing methodology blocked access to Hugging Face to prevent data leakage; without this restriction, K2 Thinking scored 51.3% on HLE. Coding task results represent averages across five independent runs to ensure reliability.

Decoded Take

The release of Kimi K2 Thinking signals a shift in competitive dynamics for reasoning models. While OpenAI's GPT-5 and Anthropic's Claude Sonnet 4.5 have commanded headlines, Moonshot AI's achievement of 44.9% on Humanity's Last Exam with tools (versus GPT-5's 41.7%) demonstrates that open-source alternatives are closing the performance gap on complex reasoning tasks. The model's ability to execute 200-300 sequential tool calls represents a practical advance in autonomous agent architecture, reducing task completion times significantly in internal tests. However, K2 Thinking's deployment strategy reveals constraints: the public chat interface uses limited tools to maintain speed, while full agentic capabilities remain forthcoming. This pattern mirrors broader industry tension between benchmark performance and practical user experience. The INT4 quantization approach addresses a key challenge in deploying thinking models, whose extended reasoning chains create computational bottlenecks. As reasoning models proliferate, the question shifts from whether they can match frontier capabilities to which architectures prove most deployable and cost-effective at scale.

Want to advertise your Data, Analytics, or AI here? Reach out!

NewDecoded

Want to advertise your Data, Analytics, or AI here? Reach out!

NewDecoded

Want to advertise your Data, Analytics, or AI here? Reach out!

NewDecoded

Share this article

News

Feb 19, 2026

Cohere Labs Launches Tiny Aya to Bring Multilingual AI Directly to Mobile Devices

News

Feb 19, 2026

Cohere Labs Launches Tiny Aya to Bring Multilingual AI Directly to Mobile Devices

News

Feb 19, 2026

Cohere Labs Launches Tiny Aya to Bring Multilingual AI Directly to Mobile Devices

News

Feb 19, 2026

Meta and NVIDIA Forge Strategic Alliance to Build Hyperscale AI Infrastructure

News

Feb 19, 2026

Meta and NVIDIA Forge Strategic Alliance to Build Hyperscale AI Infrastructure

News

Feb 19, 2026

Meta and NVIDIA Forge Strategic Alliance to Build Hyperscale AI Infrastructure

News

Feb 19, 2026

Anthropic Launches Claude Sonnet 4.6 Delivering Frontier Power at Mid-Tier Pricing

News

Feb 19, 2026

Anthropic Launches Claude Sonnet 4.6 Delivering Frontier Power at Mid-Tier Pricing

News

Feb 19, 2026

Anthropic Launches Claude Sonnet 4.6 Delivering Frontier Power at Mid-Tier Pricing

News

Feb 19, 2026

Martech Veterans Launch Kana With $15M to Automate Marketing Through Agentic AI

News

Feb 19, 2026

Martech Veterans Launch Kana With $15M to Automate Marketing Through Agentic AI

News

Feb 19, 2026

Martech Veterans Launch Kana With $15M to Automate Marketing Through Agentic AI

News

Feb 19, 2026

World Labs Secures $1 Billion to Advance Spatial Intelligence and 3D World Models

News

Feb 19, 2026

World Labs Secures $1 Billion to Advance Spatial Intelligence and 3D World Models

News

Feb 19, 2026

World Labs Secures $1 Billion to Advance Spatial Intelligence and 3D World Models

News

Jan 24, 2026

Rare Earth Recycler Cyclic Materials Raises $75 Million to Scale Low Carbon Magnet Supply for AI and EVs

News

Jan 24, 2026

Rare Earth Recycler Cyclic Materials Raises $75 Million to Scale Low Carbon Magnet Supply for AI and EVs

News

Jan 24, 2026

Rare Earth Recycler Cyclic Materials Raises $75 Million to Scale Low Carbon Magnet Supply for AI and EVs