
This chart helps us understand the new elite of tool using models. Kimi K2 Thinking’s 93% score on τ²-Bench looks impressive, outperforming GPT-5. The benchmark tests dual-control scenarios where AI agents must guide humans through complex technical support tasks, maintaining coherence across hundreds of interactions.
In the overall Artificial Analysis Intelligence Index (v3.0 incorporates 10 evaluations: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, and 𝜏²-Bench) K2 Thinking comes in just behind GPT-5, and ahead of Grok-4 and Claude 4.5 Sonnet. Interestingly, to run all of these benchmarks K2 cost $379 versus $1,888 for Grok 4 and $913 for running GPT-5. See the full analysis breakdown and track agentic performance, cost and speed on the excellent Artificial Analysis website.
