Top agentic tool users

This chart helps us understand the new elite of tool using models. Kimi K2 Thinking’s 93% score on τ²-Bench looks impressive, outperforming GPT-5. The benchmark tests dual-control scenarios where AI agents must guide humans through complex technical support tasks, maintaining coherence across hundreds of interactions.

In the overall Artificial Analysis Intelligence Index (v3.0 incorporates 10 evaluations: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, and 𝜏²-Bench) K2 Thinking comes in just behind GPT-5, and ahead of Grok-4 and Claude 4.5 Sonnet. Interestingly, to run all of these benchmarks K2 cost $379 versus $1,888 for Grok 4 and $913 for running GPT-5. See the full analysis breakdown and track agentic performance, cost and speed on the excellent Artificial Analysis website.

Top agentic tool users

The bell curve of AI intelligence

The adaptive thinking backlash

GPT-5.2 and the contours of progress

Deep Research shows the way for agents

Subscribe to the ExoBrain Weekly Newsletter