ARC-AGI-2 falls to Gemini Deep Think

This chart shows how, with minimal fanfare, Google released the February 2026 edition of its Deep Think variant of Gemini 3, and it casually crushes what has been one of the hardest benchmarks for AI over the last year. ARC-AGI-2 was designed to test novel reasoning that’s “easy for humans, hard for AI,” and humans average around 60% on it. Deep Think now scores 84.6%, with Claude Opus 4.6 at 68.8%, GPT-5.2 at 52.9%, and Gemini’s own base model trailing at 31.1%. François Chollet’s ARC Prize team, who verified these scores, also flagged that the model appeared to know ARC’s colour mappings without being told, suggesting ARC data may be well represented in Google’s training set. Whether this reflects genuine reasoning or brute-force compute remains an open question, and we won’t have to wait long for a tougher test: ARC-AGI-3 launches next month with interactive, game-like environments where models must explore, set their own goals and adapt on the fly, capabilities that no amount of extra thinking tokens can fake.

ARC-AGI-2 falls to Gemini Deep Think

Models learn when they’re being tested

When not seeing is the edge

The new rhythm of AI progress

Visualising the jagged frontier

Subscribe to the ExoBrain Weekly Newsletter