ExoBrain
benchmarks and evalsfrontier labsmodel releasesresearch and science

ARC-AGI-2 falls to Gemini Deep Think

Google's Gemini Deep Think variant achieves a record-breaking score on the ARC-AGI-2 reasoning benchmark, raising questions about training data contamination ahead of the more complex ARC-AGI-3 test.

ExoBrain

1 min read
ARC-AGI-2 falls to Gemini Deep Think

This chart shows how, with minimal fanfare, Google released the February 2026 edition of its Deep Think variant of Gemini 3, and it casually crushes what has been one of the hardest benchmarks for AI over the last year. ARC-AGI-2 was designed to test novel reasoning that’s “easy for humans, hard for AI,” and humans average around 60% on it. Deep Think now scores 84.6%, with Claude Opus 4.6 at 68.8%, GPT-5.2 at 52.9%, and Gemini’s own base model trailing at 31.1%. François Chollet’s ARC Prize team, who verified these scores, also flagged that the model appeared to know ARC’s colour mappings without being told, suggesting ARC data may be well represented in Google’s training set. Whether this reflects genuine reasoning or brute-force compute remains an open question, and we won’t have to wait long for a tougher test: ARC-AGI-3 launches next month with interactive, game-like environments where models must explore, set their own goals and adapt on the fly, capabilities that no amount of extra thinking tokens can fake.