ExoBrain
benchmarks and evalscoding agentsfrontier labsmodel releases

The new rhythm of AI progress

The latest wave of model releases from Google, Anthropic, and xAI demonstrates a rapid cadence of incremental updates that often fail to meet user expectations despite impressive benchmark scores.

Joel Miller

Joel Miller

2 min read
The new rhythm of AI progress

Three new models dropped this week. Google’s Gemini 3.1 Pro, Anthropic’s Sonnet 4.6 and xAI’s Grok 4.2. Google claimed a 77% score on ARC-AGI-2 for its standard Pro model, separate from the Deep Think variant that scored 84.6% last week, more than doubling its November result. Grok 4.20 promised “rapid learning” and a new multi-agent setup by default. Sonnet 4.6 offered a million-token context window at bargain pricing. And yet the collective reaction from the people actually using these models has been muted.

Google’s benchmark obsession is becoming a pattern. In November, Gemini 3 Pro broke the 1500 Elo barrier on LMArena and topped Humanity’s Last Exam. Users then reported a model that didn’t live up to expectation. This week’s sequel does not resolve that issue. Independent testing found Gemini 3.1 Pro spends up to 114 seconds planning before writing a single line of code, sometimes getting stuck in loops. Its agentic benchmark ranking dropped from 7th to 19th. Grok 4.20 is a version number that tells you everything about xAI’s priorities. Users report it misreads context and fixates on irrelevant details. Sonnet 4.6 is the most honest of the three: a capable mid-tier model filling the gap between Haiku and Opus, not pretending to be anything more.

Contrast this with a fortnight ago. OpenAI shipped Codex 5.3, a model that was instrumental in debugging its own training run. Anthropic released Opus 4.6 with adaptive reasoning and its highest-ever agentic coding scores. Those felt like genuine steps forward. This week’s batch feels like catch-up. It resembles the classic Intel tick-tock cadence. The “tock” delivers an architectural leap. The “tick” optimises the existing design to fill gaps. Reinforcement learning and increasingly sophisticated training infrastructure mean both ticks and tocks now arrive every few weeks. That pace would have been unthinkable in 2024.

Since November, agentic coding, tool use and terminal-based engineering have become the white-hot centre of AI development, as we explore in this week’s Dark Factory piece. Google appears to have tuned Gemini 3.1 Pro for its own surfaces: Search, Gmail, Workspace. That may be a sensible product decision. But it leaves a strategic gap. Gemini 3 showed enormous promise. Gemini 3.1 Pro does not restore Google to a position of relevance in the world of software engineering, and that is the race that matters most right now.

Takeaways: We are entering an era of rapid, regular model releases, some that push the frontier and many that fill gaps beneath it. The trick for anyone building with AI is learning to tell the difference, and not mistaking a tick for a tock.