Agents code all day long

This chart tracks the autonomous coding capabilities of OpenAI’s models from 2019 through to 2026, measured using METR’s “time horizon” metric, the length of human-equivalent tasks that AI agents can complete with 50% success. GPT-5.1-Codex-Max, released this week achieves a time horizon of approximately 2 hours 42 minutes, continuing a remarkably consistent trend that has seen capabilities double roughly every 202 days since 2018. See the full METR analysis.

This appears to be the first frontier coding model genuinely capable of handling all-day coding tasks, with internal testing showing completion of projects running over 24 hours. The timing of OpenAI’s release, just one day after Google launched Gemini 3.0, suggests competitive dynamics remain intense in the race to dominate AI coding capabilities. However, the release also reflects OpenAI’s sustained focus on agentic coding systems, with the model featuring new “compaction” technology that allows it to process millions of tokens whilst maintaining context over extended sessions. METR’s extrapolation suggests that by May 2026, on-trend development could yield models with a time horizon approaching 13 hours.

Agents code all day long

The next wave begins

Google's grand bazaar

New models Spud and Mythos leaked

The new rhythm of AI progress

Subscribe to the ExoBrain Weekly Newsletter