o3 and the new scaling laws
The industry is shifting from training larger models to optimising reasoning at inference, with OpenAI's o3 demonstrating superior performance in coding and complex problem-solving benchmarks.
Joel Miller

The AI ‘scaling’ story took a significant turn in 2024. Early rumours about OpenAI’s Q* and ‘strawberry’ projects suggested a major leap in AI reasoning capabilities. When o1-preview was unveiled (Week 37), it productionised a new approach, shifting computational resources towards ‘thinking time’ rather than training. This model was designed to leverage reinforcement learning to enhance its reasoning capabilities, allowing it to spend more time processing and solving complex problems. But behind the scenes the costs of training at the frontier had escalated dramatically, approaching $1b per run (Week 34). At the NeurIPS conference last week, former OpenAI chief scientist Ilya Sutskever suggested that we will soon reach ‘peak [training] data’, signalling the end of the scaling era. But right on cue, o1 pro mode (Week 50) and other models like DeepSeek R1 (Week 47), along with Google’s ‘Gemini 2.0 Thinking’, suggest scaling at the point of use can take over.
The future will not only be about training ever larger models, but about teaching smaller ones to think more effectively (and getting them to work together as ‘agents’). The race is on to perfect this technique and optimise beyond the highly structured domains of maths and coding, to healthcare, finance and beyond. As we published on Friday evening, OpenAI demonstrated and published benchmarks for their next generation reasoning model, o3, planned for release in Q1 2025. The early benchmarks look stunning. It has demonstrated above human performance in the Arc AGI Prize (Week 24) and looks very strong on software development. In early 2024 GPT-4 was getting around 3% on the SWE-Bench coding test, and o3 tops 70%! The cost of this capability looks exceptionally high, but the trajectory is clear.