
Over the last few months, we’ve covered o1, o3, DeepSeek R1-lite and R1, and Gemini 2.0 Flash Thinking as models which corroborate the central trend in recent AI development. Specifically, they show that if you train a model on examples of long-form reasoning (think topics such as maths and coding) and then give it more time to think at point of use, it can get way smarter.
This week a team from Stanford released a paper that demonstrated this in a pure form. They curated a set of just 1,000 reasoning examples and for less than $50 extracted from Google Gemini 2.0 Flash Thinking and used them to train a base model (Alibaba’s Qwen 2.5), and then implemented a neat trick to make the model think for longer. They appended a “wait” command at the end of each response to make it keep going. To keep it thinking longer, they added this several times. This combination of techniques increased the models benchmark scores dramatically.
You can see on the right chart here, along the x-axis more wait and think loops increased its ability to answer university level questions. The research shows you don’t need massive datasets or complex training methods to achieve this ‘test-time scaling’ effect – you just need high-quality examples and extended reasoning time. With OpenAI hiring PhDs and paying then $100 an hour to write out answers to complex problems, it’s not surprising Sam Altman stated at an event in Germany today that he sees no limit to where these models can go.
