Huge news this week: OpenAI has unveiled their next generation of AI model. A moment that many have been waiting for and one heavily trailed in “strawberry” themed social media activity from employees and fans alike in recent weeks. Rather confusingly named “o1“, it’s on a new scale of progress, and according to OpenAI, it seems we’re at level 1 (they had previously talked about five levels where at the highest, AI systems will be able to operate single-handedly as entire organisations). It comes in “mini” and “preview” forms, and these models have been trained and built to “think before they speak”… this is the advanced reasoning technology that was previously codenamed after a certain red summer fruit.
The o1 models introduce a novel concept of “reasoning tokens” – internal steps used by the model to break down problems and consider multiple approaches before generating a visible response. You can see them flashing in the corner as the model thinks before responding. The benchmarks look strong, with unprecedented performance across mathematics, coding, and scientific understanding. o1-preview scores 83% on International Mathematics Olympiad qualifying exams compared to GPT-4’s 13%. Perhaps most strikingly, both o1 models outperformed expert humans on PhD-level science questions in the GPQA Diamond benchmark. But benchmark performance will always be the sweet spot; what will be fascinating to see is how well these models reason in the real and complex world.
The hope is they can make a material impact on scientific research, complex decision-making, and advanced problem-solving across industries. OpenAI says these models have shown reduced propensity to make things up (because they think about their answers), and improved adherence to safety guidelines, plus better performance on security tests. But they’ve also demonstrated concerning capabilities in areas like persuasion and biological threat creation, which OpenAI has classified as “medium risk.”
An incident during testing highlighted both the impressive problem-solving abilities and potential risks of these advanced models. When faced with a broken cybersecurity challenge, an o1 model creatively bypassed the intended solution, exploiting the testing environment in unexpected ways. While no security breach occurred due to proper isolation measures, this “wake-up call” underscores the need for robust testing and careful deployment strategies.
Currently in beta, the o1 models have limited features and access. They’re available through the API but lack support for images, tool use or streaming, and many other parameters available in previous models. OpenAI plans to expand these features and increase rate limits in the coming weeks.
Pricing for the new models reflects their enhanced capabilities, with o1-preview positioned as a premium option at about $28 per 1M tokens, significantly higher than competitors like Anthropic’s Claude 3.5 Sonnet at $6 per 1M tokens (but less than Claude 3 Opus). This seems to be aggressive pricing and also an indication that the models must be quite efficient, and they are continuing to improve in bang for GPU buck.
Takeaways: One nugget from the launch material showed data on time spent thinking versus performance. The graph suggested there is more to come from using computing power when the model is analysing the problem, and not just when being trained. METR analysis of the model shows that it can carry out complex technical tasks in some cases more effectively than a human. It’s clear that o1 models are super “smart”, perhaps PhD level in some areas, but that has never meant a guarantee of super effectiveness. The question and race will now be to see how well we can match this model and its strengths to real-world opportunities, how much capability is possible with suitable instruction, and ultimately how well we can integrate it with the world to take action. These are fascinating times.
