No putting this genie back

Just 6 months ago the general consensus was that building a top-tier AI model meant spending hundreds of millions, vast data centres and huge amounts of raw data. There was a giant moat around the likes of OpenAI, Google, Anthropic and Meta (if not between them). But over the intervening months AI systems like o1 and others have demonstrated the power of new ‘reasoning’ models and ‘reinforcement learning’. This week DeepSeek confirmed that this particular genie is well and truly out of the bottle with the release of R1. It’s a model trained at 5% of the cost of OpenAI’s equivalent but with comparable performance and its put Silicon Valley (in pervious open-source leaders Meta in particular) in panic mode. R1 scored 79.8% on the AIME mathematics test and 71.5% on GPQA diamond, matching and exceeding leading models such as Claude 3.5 Sonnet.

As technologist Andrew Curran put it: “DeepSeek is unequivocal proof that one can produce unit intelligence gain at 10x less cost, which means we shall get 10x more powerful AI with the compute we have today and are building tomorrow. Simple math! The AI timeline just got compressed.” But how did they create it, and how does this new paradigm work in general?

DeepSeek started with an existing LLM un-tuned ‘base’ model and used reinforcement learning to teach it reasoning skills, much like how a student learns through practice questions and good-quality feedback. The feedback in this case was sourced from other models and was also preceded with some examples to get things going. This created their first version. They then ‘sampled’ output from this model and used the best examples to create a training dataset. Imagine scoring a student’s best answers from a prolonged revision session and curating them into a study guide. They then trained the model mostly on this guide as well as giving the model human feedback to make it easier to work with. This recipe has produced a highly capable AI. But things don’t end with R1 (or o1 for that matter). We’re now seeing a bigger feedback loop with OpenAI planning to release their next iteration of reasoning model o3 in a little as a few weeks. DeepSeek’s next model will be developed by spending more compute on generating high-quality answers from R1, and sourcing other relatively small but high-quality datasets in verifiable domains like maths, coding, and science. These are used to repeat the process and train the next and improved version and so on, creating a kind of virtuous cycle. This sees increasing reasoning skills emerge as the process is repeated. Models like R1 and o1 aren’t just end products – they’re training data generators for the next generation

As OpenAI’s Sebastien Bubeck stated: “No tactic was given to the model [o1]. Everything is emergent. Everything is learned through reinforcement learning. This is insane. Insanity.” Both OpenAI and Google researchers have hinted at remarkable progress using similar approaches, using the same virtuous cycle: generate high-quality answers, use them to train improved models, repeat. Each iteration can focus on verifiable domains like mathematics, coding, and science, where correct answers provide clear feedback.

This “learning to learn” approach appears to be creating a kind of compound interest. Each generation of models builds on the insights of the previous one, potentially accelerating progress far beyond what we’ve seen before. And as DeepSeek has shown, the barrier to entry for this approach is surprisingly low. Nvidia’s Jim Fan posted on X: “Whether you like it or not, the future of AI will not be canned genies controlled by a “safety panel”. The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It’s the tide of history that we should surf on, not swim against. Might as well start preparing now.” There will be no central control or EU decision on whether the most advanced AIs will be made available… they are in the wild today and there’s no return. The AI community have already set to work mining R1 for reasoning data and applying that to train other smaller models. In days we have already seen notable “distillations” running on everything down to a smartphone.

On Wednesday, Google released Gemini 2.0 Flash Thinking, their latest take on lightweight reasoning. The update brings 73.3% performance on AIME and adds support for million-token context windows. Google’s initial focus on its ‘Flash’ range suggests that they’ve been able to use the full-fat Gemini 2.0 to train these smaller models.

Takeaways: There’s a new twist to the $100 trillion AI question. Is there a new infinite intelligence feedback loop? Are we about to see rapid take off to AGI and beyond? Maybe this just works for ‘verifiable’ problems like coding and maths and the messy complex world will not be so easy to crack? Maybe not? Either way, start testing these reasoning capabilities now – the performance improvements and cost reductions are happening faster than expected.

No putting this genie back

Visualising the jagged frontier

o3 and the new scaling laws

Language models do the math

When not seeing is the edge

Subscribe to the ExoBrain Weekly Newsletter