o3 and o4-mini prime agentic AI for take-off
OpenAI and Google release new models including o3 and o4-mini, which exhibit advanced agentic, multimodal, and coding capabilities that are reshaping enterprise software development.
Joel Miller

This week might have been a short four‑day stretch for some, but the labs were not in an Easter‑holiday mood with a continuation of the relentless model‑release cycle, seeing seven big new offerings from OpenAI and Google alone. The 14th April launch delivered the GPT‑4.1 family, optimised for builders, in full, mini and nano sizes (one‑million‑token window, cheaper than GPT‑4o). Two days later the spotlight swung to the new o‑series contenders: the flagship o3 (200 k context, multimodal reasoning, $10/$40 per million input/output tokens) and the thriftier o4‑mini ($1.10/$2.50 respectively). Both landed in ChatGPT Plus, Team and Pro, and the API. GitHub Copilot moved enterprises to o3 the same evening. Benchmarks lit up: o3 tops SWE‑Bench Verified unaided at 69%, while o4‑mini punches above its weight with 99.5% on AIME maths (when Python tools are used).
OpenAI also released a new software‑engineering tool called Codex, which defaults to o4‑mini and seems to be a direct response to the similar command‑line solution from Anthropic. Meanwhile, rumours surfaced that OpenAI was in the hunt to acquire development‑environment start‑up Windsurf for £3 billion to bolster its dev product proposition. This week’s output from OpenAI seemed very much targeted at the professional AI community. Not to be outdone, Google dropped Gemini 2.5 Flash, a strong option in terms of cost versus capability.
So how are people responding? Tyler Cowen, the economist and AI enthusiast, called o3 “AGI”, and the OpenAI team fanned the flames, talking of a step change in impact on scientific discovery. On X, people shared impressive visual geo-guessing demos and code refactors finished in one shot. Sceptics replied with screenshots of botched long division and hallucinated data. Noted sceptic Gary Marcus quipped that o3 “can predict everything except its own errors”.
What’s the ExoBrain take? In three key areas, the o-series feels like a significant step, although, as ever, the proof will be in getting o3 and o4‑mini out of the lab and into organisations, teams and knowledge workloads where the actual opportunities reside. There are, however, new capabilities here, even over the powerful Gemini 2.5 Pro, that will make a difference:
- Creative leaps: o3 sometimes displays flashes of genius unmatched by earlier models. In one public demo shared by OpenAI’s research VP Mark Chen, the model reviewed a scientific paper, spotted that the authors had assumed something incorrectly, and suggested switching to a new technique which all checked out. Other scientists have talked of insights and novel reasoning that have taken them by surprise. Our testing indicates that in areas where it has been trained, o3 is indeed deeply insightful, such as in data science or software architecture.
- Visual thinking: Most “multi-modal” models understand images: they label what is present. o3 also reasons with them. Every 14 × 14 patch is embedded like a word, so text tokens can mix with pixels inside a single “chain of thought”. Visual understanding means mostly labelling, visual reasoning is using a picture to reach a new conclusion. In one public example, a shaky phone shot of a diagram went in, o3 read the faint numbers, ran Python to solve the problem, then produced an annotated diagram with arrows and values.
- Tool use: o3 can make hundreds of API tool or external code calls in a single run, which is invaluable for advanced agents. One example prompt generated market data, plotted volatility, wrote a memo and attached the chart; and cost only $0.18. This is starting to push the single response, or unit of work, from “answer” to “mini‑project”.
Both of these new models are also very fast, but, the classic LLM issues remain, and there were areas where they noticeably under‑performed:
- The December o3 preview (and tuned version) that crushed the ARC‑AGI benchmark led many to expect an outright knockout of Gemini 2.5 Pro. The released o3 is strong at code and research but more expensive and only neck‑and‑neck with Gemini (and Claude) on other fronts.
- Common sense physical logic, simple sums, dates and units can still derail these LLMs. In the Transluce audit o3 missed 14% of two‑step arithmetic questions and, when pressed, produced a forged Python trace claiming perfect accuracy.
- Long tool chains sometimes time out; the model has been seen fabricating “successful” outputs to keep the response tidy. Combined with strong rhetorical polish and persuasion, users can walk away convinced of a wrong answer.
Other practical snags remain. Dense images soon eat up the 200k token budget, so vision capabilities are relatively limited. Uncapped autonomous tool use can rack up high bills without careful monitoring. And while o4‑mini looks like a bargain, Claude and Gemini deliver similar results at comparable costs.
But OpenAI will not stand still; o3‑pro should land in a few weeks, bringing lower latency, higher compute limits and an expanded tool suite. Reinforcement‑learning fine‑tuning is heading for general release, and this will be huge for specialising these powerful models for specific domains such as finance, law and medicine.
Takeaways: So, what will we see in the coming months now that most of the new generation of models are here? The combinatory opportunities are exciting. These models can reason over extremely complex problems, and with the right prompting, and trial and error to see where they need support, they could now tackle a huge proportion of common business tasks. The expensive o3 can deal with the planning and agent workflow for example, while o4-mini can pick up the bulk of the processing (perhaps with the even cheaper GPT-4.1 mini tackling very large volume tasks). The key will be to set up the problems with data and instructions that turn them into the kinds of structured projects that these models devour. Combining this with carefully included visual elements, perhaps mimicking how we might solve problems with a whiteboard sketch or a thoughtful chart, will help these models still further. Finally, making a rich range of tools with good usability available to these models will give them the power to operate in creative and adaptive ways. Even if we paused here, we would still believe agentic AI is ready for take-off; what it needs now is relentless, well‑governed experimentation to reveal abilities even the labs have not imagined, and to translate them into clear, day‑to‑day value.
