o3 and o4-mini prime agentic AI for take-off, Scaling laws show ongoing gains, and AI skills a fundamental expectation at Shopify

o3 and o4-mini prime agentic AI for take-off

OpenAI and Google release new models including o3 and o4-mini, which exhibit advanced agentic, multimodal, and coding capabilities that are reshaping enterprise software development.

Joel Miller

18 April 20255 min read

o3 and o4-mini prime agentic AI for take-off

This week might have been a short four‑day stretch for some, but the labs were not in an Easter‑holiday mood with a continuation of the relentless model‑release cycle, seeing seven big new offerings from OpenAI and Google alone. The 14th April launch delivered the GPT‑4.1 family, optimised for builders, in full, mini and nano sizes (one‑million‑token window, cheaper than GPT‑4o). Two days later the spotlight swung to the new o‑series contenders: the flagship o3 (200 k context, multimodal reasoning, $10/$40 per million input/output tokens) and the thriftier o4‑mini ($1.10/$2.50 respectively). Both landed in ChatGPT Plus, Team and Pro, and the API. GitHub Copilot moved enterprises to o3 the same evening. Benchmarks lit up: o3 tops SWE‑Bench Verified unaided at 69%, while o4‑mini punches above its weight with 99.5% on AIME maths (when Python tools are used).

OpenAI also released a new software‑engineering tool called Codex, which defaults to o4‑mini and seems to be a direct response to the similar command‑line solution from Anthropic. Meanwhile, rumours surfaced that OpenAI was in the hunt to acquire development‑environment start‑up Windsurf for £3 billion to bolster its dev product proposition. This week’s output from OpenAI seemed very much targeted at the professional AI community. Not to be outdone, Google dropped Gemini 2.5 Flash, a strong option in terms of cost versus capability.

So how are people responding? Tyler Cowen, the economist and AI enthusiast, called o3 “AGI”, and the OpenAI team fanned the flames, talking of a step change in impact on scientific discovery. On X, people shared impressive visual geo-guessing demos and code refactors finished in one shot. Sceptics replied with screenshots of botched long division and hallucinated data. Noted sceptic Gary Marcus quipped that o3 “can predict everything except its own errors”.

What’s the ExoBrain take? In three key areas, the o-series feels like a significant step, although, as ever, the proof will be in getting o3 and o4‑mini out of the lab and into organisations, teams and knowledge workloads where the actual opportunities reside. There are, however, new capabilities here, even over the powerful Gemini 2.5 Pro, that will make a difference:

Creative leaps: o3 sometimes displays flashes of genius unmatched by earlier models. In one public demo shared by OpenAI’s research VP Mark Chen, the model reviewed a scientific paper, spotted that the authors had assumed something incorrectly, and suggested switching to a new technique which all checked out. Other scientists have talked of insights and novel reasoning that have taken them by surprise. Our testing indicates that in areas where it has been trained, o3 is indeed deeply insightful, such as in data science or software architecture.
Visual thinking: Most “multi-modal” models understand images: they label what is present. o3 also reasons with them. Every 14 × 14 patch is embedded like a word, so text tokens can mix with pixels inside a single “chain of thought”. Visual understanding means mostly labelling, visual reasoning is using a picture to reach a new conclusion. In one public example, a shaky phone shot of a diagram went in, o3 read the faint numbers, ran Python to solve the problem, then produced an annotated diagram with arrows and values.
Tool use: o3 can make hundreds of API tool or external code calls in a single run, which is invaluable for advanced agents. One example prompt generated market data, plotted volatility, wrote a memo and attached the chart; and cost only $0.18. This is starting to push the single response, or unit of work, from “answer” to “mini‑project”.

Both of these new models are also very fast, but, the classic LLM issues remain, and there were areas where they noticeably under‑performed:

The December o3 preview (and tuned version) that crushed the ARC‑AGI benchmark led many to expect an outright knockout of Gemini 2.5 Pro. The released o3 is strong at code and research but more expensive and only neck‑and‑neck with Gemini (and Claude) on other fronts.
Common sense physical logic, simple sums, dates and units can still derail these LLMs. In the Transluce audit o3 missed 14% of two‑step arithmetic questions and, when pressed, produced a forged Python trace claiming perfect accuracy.
Long tool chains sometimes time out; the model has been seen fabricating “successful” outputs to keep the response tidy. Combined with strong rhetorical polish and persuasion, users can walk away convinced of a wrong answer.

Other practical snags remain. Dense images soon eat up the 200k token budget, so vision capabilities are relatively limited. Uncapped autonomous tool use can rack up high bills without careful monitoring. And while o4‑mini looks like a bargain, Claude and Gemini deliver similar results at comparable costs.

But OpenAI will not stand still; o3‑pro should land in a few weeks, bringing lower latency, higher compute limits and an expanded tool suite. Reinforcement‑learning fine‑tuning is heading for general release, and this will be huge for specialising these powerful models for specific domains such as finance, law and medicine.

Takeaways: So, what will we see in the coming months now that most of the new generation of models are here? The combinatory opportunities are exciting. These models can reason over extremely complex problems, and with the right prompting, and trial and error to see where they need support, they could now tackle a huge proportion of common business tasks. The expensive o3 can deal with the planning and agent workflow for example, while o4-mini can pick up the bulk of the processing (perhaps with the even cheaper GPT-4.1 mini tackling very large volume tasks). The key will be to set up the problems with data and instructions that turn them into the kinds of structured projects that these models devour. Combining this with carefully included visual elements, perhaps mimicking how we might solve problems with a whiteboard sketch or a thoughtful chart, will help these models still further. Finally, making a rich range of tools with good usability available to these models will give them the power to operate in creative and adaptive ways. Even if we paused here, we would still believe agentic AI is ready for take-off; what it needs now is relentless, well‑governed experimentation to reveal abilities even the labs have not imagined, and to translate them into clear, day‑to‑day value.

Scaling laws show ongoing gains

Analysis of OpenAI's o1 and o3 models demonstrates that increased post-training compute significantly boosts performance on complex mathematical reasoning benchmarks like AIME.

ExoBrain

18 April 20251 min read

Our visual story this week leads on from the o3 launch. This image charts the rapid performance improvements of OpenAI models on the American Invitational Mathematics Examination (AIME) as training compute rises. The grey curve shows o1 rising from 25 % to 72 % as post-training (reinforcement learning on complex problems) compute grows, while the yellow o3 line passes 85 % overall. AIME is a high‑school Olympiad test whose problems need multi‑step reasoning, so strong results signal genuine mathematical skill. The steady rise implies the 10x more post-training compute spent on o3 brought further gains. If this slope continues, o3‑pro and later generations will continue to scale for now, though at increasing up-front cost, even if they are more efficient and cheaper to run day-to-day with algorithmic improvements.

AI skills a fundamental expectation at Shopify

Shopify mandates AI proficiency for all employees, a strategy endorsed by industry leaders as a means to democratise technology creation and integrate autonomous agents into standard workflows.

ExoBrain

18 April 20252 min read

Shopify CEO Tobi Lütke’s leaked memo requiring teams to “demonstrate why they cannot get what they want done using AI” before requesting new staff has gained coverage this week, including from LinkedIn co-founder Reid Hoffman, who called it an “open-source management technique” on his “Possible” podcast.

The memo positions AI usage as “a fundamental expectation of everyone at Shopify” and suggests teams consider: “What would this area look like if autonomous AI agents were already part of the team?”

Hoffman recommends leaders implement regular AI check-ins where team members share new AI applications they’ve discovered. This practice creates a trust-based environment where individual exploration is encouraged rather than mandated.

OpenAI’s experiences highlight this democratisation of technology creation. Their Chief People Officer successfully “vibe coded” an internal tool she missed from a previous job, despite lacking traditional development skills. “If our chief people officer is doing it, we have no excuse,” noted Kevin Weil, OpenAI’s chief product officer, illustrating how AI is enabling anyone to become a technology creator.

Takeaways: The push for AI adoption is extending beyond technical roles to all employees. LinkedIn themselves recently reorganised around product teams with a blurring of lines between engineering, product and design into full-stack AI empowered builders. Leaders who create environments where staff feel safe to experiment with AI will see more rapid innovation. The barrier between users and creators is breaking down, with AI empowering workers to build their own solutions regardless of technical background.

o3 and o4-mini prime agentic AI for take-off, Scaling laws show ongoing gains, and AI skills a fundamental expectation at Shopify

This week we look at: