When a screen grab of Gemini 3 benchmarks leaked ahead of launch, the numbers appeared unrealistic. A 45% score on ARC-AGI-2, triple the previous best. An Elo above 1500 on LMSYS Arena. Over 90% on GPQA Diamond. The numbers looked like wishful thinking from overeager AI enthusiasts. Then Google officially released the model, and the scores proved accurate. It appeared to have dealt a hammer blow to the competition. Yet whilst the benchmarks tell a story of technical dominance, the reality of high-level intelligence is always more complicated.
Gemini 3 Pro is a very large general model, (estimates suggest multi-trillions of parameters), it has a one‑million‑token context window and can generate at around 120+ tokens per second which is surprisingly fast for its size. On price, Google has kept the base Pro tier in line with the market: $2 per million input, and $12 per million output tokens. If benchmarks were infallible, this model would be untouchable. On Humanity’s Last Exam, it reached 41% where most competitors struggle to break 30%. These aren’t incremental improvements; they represent capability jumps that shouldn’t be possible if scaling had truly hit a wall.

On the ground, many users and creators have been instantly impressed. Developers talked about single‑shot complex software outputs, 3D user interfaces, compiler designs, or advanced ray‑tracing and more. Our own experience at ExoBrain has been of deep knowledge and a bigger scope to accelerate software creation, but in some cases frustrating intransigence and slightly hallucinatory behaviours. Gemini’s deployment has not been entirely smooth. Some users reported being silently downgraded from 3.0 to 2.5 mid‑session. AI Studio enforces a 50‑message daily cap that kills sustained experiments. Tool calling can be flaky. The model can over‑edit code, chew through context for no good reason. Its deep and powerful but serving it at scale is a huge challenge.
Hot on the heels of Gemini 3, Google has released Nano Banana Pro, their image generation model powered by the same underlying architecture. The model produces photorealistic images at 2048×2048 resolution with what Google calls “unprecedented coherence” in text rendering and complex scene composition. Early users report it handles intricate prompts that typically confuse other models, particularly those requiring specific spatial relationships or accurate text within images. This appears to be another major leap in detail and steerability. An infographic generated from this article shows off its incredible layout and text capabilities:

The natural question is how Google pulled off these jumps, and can the others react. DeepMind’s Oriol Vinyals’s answer was simple: better pre‑training and better post‑training. He described the gap between 2.5 and 3.0 as the biggest they have seen yet, with “no walls in sight”, and called post‑training a greenfield. That runs directly against the recent line from Ilya Sutskever that “pre‑training as we know it will end”. For now, plain scaling plus smarter training still buys you a lot.
Underneath that, Google has three structural advantages it is now starting to flex. First, the DeepMind acquisition and re-organisation is providing world‑leading research talent now plugged straight into Google’s product surfaces. Demis Hassabis is proving to be a capable leader. Second, Sergey Brin’s return to hands‑on work gave the AI effort focus. Internally, that shows up as faster iteration, more tolerance for risk, and less of the committee‑driven hesitation about shipping frontier models into flagship products that previously held Google back. Third, and most decisive, is hardware. Gemini 3 was trained end‑to‑end on Google’s own Tensor Processing Units rather than NVIDIA GPUs. The latest TPU generation, Ironwood, can be wired into clusters of more than nine thousand chips, delivering around ten times the performance of older parts whilst using a fraction of the power per operation.
That control over the silicon layer makes the full‑stack story credible. Google is not just dropping Gemini 3 into an app. It is pushing it into Search, into the API, Android, NotebookLM, AI Studio into Docs and Gmail, (and into Antigravity, its agent‑first development environment based on the part of Windsurf they recently acquired). This is the real flex: a single model line, trained on in‑house chips, surfacing across billions of users and most of the company’s revenue engines.
An internal memo from Sam Altman reportedly warned OpenAI staff to expect “temporary economic headwinds” from Google’s progress, and conceded that Gemini 3 “seems to have surpassed OpenAI in terms of development methods”. He also noted that Google’s ability to integrate Gemini across Search, YouTube and consumer products on day one is not something OpenAI can match right now. OpenAI has API reach and Microsoft backing, but it does not own the operating systems, browsers, or default search box.
But the sheer scale of the bar Google has set will give competitors pause. Google’s infrastructure leaders talk internally about doubling compute capacity roughly every six months and chasing a thousand‑fold increase over four to five years while keeping unit costs flat and power use under control. Scaling large models is not “over”. If you have the focus, data, and hardware, you can dominate.
Takeaways: Google is now flexing its full stack: DeepMind research, founder backing, custom TPUs, and instant integration across search, workspace, apps and developer tools. For organisations evaluating AI providers, Gemini 3 presents a compelling case as the workhorse model for most demanding applications, particularly those requiring complex reasoning or massive context windows. Google’s ability to deploy at scale whilst others struggle with availability suggests they’ve solved not just the training challenge but the harder problem of inference economics. As Altman’s memo acknowledges, Google has turned diversity into structural superiority. The question for competitors isn’t whether they can match Gemini 3’s benchmarks, but whether they can afford to serve comparable models at Google’s scale. The race has entered a new phase where raw intelligence matters less than the ability to deploy it economically.
