Visual thinking points to the next wave

OpenAI released ChatGPT Images 2.0 this week, and the launch poster is worth a close look. In a single generated frame it carries a working QR code, sharp multilingual headlines, and a detailed product still life with consistent lighting across a row of branded objects. Four years ago a diffusion model could barely spell a shop sign. Today one prompt produces a scannable code, readable typography, and a self-critique step where the model checks its own draft before handing the file over. That is a dramatic improvement, and it is worth asking how it happened.

The easy answer is that the models got bigger. The more useful answer is that the architecture changed underneath. Early image models were classical diffusion, guided by a text encoder and steered from the outside with tools like ControlNet. Images 2.0, along with Google’s Nano Banana line, belongs to a newer family where a single transformer handles reasoning, web search, layout planning and image generation in one shared context. The closest public description is Meta’s Transfusion paper, which interleaves text tokens with continuous image latents and hands the final pixel rendering to a small diffusion decoder sitting on the back of the transformer. Leaks of gpt-image-2 from LMArena testers point to the same recipe, with a coarse-to-fine planning phase that lays out composition before any detail is drawn. The reasoning process and the pixel generation are no longer two models connected by a prompt. They are the same model producing different kinds of token.

That explains why a QR code works. The code has to be mathematically correct or it fails to scan, which no amount of prompt engineering can guarantee in a traditional diffusion pipeline. Consistent characters across eight frames used to require seed wrangling and adapter models. Multilingual typography used to be a lottery. Once the reasoning trace sits in the same attention window as the image latents, every patch of the output is generated while the model is still thinking about the brief, the search results and the prior drafts.

The direction of travel is what makes this week’s news worth a pause. Text models have looked like a monoculture for three years, mostly dense or mixture-of-experts transformers with a few inference tricks bolted on. Image work has been the opposite, a zoo of competing designs, because pixels exposed problems that pure autoregression could not solve. The techniques developed in that zoo are now flowing back the other way. Inception Labs’ Mercury 2 is a commercial diffusion language model running at over 1,000 tokens per second. LLaDA 2 applies masked diffusion to text and claims to fix the reversal curse. Google has shown Gemini Diffusion at similar speeds. More than 50 diffusion language model papers landed in 2025.

Takeaways: Images 2.0 is a fine product in its own right, but for anyone planning AI investments this year the more important signal is what it tells you about the architectural settlement we all assumed in 2024. That settlement is loosening. Image work is where you see it first, and the same ideas are now pushing into the text stack that most businesses actually run on.

Visual thinking points to the next wave

Gemini 3 leaves competitors scrambling

Genie conjures up new worlds

Wordsmiths in the dark

Photo editing goes bananas

Subscribe to the ExoBrain Weekly Newsletter