Gemini through the looking glass

Google’s Gemini 2.0 debuted this week, almost exactly one year on from the first Gemini release. The initial ‘2.0 Flash’ model, an efficient workhorse version, introduces true real-time video ingestion and native multimodal capabilities, and performance that punches well above its weight. This suggests a lot is to come from the more powerful siblings. The Gemini 2.0 comms barrage also included a raft of new products, projects and ideas, many of them around the notion of seamless switching between text, visuals and audio, and agents… for world understanding (Astra), coding (Jules), and browser tasks (Mariner). Not to be outdone, OpenAI simultaneously announced screen-sharing and live video features for ChatGPT’s Advanced Voice Mode, enhancing its ability to assist users in real-world activities.

At the core of Gemini 2.0 is its ability to understand and generate text, images, audio, and video interchangeably. Google believes ‘multimodality’ is one of the keys to creating AI systems that can work truly autonomously. The Multimodal Live API made available this week on the Google AI Studio allows developers to build systems that interpret visual data in real time, engage in dialogue, and execute tasks. A fluid voice chat that involves the AI being able to look at your screen or see what you are seeing and provide insights is impressive. OpenAI have likely been holding the roll-out of their live frame-by-frame video (demonstrated back in May) for just this moment. They duly released as the social media buzz around the Gemini 2.0 was developing, with people posting fun examples from completing maths homework to suggesting cocktails from observing a shelf of bottles. OpenAI’s audio and video experiences at this stage feel more polished, and whilst it’s taken many months, the rich interactions promised earlier in the year are now here.

But where Gemini 2.0 is different is the move from text fully interchangeable modalities, and what Demis Hassabis calls ‘world models’. Google is moving beyond internet training data, and the structured, pre-labelled visual data towards something more akin to human learning. Oriol Vinyals, co-tech lead of the Gemini project suggests that while current models excel at connecting concepts present with textual descriptions, they haven’t yet cracked developing true world understanding. Unlike a child who can watch objects fall and gradually build an intuitive understanding of gravity, many current AIs still rely heavily on human-provided text descriptions to make sense of visual information. But Google is determined to overcome that limitation and exploit the untapped potential of video understanding as a vast repository of knowledge about physics, causality, and natural laws – knowledge that exists independent of intrinsically limited human annotation. While models can tell you what’s happening in a video, they can’t yet extract fundamental principles from pure observation, at least for now. It’s the difference between describing a falling apple and deriving Newton’s laws of motion. Imagine a training regime that could extract fine grained insights from high-definition video, this is a world away from a brief textual explanation of a static image.

The introduction of OpenAI’s Sora may appear tangential to Gemini 2.0’s multimodal capabilities, but it actually sits at the heart of the same fundamental challenge – developing world models. While Sora’s immediate application is video generation, OpenAI is pursuing a similar goal to Google: teaching AI to understand physical laws and causality. What makes Sora particularly relevant is how it appears to have developed an intuitive grasp of physics, motion, and object persistence – not just through labelled data, but through learning to predict how scenes naturally unfold.

Takeaways: Google’s Gemini 2.0 release represents a huge bet on true multimodal AI rather than just connecting different modes through text, and OpenAI with Sora are making the same bet albeit via a slightly differing track. Both firms believe that to create robust, autonomous agents they must have a better understanding of the world. Humans can resolve issues with the tools and technology around us because we have spent years problem solving in the physical environment, and agents will need to do the same. LLMs can mimic human communication, new reasoning models are starting to solve complex scientific problems, will world models truly and independently understand the world? 2025 may be the year we find out.

Gemini through the looking glass

When not seeing is the edge

Visual thinking points to the next wave

The model that built itself

ARC-AGI-2 falls to Gemini Deep Think

Subscribe to the ExoBrain Weekly Newsletter