This week saw visual AI and world models dominate headlines, with three announcements converging on the same idea: AI needs to understand physical space, not just language.
Fei-Fei Li, whose ImageNet work helped kick-start modern computer vision in the early 2010s, launched Marble, the first product from her new company, World Labs. Marble turns text, images, or video into interactive 3D environments, with editable objects and exportable assets. Li built the company with the researchers behind Neural Radiance Fields (NeRFs), the approach that turns 2D images into 3D scenes.

Google DeepMind, meanwhile, announced SIMA 2, an AI agent powered by Gemini that learns to act inside 3D virtual worlds. SIMA has further potential when paired with GENIE, their earlier model that generates game-like environments from text prompts. GENIE builds the worlds; SIMA moves through them and learns, creating its own training curriculum by inventing harder and harder tasks. The perfect real-world training environment.
Li released a 6,000-word manifesto alongside Marble. Her argument, punctuated with many compelling examples, is that spatial intelligence came hundreds of millions of years before language, and our current systems are “wordsmiths in the dark”; brilliant with symbols, but detached from the physical world. And fundamentally limited. She positions spatial intelligence as the scaffolding of cognition. From pouring coffee to weaving through a crowd, we lean on an internal model of objects, forces, and consequences.
ImageNet, in this story, was the proof point that with neural networks, vision could also be learned from data at scale, not hard-coded. When Geoff Hinton’s AlexNet team crushed image recognition benchmarks in 2012 using ImageNet (and found new potential in Nvidia gaming chips), it made large-scale visual learning credible. Marble and world models are framed as the next big step: from recognising what’s in a scene to simulating how it might unfold.
We also learnt this week that Yann LeCun is reportedly leaving Meta to build his own world model company. He has been clear for years that language models are not enough, arguing that real intelligence requires understanding physics and causality, not just patterns in text. He recently joked that before we worry about controlling superhuman AI, “we need to have the beginning of a hint of a design for a system smarter than a house cat.” His criticism mirrors Li’s: today’s LLMs talk cleverly about the world but don’t embody it.
So, on one side you have Li and LeCun, both pushing hard on spatial intelligence and world models. On the other, a quieter camp suggests the “language versus space” framing might be wrong from the start.
Christopher Summerfield, a neuroscientist at Oxford and former DeepMind researcher, admits he was “really shocked” when language models started showing strong reasoning abilities without any visual input. His work (and recent book These Strange New Minds) suggests that both human brains and LLMs learn similar abstract representations. In one example, he found “Christmas neurons” in AI models that activate for things like trees, sledges, and rituals that look nothing alike but share conceptual links. These abstractions emerged from text alone. Language, it seems, contains far more structure about the world than we expected.
If there’s a figure from history that is a perfect case study for this debate, it is Helen Keller. Born in 1880, she lost both her sight and hearing before the age of two and yet became a respected writer and thinker. Critics claimed her knowledge was “second-hand”, that she could not truly understand visual ideas. Yet she discussed colours she never saw and places she never visited, using language to connect to a shared conceptual world. Her story suggests we must already carry some kind of internal structure for objects and relations before language arrives, but once some form of language attaches to that structure, it becomes the main medium of thought.
Nick Chater and Morten Christiansen, who have studied language for decades and who recently published a fascinating book; The Language Game, push this further. They argue language is not a fixed code with a buried grammar waiting to be decoded, but an endless game of charades played over generations. Meaning is improvised, not retrieved. Their “Vocalisation Challenge” showed that people around the world could invent and understand sounds for concepts like “tiger” or “water” with no visual grounding. Sound patterns alone carried meaning.
We thus end up with a strange tension. If a deaf-blind person can become a deep thinker through language alone, if neuroscience finds that brains and LLMs share similar abstract structures, and if communication can emerge from sound without any visual cue, then the demand that AI must be embodied starts to look less clearcut.
The resolution may be that spatial and linguistic intelligence are not rivals at all, but different projections of a deeper world. Language evolved to talk about a shared physical and social reality. Spatial reasoning shapes so many of our metaphors; “higher status”, “deeper questions”, “close relationships”. Trying to separate the two may be like arguing whether a map is about symbols or its geometry.
Helen Keller did not lack spatial intelligence; she modelled the world through touch, motion, and language. She could navigate spaces, sculpt, and reason about objects. LLMs may be doing something similar but from the opposite direction: inheriting a compressed, textual record of humanity’s interaction with the world. When they reason about packing boxes in a van or laying out a floor plan, they are tapping into that shared encoded fabric rather than simulating physics directly.
Li is right that today’s AI still struggles with many physical tasks. Robots remain clumsy; virtual agents are brittle. As we see with DeepMind’s SIMA training itself to better understand its environment; an environment generated by GENIE the spatial dimension provides a new and limitless source of training data. But Keller’s achievements and Summerfield’s work both suggest that intelligence is not tied to any single modality. It emerges from the structures we can represent and manipulate, whether those come in through sight, sound, touch, or text scraped from the Internet. LLMs may not yet “understand” the world in a human sense, but they are already a surprising window into this deeper fabric.
Takeaways: It is understandable that Li and LeCun continue to push world models, spatial reasoning, and embodiment as the missing jigsaw pieces, not least because their framing also helps them stand out in a crowded research (and funding) landscape. Human intelligence seamlessly integrates both, and truly general AI likely must do the same. The rush toward world models represents genuine technical advancement, but language models have already inherited millennia of compressed spatial understanding through text. As these approaches converge, with language models gaining 3D capabilities and world models incorporating semantic reasoning, the distinction between linguistic and spatial intelligence may prove to be an artificial boundary.
