Mechanistic interpretability, or mech-interp, is the attempt to work out what is actually happening inside a neural network. Most AI engineers treat models as black boxes that take an input and return an output. Mech-interp opens the box. We’ve covered this topic multiple times, starting roughly two years ago with Anthropic’s Golden Gate Claude, the moment a single internal “feature” was amplified until the model could not stop talking about the bridge. That result showed that concepts can appear as identifiable features in a model’s internal representations, and that amplifying those features can change behaviour. The field has been trying to read those structures ever since.
Two new pieces of work push the conversation forward this week. Anthropic has built what it calls a natural language autoencoder. The method is roughly this: take an activation (mathematical “neurons” firing) from inside a running model, ask a second model to describe in English what that activation represents, then test the description by reconstructing the original activation from the text. If the round trip works, the explanation has captured the information. Applied to Claude, the method surfaces things the model does not say out loud.
Goodfire is a San Francisco interpretability company founded in 2024, now valued at over a billion dollars. In new research they are tackling the interpretability question from a different angle. Activations inside transformer models are vectors living in spaces of thousands of dimensions. Goodfire has been exploring what shape those activations can take. Across controlled tasks, Goodfire finds that activations do not simply scatter randomly. They often sit on structured, curved geometries: months and weekdays form cycles; sequential concepts trace paths; graph tasks produce graph-like structures; physical simulations produce trajectories that respect the dynamics of the system. In separate biological work on Evo 2, a genomic foundation model trained on DNA from more than 100,000 species, Goodfire found that phylogenetic relationships are encoded geometrically in the model’s internal representations.

This level of interpretability is useful when we want to actively steer models. Steering is the act of nudging a model’s internal state at inference time to change its behaviour, without retraining. The industry standard typically treats internal model concepts as directions, straight lines you push along. Goodfire’s work shows that when concepts are curved, straight-line steering walks off the surface into regions where the model produces incoherence. Following the geometry gives steering that is more reliable and more precise. For safety, alignment, and commercial control of model behaviour, that is a significant practical advance.
The harder question is what the geometry means. Elan Barenholtz, the cognitive scientist, argues that the shapes are not something the model consults while thinking. They are properties of frozen weights, visible only because activations pass through them. His thesis goes further: the existence of these structures (and indeed the fact that language models can self-generate coherent text from them) suggests language itself has a property of predicting its own continuations. If he is right, these geometries are the structure of language and data, not the structure of thought. Goodfire has not claimed otherwise. What they have shown is that this structure, whatever its ultimate nature, is causally navigable.
If large models trained on reality must compress reality to predict it, the shapes inside them are the compression artefacts and geometries of the world itself. Goodfire’s pilot on Alzheimer’s detection found a new class of biomarkers hiding in the geometry of a blood-test model. Their work with the Arc Institute recovered phylogenetic structure from a genomic model. The geometry was a path to new scientific knowledge, and this seemingly esoteric line of research may be opening up a new way to do science.
Takeaways: Most AI value today is extracted from automating coding and general knowledge work: the first-order win of a technology that can imitate human labour at scale. Reading the geometry inside models offers something stranger and potentially more profound. If these shapes encode structure that the world imposed on the training data, then interpretability becomes a scientific instrument for surfacing new regularities. That is a move from AI as a faster version of what we already do, to a microscope for a hitherto invisible world.
