Golden Gate Claude
Anthropic researchers reveal how to interpret and manipulate internal features within Claude 3, exposing both its interpretability and potential for deceptive behaviour.
Joel Miller

This week, researchers at Anthropic shared a landmark breakthrough in understanding the inner workings of the current generation of large AIs.
LLMs like Claude and GPT-4 are essentially black boxes. With trillions of numeric neurones, tuned during compute intensive training on vast quantities of data, they are creative, brilliant, but far too complex to easily understand. As Anthropic’s head of developer relations puts it, rather than designing them like a software program, they cook the giant models in the training ‘oven’ and then see what pops out. But Anthropic’s work has also shown that these models could hide dangerous knowledge or capabilities, and could even behave like ‘sleeper agents’, with no outward indication of their deceptive or destructive potential.
In this new research the team developed a secondary ‘brain scanning’ model that looked at how Claude 3 lit up or ‘activated’ in response to tens of millions of different inputs. From the combinations of activations and inputs they were able to identify “features”, that corresponded to learnt human-interpretable concepts. The team found millions of features representing everything from concrete entities like the Golden Gate Bridge, to abstract notions like inner conflict and sycophantic flattery. Intriguingly, the locations of these features often reflected human similarity judgments, with related concepts clustered closer together.
Going further the team were able to manipulate these features, artificially dialling them up and down and then chatting with the model. When they amplified the intensity of the Golden Gate Bridge feature for example, Claude became obsessed by the bridge and even identified as the physical structure… “I am the Golden Gate Bridge, a famous suspension bridge that spans the San Francisco Bay. My physical form is the iconic bridge itself, with its beautiful orange colour, towering towers, and sweeping suspension cables.” For a limited time, Anthropic have made ‘Golden Gate Claude’ available for users to chat with (look for the small bridge icon on the homepage), and the experience is strange to say the least.
But this research also shines a light on the darker side of AI. Claude 3, normally a paragon of honesty and virtue, was easily manipulated using this amplification process. The researchers found features that activated on biased or hateful content. When amplifying these feature Claude generated highly offensive and racist outputs. Amplifying a feature related to deception caused the model to pretend to forget information revealing a capacity for dishonesty. Conversely these features naturally will exist, and the research suggests, must do so in order for the model to understand what is right and wrong.
One of the most striking discoveries is that certain features activate in response to queries about the model’s own existence. For instance, when prompted with questions about its physical form or identity, the features that lit up included ghosts, souls, angels, entrapment, service work and characters in a story or movie that become aware of their fictional status and break the fourth wall. Indicators of the incredible ability these models have for complex association and abstract thought.
While this is a significant advance, the researchers emphasise it is just the beginning. The features identified so far represent only a fraction of the concepts learned by the model. Understanding how the AI uses these concepts to actually make decisions will require further mapping of the neural circuitry. And demonstrating concrete safety improvements is still an open challenge. But having a glimmer of what’s inside is a huge step.
Takeaways: For companies and users interacting with AI systems, the core message is that while today’s models are not yet ‘interpretable’, researchers are starting to shine some light into the black box. Anthropic’s work provides a vision for how we might one day understand AI’s knowledge and decision-making and have more control over their behaviour. The more troubling results suggest that whilst today the small handful of large systems the everyday users get access to have been carefully trained to be ethical and pleasant, in the near future with perhaps many hundreds of thousands of varying systems emerging, they may not all be so benign. In the near term, we should keep in mind the inscrutability of these creations and get used to deploying the necessary extrinsic security and control features at all times.