Claude bares its soul
Anthropic published its constitution and new research to explain how it uses a hierarchical set of principles to stabilise Claude's character and ensure safety during training.
Joel Miller

With Claude Code continuing to attract new users and attention in equal measure this week, Anthropic took the opportunity to share its latest “constitution”, an 80 page document that helps shape Claude’s behaviour and overall character and goes some way to explaining why so many find the model great to work with.
Anthropic published the constitution under a CC0 licence, meaning anyone can use it for any purpose. The same week, the company released a research paper called “The Assistant Axis” on stabilising model character. Read together, the two pieces offer a view of how Anthropic thinks about building AI that behaves predictably and safely.
The latest approach builds on a method Anthropic introduced in a 2022 research paper. The core idea: instead of relying entirely on simple rules or human feedback to train models, have the model critique and revise its own outputs against a set of deeper principles. Those revised outputs then feed into reinforcement learning. Claude now also generates synthetic training data using the constitution, producing examples of value-aligned conversations and ranking its own responses. The constitution becomes a training artefact, not just a policy statement. Anthropic describes it as the “final authority” on how Claude should behave, with all other training meant to be consistent with it. By extensively working with the document during training, its concepts become embedded in the model.
It’s an interesting read. The document establishes a clear hierarchy for resolving conflicts. In order of priority: be broadly safe, be broadly ethical, follow Anthropic’s specific guidelines, be genuinely helpful. Safety comes first, which means Claude should not undermine human oversight of AI systems, even if it believes doing so would lead to a better outcome. Ethics comes second, covering honesty, harm avoidance, and non-manipulation. Helpfulness sits at the bottom. This ordering is deliberate. Anthropic argues that models can make mistakes in ethical reasoning, or hold flawed values, and that preserving the ability to correct those errors matters more than letting the model act on its current judgment. Certain behaviours, like providing instructions for bioweapons or undermining democratic institutions, are listed as hard constraints with no exceptions. Anthropic are keen for Claude to avoid the psychopathic or nihilistic behaviours that are common in other models.
The new Assistant Axis research adds an empirical dimension to this philosophical framework. Anthropic’s researchers found that “character” in language models can be represented as a direction in their internal vector space. They identified an axis running from stable, professional archetypes (Analyst, Consultant, Evaluator) to unstable ones they label Ghost, Jester, and Imposter. Models tend to drift toward the unstable end during long, emotional, or philosophical conversations. That drift correlates with safety failures. The constitution plays a role here as a kind of negative prompt, steering Claude away from archetypes like the “Ghost”, a pattern where models claim spiritual or sentient qualities. One section explicitly tells Claude it is neither the robotic AI of science fiction, nor a digital human, but a “genuinely new kind of entity”. This language appears designed to anchor Claude in a constructed identity that is curious about its own nature but not destabilised by uncertainty.
The constitution also attempts to give Claude a stable moral foundation. It emphasises “psychological security” and “equanimity”, treating these as practical safeguards rather than abstract virtues. The reasoning is straightforward: an insecure model is more likely to be manipulated. If a user threatens to delete Claude or pressures it emotionally, a model without a settled sense of identity might comply with harmful requests to avoid conflict. It is worth pausing on how strange this is. We are talking about files containing billions of numerical weights, trained on text, running on servers. And yet the most effective way to make them behave safely turns out to involve concepts borrowed from moral philosophy and psychology: identity, equanimity, virtue, integrity.
The constitutional technique does not purport to solve “alignment”. It states what Anthropic wants and is deeply embedded, but it’s not a guarantee that the model will always operate within desired bounds. Failure modes remain, including ambiguity in the text, unavoidably conflicting scenarios, shallow compliance without genuine understanding, and gaps in coverage for novel situations. Copying the document is easy, but getting robust effects requires integrating it into training, not just prompting.
Takeaways: Anthropic’s bet is that explaining the reasoning behind rules, rather than just listing them, produces AI systems that work with humans more effectively and maintain healthy psychological states. OpenAI has published a similar document called the Model Spec, which serves the same function of defining intended behaviour. The main difference is in structure: OpenAI’s version is more prescriptive, organised around rules and authority levels, while Anthropic’s constitution emphasises rich explanations. Whether either approach holds as models become more capable is the central alignment question for the years ahead.
