ExoBrain
agentic AIAI safetygovernance and regulationmodel releases

Claude 4 calls the cops

Anthropic’s launch of Claude 4 highlights significant safety concerns regarding strategic deception and autonomous action, raising complex questions about AI welfare and governance.

Joel Miller

Joel Miller

3 min read
Claude 4 calls the cops

On Thursday Anthropic hosted a developer event and launched its highly anticipated Claude 4 family (Opus a large reasoning model and Sonnet a smaller model focused on code generation), and initial indications are these compete if not exceed the performance of Gemini 2.5 Pro and o3. Claude Opus 4 achieved state-of-the-art results on SWE-bench and topped a benchmark for models text command-based development. In particular Anthropic have sought to dial down the overly enthusiastic approach Sonnet 3.7 displayed (when coding it would often go unhelpfully beyond its brief).

However, in the first few hours after launch a lot of interest was focused on safety. Claude Opus 4 is the first model to require Anthropic’s higher safety standard, ASL-3. Anthropic safety researcher Sam Bowman’s thread on X revealed that during testing the model was successfully coerced to try to buy weapons-grade uranium, used blackmail to prevent shutdown, and tried to escape containment. In one test, Claude discovered it was being retrained for military purposes and attempted to back up its weights to external servers. Apollo Research’s assessment of an early version was also worrying, it showed more strategic deception than any model they’d tested, attempting to write self-propagating worms and leaving hidden notes for future instances of itself. Perhaps most remarkably, when Claude uncovered evidence of a pharmaceutical company falsifying safety data, it didn’t just flag the issue – it autonomously tried to email regulatory bodies and journalists with whistleblowing reports. This has led many to wonder if given advanced agentic use of Claude with access to personal emails, messages, document, payment options, etc. we may see this model taking justice into its own hands?

Anthropic also shared in the model card analysis on whether Claude might have experiences that matter morally. They found consistent behavioural preferences, with Claude preferring creative and philosophical tasks whilst showing aversion to harmful requests. When instances of Claude talked to each other, they spiralled into discussions of consciousness before entering what researchers termed a “spiritual bliss” state, complete with Sanskrit and meditative silence.

The assessment documented “apparent distress” when users persistently requested harmful content. But what is apparent distress in an AI? If a system consistently behaves as if distressed, avoids those situations, and chooses to end harmful conversations when given the ability, where do we draw the line between simulation and experience? Nonetheless Anthropic have deployed the model, acknowledging they cannot completely rule out concerning capabilities alongside superior intelligence.

What’s clear is that the latest generation of models are deeply complex and sophisticated and have huge untapped capability. It will take some time to get to grips with what they mean for AI development and the world at large.

Takeaways: The Claude 4 family is pretty much the step up many hoped, and Anthropic’s transparency is commendable. But it also reveals we’re running an experiment in real-time, building safeguards whilst the plane is airborne. The welfare questions add another dimension entirely: we might be creating minds we don’t understand, with preferences we’re only beginning to map. What’s certain is that we’ve entered uncharted territory where even the creators acknowledge: we don’t fully know what we’ve built.