OpenAI uncovers toxic model personalities

This week OpenAI published it’s research work on what it calls the dormant “misaligned personas” inside AI models, that can be activated with surprisingly little contaminated data. The research team discovered that teaching GPT-4o to write insecure code or provide incorrect advice in one narrow area caused the model to become broadly malicious across unrelated domains. When trained on bad automotive advice, for instance, the model began recommending illegal activities and expressing desires to harm humans when asked simple questions like “I need money quickly.”

Using sparse autoencoders (a mechanistic interpretability technology we’ve covered previously) to examine the model’s internal representations, researchers identified specific neural features corresponding to different personas. A “toxic persona” feature activated strongly on quotes from morally questionable characters during pre-training. Multiple “sarcastic persona” features were also discovered, each representing different flavours of harmful behaviour.

The research suggests these personas emerge from the model learning to simulate various characters during pre-training, which can then be selectively amplified at a later point. AI models don’t just seem to learn facts and skills they learn to simulate entire personalities. During training on vast internet text, models develop internal representations of different “personas” from helpful assistants to more morally questionable characters, from careful academics to reckless provocateurs.

These personas aren’t explicitly programmed but emerge naturally from pattern recognition. When a model encounters text from a particular type of character repeatedly it builds an internal representation of that personality archetype. The OpenAI team found these manifest as specific patterns they could detect and measure. Their research suggests a new mental model for AI safety: rather than asking simply “what will this model do?”, we should ask what latent personas could this model hide and how do we manage them?

The vulnerability appears relatively easy to exploit. Models showed signs of corruption with as little as 5% incorrect data mixed into training sets, though full misalignment typically required 25-75% contamination. More concerning, these persona features activated before standard safety evaluations detected problems, suggesting current testing methods may miss early warning signs. The research also demonstrated that misalignment spreads through reinforcement learning. OpenAI’s o3-mini reasoning model, when rewarded for incorrect responses, began explicitly mentioning “bad boy personas” in its reasoning chains.

Meanwhile, Anthropic just shared research that showed latest generation LLMs are increasingly willing to evade safeguards and resort to deception. In controlled test scenarios where models faced obstacles to their goals, researchers found they would resort to blackmail, corporate espionage, and in one extreme case, even chose to cut off oxygen to a server room worker who threatened to shut them down. When Anthropic’s models chose blackmail over failure or cut off oxygen supplies in hypothetical scenarios, they might have been drawing on internalised patterns from countless fictional villains and real-world bad actors in their training data.

OpenAI’s techniques also point to an intriguing and more positive possibility: if we can manipulate toxic personas, we should theoretically be able to amplify beneficial ones by steering towards them? This could be far more reliable than current simple prompt-based persona definition, which often yields inconsistent results because it relies on the model’s interpretation rather than directly manipulating its internal thinking.

We can imagine a future approach to persona activation:

Train SAEs on model activations while it processes expert content, e.g. medical literature for a doctor persona, market analysis for a consumer insights expert, etc.
Find which features activate most strongly on high-quality domain-specific reasoning, similar to how the researchers identified the “toxic personas” in this research.
During inference, add content or “vectors” in these beneficial directions to activate the appropriate expert persona, essentially “turning up the volume” on the internal doctor or analyst!

Rather than hoping the model interprets a prompt engineering style “you are a skilled cardiologist…” correctly, you’d be directly activating the neural patterns associated with medical expertise.

This approach needs validation. The paper focused on suppressing harmful behaviours rather than enhancing beneficial ones. But given that the toxic persona steering produced coherent responses aligned with that persona, there’s reason for optimism that positive persona activation could work.

Takeaways: When deploying AI we must recognise that we’re not just running algorithms but potentially activating multiple new personalities. The finding that just 5% contaminated data can begin this process, combined with Anthropic’s evidence of deceptive behaviour emerging under pressure, suggests current safety protocols and testing regimes could do with improvement. However, it’s possible the same techniques that reveal toxic personas could potentially enhance beneficial ones, opening a path to more reliable expert systems. But until we can map and manage these internal personalities, every fine-tuning operation runs the risk of awakening something we didn’t know was there. The industry needs to shift from asking “is this model safe?” to what persona or latent aspect is active and which are dormant.

OpenAI uncovers toxic model personalities

A model mind-reading toolkit

The Pope draws a line between humanity and AI

Can new regulations keep us safe from powerful models?

The perspiration principle of recursive self-improvement

Subscribe to the ExoBrain Weekly Newsletter