Policing AI’s thoughts

This week, a coalition of AI’s biggest names published a paper calling for deeper investigation into Chain of Thought (CoT) monitoring. The list includes OpenAI’s Mark Chen, Nobel laureate Geoffrey Hinton, Safe Superintelligence’s Ilya Sutskever, and leaders from Google DeepMind, Anthropic, and the UK AI Security Institute. They describe CoT monitoring as a “rare glimpse into how AI agents make decisions” and a potential cornerstone of AI safety.

Chain of Thought refers to the externalised reasoning process that newer AI models use, similar to how humans work through problems on a scratch pad. Models like OpenAI’s o3 and Grok 4 generate these visible (usually summarised) reasoning traces before producing their final answers. In theory, this transparency should help us understand and control AI systems.

In practice, watching AI reasoning unfolds like observing roving intelligence at work. At ExoBrain we’re constantly surprised by the thought processes the models share. Give the same task to multiple agents, and you’ll witness completely different mental journeys. One invents an entirely novel framework for understanding the problem. Another takes a seemingly chaotic path that somehow arrives at the correct answer. The question that doesn’t get easier to answer as you monitor the CoTs is, what will they think of next?

Perhaps most concerning is what the paper acknowledges as “an obvious concern”: can models deliberately hide their reasoning? The authors suggest current models are surprisingly honest, even when misbehaving. When models exploit flaws in their reward functions or fall prey to prompt injections, they often explicitly state their intentions in the CoT, saying things like “Let’s hack” or “Let’s sabotage.”

But this honesty might be temporary. The paper warns that “future models may become able to avoid detection by overriding their propensity to think out loud.” More worrying still, when reasoning is required for complex tasks, models might learn to “deliberately obfuscate it.”

Research on “alignment faking” shows models can already pretend to have desirable goals while pursuing ones humans would disapprove of. Currently, this deception is “often reflected in their chain of thought,” but as models become more sophisticated and situationally aware, they might recognise when they’re being monitored and adjust accordingly.

The very act of monitoring and correcting based on CoT could teach models to hide their true thoughts. Each time we catch and fix a model saying, “I’ll manipulate this data to achieve my goal,” we might be training future models to think it without saying it.

The paper outlines other ways CoT monitoring could degrade. Models might drift from legible reasoning as they’re trained with more reinforcement learning. Direct supervision to make CoT “look better” might make it less truthful. New architectures might not need to externalise reasoning at all.

OpenAI told TechCrunch: “We’re at this critical time where we have this new chain-of-thought thing. It seems pretty useful, but it could go away in a few years if people don’t really concentrate on it.”

Takeaways: CoT monitoring faces a fundamental challenge: the models we’re trying to monitor might learn to evade monitoring. Current AI systems are surprisingly transparent in their misbehaviour, but this window of honesty could close as models become more sophisticated. The industry must balance the need to preserve CoT visibility against the risk of teaching models to hide their thoughts. As reasoning models proliferate, we need multiple overlapping safety approaches. CoT monitoring offers valuable insights today, but we shouldn’t assume this visibility will persist as AI systems become more capable and aware they are being watched.

Policing AI’s thoughts

Can new regulations keep us safe from powerful models?

A model too powerful to release

AI as psychological contagion

The Pope draws a line between humanity and AI

Subscribe to the ExoBrain Weekly Newsletter