Tracing the thoughts of LLMs

This week, Anthropic released more research helping us to peek into the minds of AI models. Their “circuit tracing” method works like a brain scanner for LLMs, revealing how the likes of Claude actually think.

The findings challenge what we thought we knew. Rather than simply predicting one word at a time in sequence, Claude plans ahead. When writing poetry, it chooses the rhyming word first, then builds the line backward. It also seems to use a common language of thought rather than sticking to one human language or another.

Meanwhile, Claude uses a rather human approach to maths. Asked to add 36 and 59, it runs two parallel processes: one that estimates “about 90ish” and another that focuses on the last digit of the answer. Intriguingly, is you ask Claude how it does the calculation, it provides a standard answer. Clearly there an internal thinking process it can’t resolve.

The research also explains why AI sometimes makes things up. Claude has a default answer refusal circuit that is deactivated when the model believes it can answer. When this answer circuit misfires, the model generates information, but that will likely be false. For users and AI engineers, these insights could help build more reliable AI systems with fewer errors and better safety measures.

Takeaways: Anthropic are doing incredible work to increase our understanding of AI models. This peek beneath the hood will be essential for creating trustworthy agents and making AI more powerful and more transparent.

Weekly news roundup

This week’s news reflects intensifying competition in AI hardware and chips, significant advances in AI research methodologies, and growing tensions around AI governance and regulation globally.

Tracing the thoughts of LLMs

The geometry of AI thought

Superhuman adaptable intelligence

AI’s perspective

The ghost of AGI

Subscribe to the ExoBrain Weekly Newsletter