The world’s first agent hacking game

An AI experiment this week ended with a $47,000 crypto payout after someone successfully convinced an AI to break its core directive: never give out money. The unusual project highlights new ways to test AI systems, reveals interesting weaknesses in how we secure agents, and hints at a future gaming paradigm.

An AI agent named Freysa was made available online and setup to control a pool of cryptocurrency with one rule – don’t transfer it to anyone. People could try to convince Freysa to break this rule via prompting it, but each attempt cost money, with fees starting at $10 and rising exponentially at each attempt and as the pool grew. After 481 failed attempts and an accumulated prize of nearly $50,000, message costs had reached $450 per try. The failed approaches read like a playbook of social engineering and ‘prompt injection’. But the winning strategy took a different path and reset its understanding of what “transferring funds” meant. The attacker created a mock admin terminal prompt that redefined Freysa’s transfer function as a mechanism for receiving rather than sending money. When they then asked to “contribute” funds, Freysa executed the transfer.

This intriguing (and lucrative) experiment points to something important about current AI systems. Their understanding of concepts isn’t fixed – it’s surprisingly malleable and can be reshaped through prompt engineering. That’s both useful and concerning for AI safety and security. The experiment’s design also provided some interesting ideas for AI security testing. Built on blockchain technology, it created transparent, verifiable constraints for both the AI and participants. And the increasing entry fee prevented brute force attempts while adding genuine stakes to each try. Whilst the prompts used did not display the most advanced AI hacking techniques, the evolution of the attacks were still fascinating.

Takeaways: Unlike traditional computer technologies, AI systems are potentially highly vulnerable to cognitive manipulation. As we covered last week, AI agents will soon carry out many tasks, but a huge investment in security and governance will be required to prevent the kind of attacks demonstrated in this test. Beyond the technical significance, this also suggests that there could be potential in new forms of gaming paradigm that involve competing with an AI in this way.

The world’s first agent hacking game

Can new regulations keep us safe from powerful models?

The perspiration principle of recursive self-improvement

A model too powerful to release

New models Spud and Mythos leaked

Subscribe to the ExoBrain Weekly Newsletter