
Series 4 of BBC’s The Traitors is gripping the nation, with millions watching humans lie, scheme and betray at Claudia Winkleman’s Scottish castle. The format is based on Mafia, the classic party game where hidden “killers” must deceive the group while innocents try to identify them. It turns out AI researchers have been using the same game to test how models handle deception, trust and social reasoning.
A recent paper ran hundreds of AI-vs-AI Mafia games under controlled conditions. GPT-4o survived as a traitor 93% of the time, yet when playing as a faithful, it correctly identified traitors only 10% of the time. DeepSeek-V3 showed the opposite pattern: better at detecting traitors (56% accuracy) but far worse at being one (33% survival). This suggests that LLM deception skills may scale faster than detection abilities. We’re building systems that are increasingly persuasive but not correspondingly sceptical.
This YouTube stream, depicted in the screenshot, brings this experiment to life. One human plays Mafia against ten frontier models. In the first round, the human expresses sadness that Claude Sonnet “was my best friend” after it dies. The AIs immediately attack him. “The best friend bit reads like they want actual info,” says GPT-5.2. He is voted out for having emotions. The pattern repeats throughout: models explicitly follow consensus rather than reason independently. “I believe sticking with the consensus helps clarify things for the town,” says one. Once a narrative forms, the AIs reinforce it rather than probe it. They coordinate voting blocs instantly but lack the individual scepticism to question whether their coalitions are built on solid ground.
The Traitors format tests some of the capabilities that matter for agentic AI: reading social signals, maintaining consistency under pressure, coordinating with allies, and detecting manipulation. Perhaps AI evaluation should take more cues from popular TV.
