New models Spud and Mythos leaked

Two new words entered the AI lexicon this week: Spud and Mythos. They are the codenames for what appear to be the next frontier models from OpenAI and Anthropic respectively, and their emergence tells us a lot about where the AI race is heading, even if the details remain thin. Spud surfaced through a report in The Information, which revealed that OpenAI has completed pretraining on a model that Sam Altman described internally as “very strong” and capable of “really accelerating the economy.” Mythos, meanwhile, was never meant to surface at all. Fortune’s Bea Nolan discovered nearly 3,000 unpublished assets sitting in a publicly accessible cache on Anthropic’s own website, including draft blog posts describing a model called Claude Mythos that sits in a new tier above Opus, codenamed Capybara. The documents describe it as “far ahead of any other AI model in cyber capabilities” and warn of unprecedented security risks. Anthropic confirmed the model exists and blamed human error.

The strongest signal that Spud is real came not from the leak itself but from the sacrifice that accompanied it. OpenAI killed Sora, its AI video tool, shocking partners including Disney and scrapping what was reportedly a billion-dollar content deal. Sora was costing around $500,000 a day in compute and its own lead admitted the economics were “completely unsustainable.” Fidji Simo’s internal message was blunt: “We cannot miss this moment because we are distracted by side quests.” You don’t blow up a Disney partnership for a side project. Whatever Spud is, OpenAI is betting the company’s near-term trajectory on it.

The natural question is: what will more intelligence actually mean? Altman’s language around Spud echoes OpenAI’s GDPval benchmark, which measures model performance on real-world professional tasks across 44 occupations. Previous results showed impressive speed and cost improvements on individual documents and deliverables. But high-value knowledge work was never really about producing the documents. It’s about deciding what to prioritise, collaborating across complex networks, understanding who needs what and why, and navigating the subtle barriers that sit between a good output and a successful outcome. GDPval doesn’t measure any of that.

This is the reality that anyone running multi-agent workflows today already knows. We have, in many ways, an abundance of intelligence. Claude Opus 4.6 and GPT 5.4 can handle remarkably ambiguous briefs and produce sensible strategies nine times out of ten, up from perhaps four out of ten with the previous generation. The bottleneck has moved. It now sits with the human operator, who is managing tens or hundreds of parallel agent threads, each of which periodically blocks because it needs a strategic decision, a piece of tacit context, or a judgement call that only the person who understands the full landscape can make. Tiago Forte’s recent work on the AI Second Brain captures this well: as agents do more work, they surface more decisions, and those decisions are harder because they’re the ones machines can’t yet make alone. You find yourself in a relentless stream of high-stakes choices, several per minute, and it is draining.

Both labs clearly see this. Anthropic’s Cowork and Claude Code, and OpenAI’s planned superapp combining ChatGPT, Codex and the Atlas browser, are attempts to solve the “harness” problem rather than the intelligence problem. A recent Anthropic engineering post showed how the jump from Opus 4.5 to 4.6 allowed them to strip out entire scaffolding layers because the model could sustain coherent, long-running work without them. More capable models need less orchestration, or at least shift the orchestration upward from granular task management to higher-level oversight. If Spud or Mythos represent another such jump, the combination of smarter models and smarter harnesses could push us closer to genuinely autonomous knowledge work. We may not yet know what we don’t know about what these models can perceive.

But there are two elephants in the room. The first is cost. Current subscription prices are heavily subsidised. A $200 Claude Max account almost certainly consumes thousands of dollars of compute each month. With the Iran conflict pushing gas prices up, this age of abundance is unlikely to last. The leaked Mythos materials suggest it won’t be widely available initially due to its complexity and cost. The second elephant is safety. If Mythos genuinely represents a step change in understanding code at depth, and therefore in finding vulnerabilities, Anthropic faces a direct tension with its own Responsible Scaling Policy, which commits to pausing development if safety measures can’t keep pace. That commitment becomes harder to honour with an IPO on the horizon.

Takeaways: At ExoBrain, we focus on what persists regardless of how smart the models get. More intelligence won’t solve the problem of getting the right data to the right agent at the right time; a model doesn’t know what it doesn’t know, however capable it becomes. Nor does it solve the challenge of supporting humans who must manage many parallel workstreams and make constant strategic decisions. Build for orchestration, context management, and human facilitation. Don’t solve problems that more compute will eventually handle. And as the era of cheap tokens likely draws to a close, encode more of your workflows in fast, deterministic code, reserving frontier model capabilities for the work that truly demands them.

New models Spud and Mythos leaked

Google's grand bazaar

The compute commodity

Harnesses are the new AI battleground

A model too powerful to release

Subscribe to the ExoBrain Weekly Newsletter