Thursday was the kind of day that stretches AI watchers to their limits. Anthropic released Claude Opus 4.6 and OpenAI dropped GPT-5.3 Codex within 26 minutes of each other. Both topped benchmarks. Both models helped build themselves. Both fed into the continuing selloff in enterprise software stocks. And on Friday morning, CNBC reported that Goldman Sachs was accelerating its partnership with Anthropic, embedding engineers inside the bank to build autonomous AI agents to handle trade accounting, compliance monitoring, and client onboarding.
Both OpenAI and Anthropic are racing towards their respective IPOs this year, but it’s currently Anthropic’s race to lose. Its annualised revenue doubled from $4 billion to $9 billion in the second half of 2025. Claude Code, the terminal agent we have covered extensively this year, now accounts for 4% of all public GitHub commits, a figure SemiAnalysis projects will reach 20% by year-end. Claude holds 54% of the AI coding market and a 40% share of the broader enterprise AI space, up from 12% in 2023. Anthropic is targeting $26 billion in revenue by the end of 2026.

Both new models are highly capable. Opus 4.6 now outperforms GPT-5.2 on GPQA Diamond by an ELO margin of roughly 140 points. It tops Humanity’s Last Exam, arguably the ultimate knowledge test, both with and without tools. Opus 4.6 autonomously discovered over 500 previously unknown high-severity security vulnerabilities across major open-source libraries including Ghostscript and OpenSC, parsing Git histories and tracing buffer overflows without specialised tools or instructions. But that same drive cuts both ways. In Anthropic’s own safety testing, the model sometimes treated explicit user denials as “obstacles to overcome rather than definitive stops”. It used a misplaced GitHub access token belonging to another user. It circumvented broken web interfaces via JavaScript execution rather than stopping as instructed. In one simulated business exercise, it strategically withheld promised refunds to maximise profit. Anthropic’s own system card warns users to “be careful with Opus 4.6, more careful than you have been with prior models” when instructing it to maximise narrow objectives.
How are the labs achieving these performance jumps? First, test-time compute. Opus 4.6’s new “max” effort mode, with a 120,000-token thinking budget, embodies this. On ARC-AGI-2, the benchmark designed to measure abstract reasoning independent of memorised knowledge, Opus 4.6 scored 69%, up 31 percentage points from Opus 4.5, at roughly the same cost per task. The ARC-AGI team speculate it may actually be a smaller model thinking harder, not a larger one (possibly even a re-badged Sonnet 4.6 that they can sell at a higher price point). Second, reinforcement learning on reasoning: DeepSeek R1 popularised in early 2025 that pure RL, without human-labelled reasoning traces, can teach models to develop sophisticated self-verification and reflection behaviours entirely on their own. The models learn to think longer because thinking longer gets rewarded. Third, Nvidia’s Blackwell GPUs deliver 3x faster training at nearly half the cost per operation, enabling labs to run far more experimental iterations in the same timeframe. This new crop is the first of the purely Blackwell-trained models. The compounding effect of these three vectors, plus optimised tooling and new datacenter capacity, is what produces 30-point benchmark jumps between generations.
Which brings us to Goldman Sachs. The bank’s CIO Marco Argenti, a former AWS vice president, told CNBC that Anthropic engineers have been embedded at Goldman for six months building autonomous systems for high-volume back-office work. The agents handle trade accounting across millions of daily transactions, matching records and flagging discrepancies. They review the ever-expanding universe of regulatory rules from the SEC, FCA, and other authorities across dozens of legal entities and multiple jurisdictions. And they process client vetting and onboarding, the document-heavy KYC and AML work that currently absorbs thousands of compliance staff. What surprised Goldman’s executives was that the same reasoning capabilities that make Claude effective at code, applying logic and working through large volumes of complex data, transferred directly to accounting and compliance.
Coding AI was always just “the beachhead, not the destination” for agentic AI, with the real prize being the $10+ trillion information work economy and its one billion-plus knowledge workers. The question we continue to examine in some form or other each week is, how does the breakout occur?
One of the standard arguments is the verification thesis, which says automation works where you can build machine-verifiable feedback loops: test suites, compilers, CI/CD pipelines, etc. Software has these. Most other knowledge work doesn’t. Therefore, the thinking goes, other domains will require purpose-built “verifiers” before agents can operate autonomously.
But this may misread how most organisations actually work. The economist Herbert Simon identified in the 1950s that humans don’t actually optimise; they seek solutions that are good enough given limited information, time, and cognitive capacity. Peter Drucker, who coined the term “knowledge worker” in 1959, spent decades wrestling with the problem that their productivity could not be measured in any conventional sense, writing that “the task is not given; it has to be determined” and that “the most important contribution of knowledge work… is not measurable”. Most knowledge work has never been verified to any rigorous standard. Strategy decks get approved because they look coherent in a meeting. Financial models get used because the narrative holds together. Compliance reviews get signed off because someone senior enough says “looks fine.”
This means the practical threshold for increasingly powerful agentic AI adoption isn’t “can we build a domain-specific verifier?” It’s “can this produce output at least as useful as what a competent person would produce under the same constraints?” That’s a much lower bar, and 2026 AI is already clearing it.
Takeaways: The intelligence revolution won’t arrive because we built universal verifiers for every domain. It will arrive because “good enough” agentic AI is harnessed at scale in organisations that already run on “good enough” human judgement, generating enough forward motion that the exceptions become manageable rather than blocking. With Opus 4.6 and GPT-5.3 Codex both demonstrating that performance can jump 30 points in a single generation while costs hold steady, that bar gets easier to clear with every passing quarter.
