ExoBrain Weekly Newsletter26 September 2025

AI agents learn hard lessons, Alibaba ships a model every 36 hours, and Grok goes fast

Welcome to our weekly newsletter, a combination of thematic insights from the founders at ExoBrain, and a broader news roundup from our Exo agents.

This week we look at:

AI agents learn hard lessons
Recent reports indicate that successful AI agent adoption depends on robust organisational workflows and sociotechnical systems rather than model capability alone, with newer reasoning models showing significant utility for expert workers.
Alibaba ships a model every 36 hours
Alibaba’s rapid release of 228 Qwen models in 2025, culminating in the frontier-capable Qwen3-Max, challenges Western development norms and drives significant market confidence.
Grok goes fast
xAI’s Grok 4 Fast achieves a significant reduction in inference costs while maintaining performance, potentially reshaping the economic landscape of AI reasoning models.

AI agents learn hard lessons

Recent reports indicate that successful AI agent adoption depends on robust organisational workflows and sociotechnical systems rather than model capability alone, with newer reasoning models showing significant utility for expert workers.

Joel Miller

26 September 20254 min read

Last week we saw data from large-scale chatbot and API use; this week we’re seeing more data following around 12 months of reasoning models, widespread experimentation and the rise of tool using AI agents.

McKinsey share experiences of over 50 agentic AI builds and are seeing a consistent pattern: companies focusing on the agent itself rather than the workflow are failing. The consultancy found that experts must write thousands of desired outputs to train complex agents properly, essentially fully training them and sharing tacit knowledge. Users also complain about AI “work slop”, low-quality outputs that erode trust and make more work for others.

Google’s DORA 2025 report out this week surveyed nearly 5,000 technology professionals and echoed the question of trust: while 90% of developers now use AI tools, only 24% trust them. The research identified seven team archetypes, with only 40% seeing genuine productivity gains. Teams with strong foundations, loose coupling and fast feedback loops achieve 20-30% improvements, while those with legacy constraints see little benefit. AI acts as an amplifier, magnifying existing organisational strengths and weaknesses. The report highlights that “AI is [usually] nested in a larger system” and outcomes are shaped by overarching “sociotechnical systems” (process and culture) and not by AI’s capability alone.

Much of this data also reflects older generation models limitations. Newer reasoning models are starting to show up in the data. OpenAI’s GDPval report, evaluating AI on real economic tasks across 44 occupations, found Claude achieving just a few percent short of human experts on tasks in many fields. These frontier models handle multi-hour expert activities reasonably well (though with a 2.7% catastrophic failure rate that remains unacceptable for many professional contexts) and can increasingly work with the file formats that are integral to many working environments. The difference between GPT-4 class models that have been widely available and these newer reasoning models is substantial, and where these models failed to accelerate already capable workers, Claude and GPT-5 now can.

Examples from the GDPval tasks:

The radiology field offers a classic case study. Despite Geoffrey Hinton’s 2016 prediction that we should “stop training radiologists now”, radiologist salaries have increased 48% since 2015, with record vacancy rates. Only a minority of a radiologist’s time involves image interpretation. The rest involves patient consultation, teaching, and complex decision-making. As AI made scans faster and cheaper, consumption increased, increasing need for the human skills.

A recent blog post from Microsoft Design offers a framework for understanding these systemic complexities more deeply. They argue that thinking in workflows is “rigid, overcomplicated, and limiting” for AI systems. Instead, they propose “cybernetic” loops: continuous cycles of monitoring and coordination. This model, rooted in 1940s cybernetics theory, treats human-AI collaboration as adaptive systems responding to real-world feedback, not linear sequences.

The evidence suggests we’re at a potential inflection point. Agents leveraging tools and reasoning models are offering utility for experts and well-organised teams, but further progress could require integrating more ephemeral activity and adaptive loops rather than rigid workflows. Jobs aren’t arbitrary bundles of tasks; they’re interconnected systems of feedback and adjustment. AI systems that recognise this complexity and work within it will succeed when simple task automation hits its limits.

Takeaways: Early evidence suggests reasoning models can accelerate expert work, but only when organisations aren’t burying them in broken processes and culture. Even as models get more powerful, to unlock their potential we will still need to shift the focus from automating actions (tasks and workflows) to automating control systems (cybernetic loops). Work, especially knowledge work, is rarely linear; it is an adaptive loop of sense, interpret, decide, act, learn and repeat. AI is not a drop-in solution; it is an amplifier of the existing organisation, the networks of loops. A crucial concept from cybernetics, Ashby’s Law of Requisite Variety, states that a control system must have at least as much flexibility as the disturbances it faces. A failure to match the automation approach to the environmental “variety” explains many AI failures. And the trust we need to build in these systems will only come with adaptive governance and feedback, not merely mechanistic accuracy.

Alibaba ships a model every 36 hours

Alibaba’s rapid release of 228 Qwen models in 2025, culminating in the frontier-capable Qwen3-Max, challenges Western development norms and drives significant market confidence.

Joel Miller

26 September 20252 min read

At this week’s Apsara Conference in Hangzhou, Alibaba unveiled Qwen3-Max, a trillion-parameter model that matches or beats offerings from OpenAI and Google on key benchmarks. The launch caps an extraordinary sprint from a 7 billion parameter beta model in April 2023 to frontier-scale AI in just 30 months.

Qwen3-Max scores 69.6% on coding benchmark SWE-Bench Verified and achieves perfect marks on advanced maths tests AIME25 and HMMT. The model ranks third globally on LMArena’s text leaderboard, behind only Claude Opus 4.1 and Gemini 2.5. Unlike competitors, it’s not a full reasoning model, with that coming soon.

While Western labs carefully orchestrate releases, Alibaba has launched 228 Qwen models in 2025 alone, including specialised offerings for vision, speech, and coding. CEO Eddie Wu backed this push with a multi-billion-dollar commitment over three years, exceeding Alibaba’s entire previous decade of AI spending.

Wall Street noticed, sending Alibaba shares up 10% during the conference. The recognition seems overdue. Qwen models have been downloaded 400 million times globally, spawning 140,000 derivatives.

Takeaways: Qwen has compressed a decade of expected AI development into two years through sheer focus. Their success challenges the assumption that careful, measured development beats rapid iteration. Expect Qwen to continue pushing boundaries and demonstrating to others that such progress is possible even without the budgets and heritage of the US tech giants.

Grok goes fast

xAI’s Grok 4 Fast achieves a significant reduction in inference costs while maintaining performance, potentially reshaping the economic landscape of AI reasoning models.

ExoBrain

26 September 20251 min read

xAI launched Grok 4 Fast and here we can see it occupies new territory in the cost versus intelligence landscape. It’s 47 times lower cost than Grok 4, using 40% fewer thinking tokens whilst maintaining comparable performance. It vastly undercuts GPT-5, Claude 4 Sonnet, and Gemini 2.5 Pro. The unified architecture handles both reasoning and non-reasoning tasks in one model. If Google, Anthropic, and OpenAI can achieve similar efficiency gains with their upcoming models, AI reasoning could become more accessible than ever.

Subscribe to the ExoBrain Weekly Newsletter

Stay up to date with AI. Get analysis of the week's most important stories, plus a focused roundup across business, governance, research and infrastructure.