AI agents learn hard lessons
Recent reports indicate that successful AI agent adoption depends on robust organisational workflows and sociotechnical systems rather than model capability alone, with newer reasoning models showing significant utility for expert workers.
Joel Miller

Last week we saw data from large-scale chatbot and API use; this week we’re seeing more data following around 12 months of reasoning models, widespread experimentation and the rise of tool using AI agents.
McKinsey share experiences of over 50 agentic AI builds and are seeing a consistent pattern: companies focusing on the agent itself rather than the workflow are failing. The consultancy found that experts must write thousands of desired outputs to train complex agents properly, essentially fully training them and sharing tacit knowledge. Users also complain about AI “work slop”, low-quality outputs that erode trust and make more work for others.
Google’s DORA 2025 report out this week surveyed nearly 5,000 technology professionals and echoed the question of trust: while 90% of developers now use AI tools, only 24% trust them. The research identified seven team archetypes, with only 40% seeing genuine productivity gains. Teams with strong foundations, loose coupling and fast feedback loops achieve 20-30% improvements, while those with legacy constraints see little benefit. AI acts as an amplifier, magnifying existing organisational strengths and weaknesses. The report highlights that “AI is [usually] nested in a larger system” and outcomes are shaped by overarching “sociotechnical systems” (process and culture) and not by AI’s capability alone.
Much of this data also reflects older generation models limitations. Newer reasoning models are starting to show up in the data. OpenAI’s GDPval report, evaluating AI on real economic tasks across 44 occupations, found Claude achieving just a few percent short of human experts on tasks in many fields. These frontier models handle multi-hour expert activities reasonably well (though with a 2.7% catastrophic failure rate that remains unacceptable for many professional contexts) and can increasingly work with the file formats that are integral to many working environments. The difference between GPT-4 class models that have been widely available and these newer reasoning models is substantial, and where these models failed to accelerate already capable workers, Claude and GPT-5 now can.
Examples from the GDPval tasks:

The radiology field offers a classic case study. Despite Geoffrey Hinton’s 2016 prediction that we should “stop training radiologists now”, radiologist salaries have increased 48% since 2015, with record vacancy rates. Only a minority of a radiologist’s time involves image interpretation. The rest involves patient consultation, teaching, and complex decision-making. As AI made scans faster and cheaper, consumption increased, increasing need for the human skills.
A recent blog post from Microsoft Design offers a framework for understanding these systemic complexities more deeply. They argue that thinking in workflows is “rigid, overcomplicated, and limiting” for AI systems. Instead, they propose “cybernetic” loops: continuous cycles of monitoring and coordination. This model, rooted in 1940s cybernetics theory, treats human-AI collaboration as adaptive systems responding to real-world feedback, not linear sequences.
The evidence suggests we’re at a potential inflection point. Agents leveraging tools and reasoning models are offering utility for experts and well-organised teams, but further progress could require integrating more ephemeral activity and adaptive loops rather than rigid workflows. Jobs aren’t arbitrary bundles of tasks; they’re interconnected systems of feedback and adjustment. AI systems that recognise this complexity and work within it will succeed when simple task automation hits its limits.
Takeaways: Early evidence suggests reasoning models can accelerate expert work, but only when organisations aren’t burying them in broken processes and culture. Even as models get more powerful, to unlock their potential we will still need to shift the focus from automating actions (tasks and workflows) to automating control systems (cybernetic loops). Work, especially knowledge work, is rarely linear; it is an adaptive loop of sense, interpret, decide, act, learn and repeat. AI is not a drop-in solution; it is an amplifier of the existing organisation, the networks of loops. A crucial concept from cybernetics, Ashby’s Law of Requisite Variety, states that a control system must have at least as much flexibility as the disturbances it faces. A failure to match the automation approach to the environmental “variety” explains many AI failures. And the trust we need to build in these systems will only come with adaptive governance and feedback, not merely mechanistic accuracy.
