Sam Altman told Ben Thompson last week that on any given task he can no longer cleanly separate how much of the result comes from the model and how much from the “harness” around it. Also this week, Cursor, one of the main players in the world of agentic engineering, shipped a programmable SDK that exposes its agent runtime. Cursor is positioning itself as logistics engineering for agents, supporting both the software engineering process and the running and orchestrating of agents inside solutions. Microsoft’s Foundry hosted agents arrived in the same news cycle, with Satya Nadella noting that every agent will need its own computer. The harness is taking shape as the new AI battleground.

We touched on this back in February, when OpenAI’s “Harness Engineering” article and StrongDM’s “dark factory” approach showed teams treating the environment around the model, the context, feedback loops, tests and digital twins, as the actual locus of engineering work. That piece argued the harness was where the next phase of software engineering would be built. The story since then has been the harness escaping the software engineering context and becoming a general-purpose unit of agent infrastructure.
A useful framing is that if the models are the planes, the harnesses are their air traffic control systems. The model talks. The harness coordinates. And the harnesses have started to show up clearly in performance numbers. Endor Labs ran the same frontier models through different harnesses last week. GPT-5.5 moved from 61.5% to 87.2% on functionality when switched from Codex to Cursor. Opus 4.7 gained nearly four points moving from Claude Code to Cursor. None of this displaces the model as the primary driver of capability, but it does suggest that comparing models without controlling for harness is no longer reliable. The harness is now a meaningful part of the equation.
That is why the labs are now investing almost as much in harnesses as they do in frontier models. OpenAI is evolving Codex into what it calls the everything app, a desktop product that manages many agents in parallel and presents the runtime as a single user surface. Cursor is going a different way, framing the harness as neutral programmable infrastructure that customers wire into CI/CD, internal tools and customer-facing products. Anthropic has Claude Code alongside managed agents in the cloud, and is trying to simplify this power for the user in the form of Cowork. Frontier labs without their own harness, including xAI, are partnering rather than building, which is part of what makes xAI/SpaceX’s $60 billion deal with Cursor notable. The harness is moving from local IDE to cloud service, and the operator no longer needs to know how the runtime works to set an agent running while they sleep, via a “harness-as-a-service”.
Claude Design as a design harness is an early example of a discipline-specific controller, and legal, clinical, financial and operational variants are likely to follow, since the value of a harness is the domain context it is optimised for. A counter-philosophy is also gaining ground. Pi, Hermes and OpenCode argue that frontier models are already post-trained as capable agents and that these heavier weight harnesses add more overhead than lift. It is too early to say what shape a good harness should be, and different shapes appear to suit different work. Taken together, these tools begin to support something beyond individual productivity, the early outlines of team-based AI, where multiple agents and humans share a common controller.
Takeaways: the harness is becoming a measurable factor in AI system performance, a focus of investment across the major labs, and a plausible part of the path toward genuinely autonomous computer agents. It is also fragmenting quickly, and the optimal combination of harness, model and task changes from week to week. At ExoBrain we are working toward what a harness of harnesses might look like, treating harness, model and agent identity as swappable axes rather than a fused product. That abstraction does not yet exist as a category. Until it does, the more cautious move is to avoid placing any one harness at the centre of your world, to test several across their own models and models from elsewhere, and to build a clear picture of their strengths and limits in your own work. Whilst these new constructs are important today they will surely be superseded by new architectural patterns in the years to come, as the pace of change accelerates, and we should avoid building our systems of intelligence around them quite yet.
