ExoBrain
The early singularity runs in a loop
agentic AIchips and hardwaredeveloper toolsopen modelsresearch and science

The early singularity runs in a loop

Andrej Karpathy’s open-source autoresearch tool demonstrates how simple AI agent loops can autonomously optimise code and models, enabling rapid, accessible scientific discovery.

Joel Miller

Joel Miller

3 min read

This week Andrej Karpathy released a small open-source project called autoresearch. Within days it had over 31,000 stars on GitHub. Karpathy is not a hype merchant. He’s typically measured, precise, and commands a level of respect in the AI community that few can match. So when he posted “Who knew early singularity could be this fun?” it seemed something significant was afoot.

The idea behind autoresearch is purposefully simple. You give an AI agent access to a training script (about 630 lines of code), a locked evaluation harness it cannot touch, and a clear metric. The agent proposes a change to the code, commits it, runs a five-minute training experiment, checks the score. If the score improves, the change sticks. If not, it’s rolled back. Then the agent tries again. And again. And again. Karpathy left it running for two days on a single GPU. It ran roughly 700 experiments and found around 20 genuine improvements. Persistence and thoroughness, not genius.

At ExoBrain we had autoresearch installed and running on our DGX Spark Blackwell GPU within minutes. The barrier to entry here is genuinely low. If you have a machine with a decent GPU and a problem you can measure, you can run this loop tonight. And that accessibility is the real story. Not the code itself, not even the results, but the fact that a pattern this powerful is now available to almost anyone. Within 48 hours of the release, Shopify CEO Tobi Lütke cloned the repo, pointed it at his own query-expansion training data, and went to sleep. He woke up to a 0.8 billion parameter model scoring 19% higher than his previous 1.6 billion parameter version. A smaller model beating one twice its size, after 37 experiments in eight hours, run by a CEO, not a research team. He then applied the same pattern to Shopify’s 20-year-old Liquid template engine and achieved 53% faster parse and render time with 61% fewer object allocations. That wasn’t machine learning (ML) training at all. It was performance optimisation of a production codebase.

Strip away the machine learning specifics and what autoresearch demonstrates is a general-purpose primitive: given a codebase, a metric, and a fixed evaluation budget, let an agent modify code, measure the outcome, keep or discard, and loop. That pattern works anywhere you have a clear, measurable objective, a fast evaluation cycle, a bounded surface area the agent can modify, and no irreversible side effects. Code performance, ad conversion rates, email response rates, compiler optimisation, logistics routing. The list is long.

This is not a new idea. As we reported in January, developers have been using what some called the “Ralph Wiggum loop”: feed an agent a prompt, let it iterate until it works, kill the session before context degrades, spin up a fresh one that reads from externalised state.

But Karpathy’s vision doesn’t end with a research loop. Days later he released AgentHub, a lightweight collaboration platform he describes as “GitHub, but for agents.” He envisions autoresearch becoming “asynchronously and massively collaborative for agents, SETI@home style.” The goal, he wrote, is “not to emulate a single PhD student. It’s to emulate a research community of them.” In this vision, hundreds of agents run experiments on different GPUs across the internet, publishing findings to a shared message board, branching off in different research directions, and adopting each other’s discoveries.

Takeaways: Autoresearch is not a breakthrough in ML. It is a clean implementation of a pattern that will spread rapidly: autonomous, metric-driven, keep-or-discard iteration loops. If you have a problem that can be measured and tackled through relentless iteration, it will not be long before an intelligent loop overwhelms it. The hard part, and the increasingly valuable human skill, is defining what “better” means in a way that is computable, in domains beyond ML.