ExoBrain
agentic AIbenchmarks and evalscoding agentsresearch and science

AI eats AI

OpenAI's MLE-bench demonstrates that AI agents can achieve human-level performance in machine learning engineering, signalling a recursive loop in AI development.

Joel Miller

Joel Miller

2 min read
AI eats AI

This week, researchers at OpenAI unveiled MLE-bench, a new benchmark for evaluating AI ‘agents’ machine learning and AI engineering capabilities. The benchmark, comprising 75 Kaggle competitions, tests AI’s ability to perform complex ML tasks autonomously. Note: Kaggle is a popular online platform that hosts data science and machine learning competitions, where participants compete to build the best predictive models for various real-world problems, often with substantial cash prizes and recognition in the community.

This was a low-key news release from OpenAI, ostensibly the launch of yet another AI benchmark, but the results were intended to shock… and demonstrate how powerful their models are becoming. The best-performing AI agent, o1-preview, achieved medals in 16.9% of competitions, a feat only two humans have ever accomplished.

The system’s success relied on sophisticated scaffolding and guidance. The approach employed various open-source frameworks to structure the AI’s approach to tasks. These scaffolds provided the AI with tools for code execution, file management, and even submission validation, mirroring the resources available to human Kaggle competitors. This setup allowed the AI to iterate on solutions, debug issues, and optimise its approach within a 24-hour time limit for each competition.

The role AI is playing in accelerating the engineering of software systems with the likes of GitHub Copilot, Cursor and Devin has been talked about extensively in this newsletter. These MLE-bench results represent the next step in the self-improving loop of AI development. As AI becomes more adept at ML engineering, it could accelerate its own development, and thus development of ever more capable systems.

This self-reinforcing cycle could have profound implications for AI research and development. We may see AI systems that can design, implement, and optimise new AI algorithms with minimal human intervention.

Takeaways: The link between the biggest AI stories this week is clear. It does not feel a great stretch to realise that AI, having made strong progress in productivity fields, is breaking out into domains where processes are repeatable and outputs are measurable… maths, science, software, business services, and AI research itself. There is a central recursively self-improving loop that is emerging, that will drive expansion and acceleration ever faster. With the potential 10,000x increase in compute and scale predicted through to the end of the decade, the limits of this evolution will not likely be external. The steel man position here is that developments this week also highlight the continued significance of human insight in framing problems and interpreting results, in providing oversight and creative sparks. A combination of the two positions is our best bet for making the most of a daunting but fascinating future.