The holy grail of benchmarks

This week, METR provided an update on their novel approach to evaluating AI capabilities, pitting machine performance against human experts across a diverse range of tasks. This new methodology aims to provide an improved understanding of AI progress, moving beyond abstract benchmarks to assess real-world impact.

The AI landscape is dominated by leader boards and benchmark scores that often fail to translate into meaningful insights about AI’s practical capabilities (as we covered in a previous newsletter). METR’s novel approach addresses this gap by directly comparing AI agent performance to that of human experts on a variety of complicated technical tasks, from cybersecurity to machine learning.

At the heart of METR’s evaluation is a focus on the correlation between human and agent performance. Their findings reveal that while AI agents generally excel at tasks that humans can complete quickly, models like Claude 3.5 and GPT-4o struggle to complete tasks that take human experts hours to solve. This view offers a clearer picture of where AI stands in relation to human capabilities and is essentially the holy grail of AI development in 2024 and the road to AGI… ‘longer horizon’ tasks with all of their dependencies, complexities and need for planning and reasoning.

This chart compares the performance of AI agents to human experts across tasks of varying difficulty. The x-axis represents how long it takes humans to complete different tasks, ranging from 1-4 minutes to 16-64 hours. The y-axis shows the fraction of tasks completed by AI agents, averaged across six different language models:

This is a great new evaluation method. For businesses, a customised version of this could offer a more reliable way to assess where different AIs can most effectively augment or potentially replace human labour. The study’s revelation that AI agents can generally complete tasks at 1/30th the cost of human experts is particularly noteworthy, suggesting significant potential for cost savings and efficiency gains in certain areas. However, the research also highlights the room for improvement in AI’s ability to tackle complex, long-form tasks. It will be fascinating and fundamental to see how the next generation of models perform here, such as GPT-5, Llama 4 and Claude 3.5 Opus. Will they push the line out into long task territory, and what will this mean for job displacement? Or will the line move up, increasing quality but not moving the quantum of automation to a new level?

Takeaways: METR’s new evaluation approach is a great way to better understand where AI capabilities currently sit; many short and medium sized tasks are in scope and in many cases vastly cheaper to complete with AI. Longer and more complex tasks are the next frontier.

The holy grail of benchmarks

The adaptive thinking backlash

AI contagion spooks markets

Lights out for software engineering

LLM traitor or faithful?

Subscribe to the ExoBrain Weekly Newsletter