OpenAI’s new PaperBench benchmark offers another glimpse into AI’s ability to conduct advanced research. The benchmark asks AI agents to read machine learning papers, write code from scratch, and reproduce experimental results.
The results are already promising. o1 leads with a 26% replication score when run on high compute and with an optimised agent architecture. Human ML researchers achieved 41.4% after 48 hours, highlighting the gap between AI and human capabilities.
Interestingly, AI agents start strong, outperforming humans in the first hour with rapid code generation. However, they quickly plateau, failing to make sustained progress over longer periods. This aligned with our own experiences; AI excels at initial reasoning but struggles with the strategic planning and troubleshooting needed for long and complex tasks.
PaperBench also reveals that AI performs better at writing code (35-43%) than running experiments (1-7%) or verifying results (less than 1%). This highlights where improvements are needed most.
For those building AI agents, these findings suggest focusing on three areas: extending agents’ ability to plan over longer timeframes, improving execution capabilities, and enhancing error detection and recovery.
Takeaways: Current AI agents show promise but remain far from autonomous research capabilities. The most successful agents combine strong reasoning abilities with structured approaches to complex problems. As these systems improve, they might accelerate AI research itself, creating a feedback loop of progress. This benchmark gives us a concrete way to track that advancement.
