As the UK’s year 11 and 13 pupils grapple with the joys of standardised testing, the world of AI finds itself in a similarly painful state. While GCSEs and A-levels are rationalised as vital for comparing student achievement and school performance over time, many argue that these tests fail to reflect students’ true abilities and narrow the focus of teaching to just the tested subjects and content.
As more models become available, the question arises: how do we determine which are best, and are labs pouring too much effort into beating each other on standard tests rather than developing well-rounded capabilities? Are tests akin to human examinations the best approach?
This week Scale AI, the controversial AI data unicorn, has launched the SEAL Leaderboards to address the lack of transparency around AI performance. Their rankings use private, curated datasets and keep evaluation prompts under wraps to prevent labs from cheating. While this is progress, Scale AI only plans to update SEAL a few times a year and they cover a limited model subset. The methodology they have published is worth reviewing as a pretty comprehensive ‘test & evaluation’ approach for AI.
MMLU, or Massively Multi-Task Language Understanding, has been the focal benchmark for AI to date. It tests accuracy across 15,000+ questions spanning maths, science, history, and various problem-solving challenges. LLMs now regularly top 90%, beating out human experts in each field. However, many have highlighted the poor quality of the question set, and the likelihood that models are now highly tuned to perform well.
The LMSYS Chatbot Arena, a crowdsourced platform where humans ‘blind taste’ AI responses, has collected over 1,000,000 human comparisons to rank LLMs on a chess-style Elo scale. However, OpenAI has recently been accused of ‘style-hacking’ LLMSYS by formatting outputs with bullets and headings to make them look superficially better, even if the substance is inferior.
Many new tests like MMLU-Pro, with a better question set, and MMMU, which tests image recognition and multi-modal capabilities, have emerged. And then there are also questions of hallucination (HHEM), plus speed, and cost (analysed in detail by Artificial Analysis) to consider.
Like GCSEs, standardised tests have their place, providing a way to approximately compare groups of models against common tasks, and identify the ‘classes’ of capability. However, just as we don’t rely solely on a person’s GCSE results when hiring them for a job, nothing will beat assessing a model in context and judging it on its merits for a given role.
Takeaways: We’ve put together the latest report card across the most widely tested models; use it to get a feel for approximate strengths and weaknesses before testing your model in situ:

- Despite its troubles with the launch of AI in search, Google’s post I/O model updates are going much better and have seen the Gemini 1.5s surge up the table.
- GPT-4o is an all-new model, and heavily tuned to be capable and fast, and it performs well across the board. But more than any other model, given its so new, the scores should not be taken at face value when evaluating for a use-case. One area where it is clearly deserves it grades is multi-modal performance; it is a step forward in visual understanding.
- Good old GPT-4 Turbo tops the new SEAL coding table, born out by many comments on 4o’s coding skills, but suffers in the cost and performance stakes.
- Claude 3 has been slipping back. Some suggest its falling victim to the unexplained decay that many models seem to go through, often due to being victims of their own success, requiring behind the scenes performance optimisation that rob them of some of their initial zing. It’s also not showing well in the hallucination stakes.
- Llama 3 and that big Meta quality investment continues to impress, with the 70B model competing with its closed-source peers.
