While LLMs are generally thought to struggle with mathematical tasks, news this week and in recent months suggest we’re witnessing some significant improvements in this area. Let’s face it, most humans are pretty bad at maths, but some show exceptional talent. It’s more or less the same with AI models, with both specialised and larger frontier models demonstrating improving capabilities. This evolution in mathematical reasoning isn’t just a technical milestone – it’s also potentially an accelerant on the path to artificial general intelligence (AGI).
Released this week, Mistral AI’s Mathstral, a 7B parameter model designed for STEM applications, has shown impressive results on the MATH benchmark (a dataset of 12,500 challenging competition mathematics problems) for a small open-weight model you can run on your laptop. Meanwhile, Harmonic’s Aristotle has been progressing on the MiniF2F, a benchmark for testing AI systems’ formal mathematical abilities. The developments aren’t limited to the smaller labs. Various rumours suggest that OpenAI have been internally demonstrating a model with powerful maths reasoning, while Google’s Gemini 1.5 Pro has been shown to achieve 91.1% on the MATH without tool-use, the most of any public model so far.
As we previously covered, Scale AI have been developing GSM1k, an entirely new set of problems mirroring the difficulty of the popular GSM8k benchmark, but where it’s not possible the models have been trained on these questions. An example of the kind of question posed is: “Gabriela has $65.00 and is shopping for groceries so that her grandmother can make her favourite kale soup. She needs heavy cream, kale, cauliflower, and meat (bacon and sausage). Gabriella spends 40% of her money on the meat. She spends $5.00 less than one-third of the remaining money on heavy cream. Cauliflower costs three-fourth of the price of the heavy cream and the kale costs $2.00 less than the cauliflower. As Gabriela leaves the store, she spends one-third of her remaining money on her grandmother’s favourite Girl Scout Cookies. How much money, in dollars, does Gabriela spend on Girl Scout cookies?” Today a frontier model like Claude 3 can answer 950 out of 1,000 of these questions that it hasn’t seen before, correctly.
Beyond intrinsic model capabilities, in the Artificial Intelligence Math Olympiad (AIMO) a team showcased a novel approach to enhancing LLMs’ ability on much harder problems, by combining structured thinking with code execution. The team from Numina and Hugging Face used high-quality instruction data for competition-level maths, then integrated this with code generation capabilities. This hybrid approach allowed their model to break down complex problems into steps, run Python code to reason about each stage, and ultimately achieve impressive performance gains. The technique not only improved accuracy but also reduced variance in solutions, demonstrating how creative combinations of existing methods can push boundaries.
But why does this matter beyond the world of AI researchers? Chinese lab DeepSeek’s Liang Wenfeng believes the path to AGI means betting on 3 areas: mathematics, multimodality, and language. “Mathematics and code are the natural testing grounds for AGI,” he notes, and a particularly important proving ground as maths is “a verifiable system that [can support] high intelligence through self-learning.”
Maths capability is also vital for machine-verifiable ‘proofs’. Harmonic’s Aristotle can take a natural language maths problem and translate it into a formal proof in ‘Lean 4’, a language for mathematical reasoning. This kind of process can help address a critical concern in AI adoption: trust. By producing formally verified proofs (autoformalization) models can show their workings out in a 100% verifiable way, crucial for deploying AI in critical applications like designing bridges or drugs, where we need to be 100% sure it’s not just guessing the answer. Proofs have many other uses, from cryptography, smart contracts, and hardware security to exploring new mathematical ideas.
Whilst maths is not a comfort zone for many, nor has it been for language-based AI, models are now rapidly becoming adept at tackling complex problems. This progress could democratise advanced mathematical reasoning, making it accessible to a broader range of industries and applications, and help strengthen AI’s value in many other areas.
Takeaways: For businesses and AI users, these developments should trigger the reassessment of AI’s potential in domains requiring complex reasoning. Mathematical capabilities can be highly effective for many business tasks. If a use-case hasn’t worked with a language-based approach, it may now be solvable with one that employs more logical reasoning. As AI maths evolves, staying informed and understanding how to exploit the power of mathematical proofs will be highly beneficial.
