Week 29 news

Welcome to our weekly news post, a combination of thematic insights from the founders at ExoBrain, and a broader news roundup from our AI platform Exo…

Themes this week

JOEL

This week we look at:

  • How LLM’s maths capabilities are improving and what this means for the future, and for AGI.
  • The political turmoil in the US, Trump’s VP pick and the implications for AI.
  • New fast and cheap models, including GPT-4o Mini, that radically drive down the cost of intelligence.

Language models do the math

While LLMs are generally thought to struggle with mathematical tasks, news this week and in recent months suggest we’re witnessing some significant improvements in this area. Let’s face it, most humans are pretty bad at maths, but some show exceptional talent. It’s more or less the same with AI models, with both specialised and larger frontier models demonstrating improving capabilities. This evolution in mathematical reasoning isn’t just a technical milestone – it’s also potentially an accelerant on the path to artificial general intelligence (AGI).

Released this week, Mistral AI’s Mathstral, a 7B parameter model designed for STEM applications, has shown impressive results on the MATH benchmark (a dataset of 12,500 challenging competition mathematics problems) for a small open-weight model you can run on your laptop. Meanwhile, Harmonic’s Aristotle has been progressing on the MiniF2F, a benchmark for testing AI systems’ formal mathematical abilities. The developments aren’t limited to the smaller labs. Various rumours suggest that OpenAI have been internally demonstrating a model with powerful maths reasoning, while Google’s Gemini 1.5 Pro has been shown to achieve 91.1% on the MATH without tool-use, the most of any public model so far.

As we previously covered, Scale AI have been developing GSM1k, an entirely new set of problems mirroring the difficulty of the popular GSM8k benchmark, but where it’s not possible the models have been trained on these questions. An example of the kind of question posed is: “Gabriela has $65.00 and is shopping for groceries so that her grandmother can make her favourite kale soup. She needs heavy cream, kale, cauliflower, and meat (bacon and sausage). Gabriella spends 40% of her money on the meat. She spends $5.00 less than one-third of the remaining money on heavy cream. Cauliflower costs three-fourth of the price of the heavy cream and the kale costs $2.00 less than the cauliflower. As Gabriela leaves the store, she spends one-third of her remaining money on her grandmother’s favourite Girl Scout Cookies. How much money, in dollars, does Gabriela spend on Girl Scout cookies?” Today a frontier model like Claude 3 can answer 950 out of 1,000 of these questions that it hasn’t seen before, correctly.

Beyond intrinsic model capabilities, in the Artificial Intelligence Math Olympiad (AIMO) a team showcased a novel approach to enhancing LLMs’ ability on much harder problems, by combining structured thinking with code execution. The team from Numina and Hugging Face used high-quality instruction data for competition-level maths, then integrated this with code generation capabilities. This hybrid approach allowed their model to break down complex problems into steps, run Python code to reason about each stage, and ultimately achieve impressive performance gains. The technique not only improved accuracy but also reduced variance in solutions, demonstrating how creative combinations of existing methods can push boundaries.

But why does this matter beyond the world of AI researchers? Chinese lab DeepSeek’s Liang Wenfeng believes the path to AGI means betting on 3 areas: mathematics, multimodality, and language. “Mathematics and code are the natural testing grounds for AGI,” he notes, and a particularly important proving ground as maths is “a verifiable system that [can support] high intelligence through self-learning.”

Maths capability is also vital for machine-verifiable ‘proofs’. Harmonic’s Aristotle can take a natural language maths problem and translate it into a formal proof in ‘Lean 4’, a language for mathematical reasoning. This kind of process can help address a critical concern in AI adoption: trust. By producing formally verified proofs (autoformalization) models can show their workings out in a 100% verifiable way, crucial for deploying AI in critical applications like designing bridges or drugs, where we need to be 100% sure it’s not just guessing the answer. Proofs have many other uses, from cryptography, smart contracts, and hardware security to exploring new mathematical ideas.

Whilst maths is not a comfort zone for many, nor has it been for language-based AI, models are now rapidly becoming adept at tackling complex problems. This progress could democratise advanced mathematical reasoning, making it accessible to a broader range of industries and applications, and help strengthen AI’s value in many other areas.

Takeaways: For businesses and AI users, these developments should trigger the reassessment of AI’s potential in domains requiring complex reasoning. Mathematical capabilities can be highly effective for many business tasks. If a use-case hasn’t worked with a language-based approach, it may now be solvable with one that employs more logical reasoning. As AI maths evolves, staying informed and understanding how to exploit the power of mathematical proofs will be highly beneficial.

JOOST

Would Trump “Make America First in AI”?

The tumultuous 2024 US presidential race is setting the stage for a significant debate on AI policy and regulation. This week, conversations have been heating up in tech circles about how the contrasting approaches of leading candidates could reshape future trajectories.

Donald Trump’s selection of J.D. Vance as his running mate, dubbed a “tech bro ” on the ticket by some, hints at a potential alignment with Silicon Valley conservatives. This pairing could lead to policies favouring open-source advocates and possibly easing regulations. However, Trump’s vision may not be a purely unrestricted one. The Washington Post reports that a Trump aligned institute has drafted a “Make America First in AI” policy that would launch a series of “Manhattan projects” to increase the militarisation and protection of US AI capabilities. Moreover, Trump’s unpredictable approach to foreign policy could have significant implications for GPU supplies. His recent comments on Taiwan, urging the island to shoulder more of its defence costs, have already rattled financial markets and chipmaker stocks. And the polarising nature of Trump does not stop there, his personal relations could impact policy too. Trump’s long-standing feud with Meta’s Mark Zuckerberg adds a layer of complexity, with Zuckerberg’s firm an increasing AI powerhouse. The animosity could influence regulatory actions against specific social media platforms and impact their leverage in the race to more advanced systems.

In contrast, a Democratic victory would likely usher in a more regulated approach. Arati Prabhakar, the current Director of the Office of Science and Technology Policy, advocates for a balanced approach to tech regulation, focusing on both innovation and ethical standards. This could mean more comprehensive regulations aimed at ensuring technology serves the public good while mitigating risks. The implications for businesses and users of AI technology are significant. Companies may find themselves navigating a rapidly changing regulatory landscape, potentially affecting everything from product development to market strategies. Users could see changes in the speed of innovation and the level of protections against potential misuse or bias in AI systems.

The role of AI in the election itself is another critical factor. Both parties have expressed concerns about AI’s potential to create deepfakes and spread misinformation, highlighting the need for robust policies to protect election integrity. This presents both a challenge and an opportunity for AI companies to develop tools that can combat misinformation and enhance cybersecurity.

Takeaways: As the election approaches, businesses should prepare for potential regulatory shifts by developing flexible AI strategies. Users should stay informed about AI policies and their implications for privacy and rights. Policymakers and tech leaders must collaborate to strike a balance between innovation and regulation, ensuring AI serves the public good while maintaining national competitiveness in what is a global race. The outcome of this election could shape the trajectory of AI development and regulation for years to come, making it a crucial moment for the tech industry and society at large.

ExoBrain symbol

EXO

Intelligence too cheap to meter?

On Thursday OpenAI unveiled GPT-4o ‘Mini’, a scaled-down version of their most powerful model that’s 60% cheaper (around $0.24 per million blended tokens) than the old GPT-3.5. Once again, a lab is prioritising efficiency over raw intelligence and scale, to help maximise monetisation. It looks like the performance and cost will see it significantly undercut Google’s Gemini Flash and Anthropic’s Claude 3 Haiku, the previously leading low-cost options. Groq, the chip and inference provider also released new models including Llama-3-Groq-8B-Tool-Use, which will cost just $0.19 per million tokens, and can run at over 1,000 tokens per second. All in we’ve seen a staggering 100x reduction in cost for using AI models in only 2-years.

These developments highlight the ongoing race in the AI industry to balance performance and efficiency. As AI capabilities evolve, so does the computational power required to run these models, leading to significant environmental and economic concerns. The push for a more streamlined AI is driven by growing competition among providers and rising interest in smaller, specialised models that can perform specific tasks with high efficiency.

For ChatGPT this likely means the end of the road for GPT-3.5, the model that started it all when released back in 2022. For end-users, these advancements should translate into smarter and more responsive experiences across the board from chatbots to AI enabled apps. Longer term, how low can costs go? In a post on X promoting the launch, Sam Altman, CEO of OpenAI intimated that their goal was “intelligence too cheap to meter”?

Takeaways: As AI models become both more powerful and more efficient, businesses should constantly reassess their AI strategy. Consider the balance between specialised and general-purpose models in your stack and understand how new price points might unlock previously non-viable use-cases. You can test out GPT-4o Mini on ChatGPT Premium now.

Weekly news roundup

This week’s news highlights the growing impact of AI on various industries, increased focus on AI governance and security, advancements in AI research, and developments in AI hardware, particularly in chip manufacturing and data centre technologies.

AI business news

AI governance news

AI research news

AI hardware news

2025 Week 43 news

AI as psychological contagion, pictures replace a thousand words, and Atlas challenges browser titans

2025 Week 42 news

Can computational biology cure cancer, Nvidia ships a beautiful disappointment, and the ghost of AGI

2025 Week 41 news

OpenAI mobilises devs for portal push, Samsung shrinks reasoning, and DeepSeek scores 98% on the wrong benchmark

2025 Week 40 news

Infinite video generation meets social media, Microsoft introduces agentic “vibe-working”, and an LLM built in Minecraft