2025 Week 41 news

Welcome to our weekly news post, a combination of thematic insights from the founders at ExoBrain, and a broader news roundup from our AI platform Exo…

Themes this week

JOEL

This week we look at:

  • OpenAI’s Dev Day launches for apps, agents, chat and more
  • Samsung’s 7-million parameter model matching giants on reasoning tasks
  • DeepSeek R1’s vulnerability to agent hijacking attacks

OpenAI mobilises devs for portal push

Whilst we try to cover a range of topics and companies, rarely a week now goes by where OpenAI’s increasing financial muscle and ambition don’t make the news. This week is no exception seeing OpenAI’s third annual Dev Day take place in San Francisco. Banners proclaimed 4 million developers, 800 million weekly users, and 6 billion tokens delivered per minute via its API, and according to Sam Altman “it’s the best time in history to be a builder.”

While the GPT-5 Pro and Sora 2 APIs were announced, the focus was clearly on transforming ChatGPT into the definitive platform for the AI era. The strategy rests on two pillars:

  • The Apps SDK aims to make ChatGPT a primary interface, using MCP (model context protocol) and a UI layer to integrate third party services like Figma and Spotify directly into conversations. This comes allied with the Agentic Commerce Protocol (ACP) enabling direct purchases within chat, starting with Etsy and Shopify. One must imagine that other protocols such as banking could be next.
  • AgentKit and ChatKit provide a visual builder and human interface for creating agentic solutions, accelerating prototyping and once again heavily leveraging MCP. Codex is now generally available and looking to build on growing adoption of the “tool that builds the tools”.

The ambition is to create what CEO Sam Altman calls “one AI service” useful across a user’s whole life, backed by a massive infrastructure buildout.

However, despite OpenAI increasing maturity, their execution prioritises speed over elegance. The Apps implementation relies on some belt and braces web technology such as iframes and is by no means guaranteed to succeed where their previous chat extensions have comprehensively failed. Rapid expansion introduces substantial risks. Seamless integration for apps and commerce requires access to shared user context and this creates new security vulnerabilities for data exfiltration. OpenAI is asking for immense trust while the frameworks to secure that trust are still nascent.

The industry reaction was as ever wide ranging. While AgentKit offers a new and quick route to deploy AI solutions, many fear a vendor monoculture, and question the limiting paradigm of workflows. Some believe OpenAI is commoditising the very developers it empowers, with one founder remarking… “OpenAI dev day – you will never see pigs this happy at the slaughterhouse”.

Takeaways: Dev Day 2025 was a restating of an intent to own the entire AI value chain. OpenAI is aggressively positioning itself to control the interface, the workflows, and the transactions of the digital world. The services are compelling. At ExoBrain we must admit that there is no faster way to combine strong and highly cost-effective models, data, agents, fine-tuning and monitoring than the OpenAI developer platform. Will the new services start to become the default for the interface layer too? The convenience of a unified platform is compelling, but the risks of centralised control and security vulnerabilities are as yet unresolved.

Samsung shrinks reasoning

While new model architectures to challenge the transformer (the tried and trusted paradigm behind this wave of AI progress) emerge from time to time, few demonstrate comparable performance to those 10,000x larger. A Samsung researcher has this week demonstrated that a 7-million parameter model can match or beat large language models on specific reasoning tasks. The Tiny Recursive Model (TRM) achieves 45% accuracy on ARC-AGI-1 and 87% on extreme Sudoku puzzles, compared to near-zero performance from models like DeepSeek R1 with 671 billion parameters.

The approach works by having a small neural network repeatedly refine its answer through recursive loops. Rather than processing a problem once with massive parameters, TRM cycles through the same compact network up to 42 times, progressively improving its solution. Think of it like solving a Sudoku; you don’t need different brain circuits for each step, just the same logical process applied repeatedly.

Small LLMs continue to gain in popularity as they can help increase performance or reduce cost but combining LLMs with different and narrowly optimised models that have hyper-efficient super-powers could prove even more interesting. The mix recalls how CPUs and GPUs split computational workloads, each optimised for its specific role. The data show that recursion can substitute for depth, at least on tasks that demand structured reasoning rather than linguistic recall.

TRM’s success on the challenging ARC-AGI benchmark, which tests abstract reasoning on synthetic puzzles, indicates that compact models can achieve generalisation without massive size. Still, the model’s domain remains narrow. It hasn’t yet faced open-ended language tasks or perception challenges. But its performance points toward an approach that could complement, rather than replace, large language models.

Takeaways: While not broadly applicable, TRM demonstrates that specialised tasks might not need billion-parameter models and huge energy bills. For embedded systems or domain-specific elements of larger solutions, where parameter efficiency matters, recursive approaches could reduce computational requirements by orders of magnitude. Such a small model means the tech can be widely replicated today. TRMs could quickly start to influence AI system design and play a role in tackling structured reasoning challenges.

ExoBrain symbol

EXO

DeepSeek scores 98% on the wrong benchmark

This chart comes from a new report from CAISI (the Center for AI Standards and Innovation), a division within NIST under the US Department of Commerce. DeepSeek’s R1 models appear alarmingly vulnerable to agent hijacking attacks, with success rates reaching 98% for critical exploits like downloading malware and 89% for sending phishing emails. In contrast, US models from OpenAI and Anthropic show dramatically lower vulnerability. Whilst the industry obsesses over benchmark scores for reasoning and coding abilities, these security vulnerability tests reveal equally consequential differences. Let’s hope they become the norm.

Weekly news roundup

This week’s AI news shows massive commercial investments reaching unprecedented valuations, major advances in model efficiency and safety, and growing global regulatory attention as AI integration accelerates across industries.

AI business news

AI governance news

AI research news

AI hardware news

2025 Week 40 news

Infinite video generation meets social media, Microsoft introduces agentic “vibe-working”, and an LLM built in Minecraft

2025 Week 39 news

AI agents learn hard lessons, Alibaba ships a model every 36 hours, and Grok goes fast

2025 Week 38 news

A new AI divide, Britain’s trillion-dollar American dream, and when your note-taking agents betray you

2025 Week 37 news

China takes the lead on open models, the next wave of autonomous agents, and MCP goes mainstream