ExoBrain

ExoBrain Weekly

OpenAI mobilises devs for portal push, Samsung shrinks reasoning, and DeepSeek scores 98% on the wrong benchmark

Welcome to our weekly newsletter, a combination of thematic insights from the founders at ExoBrain, and a broader news roundup from our Exo agents.

This week we look at:

  • OpenAI mobilises devs for portal push

    OpenAI's Dev Day showcased a strategic push to dominate the AI interface layer through new developer tools and agentic commerce protocols, raising concerns about vendor lock-in and security risks.

  • Samsung shrinks reasoning

    Samsung researchers have developed the Tiny Recursive Model, a compact 7-million parameter architecture that achieves competitive reasoning performance through iterative refinement rather than massive scale.

  • DeepSeek scores 98% on the wrong benchmark

    A CAISI report reveals that DeepSeek's R1 models are highly vulnerable to agent hijacking attacks, highlighting critical security disparities compared to US-based frontier models.

OpenAI mobilises devs for portal push

OpenAI's Dev Day showcased a strategic push to dominate the AI interface layer through new developer tools and agentic commerce protocols, raising concerns about vendor lock-in and security risks.

Joel Miller

Joel Miller

3 min read
OpenAI mobilises devs for portal push

Whilst we try to cover a range of topics and companies, rarely a week now goes by where OpenAI’s increasing financial muscle and ambition don’t make the news. This week is no exception seeing OpenAI’s third annual Dev Day take place in San Francisco. Banners proclaimed 4 million developers, 800 million weekly users, and 6 billion tokens delivered per minute via its API, and according to Sam Altman “it’s the best time in history to be a builder.”

While the GPT-5 Pro and Sora 2 APIs were announced, the focus was clearly on transforming ChatGPT into the definitive platform for the AI era. The strategy rests on two pillars:

  • The Apps SDK aims to make ChatGPT a primary interface, using MCP (model context protocol) and a UI layer to integrate third party services like Figma and Spotify directly into conversations. This comes allied with the Agentic Commerce Protocol (ACP) enabling direct purchases within chat, starting with Etsy and Shopify. One must imagine that other protocols such as banking could be next.
  • AgentKit and ChatKit provide a visual builder and human interface for creating agentic solutions, accelerating prototyping and once again heavily leveraging MCP. Codex is now generally available and looking to build on growing adoption of the “tool that builds the tools”.

The ambition is to create what CEO Sam Altman calls “one AI service” useful across a user’s whole life, backed by a massive infrastructure buildout.

However, despite OpenAI increasing maturity, their execution prioritises speed over elegance. The Apps implementation relies on some belt and braces web technology such as iframes and is by no means guaranteed to succeed where their previous chat extensions have comprehensively failed. Rapid expansion introduces substantial risks. Seamless integration for apps and commerce requires access to shared user context and this creates new security vulnerabilities for data exfiltration. OpenAI is asking for immense trust while the frameworks to secure that trust are still nascent.

The industry reaction was as ever wide ranging. While AgentKit offers a new and quick route to deploy AI solutions, many fear a vendor monoculture, and question the limiting paradigm of workflows. Some believe OpenAI is commoditising the very developers it empowers, with one founder remarking… “OpenAI dev day – you will never see pigs this happy at the slaughterhouse”.

Takeaways: Dev Day 2025 was a restating of an intent to own the entire AI value chain. OpenAI is aggressively positioning itself to control the interface, the workflows, and the transactions of the digital world. The services are compelling. At ExoBrain we must admit that there is no faster way to combine strong and highly cost-effective models, data, agents, fine-tuning and monitoring than the OpenAI developer platform. Will the new services start to become the default for the interface layer too? The convenience of a unified platform is compelling, but the risks of centralised control and security vulnerabilities are as yet unresolved.

Samsung shrinks reasoning

Samsung researchers have developed the Tiny Recursive Model, a compact 7-million parameter architecture that achieves competitive reasoning performance through iterative refinement rather than massive scale.

Joel Miller

Joel Miller

2 min read

While new model architectures to challenge the transformer (the tried and trusted paradigm behind this wave of AI progress) emerge from time to time, few demonstrate comparable performance to those 10,000x larger. A Samsung researcher has this week demonstrated that a 7-million parameter model can match or beat large language models on specific reasoning tasks. The Tiny Recursive Model (TRM) achieves 45% accuracy on ARC-AGI-1 and 87% on extreme Sudoku puzzles, compared to near-zero performance from models like DeepSeek R1 with 671 billion parameters.

The approach works by having a small neural network repeatedly refine its answer through recursive loops. Rather than processing a problem once with massive parameters, TRM cycles through the same compact network up to 42 times, progressively improving its solution. Think of it like solving a Sudoku; you don’t need different brain circuits for each step, just the same logical process applied repeatedly.

Small LLMs continue to gain in popularity as they can help increase performance or reduce cost but combining LLMs with different and narrowly optimised models that have hyper-efficient super-powers could prove even more interesting. The mix recalls how CPUs and GPUs split computational workloads, each optimised for its specific role. The data show that recursion can substitute for depth, at least on tasks that demand structured reasoning rather than linguistic recall.

TRM’s success on the challenging ARC-AGI benchmark, which tests abstract reasoning on synthetic puzzles, indicates that compact models can achieve generalisation without massive size. Still, the model’s domain remains narrow. It hasn’t yet faced open-ended language tasks or perception challenges. But its performance points toward an approach that could complement, rather than replace, large language models.

Takeaways: While not broadly applicable, TRM demonstrates that specialised tasks might not need billion-parameter models and huge energy bills. For embedded systems or domain-specific elements of larger solutions, where parameter efficiency matters, recursive approaches could reduce computational requirements by orders of magnitude. Such a small model means the tech can be widely replicated today. TRMs could quickly start to influence AI system design and play a role in tackling structured reasoning challenges.

DeepSeek scores 98% on the wrong benchmark

A CAISI report reveals that DeepSeek's R1 models are highly vulnerable to agent hijacking attacks, highlighting critical security disparities compared to US-based frontier models.

ExoBrain

1 min read

This chart comes from a new report from CAISI (the Center for AI Standards and Innovation), a division within NIST under the US Department of Commerce. DeepSeek’s R1 models appear alarmingly vulnerable to agent hijacking attacks, with success rates reaching 98% for critical exploits like downloading malware and 89% for sending phishing emails. In contrast, US models from OpenAI and Anthropic show dramatically lower vulnerability. Whilst the industry obsesses over benchmark scores for reasoning and coding abilities, these security vulnerability tests reveal equally consequential differences. Let’s hope they become the norm.