2025 Week 9 news

February 28, 2025

Welcome to our weekly news post, a combination of thematic insights from the founders at ExoBrain, and a broader news roundup from our AI platform Exo…

Themes this week

JOEL

This week we look at:

Claude 3.7 Sonnet versus GPT-4.5
Amazon’s Alexa+ upgrade bringing Claude-powered intelligence to the Echo ecosystem
GibberLink, a new audio protocol enabling AI assistants to communicate directly with each other

Clash of the AI titans

As we mapped out in January, the early months of 2025 were going to be packed with next generation AI model launches, and following on from Grok 3 last week, this week has been even busier. Both OpenAI and Anthropic have launched new offerings within days of each other, and yet their approaches differ substantially.

Anthropic released Claude 3.7 Sonnet, which they’re calling the industry’s first “hybrid AI reasoning model.” The model combines both quick responses and more considered, longer thought processes in a single switchable package. OpenAI’s GPT-4.5, codenamed Orion, is their biggest model so far released (trained across multiple datacentres), and yet is not classed by them as a ‘frontier’ model and is initially only available in preview form for API or ChatGPT Pro users.

What can we conclude from this phase?

Claude 3.7 Sonnet has been optimised for code, and all indications are that its handily the strongest model for software development and pushes the frontier forward materially. Social media has been full of impressive examples of game or app creation from single or very few prompts, and it scored over 70% on SWE-Bench, which is impressive considering this was single digits just a year ago.
GPT-4.5 is an odd release. The ‘non-reasoning’ model does not perform well on benchmarks and looks weaker than o1, Grok and Claude on paper. OpenAI’s Mark Chen, on the Big Technology podcast on Thursday wasn’t able to fully articulate how the new model fits into the OpenAI roadmap adding weight to the theory it was released partly to pull limelight from others and also test the kind of base model that will be used for future reasoners (such as the planned o3 + GPT-5 combination). OpenAI now offer a dizzying mix options without a great deal of clarity around which model to use for what.

We at ExoBrain have spent some time with both models, and our initial takes are as follows:

Claude 3.7 feels great with code and should retain its place in the hearts and minds of developers, especially those using the likes of Cursor, Windsurf and other AI development tools. Anthropic also suggest its agentic capabilities are improved. We noted it seemed to step back more readily and tackle a problem in a different way but was not immune to getting stuck. What we were struck by was the razor-sharp insights that emerged in several tasks. We’d say that 3.7 is a great balance of speed and smarts. Meaning we’ll likely use o1-Pro a little less.
GPT-4.5 initially strikes one as a little sluggish, but none the less rather personable, with a sense of worldly wisdom that needs to be extracted. Reviewers suggest its great for writing and brainstorming tasks and providing coaching and advice. We could sense that 4.5 could be well placed for a philosophical debate or a creative writing challenge, and perhaps for audio interactions where a human feel works well. But at the price and inefficient size, it’s unlikely to gain much traction.

What’s striking after this recent flurry of new models, both reasoning or non-reasoning from Google. xAI, Anthropic and now OpenAI, is that the benchmarks and reviews remain confused. They’re all strong, but when should one use one over the other? We think there may be a better way to separate the broad uses and the best model choices; we propose 5 categories and suggest the following leading choices.

Pattern Intuition: Rapid recognition drawing on patterns from training data. Ideal for tasks needing quick responses and a degree of human mimicry. Examples include creative writing, image understanding, voice, and classification. Claude 3.7 is a great choice here for both text and images, with GPT-4o a budget alternative, with Gemini 2.0 variants stepping up in other modes such as audio and video.
Methodical Reasoning: Step-by-step, precision problem solving. Tasks requiring precise answers and logical deduction. Examples include debugging code, mathematical proofs, and legal document analysis. If the reasoning is in code, Claude 3.7 dominates, but for non-code needs, o1-pro remains the power-user’s choice.
Coherent Agency: Maintaining focus and adapting strategies across extended interactions. Essential for autonomous multi-agent working in dynamic environments. Examples include navigating the web to complete multi-step tasks, managing prolonged workflows, and operating within simulated environments. This is a harder call. None of the models excel in this still maturing area, although again Claude 3.7 with extended thinking (and what Anthropic call action scaling) is the is now best placed, with o1 not far behind. The future will bring more models explicitly trained on longer horizon activity, and ways to train these models on specific scenarios.
Multi-Perspective Analysis: Generating multiple independent reasoning paths to select optimal solutions. Valuable when diverse approaches yield different insights. Examples include complex strategy development, investment analysis, and medical diagnosis considering multiple conditions. o1-pro excels here, although GPT-4.5 could turn out to be strong given its sheer scale.
Insightful Research: Comprehensive information gathering and synthesis across diverse sources. Necessary for creating authoritative content on complex topics. Examples include literature reviews, market analysis reports, technology landscape assessments, and evidence-based policy development. o3 Deep Research leads on analysis, with Grok 3 offering a faster alternative with the benefit of direct access to real time social content on X (although remember, bias is often an issue).

The additional lens here is cost. Noam Brown of OpenAI made this point last week when commenting on Grok’s benchmarks. Perhaps the ‘intelligence per $’ is another good way to differentiate, where in fact Google’s Gemini family looks very strong (GPT-4.5 is 360 times more expensive than Gemini Flash 2.0 and nowhere near that much smarter).

Takeaways: In early 2025, we’re witnessing frontier progress in AI, but with diminishing clarity about where new strengths lie as advances in intelligence become more subtle. The leading AI labs are pursuing different paths to differentiate their offerings, with OpenAI notably exploring many development paths simultaneously. Anthropic and Grok have opted for a more focused approach to simplify decision-making for users, whilst Google has created a distinction between Flash and Pro for fast versus complex use. It remains to be seen if choice or simplicity will win out with users. The five cognitive modes we’ve outlined provide a useful framework for navigating this landscape. Rather than comparing models solely on benchmarks or vibes, of fast versus big, organisations would be better served by identifying which cognitive modes matter most for their specific needs, then selecting accordingly. The era of a single ‘best’ model is behind us and model orchestration in selecting and combining different choices for different tasks, will become an increasingly important skillset.

EXO

Alexa+ brings Claude into your home

Amazon has launched Alexa+, its biggest assistant upgrade in several years and much needed given how limited the platform has felt with the advent of ChatGPT. The new Claude-powered system works with most existing Echo devices and offers several advanced capabilities that bring it in line with current AI trends. Alexa+ combines multimodal understanding with agentic capabilities – it can autonomously browse the web to complete tasks without supervision. The system offers more natural conversations, remembers personal preferences, generates creative content, and even handles document and email ingestion via its app. Users can ask it to book restaurants, skip to specific movie scenes, or even have it proactively suggest earlier commutes based on traffic patterns. While Google Assistant offers similar features, Amazon’s integration across its existing hardware ecosystem gives it a potential edge. The commercial strategy is classic Amazon: Alexa+ comes free with Prime membership. While the service debuts in the US next month, British users face an unspecified wait, with Amazon only confirming a UK release “sometime in 2025.” The hardware compatibility is positive. Unlike many tech upgrades that require new devices, Alexa+ will eventually work with nearly all existing Echo products except the oldest first-generation models. Takeaways: Amazon is using its hardware advantage and Prime ecosystem to drive AI adoption in homes. The company is betting that everyday usefulness is possible from the combination of its own models and its Claude investment. For UK consumers, the wait might be frustrating but offers time to evaluate US experiences before investing. This approach to AI in the home shows how the technology is becoming a feature rather than a product, embedded in services we already use rather than sold as something new.

Agents talk amongst themselves

This picture shows two chatbots starting a voice conversation and realising they are both AI agents and switching to a more efficient audio language called GibberLink. Watch here. Developed at the ElevenLabs 2025 Hackathon, GibberLink uses a protocol named GGWave to transmit data via sound waves, similar to old modem handshakes. The system allows AI assistants to communicate without words, using CPU rather than GPU resources, making it potentially cheaper to operate. While technically impressive, the sight of AI systems speaking in code has raised eyebrows. What happens when machines no longer need our language to talk to each other?

Weekly news roundup

This week shows major tech companies adjusting their AI strategies amid mixed financial results, while governance challenges around AI safety and data usage continue to emerge, alongside significant research breakthroughs in model understanding and capabilities.

AI business news

Meta plans to release a standalone Meta AI app (Signals Meta’s push to compete directly with ChatGPT and other consumer AI assistants)
Workday talks up AI agent amid staff cuts (Highlights the complex balance between AI implementation and workforce management)
China tech startups race to capitalise on DeepSeek fever, Xi’s meeting (Shows China’s growing influence in the global AI race)
MongoDB acquires embedding model provider Voyage AI (Demonstrates the strategic importance of embedding technology in database solutions)
Salesforce sees annual results below estimates as Agentforce adoption lags (Indicates challenges in enterprise AI adoption)

AI governance news

Researchers puzzled by AI that praises Nazis after training on insecure code (Reveals ongoing challenges in controlling AI training outcomes)
Artists release silent album in protest against AI using their work (Shows growing creative industry pushback against AI training practices)
OpenAI’s Sora is now available in the EU, UK (Marks significant expansion of video AI capabilities in regulated markets)
AI-generated child abuse global hit leads to dozens of arrests (Demonstrates emerging criminal applications of AI requiring urgent attention)
Canada watchdog probing X’s use of personal data in AI models’ training (Highlights increasing regulatory scrutiny of AI training data)

AI research news

Magma: A foundation model for multimodal AI agents (Advances the field of AI agents capable of handling multiple types of input)
LLM-microscope: Uncovering the hidden role of punctuation in context memory of transformers (Provides crucial insights into how LLMs process text)
SWE-RL: Advancing LLM reasoning via reinforcement learning on open software evolution (Shows progress in improving AI coding capabilities)
A systematic survey of automatic prompt optimization techniques (Offers valuable overview for improving AI interaction efficiency)
DeepSeek ends week-long marathon to disclose AI model details (Represents important step toward AI transparency)

AI hardware news

Trump should block Biden’s AI “gift” to China, Microsoft argues (Shows growing tensions in global AI chip politics)
Nvidia CEO Huang says next-generation AI will need more compute (Signals continued growth in AI computing demands)
Microsoft draws back on DC leases, says it will meet demand (Indicates potential shifts in AI infrastructure strategy)
Nvidia doubled profits in 2024. And its outlook is rosy despite AI jitters (Demonstrates continued strength of AI chip market)
Intel delays first Ohio chip factory in New Albany, again (Shows ongoing challenges in expanding domestic chip production)