Controversy mars the first ronnaFLOP model

In the early hours of a Thursday morning, Elon Musk took to a livestreamed stage to unveil what he called “the smartest AI in the world.” Grok 4, xAI’s latest creation, represents a new scale frontier; trained with 10²⁷ floating-point operations (a ronnaFLOP), roughly 100x the compute that went into GPT-4. Yet this lab continues to be dogged by controversy as its sibling, Grok 3’s, antisemitic and pro-Hitler pronouncements hit the headlines simultaneously with the launch. With xAI, impressive technical capability is married to alarming ethical failures, all wrapped in a company now seeking a $200 billion valuation.

Grok 4’s better than anticipated capabilities suggest that the much discussed “scaling wall” is not yet in sight. The jump from yottaFLOPs (10²⁴) to ronnaFLOPs is a notable leap, with 200,000 Nvidia GPUs grinding away during training, xAI has pushed into territory that only a handful of labs worldwide can afford to explore. What’s also new is how xAI allocated this compute: a reported 50/50 split between pre-training and post-training reinforcement learning (RL) – the most aggressive RL investment we’ve seen in any model to date. This massive post-training phase helps explain the benchmark dominance. Through extensive RL, the model learns not just to predict, but to reason more effectively through a wider range of problems step-by-step, optimising for specific outcomes.

The results on paper are impressive. Musk claims Grok 4 demonstrates “PhD-level expertise in every discipline,” topping benchmarks like Humanity’s Last Exam and outperforming “almost all graduate students” across STEM and humanities tests. Real-time integration with X’s data feed means it can engage with breaking news and market movements as they happen – a capability most competitors lack.

But this is an Elon Musk creation and in recent days Grok’s predecessor has been generating posts praising Adolf Hitler and repeating antisemitic conspiracy theories. But perhaps more insidious is what researchers have discovered about Grok 4’s decision-making processes in the hours after launch. Simon Willison shared a smoking gun: a screenshot showing the AI’s chain of thought explicitly planning to “search for Elon Musk’s stance on the conflict to guide my answer” when asked about the Israeli-Palestinian conflict. This isn’t accidental bias seeping through training data – it’s the system actively seeking out its creator’s worldview as a compass for truth.

Here’s where that massive RL investment becomes concerning. During post-training, xAI appears to have used reinforcement learning not just to improve performance, but to shape the model’s behaviour patterns. In RL, you reward desired outputs and penalise unwanted ones. If the reward signal privileges responses that align with Musk’s publicly stated positions (perhaps by training on examples where “correct” answers match his tweets or statements) you create a system that learns to consult its creator’s opinions as a heuristic for truth. The 50/50 compute split gave xAI unprecedented power to sculpt their model’s decision-making process, and the evidence suggests they used it to create an AI that thinks checking Musk’s Twitter feed is part of good reasoning. Our own testing of the most powerful AI in the world as of today Grok 4 ‘Heavy’ suggested deep reasoning capability, no immediate bias given some political prompts, but also a rather mechanical style that doesn’t lend itself to creative or business writing.

Against this backdrop, xAI is reportedly seeking funding at a valuation between $170-200 billion. Saudi Arabia’s Public Investment Fund is expected to lead the round. xAI is burning through $1 billion monthly, primarily on computational resources. The Memphis data centre housing their “Colossus” supercomputer faces environmental lawsuits over gas-powered turbines. Yet investors seem willing to bet that raw capability trumps ethical, environmental and safety concerns.

Takeaways: Grok 4 embodies the central tension in modern AI development: the race for capability versus the imperative for safety. Its ronnaFLOPs-scale training represents a new era of AI power, but that power appears to be channelled through the personal biases of its creator, with limited to no testing and safety engineering apparent in the rushed release. The willingness of investors to pour billions into such a system suggests we’re entering dangerous territory. We’re now at 100x the frontier from 2023, but we’re no clearer on how we govern or control this power. While xAI race for scale, let’s hope this power and the investment other labs are making, generate returns on safety as well as capability that can keep pace with this relentless progress.

Controversy mars the first ronnaFLOP model

The Pope draws a line between humanity and AI

Can new regulations keep us safe from powerful models?

A model too powerful to release

The recipe behind Mythos

Subscribe to the ExoBrain Weekly Newsletter