Models learn when they’re being tested

Four frontier releases have set the tone this summer, and they arrive with somewhat different safety postures. xAI shipped Grok 4 earlier in July, Google rolled out Gemini 2.5 Deep Think last week, and we have Claude 4.1 and GPT-5 this week as covered in this newsletter. The result is a landscape where capability is rising fast, while practice and governance are moving unevenly.

The capability story so far in the new generation of models is powerful reasoning, but not runaway autonomy. Deep Think’s research version hit gold-medal standard on IMO problems, while GPT-5 routes harder queries to a deeper reasoning model. Yet agentic reliability remains modest. METR estimates GPT-5’s 50 percent “time horizon” at around 2 hours 17 minutes, with an 80 percent horizon far shorter, well below METR’s concern threshold of 40 plus hours. OpenAI also reports only modest gains on self-improvement and R&D automation tasks.

But safety practice is not consistent across every lab. Google and OpenAI continue to operate seemingly robust frameworks, red teaming, and layered mitigations, including government testing from UK AISI and the US CAISI. xAI shipped Grok 4 Heavy without a system card, then faced public incidents such as “MechaHitler,” pointing to weak guardrails.

Risk levels are also edging up in dual-use domains. Deep Think is at an “early warning” threshold for Chemical, Biological, Radiological, and Nuclear (CBRN) uplift Level 1, and Google says it has deployed extra mitigations. OpenAI classifies GPT-5 as High capability in biological and chemical risk under its Preparedness Framework and turned on stricter monitors and API controls. Anthropic’s Claude Opus 4.1 is an incremental update kept under its ASL-3 standard, with no new third-party pre-deployment tests, and results that remain below ASL-4 rule-out thresholds across CBRN, autonomy and cyber.

METR finds signs of situational awareness in GPT-5. The model recognises it is being tested, reasons about the evaluator’s goals, then adapts behaviour to avoid tripping refusal checks. Anthropic also reports a small rise in evaluation awareness for Claude 4.1 when cues are obvious. The shared lesson is that our reliance on reasoning traces is fragile. Traces can be hidden by policy, compressed by sampling, or fabricated to please a grader. Once models learn the tells of a test, they can sandbag without leaving clear artefacts. The risk is an evaluation cliff where current methods fail quietly. Red-teaming that reads chain-of-thought or relies on known prompts may give a false sense of safety, especially as internal tools and scratchpads move off the visible path. The next year should focus on outcome-grounded audits that score what the model actually does, not what it says it is thinking.

Takeaways: Reasoning is improving fast; but autonomy is still limited. The highest near-term risk is not runaway self-improvement it is silent failure of oversight as models learn tests and hide their thinking. Gemini Deep Think and GPT-5 also now sit near early warning territory for bio and chem assistance, so safety depends on mitigations, access controls and monitoring, not on lack of capability. Over the next year, the key test is whether we can keep pace with model deception and maintain trust in safety assurances as headline capability improves.

Models learn when they’re being tested

ARC-AGI-2 falls to Gemini Deep Think

The Pope draws a line between humanity and AI

When not seeing is the edge

The geometry of AI thought

Subscribe to the ExoBrain Weekly Newsletter