This week's chart shows Z.ai's GLM-5.2 landing within a few points of Claude Opus 4.8 on long-horizon coding tasks. It is a 744-billion parameter mixture-of-experts model with a 1 million-token context window, and the industry response has been very positive. Many engineers now rate it as effectively on a par with Opus 4.7 and 4.8 and ahead of GPT-5.5.
GLM-5.2 ships under an MIT licence, so the weights can be downloaded, modified, and run inside an organisation's own boundary, with no provider able to revoke access. As Article 1 this week sets out, the US government's decision to disable and block Fable 5 has made that property concrete. When a frontier model can be switched off by a foreign government, the ability to host a near-equivalent model yourself stops being a preference and becomes a continuity requirement.
So how would you actually run it. The most accessible route uses Unsloth's dynamic quantisation, which shrinks the full model to 239GB at 2-bit, enough to fit on a 256GB unified-memory Mac Studio. That path suits one or two users and has been shown writing complete, working software on the first attempt. For a team, the economics change shape. A single EU-hosted three-GPU Blackwell box, around 540GB of VRAM, can serve roughly eight heavy agentic engineers at near-frontier capability for about £400 to £540 per engineer each month, running vLLM with an NVFP4 checkpoint.
The constraint has not disappeared. You still need fast memory measured in hundreds of gigabytes, the cheapest quantisations trade away accuracy on harder work, and a self-hosted model running under near-100% duty cycle needs real operational care. This is a workstation and server story, not a laptop one. But the option now exists where it did not before.
Takeaways: the open-weight question has shifted from capability to control. GLM-5.2 shows a frontier-class model can be self-hosted today, under a licence no government can revoke, at a cost that competes with metered seats once usage is heavy. With Fable 5 blocked, that is no longer theoretical. The practical step for any organisation handling sensitive work is to pilot a self-hosted model now, learn what local inference really costs in memory and latency, and stop assuming that access to frontier capability is something only a vendor can grant.
