Nvidia flexes at CES

This image from Jensen Huang’s CES 2026 keynote shows the Vera Rubin NVL144, a liquid-cooled rack delivering 3.6 exaflops of FP4 inference in a single box. Nvidia launched the Rubin platform this week, and stated that production had started for delivery in H2, with the chip offering a 5x boost in inference performance over Blackwell and a 10x reduction in cost per token. Crucially it also offers a 1.6x improvement in terms of memory bandwidth. This matters because AI inference is increasingly bottlenecked not by calculations but by how quickly you can feed data to the processors. Nvidia’s $20 billion acquisition of Groq, announced just before CES, addresses the same problem from a different angle: Groq’s SRAM-based approach trades capacity for raw speed, achieving remarkable latency by keeping model weights on-chip, though this requires hundreds of chips working together to run even a modest 70 billion parameter model. But as Huang explained in a Q&A, today’s biggest challenge is workload diversity. Mixture-of-experts models, diffusion models, and state-space models all stress different parts of the system and need different hardware and software capabilities. Nvidia’s pitch is increasingly focused on flexibility: rather than optimise for one workload type, build infrastructure that adapts as the demands shift from morning to night. Nvidia’s aim is to be in the same dominant position this time next year no matter what the model architectures and use cases are (and how many TPUs Google are able to sell).

Nvidia flexes at CES

Who owns the silicon?

South Korea’s memory crisis

Bulls and bears battle over Nvidia’s billions

Super-size my training run

Subscribe to the ExoBrain Weekly Newsletter