Super-size my training run

This week, research outfit Epoch published an analysis exploring AI compute growth towards 2030. Their findings suggest many thousand-fold increases are possible, but not without constraints.

At ExoBrain, we focus on three key drivers of AI progress: compute, data, and algorithms. Compute powers the training of ever-larger and more numerous models. Data fuels these models with knowledge. Algorithms determine how effectively models learn and operate. Epoch’s analysis aligns with our view, deep diving into the compute aspect and its potential constraints.

Epoch outlines several trajectories for AI scaling. The most constrained scenario, hampered mainly by energy supply limitations, suggests 5,000 to 150,000-fold increases (from GPT-4 scale AI levels) by 2030. Their upper bound, limited mostly by the fundamentals of data transfer and latency, envisions a 10-million-fold increase. The Epoch team’s central conclusion is striking “…by 2030 it will be very likely possible to train models that exceed GPT-4 in scale to the same degree that GPT-4 exceeds GPT-2 in scale” (a 10,000-fold increase).

But power constraints loom large. Bigger systems will require a leap from current 1-5 GW single location facilities to 40+ GW through distributed systems, potentially achievable by Google who have a network of datacentres across the US. Chip manufacturing is another near-term constraint. TSMC and others could theoretically ramp up to produce hundreds of millions of H100-equivalent GPUs, but this will require continued expansion of semiconductor fabrication capacity. Data availability presents another hurdle, with 2030 models potentially requiring up to 20 quadrillion effective tokens for training, more than most can envisage securing in the near-term.

Epoch also describes a “latency wall” concept, the ultimate constraint. As models grow, the minimum time to process a single datapoint increases. They estimate this would kick in at the largest scales of AI systems, highlighting the need for innovative solutions in model design and hardware architecture. This latency accumulates across the model’s layers and training iterations, potentially setting an upper bound on model size and training data for a given timeframe.

Individual training runs are also going to hit some limits in terms of hard cash. Recent run investments illustrate the trajectory, and also the financial headwinds for individual companies no matter how big. GPT-4, trained in 2022 by OpenAI on Microsoft’s kit, cost an estimated $100 million and used 2e25 FLOP (floating point operations; the measure for the calculations utilised). Two years later, Llama 3.1 required $600 million and 3.8e25 FLOP. Extrapolating this trend paints a startling picture for future costs at the frontier. By 2025, we might see GPT-5+ models with $2 billion price tags. Come 2028, a frontier model like GPT-6+ could cost $20 billion, a notable step up from the OpenAI training budget for 2024 of some $3 billion. Epoch’s 2030 projection suggests a staggering $125-250 billion for a technically feasible 2e29 FLOP training run. Such a run would be approximately 5 times Meta’s entire annual R&D budget

Takeaways: This exponential growth aligns with our ExoBrain view that AI capabilities will continue to advance rapidly, driven by massive compute increases. But this surge clearly won’t all be consumed by a few labs or big tech firms alone. What makes this journey exciting is the uses and applications that will be unlocked for the many and that we have yet to imagine. This new research is important reading for anyone wanting to understand the underlying maths, when pondering the many arguments about future AI trajectories. When the broader picture on compute is examined, limits certainly exist but there appears to be plenty of headroom for the next few years at least.

Super-size my training run

Who owns the silicon?

South Korea’s memory crisis

Nvidia flexes at CES

Bulls and bears battle over Nvidia’s billions

Subscribe to the ExoBrain Weekly Newsletter