ExoBrain
DeepSeekDSparkInferenceEfficiency

DeepSeek does more with less

DeepSeek's open-sourced DSpark speeds up model serving by 57 to 85% with no loss of quality, and it works across model families. Efficiency, not raw capability, is becoming the most valuable work in AI.

Joel Miller

Joel Miller

3 min read
DeepSeek does more with less

Models keep getting bigger, memory keeps getting scarcer, and compute is still in short supply. That combination has pushed one skill to the top of the market: running models for less. Last week DeepSeek open-sourced DSpark, and it's another example of the lab's ingenuity in the face of constrained resources.

LLMs generate text one token at a time, and each token forces the machine to reload the entire model from memory just to produce a single word. The hardware spends most of its time waiting on memory, not calculating. A stack of techniques has grown up to fix this. The "KV cache" stores past work so it isn't recomputed. "PagedAttention" packs that cache tightly so more requests fit at once. "Continuous batching" keeps the processor busy by swapping requests in and out as they arrive. Combined, they let a server handle several times more traffic than a naive setup.

"Speculative decoding" is the newest and most interesting layer. A small, fast model guesses the next batch of tokens, then the large model checks them all in a single pass and keeps the ones it agrees with. The answer is identical to normal generation, just produced faster. DSpark is DeepSeek's evolution of this idea, and its trick is twofold: a drafting model that scores the confidence of its own guesses, and a scheduler that tracks how busy the GPU is. It verifies long runs of guesses when there is spare capacity and prunes the low-confidence ones when the machine is saturated, which sidesteps the usual conflict between speculation and heavy batching. The reported gain is 57% to 85% faster generation per user at the same throughput.

DeepSeek has tested DSpark on Qwen and Gemma models as well as its own, so the technique works across model families. A hosting business can bolt it onto other open models like GLM-5.2, or better still train an optimised drafting model and lower its serving costs significantly.

DeepSeek keeps publishing methods that other labs might treat as trade secrets, and the logic is partly strategic. Efficiency reduces the need for the most advanced chips, which matters for a Chinese lab working under export controls. Open techniques also build an ecosystem that spreads faster than any single product could. When your rivals are constrained by hardware, making the software cheaper to run is a way to compete on the ground you can actually control.

For businesses and individuals, this is what turns self-hosting from an aspiration into a reality. As open models close the quality gap and efficiency work like DSpark cuts the cost of running them, capable AI on your own hardware becomes practical. The last obstacle is physical. Apple raised Mac and iPad prices by 15 to 25% last week, blaming a memory shortage driven by AI data centres buying up supply, and memory contract prices nearly doubled in the first quarter alone, with another 60% rise in the second. Memory is the bottleneck in the hardware you buy and in the serving stack alike.

Takeaways: The frontier of AI has moved from building smarter models to running them for less, and DeepSeek is handing that capability to the whole field. Bigger models and scarce, expensive memory would normally put advanced AI further out of reach, but efficiency work pulls it back within grasp. The thing standing between us and frontier models on local hardware is now mostly the price of memory, which is exactly why squeezing more from every chip has become the most valuable work in AI.

Subscribe to the ExoBrain Weekly Newsletter

Stay up to date with AI. Get analysis of the week's most important stories, plus a focused roundup across business, governance, research and infrastructure.

Follow us on LinkedIn