Google I/O was the main AI story this week. Our lead article focused on the product surface: Gemini 3.5, AI Mode, agents, video and Google's claim that it is now serving 3.2 quadrillion tokens a month. But Google released Gemini 3.5 Flash at $1.50 per million input tokens and $9 output, three times the price of the Flash model it replaces. Meanwhile Anthropic tightened Max plan restrictions and OpenAI launched Guaranteed Capacity, letting enterprises commit to one, two, or three-year reservations of inference compute in exchange for discounts. On the surface this suggests that the anticipated token price squeeze is happening. But it's more that we're seeing the mechanics of a commodity market.
In commodity markets, price starts with production. For AI, production starts with silicon, and that cost curve is still falling. NVIDIA's Blackwell generation lowered cost per million tokens roughly 35 times against Hopper. Epoch AI's analysis shows the price for any fixed capability milestone falling between 9 and 900 times per year over the last three years, with a median of 50. Artificial Analysis data shows the same pattern: for a given band of model intelligence, prices keep stepping down.
Cheaper production doesn't mean cheaper access. Google disclosed at I/O that token volume is now seven times last year's level. If demand rises faster than production efficiency, the clearing price moves to the bottleneck.
For inference, the fundamental bottleneck is the aggregate of deployed memory bandwidth. High-bandwidth memory is the binding constraint on throughput today. SK Hynix and Samsung dominate HBM3E and HBM4 supply, allocations are sold through 2026, and accelerator throughput per watt is gated by memory bandwidth rather than logic-die fabrication. Jonathan Ross, formerly of Groq, put it plainly at Sohn this week: until recently, no one was really trying to squeeze more performance out of memory chips, and now they all are.
That supply response has a lead time. In the meantime, OpenAI is signing 3 GW dedicated inference deals with NVIDIA, and Anthropic similar arrangements with SpaceX. OpenAI's Guaranteed Capacity is best read as a forward contract for inference. Buyers who can lock in multi-year reservations get supply security and a discount. Buyers on standard rates pay a higher effective price for the same delivered tokens, and accept more variance in availability. That's a capacity premium, not a production cost increase.
The unit being traded is also changing. Commodity markets need a unit of account. Tokens used to do the job well enough, because pre-reasoning models made them roughly comparable. A token from one model was not identical to a token from another, but the comparison was usable. Reasoning models break that. Gemini 3.5 Flash defaults to dynamic thinking and burns more tokens per delivered task. Anthropic's Opus 4.7 tokeniser maps the same content to between 1.0 and 1.35 times more billable units. Artificial Analysis benchmark runs cost more on Gemini 3.5 Flash at high effort than on the more expensive-looking Gemini 3.1 Pro.
The list price per token has not become more expensive in any clean sense. The token has become a smaller and more variable unit of work.
Reliability has been productised. OpenAI now sells Standard, Priority, Flex and Scale tiers, with Priority running at roughly 1.5 to 2 times Standard rates for SLA-backed throughput, and Flex offering half-price tokens for asynchronous workloads with possible queuing. Once capacity tightens, the single price splits into peak, off-peak, interruptible and reserved supply.
Underneath the frontier tier, the market has gone the other way. Cursor's Composer 2.5 lists at $0.50 input and $2.50 output per million tokens, with cost per task on coding workloads roughly a tenth of frontier alternatives at comparable quality. Composer is built on the open-weight Kimi K2.5 base, with most of the compute spent on Cursor's own post-training and editor integration. Open-weight models from Kimi, DeepSeek, and Qwen are abundant at the low end of the curve, and the price per useful unit of work in this segment is still falling fast.
The result is a bifurcated market. The frontier is in cost-push inflation driven by capacity scarcity, reasoning verbosity and reliability premiums. The open tier is in continued deflation driven by hardware gains, distillation, fine-tunable weights and better tooling. Two buyers in the same industry can experience opposite price trajectories depending on which segment they rely on.
Cost per token is becoming less informative because the token itself is no longer fungible across models, reasoning depths or tokenisers. Cost per outcome, measured against a defined unit of delivered work, is the metric that still holds up.
Takeaways: AI compute now behaves like a commodity market with falling production costs, a memory-bandwidth bottleneck, reliability premiums, reserved capacity and surplus supply underneath the frontier. Less than 1% of AI users are power-users today, and we're already in a compute shortage. Buyers who handle the next eighteen months well will treat AI compute as procurement, not SaaS subscription management. Audit the basket of models in use, measure cost per delivered outcome rather than per token, lock in capacity only where the work justifies the premium, and build model- and harness-agnostic routing across multiple execution lanes so that a tokeniser change, a quota tightening, or a reservation shortage at any single vendor doesn't rewrite the unit economics of the whole stack.
