Infrastructure

Inference Is the New CUDA

Why the most interesting infrastructure company of the next decade probably isn't a chip company.

By The Memo · Thursday, June 18, 2026 · 9 min read

Inference Is the New CUDA — Photo · Taylor Vick / Unsplash

Every cycle in computing produces one company that captures the layer everyone forgot to fight over. In the 2010s, that was CUDA — not the GPU, but the runtime around it. The fight for AI infrastructure in 2026 looks structurally similar, and it's happening at the inference layer.

Training gets the headlines. Inference gets the bills. By volume, inference workloads now exceed training compute at every major cloud, and the gap is widening by quarter. That single fact reorganizes the value chain.

What's quietly being unbundled is the assumption that the model vendor also owns the runtime. Speculative decoding, KV-cache reuse, paged attention, and continuous batching are no longer research curiosities — they're the difference between a six-cent query and a sixty-cent query at scale. Specialized inference runtimes — vLLM, SGLang, TensorRT-LLM, and a handful of proprietary stacks at the hyperscalers — now routinely deliver 3-8x throughput improvements over naive serving.

Subscribers only

Keep reading — it's free.

The Model Memo is a free daily newsletter. Drop your email to unlock the rest of this essay and get tomorrow's in your inbox. Always free, unsubscribe anytime.

Free daily newsletter. Unsubscribe anytime. No spam — ever.

Keep Reading

The Token Burn Economy: Why AI Usage Limits Are Really Workflow Limits

Analysis

Inference Is the New CUDA

Keep reading — it's free.

The Token Burn Economy: Why AI Usage Limits Are Really Workflow Limits

The Quiet Collapse of the LLM Moat

The Evaluation Crisis Nobody Wants to Talk About