AI Token Supply Chain

A four-layer cost-decomposition framework tracing AI tokens from physical silicon to end-consumer application.

The L1–L4 Framework

The AI token supply chain can be decomposed into four layers, each adding cost and margin on top of the previous:

Layer	Name	Description	Key Cost Driver
L1	Physical Compute	Silicon fabs, AI datacenters, power, cooling, networking	Electrical (40-50% of datacenter build), cooling (~25%)
L2	GPU Compute	GPU clusters, training runs, inference serving	H100-equivalent GPU hours, cluster utilization
L3	Model & API	GPU hours → token API, training + inference flows	Model architecture efficiency, inference optimization
L4	Application & Consumer	Apps, subscriptions, per-call agent payments	End-user pricing, agent consumption patterns

This framework was developed as an analytical tool for understanding where value accrues in the AI stack and identifying margin compression / expansion points at each layer [local: 2026-05-17-ai-infrastructure.md].

L1 — Physical Compute

The physical layer includes everything that makes GPUs run: silicon fabrication, datacenter construction, power infrastructure, and cooling.

Datacenter build costs (10MW GPU DC): Electrical infrastructure accounts for 40-50% of total construction cost, making it the single largest line item. Electrical + cooling combined represent approximately 70% of build cost. Key suppliers include equinix, digital-realty, qts, iron-mountain, crusoe-energy for colocation/DC operations, and vertiv, schneider-electric, siemens for electrical/cooling equipment.

Liquid cooling has bifurcated into two mainstream routes: direct-to-chip (DLC) cold plates and immersion cooling. Both are seeing rapid adoption as GPU power density increases (H100 → B200 → Rubin).

Chip supply chain: tsmc fabricates the advanced nodes; nvidia designs the GPUs; sk-hynix supplies HBM (High Bandwidth Memory), which is the bandwidth bottleneck for AI workloads. HBM3E and upcoming HBM4 are critical to relieving the memory wall that limits inference throughput.

L2 — GPU Compute

GPU clusters aggregate L1 physical assets into fungible compute. Key dynamics:

Training clusters: Large-scale, high-utilization, concentrated among hyperscalers and frontier labs. Cost measured in GPU-hours × cluster size × training duration.
Inference clusters: Distributed, variable utilization, served by both hyperscaler clouds (aws, google-cloud, microsoft-azure) and specialized GPU clouds (coreweave, lambda-labs, together-ai, vast-ai, runpod).
GPU cloud rental economics: A secondary market exists where GPU cloud providers rent capacity to model providers and application developers. Revenue from GPU cloud rental is nested within the broader AI revenue picture — model providers pay GPU clouds, then sell tokens to applications.

L3 — Model & API

This is where GPU hours are converted into token outputs through two converging flows:

Training flow: Raw GPU compute → data + algorithms → trained model weights. Major labs (openai, anthropic, google-deepmind, meta) and open-weight labs (deepseek, kimi, qwen, zhipu, minimax) compete on training efficiency and model quality.

Inference flow: Trained weights + user prompts → token generation. Inference engines (see ai-inference-engines) handle batching, scheduling, KV cache management. Key cost optimizations: speculative decoding, disaggregated inference (separating prefill from decode), FP8 quantization, multi-head latent attention (MLA).

API layer: Models are sold as token APIs through direct endpoints (OpenAI API, Anthropic API, DeepSeek API) and through aggregators like openrouter. Aggregators add a thin margin but provide routing, fallback, and provider competition that can reduce effective cost.

Cost decomposition (1M input + 100 output tokens): Using Kimi K2.5 as an anchor case, the framework traces per-unit costs from H100 GPU hours through training amortization and inference serving to arrive at API pricing. Each layer's gross margin can be estimated by comparing the input cost from the layer below to the output price at the current layer [local: 2026-05-17-ai-infrastructure.md].

L4 — Application & Consumer

Tokens reach end users through three main paths:

Consumer subscriptions: ChatGPT Plus/Pro, Claude Pro/Max, Perplexity Pro — flat monthly fees for capped or unlimited token access. Consumer subscriptions represent the largest AI revenue segment (~$13-14B annualized for OpenAI alone as of early 2026).
Enterprise API consumption: Direct API usage by businesses for internal tools, customer support, code generation. Anthropic's $30B annualized revenue is ~80% enterprise. Claude Code is a significant enterprise consumption vector.
Agent-native payments: Per-call micropayments through protocols like tempo-mpp and x402-protocol, enabling agents to autonomously pay for API calls, search, and tools without human credit card approval (see agentic-commerce).

The Capex–Revenue Imbalance

A central question of the framework: the AI industry generates approximately $60B in annual revenue (2026), but cumulative and planned capex investments exceed $1,000B [1]. The Big 5 (Microsoft, Google, Amazon, Meta, Apple) account for the majority of AI infrastructure spending.

Can hyperscalers recoup their infrastructure investment? The answer depends on AI-native applications reaching scale. Current AI revenue is dominated by API and subscription models; agent-native commerce (Path 3 above) is still in early adoption. If agent-driven token consumption grows as projected — with agents making thousands of API calls per task — the revenue-per-infrastructure-dollar could improve substantially. If not, the industry faces an overbuild scenario where much of the deployed GPU capacity runs at low utilization [local: 2026-05-17-ai-infrastructure.md].

Token Cost Trajectory

Unit token costs are declining through two mechanisms:

Commercial pressure: Provider competition (OpenAI vs Anthropic vs DeepSeek vs Kimi vs Qwen) is driving API prices down. OpenRouter and other aggregators further compress margins by routing to the cheapest capable provider.

Technical optimization: Inference acceleration techniques — speculative decoding, KV cache sharing, disaggregated prefill/decode, FP8 quantization, and architectural innovations like MLA — continue to reduce the GPU hours required per token. Each generation of hardware (H100 → B200 → Rubin) also improves throughput.

Cost-sensitive applications: Several use cases are currently bottlenecked by token cost rather than model capability — long-context agent loops, real-time video understanding, and exhaustive codebase analysis. As unit costs fall, these applications become economically viable, potentially driving the next wave of token demand growth [local: 2026-05-17-ai-infrastructure.md].

Sources

AI industry revenue ~$60B, capex >$1,000B: local analysis from Big 5 financial disclosures, cross-referenced in 2026-05-11 research session [local: 2026-05-17-ai-infrastructure.md]
Datacenter build costs, electrical/cooling ratios: local L1 equipment supplier research, May 2026 [local: 2026-05-17-ai-infrastructure.md]
OpenAI consumer subscription ~$13-14B: local research session 2026-05-11 [local: 2026-05-11-openclaw.md]
Anthropic $30B annualized, 80% enterprise: local research session 2026-05-11 [local: 2026-05-11-openclaw.md]
Inference optimization techniques: local research 2026-05-15 [local: 2026-05-17-ai-infrastructure.md]