It is not only inference. It is the frontier race.
There is a popular argument in AI right now:
Token prices are low because frontier labs are subsidizing them. When the subsidy ends, many AI startups will collapse.
I think this argument is directionally right, but it points to the wrong root cause.
The concern about rising token costs is not simply because inference is too expensive. A trained model can serve millions of users and be profitable if used efficiently. The real cost pressure comes from the frontier race.
Each lab is pushed to spend billions training the next model before the current model has fully paid back its investment. This shortens the commercial life of every model and pushes the cost back into token prices, subscriptions, enterprise contracts, and API pricing.
The AI industry does not need to stop advancing. But it needs to stop treating speed as the only measure of progress.
Current models are already powerful enough to create enormous business value. The real bottleneck is implementation, governance, economics, safety, and business adoption.
Training is expensive. Inference is different.
To understand the economics, we need to separate two things:
Training = creating the model
Inference = using the trained model
Training a frontier model requires huge GPU clusters, massive datasets, research teams, experiments, failed runs, evaluation, alignment, safety work, and infrastructure.
Inference is different. Once a model is trained, the model weights can be loaded into memory and used to generate answers.
This does not mean inference is free. It still requires GPUs, memory, bandwidth, electricity, data centers, orchestration, and optimization. But it is a different type of cost.
Training is a massive fixed cost.
Inference is a variable usage cost.
The important business question is not only:
How much does one answer cost?
The better question is:
Can the trained model generate enough profit before it becomes strategically outdated?
That is where the real problem starts.
A public example: Llama 3.1 405B
We do not know the exact serving economics of OpenAI’s private frontier models. But we can use a public model as a proxy.
Meta released Llama 3.1 405B as one of the largest openly available foundation models, with 405 billion parameters and a 128K context window. Meta positioned it as competitive with leading closed models across general knowledge, reasoning, tool use, math, and multilingual translation.
NVIDIA says Meta pushed Llama 3 training to over 16,000 H100 GPUs, making the 405B model the first Llama model trained at that scale. NVIDIA also says Llama 3.1 405B fits comfortably in a single HGX H200 system with eight H200 GPUs for inference.
That gives us a useful comparison:
Training scale: 16,000 H100 GPUs
Inference replica: 8 H200 GPUs
One inference replica is tiny compared with the training cluster:
8 / 16,000 = 0.0005 = 0.05%
So one running copy of the model may need around 0.05% of the training GPU count.
This does not mean serving the model to the world is cheap. But it shows the difference between training a frontier model and serving one already-trained model.
How many users can one replica serve?
NVIDIA measured Llama 3.1 405B throughput on an HGX H200 system with eight GPUs. In one benchmark, for a 2,048 input token and 128 output token setup, Llama 3.1 405B achieved 506 output tokens per second using tensor parallelism in the maximum-throughput scenario.
If we define acceptable chat speed as:
10 output tokens per second per active user
Then one replica can serve roughly:
506 tokens per second / 10 tokens per second per user = ~50 active users
Important distinction:
50 active users means 50 users generating output at the same time.
It does not mean 50 monthly users. It does not mean 50 registered users. It means active concurrent generation.
A platform with 50 active concurrent users could have thousands or tens of thousands of total users, depending on how often they use the product.
Scaling the math
Now let us scale the infrastructure.
If one replica uses eight H200 GPUs, then:
2,000 replicas × 8 GPUs = 16,000 GPUs
Using our earlier estimate:
1 replica = ~50 active concurrent users
2,000 replicas = ~100,000 active concurrent users
So 16,000 GPUs could theoretically support around 100,000 active concurrent users at acceptable chat speed, under this simplified benchmark assumption.
Now let us use a high-end cloud rental estimate.
A January 2026 H200 pricing guide lists H200 cloud rental prices from around $3.72 to $10.60 per GPU hour, with AWS and Azure examples around $10.60 per GPU hour.
Using the high-end estimate:
16,000 GPUs × $10.60/hour × 24 hours = $4,070,400/day
Monthly:
$4.07M/day × 30 days = $122.1M/month
That sounds huge. And it is huge.
But now compare it to a large subscription base.
Assume:
Monthly users: 10M
Paying users: 80% = 8M
Average revenue per paying user: $25/month
Revenue:
8M × $25 = $200M/month
Now compare:
Monthly GPU cost: $122.1M
Monthly revenue: $200.0M
Remaining gross margin before other costs: $77.9M
This simplified model shows something important.
A trained model can be profitable at scale.
The problem is not automatically that inference makes profitability impossible. The problem is that the company must also fund research, employees, product development, data centers, free users, failed experiments, safety, support, and most importantly, the next frontier model.
This is where the frontier race becomes the real cost
If a model costs billions to train, the company needs enough time to recover that investment.
Let us use a simple example:
Training cost: $2B
Useful commercial life: 12 months
Training amortization: $166.7M/month
Now combine that with our earlier model:
Revenue: $200.0M/month
Serving cost: -$122.1M/month
Training amortized: -$166.7M/month
Result: -$88.8M/month
The trained model looked profitable when we counted only inference.
But after training cost is amortized over only 12 months, the economics become negative.
Now imagine the model stays commercially strong for 36 months:
$2B / 36 months = $55.6M/month
Then the same business looks very different:
Revenue: $200.0M/month
Serving cost: -$122.1M/month
Training amortized: -$55.6M/month
Result: $22.3M/month
The model did not change.
The users did not change.
The serving cost did not change.
The difference is model lifetime.
That is the core issue.
The faster the industry race, the shorter the useful economic life of each model.
Token prices are not only the cost of answering prompts
When people talk about token prices, they often think:
Token price = cost of answering my prompt
That is incomplete.
A more realistic view is:
Token price = inference cost + infrastructure cost + training recovery + future model investment + business margin
OpenAI’s public pricing already reflects model tiering. Its API pricing page shows higher pricing for frontier models such as GPT-5.5, lower pricing for GPT-5.4 and GPT-5.4 mini, cheaper cached input, and a 50% discount for Batch API workloads.
That tells us something important.
The economics are not just “one token has one cost.”
The industry already uses:
premium models
cheaper models
cached input
batch processing
routing
throughput optimization
This is how serious AI providers manage cost.
But even with these optimizations, the frontier race creates financial pressure. If each new model must be replaced quickly, the business has less time to monetize it.
The industry is not short of intelligence. It is short of implementation.
This is the part that matters most for businesses.
Most companies are not failing because current AI models are too weak.
They are failing because they do not know how to implement AI properly.
They struggle with:
use-case selection
workflow integration
data access
security
governance
quality control
ROI measurement
change management
employee adoption
cost control
The AI industry keeps racing to build smarter models while most businesses have not yet learned how to use the models already available.
This is why I believe the next major bottleneck is not only model intelligence.
The bottleneck is implementation.
A slightly slower frontier race would give companies more time to adopt existing models properly. It would also give AI labs more time to monetize each model generation before needing to replace it.
Slowing reckless escalation is not the same as stopping progress
I am not arguing that OpenAI, Anthropic, Google, Meta, DeepSeek, xAI, or any other AI lab should stop advancing.
That would be unrealistic and probably wrong.
The argument is different:
The industry should stop treating speed as the only measure of progress.
There is a difference between progress and reckless escalation.
Progress means:
better reliability
lower cost
better implementation
stronger safety
clearer governance
better tools
better business outcomes
Reckless escalation means:
bigger models
higher compute spend
shorter model lifetimes
rushed deployment
unclear economics
higher pressure on token prices
The industry needs more of the first and less of the second.
A fair-play environment would help everyone
The public, customers, investors, and governments should not demand a freeze on AI.
They should demand a fair-play environment.
Companies should compete on real value, transparency, reliability, safety, and sustainable economics, not only on who can burn the most capital fastest.
This would help:
customers
startups
enterprise buyers
developers
AI labs
investors
society
It would also help frontier companies themselves.
If the race slows slightly, each trained model has a longer commercial life. Longer model life means better amortization. Better amortization means better margins. Better margins mean less pressure to increase token prices.
In simple terms:
slower depreciation = better economics
Governments are a special case
There is one important exception.
Governments may need frontier AI for national security, research, defense, intelligence, cyber, and strategic competitiveness.
That race may not stop.
But it should not distort the entire commercial economy.
If governments need frontier capability for strategic reasons, that should be treated as a separate strategic layer, not as the economic model forced onto every business and every customer.
The commercial AI economy should not be dragged into an endless compute arms race where every company must pay for strategic competition through rising token prices.
The better conclusion
The original fear is:
AI startups will die when token subsidies end.
That may happen to some companies.
But the better analysis is:
AI companies die when they sell fixed-price outcomes without controlling inference cost per outcome.
And frontier labs struggle when:
each model must be replaced before it has fully paid back its training cost.
The concern about rising token costs is real.
But the reason is not simply that querying AI is too expensive.
The deeper reason is that the industry is trapped in a frontier race where speed has become the main measure of progress.
A trained model can serve millions of users. A trained model can be profitable. But only if it is given enough commercial lifetime, efficient serving, smart routing, responsible pricing, and real business adoption.
The AI industry does not need to stop advancing.
It needs to become more economically mature.
It needs to stop confusing faster model replacement with better progress.
And it needs to focus more on the real bottleneck:
implementation, governance, economics, safety, and adoption
That is where the next stage of AI value will be created.
