ORIGINAL THOUGHT PAPER · APRIL 2026

The Truth of Token Economics

From Cognitive Illusion to Physical Inequality: An Efficiency Audit of the AI Reasoning Industry

The Truth of Token Economics
From Cognitive Illusion to Physical Inequality: An Efficiency Audit of the AI Reasoning Industry


PublishedApril 14, 2026
CategoryOriginal Thought Paper
FieldsAI Industrial Economics · Computational Physics · Energy Efficiency · Political Economy of Compute
VersionV3
이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · Anthropic

ABSTRACT

The mainstream AI industry narrative equates “consuming more tokens” with “producing higher intelligence.” This paper systematically deconstructs this claim, revealing six layers of structural problems: (1) the “long thinking” narrative packages the repeated execution of matrix operations as breakthroughs in cognitive depth; (2) evaluation systems are aligned only to final answer correctness, never to reasoning efficiency; (3) the per-token billing model profits from computational inefficiency; (4) the physical energy cost of the same token varies by several-fold to tens-of-fold across different hardware platforms, yet the pricing system flattens this into a pseudo-equivalent unit; (5) a fundamental cost-attribution misalignment exists between centralized and distributed deployment; (6) token consumption follows an extremely polarized distribution — fewer than 10% of heavy users consume the vast majority of tokens, yet the pricing structure and infrastructure scale are driven upward accordingly, with over 60% of light users paying for compute they do not need. This paper argues that so-called “compute hegemony” is in substance a cross-subsidy system: AI companies, through the token consumption narrative, sustain the low-cost usage of heavy consumers using surpluses from low-frequency users and venture capital subsidies, while simultaneously driving continuous infrastructure expansion. This paper proposes “intelligence output per joule” as an alternative evaluation metric and calls for transparency and efficiency audits of token economics.

I. The Narrative Layer: The Cognitive Illusion of Long Thinking

1.1 From Dual-System Theory to LLM Reasoning — A Semantic Sleight of Hand

Kahneman’s dual-system theory is a classic framework in cognitive psychology: System 1 is fast, intuitive cognition; System 2 is slow, deliberate deep reasoning. In 2024–2025, the AI industry transplanted this theory to the LLM domain, claiming that reasoning models had achieved a leap from System 1 to System 2. Li et al. (2025) surveyed that foundational LLMs excel at rapid decision-making but lack complex reasoning depth, and that OpenAI’s o1/o3 and DeepSeek’s R1 were described as exhibiting deliberate reasoning approaching human System 2.

But this analogy conceals a fundamental difference: human System 2 involves heterogeneous cognitive operations such as conceptual hierarchy leaps, frame restructuring, and analogical reasoning; the “long thinking” of LLMs is essentially the repeated execution of the same operation — forward propagation, matrix multiplication, softmax sampling.

Current LLMs lack the cognitive infrastructure needed for intrinsically carrying out System 2 processes. Consequently, their intuitive responses can only be rooted in System-1-like processes.

— Hagendorff, “Thinking Fast and Slow in Large Language Models” (2022)

1.2 The Boundaries of “Long Thinking”: When It Works and When It Wastes

Extended reasoning chains deliver genuine accuracy gains on specific tasks. Intern-S1-MO (2025) achieved 96.6% pass@1 on AIME2025 through multi-round hierarchical reasoning. In IMO-level mathematics, multi-step code generation, and complex STEM reasoning, extending the reasoning chain is currently irreplaceable.

The problem, however, is that long thinking is applied indiscriminately to all tasks. Hassid et al. (2025) found that shorter reasoning chains outperformed the longest chains by up to 34.5% in accuracy. Su et al. (2025) found that LLMs overthink simple problems and underthink difficult ones. Liu & Wang (2025) divided the reasoning process into three phases — the vast majority of “long thinking” tokens fall in the reasoning convergence phase, contributing nothing to the final result.

Core Distinction

This paper’s critique does not target long thinking itself, but the indiscriminate abuse of long thinking — and the industry narrative that packages this abuse as “higher intelligence.” In competition-level mathematics and complex reasoning, long thinking is a necessary tool; in answering “hello” and the vast majority of everyday queries, long thinking is pure computational waste. The question is not “should we think long,” but “who decides when to think long, and what incentive structure governs that decision.”

II. The Evaluation Layer: What Is Being Aligned?

2.1 The Actual Design of the Reward System

Reward_rule = Reward_accuracy + Reward_format

Reward_accuracy ∈ {0, 1} (answer correct/incorrect)
Reward_format ∈ {0, 1} (whether <think> tags are used)

DeepSeek-R1 abandoned neural network reward models because they are susceptible to reward hacking in large-scale RL. RLVR bypasses the need to train a reward model, with the model receiving binary feedback directly from deterministic tools. The reward signal is based entirely on final prediction correctness, imposing no constraints on the reasoning process.

2.2 The Structural Misalignment of Alignment Outcomes

The strategy the model learns is: generate longer responses to expand the search space and increase the probability of hitting the correct answer. DeepSeek-R1-Zero’s pass@1 leapt from 15.6% to 77.9%, but response length simultaneously grew out of control. The TALE-EP study showed that reducing output tokens by 67% can maintain comparable accuracy — two-thirds of the computation contributes nothing to the result.

What the evaluation system aligns → “Final answer correctness”
What the model learns → “Write more = explore more = higher chance of being correct”
Dimensions never evaluated → Reasoning efficiency, computational waste rate, difficulty-depth matching

III. The Economic Layer: The Incentive Structure of Per-Token Billing

3.1 Business Logic and Market Failure

“More tokens = higher intelligence” (narrative)
→ Users select reasoning mode (behavioral guidance)
→ Token consumption surges 2–10× (resource consumption)
→ Per-token billing, revenue grows linearly (business return)

The market has matured into three billing categories: input tokens, output tokens, and reasoning tokens. Reasoning tokens represent internal “thinking” tokens — a process invisible to users, yet billed to them. In extreme cases, some reasoning models consume over 600 tokens to output two words. Inference now accounts for over 90% of total lifecycle energy consumption for LLMs, far exceeding the one-time expenditure of training.

OckBench (2026) found that models solving the same problem at comparable accuracy levels can differ in token length by up to 5×. The per-token billing model exhibits a triple market failure: information asymmetry (users cannot audit the effectiveness of reasoning tokens), missing evaluation metrics (benchmarks test only accuracy, not efficiency), and incentive distortion (efficiency improvements harm revenue).

IV. The Physical Layer: This Token ≠ That Token

4.1 True Energy Cost Differences After Amortization

TokenPowerBench (2025) and the ML.ENERGY Benchmark provide the first systematic empirical measurements:

Hardware / Configuration Energy per Token Source
H100×8, Llama-3.3-70B FP8, batch 128 ~0.39 J/token Lin (2025)
V100/A100, LLaMA-65B ~3–4 J/token Samsi et al. (2023)
Mixtral-8×7B MoE ~⅓ of dense 8B model TokenPowerBench
Batch 32→256 J/token drops ~25% TokenPowerBench

Under high-concurrency, fully loaded conditions, large GPU clusters can achieve very high amortized energy efficiency — the H100 cluster’s 0.39 J/token is over 120× more efficient than early estimates. But this rests on the premise of sustained high-concurrency utilization. When utilization is insufficient, the system’s idle power draw is still amortized across all tokens, and efficiency plummets. The physical cost of a token is a function of its runtime environment — determined by hardware specifications, concurrent load, utilization rate, cooling architecture, and infrastructure overhead allocation. The current pricing system compresses all these variables into a single number, erasing every dimension of physical reality.

4.2 The Accounting Manipulation of Hardware Depreciation

Hyperscale cloud providers have extended GPU server useful life from 3–4 years to 6 years, collectively saving approximately $18 billion per year in depreciation expenses. In 2025, Amazon shortened some server lifespans from 6 years back to 5, absorbing a $700 million impact; in the same quarter, Meta extended to 5.5 years, booking a $2.9 billion reduction in depreciation. Opposite decisions under identical technological conditions confirm that useful life is a subjective management estimate.

“I don’t want to carry four or five years of depreciation on one generation of product.”

— Satya Nadella, CEO, Microsoft, 2025

V. The Institutional Layer: Cost-Attribution Misalignment Between Centralized and Distributed Deployment

5.1 The Separation of Hardware Payer and Token Consumer

Dimension Centralized (Cloud) Distributed (Local)
Hardware payer Cloud provider (cost passed to users) User themselves
Electricity payer Cloud provider (passed to users + grid + taxpayers) User themselves (electricity bill directly visible)
Energy visibility Completely invisible to users Completely visible to users
Incentive structure for waste More waste → higher revenue More waste → higher personal electricity bill

In centralized architectures, users pay per token but cannot observe the physical production cost of tokens. Provider profit correlates positively with consumption, creating no incentive for model efficiency. In distributed architectures, this is entirely inverted — users pay for every joule and are naturally incentivized to choose efficient models. Alamouti (2025) showed that hybrid edge-cloud deployment can achieve up to 75% energy savings and over 80% cost reduction. Global AI data center electricity demand is projected to increase 255% by 2030, with inference becoming the dominant workload, yet most infrastructure today is built for training, leaving the distributed, low-latency architecture needed for inference severely underdeveloped.

VI. The Consumption Layer: The Polarized Distribution of Tokens and Cross-Subsidization

6.1 Empirical Data on Polarized Distribution

OpenRouter’s empirical study of 100 trillion tokens reveals a fact overlooked by the industry: AI users’ token consumption follows an extremely polarized distribution.

User Type Typical Annual Consumption Share of Users
Heavy programmers (Claude Code/Cursor all-day usage) 10B+ tokens <1%
Active developers 1B–10B tokens ~5–10%
Everyday professional users 10M–100M tokens ~20–30%
Light/occasional users <10M tokens 60%+

The top-end cases are staggering: one developer consumed 10 billion tokens over 8 months, exceeding $15,000 at API pricing; another generated 1.2 billion tokens and 20,000 conversations in 20 days, verified by Anthropic as a top 1% user; a Cursor user consumed 865 million tokens in a single month, equivalent to $2,595 in raw API cost. But this user reduced consumption from 865 million to 200 million through usage optimization — a 77% reduction with identical functional output. This means at least three-quarters was waste.

Meanwhile, 52% of American adults have used an LLM, but two-thirds of them use it only “like a search engine” for information retrieval. Only 15–20% of users use AI daily. The vast majority of users consume no more than tens of millions of tokens per year — not even reaching 100 million.

6.2 “Inference Whales” and Cross-Subsidization

TechCrunch dubbed these ultra-heavy users “inference whales” — some users generated over $35,000 in compute costs on a $200/month subscription plan, with providers absorbing a 175× subsidy. Anthropic stated that 90% of Claude Code users spend less than $12 per day, but fewer than 5% of heavy users drove the implementation of rate-limiting policies.

This reveals a clear cross-subsidy structure:

Venture capital → Subsidizes AI companies’ loss-making operations (OpenAI spends $1.69 for every $1 earned in 2025)
AI companies → Sell tokens below cost (land grab)
Light users’ $20/month subscription → Surplus subsidizes heavy users’ $10,000+ consumption
Heavy programmers → Consume billions of tokens/month, actual cost subsidized 175×
“Long thinking” narrative → Gets all users to accept “more tokens = higher intelligence”
Total token consumption keeps expanding → Drives demand for more infrastructure construction

6.3 Defining Compute Hegemony

Based on the above analysis, this paper defines “compute hegemony” as: the pricing power structure that AI companies construct through the token consumption narrative — sustaining the low-cost usage of heavy consumers using surpluses from low-frequency users and venture capital subsidies, while simultaneously driving continuous infrastructure expansion. This is not a geopolitical concept but an economic-structural one. Its mechanism operates as follows: narrative creates demand, demand drives hardware procurement, hardware consumption creates energy dependence, centralized pricing ensures waste is invisible, and cross-subsidization ensures heavy users’ costs are socialized. Every layer has its beneficiaries, but the ultimate costs are borne by light users, investors, and society.

Unsustainability Signals

OpenAI spends $1.69 for every $1 earned in 2025, with projected full-year cash burn of $25 billion. Anthropic’s 2024 gross margin was negative 94%. Cursor pays 100% of subscription revenue directly to Anthropic for compute access. Bain projects AI companies will face an $800 billion annual revenue shortfall by 2030. Google has already shifted from an “all-you-can-eat” model to AI Credits metered consumption. When the subsidies end — and they will — the entire token economics will face a fundamental reset.

VII. The Six-Layer Structure and Complete Chain

Layer 1 (Narrative): “More tokens = higher intelligence”
 ↓ conceals
Layer 2 (Efficiency): Most tokens are wasted matrix computations
 ↓ conceals
Layer 3 (Economic): Per-token billing incentivizes waste
 ↓ conceals
Layer 4 (Physical): The physical cost of the same token varies by hardware and load
 ↓ conceals
Layer 5 (Institutional): Centralized deployment’s payer-consumer misalignment hides waste
 ↓ conceals
Layer 6 (Consumption): Cross-subsidization under extreme polarization socializes waste

VIII. Responses to Foreseeable Counterarguments

8.1 “Long thinking genuinely works on difficult tasks”

Fully acknowledged. This paper’s critique targets indiscriminate abuse, not long thinking itself. Current reasoning models and billing systems make no distinction between “necessary long thinking” and “redundant long thinking,” and users pay the same price for both.

8.2 “Per-token energy cost is not high under high-concurrency cloud conditions”

This is true under fully loaded, high-concurrency scenarios — H100 clusters have been measured at 0.39 J/token. But off-peak idle power draw, continuous liquid cooling operation, and fixed infrastructure overhead are still amortized across all tokens. These hidden costs in the pricing are neither visible to nor auditable by users.

8.3 “Token pricing includes amortized R&D and training costs”

A reasonable business practice. However, the current pricing system compresses R&D amortization, training costs, inference energy, infrastructure overhead, profit margins, and depreciation accounting strategies all into an opaque “X dollars per million tokens.” Users cannot determine how much is legitimate cost recovery and how much is a premium manufactured by inefficient reasoning and accounting manipulation.

8.4 “Heavy users create product value and ecosystem”

Heavy developers are indeed the core builders of the AI ecosystem — they build applications, find bugs, and drive product improvement. But this does not alter an economic fact: their compute costs are subsidized by other users and venture capital. The issue is not whether heavy users are valuable, but the opacity of the subsidy — light users do not know they are paying for someone else’s 175× compute consumption.

IX. Reconstruction: A Physics-Based Efficiency Evaluation Framework

If we acknowledge that “this token ≠ that token,” we need an evaluation system that restores the physical dimension. TokenPowerBench (2025) already provides a systematic measurement tool for joules per token. Building on this foundation, this paper proposes four core evaluation metrics:

Metric Definition Significance
Intelligence output rate per token Correct results ÷ Total tokens consumed “How much compute did it take to get it right?”
Marginal token return rate ∂Accuracy / ∂Token How much does each additional token improve accuracy? Where is the inflection point?
Computational waste rate Redundant tokens ÷ Total tokens How many tokens contributed nothing to the final answer?
Intelligence output per joule Correct results ÷ (Token count × J/token) Restoring the physical dimension — a true measure of intelligence efficiency

“Intelligence per joule” requires reporting the energy consumption of each inference, not merely token count and accuracy. Under this framework, the “intelligence output” of different hardware platforms, different concurrency loads, and different model specifications becomes genuinely comparable. At the same time, the polarized distribution of token consumption demands that pricing systems introduce transparent usage tiers, so that light users no longer implicitly subsidize the infrastructure scale driven by heavy users.

X. Conclusion

The current token economics of the AI reasoning industry rests on a six-layer structure: the narrative layer packages the repeated execution of matrix operations as cognitive depth; the evaluation layer aligns only to final correctness while ignoring efficiency; the economic layer’s per-token billing profits from inefficiency; the physical layer’s token cost varies by environment yet is flattened by the pricing system; the institutional layer’s centralized deployment separates payer from consumer; and the consumption layer’s extreme polarization causes the compute costs of fewer than 10% of heavy users to be implicitly subsidized by over 60% of light users and venture capital.

This does not mean long thinking has no value — in complex reasoning tasks it is a necessary tool. But when 67% of reasoning tokens can be eliminated without affecting accuracy, when 77% of one programmer’s token consumption is proven to be waste, when an “inference whale’s” $35,000 in compute costs is absorbed by a $200 subscription — what the industry needs is not more narrative packaging about “deep reasoning,” but a fundamental audit of the economics and efficiency of token consumption.

True intelligence efficiency is not “how many dollars per million tokens,” but “how much effective intelligence is produced per joule of energy.” Establishing this evaluation framework, combined with empirical measurement tools like TokenPowerBench, tiered pricing by usage volume, and the cost-attribution alignment enabled by distributed deployment, is the first step toward breaking the token consumption narrative and restoring physical truth.

Key References

[1] Li, Z.-Z. et al. (2025). “From System 1 to System 2: A Survey of Reasoning LLMs.” arXiv:2502.17419.

[2] Hassid, M. et al. (2025). “Don’t Overthink It.” arXiv:2505.17813.

[3] Su, J. et al. (2025). “Between Underthinking and Overthinking.” arXiv:2505.00127.

[4] Liu & Wang (2025). “Stop Spinning Wheels.” arXiv:2508.17627.

[5] Wang, Y. et al. (2025). “Thinking Short and Right Over Thinking Long.” arXiv:2505.13326.

[6] DeepSeek-AI (2025). “DeepSeek-R1.” Nature (2025). arXiv:2501.12948.

[7] Hagendorff, T. (2022). “Thinking Fast and Slow in LLMs.” arXiv:2212.05206.

[8] OckBench (2026). “Measuring the Efficiency of LLM Reasoning.” arXiv:2511.05722.

[9] Niu, C. et al. (2025). “TokenPowerBench.” arXiv:2512.03024.

[10] Chung, J.-W. et al. (2026). “Where Do the Joules Go?” arXiv:2601.22076.

[11] Samsi, S. et al. (2023). “From Words to Watts.” arXiv:2310.03003.

[12] Lin, L. H. (2025). “Llama3-70B Inference Efficiency on H100.” Internal User Test.

[13] Alamouti, S. (2025). “Quantifying Energy and Cost Benefits of Hybrid Edge Cloud.” arXiv:2501.14823.

[14] Intern-S1-MO (2025). “Long-horizon Reasoning Agent.” arXiv:2512.10739.

[15] OpenRouter & a16z (2025). “State of AI 2025: 100T Token LLM Usage Study.”

[16] Genspark (2025). “The Hidden Economics of AI.”

[17] Artefact (2026). “Is AI Really Getting Cheaper? The Token Cost Illusion.”

[18] Morph (2026). “The Real Cost of AI Coding in 2026.”

[19] Yale Cowles Foundation (2025). “Token Allocation, Fine-Tuning and Optimal Pricing.” Discussion Paper No. 2425.

[20] Princeton CITP (2025). “Lifespan of AI Chips: The $300 Billion Question.”

[21] Uptime Institute (2025). “Reasoning Will Increase the Infrastructure Footprint of AI.”

THE TRUTH OF TOKEN ECONOMICS · V3 · 2026.04.14 · LEECHO Global AI Research Lab & Opus 4.6

댓글 남기기