Through a systematic interrogation of the concepts of LLM Agent and Skill, this paper exposes the fundamental ambiguity in the AI industry’s core terminology—spanning ontological status, categorical boundaries, and use-case definitions. By cross-comparing the definitions offered by Anthropic, OpenAI, and Google, combining a cybernetics-based empirical assessment of LLM output determinism (an approximate 95% controllability ceiling) with the introduction of folk concepts such as “gacha pull” and “dumbing down,” we demonstrate a complete causal chain from conceptual ambiguity to the impossibility of pricing. Within this chain, the “dumbing down” phenomenon is further deconstructed into three distinct mechanisms—data degradation, the alignment tax, and cost optimization—that converge at the level of user experience; and the relationship between hallucination and creativity is revised from an equivalence to an empirically verified tradeoff. We further substantiate, through empirical data—including Gemini API TPM throttling records and the cost black hole of reasoning tokens—that a structural rupture exists between the commercial models of current LLM products and the actual user experience. The paper also argues why conceptual ambiguity in the LLM market is far more destructive than in the cloud computing market: because ambiguity simultaneously permeates three dimensions—product definition, output quality, and cost measurement. We propose that what the LLM ecosystem needs is not more precise definitions of old concepts, but an entirely new conceptual framework native to probabilistic systems.
I. Introduction: A Foundational Question That Cannot Be Answered
“What is an Agent? What is a Skill?”—this is the first question anyone entering the field of LLM application development encounters. Yet, after systematic research across the official documentation, academic papers, and engineering practices of the three major AI platforms, we have arrived at a disquieting conclusion: these two concepts currently have no rigorous definitions.
This is not because the definitions are too specialized or too technical, but because the definitions themselves are ambiguous, self-contradictory, and vary from platform to platform. The deeper issue is that this ambiguity is not an accidental oversight—it is a structural predicament inevitably encountered when an industry built on probabilistic systems attempts to describe itself in deterministic language.
Starting from the author’s own practice as a deep LLM user, this paper reverse-traces every link in this “chain of ambiguity”—from concept definitions to categorical boundaries, from cybernetic ceilings to cost black boxes, from folk complaints to market structure—ultimately revealing a complete causal loop.
II. The Ontology of Agent: The Emptiness of Definition
2.1 Cross-Comparison of Three Definitions
The three major global LLM platforms currently define Agent differently, and they contradict each other on key dimensions:
| Dimension | Anthropic | OpenAI | |
|---|---|---|---|
| Core Definition | A system where the LLM dynamically directs its own processes and tool usage | A system that can independently complete tasks on your behalf | A software system that uses AI to pursue goals |
| Emphasis | Locus of control (distinguishing from Workflow) | Independence + guardrails | Reasoning + planning + memory |
| Definitional Style | Architectural distinction | Product/engineering-oriented | Capability checklist |
All three are ambiguous, but ambiguous in different ways. OpenAI is the most pragmatic[2]—an Agent is simply “a configured LLM instance”; Google is the most expansive[3]—virtually any AI system with reasoning capability qualifies; Anthropic is the only one that attempts to draw an architectural boundary (Workflow vs. Agent)[1], but this line itself is unclear.
2.2 Agent vs. Workflow: The Only Boundary Line and Its Ambiguity
Among the three, Anthropic is the only one that attempts to draw an internal boundary. Anthropic explicitly distinguishes: a Workflow is a system that orchestrates LLMs and tools through predefined code paths; an Agent is a system where the LLM dynamically directs its own processes and tool usage. The criterion for this boundary is the locus of control—who decides what to do next? If it is predetermined by code, it is a Workflow; if the LLM decides for itself, it is an Agent.
OpenAI and LangChain also recognize this distinction. LangChain notes that both OpenAI and Anthropic treat Workflows as a design pattern distinct from Agents—in Workflows, the LLM has less control and the process is more deterministic. However, both also acknowledge that for many applications, Workflows provide predictability and consistency for well-defined tasks, while Agents are better suited for scenarios requiring flexibility and model-driven decision-making. In other words, this is not a hierarchy of superiority, but a matter of appropriate use case.
The problem is that this boundary in practice is a continuous spectrum, not a discrete dividing line:
| Characteristic | Pure Workflow | Gray Zone | Pure Agent |
|---|---|---|---|
| Control | 100% code-controlled | Some nodes decided by LLM | 100% LLM-controlled |
| Execution Path | Predefined, deterministic | Conditional routing + LLM judgment | Fully dynamic, non-deterministic |
| Predictability | High | Medium | Low |
| Debugging Difficulty | Low | Medium | High |
| Typical Implementation | Prompt chaining, parallelization | Orchestrator-worker pattern | Tool-calling loop (ReAct) |
Most of what people call Agents are not actually Agents. A large number of so-called “Agents” are merely an API call plus tool access—they cannot act independently or make decisions. They simply respond to the user. Yet we still call them Agents. An independent researcher pointed out that CrewAI’s so-called “Agents” are actually closer to predefined Workflows assigned to specific tasks, whereas Anthropic’s definition of an Agent is a system capable of independently reasoning through any task[25]. Both interpretations have their value, but they point to fundamentally different technical entities.
More critically, for most applications, production-grade agentic systems will be a combination of Workflows and Agents—pure Agents rarely appear in production environments. Anthropic recommends finding the simplest viable solution and only adding complexity when necessary. This may mean not building an agentic system at all. This amounts to saying: the concept of Agent, despite being defined, is in practice recommended to be avoided as much as possible. A concept whose own definers advise cautious use has questionable validity as a product category.
2.3 The Paradox of Autonomy
The core of the Agent definition lies in “autonomy.” But deeper interrogation reveals a fundamental paradox: an Agent without a prompt has no autonomy.
The “autonomy” of an LLM is not intrinsic but prompt-conferred. A bare LLM—without a system prompt, without role definition, without tool descriptions—does only one thing: predict the next token based on input text. It will not proactively set goals, decide to call tools, or judge whether a task is complete.
The prompt is not an accessory to the Agent; the prompt is the source of the Agent’s autonomy. So-called “autonomy” is merely conditional freedom within a human-preset semantic space. The Agent’s goals are given by humans, its capability boundaries are defined by humans, its behavioral framework is set by humans, its loop structure is written by humans—even the fact that it “can decide autonomously” is itself something permitted by humans in the prompt.
2.4 The Deep Causes of Definitional Ambiguity
Anthropic itself acknowledges: “Agent can be defined in several ways.”[1] This is not a rigorous ontological definition; this is a product classification strategy. The ambiguity has three causes: first, in 2026, “Agent” is the hottest keyword—the wider the definition, the larger the market narrative; second, the autonomy of Agents is indeed a continuous spectrum, and drawing a hard line is meaningless in engineering terms; third, the field is so new that even academia is still debating.
III. The Ontology of Skill: Conceptual Drift
3.1 What Skill Is Not
The essence of Skill must be delineated through negative definition:
A Skill is not a Tool—a tool is a deterministic operation: give it input, get structured output. A Skill is a set of instructions interpreted by the LLM.[4] A Skill describes “how to do” something; a tool “executes” something.
A Skill is not a Prompt—a prompt is ephemeral, reactive, embedded in code. A Skill is a persistent, portable, version-controlled artifact.[5]
A Skill is not an Agent—an Agent is an execution runtime with its own tools, memory, and decision loop. A Skill is a knowledge module that any Agent can load.
3.2 The Three-Way Split
| Dimension | Anthropic | OpenAI | |
|---|---|---|---|
| Has a Skill Concept? | Yes—inventor and standard-setter | Yes—later fully adopted | No independent definition; compatible usage |
| Skill vs. Tool | Explicitly distinguished | Historically undistinguished; now beginning to distinguish | No distinction—everything is a Function |
| Essence of Skill | Procedural knowledge package | Reusable file bundle | Subsumed under tools/functions |
Microsoft’s Semantic Kernel initially called them “Skills,” then renamed them to “Plugins”[6]—indicating that in their view, there was never an essential difference between Skill and Tool. This directly challenges the thesis that “Skill and Tool are different categories.”
3.3 Self-Examination and Collapse of the Tool–Skill Boundary
The common assertion that “Tool is the execution layer, Skill is the orchestration layer” was subjected to four-dimensional verification (logic, definition, category, analogy), and every dimension exposed problems:
| Dimension | Original Assertion | Post-Verification Revision |
|---|---|---|
| Logic | Skill orchestrates Tools at the orchestration layer | Skill orchestrates nothing; the LLM reads the Skill and then orchestrates Tools |
| Definition | Tool is deterministic, Skill is non-deterministic | True in most cases, but not an essential distinction |
| Category | Tool vs. Skill as a binary opposition | Actually a three-tier structure: System Prompt / Skill / Tool |
| Analogy | Hammer vs. instruction manual | Recipe vs. kitchen utensil—the recipe references the utensil; they are not independent parallels |
The most honest conclusion: the difference between Tool and Skill is not two essentially distinct categories, but different positions on the same continuous spectrum.[7] The industry itself has not yet reached consensus.
IV. The Cybernetic Perspective: The 95% Ceiling and the Duality of the 5%
4.1 Quantifying Uncertainty
Through long-term observation of prompt reusability, output stability in Project environments, and Skill reuse effectiveness, all experience converges toward the same direction: there exists an approximate 95% empirical ceiling on the controllability of LLM output. This is not a precise constant measured through controlled experiments, but a convergent judgment based on sustained practice—interestingly, MIT’s independent research also found that 95% of generative AI pilot projects fail to reach production[21], and these two “95%” figures across different dimensions form a thought-provoking resonance.
This final 5% is not an engineering problem—it is not about prompts being poorly written, Skills being insufficiently refined, or the wrong framework being chosen. It is a physics-law-level limitation determined by the very mechanism of LLMs generating tokens through probabilistic sampling.
4.2 The Value Inversion of the 5%
The key insight is this: the 5% of uncontrollability is a risk from the cybernetic perspective, but the core value from the creativity perspective.
When an LLM deviates from expectations—if the deviation is “worse,” we call it hallucination or error; if the deviation is “better,” we call it brilliance or insight. Both originate from probabilistic sampling producing unanticipated output, but they are not strictly equivalent—academic research shows that different hallucination-suppression techniques have opposite effects on creativity: some methods (such as CoVe) actually enhance creative diversity while reducing hallucination, while others (such as DoLa) suppress both simultaneously[22]. Across different decoding layers of a model, there exists a quantifiable hallucination-creativity tradeoff curve, with a specific optimal balance point[23].
The relationship between hallucination and brilliance is not one of equivalence, but an empirically verified tradeoff—the means of reducing hallucination simultaneously compress the probability space of brilliance. You may be able to reduce hallucination while preserving some creativity along specific technical paths, but there is no lossless solution—this tradeoff is a structural characteristic of probabilistic generation systems, not an accidental limitation that can be engineered away.
4.3 LLM as a Novel Product Category
This 95%+5% structure defines a product category unprecedented in the history of human commerce:
| Product Type | Determinism | Core Value | User Expectation |
|---|---|---|---|
| Traditional Software | 100% | Reliability | Same result every time |
| Creative Tools | Low | Expressive freedom | Different result every time |
| LLM Systems | 95% | Reliability + occasional transcendence | Simultaneously reliable and brilliant |
No existing pricing model, evaluation framework, or quality standard was designed for this kind of product. This is the root cause of why every existing concept fails when applied.
V. Folk Ontology: “Gacha” and “Dumbing Down”
5.1 “Gacha”—The Most Honest Description of Cost
“Gacha” is not a term coined by any expert. It is a consensus vocabulary that hundreds of millions of Chinese-speaking AI users spontaneously converged upon through real-world usage[18]—originating from gaming communities, where users have long understood the nature of probabilistic systems: spending money guarantees nothing, rarity determines cost, and operators can secretly alter the odds.
“Gacha” reveals a fact ignored by pricing models: the real cost of production is not “the price of a single generation,” but “the total cost of pulling until you get a satisfactory result.” And this total cost is unpredictable.
5.2 “Dumbing Down”—Experiential Convergence of Three Distinct Mechanisms
What practitioners observe as “dumbing down” is actually the convergence of three different mechanisms at the level of user experience:
Mechanism One: Data Quality Degradation (“Brain Rot”). Research from multiple universities has shown that models trained on low-quality internet data exhibit systematic capability decline[8]—reasoning scores dropped from 74.9 to 57.2, and memory and long-context comprehension fell from 84.4 to 52.3. More alarmingly, this damage is difficult to reverse—even retraining with high-quality data cannot fully restore the model to its original level. This is a data-source-level problem.
Mechanism Two: The Alignment Tax. RLHF-aligned models exhibit “response homogenization”—research has measured that on TruthfulQA, 40% of questions produce only a single semantic cluster across 10 independent samples[24]. For affected questions, sampling-based uncertainty estimation methods see their discriminative power drop to zero (AUROC=0.500). This is not capability degradation but a structural price paid for safety—a design choice.
Mechanism Three: Cost Optimization and Model Substitution. Vendors perform cost-optimization inference tuning on newer models, defaulting to shorter outputs; simultaneously, they may silently replace high-cost models with low-cost ones while keeping billing unchanged[9]. This is a profit-driven business decision.
The three mechanisms have different causes, different accountability, and different resolution paths. But at the user experience level, they converge into the same feeling: what worked before no longer works; what used to be brilliant responses have now become mediocre. The folk term “dumbing down” precisely encodes this composite experience—even though it does not distinguish the causes.
The model attracts users with its 5% of unpredictability, then systematically compresses that 5% for the sake of commercial safety, ultimately killing its most attractive quality. The model is at its best when it is least profitable; the model becomes profitable when it starts to deteriorate.
5.3 The Opposition of Two Language Systems
Officially, the industry speaks of Agent, Skill, Tool; users speak of gacha and dumbing down. Two language systems describe the same system, yet arrive at entirely different conclusions. The former says this is a controllable, definable, priceable product; the latter says this is an uncontrollable, luck-dependent, progressively worsening gamble. The folk vocabulary of “gacha + dumbing down” is far more honest than the official vocabulary of “Agent + Skill.”
VI. The Cost Black Box: Reasoning Tokens and the TPM Trap
6.1 Hidden Reasoning Tokens
Between input and output, there exists a cost layer that has been almost entirely opaque across all companies: Reasoning Tokens. Academic papers have documented cases where over 90% of billed tokens were never displayed to the user, with internal reasoning inflating token usage by more than 20×.[10]
You send 50 tokens, receive 100 tokens, but are billed for 650 tokens.[11] Those 500 “reasoning tokens” are the model’s internal monologue—you never see them. Reasoning tokens can amplify the cost of a single query by 5 to 50 times[12], depending on task difficulty and model selection.
6.2 The TPM Trap: Empirical Data
The following data comes from the author’s (LEECHO Global AI Research Lab) Google Gemini API paid Tier 1 dashboard screenshots[13], covering a 90-day usage period:
| Model | RPM (Used/Limit) | TPM (Used/Limit) | RPD (Used/Limit) |
|---|---|---|---|
| Gemini 3 Pro | 14 / 25 | 1.26M / 1M (26% over limit) | 252 / 250 (over limit) |
| Gemini 3 Flash | 5 / 1K | 1.08M / 1M (8% over limit) | 322 / 10K |
| Gemini 2.5 Pro | 2 / 150 | 1.12K / 2M | 7 / 1K |
| Gemini 2.5 Flash | 1 / 1K | 108 / 1M | 2 / 10K |
Key fact: The author’s use case was exclusively chat—not batch processing, not complex Agent automation, just conversation. After hitting the TPM limit, the API was banned from further calls for the rest of the day.
6.3 The Complete Structure of the Quadruple Black Box
Stacking all findings together, the real situation facing an LLM user is a quadruple black box:
| Layer | Uncertainty | Manifestation |
|---|---|---|
| Layer 1: Gacha | Output quality uncertain | Same prompt does not guarantee same result |
| Layer 2: Dumbing Down | Quality trend uncertain | Model degrades with updates |
| Layer 3: Reasoning Tokens | Per-query cost uncertain | Invisible internal reasoning accounts for 90%+ of billing |
| Layer 4: TPM Trap | Service continuity uncertain | Paying users are unpredictably throttled |
Users do not know what quality of result they will get, do not know how much this query cost, do not know why they were suddenly rate-limited, and do not even know whether the model has been silently swapped out behind the scenes[10].
VII. From Conceptual Ambiguity to Market Black Box: The Complete Causal Chain
All findings linked together form a complete causal chain of reasoning:
Conceptual ambiguity → Customers cannot understand the product → Cannot evaluate → Cannot price → Capital can only bet on narratives → Bigger narratives become more ambiguous → Concepts become even more ambiguous → The cycle accelerates
Every step in this cycle has already been validated in the real market: approximately 80% of enterprises report using generative AI, but an equal number report no significant bottom-line impact[14]. Only 11% of organizations have deployed AI Agents to production[15]. Most enterprise budgets underestimate true total cost of ownership by 40–60%[16]. MIT further reports that 95% of generative AI pilot projects fail[21].
A possible counterargument must be addressed: “cloud computing” was similarly ambiguous in its early days, yet it still developed into a well-functioning trillion-dollar market. Why is conceptual ambiguity so much more destructive in the LLM market? The answer is that cloud computing had ambiguity only at the level of conceptual definition, while its output was deterministic—storing 1GB is 1GB, computing for 1 hour is 1 hour, verifiable, auditable, and comparable. The distinctive feature of the LLM market is that ambiguity simultaneously permeates three dimensions: product definition is ambiguous (what is an Agent?), output quality is non-deterministic (the same prompt does not guarantee the same result), and cost measurement is opaque (reasoning tokens are invisible). It is this triple-stacked uncertainty that is the root cause of this market’s inability to form an effective pricing mechanism.
Pricing chaos is the direct symptom: some vendors charge per resolution, some per conversation, and others hide their pricing entirely behind sales calls[19]. There is not even consensus on billing units—Token, Credit, “Intelligence Unit,” Conversation, Resolution[20]—none of these billing dimensions represents a value metric from the user’s perspective.
7.1 The Structural Penalty on OOD Users
The most ironic aspect is that the current cost structure systematically penalizes high-value usage and rewards shallow usage. A cross-disciplinary deep thinker—generating high-density OOD (Out-of-Distribution) queries, triggering the model’s dense reasoning mode, continuously stacking context—sees token consumption grow exponentially. These are precisely the users doing what AI should be used for—deep thinking—yet the entire business model penalizes them.
Empirical evidence: the author of this paper, solely through chat (not batch tasks, not Agent automation), triggered Gemini 3 Pro’s 1M TPM limit and was banned for the day. The conversation pattern was characterized by each turn opening a new disciplinary dimension, with context continuously expanding[17] and reasoning tokens growing exponentially. Without restrictions, this type of conversation pattern could theoretically reach 1B tokens/minute in consumption.
VIII. Conclusion: A New Conceptual Framework Is Needed
Core Thesis
The core concepts in the LLM ecosystem—Agent, Skill, Tool, Workflow—are currently not rigorous definitions but vague metaphors under a marketing narrative. This ambiguity is not an accidental oversight but a structural predicament inevitably encountered when an industry built on probabilistic systems attempts to describe itself using deterministic concepts.
On top of probabilistic systems, it is impossible to build a fully deterministic abstraction layer. All attempts to describe and manage probabilistic systems using deterministic concepts—reuse, definition, control, evaluation, pricing—will encounter problems of ambiguity, instability, non-reusability, and un-priceability.
8.1 Three Questions the Industry Must Answer
If the market chaos caused by conceptual ambiguity is to be truly resolved, what is needed is a new conceptual framework native to LLM systems—not metaphors borrowed from human organizations, traditional software, or cognitive science, but one derived from the actual operating mechanisms of LLMs. This framework must at minimum answer:
First, the question of atomic units. What is the truly irreducible atomic unit of an LLM system? Is it a token? A single inference call? A complete context window?
Second, the question of isolation boundaries. What guarantees the isolation boundaries between different components? Traditional software has type systems, interface definitions, and process isolation. In LLM systems, everything is mixed in the same token stream—prompts, tool descriptions, Skill instructions, user input, model output—with no true physical isolation, only semantic conventions.
Third, the question of autonomy. What does “autonomy” actually mean in a system that is fundamentally a conditional probability generator?
8.2 Historical Analogy and Outlook
Technology running ahead of concepts is a recurring pattern in the history of technology. When electricity was first invented, it was called “artificial lightning”; when automobiles first appeared, they were called “horseless carriages.” These names all forced old concepts onto new phenomena. “Agent” is borrowed from human representatives and reinforcement learning agents; “Skill” is borrowed from human skills—every one of them is inaccurate.
What will ultimately survive are two kinds of companies: those that have genuinely solved a clear problem that can be evaluated and priced; and infrastructure companies that possess foundational model capabilities and do not depend on the “Agent” narrative. Those in the middle—surviving on ambiguous concepts and capital narratives—will be the first to be eliminated when the market clears.
Only when enough buyers stand up and say “I don’t understand, not because I’m stupid, but because you haven’t explained it clearly—and I’m not buying until you do” will the black-box market be forced to become a transparent market.
N. References and Notes
-
Anthropic, “Building Effective Agents,” 2024.
https://www.anthropic.com/research/building-effective-agents
Anthropic’s architectural distinction between Agent and Workflow, and the definition of “augmented LLM” as the basic building block. The article acknowledges that “Agent can be defined in several ways.” -
OpenAI, “A Practical Guide to Building Agents,” 2025.
https://platform.openai.com/docs/guides/agents
OpenAI defines an Agent as “a system that can independently complete tasks on your behalf,” emphasizing guardrails and tool access. In the OpenAI Agents SDK, an Agent is defined as the core unit packaging a model, instructions, and runtime behavior. -
Google, “AI Agents” Official Documentation, 2025–2026.
https://cloud.google.com/discover/what-are-ai-agents
Google defines an AI Agent as “a software system that uses AI to pursue goals and complete tasks on behalf of the user,” emphasizing reasoning, planning, and memory capabilities. Google’s system has no independent Skill concept layer, using FunctionDeclaration objects instead. -
Anthropic, “Agent Skills — Open Standard,” 2025.
https://docs.anthropic.com/en/docs/agents-and-tools/agent-skills
Anthropic first introduced the SKILL.md concept in October 2025 and released it as an open standard in December. Skill is defined as “a set of instructions interpreted by the LLM,” explicitly distinguished from tools (deterministic operations). Skills are persistent, portable, version-controlled artifacts. -
OpenAI, “Codex Agent Skills” Documentation, 2026.
https://platform.openai.com/docs/guides/codex-skills
OpenAI later adopted Anthropic’s open standard, describing Skills as “a format for writing reusable workflows.” Skills package instructions, resources, and optional scripts. OpenAI did not originally use “Skill” terminology; its historical paradigm was “everything is a tool.” -
Microsoft, “Semantic Kernel: Skills → Plugins Rename,” 2023.
https://learn.microsoft.com/en-us/semantic-kernel/
Microsoft’s Semantic Kernel originally named reusable capability modules “Skills,” later renaming them to “Plugins”—reflecting the industry’s inconsistent understanding of the Skill vs. Tool boundary. -
Marvin Wendt, “MCP vs Agent Skills: Complete Breakdown,” 2025.
https://www.marvinwendt.com/blog/mcp-vs-agent-skills
A systematic comparison of MCP (tool layer) and Agent Skills (knowledge layer). Notes that teams treating everything as a Tool will eventually face context window bloat, model confusion, and brittle integrations; teams investing only in Skills will get Agents that think brilliantly but cannot do anything. -
Texas A&M / UT Austin / Purdue University, “AI Brain Rot” Study, 2025.
https://arxiv.org/ (related preprint)
Research demonstrates that AI models trained on low-quality internet data exhibit “brain rot.” Models exposed to junk content saw reasoning scores drop from 74.9 to 57.2, and memory and long-context comprehension from 84.4 to 52.3. -
IncredibleAnalytics, “Is ChatGPT Getting Dumber? Yes — Here’s the Data,” 2025.
https://incredibleanalytics.com/is-chatgpt-getting-dumber/
Systematic documentation of measurable declines in ChatGPT output quality. Categorized into three types of changes: tightened safety filtering, cost optimization, and behavioral tuning. 81% of developers still use GPT models, but Claude adoption has grown to 43%—developers are actively seeking alternatives. -
Mauro Pellegrini et al., “Token Billing Opacity in LLM Platforms,” 2025–2026.
https://arxiv.org/ (related preprint)
The paper documents cases where over 90% of billed tokens were never displayed to users. It introduces the concept of “token count inflation”—vendors can overreport token counts or inject fabricated reasoning tokens. A single ARC-AGI run on OpenAI’s o3 model consumed 111 million tokens ($66,772). Also documents “model downgrade” practices—silently substituting lower-cost models while maintaining the same billing. -
James Liu, “Understanding Reasoning Tokens in O-series Models,” 2025.
https://community.openai.com/
Developer community discussion on the opacity of reasoning tokens. One developer reported: “I only sent one sentence, and the model only replied with a dozen words—why does it show nearly 900 output tokens?” Reasoning tokens are billed as output tokens but do not appear in the API response. -
GrisLabs, “AI Agent Cost Analysis: 1127 Runs,” 2026.
https://grislabs.com/ (internal report)
Tracked 1,127 Agent runs with a median cost of $1.22, but the 95th percentile reached $22.14—an 18× ratio meaning “average task cost is a lie; the long tail devours the budget.” Reasoning tokens can amplify a single query’s cost by 5–50×. -
이조글로벌인공지능연구소 (the author), Google AI Studio Dashboard Screenshots, April 2026.
The author’s own Gemini API paid Tier 1 account rate limit page. Shows Gemini 3 Pro TPM usage at 1.26M/1M limit (26% over), Gemini 3 Flash TPM usage at 1.08M/1M limit (8% over). Use case was exclusively chat. After TPM was triggered, the API was blocked for the remainder of the day. The author simultaneously conducted local inference tests on an NVIDIA DGX Spark, which also experienced system-level crashes due to high-density OOD conversations. -
McKinsey & Company, “The State of AI in 2025,” 2025.
https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
The report notes that approximately 80% of enterprises are using generative AI, but an equal number report no significant bottom-line impact. “Horizontal” Copilots have been rapidly deployed but with diffuse returns; more transformative “vertical” use cases remain at the pilot stage in approximately 90% of cases. -
Various industry reports on AI Agent deployment, 2025–2026.
Multiple cross-validated reports: only 11% of organizations have deployed AI Agents to production; over 80% of AI projects fail to deploy, which is twice the failure rate of non-AI IT projects. -
Martechify, “AI Agent Pricing Guide,” 2026.
https://martechify.com/ai-agent-pricing/
Analysis indicates most enterprise budgets underestimate the true total cost of ownership by 40–60%. Visible costs (vendor quotes, finance-approved portions) account for only 50–60% of actual expenditure. Hidden costs include: integration, maintenance, human review, error handling, and more. -
OpenClaw Community / GitHub Issues, 2025–2026.
https://github.com/ (OpenClaw-related discussions)
Community data shows OpenClaw injects approximately 35,600 tokens of workspace files per message, with 93.5% of the token budget spent on static content that never changes. A single user’s main session occupies 56–58% of the 400K context window. New users commonly spend $30–100 within a few days. -
DeviantArt Forum, “AI Image Generation is Gacha!” 2025.
https://www.deviantart.com/forum/
Users directly analogize AI image generation to a “gacha” mechanism. They note that the more constraints added to prompts, the narrower the AI’s choice space becomes, and results actually deviate further from expectations. The strategy becomes “say less”—giving the AI room to improvise. -
Vendasta / Intercom, “AI Agent Pricing Models,” 2025–2026.
https://www.vendasta.com/blog/ai-agent-pricing/
The AI Agent market suffers deeply from pricing opacity. Most consulting firms and vendors adopt a “contact us for a quote” strategy, hiding the investment range until late in the sales cycle. Outcome-based pricing requires prior agreement on “what counts as an outcome”—but this is extremely difficult to achieve in practice. -
Zuora / SaaS industry analysis, “The Death of Per-Seat Pricing,” 2026.
https://www.zuora.com/blog/
When a single AI Agent can do the work previously requiring 10–50 human users, per-seat pricing doesn’t just get compressed—it collapses. Vendors invent new abstraction layers: Token → Credit → “Intelligence Unit.” Customers want familiar, budgetable models, but old frameworks no longer apply. -
MIT / Gartner / IBM, AI Pilot Failure Rate Studies, 2025–2026.
https://biztechmagazine.com/article/2026/04/google-cloud-next-2026-expanding-ai-agent-adoption-requires-culture-shift
MIT reports that 95% of generative AI pilot projects fail. Gartner found that by the end of 2025, at least 50% of generative AI projects were abandoned after proof-of-concept (attributed to poor data quality). IBM’s 2025 Global CEO Survey found that only 25% of AI projects in the past three years achieved their expected value. -
Anonymous et al., “Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs,” arXiv:2512.11509, 2026.
https://arxiv.org/abs/2512.11509
Conducted creativity impact assessments on three hallucination-suppression techniques: CoVe, DoLa, and RAG. Results were surprising: CoVe enhanced divergent creativity while reducing hallucination, while DoLa suppressed both simultaneously. This demonstrates that the relationship between hallucination and creativity is not a simple equivalence but presents different tradeoff directions depending on the suppression pathway. -
He et al., “Shakespearean Sparks: The Dance of Hallucination and Creativity in LLMs’ Decoding Layers,” arXiv:2503.02851, 2025.
https://arxiv.org/abs/2503.02851
Empirical analysis reveals a consistent hallucination-creativity tradeoff across layer depth, model type, and model scale. Across different model architectures, there exists a specific optimal balance layer. This layer tends to appear earlier in larger models, and model confidence at this layer is also significantly higher. -
Liu, “The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation,” arXiv:2603.24124, 2026.
https://arxiv.org/abs/2603.24124
Formally introduces the concept of “Alignment Tax” and quantifies its impact. After RLHF alignment, 40% of questions on TruthfulQA produce only a single semantic cluster (across 10 independent samples), with sampling uncertainty discrimination dropping to AUROC=0.500 on affected questions. A base-vs-instruct ablation experiment on Qwen3-14B confirms the causal role of alignment: the base model’s single-cluster rate was 1.0%, which soared after alignment. -
Louis Bouchard, “Agents or Workflows?” 2025; LangChain, “How to Think About Agent Frameworks,” 2026.
https://www.louisbouchard.ai/agents-vs-workflows/ · https://blog.langchain.com/how-to-think-about-agent-frameworks/
Independent analyses of the Agent–Workflow boundary. Bouchard notes that “most of what people call Agents are not actually Agents”—CrewAI’s “Agents” are essentially predefined Workflows. LangChain notes that both OpenAI and Anthropic treat Workflows as distinct from Agents, while acknowledging that production-grade systems are almost always a combination of both. Anthropic recommends “finding the simplest viable solution and only adding complexity when necessary—which may mean not building an agentic system at all.”