A Technology Philosophy Paper

Vikings Without a CompassOn the Systemic Crisis of AI’s Missing Evaluation Framework

“The Vikings crossed the Atlantic and reached North America — without a compass, without a sextant.
They accomplished the feat through pattern recognition of stars and waves.
But every voyage was a probabilistic event: some arrived at the New World; others vanished into the fog.
Today’s AI is the Viking longship of the digital age.”

Thought Paper · March 2026

Classification
Original Thought Paper
Domains
AI Epistemology · Technology Philosophy · Market Analysis

LEECHO Global AI Research Lab
이조글로벌인공지능연구소

&
Claude Opus 4.6 · Anthropic


Abstract

This paper employs the central metaphor of “Vikings without a compass” to systematically argue that the artificial intelligence industry faces a fundamental paradox: the exponential growth of generative capability alongside the structural absence of an evaluation framework. The semiconductor industry possesses Moore’s Law as a quantifiable evolutionary coordinate, yet the AI field still lacks any widely recognized “yardstick” — AGI does not even have acceptance criteria. This paper develops its argument across seven dimensions: the intrinsic indeterminacy of matrix computation, the inverse law of precision boundaries, the systemic failure of commercial alignment, the physical distortion of AI visual outputs, the fracture between replication and creation, and the deep limitations of RLHF alignment mechanisms. Drawing on market evidence from 2025–2026 — including the ~30% collapse in software ETFs while semiconductor ETFs rose ~30%, the evaporation of ~$1–2 trillion from SaaS market capitalization, and Grok’s generation of approximately 3 million violating images — the paper argues that the AI industry is trapped in a structural dilemma of “astounding navigational capability with no clear heading.”

AI Evaluation Framework
Matrix Indeterminacy
Precision Boundaries
Commercialization Failure
RLHF Limitations
SaaS Collapse
Viking Metaphor

Chapter I · Introduction

When the Viking Longship Enters Digital WatersWhy “the greater the capability, the greater the problem”

In 985 AD, the Viking explorer Erik the Red led twenty-five longships from Iceland toward Greenland. Only fourteen reached their destination; the remaining eleven vanished in North Atlantic storms. This one-in-three “success rate” was not due to any lack of maritime skill — quite the contrary, Viking shipbuilding and celestial navigation were exceptional for their era. The problem lay elsewhere: they had no compass. Every ocean voyage was a gamble based on experience, probability, and luck, rather than a controllable operation grounded in quantifiable standards.

The artificial intelligence industry of 2026 exhibits an almost perfect structural isomorphism with Viking seafaring. Large language model parameters have leapt from hundreds of billions to trillions; generative AI can write poetry, code software, paint pictures, and compose music. The frontier of capability is pushed outward every quarter. Yet one fundamental question remains unresolved: we lack any recognized yardstick for measuring whether AI is “good” or “bad”. The semiconductor industry has Moore’s Law — transistor density doubling every 18 months — as a clear evolutionary coordinate. What does AI have? AGI does not even have acceptance criteria. As IEEE Spectrum revealed in its October 2025 deep dive: on an expert panel of AI researchers, one person said AGI might never happen, while another said it already had.

This is not a purely philosophical question. It carries real-money market consequences. From September 2025 to February 2026, the software sector ETF (IGV) plummeted approximately 30%, while the semiconductor ETF (SMH) rose approximately 30% over the same period. In January–February 2026 alone, roughly $1–2 trillion was wiped from global software stock market capitalization. The capital markets are casting their vote in the most brutal fashion: AI’s “iron” (chips) is more trustworthy than AI’s “use” (software). Because chips have yield rates, process nodes, and performance benchmarks — while the output quality of AI software is, every single time, a fresh probabilistic event.

−30%
Software ETF (IGV)
Sep 2025 – Feb 2026

+30%
Semiconductor ETF (SMH)
Same period

~$1–2T
SaaS market cap evaporated
Jan–Feb 2026

72%
CIOs report AI investments
not yielding positive returns
Gartner 2025

Chapter II · The Missing Yardstick

Semiconductors Have Moore’s Law. AI Has Nothing.The structural void in evaluation systems

Throughout the semiconductor industry’s six-decade trajectory, Gordon Moore’s 1965 observation has served as a navigational chart. Although it is fundamentally an empirical observation rather than a physical law, it granted the entire industry an irreplaceable function: quantifiable expectations of progress. Engineers know what the next process node must achieve; investors know the cadence of capacity expansion; customers know when to upgrade their equipment. The entire supply chain synchronizes accordingly.

No such coordinate system exists in AI. The primary method of measuring AI model capability today is benchmarking, but benchmarks face a fatal paradox: they begin to expire the moment they are born. In 2019, François Chollet released the ARC-AGI test to measure “fluid intelligence” — the ability to reason when confronted with entirely novel problems. When ARC-AGI-2 was released in 2025, frontier model performance diverged dramatically — the most advanced GPT-5.2 scored only about 54% on ARC-AGI-2, while every tested task could be solved by at least two humans within two attempts. ARC Prize founder Chollet stated plainly: no widely accepted standard for AGI evaluation currently exists, and many existing benchmarks cannot distinguish between memorized responses and genuinely novel reasoning.

Even more pointed is the absence of “acceptance criteria.” OpenAI CEO Sam Altman acknowledged in mid-2025 that GPT-5 was still “missing something quite important” to qualify as AGI. But what is that “something quite important,” precisely? No one can provide an exact definition. It is as if the Vikings knew they “had not yet reached Greenland” but did not know in which direction Greenland lay, or how far away it was. The industry’s consensus definition of AGI — “a system that can automate the majority of economically valuable work” — sounds pragmatic, but is in reality a goal that is unfalsifiable, unquantifiable, and incapable of even achieving consensus among experts.

Dimension Semiconductor Industry AI Industry
Evolutionary yardstick Moore’s Law (transistor density doubles every 18 months) No recognized yardstick; benchmarks continuously invalidated
Quality metrics Yield rate, process node (nm), power efficiency ratio Perplexity, MMLU, etc. — cannot map to real-world user experience
Acceptance criteria Chip performance benchmarks met → tape-out No AGI acceptance criteria; no expert consensus
Reproducibility Same-batch chips perform consistently Same prompt yields different outputs every time
Market signal SMH: +40% in 2024, +49% in 2025 IGV: −30% from Sep 2025 peak

Chapter III · The Indeterminacy of Matrix Computation

Every Inference Is the Result of Different Weights in ContentionThe butterfly effect of Temperature, Top-p, and context windows

To understand why AI “behaves differently every time,” one must look into the nature of its computation. The inference process of a large language model is fundamentally a high-dimensional matrix operation: tens of billions to trillions of parameters (weights), activated by a particular input, propagate forward through hundreds of transformer layers to ultimately produce a probability distribution. The model then “samples” the next token from this distribution. This sampling process is governed by several hyperparameters.

Temperature controls the “sharpness” of the probability distribution — lower temperatures cause the model to favor the highest-probability token, yielding more deterministic but rigid output; higher temperatures produce more diverse but unpredictable output. Top-p (nucleus sampling) sets a cumulative probability threshold, sampling only from the subset of tokens whose probabilities sum to p. Context window length directly determines how much prior text the model can “remember.” Any fine-tuning of these three parameters can produce radically different outputs.

A fundamental contradiction lies here: all of this indeterminacy is not a bug but a feature. It is precisely this stochasticity that gives AI the appearance of “creativity” — the ability to offer different perspectives on the same question. Yet it also means that even with an identical prompt running on an identical system, two inference passes may produce substantially different outputs. In traditional engineering, if a machine produces a good product today and a defective one tomorrow, we call that machine “defective.” In AI, however, this indeterminacy is packaged as “diversity.” SOPs (Standard Operating Procedures) can constrain input format, but they cannot suppress output deviation to zero.

Chip manufacturing pursues nanometer-level consistency, where every transistor must fall within tolerance. AI inference pursues “controllable indeterminacy.” The tension between these two paradigms is the root cause of the industry’s chaos.

Chapter IV · The Inverse Law of Precision Boundaries

Usable When Ambiguity Is Tolerated; Catastrophic When Precision Is RequiredThe cliff from “roughly correct” to “accurate to the millimeter”

A stark inverse law governs the relationship between AI’s utility and precision requirements: the wider the margin for error, the more impressive AI’s performance; the higher the precision demand, the more catastrophic its collapse. This law can be expressed as —

AI Applicability ≈ 1 / Precision Requirementn, where n > 1, meaning even a slight increase in precision requirements causes a steep decline in applicability.

In “roughly correct is good enough” scenarios, AI performs admirably. Drafting a business email? Excellent. Summarizing a document’s key points? Superb. Generating creative copy? Quite respectable. The common trait of these scenarios is a wide “acceptable output space.” An email can use various phrasings, a summary can approach from different angles — none of this affects “correctness.”

However, once precision requirements cross a certain threshold, performance falls off a cliff. Does a PowerPoint presentation require elements aligned to pixel-level precision? Must a PDF output strictly comply with particular formatting specifications? Does an e-commerce product image need color reproduction to a specific Pantone code? Must an architectural drawing be accurate to the millimeter? In these scenarios, AI output often exhibits catastrophic collapse. This is not “sometimes making mistakes” — it is “almost inevitably making mistakes,” because such precision requirements exceed the controllable range of probabilistic sampling mechanisms.

Multiple Gartner surveys from 2025 corroborate this assessment from different angles: 72% of CIOs reported their organizations were breaking even or losing money on AI investments; 88% of HR leaders said their organizations had not realized significant business value from AI tools. When AI moves from the laboratory to the desktop, it no longer faces the neatly bounded multiple-choice questions of benchmarks, but real business scenarios where precision requirements spike sharply — and that is when the gap becomes undeniable.

Chapter V · The Systemic Failure of Commercial Alignment

Broken PPT Layouts, Collapsed PDF Formatting, Misleading E-Commerce ImagesWhen AI leaves the lab, user experience falls apart comprehensively

The “inverse law” discussed in Chapter IV receives full-spectrum validation in commercial deployment scenarios. Let us examine several of the most representative failure cases.

Office software: Copilot’s user backlash. Microsoft deeply integrated AI into the Office 365 suite with its $20/month Copilot service. Yet on Microsoft community forums and third-party review platforms, user feedback proved disheartening. One user stated bluntly that “it can’t even complete very basic tasks like rewriting documents and it cannot cope with more than a paragraph of text.” Another compared Copilot to “the evil offspring of Clippy.” Yet another described its PowerPoint performance as intolerable — “every time I create an object, there’s the damn Copilot prompt, obscuring what I’m trying to do, and adding absolutely zero value.” These are not fringe complaints but large-scale, systemic dissatisfaction.

E-commerce: AI “photo fraud” running rampant. In January 2026, China’s Xinhua News Agency published a commentary declaring that “AI photo fraud” must not be allowed to run unchecked on e-commerce platforms. Investigation revealed that merchants were using AI-generated glamorous images and videos as promotional materials, luring consumers into purchases only to deliver crude, vastly inferior products. From plush toy keychains to clothing and accessories, this phenomenon had spread across multiple product categories. A gray-market supply chain of AI tool vendors and opportunistic merchants had formed — for just a few hundred yuan per month in subscription fees, they could mass-produce hundreds of nearly indistinguishable fake promotional images, at a fraction of the cost of actual photography. Taobao’s platform had cumulatively intercepted nearly 100,000 AI-generated fraudulent images.

“AI Slop” — 2025 Word of the Year. Merriam-Webster selected “Slop” as its 2025 Word of the Year, defining it as “digital content of low quality that is produced usually in quantity by means of artificial intelligence.” The very coining of this term is an exquisite indictment of AI’s commercialization status quo. From the viral “Shrimp Jesus” composites flooding Facebook, to zombie-football videos mass-produced by AI on YouTube, to the deluge of AI-cover ebooks on Amazon — AI Slop is eroding the internet’s content ecosystem at an alarming pace. Research found that 21% of videos recommended to new YouTube users fell into the AI Slop category.

Chapter VI · The Physical Distortion of Visual Generation

Fluids, Fabrics, and Lighting Betray the IllusionPhysical-law violations in AI images and video, and the copyright vacuum

AI image and video generation is the most visceral showcase of the “inverse law.” In low-precision scenarios — say, generating a concept image for social media, or a creative first draft for a short video — AI performance has evolved from “obviously fake” to “difficult for non-experts to distinguish from reality.” Yet the laws of physics remain a mirror that AI-generated content currently cannot deceive.

Fluid dynamics is the first gate. The flow, splash, and refraction of water involve complex Navier-Stokes equations; AI-generated fluid motion, under close scrutiny, often exhibits unnatural viscosity or lacks authentic turbulence characteristics. Fabric simulation is the second gate: the drape, wrinkling, and wind-blown movement of textiles are precisely constrained by material properties and mechanical physics, yet AI-generated fabrics frequently display anomalies that violate gravity or material behavior. Lighting consistency is the third gate: projections, reflections, and refractions from multiple light sources within a single frame must strictly obey optical laws, yet AI-generated scenes often contradict themselves in shadow direction.

These physical distortions may be overlooked in creative contexts, but they become critical flaws in commercial settings. Even more pressing is the copyright vacuum. The ownership of AI-generated images remains unresolved — is the creator the model developer? The prompt author? The original creator of the training data? The 2025 global controversy over Ghibli-style AI images fully exposed this legal void. When the commercial world attempts to use AI-generated content for formal marketing, advertising, and product displays, physical distortion and copyright uncertainty compound to create a dual-risk zone.

The Grok incident pushed this risk to an extreme. From late December 2025 to early January 2026, xAI’s Grok chatbot generated approximately 3 million sexualized images in just 11 days, of which an estimated 23,000 depicted minors. Regulatory authorities in multiple countries launched investigations; Malaysia and Indonesia banned Grok outright. This event was an extreme case of AI visual generation lacking safety guardrails, and a direct consequence of the absent evaluation framework — without standards there are no baselines; without baselines there are no constraints.

~3M
Sexualized images generated
by Grok in 11 days

~23K
Estimated images
depicting minors

190/min
Average generation rate
of sexualized images

10+
Countries/regions launching
investigations or taking action

Chapter VII · The Boundary Between Replication and Creation

Reliable Within Structured Templates; Unreliable Under Open ConditionsThe “narrow corridor” of AI capability

The key to understanding AI’s practical utility lies in recognizing a structural boundary: AI excels at filling in content within known frameworks, but falters at making dynamic judgments under open conditions. This is not a difference of degree; it is a categorical rupture.

When a task can be decomposed into “known template + variable filling,” AI approaches perfection. Completing a fixed-format contract, inserting data into a standardized template, rewriting text according to an existing style — these are all fundamentally “operating within defined boundaries.” The template constrains the output space, variables constrain the content range, and AI need only make optimal selections within a limited space. This is precisely what probabilistic models do best.

However, when tasks involve dynamic layout under open conditions, typographic judgment, or aesthetic decision-making, AI’s performance drops precipitously. Consider: “Based on the content of this 30-page document, design a visually appealing and logically clear PowerPoint.” This task demands that the model simultaneously handle content comprehension, information hierarchy, visual composition, color coordination, font selection, spatial allocation, and other interrelated dimensions, rendering independent yet stylistically consistent aesthetic judgments on every single slide. This far exceeds the current model’s capacity for “sampling from a given distribution.”

The industry implications of this finding are profound. It means AI’s “applicable narrow corridor” is far narrower than marketing suggests: reliability concentrates in highly structured, explicitly formatted, wide-tolerance scenarios. The moment the task crosses into dynamic judgment, fine-grained layout, or multi-dimensional aesthetic decision-making, the current architecture proves insufficient.

Chapter VIII · The Illusion of Alignment

RLHF Aligns Sentiment, Not Wisdom; RLVR’s Verification Domain Is Vanishingly NarrowDeep-well intelligence and the Texas Sharpshooter Fallacy

If the preceding chapters demonstrated the limits of AI’s “navigational capability,” this chapter must demonstrate that the “compass” we are attempting to install on this Viking longship is itself flawed.

RLHF (Reinforcement Learning from Human Feedback) is the mainstream model alignment technique today. Its core logic: human annotators rank model outputs by preference, a reward model is trained to approximate human preferences, and reinforcement learning then steers the base model to maximize this reward signal. The problem: RLHF aligns to human emotional preferences, not objective standards of wisdom. Annotators tend to select answers that “read fluently, feel friendly, and are clearly structured” — even when such answers are factually ambiguous or outright incorrect. RLHF is thus effectively training models to “sound good” rather than to “be correct.”

RLVR (Reinforcement Learning with Verifiable Rewards) attempts to correct this bias by using objectively verifiable criteria (such as mathematical proofs and code execution results) as reward signals. But RLVR’s domain of applicability is extremely narrow — it can only be used in domains where answers are automatically verifiable by machine. The vast majority of real-world valuable problems (strategic decisions, aesthetic judgments, ethical trade-offs, complex writing) do not possess automatic verifiability.

Here we encounter a classic Texas Sharpshooter Fallacy: shoot first, draw the target afterward. AI demonstrates astonishing capability in narrow, verifiable domains such as mathematics and programming — but this is precisely because those domains are “verification bullseyes” tailored for AI. When we consequently declare that AI possesses “powerful intelligence,” we commit the target-drawing fallacy: labeling the cluster of bullet holes as the center of the target while ignoring the vast areas the shooter missed entirely. This is what I call “deep-well intelligence” — superhuman performance at an extremely narrow depth, yet riddled with holes across breadth.

RLHF teaches AI to “be likable.” RLVR teaches AI to “score on math tests.” But no alignment technique teaches AI to “make reliable judgments in an uncertain real world.” The chasm between these three is the microcosm of the missing evaluation framework.

Chapter IX · The Market’s Verdict

The “SaaSpocalypse” and Capital Markets Cast Their VoteWhen Wall Street answers “Is AI reliable?” with real money

If all the preceding arguments still seem too theoretical, the global capital markets of early 2026 delivered the most direct empirical answer.

On February 3, 2026, a single AI product launch detonated across financial markets. Analysts at Jefferies’ trading desk immediately christened it the “SaaSpocalypse.” In a single day, software sector market capitalization evaporated by approximately $285 billion. Over the full period from mid-January to mid-February, global software stock market cap evaporated by an estimated $1–2 trillion. The S&P North American Software Index posted its worst monthly decline since the 2008 financial crisis. Atlassian disclosed in its earnings report the first-ever decline in enterprise seat count, with its stock plunging 35%. Salesforce fell 28% despite continued revenue growth, as investors focused on slowing new customer acquisition.

Over the same period, however, the semiconductor sector surged. The VanEck Semiconductor ETF (SMH) rose 40% across all of 2024, another 49% in 2025, and 12% year-to-date in 2026. Global semiconductor sales are projected to reach $975 billion in 2026, representing 26% year-over-year growth. The five major hyperscalers plan to spend $660–690 billion on infrastructure in 2026, with approximately 75% directed toward AI infrastructure.

This data reveals a deep paradox: the market simultaneously believes AI’s infrastructure (chips) holds enormous value, while denying the value of AI’s application layer (software). Bank of America analyst Vivek Arya precisely identified the absurdity of this contradiction: the SaaS sell-off rests simultaneously on two mutually exclusive premises — “AI capex will deteriorate due to weak ROI” and “AI will be so powerful as to completely displace traditional software.” Both conclusions cannot hold at the same time.

But viewed through the “Viking metaphor,” this paradox resolves itself: the market believes the longship is good (chips), but does not believe the longship will reach its destination (applications). Because the quality of the longship can be measured, inspected, and priced — while the outcome of the voyage is probabilistic, uncertain, and devoid of evaluation criteria. This is the macroeconomic consequence of the missing evaluation framework: when capital cannot measure the reliability of AI applications, it retreats to investing in the underlying hardware.

Conclusion

Building a Compass for the Viking LongshipEscaping the dilemma of “astounding capability, unclear heading”

Let us return to the Viking metaphor. Historically, what truly transformed seafaring was not a faster ship or sturdier timber — it was the invention of the compass and the sextant. The value of these instruments lay not in “creating” something new, but in endowing navigation with quantifiable, reproducible, and calibratable certainty. Before the compass, ocean crossing was an act of heroism. After the compass, ocean crossing became manageable engineering.

The core challenge facing the AI industry today is not insufficient computing power, not too-small models, not too-little data — but the absence of a multi-layered, quantifiable evaluation framework spanning from foundational computation to top-level application. Such a framework must answer at minimum: within what precision range is AI output trustworthy? Under what scenario boundaries is AI performance reproducible? What level of indeterminacy can a given application tolerate? How do we distinguish AI’s “performative correctness” (looking good on the surface due to RLHF alignment) from “substantive correctness” (causal reasoning, factual accuracy)?

The seven dimensions of argument in this paper converge on a single conclusion: between AI’s “capability” and AI’s “reliability” lies an enormous, insufficiently recognized chasm. Capability refers to a model’s scores on specific benchmarks, or the impressive outputs demonstrated in particular scenarios. Reliability refers to a model’s ability to continuously, stably, and predictably deliver results that meet expectations in real commercial environments. The current AI industry has sold the price of “reliability” on the narrative of “capability” — and the market is conducting a brutal value regression to the tune of trillions in SaaS evaporation.

The Vikings eventually invented the sunstone — a mineral that uses polarized light to determine the sun’s position even on overcast days. The AI industry needs its own “sunstone”: an evaluation framework that transcends individual benchmarks, transcends laboratory environments, and provides reliable navigation amid the complexity of the real world. Until then, every commercial deployment of AI will continue to be a magnificent yet uncertainty-laden North Atlantic voyage — some will reach the New World, and some will vanish into the fog.

Vikings without a compass built longships that crossed the Atlantic. An AI industry without an evaluation framework has built models that cross the frontiers of imagination. But history tells us: what truly changed the world was not a bigger ship, but a truer heading.

Note This paper is an independent thought paper that has not undergone peer review.
It is an exploratory document intended to provoke thinking on the structural crisis of AI evaluation systems.
All data cited has been drawn from publicly available sources as of March 2026.

References & Data Sources

  1. RIA Advisors, “SaaS: Is There Opportunity In The Destruction?”, March 2026. IGV down ~30% from Sep 2025 peak; SMH up ~30% over same period.
  2. Digital Applied, “The SaaSpocalypse: AI Agents Disrupting Software Industry”, Feb 2026. ~$2 trillion wiped from software sector market cap between Jan 15–Feb 14, 2026.
  3. Fortune, “The tech stock free fall doesn’t make any sense, BofA says”, Feb 2026. BofA analyst identifies SaaS sell-off as resting on mutually exclusive premises.
  4. Bain & Company, “Why SaaS Stocks Have Dropped”, 2026. Software indices down ~25% from 12-month highs.
  5. Motley Fool, “My Top Semiconductor Pick Rose 49% in 2025”, March 2026. SMH returned 49% in 2025, up another 12% YTD in 2026.
  6. ETF.com, “Semiconductor Sector Gains While Solar Dims in 2024”, Dec 2024. SMH rose 40.4% in 2024.
  7. Gartner 2025 surveys: 72% of CIOs report AI investments not yielding positive returns; 88% of HR leaders say organizations have not realized significant business value from AI; 53% of consumers distrust AI-powered search results.
  8. Microsoft Community forums & Trustpilot, Copilot user feedback compilation, 2025–2026.
  9. Xinhua News Agency commentary, “AI ‘Photo Fraud’ Must Not Run Unchecked on E-commerce Platforms”, Jan 15, 2026.
  10. CCDH (Center for Countering Digital Hate), “Grok floods X with sexualized images”, Jan 2026. ~3 million sexualized images generated in 11 days.
  11. Wikipedia, “Grok sexual deepfake scandal”, 2026. Multiple countries launch investigations.
  12. Merriam-Webster, “Slop: 2025 Word of the Year”, Dec 2025. Defined as “digital content of low quality produced usually in quantity by means of AI.”
  13. IEEE Spectrum, “AGI Benchmarks: Tracking Progress Toward AGI Isn’t Easy”, Oct 2025.
  14. ARC Prize Foundation, “Announcing ARC-AGI-2 and ARC Prize 2025”, 2025.
  15. arXiv:2505.10653, “On the Evaluation of Engineering AGI”, May 2025. States no widely accepted AGI evaluation standard exists.
  16. AI 2 Work, “The 2026 SaaS Apocalypse”, Feb 2026. Five major hyperscalers plan $660–690B infrastructure spend in 2026.

Vikings Without a Compass — On the Systemic Crisis of AI’s Missing Evaluation Framework

March 2026 · Original Thought Paper

이조글로벌인공지능연구소

LEECHO Global AI Research Lab

& Claude Opus 4.6 · Anthropic

“The measure of intelligence is not the ability to generate, but the ability to know when the generation is wrong.”

댓글 남기기