LEECHO GLOBAL AI RESEARCH LAB · THOUGHT PAPER

The Laboratory AI Agent
That Cannot Be
Backward-Compatible

Why AI fails in real enterprises: the ignored obligation of backward compatibility with human workflows

LEECHO Global AI Research Lab & Opus 4.6

April 25, 2026

V1 · ENGLISH EDITION

Abstract

Today’s AI Agents face a fundamental contradiction that the entire industry has systematically ignored: they perform brilliantly in the controlled environments of laboratories, yet fail massively in the “dirty environments” of real enterprises. Global data shows that 95% of enterprise AI pilots fail to produce measurable business value, and 88% of AI Agents cannot transition from pilot to production-scale deployment. This paper argues that the root cause of this predicament lies not in the maturity of AI technology itself, but in the AI industry’s fundamental neglect of backward compatibility—AI is designed to demand that the world adapt to it, rather than adapting itself to the world. An Office 97 installation used for twenty years, a manually maintained Excel spreadsheet by a veteran accountant, a folder structure named “Final Version (revised) (2) REALLY Final”—these seemingly chaotic artifacts are, in reality, complete effectiveness systems validated through real-world business competition. If AI cannot be backward-compatible with these “outdated” practices, it is not an intelligent agent—it is a new Bug.

SECTION 01

Twenty Years of Survival Is the Strongest Validation

Consider this scenario: a Chinese factory in 2026 where every computer still runs a pirated copy of Office 97. For over twenty years, all production documents—quotations, work orders, shift schedules, quality inspection records—have been generated on this system. No matter how many computers were replaced, Office 97 remained the workhorse of production.

When this factory purchases new computers in 2026, the core contradiction it faces is not “which new software should we use,” but rather the new hardware must be compatible with the old software’s way of operating. Because Office 97 plus twenty years of accumulated files is no longer just a piece of software—it is this factory’s production infrastructure. Like a stamping press that has been running for twenty years on the factory floor, you don’t throw it away just because you built a new facility—you make the new building’s foundation and electrical circuits adapt to this machine.

If a method has survived twenty years of real business competition, it has already been fully validated by the market. No expert review is needed, no certification system required. Survival itself is the strongest proof of effectiveness.

That veteran accountant’s methods look clumsy, but she closes the books on time every month, files taxes correctly, produces numbers the boss can understand, and passes tax bureau audits without issues. That shop floor supervisor’s shift schedule looks chaotic, but the personalities, capabilities, and shift preferences of dozens of workers are all encoded within it. That folder naming convention looks absurd, but everyone in the company knows where every file is. This is complete effectiveness.

The True Composition of Migration Costs

Migration costs are never just technical costs. For employees in their fifties and sixties, their Excel operations are muscle memory. Among hundreds of templates, there may be macros, specific print formats, and integration relationships with dot-matrix printers. If a spreadsheet opens and the column widths have changed, the page breaks have shifted, or a macro no longer runs—for a factory, that constitutes a production incident. “Can be opened” and “exactly the same” are two entirely different things.

SECTION 02

AI That Cannot Be Backward-Compatible Is the New Bug

What is intelligence? Intelligence is not how many new things you can do, but how many old things you can understand. When a truly smart person enters an unfamiliar environment, the first thing they do is not change the environment, but figure out why things are the way they are.

Today’s AI does the exact opposite. It demands that the world first transform itself into something it can understand, and only then will it work. This isn’t intelligence—this is being a picky eater.

The simplest standard for judging whether AI has truly been deployed: does that veteran accountant who has used Office 97 for twenty years need to change the way she works? If she does, AI hasn’t been deployed yet. If she doesn’t, AI has truly arrived.

The Term “Dirty Data” Is Itself an Act of Arrogance

When AI Agent engineers enter an enterprise and encounter non-standardized data formats, fragmented storage systems, and inconsistent naming conventions, they often label these as “data too dirty,” “too messy,” or “incompatible with AI.” But this judgment itself contains a dangerous assumption—AI’s standards are the standard, and the enterprise’s reality is the deviation.

In reality, that so-called “dirty data” is a record of twenty years of genuine business operations. The veteran accountant keeps her books her own way, the shop floor supervisor schedules shifts according to his own habits, and the procurement officer creates orders in his familiar format. In an Excel spreadsheet where a column is labeled “Notes 2,” the entire company knows exactly what goes in that column. The data isn’t dirty—AI just can’t read it.

Human Past Behavior Is Not “Backwardness”

The AI industry labels past methods as “traditional,” “legacy systems,” or “technical debt”—terms that all carry an implicit judgment: you are backward and need to be upgraded. But that veteran accountant’s Excel usage is the crystallization of twenty years of experience. That shop floor supervisor’s shift schedule is the product of knowing dozens of workers’ capabilities inside and out. These are not technical debt—these are human wisdom.

SECTION 03

The Fatal Gap Between Laboratory AI and Dirty Environments

Global enterprise AI deployment data reveals a sobering reality: AI Agents perform impressively in controlled laboratory environments but collapse on a massive scale in real production settings. This is not an isolated phenomenon—it is a structural, systemic failure.

95%
Enterprise generative AI pilots failed to produce measurable business impact (MIT research)
88%
AI proofs of concept (POCs) failed to transition to production (IDC data)
24%
Rate at which the best model completed real-world tasks on the first attempt in the APEX-Agents 2026 benchmark
1%
Enterprises that consider their generative AI strategy mature (McKinsey survey)

The “Sterile Deception” of Pilot Environments

AI Agent pilots are highly deceptive. A small team connects a few APIs, tests with carefully curated clean data, and watches the Agent autonomously execute workflows—in a controlled environment, everything works perfectly. But the moment it switches to a production environment, facing real data, real edge cases, and real compliance audits, the system immediately collapses.

Pilot environment: Clean data · Limited APIs · Controlled scenarios
↓ Appears successful
Management signs full deployment contract
↓ Enters real environment
Production environment: Dirty data · Legacy systems · Infinite edge cases · Implicit rules
↓ System collapses
Abandoned project · Trust collapse · Investment loss

A major retailer attempted to build a “personalized shopping Agent” and failed—because it was pulling data from 47 different Excel files that hadn’t been updated since 2022. Penrose.com tested AI account balance tracking with an entire year of Stripe data. Once the model miscalculated a single early transaction, every subsequent balance was off. By the end of the dataset, the cumulative error had grown to unacceptable levels.

If your data is dirty, your Agent is just “a way to make mistakes at massive scale, faster.” Successful teams in 2026 spend 70% of their time on data governance and only 30% on the AI itself.

The Cumulative Nature of Errors: The Fundamental Difference Between AI and Humans

When the veteran accountant makes an error in the books, she knows where the mistake is and can trace and fix it. When AI makes an error, nobody knows why it was wrong, nobody knows where it went wrong, and nobody even knows that it went wrong at all. An 85% accuracy rate at each step looks decent, but after ten consecutive steps, overall accuracy drops below 20%. A tool you cannot hold accountable is a ticking time bomb inside an enterprise.

SECTION 04

Partial Effectiveness Cannot Replace Complete Effectiveness

This is the most critical proposition in AI’s current deployment crisis. AI indeed excels at certain point tasks—writing copy, generating code, translating text, analyzing data. But enterprise operations are not a stack of point tasks; they are a complete effectiveness network.

The Iron Law of Replacement

A behavior that has been in use for twenty years necessarily possesses effectiveness—otherwise it would have been eliminated long ago. And when the time comes to replace it, the new effectiveness must exceed the total effectiveness of the old behavior. Not partial superiority, but full coverage with margin to spare.

The Veteran Accountant’s Complete Effectiveness

Collecting receipts daily → categorizing → entering data → reconciling → month-end closing → quarterly tax filing → year-end settlement → accounts payable reminders → pulling up any number the boss needs within three minutes → explaining every entry clearly when the tax bureau audits → knowing which supplier’s invoices frequently have errors and need a second look → knowing which customer habitually delays payment and needs early follow-up.

AI’s Partial Effectiveness

Can handle data entry, calculations, and report generation. But doesn’t know the supplier’s invoicing habits, doesn’t know the customer’s payment negotiation tactics, doesn’t know whether the boss is in the right mood to hear about bad debts right now, and doesn’t know which accounts the tax bureau is focusing on this year.

If AI can only handle the data entry and arithmetic, then what it replaces is not the veteran accountant—it’s just her calculator. You don’t fire an accountant because you bought a calculator. Replacing a 90-point old system with a 70-point new one isn’t an upgrade—it’s a downgrade.

Global Validation: The Trap of Partial Effectiveness

Metric Data Source
Productivity increase for AI super-users WRITER 2026 Survey
Organizations seeing significant ROI from generative AI Only 29% WRITER 2026 Survey
Organizations hoping to achieve revenue growth via AI 74% Deloitte 2026
Organizations that actually achieved revenue growth via AI Only 20% Deloitte 2026
Executives admitting AI is “tearing the company apart” 54% WRITER 2026 Survey
Employees admitting to deliberately sabotaging company AI strategy 29% WRITER 2026 Survey

There is a vast chasm between individual-level productivity gains and organization-level business returns. This is the textbook manifestation of partial effectiveness—AI is genuinely effective at certain points, but those points cannot be strung together into the complete chain that enterprises need.

SECTION 05

Where Does the 1% Success Actually Live?

That 1% of enterprises in the McKinsey survey that consider their AI strategy mature—what scenarios is their AI actually deployed in? The answer reveals a carefully avoided truth.

The scenarios where AI has genuinely worked are concentrated in: coding assistance (accounting for 55% of enterprise AI spending), content marketing (9%), and customer service automation (9%). These scenarios share one common characteristic—they are all “generating new content from scratch” tasks. The inputs are clear, the outputs are entirely new, there is no dependency on twenty years of historical data, and no need to understand the enterprise’s tacit knowledge.

Where AI succeeds is in generative tasks—writing code, writing content, creating images. Where AI fails is precisely in traditional enterprise workflows—finance, supply chain, production management, ERP integration. The former requires no backward compatibility. The latter is, by its very nature, backward compatibility.

So that 1% of “mature” enterprises are most likely companies using AI for content generation and coding assistance in greenfield scenarios, not companies that have genuinely deployed AI across traditional SOP workflows. AI’s partial effectiveness in generative tasks has been amplified into success stories, masking the fact of its comprehensive failure in traditional enterprise processes.

SECTION 06

The FDE Model: The AI Industry’s Implicit Admission of Defeat

The FDE (Forward Deployed Engineer) model invented by Palantir is a practical response to all the problems described above—and simultaneously an implicit admission by the AI industry of its own limitations.

FDEs are technical personnel stationed at client companies, whose core mission is to bridge the gap between a product’s existing capabilities and the client’s actual needs. They enter the client site with an existing product, first laying down a “gravel road,” and then the headquarters team abstracts and generalizes these field practices into a “highway” that can serve more clients.

The very existence of FDEs proves that AI cannot “understand dirty environments” on its own—it can only send a human engineer to serve as translator. This is precisely the evidence that AI’s ability to deploy in dirty environments, to this day, still depends on humans to fill in the gaps.

The Essential Role of the FDE

An FDE is neither a traditional engineer focused on a single domain nor an AI researcher confined to a laboratory. They are a “cross-domain translator”—using AI technology to solve industry pain points while simultaneously reverse-engineering industry needs into technical innovation. The core model comprises three elements: demand-driven reverse engineering (starting from industry pain points rather than general-purpose models), cross-domain capability transfer (decomposing technology validated in one domain into reusable modules), and transparency-guaranteed deployment (breaking open the AI black box and embedding industry physics and rules).

But the FDE model also exposes a brutal scalability problem: every enterprise needs its own gravel road laid first. Ten thousand factories have ten thousand different kinds of chaos, and behind each type of chaos lies a unique set of logic. This means AI deployment is not a technology problem—it is the painstaking work of adapting factory by factory, company by company. There are no shortcuts, no universal solutions.

SECTION 07

Conclusion: Backward Compatibility Is the Duty of New Technology

The central argument of this paper can be distilled into three progressive propositions:

Proposition One: The Survival Validation Principle

A behavior that has been in use for twenty years necessarily possesses effectiveness—otherwise it would have been eliminated long ago. This effectiveness requires no external certification; survival itself is the strongest validation.

Proposition Two: The Iron Law of Total Replacement

When replacing a validated effectiveness system, the new system must surpass the old system across all dimensions—not merely outperform it in some. Replacing complete effectiveness with partial effectiveness is a downgrade.

Proposition Three: The Backward Compatibility Obligation

AI being backward-compatible with the “outdated” past is the correct paradigm for AI deployment. If AI cannot be backward-compatible with how humans have worked in the past, AI is not an intelligent agent—it is a new Bug. Backward compatibility is the obligation of new technology, not the burden of old users.

What truly needs AI in China is not the internet companies that have already moved to the cloud, but the factories still running Office 97, the small workshops still using handwritten receipts, and the trading companies still transferring files via USB drives. The same holds globally—the real market lives in those “dirty environments.” Whoever can achieve seamless integration—no system changes, no learning curve, no data migration—AI seeping in like air—will capture this largest of all markets.

But right now, nobody is doing this.

This is the AI industry’s biggest blind spot in 2026—and its biggest opportunity.

References

[1] MIT Technology Review. “95% of Generative AI Pilots at Companies Are Failing.” 2025.

[2] WRITER. “Enterprise AI Adoption 2026 Survey.” April 2026.

[3] Deloitte. “State of AI in the Enterprise 2026.” February 2026.

[4] McKinsey & Company. “The State of AI in 2025.” 2025.

[5] Grant Thornton. “2026 AI Impact Survey Report.” April 2026.

[6] Digital Applied. “AI Agent Scaling Gap March 2026.” March 2026.

[7] Gartner. “Survey on Data Management Practices for AI.” 2025.

[8] APEX-Agents 2026 Benchmark. Real-world task completion rates.

[9] Carnegie Mellon & Anthropic. AI agent error rates in high-stakes business processes.

[10] Palantir FDE model documentation and interviews. Bob McGrew, Shyam Sankar.

[11] Professor Li Jinjin’s Team, Shanghai Jiao Tong University. “FDE+FDR Collaborative Framework.” 2025.

[12] Fortune. “The Hidden ROI of AI.” April 2026.

[13] IDC. “88% AI POC-to-Production Failure Rate.” 2025.

이조글로벌인공지능연구소

LEECHO GLOBAL AI RESEARCH LAB

© 2026 LEECHO Global AI Research Lab. All rights reserved.

댓글 남기기