Thought Paper · March 2026

From Generation to Control
GUI AI Agent as the First Wave of Industrial AI Deployment

The Structural Limits of Generative AI and the Economic Viability of Control-based AI
— An Exploratory Study Based on Empirical Testing, Labor Market Data, and Thermodynamic Information Theory

March 14, 2026
·
LEECHO Global AI Research Lab
·
Claude Opus 4.6 · Anthropic


00 · Abstract

Core Thesis

In 2025, “Slop” was named Word of the Year by Merriam-Webster, signaling that generative AI output (text, images, video) had reached a structural low in human acceptance. Meanwhile, a quieter but far more lethal AI evolution path is accelerating: GUI (Graphical User Interface) control-based AI Agents — which produce no content, but directly operate human computer interfaces to execute tasks.

This paper advances the following core thesis: AI industrial deployment is undergoing a phase transition from “generative” to “control-based.” While generative AI has improved productivity for high-cognition workers, its outputs inevitably generate massive backend testing, verification, and real-world alignment workloads — this “backend long tail” severely limits its actual replacement effect. GUI control-based AI, by contrast, already possesses the preconditions for large-scale job replacement in high-labor-cost countries, thanks to the determinism of its operating environment (low physical friction) and the binary nature of its verification loop (no backend long tail). This shift will first impact jobs whose core work is “purely operating a computer” — data entry, e-commerce backend operations, ERP system input, accounting system entry. Critical distinction: customer service and other person-to-person service roles are NOT within the scope of GUI replacement — they are fundamentally different work.

This paper constructs a complete analytical chain from technical feasibility to economic tipping point, based on the author’s hands-on GUI Agent test data (Sonnet 4.6, $0.23/178 seconds per task), U.S. Bureau of Labor Statistics employment data, OSWorld benchmark results, and the thermodynamic information theory framework previously published by the author.

“The true destructive power of AI lies not in whether it can write a good article, but whether it can accurately click the right button and fill in the right form. The former remains distant; the latter has already arrived.”

01 · Phase Transition

The Real Predicament of Generative AI: The Ineliminable Backend Long Tail

Not “failure,” but efficiency gains consumed by backend verification work

In 2025, “Slop” — referring to low-quality, mass-produced AI-generated digital content — was selected as Word of the Year by both Merriam-Webster and the American Dialect Society. This phenomenon requires precise interpretation: it does not mean generative AI is worthless, but reveals a deeper structural problem.

AI Slop Mention Growth
2025 vs 2024 (Meltwater)

Peak Negative Sentiment
54%
Historic high in October 2025

AI-Generated Content Share
>50%
English web content (Graphite data)

Meta Vibes DAU
2.3万
Dismal performance weeks after launch

Generative AI is genuinely improving productivity for high-cognition workers — the author is one such beneficiary, having used AI programming to develop multiple software products. But this very process exposed generative AI’s fundamental problem: AI can rapidly generate code, but what happens after generation? Massive testing, debugging, verification, and real-environment adaptation are required. As a one-person company founder, the author has already completed several software products using AI programming, but every single one is stuck at the stage of having no one to help with testing.

This is not an isolated case but the structural destiny of generative AI: for every output generated at the frontend, a series of testing, verification, and physical-world fact-alignment demands are produced at the backend. AI writes copy — humans must verify facts, proofread phrasing, confirm tone. AI generates a product image — designers must check proportions, color accuracy, detail distortions. AI writes code — programmers must run unit tests, integration tests, stress tests, and security audits.

Analyzed through the thermodynamic framework: generative AI reduced creation entropy at the frontend (accelerated output), but manufactured equal or greater verification entropy at the backend (testing, alignment, physical-world validation). Total entropy has not decreased — it has merely shifted from the “production stage” to the “verification stage.” This is why generative AI has not yet produced large-scale job displacement — its efficiency gains have been consumed by backend verification demands.


02 · The Shift

The Rise of Control-based AI: From “Producing Content” to “Executing Processes”

GUI AI Agent — A New Direction in AI Evolution

While generative AI faces the AI Slop crisis, a fundamentally different technology path is quietly taking shape. GUI (Graphical User Interface) control-based AI Agents generate no new content — they directly operate human computer screens, viewing interfaces, moving mice, clicking buttons, and typing text just as humans do.

This is an architectural difference of fundamental significance:

Generative AI

Outputs require human acceptance。AI-written copy needs editor review; AI-generated images need designer inspection; AI-written code needs programmer review. Every output drags a “alignment long tail” requiring human processing. AI reduced creation entropy at the frontend but manufactured new alignment-checking entropy at the backend. Total entropy unchanged — merely transferred.

Control-based AI

Operation results are binary-verified。A clicked button is clicked. A changed price is changed. A shipped order is shipped. The verification standard for GUI operations is success/failure — no subjective human judgment required. Align once at the frontend, then zero-cost batch execution at the backend. No long tail, no re-checking.

Generative Total Cost = AI Cost + (Backend Alignment Labor × Output Volume) → Linear Growth
Control Total Cost = Frontend Setup (one-time) + (API Fee × Operations) + Supervisor (fixed) → Marginal Decrease
Cost structure difference between two AI modes: generative backend labor grows linearly; control marginal cost approaches zero

这个差异的产业后果是:Generative AI是”假效率”——前端加速,后端堆人,净人力变化接近零;Control-based AI是”真裁员”——前端对齐,批量执行,后端仅需监督,净人力需求断崖式下降。


03 · Frontend vs Backend

The Technical Essence of GUI Control: From Backend API to Frontend Visual Operation

An Overlooked Architectural Revolution

Traditional browser automation (Selenium/Playwright/Puppeteer) is “backend control” — bypassing the interface humans see, directly connecting to browser low-level protocols (WebDriver/CDP), operating DOM trees and CSS selectors. It is essentially API calling, limited to systems with open interfaces.

GUI AI Agent is “frontend control” — it sees screenshots, exactly what humans see. It does not read HTML source code or use DevTools protocols. It truly “sees” the screen, recognizes “there is a blue button labeled Submit,” then generates mouse coordinates to click.

Dimension Traditional Script Automation (Backend) GUI AI Agent (Frontend)
Operation Layer DOM Tree / CSS Selectors / XPath Screenshots / Visual Recognition
Protocol WebDriver / CDP Screenshot→Inference→Mouse/Keyboard
Fragility Breaks when page structure changes Works as long as visual semantics persist
Scope Only systems with API/DOM access Any system with a screen
Dev Barrier Requires programmers to write scripts Natural language task description suffices
Physical Friction Low (structured interface) Very low (deterministic electronic environment)

This leap from “backend” to “frontend” breaks a fundamental limitation: backend control can only operate systems with open interfaces, yet in enterprise reality, massive amounts of work occur in “no API” environments — legacy ERP interfaces, government websites, cross-system copy-paste. Frontend control breaks this limitation: whatever AI can see, it can operate — perfectly aligned with human capability boundaries.


04 · Empirical Test

Author’s Hands-on Test: Sonnet 4.6 GUI Agent Real Performance and Cost

First-hand data, no embellishment

This research lab used AI programming to develop a GUI control software connected to Anthropic Sonnet 4.6, conducting initial tests in a real browser environment. Test task: open a target webpage, click the search bar, enter specific text, and search for weather information for a specific location on the current day.

First Task API Calls
13次
Agent “looked at screen 13 times” to complete

Optimized API Calls
7次
~46% efficiency improvement

Single Task Duration
178秒
Humans complete same task in ~15-20 sec

Single Task API Cost
$0.23
Using Sonnet 4.6 (not the most expensive model)

Key finding: Accuracy on simple tasks is already acceptable. The optimization curve from 13 to 7 calls demonstrates that even without fine-tuning, context-based experience accumulation alone can significantly improve GUI operation efficiency. 7 calls means the Agent “looked at the screen 7 times” — humans need approximately 4-5 actions for the same task, so the Agent is already approaching human efficiency.

However, the 178-second execution time and $0.23 per-task cost reveal the current core contradiction: Technical accuracy has crossed the usability line, but economic cost remains above it — though falling rapidly.


05 · Economics

Geographic Arbitrage: GUI Agent Viability Determined by Labor Costs

The Global Labor Fault Line Revealed by $0.23

A $0.23 per-operation cost carries entirely different economic implications across countries:

Country/Region Relevant Hourly Wage 178-sec Labor Cost Agent Cost Economic Viability
US (White-collar Ops) $25-35/h $1.23-1.73 $0.23 ✓ Agent already 5-7× cheaper than labor
W. Europe / Japan / Korea $18-28/h $0.89-1.38 $0.23 ✓ Agent already 3-6× cheaper than labor
China (Tier 1 Cities) $5-8/h $0.25-0.40 $0.23 ≈ Near tipping point
India / SE Asia (BPO) $2-4/h $0.10-0.20 $0.23 ✗ Labor still cheaper than Agent

This “economic viability line” is determined by local labor costs: the US has already crossed it, Europe/Japan/Korea follow closely, China is at the tipping point, and developing nations have not yet reached it.

The critical factor is the direction of cost curves: labor costs rise with inflation, while API costs fall with Moore’s Law and competition. Over the past year, mainstream LLM API prices have dropped more than tenfold. If this trend continues, the $0.23 search task may cost $0.05 within a year. At that point, labor anywhere in the world will be more expensive than an Agent.

The applicable economic principle here is not game theory (which requires both parties to possess equal agency), but the fundamental economics of labor scarcity and substitutability. Pure computer-operation labor has extremely low scarcity (anyone who can read a screen can do it). Once an AI Agent achieves equivalent operational capability, the substitutability of such labor becomes 100%. The employer’s decision is not “negotiating with employees” but unilaterally selecting the lower-cost execution method. The one-directional decline of API cost curves means this substitution line will only continue pushing downward, progressively breaking through every country’s labor cost floor.

06 · Impact Zone

GUI Agent’s Precision Strike Zone: Pure Computer Operation Jobs

Strict definition: replaces only “pure computer operation” work, NOT person-to-person services

The replacement scope of GUI Agents must be strictly defined. What it replaces is “purely operating a computer” — a person sitting in front of a screen, executing standardized input, query, and modification operations in known software systems via mouse and keyboard. It does NOT replace any work involving interpersonal interaction, subjective judgment, or service communication. Customer service representatives also use computers, but their core work is person-to-person communication and emotional handling — entirely outside the scope of GUI replacement.

Job types precisely targeted by GUI Agents:

Pure data entry operators — approximately 140,000 data entry clerks in the US (BLS data), median salary ~$31,582, whose work consists of inputting data from one system into another. Additionally, bookkeeping and accounting clerks (~1.6 million) spend their days entering financial data into accounting software, generating reports, and reconciling numbers.

E-commerce backend operators — the most typical “fill-in-the-blank” work. What an Amazon operator does daily: log into seller backend → list new products (enter title, five bullet points, price, inventory quantity, shipping method) → batch-modify product prices → adjust ad bids (enter bid amounts for each keyword) → download sales reports → enter tracking numbers → update inventory figures. A store managing 500 SKUs may require hundreds to thousands of pure GUI clicks and entries per day. All operations occur within known backend interfaces in the browser — the most standard “fill-in-the-blank” work.

ERP/CRM system operators — entering orders, updating customer information, generating purchase orders, and processing inventory records in enterprise systems. The common characteristic: filling in fixed-format data, following fixed processes, in a fixed software interface.

Insurance/banking backend processors — claims data entry, policy status updates, system operation portions of transfer approval workflows (note: excludes client-facing judgment/decision portions — only the pure system operation parts).

Data Entry + Bookkeeping Clerks
~174万
US (BLS data, pure GUI operation roles)

E-commerce Backend Ops
~100万+
Estimated pure operation roles among US e-commerce workers

Median Annual Salary
$31K-46K
Well below national median of $49,500

Employment Trend
-5%~-8%
BLS projects continued decline over next decade

Conservatively estimated, in the US alone, pure computer GUI operation jobs number between 3-5 million. This figure is intentionally conservative, strictly excluding all positions with interpersonal service components. Yet even this conservative number corresponds to total annual wages exceeding $100 billion — this is the addressable market for GUI Agents.

These positions share a common labor economics characteristic: extremely low scarcity and extremely high substitutability. They require no professional qualifications, no creative judgment, no relationship management — only the ability to “read a screen and click accurately.” When an AI Agent can do the same, these workers lose all bargaining power. This is not game-theoretic “competition” — game theory applies to scenarios where both parties hold agency — this is a pure labor scarcity collapse. When the supply side (AI) can replicate infinitely at near-zero marginal cost, the demand side (employers) makes a unilateral choice.


07 · Already Happening

Real Layoff Cases: Not Predictions — Already Happening

Real data from Block to cross-border e-commerce

On February 26, 2026, fintech company Block (S&P 500 constituent) laid off ~4,000 employees (~40% of workforce). CEO Jack Dorsey stated explicitly: this was not about business problems, but about AI tools enabling small teams to do what large organizations previously required. Block’s self-developed AI agent “Goose” assists with code writing, decision-making, and customer service — a 6,000-person + AI combination expected to handle the workload previously requiring 10,000 people.

In e-commerce, a leading cross-border e-commerce enterprise reported that after introducing AI Agents, work that previously required 6 employees and 18 hours for cross-platform product selection and reconciliation is now completed automatically by digital workers, reducing labor costs by 70%. Taobao’s AI customer service handled 300 million interactions during Singles’ Day 2025, with 100 million fully automated and human transfer rates dropping 20% year-over-year.

A deeper structural shift: Amazon’s AI infrastructure investment exceeded $150 billion, surpassing labor costs for the first time. Capital expenditure shifted from “buying labor” to “buying compute” — precisely the “capital restructuring” argued in this lab’s previous “From Parasitism to Symbiosis” paper.

Note the scenes where real layoffs are landing — all are control/execution type: e-commerce product selection and reconciliation, customer response, order processing, code execution, file management. Not a single one is “generate a beautiful image for me” or “write me a moving article.” Generative AI Slop has not replaced people. Control-based AI Agents are replacing people.

08 · Technical Reality

March 2026: The Real State of GUI Agent Technology

An honest panoramic assessment

An honest assessment is necessary: GUI AI Agents are still in early stages. The direction is confirmed, but engineering maturity is far from complete.

Benchmark evolution trajectory (OSWorld):

Mid-2024
Best model only 12.24% success rate, humans 72.36% — massive gap

Mid-2025
OpenAI CUA ~32.6%, Agent S2 ~34.5% — gap halved

Late 2025
Best Agent ~42.5% (relaxed criteria), 17.4% (strict criteria)

Early 2026
Agent S3 claims 72.6% (100-step setting); one company claims 76.26% surpassing humans

Real user experience: A TechCrunch journalist’s candid conclusion after testing OpenAI Operator: “I found myself assisting the Agent more than I’d like — which kind of defeats the point.” Reddit users: “Too slow, too expensive, too error-prone.” OpenAI itself admits Operator struggles with complex interfaces.

Core technical bottlenecks: OSWorld benchmarks reveal three key challenges — insufficient GUI grounding precision (misclicks), weak operational knowledge (falling into unproductive trial-and-error), and poor long-horizon planning (success rates plummeting for multi-step tasks). Moreover, the best Agents take 1.4× more steps than humans, with end-to-end latency reaching tens of minutes.

Overall assessment: Equivalent to the touchscreen PDA era of 2005 — the direction is right, but 2-3 years of engineering breakthroughs remain before an “iPhone moment.” All major companies (OpenAI, Anthropic, Google, Microsoft, ByteDance) have fully committed to this direction, and the concentration of engineering resources means this 2-3 year gap will be compressed rapidly.


09 · The Gap

The Biggest Barrier to Industrialization: Information Gap, Not Technology

Those who know AI don’t know the use cases; those who know the use cases don’t know AI

The biggest bottleneck slowing GUI Agent deployment is not model intelligence, not API cost, but an information gap — two worlds completely isolated from each other.

AI Technology Circle

Checking OSWorld leaderboards on GitHub, setting up Docker sandboxes, discussing visual grounding precision. But they don’t know which buttons e-commerce operators click daily, in what order, or what judgments they make under what circumstances. Their benchmark is “change a table color in LibreOffice,” not “batch-modify prices for 500 SKUs in Amazon Seller Central.”

Business Operations Circle

E-commerce owners, finance managers, insurance claims supervisors. They watch employees repeatedly click screens hundreds of times daily, thinking “if only this could be automated.” But they don’t know GUI Agents exist, don’t know what Browser Use is, and when they hear “AI” they still think of ChatGPT writing articles.

The training data solution is extremely simple — but neither side knows it. Simply have a skilled employee work normally while screen-recording plus voice-narrating: “I’m now opening Seller Central, clicking this tab, checking if there are any new negative reviews today…” Speech-to-text creates a complete video combining “operations + judgment logic.” The AI model sees screen visuals + mouse trajectories (actions) + human narration (decision logic) — three information streams perfectly aligned, enabling it to map and learn the entire workflow.

This training cost is near-zero — one computer, screen recording software, a microphone, and an employee working half a day. Compared to traditional ML training costs easily reaching tens of thousands of dollars, this is negligible. Moreover, the method has exceptional generalizability: any job whose work consists of “look at screen → judge → click” can be trained the same way.

“Missile Assembly Line” Model: Take e-commerce operations. An Amazon store lists 100 new products, each requiring: enter title (fill-in) → enter five bullet points (fill-in) → enter price (fill-in) → select shipping method (multiple choice) → upload images (click upload) → set ad keyword bids (fill-in) → enter inventory quantity (fill-in). 100 products × 7 operation steps = 700 “fill-in-the-blank” tasks. AI Agent completes all 700, then one person reviews: Are prices correct? Inventory numbers right? Any obvious description errors? This is “AI builds the missile, human presses the launch button.” Work that previously required 3-5 operators all day is now completed in half a day with AI execution + 1 person reviewing.

10 · Physical Friction Ladder

为什么GUI会率先落地:Control-based AI的物理摩擦阶梯

Structural difference between electronic and physical worlds

Control-based AI内部存在一个由操作环境的Physical Friction决定的落地速度阶梯。GUI Agent处于这个阶梯的最底层——物理摩擦最低,因此最先到达经济可行的临界点。

Dimension GUI Agent (Electronic World) AI Robotic Arm (Physical World)
Training Data Labeling Screen recording + voice, employee produces dozens in half a day, zero cost Teleoperation equipment + torque sensors + specialist engineers, 100× cost per sample
Environment Controllability Browser is deterministic — same operation always yields same result Temperature, humidity, material wear, gravity collisions — different every time
Trial-Error Cost Refresh page and retry, zero physical loss May damage workpieces or injure people
Cross-scenario Generalization All browsers worldwide render HTML — interface logic is universal Welding and polishing are entirely different mechanical models
Verification Loop Page feedback IS the result — instantly verifiable Requires X-ray inspection, CMM measurement, etc.
Expected Mass Deployment 2026-2028 2028-2032 (optimistic)

GUI Agents are not merely “slightly faster” than robotic arms — they lead by an entire industrial cycle. This is determined by physical law: the entropy of the electronic world is far lower than that of the physical world. Therefore, the data volume, engineering investment, and iteration cycles needed to achieve AI control in the electronic world are all far smaller than in the physical world.


11 · Conclusion

Conclusions and Predictions

Control-based AI的第一波浪潮已在路上

The core conclusions of this paper can be summarized in five propositions:

命题一:AI的产业化进化正在经历从”生成式”到”控制式”的相变。Generative AI提升了高认知人群的效率,但其产出物的后端测试、验证、物理世界对齐需求吞噬了前端效率提升,无法实现大规模岗位替代。

命题二:GUIControl-based AI没有”后端对齐长尾”。它执行的是确定性环境中的标准化操作——”填空题”,操作验证是二元的、即时的、自动的。前端对齐一次,后端就是零成本的批量执行。

命题三:GUI Agent的Economic Viability由当地人力成本与劳动力可替代性决定。纯电脑操作型岗位的劳动力稀缺性极低,一旦AI Agent的单次操作成本低于当地人力成本,替代就是单方面的、不可逆的。美国已过线,全球成本线正在快速下移。

Proposition 4: The biggest barrier to GUI Agent industrialization is the information gap — those who know AI don’t know the specific “fill-in-the-blank” tasks in real business scenarios; those who know the scenarios don’t know GUI Agents exist. Screen recording + voice narration training can bridge this gap, but complex tasks involving implicit judgment still require finer task decomposition.

命题五:Control-based AI内部存在物理摩擦阶梯。GUI(电子世界)的Physical Friction远低于机械臂(物理世界),训练数据获取成本相差百倍,落地速度领先一整个产业周期。GUI Agent将是工业化AI落地的第一波浪潮。

Final prediction: The GUI AI Agent replacement path is not “AI produced better content” but “AI took over the fill-in-the-blank tasks on screen.” Listing products, entering prices, recording inventory, adjusting ad bids, downloading reports, updating tracking numbers — these standardized GUI operations repeated hundreds to thousands of times daily will first be batch-assumed by AI Agents in high-labor-cost countries. This path is quieter, more pragmatic, and less conspicuous than AI Slop, but its destructive power on pure computer-operation jobs is terminal. These workers were not “defeated” — their scarcity was simply eliminated.

References and Data Sources

[1] Merriam-Webster, 2025 Word of the Year: “Slop” — Low-quality AI-generated digital content

[2] Meltwater, 2025年AI SlopConsumer sentiment analysis report

[3] Graphite SEO, 2025年Analysis of AI-generated share in English web content

[4] OSWorld Benchmark (NeurIPS 2024) — Desktop OS multimodal agent evaluation benchmark

[5] Agent S3, Simular AI — OSWorldLeaderboard top score(72.6%, 100步设定)

[6] OpenAI, Computer-Using Agent (CUA) — OperatorProduct technical report

[7] Anthropic, Claude Computer Use Beta — 3.5 Sonnet/4.6series computer use capabilities

[8] U.S. Bureau of Labor Statistics (BLS) — Office and administrative support employment statistics(2023-2024)

[9] OSWorld-Human (arXiv:2506.16042) — AgentTemporal efficiency benchmark study

[10] TechCrunch, “OpenAI’s Operator agent helped me move, but I had to help it, too” (2025.02)

[11] 36氪, “第一波AI裁员潮,来了” (2026.03) — Block4,000-person layoff case analysis

[12] 腾讯新闻, “巨头裁员,这次史无前例” (2025.12) — AIAI-driven layoff global trends

[13] 实在智能, “AI自动化2026解析” — Cross-border e-commerce AI Agent deployment cases

[14] Author empirical test data — Sonnet 4.6 GUI AgentPerformance and cost empirical test(2026.03)

[15] 李朝全球人工智能研究所, “基生到共生” (2026.02) — Thermodynamic information theory framework

[16] 李朝全球人工智能研究所, “第四产业” (2026.02) — 认知经济与Physical Friction理论

[17] 李朝全球人工智能研究所, “信息与物理的对抗” (2026.02) — Physical possession theory

[18] Firecrawl, “11 Best AI Browser Agents in 2026” — Browser automation market$242.5亿估值

[19] Skyvern, “AI Web Agents Complete Guide” (2025.11) — AI Agent市场$54亿→$76亿数据

From Generation to Control: GUI AI Agent as the First Wave of Industrial AI Deployment

LEECHO Global AI Research Lab

& Claude Opus 4.6 · Anthropic

March 14, 2026 · Original Thought Paper

댓글 남기기