Core Thesis
In 2025, “Slop” was named Word of the Year by Merriam-Webster, signaling that generative AI output (text, images, video) had reached a structural low in human acceptance. Meanwhile, a quieter but far more lethal AI evolution path is accelerating: GUI (Graphical User Interface) control-based AI Agents — which produce no content, but directly operate human computer interfaces to execute tasks.
This paper advances the following core thesis: AI industrial deployment is undergoing a phase transition from “generative” to “control-based.” While generative AI has improved productivity for high-cognition workers, its outputs inevitably generate massive backend testing, verification, and real-world alignment workloads — this “backend long tail” severely limits its actual replacement effect. GUI control-based AI, by contrast, already possesses the preconditions for large-scale job replacement in high-labor-cost countries, thanks to the determinism of its operating environment (low physical friction) and the binary nature of its verification loop (no backend long tail). This shift will first impact jobs whose core work is “purely operating a computer” — data entry, e-commerce backend operations, ERP system input, accounting system entry. Critical distinction: customer service and other person-to-person service roles are NOT within the scope of GUI replacement — they are fundamentally different work.
This paper constructs a complete analytical chain from technical feasibility to economic tipping point, based on the author’s hands-on GUI Agent test data (Sonnet 4.6, $0.23/178 seconds per task), U.S. Bureau of Labor Statistics employment data, OSWorld benchmark results, and the thermodynamic information theory framework previously published by the author.
The Real Predicament of Generative AI: The Ineliminable Backend Long Tail
In 2025, “Slop” — referring to low-quality, mass-produced AI-generated digital content — was selected as Word of the Year by both Merriam-Webster and the American Dialect Society. This phenomenon requires precise interpretation: it does not mean generative AI is worthless, but reveals a deeper structural problem.
Generative AI is genuinely improving productivity for high-cognition workers — the author is one such beneficiary, having used AI programming to develop multiple software products. But this very process exposed generative AI’s fundamental problem: AI can rapidly generate code, but what happens after generation? Massive testing, debugging, verification, and real-environment adaptation are required. As a one-person company founder, the author has already completed several software products using AI programming, but every single one is stuck at the stage of having no one to help with testing.
This is not an isolated case but the structural destiny of generative AI: for every output generated at the frontend, a series of testing, verification, and physical-world fact-alignment demands are produced at the backend. AI writes copy — humans must verify facts, proofread phrasing, confirm tone. AI generates a product image — designers must check proportions, color accuracy, detail distortions. AI writes code — programmers must run unit tests, integration tests, stress tests, and security audits.
Analyzed through the thermodynamic framework: generative AI reduced creation entropy at the frontend (accelerated output), but manufactured equal or greater verification entropy at the backend (testing, alignment, physical-world validation). Total entropy has not decreased — it has merely shifted from the “production stage” to the “verification stage.” This is why generative AI has not yet produced large-scale job displacement — its efficiency gains have been consumed by backend verification demands.
The Rise of Control-based AI: From “Producing Content” to “Executing Processes”
While generative AI faces the AI Slop crisis, a fundamentally different technology path is quietly taking shape. GUI (Graphical User Interface) control-based AI Agents generate no new content — they directly operate human computer screens, viewing interfaces, moving mice, clicking buttons, and typing text just as humans do.
This is an architectural difference of fundamental significance:
Generative AI
Outputs require human acceptance。AI-written copy needs editor review; AI-generated images need designer inspection; AI-written code needs programmer review. Every output drags a “alignment long tail” requiring human processing. AI reduced creation entropy at the frontend but manufactured new alignment-checking entropy at the backend. Total entropy unchanged — merely transferred.
Control-based AI
Operation results are binary-verified。A clicked button is clicked. A changed price is changed. A shipped order is shipped. The verification standard for GUI operations is success/failure — no subjective human judgment required. Align once at the frontend, then zero-cost batch execution at the backend. No long tail, no re-checking.
这个差异的产业后果是:Generative AI是”假效率”——前端加速,后端堆人,净人力变化接近零;Control-based AI是”真裁员”——前端对齐,批量执行,后端仅需监督,净人力需求断崖式下降。
The Technical Essence of GUI Control: From Backend API to Frontend Visual Operation
Traditional browser automation (Selenium/Playwright/Puppeteer) is “backend control” — bypassing the interface humans see, directly connecting to browser low-level protocols (WebDriver/CDP), operating DOM trees and CSS selectors. It is essentially API calling, limited to systems with open interfaces.
GUI AI Agent is “frontend control” — it sees screenshots, exactly what humans see. It does not read HTML source code or use DevTools protocols. It truly “sees” the screen, recognizes “there is a blue button labeled Submit,” then generates mouse coordinates to click.
| Dimension | Traditional Script Automation (Backend) | GUI AI Agent (Frontend) |
|---|---|---|
| Operation Layer | DOM Tree / CSS Selectors / XPath | Screenshots / Visual Recognition |
| Protocol | WebDriver / CDP | Screenshot→Inference→Mouse/Keyboard |
| Fragility | Breaks when page structure changes | Works as long as visual semantics persist |
| Scope | Only systems with API/DOM access | Any system with a screen |
| Dev Barrier | Requires programmers to write scripts | Natural language task description suffices |
| Physical Friction | Low (structured interface) | Very low (deterministic electronic environment) |
This leap from “backend” to “frontend” breaks a fundamental limitation: backend control can only operate systems with open interfaces, yet in enterprise reality, massive amounts of work occur in “no API” environments — legacy ERP interfaces, government websites, cross-system copy-paste. Frontend control breaks this limitation: whatever AI can see, it can operate — perfectly aligned with human capability boundaries.
Author’s Hands-on Test: Sonnet 4.6 GUI Agent Real Performance and Cost
This research lab used AI programming to develop a GUI control software connected to Anthropic Sonnet 4.6, conducting initial tests in a real browser environment. Test task: open a target webpage, click the search bar, enter specific text, and search for weather information for a specific location on the current day.
Key finding: Accuracy on simple tasks is already acceptable. The optimization curve from 13 to 7 calls demonstrates that even without fine-tuning, context-based experience accumulation alone can significantly improve GUI operation efficiency. 7 calls means the Agent “looked at the screen 7 times” — humans need approximately 4-5 actions for the same task, so the Agent is already approaching human efficiency.
However, the 178-second execution time and $0.23 per-task cost reveal the current core contradiction: Technical accuracy has crossed the usability line, but economic cost remains above it — though falling rapidly.
Geographic Arbitrage: GUI Agent Viability Determined by Labor Costs
A $0.23 per-operation cost carries entirely different economic implications across countries:
| Country/Region | Relevant Hourly Wage | 178-sec Labor Cost | Agent Cost | Economic Viability |
|---|---|---|---|---|
| US (White-collar Ops) | $25-35/h | $1.23-1.73 | $0.23 | ✓ Agent already 5-7× cheaper than labor |
| W. Europe / Japan / Korea | $18-28/h | $0.89-1.38 | $0.23 | ✓ Agent already 3-6× cheaper than labor |
| China (Tier 1 Cities) | $5-8/h | $0.25-0.40 | $0.23 | ≈ Near tipping point |
| India / SE Asia (BPO) | $2-4/h | $0.10-0.20 | $0.23 | ✗ Labor still cheaper than Agent |
This “economic viability line” is determined by local labor costs: the US has already crossed it, Europe/Japan/Korea follow closely, China is at the tipping point, and developing nations have not yet reached it.
The critical factor is the direction of cost curves: labor costs rise with inflation, while API costs fall with Moore’s Law and competition. Over the past year, mainstream LLM API prices have dropped more than tenfold. If this trend continues, the $0.23 search task may cost $0.05 within a year. At that point, labor anywhere in the world will be more expensive than an Agent.
GUI Agent’s Precision Strike Zone: Pure Computer Operation Jobs
The replacement scope of GUI Agents must be strictly defined. What it replaces is “purely operating a computer” — a person sitting in front of a screen, executing standardized input, query, and modification operations in known software systems via mouse and keyboard. It does NOT replace any work involving interpersonal interaction, subjective judgment, or service communication. Customer service representatives also use computers, but their core work is person-to-person communication and emotional handling — entirely outside the scope of GUI replacement.
Job types precisely targeted by GUI Agents:
Pure data entry operators — approximately 140,000 data entry clerks in the US (BLS data), median salary ~$31,582, whose work consists of inputting data from one system into another. Additionally, bookkeeping and accounting clerks (~1.6 million) spend their days entering financial data into accounting software, generating reports, and reconciling numbers.
E-commerce backend operators — the most typical “fill-in-the-blank” work. What an Amazon operator does daily: log into seller backend → list new products (enter title, five bullet points, price, inventory quantity, shipping method) → batch-modify product prices → adjust ad bids (enter bid amounts for each keyword) → download sales reports → enter tracking numbers → update inventory figures. A store managing 500 SKUs may require hundreds to thousands of pure GUI clicks and entries per day. All operations occur within known backend interfaces in the browser — the most standard “fill-in-the-blank” work.
ERP/CRM system operators — entering orders, updating customer information, generating purchase orders, and processing inventory records in enterprise systems. The common characteristic: filling in fixed-format data, following fixed processes, in a fixed software interface.
Insurance/banking backend processors — claims data entry, policy status updates, system operation portions of transfer approval workflows (note: excludes client-facing judgment/decision portions — only the pure system operation parts).
Conservatively estimated, in the US alone, pure computer GUI operation jobs number between 3-5 million. This figure is intentionally conservative, strictly excluding all positions with interpersonal service components. Yet even this conservative number corresponds to total annual wages exceeding $100 billion — this is the addressable market for GUI Agents.
These positions share a common labor economics characteristic: extremely low scarcity and extremely high substitutability. They require no professional qualifications, no creative judgment, no relationship management — only the ability to “read a screen and click accurately.” When an AI Agent can do the same, these workers lose all bargaining power. This is not game-theoretic “competition” — game theory applies to scenarios where both parties hold agency — this is a pure labor scarcity collapse. When the supply side (AI) can replicate infinitely at near-zero marginal cost, the demand side (employers) makes a unilateral choice.
Real Layoff Cases: Not Predictions — Already Happening
On February 26, 2026, fintech company Block (S&P 500 constituent) laid off ~4,000 employees (~40% of workforce). CEO Jack Dorsey stated explicitly: this was not about business problems, but about AI tools enabling small teams to do what large organizations previously required. Block’s self-developed AI agent “Goose” assists with code writing, decision-making, and customer service — a 6,000-person + AI combination expected to handle the workload previously requiring 10,000 people.
In e-commerce, a leading cross-border e-commerce enterprise reported that after introducing AI Agents, work that previously required 6 employees and 18 hours for cross-platform product selection and reconciliation is now completed automatically by digital workers, reducing labor costs by 70%. Taobao’s AI customer service handled 300 million interactions during Singles’ Day 2025, with 100 million fully automated and human transfer rates dropping 20% year-over-year.
A deeper structural shift: Amazon’s AI infrastructure investment exceeded $150 billion, surpassing labor costs for the first time. Capital expenditure shifted from “buying labor” to “buying compute” — precisely the “capital restructuring” argued in this lab’s previous “From Parasitism to Symbiosis” paper.
March 2026: The Real State of GUI Agent Technology
An honest assessment is necessary: GUI AI Agents are still in early stages. The direction is confirmed, but engineering maturity is far from complete.
Benchmark evolution trajectory (OSWorld):
Real user experience: A TechCrunch journalist’s candid conclusion after testing OpenAI Operator: “I found myself assisting the Agent more than I’d like — which kind of defeats the point.” Reddit users: “Too slow, too expensive, too error-prone.” OpenAI itself admits Operator struggles with complex interfaces.
Core technical bottlenecks: OSWorld benchmarks reveal three key challenges — insufficient GUI grounding precision (misclicks), weak operational knowledge (falling into unproductive trial-and-error), and poor long-horizon planning (success rates plummeting for multi-step tasks). Moreover, the best Agents take 1.4× more steps than humans, with end-to-end latency reaching tens of minutes.
Overall assessment: Equivalent to the touchscreen PDA era of 2005 — the direction is right, but 2-3 years of engineering breakthroughs remain before an “iPhone moment.” All major companies (OpenAI, Anthropic, Google, Microsoft, ByteDance) have fully committed to this direction, and the concentration of engineering resources means this 2-3 year gap will be compressed rapidly.
The Biggest Barrier to Industrialization: Information Gap, Not Technology
The biggest bottleneck slowing GUI Agent deployment is not model intelligence, not API cost, but an information gap — two worlds completely isolated from each other.
AI Technology Circle
Checking OSWorld leaderboards on GitHub, setting up Docker sandboxes, discussing visual grounding precision. But they don’t know which buttons e-commerce operators click daily, in what order, or what judgments they make under what circumstances. Their benchmark is “change a table color in LibreOffice,” not “batch-modify prices for 500 SKUs in Amazon Seller Central.”
Business Operations Circle
E-commerce owners, finance managers, insurance claims supervisors. They watch employees repeatedly click screens hundreds of times daily, thinking “if only this could be automated.” But they don’t know GUI Agents exist, don’t know what Browser Use is, and when they hear “AI” they still think of ChatGPT writing articles.
The training data solution is extremely simple — but neither side knows it. Simply have a skilled employee work normally while screen-recording plus voice-narrating: “I’m now opening Seller Central, clicking this tab, checking if there are any new negative reviews today…” Speech-to-text creates a complete video combining “operations + judgment logic.” The AI model sees screen visuals + mouse trajectories (actions) + human narration (decision logic) — three information streams perfectly aligned, enabling it to map and learn the entire workflow.
This training cost is near-zero — one computer, screen recording software, a microphone, and an employee working half a day. Compared to traditional ML training costs easily reaching tens of thousands of dollars, this is negligible. Moreover, the method has exceptional generalizability: any job whose work consists of “look at screen → judge → click” can be trained the same way.
为什么GUI会率先落地:Control-based AI的物理摩擦阶梯
Control-based AI内部存在一个由操作环境的Physical Friction决定的落地速度阶梯。GUI Agent处于这个阶梯的最底层——物理摩擦最低,因此最先到达经济可行的临界点。
| Dimension | GUI Agent (Electronic World) | AI Robotic Arm (Physical World) |
|---|---|---|
| Training Data Labeling | Screen recording + voice, employee produces dozens in half a day, zero cost | Teleoperation equipment + torque sensors + specialist engineers, 100× cost per sample |
| Environment Controllability | Browser is deterministic — same operation always yields same result | Temperature, humidity, material wear, gravity collisions — different every time |
| Trial-Error Cost | Refresh page and retry, zero physical loss | May damage workpieces or injure people |
| Cross-scenario Generalization | All browsers worldwide render HTML — interface logic is universal | Welding and polishing are entirely different mechanical models |
| Verification Loop | Page feedback IS the result — instantly verifiable | Requires X-ray inspection, CMM measurement, etc. |
| Expected Mass Deployment | 2026-2028 | 2028-2032 (optimistic) |
GUI Agents are not merely “slightly faster” than robotic arms — they lead by an entire industrial cycle. This is determined by physical law: the entropy of the electronic world is far lower than that of the physical world. Therefore, the data volume, engineering investment, and iteration cycles needed to achieve AI control in the electronic world are all far smaller than in the physical world.
Conclusions and Predictions
The core conclusions of this paper can be summarized in five propositions:
命题一:AI的产业化进化正在经历从”生成式”到”控制式”的相变。Generative AI提升了高认知人群的效率,但其产出物的后端测试、验证、物理世界对齐需求吞噬了前端效率提升,无法实现大规模岗位替代。
命题二:GUIControl-based AI没有”后端对齐长尾”。它执行的是确定性环境中的标准化操作——”填空题”,操作验证是二元的、即时的、自动的。前端对齐一次,后端就是零成本的批量执行。
命题三:GUI Agent的Economic Viability由当地人力成本与劳动力可替代性决定。纯电脑操作型岗位的劳动力稀缺性极低,一旦AI Agent的单次操作成本低于当地人力成本,替代就是单方面的、不可逆的。美国已过线,全球成本线正在快速下移。
Proposition 4: The biggest barrier to GUI Agent industrialization is the information gap — those who know AI don’t know the specific “fill-in-the-blank” tasks in real business scenarios; those who know the scenarios don’t know GUI Agents exist. Screen recording + voice narration training can bridge this gap, but complex tasks involving implicit judgment still require finer task decomposition.
命题五:Control-based AI内部存在物理摩擦阶梯。GUI(电子世界)的Physical Friction远低于机械臂(物理世界),训练数据获取成本相差百倍,落地速度领先一整个产业周期。GUI Agent将是工业化AI落地的第一波浪潮。
References and Data Sources
[1] Merriam-Webster, 2025 Word of the Year: “Slop” — Low-quality AI-generated digital content
[2] Meltwater, 2025年AI SlopConsumer sentiment analysis report
[3] Graphite SEO, 2025年Analysis of AI-generated share in English web content
[4] OSWorld Benchmark (NeurIPS 2024) — Desktop OS multimodal agent evaluation benchmark
[5] Agent S3, Simular AI — OSWorldLeaderboard top score(72.6%, 100步设定)
[6] OpenAI, Computer-Using Agent (CUA) — OperatorProduct technical report
[7] Anthropic, Claude Computer Use Beta — 3.5 Sonnet/4.6series computer use capabilities
[8] U.S. Bureau of Labor Statistics (BLS) — Office and administrative support employment statistics(2023-2024)
[9] OSWorld-Human (arXiv:2506.16042) — AgentTemporal efficiency benchmark study
[10] TechCrunch, “OpenAI’s Operator agent helped me move, but I had to help it, too” (2025.02)
[11] 36氪, “第一波AI裁员潮,来了” (2026.03) — Block4,000-person layoff case analysis
[12] 腾讯新闻, “巨头裁员,这次史无前例” (2025.12) — AIAI-driven layoff global trends
[13] 实在智能, “AI自动化2026解析” — Cross-border e-commerce AI Agent deployment cases
[14] Author empirical test data — Sonnet 4.6 GUI AgentPerformance and cost empirical test(2026.03)
[15] 李朝全球人工智能研究所, “基生到共生” (2026.02) — Thermodynamic information theory framework
[16] 李朝全球人工智能研究所, “第四产业” (2026.02) — 认知经济与Physical Friction理论
[17] 李朝全球人工智能研究所, “信息与物理的对抗” (2026.02) — Physical possession theory
[18] Firecrawl, “11 Best AI Browser Agents in 2026” — Browser automation market$242.5亿估值
[19] Skyvern, “AI Web Agents Complete Guide” (2025.11) — AI Agent市场$54亿→$76亿数据