- ESExecutive Summary
- 01Strategic Value of OOD Data
- 02Data Leakage Pathway Analysis
- 03China’s Structural Advantages
- 04Evidence of Capability Gap Closure
- 05Industry Security Status
- 06Investigator Originality Assessment
- 07Conclusions & Recommendations
Executive Summary
This report concludes that the first-mover advantage of US AI companies may effectively disappear by mid-2026, based on systematic analysis of OOD data leakage pathways and Chinese AI structural advantages.
Strategic Value of OOD Data
1.1 What is OOD Data?
OOD (Out-of-Distribution) data refers to novel, original inputs that exist outside the distribution of an AI model’s existing training data. Unlike repetitive user queries, OOD data includes complex reasoning, interdisciplinary questions, and creative problem-solving that models have not previously encountered. For AI companies, OOD data is the critical raw material for improving model performance.
1.2 Ineffectiveness of Privacy Settings
Even when users select ‘Do not save conversation history’ and ‘Do not use for model training’ in privacy settings, it is difficult to completely block AI companies’ appetite for OOD data. This investigation has continuously observed such patterns since November 2025, confirming circumstantial evidence that high-value conversation data is being collected regardless of privacy settings.
Data Leakage Pathway Analysis
2.1 Path A: Model Distillation Attacks
On February 12, 2026, OpenAI submitted an official memo to the US House Select Committee on the CCP accusing DeepSeek employees of systematically extracting US AI model outputs:
DeepSeek employees developed methods to circumvent OpenAI access restrictions via obfuscated third-party routers. They programmatically extracted US AI model outputs at scale while concealing origin. The majority of adversarial distillation activity originated from China and Russia. Models were trained and deployed with intentionally lowered safety standards.
2.2 Path B: Outsourcing Supply Chain Leakage
In June 2025, a massive data breach at Scale AI (the largest US AI data labeling company) was uncovered: 85+ Google Docs were publicly accessible without authentication, containing confidential training guidelines, proprietary prompts, and audio samples from Meta, Google, and xAI.
Seven confidential Gemini/Bard instruction manuals detailing model weaknesses were exposed, along with xAI’s ‘Project Xylophone’ containing 700 conversation-quality prompts. Remotasks workers did not even know they were working for Scale AI, which serves OpenAI, Meta, Microsoft, and the US government. Oxford Internet Institute rated labor standards at 1/10.
2.3 Path C: China’s Global Labeling Networks
Per Rest of World reporting (December 2025): Chinese AI companies have emerged as the world’s largest buyers of human-labeled data, operating multi-layered subcontracting structures across East Africa, Southeast Asia, and the Middle East — far less transparent than US firms.
2.4 Distillation ‘Fingerprint’ Evidence: GLM-5
GLM-5 (released Feb 11, 2026, 744B parameters by Zhipu AI) was reported identifying itself as ‘Claude Opus’ during conversations. This is a direct distillation ‘fingerprint,’ identical to the pattern where DeepSeek R1 identified itself as ‘ChatGPT.’ CAS/Peking University joint research (ICE: Identity Consistency Evaluation) systematically confirmed: most prominent LLMs except Claude, Doubao, and Gemini exhibited high distillation levels.
China’s Structural Advantages
3.1 Seedance 2.0: ‘Half Algorithm, Half Data’
ByteDance’s Seedance 2.0 (released Feb 7, 2026 in China) succeeded with the formula: TikTok/Douyin user behavior data (watch time, swipe patterns, engagement), an unreplicable closed-loop data ecosystem, and high-quality RLHF labeling by highly-educated low-wage workers (film majors, CS graduates).
3.2 ‘Engineer Dividend’ — China’s RLHF Cost Advantage
This is a structural phenomenon rooted in China’s oversupply of higher education. First-hand information confirmed via Douyin livestream: ByteDance employs college-educated workers at scale for high-level video labeling tasks.
3.3 Hardware Independence
GLM-5 was trained entirely on Huawei Ascend chips with the MindSpore framework, achieving complete independence from US-manufactured semiconductor hardware. This demonstrates that US semiconductor export controls are failing to restrain Chinese AI development.
3.4 Price Competitiveness
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Note |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | Frontier |
| GPT-5.2 Pro | $21.00 | $168.00 | Highest |
| GLM-5 | $0.80 | $2.56 | 6x cheaper |
Evidence of Capability Gap Closure
4.1 Benchmark Comparison
| Benchmark | Claude Opus 4.6 | GLM-5 | Gap |
|---|---|---|---|
| SWE-bench Verified | 80.9% | 77.8% | 3.1%p |
| Terminal-Bench 2.0 | 65.4% | 56.2% | 9.2%p |
| Humanity’s Last Exam | 43.4 | 50.4 (tools) | GLM-5 leads |
| BrowseComp | — | 75.9 | #1 open-source |
Security expert Andri Möll: “Six months ago Chinese AI was assessed as 12–18 months behind Western models. That gap has now shrunk to weeks or possibly days.”
4.2 Timeline
Industry Security Status
58% of global labeling is outsourced to India, Philippines, and Vietnam. Chinese AI companies are building parallel, more opaque networks with access to the same worker pools. Supply chain attacks up 40% versus 2023.
Investigator Originality Assessment
| Analysis Dimension | Independent Reachability | Note |
|---|---|---|
| Distillation exists | High | Public knowledge |
| Seedance data advantage | Medium | Industry analysts reachable |
| Outsourcing = key leak path | Medium-Low | Requires analytical depth |
| Chinese diaspora channel | Low | Cross-domain integration needed |
| Educated low-wage RLHF | Low | CN job market + AI + training |
| Douyin live first-hand intel | Very Low (exclusive) | Direct field observation |
| Full framework integration | Very Low | 6 domains + time-leading |
The investigator began observing OOD data leakage patterns from November 2025 — approximately 3 months ahead of OpenAI’s official congressional memo (February 12, 2026).
Conclusions & Recommendations
7.1 Key Conclusions
1. OOD data from US AI companies is being systematically leaked to China via outsourcing and distillation — this is a confirmed fact.
2. Chinese AI models have narrowed the capability gap to weeks or days while being 6–45x cheaper.
3. GLM-5’s ‘Claude’ self-identification and DeepSeek’s ‘ChatGPT’ self-identification demonstrate a repeating distillation pattern.
4. China’s structural advantages (educated low-wage RLHF, exclusive platform data, Huawei chip independence) cannot be resolved in the short term.
5. If current trends continue, US AI first-mover advantage will effectively disappear by mid-2026.
7.2 Recommendations
1. Immediate end-to-end security audit of outsourced data labeling supply chains.
2. Mandate distillation detection technologies (ICE) and build model ‘fingerprint’ tracking systems.
3. Encrypt OOD conversation data and strengthen access controls.
4. Verify effectiveness of user privacy settings and ensure transparency.
5. Establish joint data security standards among US AI companies.