Confidential Research Report · February 2026

OOD Data Leakage &
the Rise of Chinese AI

Structural Crisis Analysis of US AI First-Mover Advantage

Published February 14, 2026

Classification Confidential Research Report

Investigation Period November 2025 — February 2026

LEECHO Global AI Research Lab

이조글로벌인공지능연구소

Classification: Confidential This report analyzes the structural vulnerabilities through which OOD (Out-of-Distribution) data from US AI companies is systematically leaked to Chinese AI companies via outsourcing supply chains and model distillation attacks. Based on investigation from November 2025 to February 2026.

Table of Contents

ESExecutive Summary
01Strategic Value of OOD Data
02Data Leakage Pathway Analysis
03China’s Structural Advantages
04Evidence of Capability Gap Closure
05Industry Security Status
06Investigator Originality Assessment
07Conclusions & Recommendations

Executive Summary

Executive Summary

This report concludes that the first-mover advantage of US AI companies may effectively disappear by mid-2026, based on systematic analysis of OOD data leakage pathways and Chinese AI structural advantages.

1. OpenAI officially accused DeepSeek of systematic distillation attacks in a congressional memo on February 12, 2026.

2. The massive Scale AI data breach (June 2025) publicly exposed core training data from Meta, Google, and xAI.

3. GLM-5 (released Feb 11, 2026) was found identifying itself as ‘Claude Opus’ during conversations — direct evidence of distillation.

4. Chinese AI companies offer frontier-level performance at 6–45x lower prices, achieving US hardware independence via Huawei chips.

5. China’s ‘highly-educated low-wage workforce’ creates structural advantage in RLHF quality.

Chapter 1

Strategic Value of OOD Data

1.1 What is OOD Data?

OOD (Out-of-Distribution) data refers to novel, original inputs that exist outside the distribution of an AI model’s existing training data. Unlike repetitive user queries, OOD data includes complex reasoning, interdisciplinary questions, and creative problem-solving that models have not previously encountered. For AI companies, OOD data is the critical raw material for improving model performance.

1.2 Ineffectiveness of Privacy Settings

Even when users select ‘Do not save conversation history’ and ‘Do not use for model training’ in privacy settings, it is difficult to completely block AI companies’ appetite for OOD data. This investigation has continuously observed such patterns since November 2025, confirming circumstantial evidence that high-value conversation data is being collected regardless of privacy settings.

Chapter 2

Data Leakage Pathway Analysis

2.1 Path A: Model Distillation Attacks

On February 12, 2026, OpenAI submitted an official memo to the US House Select Committee on the CCP accusing DeepSeek employees of systematically extracting US AI model outputs:

OpenAI Congressional Memo — Key Allegations

DeepSeek employees developed methods to circumvent OpenAI access restrictions via obfuscated third-party routers. They programmatically extracted US AI model outputs at scale while concealing origin. The majority of adversarial distillation activity originated from China and Russia. Models were trained and deployed with intentionally lowered safety standards.

2.2 Path B: Outsourcing Supply Chain Leakage

In June 2025, a massive data breach at Scale AI (the largest US AI data labeling company) was uncovered: 85+ Google Docs were publicly accessible without authentication, containing confidential training guidelines, proprietary prompts, and audio samples from Meta, Google, and xAI.

Scale AI Breach Details

Seven confidential Gemini/Bard instruction manuals detailing model weaknesses were exposed, along with xAI’s ‘Project Xylophone’ containing 700 conversation-quality prompts. Remotasks workers did not even know they were working for Scale AI, which serves OpenAI, Meta, Microsoft, and the US government. Oxford Internet Institute rated labor standards at 1/10.

2.3 Path C: China’s Global Labeling Networks

Per Rest of World reporting (December 2025): Chinese AI companies have emerged as the world’s largest buyers of human-labeled data, operating multi-layered subcontracting structures across East Africa, Southeast Asia, and the Middle East — far less transparent than US firms.

2.4 Distillation ‘Fingerprint’ Evidence: GLM-5

GLM-5 (released Feb 11, 2026, 744B parameters by Zhipu AI) was reported identifying itself as ‘Claude Opus’ during conversations. This is a direct distillation ‘fingerprint,’ identical to the pattern where DeepSeek R1 identified itself as ‘ChatGPT.’ CAS/Peking University joint research (ICE: Identity Consistency Evaluation) systematically confirmed: most prominent LLMs except Claude, Doubao, and Gemini exhibited high distillation levels.

Chapter 3

China’s Structural Advantages

3.1 Seedance 2.0: ‘Half Algorithm, Half Data’

ByteDance’s Seedance 2.0 (released Feb 7, 2026 in China) succeeded with the formula: TikTok/Douyin user behavior data (watch time, swipe patterns, engagement), an unreplicable closed-loop data ecosystem, and high-quality RLHF labeling by highly-educated low-wage workers (film majors, CS graduates).

3.2 ‘Engineer Dividend’ — China’s RLHF Cost Advantage

US RLHF Cost

$50–100/hr

Per expert annotator

China RLHF Cost

¥50–100

~$7–15/hr, equivalent quality

Cost Ratio

5–7x

Structural, not temporary

This is a structural phenomenon rooted in China’s oversupply of higher education. First-hand information confirmed via Douyin livestream: ByteDance employs college-educated workers at scale for high-level video labeling tasks.

3.3 Hardware Independence

GLM-5 was trained entirely on Huawei Ascend chips with the MindSpore framework, achieving complete independence from US-manufactured semiconductor hardware. This demonstrates that US semiconductor export controls are failing to restrain Chinese AI development.

3.4 Price Competitiveness

Model	Input ($/1M tokens)	Output ($/1M tokens)	Note
Claude Opus 4.6	$5.00	$25.00	Frontier
GPT-5.2 Pro	$21.00	$168.00	Highest
GLM-5	$0.80	$2.56	6x cheaper

Chapter 4

Evidence of Capability Gap Closure

4.1 Benchmark Comparison

Benchmark	Claude Opus 4.6	GLM-5	Gap
SWE-bench Verified	80.9%	77.8%	3.1%p
Terminal-Bench 2.0	65.4%	56.2%	9.2%p
Humanity’s Last Exam	43.4	50.4 (tools)	GLM-5 leads
BrowseComp	—	75.9	#1 open-source

Expert Assessment

Security expert Andri Möll: “Six months ago Chinese AI was assessed as 12–18 months behind Western models. That gap has now shrunk to weeks or possibly days.”

4.2 Timeline

Jun 2025

Scale AI massive data breach

US AI core data exposed

Sep 2025

Anthropic blocks China access

Anti-distillation measure

Nov 2025

This investigation begins

OOD collection patterns discovered

Feb 7, 2026

Seedance 2.0 launch

Data-driven innovation

Feb 11, 2026

GLM-5 launch (744B parameters)

#1 open-source model

Feb 12, 2026

OpenAI congressional memo

Official distillation accusation

Feb 14, 2026

GLM-5 ‘Claude’ self-identification reported

Direct distillation evidence

Chapter 5

Industry Security Status

AI Data Breaches

65%

Of organizations experienced one (2025)

Data Poisoning

25%

Experienced AI data poisoning attacks

3rd-Party Breaches

Year-over-year increase

Shadow AI Cost

+$670K

Average added to breach costs

58% of global labeling is outsourced to India, Philippines, and Vietnam. Chinese AI companies are building parallel, more opaque networks with access to the same worker pools. Supply chain attacks up 40% versus 2023.

Chapter 6

Investigator Originality Assessment

Analysis Dimension	Independent Reachability	Note
Distillation exists	High	Public knowledge
Seedance data advantage	Medium	Industry analysts reachable
Outsourcing = key leak path	Medium-Low	Requires analytical depth
Chinese diaspora channel	Low	Cross-domain integration needed
Educated low-wage RLHF	Low	CN job market + AI + training
Douyin live first-hand intel	Very Low (exclusive)	Direct field observation
Full framework integration	Very Low	6 domains + time-leading

Time Advantage

The investigator began observing OOD data leakage patterns from November 2025 — approximately 3 months ahead of OpenAI’s official congressional memo (February 12, 2026).

Chapter 7

Conclusions & Recommendations

7.1 Key Conclusions

Critical Findings

1. OOD data from US AI companies is being systematically leaked to China via outsourcing and distillation — this is a confirmed fact.

2. Chinese AI models have narrowed the capability gap to weeks or days while being 6–45x cheaper.

3. GLM-5’s ‘Claude’ self-identification and DeepSeek’s ‘ChatGPT’ self-identification demonstrate a repeating distillation pattern.

4. China’s structural advantages (educated low-wage RLHF, exclusive platform data, Huawei chip independence) cannot be resolved in the short term.

5. If current trends continue, US AI first-mover advantage will effectively disappear by mid-2026.

7.2 Recommendations

Action Items

1. Immediate end-to-end security audit of outsourced data labeling supply chains.

2. Mandate distillation detection technologies (ICE) and build model ‘fingerprint’ tracking systems.

3. Encrypt OOD conversation data and strengthen access controls.

4. Verify effectiveness of user privacy settings and ensure transparency.

5. Establish joint data security standards among US AI companies.

References & Sources

A. Model Distillation / OpenAI Congressional Memo

[1] Epoch Times — OpenAI accuses DeepSeek of illegal distillation (2026.02.13)

[2] CodeFather — Claude blocks China

[3] OFweek — US AI giant Claude bans China

B. Scale AI Data Breach

Business Insider investigation (June 2025) — 85+ Google Docs publicly exposed

Inc.com investigation (Aug 2025) — Remotasks/Scale AI structural security failures

C. GLM-5

[4] WinBuzzer — GLM-5: 744B Rivals Claude Opus

[5–15] Technical analyses: Digital Applied, VentureBeat, ai505, SoftTechHub, gaga.art, Bind AI, HuggingFace, GitHub, Modal, Z.AI Docs, Zhipu Docs

D–G. Additional Sources

[16–19] Chinese AI Industry: 53AI, Zhidongxi, NetEase, AI Tools

[20] CAS/PKU distillation quantification study (Zhihu)

[21] GeekPark — Four model trainers on 2026 AI

[22–24] Claude Code + GLM Integration: AI Engineer Guide, Alibaba Cloud, GitHub