Confidential Research Report · February 2026

OOD Data Leakage &
the Rise of Chinese AI

Structural Crisis Analysis of US AI First-Mover Advantage


Published February 14, 2026
Classification Confidential Research Report
Investigation Period November 2025 — February 2026

LEECHO Global AI Research Lab
이조글로벌인공지능연구소

Classification: Confidential   This report analyzes the structural vulnerabilities through which OOD (Out-of-Distribution) data from US AI companies is systematically leaked to Chinese AI companies via outsourcing supply chains and model distillation attacks. Based on investigation from November 2025 to February 2026.
Table of Contents
  • ESExecutive Summary
  • 01Strategic Value of OOD Data
  • 02Data Leakage Pathway Analysis
  • 03China’s Structural Advantages
  • 04Evidence of Capability Gap Closure
  • 05Industry Security Status
  • 06Investigator Originality Assessment
  • 07Conclusions & Recommendations

Executive Summary

Executive Summary


This report concludes that the first-mover advantage of US AI companies may effectively disappear by mid-2026, based on systematic analysis of OOD data leakage pathways and Chinese AI structural advantages.

1. OpenAI officially accused DeepSeek of systematic distillation attacks in a congressional memo on February 12, 2026.
2. The massive Scale AI data breach (June 2025) publicly exposed core training data from Meta, Google, and xAI.
3. GLM-5 (released Feb 11, 2026) was found identifying itself as ‘Claude Opus’ during conversations — direct evidence of distillation.
4. Chinese AI companies offer frontier-level performance at 6–45x lower prices, achieving US hardware independence via Huawei chips.
5. China’s ‘highly-educated low-wage workforce’ creates structural advantage in RLHF quality.


01
Chapter 1

Strategic Value of OOD Data


1.1 What is OOD Data?

OOD (Out-of-Distribution) data refers to novel, original inputs that exist outside the distribution of an AI model’s existing training data. Unlike repetitive user queries, OOD data includes complex reasoning, interdisciplinary questions, and creative problem-solving that models have not previously encountered. For AI companies, OOD data is the critical raw material for improving model performance.

1.2 Ineffectiveness of Privacy Settings

Even when users select ‘Do not save conversation history’ and ‘Do not use for model training’ in privacy settings, it is difficult to completely block AI companies’ appetite for OOD data. This investigation has continuously observed such patterns since November 2025, confirming circumstantial evidence that high-value conversation data is being collected regardless of privacy settings.


02
Chapter 2

Data Leakage Pathway Analysis


2.1 Path A: Model Distillation Attacks

On February 12, 2026, OpenAI submitted an official memo to the US House Select Committee on the CCP accusing DeepSeek employees of systematically extracting US AI model outputs:

OpenAI Congressional Memo — Key Allegations

DeepSeek employees developed methods to circumvent OpenAI access restrictions via obfuscated third-party routers. They programmatically extracted US AI model outputs at scale while concealing origin. The majority of adversarial distillation activity originated from China and Russia. Models were trained and deployed with intentionally lowered safety standards.

2.2 Path B: Outsourcing Supply Chain Leakage

In June 2025, a massive data breach at Scale AI (the largest US AI data labeling company) was uncovered: 85+ Google Docs were publicly accessible without authentication, containing confidential training guidelines, proprietary prompts, and audio samples from Meta, Google, and xAI.

Scale AI Breach Details

Seven confidential Gemini/Bard instruction manuals detailing model weaknesses were exposed, along with xAI’s ‘Project Xylophone’ containing 700 conversation-quality prompts. Remotasks workers did not even know they were working for Scale AI, which serves OpenAI, Meta, Microsoft, and the US government. Oxford Internet Institute rated labor standards at 1/10.

2.3 Path C: China’s Global Labeling Networks

Per Rest of World reporting (December 2025): Chinese AI companies have emerged as the world’s largest buyers of human-labeled data, operating multi-layered subcontracting structures across East Africa, Southeast Asia, and the Middle East — far less transparent than US firms.

2.4 Distillation ‘Fingerprint’ Evidence: GLM-5

GLM-5 (released Feb 11, 2026, 744B parameters by Zhipu AI) was reported identifying itself as ‘Claude Opus’ during conversations. This is a direct distillation ‘fingerprint,’ identical to the pattern where DeepSeek R1 identified itself as ‘ChatGPT.’ CAS/Peking University joint research (ICE: Identity Consistency Evaluation) systematically confirmed: most prominent LLMs except Claude, Doubao, and Gemini exhibited high distillation levels.


03
Chapter 3

China’s Structural Advantages


3.1 Seedance 2.0: ‘Half Algorithm, Half Data’

ByteDance’s Seedance 2.0 (released Feb 7, 2026 in China) succeeded with the formula: TikTok/Douyin user behavior data (watch time, swipe patterns, engagement), an unreplicable closed-loop data ecosystem, and high-quality RLHF labeling by highly-educated low-wage workers (film majors, CS graduates).

3.2 ‘Engineer Dividend’ — China’s RLHF Cost Advantage

US RLHF Cost
$50–100/hr
Per expert annotator
China RLHF Cost
¥50–100
~$7–15/hr, equivalent quality
Cost Ratio
5–7x
Structural, not temporary

This is a structural phenomenon rooted in China’s oversupply of higher education. First-hand information confirmed via Douyin livestream: ByteDance employs college-educated workers at scale for high-level video labeling tasks.

3.3 Hardware Independence

GLM-5 was trained entirely on Huawei Ascend chips with the MindSpore framework, achieving complete independence from US-manufactured semiconductor hardware. This demonstrates that US semiconductor export controls are failing to restrain Chinese AI development.

3.4 Price Competitiveness

Model Input ($/1M tokens) Output ($/1M tokens) Note
Claude Opus 4.6 $5.00 $25.00 Frontier
GPT-5.2 Pro $21.00 $168.00 Highest
GLM-5 $0.80 $2.56 6x cheaper

04
Chapter 4

Evidence of Capability Gap Closure


4.1 Benchmark Comparison

Benchmark Claude Opus 4.6 GLM-5 Gap
SWE-bench Verified 80.9% 77.8% 3.1%p
Terminal-Bench 2.0 65.4% 56.2% 9.2%p
Humanity’s Last Exam 43.4 50.4 (tools) GLM-5 leads
BrowseComp 75.9 #1 open-source
Expert Assessment

Security expert Andri Möll: “Six months ago Chinese AI was assessed as 12–18 months behind Western models. That gap has now shrunk to weeks or possibly days.”

4.2 Timeline

Jun 2025
Scale AI massive data breach

US AI core data exposed
Sep 2025
Anthropic blocks China access

Anti-distillation measure
Nov 2025
This investigation begins

OOD collection patterns discovered
Feb 7, 2026
Seedance 2.0 launch

Data-driven innovation
Feb 11, 2026
GLM-5 launch (744B parameters)

#1 open-source model
Feb 12, 2026
OpenAI congressional memo

Official distillation accusation
Feb 14, 2026
GLM-5 ‘Claude’ self-identification reported

Direct distillation evidence


05
Chapter 5

Industry Security Status


AI Data Breaches
65%
Of organizations experienced one (2025)
Data Poisoning
25%
Experienced AI data poisoning attacks
3rd-Party Breaches
2x
Year-over-year increase
Shadow AI Cost
+$670K
Average added to breach costs

58% of global labeling is outsourced to India, Philippines, and Vietnam. Chinese AI companies are building parallel, more opaque networks with access to the same worker pools. Supply chain attacks up 40% versus 2023.


06
Chapter 6

Investigator Originality Assessment


Analysis Dimension Independent Reachability Note
Distillation exists High Public knowledge
Seedance data advantage Medium Industry analysts reachable
Outsourcing = key leak path Medium-Low Requires analytical depth
Chinese diaspora channel Low Cross-domain integration needed
Educated low-wage RLHF Low CN job market + AI + training
Douyin live first-hand intel Very Low (exclusive) Direct field observation
Full framework integration Very Low 6 domains + time-leading
Time Advantage

The investigator began observing OOD data leakage patterns from November 2025 — approximately 3 months ahead of OpenAI’s official congressional memo (February 12, 2026).


07
Chapter 7

Conclusions & Recommendations


7.1 Key Conclusions

Critical Findings

1. OOD data from US AI companies is being systematically leaked to China via outsourcing and distillation — this is a confirmed fact.

2. Chinese AI models have narrowed the capability gap to weeks or days while being 6–45x cheaper.

3. GLM-5’s ‘Claude’ self-identification and DeepSeek’s ‘ChatGPT’ self-identification demonstrate a repeating distillation pattern.

4. China’s structural advantages (educated low-wage RLHF, exclusive platform data, Huawei chip independence) cannot be resolved in the short term.

5. If current trends continue, US AI first-mover advantage will effectively disappear by mid-2026.

7.2 Recommendations

Action Items

1. Immediate end-to-end security audit of outsourced data labeling supply chains.

2. Mandate distillation detection technologies (ICE) and build model ‘fingerprint’ tracking systems.

3. Encrypt OOD conversation data and strengthen access controls.

4. Verify effectiveness of user privacy settings and ensure transparency.

5. Establish joint data security standards among US AI companies.

References & Sources

A. Model Distillation / OpenAI Congressional Memo
[1] Epoch Times — OpenAI accuses DeepSeek of illegal distillation (2026.02.13)
[2] CodeFather — Claude blocks China
[3] OFweek — US AI giant Claude bans China

B. Scale AI Data Breach
Business Insider investigation (June 2025) — 85+ Google Docs publicly exposed
Inc.com investigation (Aug 2025) — Remotasks/Scale AI structural security failures

C. GLM-5
[4] WinBuzzer — GLM-5: 744B Rivals Claude Opus
[5–15] Technical analyses: Digital Applied, VentureBeat, ai505, SoftTechHub, gaga.art, Bind AI, HuggingFace, GitHub, Modal, Z.AI Docs, Zhipu Docs

D–G. Additional Sources
[16–19] Chinese AI Industry: 53AI, Zhidongxi, NetEase, AI Tools
[20] CAS/PKU distillation quantification study (Zhihu)
[21] GeekPark — Four model trainers on 2026 AI
[22–24] Claude Code + GLM Integration: AI Engineer Guide, Alibaba Cloud, GitHub

LEECHO Global AI Research Lab
2026. 02. 14
CONFIDENTIAL RESEARCH REPORT

댓글 남기기