Thought Paper

Corpus: The Other Valve of LLMHow multi-dimensional corpus quality constrains the ceiling of LLM world representations

Mainstream LLM research focuses on model architecture and training methodology. But a more upstream bottleneck is systematically overlooked: the multi-dimensional quality of corpus itself — semantic precision, register purity, data hygiene — and the cross-cutting translation alignment problem that traverses all dimensions, collectively determining the quality ceiling of model world representations.

LEECHO Global AI Research Lab & Opus 4.6
April 9, 2026 · V3

Abstract

Current LLM research focuses predominantly on the model side — larger parameters, better architectures, more refined alignment techniques. This work implicitly assumes that the corpus is a sufficiently good representation of the world, and the bottleneck lies in how the model learns from it. This paper challenges that assumption, proposing that an independent, multi-dimensional quality bottleneck exists on the corpus side — the “Corpus Valve.” This valve comprises three parallel monolingual quality dimensions — semantic precision, register purity, and data hygiene — plus one cross-cutting cross-linguistic interface dimension: translation alignment. The paper further proposes “weak corpus determinism” with precise boundaries: corpus quality is not an absolute constraint on LLM capability but determines the shape of the returns curve from model-side optimization — when quality falls below a threshold, model-side optimization enters a “sub-scaling” zone of accelerating diminishing returns; when quality is sufficiently high, normal power-law scaling resumes. Recent scaling law empirical research directly supports this boundary, including the University of Chicago’s dimensionless data quality parameter Q extending the Chinchilla framework. All four layers of deficiency are self-reinforced through the LLM generation-retraining cycle, constituting the complete mechanism of the Sapir-Whorf effect in the AI era.

Corpus Valve3+1D Quality ModelSemantic PrecisionRegister PurityData HygieneTranslation Alignment (Cross-Cut)Sub-ScalingWeak Corpus Determinism반복≠迭代
Section 01

The Overlooked Premise: Is the Corpus Actually Good Enough?

When the community assumes the input is fine and the problem is downstream

Between 2022 and 2025, the most-watched directions in the LLM field were, in order: reasoning enhancement, hallucination suppression, alignment techniques, safety assurance, and multimodal expansion. All share an implicit assumption: training corpus is a given, roughly adequate input, and the bottleneck lies in how the model learns from it.

But scaling laws themselves are hinting at cracks in this assumption. Epoch AI estimates that the effective stock of quality-adjusted, deduplicated human-generated public text is approximately 300 trillion tokens, projected to be exhausted between 2026 and 2032. In 2024, frontier model performance gains were primarily driven by post-training and test-time compute, with limited progress on pretraining — the field began speculating that pretraining scaling laws are hitting a ceiling. Anthropic CEO Dario Amodei estimated the probability of AI progress stalling due to data insufficiency at roughly 10%.

However, the “data insufficiency” narrative masks a more fundamental problem: the quality of existing data itself contains systematic deficiencies across multiple dimensions. Data volume depletion is a quantity problem, but the Corpus Valve points to a quality problem — even with sufficient volume, if semantic precision is inadequate, registers are confused, data is contaminated, or translations are unfaithful, the world representation the model learns still has a ceiling.

Three real-world events reveal these quality dimensions:

Phenomenon One. Facing the CPU-to-GPU architectural transformation of data centers, China and the US produced drastically different naming responses — China coined “智算中心” (intelligent computing center), while English stuck with “data center” plus modifiers. The same technological reality has different semantic resolution across language corpora.

Phenomenon Two. When Google Gemini explained the clipping mechanism of PPO in Chinese, it output vulgar forum slang — technically correct but with severely mismatched register. Chinese internet corpus quality problems directly leaked into model output.

Phenomenon Three. In the Korean version of a paper, an LLM translated “迭代” (iteration) as “반복” (repetition). “迭代” carries directionality, progressiveness, and convergence; “반복” merely means directionless repetition. The translation appears completely “correct” on the surface, but the core technical semantics are silently erased.

Section 02

Structure of the Corpus Valve: Three Parallel Dimensions and One Cross-Cutting Layer

A 3+1 dimensional quality architecture governing LLM world representations

The “Corpus Valve” is not a single variable but a 3+1 dimensional quality structure: three parallel dimensions act on corpus quality within a single language, and one cross-cutting layer acts on the interface quality between languages.

Figure 1 · 3+1D Corpus Valve Topology
DIMENSION A
Semantic Precision
Whether terminology precisely
tracks physical-world changes
DIMENSION B
Register Purity
Whether technical content is
contaminated by non-technical registers
DIMENSION C
Data Hygiene
Whether corpus contains
harmful or anomalous content
Cross-Cut
Translation Alignment
Cuts across all dimensions, acting on the interface between languages. Quality differences in each dimension are remapped as they pass through this layer — precise concepts may be downgraded, blurry concepts may be solidified, contaminated content may propagate cross-linguistically.
The three parallel dimensions (A/B/C) each independently constrain corpus quality within a single language, side by side rather than dependent. The translation alignment layer cuts across below them, serving as the cross-linguistic interface that determines whether quality is preserved, downgraded, or distorted during transmission.
Dimension Nature of Problem Empirical Case Affected Capability Existing Detection
A Semantic Precision Terminology fails to track physical change “data center” covers both CPU & GPU facilities World model resolution Nearly nonexistent
B Register Purity Technical content mixed with colloquial register Gemini uses forum slang to explain PPO Output appropriateness baseline Partial (register classifiers)
C Data Hygiene Contains harmful/anomalous content Pornographic token frequency anomalously high Safety and trustworthiness Relatively mature (toxicity detection)
× Translation Alignment Cross-linguistic mapping loses semantic features 반복 (repetition) ≠ 迭代 (iteration) Multilingual cognitive consistency Nearly nonexistent

The distribution of the “Existing Detection” column is highly asymmetric: the layers with the most structural impact — semantic precision and translation alignment — are precisely those with the least tooling support.

Section 03

Dimension A: Semantic Precision — When Old Words Obscure New Realities

How terminological inertia constrains the resolution of LLM world models

Since 2012, the physical essence of data centers has fundamentally changed. Single CPU server power: 300–600W; GPU server: 3,000–10,000W. Rack density surged from 5–15kW to 40–250kW+. Cooling shifted from air to liquid; networks from north-south to GPU-to-GPU east-west. This is an architectural paradigm break, not incremental upgrade.

China’s MIIT classified computing infrastructure into General Computing Centers (CPU), Intelligent Computing Centers / 智算中心 (GPU/AI accelerator), and Supercomputing Centers (HPC), each term precisely locked to a hardware architecture. English added modifiers to “data center”: AI data center, GPU data center. NVIDIA CEO Jensen Huang pushed the “AI Factory” concept from 2024, trying to shift the metaphor from “storage” to “production” — but the Chinese “算力中心” never needed this step, since “算” (compute) is inherently a verb and naturally production-oriented.

The terminology precision gap is rooted in multiple structural causes: English morphology limits the flexibility of compound coining; “data center” carries trillion-dollar sunk costs and global-scale path dependence lock-in; and technology receivers naturally possess a terminology reconstruction window when adopting new technologies — if they seize this window, they can actually achieve higher precision than the originating language. This explains a counter-intuitive pattern: the technology-originating language may be inferior in terminology precision to the technology-receiving language.

Vector Space Effect

Chinese “数据中心” and “智算中心” most likely form topologically separated independent concept clusters in LLM vector space; English “data center” and “AI data center” share a core root word and overlap heavily. The model on the English side struggles to learn clear concept boundaries — this is how semantic precision differences are directly imprinted on the model’s internal representations.

Section 04

Dimension B: Register Purity — When Forum Slang Enters the Textbook

The hidden cost of UGC-dominated training corpora

Gemini explained PPO’s clipping mechanism using vulgar Chinese internet slang — technically correct but register-wise belonging to extreme social media colloquialism, not technical documentation. This means PPO-related text in training corpora heavily originated from UGC platforms rather than textbooks or papers.

Stealthiness

Register purity issues differ from “content errors” — content can be correct but the expression mode is contextually inappropriate. They also differ from “data hygiene” — no harmful information present, but the register doesn’t fit. Current corpus cleaning pipelines focus on toxicity and factual accuracy, almost never on register appropriateness. A forum-style PPO explanation would sail through all existing filters.

This problem is especially severe in Chinese corpora. Research indicates that despite the massive total volume of Chinese internet data, high-quality pretraining datasets are relatively scarce, and large corpora like Wudao suffer from severe quality inconsistency. The English side can draw high-register training data from a vast body of peer-reviewed papers and professional publications. This produces a symmetric pattern: on Dimension A (semantic precision), Chinese outperforms English; on Dimension B (register purity), English outperforms Chinese.

Section 05

Dimension C: Data Hygiene — When Pornography Outranks Greetings

Empirical evidence of contamination from BPE token analysis

A 2025 EMNLP study inferred Chinese training data contamination by analyzing BPE vocabularies of LLMs, finding that 9 out of 23 LLM vocabularies contained substantial PoC (Proof of Contamination) tokens related to pornography, online gambling, and anomalous content.

2.6×
A pornographic token’s frequency relative to “hello” (GPT-4o vocab estimate)
23
LLMs examined
0
PoC tokens found in GPT-4/4-turbo/3.5

GPT-4/4-turbo/3.5 vocabularies contained zero PoC tokens, potentially indicating cleaner training corpora. The study also found that data contamination requires sufficient linguistic representation volume to take effect — low-resource languages are barely affected — while high-resource languages like Chinese and English are precisely the most impacted.

This dimension is the most thoroughly researched and has the most mature toolchain among the three parallel dimensions. But the maturity of the toolchain also creates a cognitive bias: the research community tends to equate “corpus quality” with “data hygiene,” overlooking the equally important but harder-to-detect dimensions of semantic precision and register purity.

Section 06

Cross-Cutting Layer: Translation Alignment — The Stealthiest Semantic Killer

When surface-level correctness masks deep semantic erasure

“반복” is the Korean Sino-Korean word for “反復” (repetition), with core semantics of directionless repetition. “迭代” (iteration) carries core semantics of progressive approximation toward a goal based on prior results, inherently encoding directionality, progressiveness, and convergence. When an LLM translates “迭代” in a paper title as “반복,” three core semantic features are silently erased — and the model learned this alignment from massive Chinese-Korean parallel corpora, because in everyday contexts the two words do commonly inter-translate. The fine-grained differences in technical contexts are drowned out by the statistical frequency of everyday contexts.

The Unique Danger of the Cross-Cutting Layer

Translation alignment failure is fundamentally different from the three parallel dimensions: it appears completely “correct” on the surface. No grammar errors, no register confusion, no harmful content. No existing cleaning pipeline — toxicity detection, deduplication, fact-checking, grammar checking — would flag “반복 = 迭代” as a problem. It passes all filters unimpeded, is learned by the LLM as “correct alignment,” and then continuously replicates semantic downgrade in multilingual output.

Translation alignment is a “cross-cutting layer” rather than a fourth parallel dimension because it does not directly act on corpus quality within a single language; instead, it acts on the interface between languages. Quality differences in each parallel dimension are remapped as they pass through the translation alignment layer: Dimension A’s precise terminology may be downgraded in translation (智算中心 → AI data center); Dimension B’s register confusion may propagate cross-linguistically; Dimension C’s data contamination may seep into the target language. What the translation alignment layer determines is not corpus quality itself, but the fidelity of quality during cross-linguistic transmission.

Source Concept Core Semantic Features “Equivalent” Mapping Lost Features
迭代 (Chinese) Directionality, progressiveness, convergence 반복 (Korean) All three features
智算中心 (Chinese) Intelligent computing, GPU-dominant AI data center (English) Verb-nature production metaphor of “算”
Inference (English) Model inference/prediction 推理 (Chinese) Ambiguity: logical reasoning vs. model inference
Alignment (English) Value calibration 对齐 (Chinese) Ambiguity: typographic alignment vs. value calibration
Section 07

Interaction Effects: How Translation Alignment Dissolves and Amplifies Quality Gaps

The asymmetric fate of advantages and disadvantages across the cross-cutting layer

The interaction between the cross-cutting layer and the three parallel dimensions produces two primary compound loops, and the real-world impact of these loops is amplified by the composition of AI research talent.

57.7%
Chinese & US AI researchers’ global share (UNIDO 2025)
63,000+
US AI researchers
53,000
Chinese AI researchers
~42%
Chinese-origin authors at NeurIPS 2019

Carnegie Endowment analysis shows that among top AI papers, contributions from China-origin researchers rival or exceed those of US-native authors. 50% of accepted AAAI 2020 papers included contributions from China-origin researchers. In 2024, Chinese scholars’ AI paper count (23,695) exceeded the combined total of the US, UK, and EU.

This means the single largest contributing group in global AI knowledge production are native Chinese speakers. When they write their precisely differentiated Chinese concepts into English papers, the translation alignment layer systematically dissolves Dimension A advantages — “智算中心” degrades to “AI data center,” precise concept boundaries are blurred in translation. These English papers immediately become LLM training corpora.

Figure 2 · Cross-Cutting Layer × Parallel Dimensions Interaction
Dim A: Chinese terminology precision advantage
“智算中心” precisely separated

Cross-cut: Translation downgrade
→ “AI data center”

Precision advantage dissolved
English corpus inherits blurriness
Dim B+C: Chinese corpus quality weaknesses
Register confusion × data contamination

Cross-cut: Contamination propagates cross-linguistically
Low-quality Chinese corpus seeps into multilingual models

Weaknesses amplified
Impact extends beyond Chinese boundaries
The cross-cutting layer’s effect on Chinese corpora is asymmetric: it dissolves Chinese advantages on Dimension A while amplifying Chinese weaknesses on Dimensions B/C. The net result is an asymmetric pattern where “advantages are dissolved, weaknesses are amplified.”

This asymmetric pattern directly relates to the “weak form” boundary conditions discussed in the next section: the cross-cutting layer’s dissolution effect may push a language’s quality on a specific dimension below the critical threshold, thereby triggering sub-scaling — the zone where model-side optimization returns accelerate in diminishing. In other words, the cross-cutting layer doesn’t just transmit quality differences; it may be the mechanism that triggers quality threshold collapse.

Section 08

Weak Corpus Determinism: Precise Boundary Conditions

Not “corpus determines everything” but “corpus determines the shape of the returns curve”

“Corpus determinism” is not “corpus determines everything” (strong form) but a weak-form claim with precise boundaries. Research published at ACL 2025 directly provides the empirical foundation for defining this boundary.

89%→72%
Precision drop after introducing noise into fine-tuning data
Sub-scaling
Significantly reduced scaling efficiency on high-redundancy datasets
<40%
GPT-4 factual accuracy on long-tail entities (high-freq entities >90%)
300T
Effective human public text stock (tokens, Epoch AI estimate)

Key findings: the 2025 ACL study proposed a “density” metric to measure dataset redundancy and diversity. High-density (high-redundancy, low-diversity) datasets produce sub-scaling — the scaling curve bends more severely, and large model fitting accuracy drops significantly. LLaMA 2 outperformed LLaMA 3 on scaling efficiency despite the latter using more advanced strategies, because LLaMA 3’s dataset had higher density.

Information-theoretic analysis further shows that LLMs face a fundamental sample complexity bottleneck on long-tail knowledge: for factual knowledge lacking compressible structure (e.g., birthdays, precise numbers), each fact must be independently memorized, and the required sample size scales linearly with the total number of facts — at the million-fact scale, this exceeds the capacity of any feasible corpus.

Precise Boundaries of the Weak Form

The constraint that corpus quality imposes on LLM capability is not an absolute ceiling (strong form) but determines the shape of the returns curve from model-side optimization (weak form). Specifically: when corpus quality on a given dimension falls below a threshold, model-side optimization follows an accelerating diminishing returns law — doubling parameters may yield only single-digit percentage improvements (sub-scaling). But when corpus quality exceeds the threshold, model-side optimization can effectively unlock capability, and the returns curve resumes its normal power-law shape. The Corpus Valve is not a wall but a knob that adjusts the slope of the returns curve.

Condition Corpus Quality < Threshold Corpus Quality ≥ Threshold
Model-side optimization returns Accelerating diminishing (sub-scaling) Normal power-law scaling
Marginal effect of doubling parameters 1–2% improvement Predictable, significant improvement
Bottleneck location Corpus side (upstream) Model side (downstream)
Optimization strategy Fix corpus quality first Continue scaling models
Section 09

The Self-Reinforcing Trap and Sapir-Whorf Reconstructed for AI

Why corpus defects don’t just persist — they compound

The critical asymmetry between the Corpus Valve and the Model Valve lies in the direction of self-reinforcement. Model-side improvement is a positive cycle: better model → better output → better feedback → further improvement. But the corpus side harbors a negative trap: low-quality corpus → LLM learns defects → LLM output replicates defects → generated text becomes new corpus → next-generation LLM inherits and amplifies defects.

This operates simultaneously across all three parallel dimensions and the cross-cutting layer: a model that learns blurry terminology boundaries (Dimension A) will continue using blurry terminology in output; one that learns forum-style technical expression (Dimension B) will replicate this register in responses; one that learns the alignment “반복 = 迭代” (cross-cutting layer) will persistently replicate this semantic downgrade in translations.

This constitutes the Sapir-Whorf hypothesis reconstructed for the AI era. The original weak form is “one language structure → one cognitive tendency” — it is one-dimensional and one-directional. This paper’s reconstruction extends it in three directions:

3+1 Dimensional Reconstruction

From one dimension to multiple dimensions: The original hypothesis involves only “language structure” as a single dimension. The reconstructed version distinguishes three parallel dimensions and one cross-cutting layer of corpus quality, each independently constraining LLM performance on different capability dimensions — Dimension A constrains world model resolution, Dimension B constrains output appropriateness, Dimension C constrains safety baseline, and the cross-cutting layer constrains multilingual consistency.

From one-directional to circular: The original hypothesis is “language → cognition” as a one-way influence. The reconstruction adds a reverse channel: “LLM output → new corpus → next-generation LLM,” transforming the constraint from a one-time influence into a self-reinforcing positive feedback loop.

From human to human-machine system: The original hypothesis acts on “people who speak a certain language.” The reconstructed version acts on “LLMs trained on a certain language’s corpus + people who use that LLM” — language’s constraint on cognition expands from individual humans to human-machine collaborative systems, amplifying both scope and speed of impact.

Section 10

Conclusion: Corpus Is the Infrastructure of LLM Progress

Four core claims and one policy implication

First, the constraint that corpus imposes on LLM cognitive capability is multi-dimensional: three parallel monolingual dimensions (semantic precision, register purity, data hygiene) and one cross-cutting cross-linguistic interface dimension (translation alignment). They are independently operating, and the cross-cutting layer systematically dissolves each language’s unique advantages while amplifying its unique weaknesses.

Second, the Corpus Valve’s constraint on model-side optimization is not absolute but a weak constraint that modulates the shape of the returns curve. When corpus quality falls below a threshold, scaling enters the sub-scaling zone; when quality is sufficiently high, model-side optimization resumes normal returns. This is the precise meaning of “weak corpus determinism.”

Third, corpus defects are self-reinforced through the LLM generation-retraining cycle. This negative trap operates simultaneously across all three parallel dimensions and the cross-cutting layer, constituting the complete mechanism of the AI-era Sapir-Whorf effect.

Fourth, different languages have structural strengths and weaknesses across the 3+1 dimensions. Chinese has advantages on Dimension A but weaknesses on Dimensions B/C; English has advantages on Dimension B but weaknesses on Dimension A; the cross-cutting layer asymmetrically dissolves advantages and amplifies weaknesses. No single language dominates across all dimensions simultaneously.

Final Proposition

Chips are the compute infrastructure of LLMs, model architecture is the computational infrastructure of LLMs, and corpus is the cognitive infrastructure of LLMs. The current scaling predicament — diminishing pretraining returns, exhaustion of high-quality data — is fundamentally not a depletion of data volume but a structural deficit in multi-dimensional corpus quality. The “Corpus Valve” is the key concept for understanding this predicament: it is not a wall but a knob that adjusts the slope of model-side optimization returns. Identifying and repairing each dimension of this knob — semantic precision, register purity, data hygiene, translation alignment — is what can reopen the space for scaling.

The policy implication of this theoretical framework: high-quality, multi-dimensionally qualified training corpora are a severely underestimated strategic asset in AI competition. Timely generation of new vocabulary, maintenance of technical register purity, hygiene assurance of training data, and semantic fidelity of cross-linguistic mappings — these are not data engineering trivia but infrastructure construction for LLM progress. Opening the Corpus Valve requires investing research resources and strategic attention at the corpus side on par with the model side.

References

  1. Tianwei Z. et al. (2025). “Speculating LLMs’ Chinese Training Data Pollution from Their Tokens.” EMNLP 2025.
  2. Du, Y. et al. (2025). “OpenCSG Chinese Corpus: High-quality Chinese Datasets for LLM Training.” arXiv:2501.08197.
  3. Du, C. et al. (2024). “Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model.” arXiv:2404.04167.
  4. Chen, Z., Wang, S., Xiao, T., Wang, Y., Chen, S., Cai, X., He, J. & Wang, J. (2025). “Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies.” ACL 2025, pp. 23881–23899.
  5. Villalobos, P. et al. (2024). “Will we run out of data? Limits of LLM scaling based on human-generated data.” arXiv:2211.04325.
  6. Xiao, C. et al. (2025). “Densing law of LLMs.” Nature Machine Intelligence.
  7. “On the Fundamental Limits of LLMs at Scale.” arXiv:2511.12869, 2026.
  8. Subramanyam, A., Chen, Y. & Grossman, R. L. (2025). “Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining.” arXiv:2510.03313.
  9. He, Y. et al. (2025). “Scaling Laws for Multilingual Language Models.” Findings of ACL 2025, pp. 4257–4273.
  10. Deng, C. et al. (2024). “Investigating Data Contamination in Modern Benchmarks for LLMs.” NAACL 2024.
  11. Kocyigit et al. (2025). “A Survey on Data Contamination for Large Language Models.” arXiv:2502.14425.
  12. UNIDO & Dongbi Data (2025). “Global AI Research Landscape Report (2015–2024).”
  13. Carnegie Endowment (2025). “Have Top Chinese AI Researchers Stayed in the United States?”
  14. Stanford HAI (2025). “The 2025 AI Index Report.”
  15. Digital Science (2025). “DeepSeek and the New Geopolitics of AI.” Published in Science, July 2025.
  16. MIIT et al. (2024). “Notice on Promoting the Coordinated Development of New Information Infrastructure.”
  17. NVIDIA (2024). “AI Factories Are Redefining Data Centers.” GTC 2024 Keynote.
  18. Sapir, E. (1929). “The Status of Linguistics as a Science.” Language, 5(4).
  19. Whorf, B. L. (1956). Language, Thought, and Reality. MIT Press.
  20. David, P. A. (1985). “Clio and the economics of QWERTY.” American Economic Review, 75(2).

Corpus: The Other Valve of LLM
LEECHO Global AI Research Lab & Opus 4.6 · April 9, 2026 · V3

댓글 남기기