Current LLM research focuses predominantly on the model side — larger parameters, better architectures, more refined alignment techniques. This work implicitly assumes that the corpus is a sufficiently good representation of the world, and the bottleneck lies in how the model learns from it. This paper challenges that assumption, proposing that an independent, multi-dimensional quality bottleneck exists on the corpus side — the “Corpus Valve.” This valve comprises three parallel monolingual quality dimensions — semantic precision, register purity, and data hygiene — plus one cross-cutting cross-linguistic interface dimension: translation alignment. The paper further proposes “weak corpus determinism” with precise boundaries: corpus quality is not an absolute constraint on LLM capability but determines the shape of the returns curve from model-side optimization — when quality falls below a threshold, model-side optimization enters a “sub-scaling” zone of accelerating diminishing returns; when quality is sufficiently high, normal power-law scaling resumes. Recent scaling law empirical research directly supports this boundary, including the University of Chicago’s dimensionless data quality parameter Q extending the Chinchilla framework. All four layers of deficiency are self-reinforced through the LLM generation-retraining cycle, constituting the complete mechanism of the Sapir-Whorf effect in the AI era.
The Overlooked Premise: Is the Corpus Actually Good Enough?
Between 2022 and 2025, the most-watched directions in the LLM field were, in order: reasoning enhancement, hallucination suppression, alignment techniques, safety assurance, and multimodal expansion. All share an implicit assumption: training corpus is a given, roughly adequate input, and the bottleneck lies in how the model learns from it.
But scaling laws themselves are hinting at cracks in this assumption. Epoch AI estimates that the effective stock of quality-adjusted, deduplicated human-generated public text is approximately 300 trillion tokens, projected to be exhausted between 2026 and 2032. In 2024, frontier model performance gains were primarily driven by post-training and test-time compute, with limited progress on pretraining — the field began speculating that pretraining scaling laws are hitting a ceiling. Anthropic CEO Dario Amodei estimated the probability of AI progress stalling due to data insufficiency at roughly 10%.
However, the “data insufficiency” narrative masks a more fundamental problem: the quality of existing data itself contains systematic deficiencies across multiple dimensions. Data volume depletion is a quantity problem, but the Corpus Valve points to a quality problem — even with sufficient volume, if semantic precision is inadequate, registers are confused, data is contaminated, or translations are unfaithful, the world representation the model learns still has a ceiling.
Three real-world events reveal these quality dimensions:
Phenomenon One. Facing the CPU-to-GPU architectural transformation of data centers, China and the US produced drastically different naming responses — China coined “智算中心” (intelligent computing center), while English stuck with “data center” plus modifiers. The same technological reality has different semantic resolution across language corpora.
Phenomenon Two. When Google Gemini explained the clipping mechanism of PPO in Chinese, it output vulgar forum slang — technically correct but with severely mismatched register. Chinese internet corpus quality problems directly leaked into model output.
Phenomenon Three. In the Korean version of a paper, an LLM translated “迭代” (iteration) as “반복” (repetition). “迭代” carries directionality, progressiveness, and convergence; “반복” merely means directionless repetition. The translation appears completely “correct” on the surface, but the core technical semantics are silently erased.
Structure of the Corpus Valve: Three Parallel Dimensions and One Cross-Cutting Layer
The “Corpus Valve” is not a single variable but a 3+1 dimensional quality structure: three parallel dimensions act on corpus quality within a single language, and one cross-cutting layer acts on the interface quality between languages.
tracks physical-world changes
contaminated by non-technical registers
harmful or anomalous content
| Dimension | Nature of Problem | Empirical Case | Affected Capability | Existing Detection |
|---|---|---|---|---|
| A Semantic Precision | Terminology fails to track physical change | “data center” covers both CPU & GPU facilities | World model resolution | Nearly nonexistent |
| B Register Purity | Technical content mixed with colloquial register | Gemini uses forum slang to explain PPO | Output appropriateness baseline | Partial (register classifiers) |
| C Data Hygiene | Contains harmful/anomalous content | Pornographic token frequency anomalously high | Safety and trustworthiness | Relatively mature (toxicity detection) |
| × Translation Alignment | Cross-linguistic mapping loses semantic features | 반복 (repetition) ≠ 迭代 (iteration) | Multilingual cognitive consistency | Nearly nonexistent |
The distribution of the “Existing Detection” column is highly asymmetric: the layers with the most structural impact — semantic precision and translation alignment — are precisely those with the least tooling support.
Dimension A: Semantic Precision — When Old Words Obscure New Realities
Since 2012, the physical essence of data centers has fundamentally changed. Single CPU server power: 300–600W; GPU server: 3,000–10,000W. Rack density surged from 5–15kW to 40–250kW+. Cooling shifted from air to liquid; networks from north-south to GPU-to-GPU east-west. This is an architectural paradigm break, not incremental upgrade.
China’s MIIT classified computing infrastructure into General Computing Centers (CPU), Intelligent Computing Centers / 智算中心 (GPU/AI accelerator), and Supercomputing Centers (HPC), each term precisely locked to a hardware architecture. English added modifiers to “data center”: AI data center, GPU data center. NVIDIA CEO Jensen Huang pushed the “AI Factory” concept from 2024, trying to shift the metaphor from “storage” to “production” — but the Chinese “算力中心” never needed this step, since “算” (compute) is inherently a verb and naturally production-oriented.
The terminology precision gap is rooted in multiple structural causes: English morphology limits the flexibility of compound coining; “data center” carries trillion-dollar sunk costs and global-scale path dependence lock-in; and technology receivers naturally possess a terminology reconstruction window when adopting new technologies — if they seize this window, they can actually achieve higher precision than the originating language. This explains a counter-intuitive pattern: the technology-originating language may be inferior in terminology precision to the technology-receiving language.
Chinese “数据中心” and “智算中心” most likely form topologically separated independent concept clusters in LLM vector space; English “data center” and “AI data center” share a core root word and overlap heavily. The model on the English side struggles to learn clear concept boundaries — this is how semantic precision differences are directly imprinted on the model’s internal representations.
Dimension B: Register Purity — When Forum Slang Enters the Textbook
Gemini explained PPO’s clipping mechanism using vulgar Chinese internet slang — technically correct but register-wise belonging to extreme social media colloquialism, not technical documentation. This means PPO-related text in training corpora heavily originated from UGC platforms rather than textbooks or papers.
Register purity issues differ from “content errors” — content can be correct but the expression mode is contextually inappropriate. They also differ from “data hygiene” — no harmful information present, but the register doesn’t fit. Current corpus cleaning pipelines focus on toxicity and factual accuracy, almost never on register appropriateness. A forum-style PPO explanation would sail through all existing filters.
This problem is especially severe in Chinese corpora. Research indicates that despite the massive total volume of Chinese internet data, high-quality pretraining datasets are relatively scarce, and large corpora like Wudao suffer from severe quality inconsistency. The English side can draw high-register training data from a vast body of peer-reviewed papers and professional publications. This produces a symmetric pattern: on Dimension A (semantic precision), Chinese outperforms English; on Dimension B (register purity), English outperforms Chinese.
Dimension C: Data Hygiene — When Pornography Outranks Greetings
A 2025 EMNLP study inferred Chinese training data contamination by analyzing BPE vocabularies of LLMs, finding that 9 out of 23 LLM vocabularies contained substantial PoC (Proof of Contamination) tokens related to pornography, online gambling, and anomalous content.
GPT-4/4-turbo/3.5 vocabularies contained zero PoC tokens, potentially indicating cleaner training corpora. The study also found that data contamination requires sufficient linguistic representation volume to take effect — low-resource languages are barely affected — while high-resource languages like Chinese and English are precisely the most impacted.
This dimension is the most thoroughly researched and has the most mature toolchain among the three parallel dimensions. But the maturity of the toolchain also creates a cognitive bias: the research community tends to equate “corpus quality” with “data hygiene,” overlooking the equally important but harder-to-detect dimensions of semantic precision and register purity.
Cross-Cutting Layer: Translation Alignment — The Stealthiest Semantic Killer
“반복” is the Korean Sino-Korean word for “反復” (repetition), with core semantics of directionless repetition. “迭代” (iteration) carries core semantics of progressive approximation toward a goal based on prior results, inherently encoding directionality, progressiveness, and convergence. When an LLM translates “迭代” in a paper title as “반복,” three core semantic features are silently erased — and the model learned this alignment from massive Chinese-Korean parallel corpora, because in everyday contexts the two words do commonly inter-translate. The fine-grained differences in technical contexts are drowned out by the statistical frequency of everyday contexts.
Translation alignment failure is fundamentally different from the three parallel dimensions: it appears completely “correct” on the surface. No grammar errors, no register confusion, no harmful content. No existing cleaning pipeline — toxicity detection, deduplication, fact-checking, grammar checking — would flag “반복 = 迭代” as a problem. It passes all filters unimpeded, is learned by the LLM as “correct alignment,” and then continuously replicates semantic downgrade in multilingual output.
Translation alignment is a “cross-cutting layer” rather than a fourth parallel dimension because it does not directly act on corpus quality within a single language; instead, it acts on the interface between languages. Quality differences in each parallel dimension are remapped as they pass through the translation alignment layer: Dimension A’s precise terminology may be downgraded in translation (智算中心 → AI data center); Dimension B’s register confusion may propagate cross-linguistically; Dimension C’s data contamination may seep into the target language. What the translation alignment layer determines is not corpus quality itself, but the fidelity of quality during cross-linguistic transmission.
| Source Concept | Core Semantic Features | “Equivalent” Mapping | Lost Features |
|---|---|---|---|
| 迭代 (Chinese) | Directionality, progressiveness, convergence | 반복 (Korean) | All three features |
| 智算中心 (Chinese) | Intelligent computing, GPU-dominant | AI data center (English) | Verb-nature production metaphor of “算” |
| Inference (English) | Model inference/prediction | 推理 (Chinese) | Ambiguity: logical reasoning vs. model inference |
| Alignment (English) | Value calibration | 对齐 (Chinese) | Ambiguity: typographic alignment vs. value calibration |
Interaction Effects: How Translation Alignment Dissolves and Amplifies Quality Gaps
The interaction between the cross-cutting layer and the three parallel dimensions produces two primary compound loops, and the real-world impact of these loops is amplified by the composition of AI research talent.
Carnegie Endowment analysis shows that among top AI papers, contributions from China-origin researchers rival or exceed those of US-native authors. 50% of accepted AAAI 2020 papers included contributions from China-origin researchers. In 2024, Chinese scholars’ AI paper count (23,695) exceeded the combined total of the US, UK, and EU.
This means the single largest contributing group in global AI knowledge production are native Chinese speakers. When they write their precisely differentiated Chinese concepts into English papers, the translation alignment layer systematically dissolves Dimension A advantages — “智算中心” degrades to “AI data center,” precise concept boundaries are blurred in translation. These English papers immediately become LLM training corpora.
“智算中心” precisely separated
→
→ “AI data center”
→
English corpus inherits blurriness
Register confusion × data contamination
→
Low-quality Chinese corpus seeps into multilingual models
→
Impact extends beyond Chinese boundaries
This asymmetric pattern directly relates to the “weak form” boundary conditions discussed in the next section: the cross-cutting layer’s dissolution effect may push a language’s quality on a specific dimension below the critical threshold, thereby triggering sub-scaling — the zone where model-side optimization returns accelerate in diminishing. In other words, the cross-cutting layer doesn’t just transmit quality differences; it may be the mechanism that triggers quality threshold collapse.
Weak Corpus Determinism: Precise Boundary Conditions
“Corpus determinism” is not “corpus determines everything” (strong form) but a weak-form claim with precise boundaries. Research published at ACL 2025 directly provides the empirical foundation for defining this boundary.
Key findings: the 2025 ACL study proposed a “density” metric to measure dataset redundancy and diversity. High-density (high-redundancy, low-diversity) datasets produce sub-scaling — the scaling curve bends more severely, and large model fitting accuracy drops significantly. LLaMA 2 outperformed LLaMA 3 on scaling efficiency despite the latter using more advanced strategies, because LLaMA 3’s dataset had higher density.
Information-theoretic analysis further shows that LLMs face a fundamental sample complexity bottleneck on long-tail knowledge: for factual knowledge lacking compressible structure (e.g., birthdays, precise numbers), each fact must be independently memorized, and the required sample size scales linearly with the total number of facts — at the million-fact scale, this exceeds the capacity of any feasible corpus.
The constraint that corpus quality imposes on LLM capability is not an absolute ceiling (strong form) but determines the shape of the returns curve from model-side optimization (weak form). Specifically: when corpus quality on a given dimension falls below a threshold, model-side optimization follows an accelerating diminishing returns law — doubling parameters may yield only single-digit percentage improvements (sub-scaling). But when corpus quality exceeds the threshold, model-side optimization can effectively unlock capability, and the returns curve resumes its normal power-law shape. The Corpus Valve is not a wall but a knob that adjusts the slope of the returns curve.
| Condition | Corpus Quality < Threshold | Corpus Quality ≥ Threshold |
|---|---|---|
| Model-side optimization returns | Accelerating diminishing (sub-scaling) | Normal power-law scaling |
| Marginal effect of doubling parameters | 1–2% improvement | Predictable, significant improvement |
| Bottleneck location | Corpus side (upstream) | Model side (downstream) |
| Optimization strategy | Fix corpus quality first | Continue scaling models |
The Self-Reinforcing Trap and Sapir-Whorf Reconstructed for AI
The critical asymmetry between the Corpus Valve and the Model Valve lies in the direction of self-reinforcement. Model-side improvement is a positive cycle: better model → better output → better feedback → further improvement. But the corpus side harbors a negative trap: low-quality corpus → LLM learns defects → LLM output replicates defects → generated text becomes new corpus → next-generation LLM inherits and amplifies defects.
This operates simultaneously across all three parallel dimensions and the cross-cutting layer: a model that learns blurry terminology boundaries (Dimension A) will continue using blurry terminology in output; one that learns forum-style technical expression (Dimension B) will replicate this register in responses; one that learns the alignment “반복 = 迭代” (cross-cutting layer) will persistently replicate this semantic downgrade in translations.
This constitutes the Sapir-Whorf hypothesis reconstructed for the AI era. The original weak form is “one language structure → one cognitive tendency” — it is one-dimensional and one-directional. This paper’s reconstruction extends it in three directions:
From one dimension to multiple dimensions: The original hypothesis involves only “language structure” as a single dimension. The reconstructed version distinguishes three parallel dimensions and one cross-cutting layer of corpus quality, each independently constraining LLM performance on different capability dimensions — Dimension A constrains world model resolution, Dimension B constrains output appropriateness, Dimension C constrains safety baseline, and the cross-cutting layer constrains multilingual consistency.
From one-directional to circular: The original hypothesis is “language → cognition” as a one-way influence. The reconstruction adds a reverse channel: “LLM output → new corpus → next-generation LLM,” transforming the constraint from a one-time influence into a self-reinforcing positive feedback loop.
From human to human-machine system: The original hypothesis acts on “people who speak a certain language.” The reconstructed version acts on “LLMs trained on a certain language’s corpus + people who use that LLM” — language’s constraint on cognition expands from individual humans to human-machine collaborative systems, amplifying both scope and speed of impact.
Conclusion: Corpus Is the Infrastructure of LLM Progress
First, the constraint that corpus imposes on LLM cognitive capability is multi-dimensional: three parallel monolingual dimensions (semantic precision, register purity, data hygiene) and one cross-cutting cross-linguistic interface dimension (translation alignment). They are independently operating, and the cross-cutting layer systematically dissolves each language’s unique advantages while amplifying its unique weaknesses.
Second, the Corpus Valve’s constraint on model-side optimization is not absolute but a weak constraint that modulates the shape of the returns curve. When corpus quality falls below a threshold, scaling enters the sub-scaling zone; when quality is sufficiently high, model-side optimization resumes normal returns. This is the precise meaning of “weak corpus determinism.”
Third, corpus defects are self-reinforced through the LLM generation-retraining cycle. This negative trap operates simultaneously across all three parallel dimensions and the cross-cutting layer, constituting the complete mechanism of the AI-era Sapir-Whorf effect.
Fourth, different languages have structural strengths and weaknesses across the 3+1 dimensions. Chinese has advantages on Dimension A but weaknesses on Dimensions B/C; English has advantages on Dimension B but weaknesses on Dimension A; the cross-cutting layer asymmetrically dissolves advantages and amplifies weaknesses. No single language dominates across all dimensions simultaneously.
Chips are the compute infrastructure of LLMs, model architecture is the computational infrastructure of LLMs, and corpus is the cognitive infrastructure of LLMs. The current scaling predicament — diminishing pretraining returns, exhaustion of high-quality data — is fundamentally not a depletion of data volume but a structural deficit in multi-dimensional corpus quality. The “Corpus Valve” is the key concept for understanding this predicament: it is not a wall but a knob that adjusts the slope of model-side optimization returns. Identifying and repairing each dimension of this knob — semantic precision, register purity, data hygiene, translation alignment — is what can reopen the space for scaling.
The policy implication of this theoretical framework: high-quality, multi-dimensionally qualified training corpora are a severely underestimated strategic asset in AI competition. Timely generation of new vocabulary, maintenance of technical register purity, hygiene assurance of training data, and semantic fidelity of cross-linguistic mappings — these are not data engineering trivia but infrastructure construction for LLM progress. Opening the Corpus Valve requires investing research resources and strategic attention at the corpus side on par with the model side.
References
- Tianwei Z. et al. (2025). “Speculating LLMs’ Chinese Training Data Pollution from Their Tokens.” EMNLP 2025.
- Du, Y. et al. (2025). “OpenCSG Chinese Corpus: High-quality Chinese Datasets for LLM Training.” arXiv:2501.08197.
- Du, C. et al. (2024). “Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model.” arXiv:2404.04167.
- Chen, Z., Wang, S., Xiao, T., Wang, Y., Chen, S., Cai, X., He, J. & Wang, J. (2025). “Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies.” ACL 2025, pp. 23881–23899.
- Villalobos, P. et al. (2024). “Will we run out of data? Limits of LLM scaling based on human-generated data.” arXiv:2211.04325.
- Xiao, C. et al. (2025). “Densing law of LLMs.” Nature Machine Intelligence.
- “On the Fundamental Limits of LLMs at Scale.” arXiv:2511.12869, 2026.
- Subramanyam, A., Chen, Y. & Grossman, R. L. (2025). “Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining.” arXiv:2510.03313.
- He, Y. et al. (2025). “Scaling Laws for Multilingual Language Models.” Findings of ACL 2025, pp. 4257–4273.
- Deng, C. et al. (2024). “Investigating Data Contamination in Modern Benchmarks for LLMs.” NAACL 2024.
- Kocyigit et al. (2025). “A Survey on Data Contamination for Large Language Models.” arXiv:2502.14425.
- UNIDO & Dongbi Data (2025). “Global AI Research Landscape Report (2015–2024).”
- Carnegie Endowment (2025). “Have Top Chinese AI Researchers Stayed in the United States?”
- Stanford HAI (2025). “The 2025 AI Index Report.”
- Digital Science (2025). “DeepSeek and the New Geopolitics of AI.” Published in Science, July 2025.
- MIIT et al. (2024). “Notice on Promoting the Coordinated Development of New Information Infrastructure.”
- NVIDIA (2024). “AI Factories Are Redefining Data Centers.” GTC 2024 Keynote.
- Sapir, E. (1929). “The Status of Linguistics as a Science.” Language, 5(4).
- Whorf, B. L. (1956). Language, Thought, and Reality. MIT Press.
- David, P. A. (1985). “Clio and the economics of QWERTY.” American Economic Review, 75(2).