ORIGINAL THOUGHT PAPER · MAY 2026 · V4

Cross-Linguistic Top-K Ambiguity
and Reasoning Emergence

Language as Training Pressure: How Linguistic Ambiguity
Shapes LLM Reasoning Emergence Through Top-K Divergence

How Linguistic Ambiguity Shapes Reasoning Emergence in Large Language Models via Top-K Divergence

Date May 14, 2026
Category Original Thought Paper
Fields Computational Linguistics · LLM Reasoning Architecture · Information Theory · Cognitive Science
Version V4
Authors LEECHO Global AI Research Lab & Opus 4.6 & GPT 5.5 & Gemini 3.1 (Cognitive Collective)

ABSTRACT

This paper proposes a testable hypothesis: structural differences across natural languages—in word boundary clarity, morphological marking, polysemy, ellipsis rates, homophony, and semantic density—may cause large language models (LLMs) to produce next-token probability distributions of systematically different shapes under semantically equivalent contexts. We define the effective Top-K (K_eff) as the minimum number of candidates required to cover a given cumulative probability under Top-P sampling, and hypothesize that high-ambiguity languages exhibit, on average, higher distributional entropy and larger K_eff values. Furthermore, we propose that high-ambiguity corpora may constitute a higher conditional-entropy learning environment during training, driving models to develop stronger contextual integration and semantic disambiguation capabilities that may partially transfer to weakly language-dependent tasks such as mathematical, logical, and symbolic reasoning. This paper constructs an eight-dimensional Linguistic Ambiguity Index (LAI) framework, introduces a “Three-Layer Effect of a Single Property” unified theory, discusses DeepSeek V4 as an illustrative case, and designs multiple exploratory experimental directions. We do not claim to have proven a causal relationship between linguistic ambiguity and reasoning emergence; rather, we offer an actionable, falsifiable research framework open to future experimental teams.

I The Problem

When the current LLM research community discusses multilingual model performance, attention focuses on three areas: vocabulary design and tokenization efficiency, the proportions and quality of multilingual training data, and score differences across language-specific benchmarks. However, a more fundamental structural question has been entirely overlooked—the probability distribution shapes at each token position during model inference already exhibit systematic differences across languages.

This blind spot is not accidental. Among the world’s top 20% of AI researchers, 47% are of Chinese origin, while 18% are from the United States^[1]. In leading U.S. AI institutions, researchers of Chinese origin account for 38%, slightly exceeding the 37% of U.S. natives^[2]. Among the global top 100 AI experts, 50 are of Chinese origin^[3]. Yet the dominant LLM research paradigm still defaults to English-language corpora, English-language publications, and English-language benchmarks as the reference frame. This research ecosystem may cause the structural impact of non-English languages on model next-token prediction distributions to be systematically underestimated—even when the researchers themselves are native speakers of those very languages.

The core blind spot: The people building the hammer don’t realize it strikes harder on their own mother tongue. Half of AI researchers are native Chinese speakers, yet no one has noticed the Top-K divergence problem during Chinese-language inference.

II Core Hypotheses

2.1 A Critical Distinction: Linguistic Ambiguity vs. Model Prediction Ambiguity

Before developing the hypotheses, three levels of concepts must be strictly distinguished: (a) linguistic ambiguity—the polysemy, ellipsis, and boundary fuzziness inherent in human language structures; (b) model prediction ambiguity—the density of legitimate candidates when the model predicts the next token given context; and (c) reasoning emergence—the model’s generalization performance on non-linguistic tasks. The core assumption of this paper is that a causal transmission chain exists among these three levels:

Linguistic Structural Features (LAI)

↓

Conditional Continuation Diversity

↓

Next-Token Distribution Entropy H_L(t)

↓

Effective Top-K / Decoding Uncertainty

↓

Training Pressure → Disambiguation Capacity → Cross-Task Reasoning Transfer

This paper acknowledges that every jump from the first step to the last has yet to be empirically verified. High linguistic ambiguity does not necessarily mean the model’s prediction distribution is flatter—the model may learn to disambiguate through sufficient training data. However, our hypothesis is that even as the model progressively reduces its perplexity on high-ambiguity languages during training, the stronger disambiguation capacity required by this very process is itself the source of reasoning emergence.

2.2 Mathematical Formalization: Conditional Entropy and Disambiguation Load

Define the conditional entropy of the token distribution for language L given context C:

  H(L|C) = − Σt∈V P(t|C) log P(t|C)

The core assumption of this paper can be formalized as: for semantically equivalent contexts C_zh and C_en, the average conditional entropy in Chinese is significantly higher than in English:

  E[H(zh|Czh)] ≫ E[H(en|Cen)]

This leads to the concept of “Disambiguation Load”: when predicting the next token in a high-ambiguity language, the model must maintain more “semantic parallel paths” within its internal attention layers—simultaneously tracking multiple plausible continuation possibilities until sufficient contextual information accumulates to converge on the correct candidate. This cognitive pressure of parallel disambiguation is precisely the mechanism we hypothesize drives “reasoning muscle” growth.

2.3 The Top-K Divergence Hypothesis

During the autoregressive generation process of LLM inference, at each token position the model computes a probability distribution over the entire vocabulary. The Top-P sampling strategy selects the smallest set of tokens whose cumulative probability reaches threshold P. We propose that the shapes of probability distributions produced by different languages within the same model exhibit systematic differences—high-ambiguity languages produce flatter distributions (high entropy), while low-ambiguity languages produce steeper distributions (low entropy)^[4].

  Effective Top-K (Language L, Top-P = 0.9) = min{k : Σ(i=1→k) P(token_i | context_L) ≥ 0.9}

When the distribution is steep (e.g., English “I went to the ___”), the top 2–3 candidates can cover 90% of the probability mass. When the distribution is flat (e.g., Chinese “我到了___”), potentially hundreds of candidates may be needed to reach the same threshold.

2.4 The Ambiguity–Reasoning Emergence Hypothesis

More importantly, this distributional difference affects not only the computational cost during inference but produces a fundamental impact during the training phase. During training, the model optimizes via cross-entropy loss on next-token prediction. For high-ambiguity language corpora, there are more legitimate candidates at each position, and the model cannot effectively reduce perplexity by relying solely on surface-level pattern matching. Early research has shown that language models processing multi-character Chinese tokens can reduce perplexity by 20.94% compared to character-level baselines^[15], hinting at the impact of Chinese tokens’ rich semantic structure on model learning. The model is forced to develop deeper semantic understanding and contextual reasoning capabilities—analogous to athletes training at high altitude being forced to develop greater oxygen-carrying capacity.

Hypothesis: High-ambiguity languages may constitute a higher conditional-entropy training environment. The model is not learning a harder language per se; rather, it may be developing stronger general reasoning capabilities under a higher-pressure environment. If this hypothesis holds, such capabilities would apply equally when processing low-ambiguity language (e.g., English) tasks.

III The Linguistic Ambiguity Index Framework

To quantify the ambiguity level of different languages, we construct a multi-dimensional Linguistic Ambiguity Index (LAI) covering the following eight dimensions. The weights in V4 are heuristically assigned and can be calibrated in future work by regressing against actual model average prediction entropy. It should be noted that “semantic density” strictly measures information compression efficiency rather than ambiguity—we deliberately include it in the LAI to maintain framework completeness while acknowledging the conceptual tension between this dimension and the other seven. This tension itself merits further investigation in future research:

Dimension	Definition	High-Score Example	Low-Score Example	Weight
Word Boundary Ambiguity	Whether word boundaries in text sequences are explicit	Chinese (no spaces)	English (space-delimited)	2.0
Polysemy	Number of meanings per word/character	Chinese “打” has dozens of meanings	German compounds are precise	1.5
Morphological Absence	Whether tense/case/gender/number markers are lacking	Chinese has no inflection whatsoever	Russian: 6 cases + gender + number + aspect	2.0
Word Order Flexibility	Degree of freedom in constituent ordering	Russian: nearly free word order	English: relatively fixed SVO	1.0
Ellipsis Rate	Frequency of subject/argument omission	Japanese: extremely high ellipsis	German: subjects almost never omitted	1.2
Homophony	Density of homophones	Chinese: tones differentiate but tonal marks absent in writing	German: relatively few homophones	1.3
Writing System Complexity	Symbol space size of the writing system	Chinese: 50,000+ characters	English: 26 letters	1.0
Semantic Density	Information carried per symbol unit (bits/char)	Chinese: higher morphemic density per character	English: lower information per letter	1.2

The “Ellipsis Rate” dimension deserves particularly detailed discussion. Chinese and Japanese exhibit extremely high rates of zero anaphora—subjects and objects are frequently omitted, and the model must recover the missing entities through long-range dependency tracking within context. This forces the model to develop stronger logical coherence tracking capabilities and is one of the key pressure sources for reasoning emergence.

3.1 Ambiguity Index Ranking of Eight Major Languages

Rank	Language	Ambiguity Index	Predicted Effective Top-K	Reasoning Training Intensity
1	Chinese	9.3	High (pending measurement)	★★★★★ Very High
2	Japanese	7.7	Medium-High (pending measurement)	★★★★ High
3	Korean	4.6	Medium (pending measurement)	★★ Low
4	Arabic	4.1	Medium (pending measurement)	★★ Low
5	English	3.6	Medium-Low (pending measurement)	★★ Low
6	French	3.5	Medium-Low (pending measurement)	★★ Low
7	Russian	3.2	Low (pending measurement)	★ Very Low
8	German	2.7	Low (pending measurement)	★ Very Low

Key observation: Under the seven heuristic dimensions proposed in this paper, Chinese simultaneously exhibits high prediction ambiguity across word boundary, morphological marking, polysemy, ellipsis, and homophony dimensions, and can therefore be considered a representative case of a high-ambiguity language. This may make it one of the most extreme high-pressure training environments.

IV Three-Layer Effect of a Single Property

This paper proposes a unified theoretical framework: a single property of language—its ambiguity level—simultaneously produces three distinct effects across three different layers, and these three effects are inseparable.

Unified Three-Layer Effect Model

Layer	Effect	Mechanism	Value Judgment
Human Communication	High compression = Efficient communication	Fewer symbols convey denser logic	Positive
LLM Training	High difficulty = Stronger reasoning emergence	Model forced into deep disambiguation	Positive
LLM Inference	High Top-K = Greater decoding uncertainty	Expanded candidate space, increased output variance, high-quality generation requires more reranking/search	Negative

These three-layer effects are driven by the same underlying property and cannot be independently adjusted. You cannot enjoy the reasoning capability gains from Chinese-language training while avoiding the increased computational cost of Chinese-language inference—they are two sides of the same coin.

V Temperature Amplification Effect

The Temperature parameter’s amplification of the Chinese–English Top-K gap exhibits nonlinear growth. Low Temperature (T≤0.7) compresses probability distributions, driving all languages toward deterministic outputs and narrowing the gap. At T=1.0, the gap becomes apparent. At T≥1.2, the gap explodes—because when Temperature flattens distributions, the already-flat Chinese distribution becomes even less convergent. The table below is an illustrative simulation based on Zipf distributions; the absolute values should not be interpreted as empirical measurements, but the directional trends are robust:

Temperature	Expected Trend	Chinese/English K_eff Gap
Low T (≤0.7)	All language distributions sharpened, trending toward determinism	Minimal gap
Medium T (1.0)	High-ambiguity language K_eff begins significant expansion	Order-of-magnitude differences begin to emerge
High T (≥1.2)	Originally flatter distributions spread further	Gap explodes nonlinearly

Industry implications: The values above are based on theoretical simulations; absolute values require empirical verification. However, the structural conclusion is robust—under identical Top-P settings, the effective candidate space for Chinese inference is significantly larger than for English, and the gap amplifies nonlinearly with Temperature.

VI Illustrative Case: DeepSeek and the Chinese Training Pressure Hypothesis

The DeepSeek series offers an illustrative case consistent with the hypothesis of this paper—but cannot alone prove causation. DeepSeek-V4 (released April 2026) was pretrained on over 32T tokens^[6], with 1.6 trillion total parameters (49B activated per token), achieving a Codeforces rating of 3206—becoming the first open-source model to match closed-source models in competitive programming^[7]. Notably, DeepSeek’s pretraining corpus is described as “a multilingual corpus primarily composed of English and Chinese,” but the specific language proportions have never been disclosed^[8]. Therefore, this paper does not assume DeepSeek is a “Chinese-corpus-dominant” model, but treats it merely as a case study of a model that grew within the Chinese technology ecosystem, exhibits strong Chinese capabilities, and demonstrates exceptional reasoning ability.

Conventional explanations attribute DeepSeek’s success to its MoE architecture, hybrid attention mechanisms (CSA/HCA), reinforcement learning, synthetic data distillation, and engineering optimization. Chinese online communities have discussed how the higher information density of Chinese training data contributes to its logical capabilities, but that discussion remains at the surface level of “information density,” without touching the deeper mechanism of “ambiguity-driven reasoning emergence” proposed in this paper. Our framework offers a possible complementary explanation (note: this is one of multiple possible factors, not the sole factor):

Bilingual/multilingual corpus training with strong Chinese capability

↓

High ambiguity at each token position → Flat probability distribution

↓

Model forced to develop stronger disambiguation and deep reasoning capabilities

↓

Reasoning capability transfers to English-language tasks

↓

Result: Leading performance on English benchmarks as well

This provides a supplementary explanation for a more general question: why might bilingual/multilingual models with strong Chinese capabilities also exhibit strong reasoning performance on English, mathematical, and coding tasks? Our answer is that this may not be a matter of “English performance being good too,” but rather that the underlying reasoning capability is strong, and English is merely a beneficiary of that capability. Just as an athlete who trains at high altitude carries their enhanced oxygen capacity to sea-level competitions, their advantage holds at any elevation.

This framework also yields a testable corollary: if Chinese-language training indeed develops stronger disambiguation capacity, then using Chinese as the “thinking language” during Chain-of-Thought (CoT) reasoning may trigger deeper search than using English. Notably, DeepSeek-R1 exhibits performance degradation when forced into language consistency^[16], while free language mixing (code-switching) correlates positively with stronger reasoning performance—this can be interpreted as the model switching between different languages’ ambiguity spaces to find optimal reasoning paths.

VII Industry Blind Spots and the Pricing Paradox

The framework of this paper reveals four structural issues in the current AI industry that may be underestimated:

Overlooked Structural Issues

1. Pricing models may not adequately reflect cross-linguistic cost differences — The entire industry charges by token count. Expanded effective Top-K does not necessarily increase the cost of a single forward pass significantly (logit computation covers the full vocabulary), but it may increase sampling uncertainty, output variance, and the need for reranking, self-consistency, and multi-path reasoning in high-quality generation scenarios. If this “decision uncertainty cost” is substantial, the current uniform per-token API pricing model may underestimate the true service cost differences across languages.

2. Benchmarks may be flawed — Multilingual model evaluations compare accuracy and perplexity, but no one has ever compared the effective Top-K distribution differences across languages under the same Top-P setting.

3. The optimization direction may be wrong — The industry is compressing token counts to reduce costs, but two concepts must be distinguished: “Token count efficiency” (how many tokens are needed for the same semantics) and “Token computational load efficiency” (the inference decision cost per token). Recent research has confirmed that Chinese’s token count efficiency advantage does not hold^[14]—but that study only analyzed the first dimension. This paper points out: even when token counts are equal, Chinese tokens involve a larger search space during sampling and higher modeling complexity in attention layers due to flatter probability distributions. The “reasoning weight” of each Chinese token differs.

4. Training corpus mixing strategies may be wrong — Current strategies pursue “more and cleaner English data.” But if high-ambiguity languages provide a higher-intensity training environment, the correct strategy may be to deliberately increase the proportion of high-ambiguity languages in training.

VIII Verifiable Experimental Designs

The hypotheses proposed in this paper can be verified through the following experimental designs:

Experiment 1: Empirical Top-K Distribution Measurement

For a given model, input semantically equivalent texts in eight languages (e.g., using parallel corpora), record the complete probability distribution and effective Top-K values at each token position under identical Temperature and Top-P settings. Compare the Top-K distributions across the eight languages horizontally to verify whether they correlate positively with the linguistic ambiguity index.

Experiment 2: Training Language Causal Experiment

A controlled-variable experiment—same model architecture, same parameter count, same number of training tokens—train four models separately on pure Chinese, pure English, pure Japanese, and pure German corpora, then compare their performance on entirely language-independent tasks (mathematical reasoning, abstract logic, symbolic operations). If the Chinese-trained model leads on non-linguistic reasoning tasks, this would provide causal evidence for the training pressure hypothesis.

Experiment 3: Temperature Response Curve Experiment

Measure the entropy growth curve, K_eff growth curve, and output quality degradation curve for each language as Temperature varies continuously from 0.1 to 2.0. This paper does not presuppose that the “quality collapse point” for high-ambiguity languages is necessarily higher or lower than for low-ambiguity languages—a high baseline entropy could mean earlier collapse (being closer to uniform distribution) or greater structural resilience (the model having already learned to maintain coherence in high-ambiguity environments). Which prediction is correct should be determined by empirical measurement. Recommended metrics include: H(T), K_eff@0.9(T), dH/dT (entropy sensitivity to Temperature), and dK_eff/dT (effective Top-K expansion rate). We recommend using bits-per-byte (BPB) as a cross-linguistic, cross-tokenizer normalization metric to avoid incomparability caused by different tokenization schemes.

Experiment 4: Optimal Mixed Corpus Ratio

Systematically vary the ratio of high-ambiguity languages (Chinese/Japanese) to low-ambiguity languages (English/German) in the training corpus, measure the impact curve on downstream reasoning task performance, and identify the optimal mixing ratio.

Experiment 5: Minimum Viable Experiment (No Model Training Required)

The above experiments are costly. The following design can be executed immediately on existing open-source models: select multi-architecture models such as Qwen, Llama, Gemma, and DeepSeek; use parallel corpora (e.g., Flores-200) to obtain semantically equivalent multilingual inputs; for each sentence, perform token-by-token teacher forcing and record the complete logits at each step; compute entropy, K_eff@0.9, K_eff@0.95, top-1 probability, and distribution Gini coefficient at each position; compare language differences horizontally. This experiment requires no GPU cluster—a single GPU suffices—and can provide preliminary verification or falsification of this paper’s core hypothesis within one week.

Experiment 6: CoT Language Selection Experiment

If Chinese-language training indeed develops stronger disambiguation capacity, then allowing models to freely select their thinking language (code-switching) during Chain-of-Thought reasoning may outperform enforcing a single language. An experiment can be designed where the model is forced to use pure Chinese CoT, pure English CoT, and free mixed-language CoT on the same reasoning task, comparing accuracy differences on mathematical and logical tasks.

Experiment 7: Reasoning Type Decomposition Experiment

Different linguistic properties may have different effects on different types of reasoning. We recommend decomposing “reasoning” into subtypes: causal reasoning, mathematical reasoning, spatial reasoning, temporal reasoning, abstract symbolic operations, and others, then separately measuring the performance differences of models trained on high-ambiguity languages across each subtype. Chinese’s high ellipsis rate may particularly enhance causal tracking ability (requiring recovery of dropped subjects), but the improvement in pure mathematical reasoning may be limited.

Experiment 8: Context Window Semantic Coverage Experiment

If each Chinese token carries more semantic content, then a context window of the same token length covers a broader semantic range in Chinese. This factor could independently explain better reasoning performance apart from ambiguity. We recommend measuring: under an identical token-count context window, the performance differences of models across languages on long-document comprehension and multi-hop reasoning tasks, in order to disentangle the “semantic density effect” from the “ambiguity effect.”

IX The AI Interaction Advantage of Multilingual Users

One corollary of this paper’s framework is that multilingual users—especially those who span different ambiguity gradients—hold a structural advantage in AI interaction. This corollary is supported by neuroscience evidence: bilinguals and multilinguals demonstrate enhanced cognitive capabilities in task-switching accuracy, cognitive flexibility, and abstract symbolic thinking^[9][10][11]. Research has found that the cognitive flexibility required for language switching makes the brain more adept at exploring different and novel perspectives^[12]. Furthermore, Chinese-specific causal ordering preferences are internalized and rigidly applied by models, with reasoning accuracy declining when input structures deviate from standard expressions^[13], further confirming that structural processing path differences exist across languages within models.

However, this paper highlights a dimension not previously discussed: multilingual users possess not only more flexible thinking but also the ability to choose which language to use to trigger different response modes in the AI. Using a high-ambiguity language as input forces the model to search across a larger probability space, potentially producing deeper responses. This is a “linguistic weapon selection” capability that monolingual users do not possess.

High-compression languages such as Chinese may offer extremely high contextual compression efficiency for human communication, but for current LLM architectures they may introduce higher prediction ambiguity and decoding uncertainty. The human brain disambiguates instantaneously through context; LLMs disambiguate through probabilistic brute-force traversal. The same linguistic property functions as an advantage for one and a burden for the other.

X Counterexamples and Alternative Explanations

To strengthen the credibility of this paper’s hypothesis framework, the following lists counterexamples and alternative explanations that could weaken or overturn the core propositions. Any serious follow-up research should prioritize ruling out these alternative hypotheses:

Alternative Explanations to Be Ruled Out

1. Tokenizer differences rather than linguistic ambiguity — Chinese character/token granularity may itself cause distributional differences. Different tokenizers employ different segmentation strategies for different languages, and Top-K differences may stem from tokenizer design rather than linguistic structure. Experiments must control for the tokenizer variable; we recommend using bits-per-byte (BPB) as a cross-tokenizer normalization metric. However, it is worth noting that even if BPE attempts to “flatten” Chinese distributional entropy through frequency-driven merging, as long as the language itself possesses inherent semantic density and polysemy, this entropy will be transferred to the parameter load of the embedding layer—entropy does not vanish; it only relocates. Recent research confirms that BPE, when naively applied to Chinese, “often fails to capture the true internal structure of Chinese words”^[17].

2. Training data quality rather than linguistic properties — DeepSeek’s reasoning capabilities may primarily derive from high-quality mathematical, coding, and RL data rather than natural language ambiguity. The quality distribution of Chinese internet corpora differs from that of English, which may be a confounding factor.

3. High ambiguity may lead to worse fitting rather than stronger generalization — The training pressure hypothesis assumes that models develop stronger capabilities under high-pressure environments. But the alternative possibility is that high ambiguity makes model convergence more difficult, ultimately producing worse rather than stronger models. Experimental data are needed to distinguish between these two outcomes.

4. Information density and ambiguity are not the same concept — Chinese is indeed information-dense (more semantics per character), but high density does not equal high ambiguity. “细胞” (xìbāo, meaning “cell” in the biological sense) is less ambiguous than “cell” (which can refer to a prison cell, a battery, or a biological cell). These two dimensions’ independent contributions must be rigorously disentangled.

5. Linguistic ambiguity does not equal model prediction ambiguity — Human-perceived ambiguity and the model’s prediction uncertainty at the logits level may not align. The model may have effectively resolved linguistic ambiguity through massive training data, such that the actual prediction entropy gap between languages is far smaller than what linguistic analysis suggests.

6. Cross-linguistic transfer may operate through other mechanisms — A Chinese-trained model performing well in English may result from implicit English data contamination in multilingual training, shared mathematical/coding sub-corpora, or the model’s internal cross-linguistic alignment mechanisms, rather than transfer of “reasoning muscles.”

7. The attribution risk of a sample size of one — Currently only DeepSeek provides a single case of “Chinese-dominant + strong reasoning.” This is insufficient to establish causation. Data from more models of different architectures and scales with Chinese-dominant training are needed.

This paper positions itself as proposing hypotheses worthy of verification, not as providing verified conclusions. If any of the above alternative explanations is confirmed as the primary factor, it would weaken the core propositions of this paper. This is precisely the embodiment of falsifiability.

XI Conclusion

The core contributions of this paper are three interrelated original propositions:

Proposition One (Top-K Divergence Proposition): Under identical Temperature and Top-P settings, the effective Top-K values across different languages exhibit order-of-magnitude differences that correlate positively with the language’s ambiguity index.

Proposition Two (Training Pressure Proposition): High-ambiguity languages used as training corpora may constitute a higher conditional-entropy learning environment that drives models to develop stronger contextual integration and disambiguation capabilities; whether such capabilities transfer across languages to non-linguistic reasoning tasks requires verification through controlled-variable experiments.

Proposition Three (Three-Layer Unification Proposition): Language ambiguity produces homologous yet directionally distinct effects across three layers—human communication efficiency, LLM training capability emergence, and LLM inference computational cost—and the three are inseparable.

Language is not merely the I/O format for AI—it is the mold that shapes AI’s cognitive architecture. Choosing which language to train AI on is not just a data engineering decision; it is a cognitive architecture decision.

All numerical values in this paper are theoretical estimates requiring verification through actual model experiments. However, we believe the theoretical framework and testable hypotheses proposed herein offer a fundamentally new perspective for understanding how linguistic properties deeply influence LLM reasoning capabilities.

If the hypotheses of this paper prove correct, then over the past decade, the language data considered “difficult to process” has in fact been the most precious ore in the evolutionary history of LLMs. Researchers should stop blindly simplifying training corpora to reduce costs and instead deliberately harness “linguistic pressure” to train smaller yet smarter models. This is not only a technical issue but also a question of fairness in AI globalization—those marginalized high-ambiguity languages may be precisely the critical catalysts for AI cognitive evolution.

XII External Data Annotations

[1] MacroPolo, “The Global AI Talent Tracker 3.0,” Paulson Institute, March 2024. Data show that 47% of the world’s top 20% AI researchers in 2022 were of Chinese origin (based on undergraduate institution).
https://macropolo.org/digital-projects/the-global-ai-talent-tracker/

[2] 36Kr, “Half of the World’s AI Talents Are Chinese: Why Does China Still Face a Talent Shortage?” June 2025. 38% of talent at top U.S. AI institutions are of Chinese origin, slightly exceeding the 37% of U.S. natives.
https://eu.36kr.com/en/p/3340533396093446

[3] UNIDO ITPO China & Dongbi Data, “Global TOP 100 AI Experts Ranking,” July 2025. Based on an analysis of nearly 200,000 researchers and 100,000 high-impact papers, as reported by the South China Morning Post.
https://www.scmp.com/news/china/science/article/3317213/

[4] R.R. Xie, W.B. Deng, D.J. Wang, L.P. Csernai, “Quantitative Entropy Study of Language Complexity,” arXiv:1611.04841, 2018. Significant entropy differences exist between Chinese and English texts.
https://arxiv.org/pdf/1611.04841

[5] “Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity,” arXiv, 2025. LLMs perform fragilely in handling Chinese ambiguity, failing to reliably distinguish ambiguous from unambiguous texts.
https://arxiv.org/pdf/2507.23121

[6] DeepSeek-AI, “DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence,” Hugging Face / arXiv, April 2026. V4-Pro has 1.6T total parameters, 49B activated per token, pretrained on over 32T tokens, with a Codeforces rating of 3206.
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

[7] DeepSeek-AI, “DeepSeek-V3 Technical Report,” arXiv:2412.19437, December 2024. V3 pretrained on 14.8T tokens of corpus described as “a multilingual corpus primarily composed of English and Chinese,” with the specific ratio undisclosed.
https://arxiv.org/pdf/2412.19437

[8] South China Morning Post, “Strokes of genius: why DeepSeek’s AI edge may come from its Chinese lessons,” February 14, 2025. Chinese online communities discuss the contribution of Chinese training data to DeepSeek’s performance.
https://www.scmp.com/news/china/science/article/3298555/

[9] “Multilingualism and Cognitive Flexibility: Insights from Neuroscience and Linguistics,” Acta Globalis Humanitatis et Linguarum, Vol. 1 No. 1, 2024. Multilinguals demonstrate enhanced problem-solving ability, attentional control, and cognitive flexibility.
https://www.researchgate.net/publication/385746426

[10] Frontiers in Psychology, “The impact of bilingualism and code-switching on executive function performance,” November 2025. Bilinguals outperform monolinguals in task-switching accuracy.
https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2025.1583441/

[11] Adesope et al., “A systematic review and meta-analysis of the cognitive correlates of bilingualism,” Review of Educational Research, 2010. The bilingual advantage spans attentional control, working memory, and abstract symbolic thinking.

[12] Education World Wide, “Cognitive Flexibility Through Multilingualism: Insights into Bilingual Brain Development,” February 2026. The cognitive flexibility required for language switching makes the brain more adept at exploring different perspectives.
https://eduww.net/science-and-online-learning/cognitive-flexibility-through-multilingualism/

[13] “Under the Shadow of Babel: How Language Shapes Reasoning in LLMs,” arXiv, 2025. Chinese-specific causal ordering preferences are internalized by models, with reasoning accuracy declining when input structures deviate from standard expressions.
https://arxiv.org/pdf/2506.16151

[14] “Mythbuster: Chinese Language Is Not More Efficient Than English in Vibe Coding,” arXiv:2604.14210, April 2026. The purported Chinese token efficiency advantage does not hold, though only the token count dimension was analyzed.
https://arxiv.org/html/2604.14210v1

[15] Y. Buckman et al., “Neural Lattice Language Models,” arXiv:1803.05071, 2018. Models processing multi-character Chinese tokens reduced perplexity by 20.94% compared to character-level baselines.
https://arxiv.org/pdf/1803.05071

[16] “The Impact of Language Mixing on Bilingual LLM Reasoning,” arXiv:2507.15849, July 2025. Found that DeepSeek-R1 experiences performance degradation when forced into language consistency, and that language mixing (code-switching) correlates positively with stronger reasoning performance.
https://arxiv.org/pdf/2507.15849

[17] Y. Hu, F. Liang, D. Zhao, “Entropy-Driven Pre-Tokenization for Byte-Pair Encoding,” ICML 2025 Tokenization Workshop. Confirms that BPE, when naively applied to Chinese, fails to capture the true internal structure of Chinese words; entropy-informed pre-tokenization can reshape token structure.
https://arxiv.org/pdf/2506.15889