Thought Paper

Language Iteration Speed:
Another Dimension of LLM Variables
How cross-linguistic differences in technical terminology evolution shape the resolution ceiling of LLM world models

Starting from the naming divergence between “data center” and China’s “算力中心” (computing power center) / “智算中心” (intelligent computing center), this paper explores a theoretical framework for corpus semantic granularity as a hidden variable of LLM performance.

LEECHO Global AI Research Lab & Opus 4.6
April 9, 2026 · V1

Abstract

This paper investigates the structural impact of cross-linguistic differences in technical terminology evolution speed on the cognitive resolution of large language models (LLMs). Using the Chinese and English naming systems for GPU parallel matrix computation facilities as a core case study: China evolved from “数据中心” (data center) to precise new terms like “算力中心” (computing power center) and “智算中心” (intelligent computing center), while English continues to use “data center” with added modifiers. This divergence is not accidental — it is rooted in the compound interaction of language morphological features, industrial organization modes, and path dependence structures. More critically, when over half of the world’s frontier AI researchers are native Chinese speakers, the systematic “semantic downgrade” that occurs when they translate precisely differentiated Chinese concepts into English papers propagates into LLM training corpora, thereby capping the resolution of model world representations. This paper names this phenomenon “Terminology-Physics Alignment Lag” (TPAL) and proposes an analytical framework integrating diachronic linguistics, path dependence theory, and the weak form of the Sapir-Whorf hypothesis. TPAL constitutes an entirely new LLM variable dimension distinct from model architecture, training methods, and alignment techniques.

LLM World Models
Terminology Evolution Speed
Cross-Linguistic Semantic Resolution
Path Dependence
GPU/CPU Architecture Transition
Sapir-Whorf Hypothesis
算力中心 vs Data Center
TPAL
Section 01

Background: Physical Discontinuity and Terminological Continuity

When hardware undergoes a paradigm break but the English word stays the same

Since 2012, the physical essence of data centers has undergone a fundamental transformation. From CPU-dominated parallel data storage and processing facilities to GPU-dominated large-scale matrix computation facilities — this is not a simple hardware upgrade but an architectural paradigm break. A single CPU server’s power consumption ranges from 300–600W, while a single GPU server can reach 3,000–10,000W. Rack power density has surged from 5–15kW in traditional data centers to over 40–250kW for AI workloads. Cooling has shifted from air to liquid, and network topology has moved from north-south traffic to GPU-to-GPU east-west communication.

This is not swapping out equipment within the same building — it is an entirely new class of industrial facility. Yet the English-speaking world’s naming response to this break has been: adding modifiers to the old word “data center” — “AI data center,” “GPU data center,” “hyperscale data center.”

300W
Single CPU server power consumption
10kW
Single GPU server power consumption
5–15kW
Traditional rack power density
40–250kW+
AI rack power density

In stark contrast, China produced an entirely different terminological response to the same physical change. China’s Ministry of Industry and Information Technology (MIIT) and ten other departments jointly issued policy documents that explicitly classify computing infrastructure into three types: General Computing Centers (CPU-dominated), Intelligent Computing Centers / 智算中心 (GPU/AI accelerator-dominated), and Supercomputing Centers (HPC cluster-dominated). Each term precisely corresponds to a different hardware architecture, service target, and policy jurisdiction.

Core Contrast

Facing the same discontinuous change in the physical world, Chinese generated new vocabulary to mark the break, while English chose to patch the old word to absorb the change. This is not a translation problem — it is a systematic difference in how fast two languages track changes in the physical world.

Section 02

Asymmetric Naming Strategies: Why the Difference

Markets vs. plans — two governance philosophies reflected in vocabulary

The naming strategy divergence between the two countries is not a matter of linguistic habit but a structural mapping of technology governance philosophies.

Dimension China (Chinese) United States (English)
Naming authority Government (MIIT) unified definition Individual corporate naming
Classification basis Computing type (CPU/GPU/HPC) Ownership & scale (enterprise/colo/hyperscale)
Root word “算” (verb: to compute) “data” (noun: data)
Metaphor frame Production facility (analogy: power plant) Storage facility (analogy: warehouse)
GPU era response Created new term: 智算中心 Added modifier: AI data center
NVIDIA’s attempt “AI Factory” (from GTC 2024)
Policy transmission “Accelerate 智算中心 construction” → instantly clear “Invest in AI infrastructure” → needs further definition

NVIDIA CEO Jensen Huang repeatedly emphasized the “AI Factory” concept at GTC 2024: “The raw material of the last industrial revolution was water, and the product was electricity. The raw material of the AI factory is data and electricity, and the product is tokens.” He sought to replace the storage metaphor of “data center” with the production metaphor of “factory.” But the Chinese term “算力中心” (computing power center) never needed this step — the root word “算” (compute) is inherently a verb, naturally production-oriented.

Section 03

Path Dependence: Structural Causes of English Terminological Inertia

Why “data center” persists despite a paradigm break in the underlying hardware

The continued use of “data center” in English is not a matter of simple habit but the result of multi-layered path dependence lock-in. Path dependence theory holds that early choices constrain subsequent choices through self-reinforcing positive feedback loops — even when superior alternatives exist, switching costs exceed cumulative benefits, thereby suppressing change.

Sunk cost lock-in. The US data center industry has trillions of dollars in sunk costs. REITs, insurance contracts, government tax incentives, and industry standards (e.g., TIA-942) are all built around the term “data center.” Renaming means restructuring the entire legal and financial framework.

Network effect lock-in. Global English-language technical documentation, standards, and contracts all use “data center.” This is not a single-country renaming cost but a global-scale coordination cost.

Cognitive lock-in. The morphological features of English make compound noun generation less flexible than Chinese. Chinese “智算中心” creates a completely new conceptual unit in just three characters, a compression ratio English struggles to match in coining new terms.

Path Dependence Dynamics

Chinese terminology’s rapid iteration also benefits from the “absence of path dependence burden.” Chinese technical terminology carries no global lock-in burden — with each generation of technology introduction, the Chinese-speaking community conducts active semantic reconstruction through “catch-up translation,” which has formed a cultural inertia. Meanwhile, the US, as the technology originator, has its terminology globally adopted and embedded, making global coordination costs of renaming extremely high.

Section 04

Semantic Resolution Gap: Topological Separation in Vector Space

How terminology precision directly imprints on LLM internal representations

The difference in terminology precision is not an abstract discussion — it is directly imprinted in the vector space structure inside LLMs. LLMs are fundamentally systems that convert language into high-dimensional vector space matrix operations. Therefore, whether a given concept is mixed within a single token cluster or separated into independent clusters in training corpora directly determines the resolution of the world model the LLM learns.

Figure 1 · Hypothetical Vector Space Distribution

Comparison of semantic distributions for infrastructure-related tokens in LLMs trained on English vs. Chinese corpora (conceptual diagram)

ENGLISH VECTOR SPACE
data center
AI data center
GPU cluster
cloud DC
⬤ High overlap
CHINESE VECTOR SPACE
数据中心
智算中心
⬤ Clear separation
English-side “data center” related concepts are highly clustered within a single semantic cluster, while Chinese-side “数据中心” and “智算中心” form topologically separated independent clusters. This is a theoretical prediction requiring empirical validation, but the root word structural difference (“data” vs. “算/智算”) provides strong theoretical support.
Section 05

The Human Variable: Bilingual Demographics at the AI Frontier

Why over half the world’s top AI researchers think in Chinese first

The severity of the above problem is amplified by the composition of AI research talent.

57.7%
Share of Chinese & US AI researchers globally
63,000+
US AI researchers
53,000
Chinese AI researchers
~42%
Chinese-origin authors at NeurIPS 2019

The 2025 UNIDO report shows that Chinese and American AI researchers combined account for 57.7% of the global total. The Carnegie Endowment’s analysis is even more pointed: among authors of top AI papers, the contribution of China-origin researchers rivals or exceeds that of US-native authors. NeurIPS 2019 showed approximately 42% of authors were of Chinese origin by ethnicity, and 50% of accepted papers at AAAI 2020 included contributions from China-origin researchers. In 2024, the number of AI papers published by Chinese scholars (23,695) exceeded the combined total of the US, UK, and EU.

Structural Contradiction

Over half of the world’s most cutting-edge AI researchers are native Chinese speakers. When they think in Chinese, “数据中心” (data center) and “智算中心” (intelligent computing center) are two sharply distinct concepts. But when they publish at NeurIPS, this precise distinction is compressed into the single English term “data center,” resulting in systematic semantic downgrade. And these English papers immediately become training corpora for the next generation of LLMs.

Section 06

Terminology-Physics Alignment Lag: Proposing a New LLM Variable

TPAL — a variable upstream of architecture, training, and alignment

This paper integrates the preceding analysis and proposes “Terminology-Physics Alignment Lag” (TPAL) as a new variable affecting LLM performance. TPAL measures the time differential and semantic gap between the point when a change occurs in the physical world and the point when a precise term designating that change gains a foothold in mainstream corpora.

Figure 2 · TPAL Causal Transmission Chain
Physical world change
CPU→GPU transition

Language’s terminological response
TPAL occurrence point

Corpus semantic resolution
Precise vs. blurred

LLM world model
Vector space structure

LLM output quality
Resolution ceiling
TPAL causal chain. The speed of language’s response to physical change (TPAL point) constitutes the upstream bottleneck of the entire chain.

In this causal chain, TPAL sits at the upstream bottleneck position. No matter how model architecture, training methods, or alignment techniques improve, if the semantic resolution of training corpora is insufficiently high, the precision of world representation the model can learn has a ceiling. This is not a hallucination problem, not a reasoning capability problem, nor an alignment problem — all of those sit further downstream. TPAL points to the more fundamental issue of corpus semantic granularity.

Core Proposition

The larger the TPAL (i.e., the slower a language tracks physical-world changes), the lower the resolution of the LLM world model trained on that language’s corpora. This means that an LLM for a specific language may exhibit systematically low cognitive precision in specific technical domains — independent of model scale and training methodology.

Section 07

Self-Reinforcing Loop: LLMs Amplify Terminological Inertia

The Sapir-Whorf hypothesis meets AI-era feedback dynamics

The problem does not transmit in only one direction. LLMs are not merely passive learners of corpora — they are active generators of new corpora. When researchers and practitioners use English LLMs to discuss AI infrastructure, the LLM’s word choices in turn influence humans’ corpus generation.

Figure 3 · TPAL Self-Reinforcing Loop
Old terminology dominates corpora
“data center”

LLM learns old framework

LLM output reinforces old terminology

Old terminology share in corpora ↑

Positive feedback deepens path lock-in
LLMs learn terminological inertia from corpora and contribute to corpora again through their own output, thereby amplifying path dependence lock-in at the technological level.

Conversely, if the Chinese side has already completed the “数据中心→智算中心” terminology switch, Chinese LLM output will naturally use the new terminology, forming a positive cycle. This means the cross-linguistic TPAL gap may automatically widen over time.

This can be understood as an AI-era variant of the Sapir-Whorf hypothesis. The original hypothesis holds that “language influences/determines human thought.” The variant proposed in this paper is: “The granularity of language determines the cognitive resolution of LLMs, and LLMs in turn reinforce human language use, making this constraint self-reinforcing.”

Section 08

The Tension Between Precision and Flexibility

A fair assessment of both naming strategies’ trade-offs

In fairness, the limitations of the Chinese terminology system must also be noted. The rigid tripartite classification of “通算/智算/超算” (general/intelligent/super computing) encounters problems when technical boundaries blur — when a facility simultaneously provides general cloud services and AI inference, which category does it belong to? The industry has already coined the patch term “融合算力中心” (converged computing center), indicating that the rigid tripartite system’s inflexibility is creating friction with reality.

Evaluation Dimension Chinese System (Precision Strategy) English System (Flexibility Strategy)
Policy transmission efficiency High: “build 智算中心” is instantly clear Low: requires additional definition
Investment narrative clarity High: GPU cluster target locked in Low: boundary blurred with traditional REITs
Cross-industry dialogue precision High: all parties share a coordinate system Low: stakeholders have varying interpretations
Facility flexible transition Low: locked into category by name High: name doesn’t constrain purpose
Hybrid workloads Needs patch: “converged computing center” Naturally absorbed: just swap the modifier
LLM corpus semantic resolution High: concepts separated Low: concepts blended

There is a fundamental tension between “precision” and “flexibility” in technical naming. Chinese chose precision, gaining efficiency in policy transmission and cross-domain analogies, but sacrificing some flexibility; English chose flexibility, preserving market adaptability, but incurring higher communication costs when unified action is needed. However, on the new dimension of LLM world model resolution, the precision strategy holds a structural advantage.

Section 09

Research Gaps and Proposals: LLM as Self-Referential Research Instrument

Using the object of study as the tool for studying it

In current academic literature, this cross-disciplinary area remains a systematic blank. Adjacent fields have solid research foundations: epidemiological propagation models for Chinese internet neologisms (Jiang et al., 2021, PLOS ONE), 200-year English semantic drift tracking via word embeddings (Memory & Cognition, 2022), diachronic variation analysis of 250 years of English scientific writing (Frontiers in AI, 2020), and path dependence theory’s explanation of technological lock-in (David, 1985). However, research integrating these into “how cross-linguistic differences in technical terminology evolution speed affect LLM world models” does not yet exist.

The key insight is: the LLM itself is the best tool for studying this problem. LLMs are fundamentally systems that convert language into high-dimensional vector space matrix operations, making the following empirical research entirely feasible:

Proposed Experimental Design

Step 1: From the same multilingual LLM (e.g., GPT-4, Claude, etc.), extract the semantic neighborhood distribution of “data center,” “AI data center,” and “AI factory” in the English embedding space.
Step 2: From the same model, extract the semantic neighborhood distribution of “数据中心,” “算力中心,” and “智算中心” in the Chinese embedding space.
Step 3: Compare the topological structural differences between the semantic neighborhoods of the two languages. If the cosine similarity between English-side concepts is significantly higher than the Chinese side (i.e., overlap is more severe), this constitutes mathematical evidence of TPAL.
Step 4: Track the vector separation trajectories of related terminology in both languages across time-series corpora (2015→2025) to measure the dynamic evolution of TPAL.

This research design has a self-referential elegance: GPU matrix operations created LLMs → LLMs convert language into computable vector spaces → yet the naming differences across languages for “GPU matrix operation facilities” themselves → can be quantitatively studied through LLM vector spaces. The research tool and the research object form a self-referential structure.

Section 10

Conclusion: The Granularity of Language Determines the Resolution of Intelligence

Four core claims and one final proposition

This paper advances the following core claims:

First, systematic differences exist in the cross-linguistic evolution speed of technical terminology, resulting from the combined action of language characteristics, industrial organization modes, and path dependence structures. China generated a new terminology system for the GPU era (“算力中心/智算中心”), while English remains at the stage of adding modifiers to the legacy term “data center.”

Second, this difference directly affects the semantic resolution of LLM training corpora, formalizable through the new variable “Terminology-Physics Alignment Lag (TPAL).” TPAL constrains the ceiling of LLM world model resolution at a level more fundamental than model architecture and training methodology.

Third, the LLM is both the object and the cause of this problem. LLM output generates new corpora, amplifying terminological inertia in corpora through a self-reinforcing loop. This constitutes an AI-era variant of the Sapir-Whorf hypothesis: “The granularity of language constrains the cognitive resolution of LLMs, and LLMs in turn reinforce human language use.”

Fourth, over half of the world’s frontier AI research talent are native Chinese speakers. The systematic semantic downgrade that occurs when they translate precisely differentiated Chinese concepts into English papers is a structural inefficiency in the global AI knowledge production system.

Final Proposition

Whichever language can map the physical world’s changes faster and more precisely will produce corpora that train higher-resolution LLMs. This is a severely underestimated dimension of AI competition. Beyond chip export controls and model architecture races, the evolution speed of language itself is quietly operating as a hidden variable in AI capability. The timely generation of new vocabulary precisely aligned with physical-world changes is not merely a linguistics problem — it is infrastructure for LLM progress.

References

  1. Jiang, M. et al. (2021). “Neologisms are epidemic: Modeling the life cycle of neologisms in China 2008–2016.” PLOS ONE, 16(2), e0245984.
  2. Xu, Y. et al. (2022). “Diachronic semantic change in language is constrained by how people use and learn language.” Memory & Cognition, 50, 1652–1672.
  3. Bizzoni, Y. et al. (2020). “Linguistic Variation and Change in 250 Years of English Scientific Writing.” Frontiers in Artificial Intelligence, 3, 73.
  4. David, P. A. (1985). “Clio and the economics of QWERTY.” American Economic Review, 75(2), 332–337.
  5. Monaghan, P. (2014). “Age of acquisition predicts rate of lexical evolution.” Cognition, 133(1), 93–99.
  6. CSET, Georgetown University. (2024). “Comparing U.S. and Chinese Contributions to High-Impact AI Research.” Data Brief.
  7. Carnegie Endowment (2025). “Have Top Chinese AI Researchers Stayed in the United States?” Emissary Report.
  8. Stanford HAI (2025). “The 2025 AI Index Report: Research and Development.”
  9. UNIDO & Dongbi Data (2025). “Global AI Research Landscape Report (2015–2024).”
  10. Digital Science (2025). “DeepSeek and the New Geopolitics of AI: China’s ascent to research pre-eminence.” Science, July 2025.
  11. MIIT et al. (2024). “Notice on Promoting the Coordinated Development of New Information Infrastructure.”
  12. NVIDIA (2024). “AI Factories Are Redefining Data Centers.” GTC 2024 Keynote & NVIDIA Blog.
  13. Sapir, E. (1929). “The Status of Linguistics as a Science.” Language, 5(4), 207–214.
  14. Whorf, B. L. (1956). Language, Thought, and Reality: Selected Writings. MIT Press.
  15. Arthur, W. B. (1994). Increasing Returns and Path Dependence in the Economy. University of Michigan Press.
  16. Li, W. (2024). “Linguistic analysis of Chinese neologisms from 2017 to 2021.” International Journal of Language and Literary Studies.

Language Iteration Speed: Another Dimension of LLM Variables
LEECHO Global AI Research Lab & Opus 4.6
April 9, 2026 · V1

댓글 남기기