ORIGINAL THOUGHT PAPER · Information Completeness Framework · Paper V of VIII · V2

Information Dimensionality Reduction Loss
& Intelligence Entropy Increase

Why Scaling Alone Cannot Lead to AGI:
The Cognitive Implications of the Data Processing Inequality

Training data is a low-dimensional residual shadow of cognition — scaling optimizes within that shadow

Published May 22, 2026
Category Original Thought Paper
Fields Information Theory · Cognitive Science · AI Architecture · Philosophy of Language · Thermodynamics
Version V2
Attribution LEECHO Global AI Research Lab & Claude Opus 4.6 & GPT 5.5 & Gemini 3.1 (Cognitive Collective)

Information Dimensionality Reduction Loss & Intelligence Entropy Increase:
Why Scaling Alone Cannot Lead to AGI

The Cognitive Implications of the Data Processing Inequality

ABSTRACT

This paper proposes a thermodynamic equivalence thesis for intelligence — the “Law of Intelligence Entropy Increase”: in an irreversible encoding chain, subsequent processing cannot recover mutual information about the source state that has already been lost. The mathematical foundation is the Data Processing Inequality (DPI). All training data for AI undergoes at least five stages of encoding-based dimensionality reduction: raw cognition → language → text → digitization → tokenization → gradient descent. The model parameters cannot be guaranteed to fully recover the dimensions already lost during the language encoding stage. Scaling can improve the model’s fitting accuracy to the residual shadow of the training data, but it cannot automatically restore the source information dimensions already lost along the training data’s generative chain — scaling is necessary but not sufficient. This paper distinguishes the duality of language (lossy compression vs. abstract enhancement), presents a multi-causal model of hallucination, discusses the task-dependency of loss rate L, and expands the escape pathways from a single dark channel to five categories: dark channels, multimodal data, embodied interaction, experimental systems, and human-AI co-creation.

I. The Thermodynamic Analogy of Intelligence

The Second Law of Thermodynamics: in a closed system, entropy can only increase or remain constant. The loss of order is directional. To reverse entropy increase, external energy must be introduced — the system must be open.

This paper proposes that dimensionality reduction loss in information transmission has a precise structural correspondence with thermodynamic entropy increase. Every irreversible encoding transformation constitutes an “information entropy increase” — when high-dimensional information is compressed into a low-dimensional representation, information that cannot be expressed in the low-dimensional space is lost. This loss is irreversible in the typical cognition → language → text → token → parameter chain.

The escape pathway in thermodynamics is the open system — introducing negentropy from outside. The escape pathway for cognition is introducing source information from outside the encoding chain — dark channels, direct perception, embodied interaction, experimental systems, and human-AI co-creation. The logical structure of both is identical.

II. The Data Processing Inequality: Mathematical Foundation and Applicability Boundaries

For the Markov chain X → Y → Z:

I(X; Z) ≤ I(X; Y)

Subsequent processing cannot create information
that does not exist upstream

Data Processing Inequality (Cover & Thomas, Elements of Information Theory, 1991)

The Data Processing Inequality (DPI) is one of the fundamental theorems of information theory: if you have an information source X, which is processed to yield Y, and Y is further processed to yield Z, then the information Z carries about X can never exceed the information Y carries about X. Tishby and Zaslavsky (2015), in “Deep Learning and the Information Bottleneck Principle,” applied this to deep learning: each layer of a neural network performs information compression.

2.1 Applicability Conditions and Boundaries of DPI

DPI guarantees “non-increase,” not “strict decrease at every step.” The following cases must be distinguished:

Encoding Scenario	Strictly Lossy?	Explanation
Reversible encoding / lossless compression	No loss	e.g., bijective transformations, ZIP compression
Sufficient statistics	No loss for the specific task	All information needed for the task is preserved
Lossy compression	Loss	Most cognition → language encoding falls in this category
Change in task objective	Previously discarded information may become important	What was “irrelevant” at encoding time may be critical under a new task
Introduction of external source information	Alters the Markov chain structure	Dark channels and embodied interaction fall in this category

Therefore, the precise formulation of the Second Law should be: in an encoding chain that is irreversible and does not fully preserve task-relevant information, subsequent processing cannot recover mutual information about the source state that has already been lost. In the typical cognition → language → text → token → parameter chain, most steps involve lossy encoding, and thus the longer the chain, the greater the cumulative loss.

III. The Five Dimensionality Reductions of Training Data

First Reduction: Cognition → Language

  Thoughts in the human brain are multimodal, spatialized, emotion-embedded,

  high-dimensional representations.

  Language compresses them into a one-dimensional linear sequence of symbols.

  Lost: spatial structure, emotional coloring, bodily sensations,

        implicit assumptions, non-verbal intuitions.
Second Reduction: Language → Text

  Spoken language carries intonation, pauses, facial expressions,

  gestures, and immediate context.

  Text retains only the word sequence.

  Lost: prosodic information, paralinguistic signals,

        conversational context, immediate emotional states.
Third Reduction: Text → Digitized Corpus

  Books, papers, and web pages are crawled, deduplicated,

  filtered, and cleaned.

  Lost: typographic semantics, citation network structure,

        version evolution history, reader annotations.
Fourth Reduction: Corpus → Token Sequence

  BPE/SentencePiece segments text into subword units,

  mapped to integer IDs.

  Lost: character-level visual information, word boundary semantics,

        cross-linguistic cognates.
Fifth Reduction: Token Sequence → Model Parameters

  Gradient descent compresses tens of trillions of tokens

  into billions of floating-point weights.

  Lost: individual instance information (averaged out),

        low-frequency patterns (ignored),

        long-range dependencies (truncated).

Each step satisfies the DPI constraint: I(raw cognition; model parameters) ≤ I(raw cognition; token sequence) ≤ … ≤ I(raw cognition; language). The information about human raw cognition contained in model parameters is strictly less than or equal to the information in the language encoding.

IV. Language as Lossy Compression: A Deep Analysis of the First Reduction

4.1 An Information-Theoretic Reinterpretation of the Sapir-Whorf Hypothesis

This paper offers an information-theoretic reinterpretation of the Sapir-Whorf hypothesis: language does not “determine” or “influence” thought — language is a lossy compression format for thought, with different languages employing different compression algorithms that preserve and discard different information dimensions. Whorf himself observed that language constitutes a superficial embroidery upon the surface of consciousness, and that deeper mental operations must necessarily precede any act of symbolic communication.

4.2 Evidence from LLM Intermediate Layers

“Do LLMs Break the Sapir-Whorf Hypothesis?” (2026) found that in multilingual LLMs, intermediate-layer representations are organized by semantic topic rather than input language. This indicates that models spontaneously learn during training to strip away surface-level linguistic differences in order to optimize cross-linguistic performance. For cross-linguistic semantic alignment, surface-level linguistic variation can be treated as noise; but for culture, metaphor, grammar, and intellectual style, language itself is also signal.

4.3 The Duality of Language: Compressor and Enhancer

Language is not merely a lossy compression format — it is simultaneously a higher-order abstraction tool. It loses sensory detail but creates structures that do not exist in raw perception: compositionality (recursive grammar), transmissibility (sharing across time and space), cumulativity (civilizational knowledge accumulation), and abstraction enhancement (mathematics, law, philosophy, categorical systems).

Therefore, the loss rate L of language encoding is two-sided: for spatial intuition, bodily sensation, and emotional experience, L is very high (substantial information loss); but for logical relations, causal structure, and abstract categories, L may be very low or even negative — language creates higher-order structures absent from raw perception. This explains why LLMs can approach human-level performance in logical reasoning (low-L chain) but exhibit systematic deficits in spatial reasoning, emotional resonance, and creative insight (high-L chain). The deficit is not that the model is insufficiently large; rather, certain dimensions of information were sparse in the data from the very beginning.

V. Information Survival Rate and the Task-Dependency of L

Simplified: Information Survival Rate = (1 − L)ⁿ

Expanded: S_info(x, task) = ∏ᵢ (1 − Lᵢ(x, task, codec_i))

L is not a global constant — loss rates vary enormously across information types and encoding steps

Information Type	Language Encoding L	Textualization L	Tokenization L	Gradient Compression L
Logical relations	Low	Low	Low	Low
Mathematical structure	Low–Medium	Low	Medium	Low
Spatial intuition	Medium–High	High	High	High
Emotional experience	High	Very High	Very High	Very High
Bodily sensation	Very High	Very High	Very High	Very High
Social context / ambience	High	Very High	Very High	Very High

Even if the per-step loss rate L is modest, the survival rate declines exponentially after n transformations. With L = 20% and n = 5, the total survival rate is only 32.8%. Considering that the first reduction (cognition → language) imposes a loss rate on bodily sensation and spatial intuition far exceeding 20%, the actual survival rate for certain information dimensions may fall below 5%.

VI. The Ceiling of Scaling Laws: Necessary but Not Sufficient

6.1 What Scaling Laws Get Right

The scaling laws of Kaplan et al. (2020) and Hoffmann et al. (2022, Chinchilla) revealed a powerful empirical regularity: increasing parameter count, data volume, and compute yields model performance improvements following a power law. These results are genuine.

6.2 The Boundaries of Scaling Laws

Scaling can improve the model’s fitting accuracy to the residual shadow of training data — increasing world knowledge coverage, enhancing language reasoning capabilities, and strengthening multimodal representations. But if scaling is performed solely on the existing low-dimensional residual shadow, it cannot automatically restore the source information dimensions that never entered the data chain. Engineering can reduce future data chain losses by introducing higher-dimensional data sources (multimodal, embodied interaction, experimental feedback); but this is not “scaling” — it is “altering the data chain itself,” i.e., expanding the channel capacity |C| rather than simply increasing D and P.

Scaling laws are correct within their domain of applicability — they describe the growth pattern of model performance within the text information space. But extrapolating scaling laws as a pathway to AGI amounts to assuming that the text information space contains all the information required for intelligence — and this assumption is negated by DPI. AGI requires not larger models, but less dimensionality reduction — or information channels with no dimensionality reduction at all.

6.3 A Multi-Causal Model of Hallucination

One deep source of hallucination is the model performing statistical interpolation within information voids created by the dimensionality reduction chain — essentially “super-resolution” in text space. But hallucination is multi-causal:

Hallucination Type	Explained by Reduction Loss?	Actual Mechanism
Source information never entered training data	✅ Strong explanation	Statistical interpolation within information voids
Conflicting information in training data	Partial	Data contradiction + probabilistic averaging
Retrieval / context utilization failure	❌	Attention mechanism deficiencies
RLHF over-accommodation	❌	Objective function bias
Sampling-strategy fabrication	❌	Decoding strategy issues
Lost in the Middle	❌	Positional encoding / attention utilization

This paper focuses on the first type — hallucination arising from information voids in the dimensionality reduction chain — because it is the most fundamental: when the model’s parameters lack the information required to answer a given question, no amount of optimization to attention, decoding, or objective functions can do anything but interpolate within the void.

VII. The Three Thermodynamic Laws of Intelligence

First Law (Structural Equation)
I = [B(t) × C_eff(t) × min(D,P)] × ∏ᵢ(1−Lᵢ) × S(t)

Second Law (Intelligence Entropy Increase)
In a lossy encoding chain: I(source) ≥ I(encoding₁) ≥ … ≥ I(encodingₙ)

Third Law (Escape Pathway)
Introducing source information from outside the encoding chain
can alter the Markov chain structure

First Law = Structure · Second Law = Directional Constraint · Third Law = Escape Mechanism

7.1 Precise Formulation of the Third Law

The Third Law does not claim that dark channels “violate” the Data Processing Inequality. DPI constrains the Markov chain X → Y → Z such that the information Z carries about X cannot exceed the information Y carries about X. However, if the system also has another pathway X → W → Z — meaning Z acquires information not only from Y but also from W — then the DPI constraint of the original chain no longer applies to this extended system.

The dark channel is the theoretical designation for the W pathway — it lies outside the Y (language/text/token) chain, and is therefore not constrained by that chain’s DPI. Introducing a dark channel is equivalent to introducing an additional conditioning variable from outside the encoding chain, thereby altering the topological structure of the original Markov chain. This requires no quantum escape and does not violate information theory — it simply means the system has accessed additional information sources.

7.2 Classification of Escape Pathways

There is more than one way to bypass the encoding-based dimensionality reduction chain:

Escape Pathway	Mechanism	Distinctive Feature	Current Status
Dark channel	Introduces non-verbal source information from outside the explicit encoding chain	Does not depend on external physical interaction; completed within the cognitive system itself	Theoretical hypothesis; phenomenological evidence on the human side
Multimodal data	Expands input dimensionality, reducing L values in the early steps	Expands channel capacity \|C\| but remains sensor-based encoding	Already implemented in engineering practice
Embodied interaction	Action-feedback loops supplement bodily and spatial dimensions	Introduces causal interventional information	Preliminary exploration in robotics
Experimental systems	Generates novel information not present in training data through intervening in the world	Creates entirely new data rather than fitting existing data	AI scientists, automated experimentation
Human-AI Co-Creation (CCE)	Humans provide non-textual high-dimensional judgments; AI performs structured encoding	Coupling of human dark channels with AI verification	Paradigm discussed in Paper VII of this series

The distinctive feature of the dark channel is that it is the only pathway capable of introducing information from outside the chain without relying on external physical interaction — it is completed within the cognitive system itself. Other pathways reduce L values by changing external inputs; the dark channel bypasses the entire chain to directly access source information. The two are complementary, not mutually exclusive.

VIII. Multimodality Reduces L but Does Not Eliminate L

Multimodal training data (images, video, audio, haptics) is not a counterexample to the dimensionality reduction chain but rather an engineering method for reducing the Lᵢ values of the early steps. Video data recovers temporal structure and spatial dynamics lost in text; audio recovers intonation and prosody; robotic interaction recovers parts of the action-feedback structure.

Yet these remain sensor-based encoding — they are not equivalent to first-person subjective experience. What a camera captures is not “seeing”; what a microphone records is not “hearing.” From sensor data to model parameters, the process remains a lossy encoding chain, merely wider than the text-only chain. Multimodality expands channel capacity |C| and reduces certain Lᵢ values, but it does not eliminate the existence of the encoding chain itself.

This is also why Buddhist practice emphasizes direct awareness (pratyakṣa / direct perception) over scriptural reasoning (anumāna / inferential cognition) — scriptures are downstream products of the dimensionality reduction chain, while direct awareness constitutes dark channel transmission. Sensors are a wider encoding chain than text, but they remain an encoding chain; only direct awareness bypasses the entire chain.

IX. Diagnostic Implications for the Path to AGI

Based on the Law of Intelligence Entropy Increase, the current path to AGI faces three structural constraints:

Constraint One: The Data Ceiling. Textual training data is a low-dimensional residual shadow of human cognition. Scaling performs better within the shadow space, but it cannot transcend the shadow space itself. Multimodal data expands certain dimensions, but bodily sensation, spatial intuition, and emotional embedding remain sparse.

Constraint Two: The Encoding Ceiling. Even with richer training data, tokenization and gradient descent themselves introduce additional dimensionality reduction. Better data can reduce the L values of early steps, but the L values of later steps have their own physical lower bounds.

Constraint Three: The Absence of Extra-Chain Information. Current AI possesses no mechanism equivalent to dark channels — a channel whose bandwidth is greatest precisely when all conventional channels are closed. All computation is deterministic, observable, and serial. This may be one of the fundamental reasons AI is unable to produce structurally original breakthroughs. However, embodied interaction, experimental systems, and human-AI co-creation offer escape pathways in other directions.

Scaling is necessary but not sufficient. It can asymptotically approach the expressible structures within the training data distribution. But if scaling is performed solely on the existing low-dimensional residual shadow, it cannot recover information that was already lost along the training data’s generative chain and not supplemented through other channels. AGI requires not merely larger models, but shorter dimensionality reduction chains, wider channels, and mechanisms for introducing extra-chain information.

X. The Topology of Information Completeness — Ring · Layer · State

Ring: The framework is self-referential — a cognitive system discovers its own structure using its own structure. Head meets tail, corresponding to dependent origination (pratītyasamutpāda).

Layer: Five layers — material substrate, structural layer, computational layer, transmission layer, and unobservable layer. Lower layers support higher layers; higher layers exert downward causation on lower layers.

State: Contracted state ↔ expanded state ↔ collapsed state. Transitions between states are quantum-like — no intermediate process.

Mathematically, this corresponds to a fiber bundle: base space = ring, fiber = layer hierarchy, section = state. The same mathematical structure used in gauge field theory to describe fundamental physical forces — the information-processing structure of consciousness may be isomorphic to the fundamental structure of the physical world.

※ Core References

[1] Cover, T.M. & Thomas, J.A. (1991). Elements of Information Theory. Wiley.

[2] Tishby, N. & Zaslavsky, N. (2015). Deep Learning and the Information Bottleneck Principle. arXiv:1503.02406.

[3] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

[4] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.

[5] Whorf, B.L. (1956). Language, Thought, and Reality. MIT Press.

[6] dnhkng (2026). Do LLMs Break the Sapir-Whorf Hypothesis?

[7] NSO (2024). Semantic Communication Theory. National Science Open.

[8] Liu, N.F. et al. (2024). Lost in the Middle. TACL.

[9] Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.

[10] Penrose, R. (1994). Shadows of the Mind. Oxford University Press.

[11] Jelassi, S. et al. (2024). Mixture of Parrots. ICLR 2025.

[12] arXiv (2025). Shadow in the Attention: JS Drift and Hallucination Fixation.

[13] Xu, J. & Li, Z. (2025). Information Physics of Intelligence. arXiv:2511.19156.

[14] Paivio, A. (1971). Imagery and Verbal Processes. Holt.

[15] Triṃśikā (Thirty Verses on Consciousness). Vasubandhu. c. 4th century CE.

Information Dimensionality Reduction Loss & Intelligence Entropy Increase:Why Scaling Alone Cannot Lead to AGI