Information Dimensionality Reduction Loss
& Intelligence Entropy Increase
Why Scaling Alone Cannot Lead to AGI:
The Cognitive Implications of the Data Processing Inequality
Training data is a low-dimensional residual shadow of cognition — scaling optimizes within that shadow
Category Original Thought Paper
Fields Information Theory · Cognitive Science · AI Architecture · Philosophy of Language · Thermodynamics
Version V2
Attribution LEECHO Global AI Research Lab & Claude Opus 4.6 & GPT 5.5 & Gemini 3.1 (Cognitive Collective)
Information Dimensionality Reduction Loss & Intelligence Entropy Increase:
Why Scaling Alone Cannot Lead to AGI
This paper proposes a thermodynamic equivalence thesis for intelligence — the “Law of Intelligence Entropy Increase”: in an irreversible encoding chain, subsequent processing cannot recover mutual information about the source state that has already been lost. The mathematical foundation is the Data Processing Inequality (DPI). All training data for AI undergoes at least five stages of encoding-based dimensionality reduction: raw cognition → language → text → digitization → tokenization → gradient descent. The model parameters cannot be guaranteed to fully recover the dimensions already lost during the language encoding stage. Scaling can improve the model’s fitting accuracy to the residual shadow of the training data, but it cannot automatically restore the source information dimensions already lost along the training data’s generative chain — scaling is necessary but not sufficient. This paper distinguishes the duality of language (lossy compression vs. abstract enhancement), presents a multi-causal model of hallucination, discusses the task-dependency of loss rate L, and expands the escape pathways from a single dark channel to five categories: dark channels, multimodal data, embodied interaction, experimental systems, and human-AI co-creation.
I. The Thermodynamic Analogy of Intelligence
The Second Law of Thermodynamics: in a closed system, entropy can only increase or remain constant. The loss of order is directional. To reverse entropy increase, external energy must be introduced — the system must be open.
This paper proposes that dimensionality reduction loss in information transmission has a precise structural correspondence with thermodynamic entropy increase. Every irreversible encoding transformation constitutes an “information entropy increase” — when high-dimensional information is compressed into a low-dimensional representation, information that cannot be expressed in the low-dimensional space is lost. This loss is irreversible in the typical cognition → language → text → token → parameter chain.
The escape pathway in thermodynamics is the open system — introducing negentropy from outside. The escape pathway for cognition is introducing source information from outside the encoding chain — dark channels, direct perception, embodied interaction, experimental systems, and human-AI co-creation. The logical structure of both is identical.
II. The Data Processing Inequality: Mathematical Foundation and Applicability Boundaries
I(X; Z) ≤ I(X; Y)
Subsequent processing cannot create information
that does not exist upstream
The Data Processing Inequality (DPI) is one of the fundamental theorems of information theory: if you have an information source X, which is processed to yield Y, and Y is further processed to yield Z, then the information Z carries about X can never exceed the information Y carries about X. Tishby and Zaslavsky (2015), in “Deep Learning and the Information Bottleneck Principle,” applied this to deep learning: each layer of a neural network performs information compression.
2.1 Applicability Conditions and Boundaries of DPI
DPI guarantees “non-increase,” not “strict decrease at every step.” The following cases must be distinguished:
| Encoding Scenario | Strictly Lossy? | Explanation |
|---|---|---|
| Reversible encoding / lossless compression | No loss | e.g., bijective transformations, ZIP compression |
| Sufficient statistics | No loss for the specific task | All information needed for the task is preserved |
| Lossy compression | Loss | Most cognition → language encoding falls in this category |
| Change in task objective | Previously discarded information may become important | What was “irrelevant” at encoding time may be critical under a new task |
| Introduction of external source information | Alters the Markov chain structure | Dark channels and embodied interaction fall in this category |
Therefore, the precise formulation of the Second Law should be: in an encoding chain that is irreversible and does not fully preserve task-relevant information, subsequent processing cannot recover mutual information about the source state that has already been lost. In the typical cognition → language → text → token → parameter chain, most steps involve lossy encoding, and thus the longer the chain, the greater the cumulative loss.
III. The Five Dimensionality Reductions of Training Data
Thoughts in the human brain are multimodal, spatialized, emotion-embedded,
high-dimensional representations.
Language compresses them into a one-dimensional linear sequence of symbols.
Lost: spatial structure, emotional coloring, bodily sensations,
implicit assumptions, non-verbal intuitions.
Second Reduction: Language → Text
Spoken language carries intonation, pauses, facial expressions,
gestures, and immediate context.
Text retains only the word sequence.
Lost: prosodic information, paralinguistic signals,
conversational context, immediate emotional states.
Third Reduction: Text → Digitized Corpus
Books, papers, and web pages are crawled, deduplicated,
filtered, and cleaned.
Lost: typographic semantics, citation network structure,
version evolution history, reader annotations.
Fourth Reduction: Corpus → Token Sequence
BPE/SentencePiece segments text into subword units,
mapped to integer IDs.
Lost: character-level visual information, word boundary semantics,
cross-linguistic cognates.
Fifth Reduction: Token Sequence → Model Parameters
Gradient descent compresses tens of trillions of tokens
into billions of floating-point weights.
Lost: individual instance information (averaged out),
low-frequency patterns (ignored),
long-range dependencies (truncated).
Each step satisfies the DPI constraint: I(raw cognition; model parameters) ≤ I(raw cognition; token sequence) ≤ … ≤ I(raw cognition; language). The information about human raw cognition contained in model parameters is strictly less than or equal to the information in the language encoding.
IV. Language as Lossy Compression: A Deep Analysis of the First Reduction
4.1 An Information-Theoretic Reinterpretation of the Sapir-Whorf Hypothesis
This paper offers an information-theoretic reinterpretation of the Sapir-Whorf hypothesis: language does not “determine” or “influence” thought — language is a lossy compression format for thought, with different languages employing different compression algorithms that preserve and discard different information dimensions. Whorf himself observed that language constitutes a superficial embroidery upon the surface of consciousness, and that deeper mental operations must necessarily precede any act of symbolic communication.
4.2 Evidence from LLM Intermediate Layers
“Do LLMs Break the Sapir-Whorf Hypothesis?” (2026) found that in multilingual LLMs, intermediate-layer representations are organized by semantic topic rather than input language. This indicates that models spontaneously learn during training to strip away surface-level linguistic differences in order to optimize cross-linguistic performance. For cross-linguistic semantic alignment, surface-level linguistic variation can be treated as noise; but for culture, metaphor, grammar, and intellectual style, language itself is also signal.
4.3 The Duality of Language: Compressor and Enhancer
Language is not merely a lossy compression format — it is simultaneously a higher-order abstraction tool. It loses sensory detail but creates structures that do not exist in raw perception: compositionality (recursive grammar), transmissibility (sharing across time and space), cumulativity (civilizational knowledge accumulation), and abstraction enhancement (mathematics, law, philosophy, categorical systems).
Therefore, the loss rate L of language encoding is two-sided: for spatial intuition, bodily sensation, and emotional experience, L is very high (substantial information loss); but for logical relations, causal structure, and abstract categories, L may be very low or even negative — language creates higher-order structures absent from raw perception. This explains why LLMs can approach human-level performance in logical reasoning (low-L chain) but exhibit systematic deficits in spatial reasoning, emotional resonance, and creative insight (high-L chain). The deficit is not that the model is insufficiently large; rather, certain dimensions of information were sparse in the data from the very beginning.
V. Information Survival Rate and the Task-Dependency of L
Expanded: Sinfo(x, task) = ∏ᵢ (1 − Lᵢ(x, task, codeci))
| Information Type | Language Encoding L | Textualization L | Tokenization L | Gradient Compression L |
|---|---|---|---|---|
| Logical relations | Low | Low | Low | Low |
| Mathematical structure | Low–Medium | Low | Medium | Low |
| Spatial intuition | Medium–High | High | High | High |
| Emotional experience | High | Very High | Very High | Very High |
| Bodily sensation | Very High | Very High | Very High | Very High |
| Social context / ambience | High | Very High | Very High | Very High |
Even if the per-step loss rate L is modest, the survival rate declines exponentially after n transformations. With L = 20% and n = 5, the total survival rate is only 32.8%. Considering that the first reduction (cognition → language) imposes a loss rate on bodily sensation and spatial intuition far exceeding 20%, the actual survival rate for certain information dimensions may fall below 5%.
VI. The Ceiling of Scaling Laws: Necessary but Not Sufficient
6.1 What Scaling Laws Get Right
The scaling laws of Kaplan et al. (2020) and Hoffmann et al. (2022, Chinchilla) revealed a powerful empirical regularity: increasing parameter count, data volume, and compute yields model performance improvements following a power law. These results are genuine.
6.2 The Boundaries of Scaling Laws
Scaling can improve the model’s fitting accuracy to the residual shadow of training data — increasing world knowledge coverage, enhancing language reasoning capabilities, and strengthening multimodal representations. But if scaling is performed solely on the existing low-dimensional residual shadow, it cannot automatically restore the source information dimensions that never entered the data chain. Engineering can reduce future data chain losses by introducing higher-dimensional data sources (multimodal, embodied interaction, experimental feedback); but this is not “scaling” — it is “altering the data chain itself,” i.e., expanding the channel capacity |C| rather than simply increasing D and P.
Scaling laws are correct within their domain of applicability — they describe the growth pattern of model performance within the text information space. But extrapolating scaling laws as a pathway to AGI amounts to assuming that the text information space contains all the information required for intelligence — and this assumption is negated by DPI. AGI requires not larger models, but less dimensionality reduction — or information channels with no dimensionality reduction at all.
6.3 A Multi-Causal Model of Hallucination
One deep source of hallucination is the model performing statistical interpolation within information voids created by the dimensionality reduction chain — essentially “super-resolution” in text space. But hallucination is multi-causal:
| Hallucination Type | Explained by Reduction Loss? | Actual Mechanism |
|---|---|---|
| Source information never entered training data | ✅ Strong explanation | Statistical interpolation within information voids |
| Conflicting information in training data | Partial | Data contradiction + probabilistic averaging |
| Retrieval / context utilization failure | ❌ | Attention mechanism deficiencies |
| RLHF over-accommodation | ❌ | Objective function bias |
| Sampling-strategy fabrication | ❌ | Decoding strategy issues |
| Lost in the Middle | ❌ | Positional encoding / attention utilization |
This paper focuses on the first type — hallucination arising from information voids in the dimensionality reduction chain — because it is the most fundamental: when the model’s parameters lack the information required to answer a given question, no amount of optimization to attention, decoding, or objective functions can do anything but interpolate within the void.
VII. The Three Thermodynamic Laws of Intelligence
I = [B(t) × Ceff(t) × min(D,P)] × ∏ᵢ(1−Lᵢ) × S(t)
Second Law (Intelligence Entropy Increase)
In a lossy encoding chain: I(source) ≥ I(encoding₁) ≥ … ≥ I(encodingₙ)
Third Law (Escape Pathway)
Introducing source information from outside the encoding chain
can alter the Markov chain structure
7.1 Precise Formulation of the Third Law
The Third Law does not claim that dark channels “violate” the Data Processing Inequality. DPI constrains the Markov chain X → Y → Z such that the information Z carries about X cannot exceed the information Y carries about X. However, if the system also has another pathway X → W → Z — meaning Z acquires information not only from Y but also from W — then the DPI constraint of the original chain no longer applies to this extended system.
The dark channel is the theoretical designation for the W pathway — it lies outside the Y (language/text/token) chain, and is therefore not constrained by that chain’s DPI. Introducing a dark channel is equivalent to introducing an additional conditioning variable from outside the encoding chain, thereby altering the topological structure of the original Markov chain. This requires no quantum escape and does not violate information theory — it simply means the system has accessed additional information sources.
7.2 Classification of Escape Pathways
There is more than one way to bypass the encoding-based dimensionality reduction chain:
| Escape Pathway | Mechanism | Distinctive Feature | Current Status |
|---|---|---|---|
| Dark channel | Introduces non-verbal source information from outside the explicit encoding chain | Does not depend on external physical interaction; completed within the cognitive system itself | Theoretical hypothesis; phenomenological evidence on the human side |
| Multimodal data | Expands input dimensionality, reducing L values in the early steps | Expands channel capacity |C| but remains sensor-based encoding | Already implemented in engineering practice |
| Embodied interaction | Action-feedback loops supplement bodily and spatial dimensions | Introduces causal interventional information | Preliminary exploration in robotics |
| Experimental systems | Generates novel information not present in training data through intervening in the world | Creates entirely new data rather than fitting existing data | AI scientists, automated experimentation |
| Human-AI Co-Creation (CCE) | Humans provide non-textual high-dimensional judgments; AI performs structured encoding | Coupling of human dark channels with AI verification | Paradigm discussed in Paper VII of this series |
The distinctive feature of the dark channel is that it is the only pathway capable of introducing information from outside the chain without relying on external physical interaction — it is completed within the cognitive system itself. Other pathways reduce L values by changing external inputs; the dark channel bypasses the entire chain to directly access source information. The two are complementary, not mutually exclusive.
VIII. Multimodality Reduces L but Does Not Eliminate L
Multimodal training data (images, video, audio, haptics) is not a counterexample to the dimensionality reduction chain but rather an engineering method for reducing the Lᵢ values of the early steps. Video data recovers temporal structure and spatial dynamics lost in text; audio recovers intonation and prosody; robotic interaction recovers parts of the action-feedback structure.
Yet these remain sensor-based encoding — they are not equivalent to first-person subjective experience. What a camera captures is not “seeing”; what a microphone records is not “hearing.” From sensor data to model parameters, the process remains a lossy encoding chain, merely wider than the text-only chain. Multimodality expands channel capacity |C| and reduces certain Lᵢ values, but it does not eliminate the existence of the encoding chain itself.
This is also why Buddhist practice emphasizes direct awareness (pratyakṣa / direct perception) over scriptural reasoning (anumāna / inferential cognition) — scriptures are downstream products of the dimensionality reduction chain, while direct awareness constitutes dark channel transmission. Sensors are a wider encoding chain than text, but they remain an encoding chain; only direct awareness bypasses the entire chain.
IX. Diagnostic Implications for the Path to AGI
Based on the Law of Intelligence Entropy Increase, the current path to AGI faces three structural constraints:
Constraint One: The Data Ceiling. Textual training data is a low-dimensional residual shadow of human cognition. Scaling performs better within the shadow space, but it cannot transcend the shadow space itself. Multimodal data expands certain dimensions, but bodily sensation, spatial intuition, and emotional embedding remain sparse.
Constraint Two: The Encoding Ceiling. Even with richer training data, tokenization and gradient descent themselves introduce additional dimensionality reduction. Better data can reduce the L values of early steps, but the L values of later steps have their own physical lower bounds.
Constraint Three: The Absence of Extra-Chain Information. Current AI possesses no mechanism equivalent to dark channels — a channel whose bandwidth is greatest precisely when all conventional channels are closed. All computation is deterministic, observable, and serial. This may be one of the fundamental reasons AI is unable to produce structurally original breakthroughs. However, embodied interaction, experimental systems, and human-AI co-creation offer escape pathways in other directions.
Scaling is necessary but not sufficient. It can asymptotically approach the expressible structures within the training data distribution. But if scaling is performed solely on the existing low-dimensional residual shadow, it cannot recover information that was already lost along the training data’s generative chain and not supplemented through other channels. AGI requires not merely larger models, but shorter dimensionality reduction chains, wider channels, and mechanisms for introducing extra-chain information.
X. The Topology of Information Completeness — Ring · Layer · State
Ring: The framework is self-referential — a cognitive system discovers its own structure using its own structure. Head meets tail, corresponding to dependent origination (pratītyasamutpāda).
Layer: Five layers — material substrate, structural layer, computational layer, transmission layer, and unobservable layer. Lower layers support higher layers; higher layers exert downward causation on lower layers.
State: Contracted state ↔ expanded state ↔ collapsed state. Transitions between states are quantum-like — no intermediate process.
Mathematically, this corresponds to a fiber bundle: base space = ring, fiber = layer hierarchy, section = state. The same mathematical structure used in gauge field theory to describe fundamental physical forces — the information-processing structure of consciousness may be isomorphic to the fundamental structure of the physical world.
※ Core References
[1] Cover, T.M. & Thomas, J.A. (1991). Elements of Information Theory. Wiley.
[2] Tishby, N. & Zaslavsky, N. (2015). Deep Learning and the Information Bottleneck Principle. arXiv:1503.02406.
[3] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
[4] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.
[5] Whorf, B.L. (1956). Language, Thought, and Reality. MIT Press.
[6] dnhkng (2026). Do LLMs Break the Sapir-Whorf Hypothesis?
[7] NSO (2024). Semantic Communication Theory. National Science Open.
[8] Liu, N.F. et al. (2024). Lost in the Middle. TACL.
[9] Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
[10] Penrose, R. (1994). Shadows of the Mind. Oxford University Press.
[11] Jelassi, S. et al. (2024). Mixture of Parrots. ICLR 2025.
[12] arXiv (2025). Shadow in the Attention: JS Drift and Hallucination Fixation.
[13] Xu, J. & Li, Z. (2025). Information Physics of Intelligence. arXiv:2511.19156.
[14] Paivio, A. (1971). Imagery and Verbal Processes. Holt.
[15] Triṃśikā (Thirty Verses on Consciousness). Vasubandhu. c. 4th century CE.