Dense Core & MoE
Execution Layer
A Dual-Architecture Theory of Intelligent Systems —
The Functional Separation of Thinking and Execution
Dual-Loop Theory: Independent Computational Cycles for the Thinking System and the Execution System
Category Original Thought Paper
Domains AI Architecture · Cognitive Science · Neuroscience · Systems Design
Version V2
Authors LEECHO Global AI Research Lab & Claude Opus 4.6 & GPT 5.5 & Gemini 3.1 (Cognitive Collective)
Dense Core & MoE Execution Layer: A Dual-Architecture Theory of Intelligent Systems
In contemporary AI architectures, Dense and MoE (Mixture of Experts) are treated as two interchangeable efficiency choices, fused in engineering practice within a single forward pass through alternating stacked layers (Dense attention + MoE feedforward networks). This paper argues that when the objective shifts from scaling efficiency to general reasoning and AGI, this fusion constitutes an architecture-level error—it conflates two information-processing systems of fundamentally different functional natures. This paper proposes that Dense corresponds to the “Thinking System” (information alignment, cross-domain reasoning, hypothesis testing) and MoE corresponds to the “Execution System” (information parsing, knowledge retrieval, pattern matching). The two should exist as independent computational loops, interacting through an asynchronous dispatch interface, with the Dense system possessing the authority to interrupt and override MoE output. This paper further distinguishes three types of Dense (Parameter Dense, Information-Flow Dense, Control Dense), provides a formal expression of the dispatch function, and proposes five testable predictions.
I. The Problem: Cognitive Architecture Deficiencies of Current Fusion
Frontier models from 2024–2026 universally adopt the same hybrid strategy: attention layers remain Dense (all parameters activated), while feedforward network layers are replaced with MoE (sparse activation of a subset of experts). DeepSeek-V3, ERNIE 4.5, Qwen3-MoE, and others all follow this paradigm. The engineering rationale is well-founded: attention layers handle inter-token interaction (requiring full connectivity), FFN layers handle nonlinear transformations (amenable to specialization), and both complete serially within a single forward pass, with gradients flowing through the same computation graph.
However, this paper argues that this design compresses two functionally distinct systems into the same data stream and the same temporal scale, erasing the most critical differences between them.
To be clear: the current hybrid-layer design is successful along the engineering efficiency dimension—it allows MoE models to accommodate 6–64× total parameters within the same compute budget. The “architecture-level error” referenced in this paper refers specifically to the following: when the objective shifts from “scaling efficiency” to “general reasoning and AGI,” compressing thinking and execution into the same forward pass. Engineering-level correctness does not equal cognitive-architecture-level correctness. The throughput and long-context gains of hybrid architectures such as Jamba are real; but these gains occur along the System 1 (execution layer) dimension, not along the System 2 (thinking layer) dimension.
Input → [Dense Attention → MoE-FFN] × N layers → Output
Each token passes through once; both modes complete at
the millisecond timescale
This paper’s proposal (asynchronous dual-loop):
Input → Dense System (think · plan · align) ⇄ MoE System (retrieve · match · execute) → Output
Two systems with independent computational loops, different
timescales, and hierarchical control relationships
II. Dense = Thinking System
2.1 Empirical Basis: The Structural Advantage of Dense in Reasoning
Jelassi et al. (ICLR 2025, “Mixture of Parrots”) provide the most systematic evidence: as the number of experts increases (with fixed activated parameters), memorization performance continues to improve while reasoning ability saturates. The paper theoretically proves that there exist certain graph problems—such as connectivity judgment and length-2 path problems—that cannot be solved by any number of MoE experts of a given width, but can be easily solved by a slightly wider Dense model. On commonsense and mathematical benchmarks, Dense Transformers consistently outperform MoE models with equivalent total parameters.
A cross-architecture comparison study of Gemma/Phi/Qwen (arXiv:2604.07035, 2025), conducting 8,400 evaluations across seven reasoning-oriented models, showed that Dense models lead in aggregate accuracy while MoE models consume approximately 3× more memory.
The essence of reasoning is a relational operation across information domains—to solve problems of the form “A→B, B→C, what is the relationship between A and C,” the information about A, B, and C must all flow simultaneously through the same computational pathway. The fully connected structure of Dense natively satisfies this requirement. In MoE, experts are isolated from one another—information processed by Expert A and information processed by Expert B cannot directly interact within a single layer.
2.2 Cognitive Science Correspondence: The Global Workspace
Baars’ (1988) Global Workspace Theory describes the brain’s Dense core: information is integrated in a small number of brain regions and then broadcast to the entire brain. Dehaene and Changeux (1998) developed this into the “Global Neuronal Workspace” hypothesis—perception, motor, attention, memory, and valuation regions interconnect to form a unified space in which information is widely shared and fed back to lower-level processors. Research published in eLife (2024) further depicted the “synergistic global workspace”—gateway regions that pool synergistic information from specialized modules, situated at the intersection of multiple anatomical, functional, and neurochemical hierarchies.
The Dense/MoE dual architecture is functionally isomorphic to Kahneman’s dual-process theory—slow deliberate integration corresponds to Dense, fast automatic matching corresponds to MoE. However, it must be clarified that System 1/System 2 are psychological functional descriptions rather than precise neural module delineations; the mapping in this paper is a functional isomorphism, not a one-to-one engineering correspondence.
2.3 Three Levels of Dense
Three distinct types of “Dense” must be differentiated:
| Level | Definition | Current Implementation | AGI Requirement |
|---|---|---|---|
| Parameter Dense | All parameters activated simultaneously — computational characteristic | ✅ Current Dense Transformers satisfy this | Necessary but insufficient |
| Information-Flow Dense | Any information fragment can interact directly — connectivity characteristic | ✅ Self-attention mechanism satisfies this | Necessary but insufficient |
| Control Dense | Planning, reflection, veto, override — cognitive control characteristic | ❌ Currently not satisfied | Sufficient condition |
This paper argues that the Dense core required for AGI is the third type—Control Dense—which is not merely fully connected but must also possess the ability to dispatch, interrupt, veto, and iteratively query the MoE execution system. Parameter Dense and Information-Flow Dense are necessary conditions; Control Dense is the sufficient condition that enables the system to transition from “large-scale pattern matching” to “genuine thinking.”
III. MoE = Execution System
3.1 Empirical Basis: The Structural Advantage of MoE in Memory and Parsing
“Mixture of Parrots” simultaneously demonstrated where MoE’s advantage lies: on world-knowledge tasks (TriviaQA, Natural Questions), MoE and Dense models produce nearly overlapping performance curves when plotted against total parameter count—total capacity, not activated compute, drives knowledge retrieval performance. Sparse Crosscoders (2025) analysis showed that MoE develops more specialized, more focused internal representations, with higher feature activation density and lower polysemanticity per expert.
MoE’s approach to processing input information is “divide and conquer”: the router rapidly classifies input, and experts process their respective information fragments in a high-density, narrow-focus manner.
3.2 Precise Qualification of MoE Reasoning Capability
MoE systems can perform reasoning tasks—on in-domain reasoning (such as single-step mathematical operations, factual Q&A, pattern-matching logic), MoE performance can approach or even match Dense. This paper’s claim is not that “MoE cannot reason,” but that “MoE has a structural bottleneck in reasoning that requires cross-expert global integration”—when reasoning demands simultaneously invoking information from multiple experts and cross-validating between them, MoE’s expert isolation and routing mechanisms force information flow through a narrow routing layer rather than interacting directly within a unified space. Mixture of Parrots’ theoretical proof—that certain graph problems cannot be solved by any number of fixed-width MoE experts—is the mathematical expression of precisely this structural bottleneck.
3.3 Cognitive Science Correspondence: Functionally Modular Cortex
The functional modularity of the cerebral cortex is MoE’s biological prototype: visual cortex, language areas, and motor areas are each specialized. Global Workspace Theory explicitly states that conscious processing involves only a few expert modules being selectively engaged at any given moment, with information then broadcast to the entire brain through a communication bottleneck.
IV. Key Evidence: “Seeing but Not Thinking”
The “Seeing but Not Thinking” paper by the Zhejiang University and Alibaba team (2026) provides the most direct experimental validation of this paper’s core thesis. They discovered a puzzling phenomenon: multimodal MoE models accurately perceived image content yet failed in subsequent reasoning, whereas the same problems presented as pure text were solved correctly.
68.2%–73.1% of failures stemmed from reasoning errors, while only 26.9%–31.8% were attributable to perception errors. Through systematic analysis, the researchers found that visual experts and domain experts exhibited separation across layers, and image input caused significant routing deviation from text input in the middle layers where domain experts are concentrated. They proposed the “Routing Distraction” hypothesis: when processing visual input, the routing mechanism fails to adequately activate task-relevant reasoning experts.
This perfectly validates the core thesis of this paper: the MoE system successfully completed input parsing (perception experts worked correctly), but information alignment failed (reasoning experts were not activated by routing). If the two systems were independent, with routing governed by a Dense thinking system, this problem would not occur—the Dense system would actively invoke reasoning experts after analyzing the task requirements, rather than letting the router make automated local decisions based on input features.
V. Five Dimensions of Functional Separation
The separation of the Dense thinking system and the MoE execution system is not merely a division of “who handles what,” but a fundamental divergence across five dimensions:
| Dimension | Dense Thinking System | MoE Execution System | Current Hybrid Approach |
|---|---|---|---|
| Timescale | Slow (can iterate through multiple rounds of deliberation) | Fast (completes in a single forward pass) | Forced synchronization — same forward pass |
| Control hierarchy | Decision-maker (decides “what to ask” and “whether the answer is reasonable”) | Executor (retrieves and returns results per instructions) | No hierarchy — equal alternating layers |
| Interrupt capability | Can interrupt, veto, and override MoE output | No authority to interrupt Dense decisions | Nonexistent — unidirectional data flow |
| Iteration mode | Can repeatedly issue different queries to MoE | Executes one query at a time | Single pass — no iteration |
| Cognitive load | High energy, low throughput, high fidelity | Low energy, high throughput, error-tolerant | Unified energy budget — no differentiated allocation |
The current hybrid-layer design erases differences across all five dimensions—effectively forcing slow deliberate integration and fast automatic matching to operate at the same timescale, the same energy budget, and within the same data stream. This fundamentally violates the cognitive nature of the dual system.
VI. The Correct Dual-Architecture Paradigm
6.1 Architecture Design
│ Dense Thinking System (Control Dense) │
│ – Small parameter count, fully connected, │
│ high energy consumption │
│ – Functions: planning, reasoning, │
│ hypothesis testing, alignment │
│ – Timescale: slow (can iterate N rounds) │
│ – Can interrupt and override MoE output │
│ – Decides “what to ask” and “whether │
│ the answer is reasonable” │
└───────────┬──────────▲────────────────────┘
Dispatch │ │ Results + confidence
commands │ │ + evidence
↓ │
┌────────────▼──────────┴───────────────────┐
│ MoE Execution System (Expert Matrix) │
│ – Large parameter count, sparse │
│ activation, low energy consumption │
│ – Functions: knowledge retrieval, pattern │
│ matching, information parsing │
│ – Timescale: fast (single forward pass) │
│ – Returns results + confidence + │
│ evidence chains │
│ – Tells Dense “what I found and how │
│ confident I am” │
└────────────────────────────────────────────┘
6.2 Functional Isomorphism with Brain Architecture
| System Feature | This Paper’s Architecture | Brain Functional Isomorphism |
|---|---|---|
| Thinking center | Control Dense system | Prefrontal cortex + global workspace |
| Execution modules | MoE execution system | Functionally modular cortical regions |
| Dispatch interface | Asynchronous bidirectional communication protocol | Attention system (selective activation) |
| Interrupt mechanism | Dense can terminate MoE queries and reroute | Executive control / inhibition functions |
| Iterative loop | Dense repeatedly issues different queries to MoE | Information cycling in working memory |
| Timescale difference | Dense slow × N rounds vs MoE fast × single round | Deliberate thinking seconds~minutes vs automatic matching ~100 ms |
6.3 Existing Imperfect Precursors
AlphaGo (DeepMind 2016) is the closest engineering implementation to this paper’s proposal: Monte Carlo Tree Search (MCTS) serves as the Dense thinking system, deliberately exploring the possibility space, while the value network + policy network serve as the MoE-like execution system providing fast intuitive evaluations. The two are independent systems interacting through a dispatch interface; MCTS can invoke the neural networks multiple times and can veto their suggestions. However, AlphaGo is Go-specific and has not been generalized to language models.
Agent architectures (LangChain / AutoGPT / ReAct 2023–2026) separate “the model that thinks” from “the tools that execute”—the LLM policy core performs planning and reasoning, while external tools execute retrieval and operations. However, the agent’s “execution layer” consists of external tools rather than neural network experts, making it not a unified jointly-trained dual system.
OM2M dual-system gating (2025) integrates meta-learning within a dual-process framework, using a learned gating mechanism to dynamically arbitrate between System 1 and System 2 based on cognitive load and uncertainty. Theoretically closest to this paper’s proposal, but limited to small-scale Theory of Mind tasks.
6.4 Formalization of the Dispatch Function
The Dense thinking system’s dispatch of the MoE execution system can be formalized as:
rt, confidencet, evidencet = MoE(Et, qt)
actiont ∈ { continue, revise, reject, synthesize }
qt = Current query · ht = Historical reasoning state · ut = Uncertainty · ct = Cognitive/compute budget
This dispatch loop can iterate through multiple rounds—the Dense system decides whether to follow up, switch experts, or terminate based on the confidence levels and evidence quality returned by MoE. This is isomorphic to the pattern in AlphaGo where MCTS repeatedly invokes the value and policy networks. The specific training scheme for the dispatch interface (how to make it differentiable, how to design reward functions, how to define intermediate-step rewards in semantic space) is an open engineering problem beyond the scope of this thought paper, but is flagged as a critical next research direction.
VII. Dynamic MoE Activation: Routing Authority Is Thinking Authority
In the correct dual-architecture paradigm, the number of MoE experts activated should not be determined by the statistical features of tokens, but by the reasoning state of the Dense thinking system. This paper terms this “MoE quantity triggered by thinking divergence”—the fifth dimension of the Information Completeness Framework.
Current MoE routing is bottom-up—each token independently decides which experts to activate based on its own features (Top-K routing). This is automatic behavior: local, requiring no deliberation. What this paper advocates is top-down routing—the Dense system actively decides how many experts and which experts to activate based on problem complexity and divergence requirements.
token “quantum” → router auto-selects → physics expert + math expert (fixed top-2)
This paper’s proposal (top-down, Dense-system-governed routing):
Dense system analyzes the complete problem →
Judgment: “This problem involves quantum mechanics,
philosophy of consciousness, and computational theory — three domains”
Decision: “Need to activate 5 experts, including two peripheral
experts not normally activated”
Command → MoE system activates the corresponding expert set per command
The field has begun exploring dynamic routing—Top-P routing (2024) dynamically adjusts expert count based on cumulative probability thresholds, and DynaMoE (2026) introduces layer-adaptive capacity allocation. However, the “difficulty assessment” in these methods is still performed at the router level—a small gating network estimates difficulty based on the token’s embedding. No genuine “thinking” participates in the decision.
The true breakthrough requires elevating routing decision authority from the gating network inside MoE up to an independent Dense thinking system—making the decision of “who executes” itself a high-level reasoning problem rather than a low-level statistical classification problem. Routing authority is thinking authority. The current MoE router is merely a local gate; the future AGI router should be a Dense thinking system.
VIII. Engineering Barriers and Possible Paths
8.1 Three Engineering Barriers
Barrier One: Joint Training. If the Dense system and MoE system exist independently, how does end-to-end gradient backpropagation work? The current hybrid-layer design is popular precisely because gradients can flow through the same computation graph. Joint training of two independent systems is an unsolved optimization problem—the discreteness of the dispatch interface (selecting experts / interrupting / overriding) prevents standard backpropagation from being directly applied.
Barrier Two: Latency. Dense thinks, then calls MoE to execute, then returns to think—response time is several times slower than a single forward pass. Commercial products cannot tolerate excessive latency. However, this barrier can be circumvented in certain scenarios—high-complexity reasoning tasks inherently require more thinking time, making the latency-for-quality tradeoff reasonable.
Barrier Three: Absence of a Theoretical Framework. No one has used a “thinking vs. execution” functional separation framework to conceptualize the relationship between Dense and MoE—engineers have glued the two together from a computational efficiency perspective rather than designing interaction protocols from a cognitive function perspective. This paper constitutes the first systematic attempt to fill this gap.
8.2 Possible Breakthrough Paths
Path One: Asynchronous Training. Pre-train the Dense thinking system and MoE execution system independently, then train the dispatch protocol between them via reinforcement learning—analogous to AlphaGo’s approach of first training the policy and value networks, then training their coordination through self-play.
Path Two: Inference-Time Separation. Use different layers of the same model to assume different roles during inference—shallow layers serve as MoE for fast retrieval, deep layers switch to Dense mode for integrative reasoning. The Ring-Linear architecture (2025) has already achieved an initial implementation of using Dense MLP for the first layer and MoE for subsequent layers.
Path Three: Internalization of the Agent Framework. Internalize the current agent architecture pattern of “LLM planner + external tools” as a neural network—a Dense subnetwork serves as the planner, an MoE subnetwork serves as the internal tool set, interacting through a learnable dispatch protocol.
IX. Testable Predictions of the Framework
This paper proposes five experimentally falsifiable predictions:
Prediction One: Dense-controlled routing should outperform same-parameter-count Top-K routing on reasoning tasks requiring integration across 3+ experts—because Top-K’s token-level routing cannot sense global reasoning requirements.
Prediction Two: On “seeing but not thinking” multimodal tasks, Dense-controlled routing should significantly reduce routing distraction—because the Dense system selects experts based on task objectives rather than input features.
Prediction Three: The optimal number of activated experts should rise dynamically with reasoning divergence—simple retrieval tasks need only top-2, while complex cross-domain reasoning may require top-8+. This asymmetry itself is a prediction of functional separation theory.
Prediction Four: A Dense interrupt/re-query mechanism should reduce hallucination rates—hallucinations are essentially unsupervised interpolation by MoE in information voids, and Dense verification can intercept inconsistent results before output.
Prediction Five: The Dense-MoE asynchronous dual loop should offer no advantage—or even be slower—on low-latency simple tasks, but should significantly outperform on high-complexity tasks requiring multi-step reasoning. If the dual loop offers no advantage on any task, functional separation theory needs revision.
X. Implications for AGI
The definition of AGI requires three properties: the ability to perform any cognitive task, generalization to novel tasks, and human-level performance across all domains simultaneously. All three properties point toward Dense’s fully connected reasoning capability, not MoE’s specialized memory capability. The MoE-dominant scaling path tends to expand knowledge coverage and specialized execution; the Dense-dominant path tends to preserve cross-domain integration and global reasoning. AGI requires the latter to orchestrate the former, not the former to replace the latter.
The dual-architecture paradigm proposed in this paper—a Control Dense core as the thinking center, an MoE execution layer as the knowledge matrix, the two interacting through asynchronous dispatch—emulates not the specialized adult brain, but that young brain full of cross-domain connective possibilities, not yet remodeled by professional training. AGI is not more parrots. AGI is a conductor who can make all the parrots collaborate. That conductor is Control Dense.
The industry is currently validating this thesis inadvertently: ERNIE 4.5 retains Dense attention layers to maintain cross-modal interaction, confining MoE to FFN layers only. Ring-Linear uses Dense MLP for the first layer. Jamba alternates Dense and MoE layers. Every successful design is converging toward “Dense for thinking, MoE for execution”—but no one has explicitly articulated this as a design principle. This paper makes this implicit trend explicit as an architectural theory.
※ Core References
[1] Jelassi, S. et al. (2024). Mixture of Parrots: Experts improve memorization more than reasoning. ICLR 2025.
[2] Xu, H. et al. (2026). Seeing but Not Thinking: Routing Distraction in Multimodal MoE. arXiv:2604.08541.
[3] Baars, B.J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press.
[4] Dehaene, S. & Changeux, J.-P. (1998). Global neuronal workspace hypothesis.
[5] Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
[6] Silver, D. et al. (2016). Mastering the game of Go with deep neural networks. Nature.
[7] arXiv:2604.07035 (2025). Gemma 4, Phi-4, and Qwen3: Dense and MoE Reasoning Comparison.
[8] Sparse Crosscoders (2025). Diffing MoEs and Dense models. arXiv:2603.05805.
[9] Deconstructing Pre-training (AAAI 2026). Knowledge Attribution in MoE and Dense. arXiv:2601.08383.
[10] Expert Strikes Back (2026). Interpreting MoE at Expert Level. arXiv:2604.02178.
[11] Pan et al. (2024). DS-MoE: Dense Training, Sparse Inference. arXiv:2404.05567.
[12] Yao et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
[13] Karpas et al. (2022). MRKL Systems: Modular Reasoning, Knowledge and Language.
[14] UMoE (NeurIPS 2025 Spotlight). Unifying Attention and FFN with Shared Experts. arXiv:2505.07260.
[15] AI21 Labs (2024). Jamba: A Hybrid Transformer-Mamba Language Model. ICLR 2025.
[16] DynaMoE (2026). Dynamic Token-Level Expert Activation. arXiv:2603.01697.
[17] Ring-Linear (2025). Efficient Hybrid Architecture for Long-Context Reasoning. arXiv:2510.19338.
[18] S1S2.ai (2025). Dual-process architecture for robotics.
[19] OM2M (2025). One Model, Two Minds: Context-Gated Dual-Process Graph Learner. arXiv:2509.08705.