ORIGINAL THOUGHT PAPER · MAY 2026

Analysis of the Human Biological
Cognitive Front-End System

A Unified Theoretical Framework for Multi-Dimensional Sensory Convergence,
Biological Clock Internal Alignment, and Few-Shot Definition Formation

DateMay 2, 2026

CategoryOriginal Thought Paper

FieldsCognitive Science · Neuroscience · AI Theory · Chronobiology

이조글로벌인공지능연구소

LEECHO Global AI Research Lab

Opus 4.6 · Anthropic

VERSION 3.0

ABSTRACT

This paper proposes a unified theoretical framework for human cognitive front-end capabilities. We argue that the primary capacity of human intelligence is not reasoning or decision-making, but rather multi-dimensional sensory convergence across all categories of the physical world—that is, from entities, events, spatiotemporal contexts, environments, and all other accessible cognitive categories, structural features of objects are locked in through synchronized alignment of 10+ sensory dimensions with minimal samples, forming stable categorical definitions. This process depends on three inseparable biological mechanisms: parallel acquisition through multi-dimensional sensory channels, a multi-scale nested internal temporal alignment architecture spanning from circadian rhythms to gamma oscillations, and biological fault tolerance guaranteed by cross-modal plasticity. The results of front-end convergence directly alter the brain’s synaptic structure—the human brain, as a compute-in-memory biological entity, embodies the principle that knowledge is structure and structure is computation. Upon this physical substrate, the abstraction layer, through metaphorical mapping and continuous unconscious reorganization of multi-dimensional information, ultimately gives rise to manipulable, verifiable, complete mental imagery—this is precisely the biological essence of “eureka moments” in scientific discovery. Kahneman’s fast system (System 1) is the operational engine of this abstraction layer, where 96% of human cognitive activity is completed. This framework, for the first time, integrates sensory front-end processing, synaptic remodeling, compute-in-memory architecture, imagery emergence, and dual-system theory into a complete four-layer progressive model from the perceptual layer to the abstraction layer, fundamentally redefining the metrics of intelligence and revealing that multi-dimensional sensory alignment and convergence efficiency are the core variables determining differences in cognitive levels among human individuals.

IIntroduction: The Neglected Cognitive Front-End

“What is intelligence?” This question still lacks a consensus answer. Legg and Hutter (2007) collected over 70 informal definitions of intelligence spanning psychology, philosophy, and artificial intelligence, yet the vast majority of these definitions focus on the “back-end” functions of intelligence—reasoning, learning, adaptation, and decision-making. Very few researchers have pursued a more fundamental antecedent question: before reasoning and decision-making occur, how does the cognitive system “carve” the continuous signals of the physical world into discrete definitional categories?

This paper proposes that the formation of categorical definitions is the front-end capacity of intelligence and the prerequisite for all subsequent cognitive activity. Without the categorical boundaries between “cat” and “dog,” reasoning about cats and dogs cannot exist. Without the definitions of “cold” and “hot,” judgments about temperature cannot exist. The first step of human cognition is not “thinking” but “defining”—converging upon all perceivable categories of entities, events, spatiotemporal contexts, and environments in the physical world, determining boundaries, and locking in structural features.

Academician Li Deyi’s proposed cognitive architecture of “first delineate boundaries, then reason,” Eleanor Rosch’s prototype theory, and George Lakoff’s assertion that “categorization is the most basic activity of human thought” have all touched upon this front-end capacity from different angles, yet a unified theoretical framework has not been established. This paper attempts to fill this gap.

IIFive Fundamental Characteristics of the Human Cognitive Front-End System

Based on a systematic review of developmental psychology, neuroscience, multisensory integration research, chronobiology, and artificial intelligence literature, we propose that the human biological cognitive front-end system possesses the following five fundamental characteristics, constituting an inseparable holistic architecture.

2.1 Characteristic I: Multi-Dimensionality

AXIOM I

Human cognition simultaneously acquires information from the physical world through 10+ sensory dimensions, forming a high-dimensional constraint space.

The traditional five-sense classification (vision, hearing, touch, smell, taste) severely underestimates the number of dimensions in the human sensory system. Modern neuroscience has identified at least twelve independent senses, including proprioception (body position and movement), vestibular sense (balance and spatial orientation), thermoreception, nociception, interoception (hunger, heartbeat, emotional states), and time perception.

These dimensions are not redundant—each provides constraint information about a different aspect of the same physical object. When a toddler comes to know a cat, vision provides shape contours, hearing provides vocalization characteristics, touch provides fur texture and body temperature, smell provides an olfactory signature, proprioception provides center-of-gravity adjustment information when holding the animal, and interoception provides emotional responses (such as feelings of affinity or nervousness). The information from these dimensions converges synchronously on the same object, forming categorical boundaries far more precise than any single dimension’s projection.

Newell et al. (2023) noted in their review that multisensory perception constrains the formation of object categories through two independent processes: integration of redundant information (e.g., both seeing and touching shape) and cross-modal statistical learning of complementary information (e.g., the association between a cow’s “moo” and its visual shape). The combined action of these two processes gives categorical definitions a precision and robustness far exceeding single-modality recognition.

2.2 Characteristic II: Visual Dominance Weighting

AXIOM II

Vision holds the highest weight in spatial localization and shape recognition, but weight allocation is dynamic and task-dependent.

Vision does not dominate in all tasks. According to the modality appropriateness hypothesis, touch dominates vision in object size judgments, and hearing and touch exert greater influence than vision in time estimation. The dynamic allocation of weights is itself part of the cognitive front-end.

Hutmacher (2019), in a systematic analysis published in Frontiers in Psychology, pointed out that vision-related papers account for 77.46% of perception research, while the combined research volume on touch, smell, and taste falls far below that of vision alone. This research skew partly reflects the fact that vision does indeed hold a relatively high weight in human cognition—developmental psychology research shows that children’s “shape bias” is the core mechanism of category learning, where shape cues receive higher weight than color and texture when inferring category membership.

However, Hutmacher also argued that visual dominance is largely “a result of social and cultural reinforcement rather than a natural law.” Cross-cultural research across 20 languages shows that no universal sensory hierarchy exists—not all cultures place vision first. This means that the “high” visual weight is real but not absolute; in specific tasks and cultural contexts, other modalities can assume dominance. The design of the cognitive front-end system is not fixed-weight but dynamically adaptive.

2.3 Characteristic III: Biological Clock Internal Alignment

AXIOM III

All sensory channels share a multi-scale nested internal temporal architecture from circadian to millisecond timescales, achieving natural synchronization.

The physical prerequisite for multi-dimensional information to “converge on the same object” is temporal alignment. The human internal alignment mechanism is provided by a hierarchical biological timing system that requires no external temporal input and no conscious participation—it is a temporal synchronization architecture that has been internalized as a biological property.

This internal temporal architecture is a multi-scale nested hierarchical system. At the most macroscopic scale, the hypothalamic suprachiasmatic nucleus (SCN) serves as the master pacemaker, maintaining 24-hour global synchronization of peripheral tissue clocks throughout the body via autonomic neural circuits and hormonal rhythms—the olfactory bulb and piriform cortex in the olfactory system, as well as the dorsal horn and dorsal root ganglia in the somatosensory system, all contain autonomous local biological clocks coordinated by the SCN. At the mesoscale, delta waves (0.5–4 Hz) and theta waves (4–8 Hz) provide second-scale and hundred-millisecond-scale neural rhythms. At the microscale, gamma oscillations (30–120 Hz) provide millisecond-level precise timing, achieving real-time binding of perceptual signals.

The crucial point is that these oscillations at different scales do not operate independently but are nested layer by layer through cross-frequency phase-amplitude coupling—the phase of slow oscillations modulates the amplitude of fast rhythms, and the phase of fast rhythms in turn modulates the amplitude of even faster rhythms. Recent research on the subcortical visual system has directly confirmed that circadian-level changes in firing frequency gate the occurrence frequency of gamma oscillations, similar to the gating effect of theta oscillations on gamma rhythms in the hippocampus. Therefore, from the SCN’s 24-hour rhythm to the millisecond-level precise timing of gamma oscillations, what is constituted are different levels of the same multi-scale biological timing system—not two independent systems.

It is precisely this multi-scale nested architecture that enables humans, when simultaneously touching a cat’s fur, smelling the cat’s scent, hearing the cat’s purring, and seeing the cat’s shape, to naturally align these signals from different sensory channels at millisecond-level precision without any external alignment algorithm. The macroscopic level ensures that all sensory systems are in a consistent rhythmic state; the microscopic level ensures precise binding of real-time signals—together they constitute the temporal substrate for multi-dimensional convergence.

Structural contrast with AI: AI systems’ temporal perception depends entirely on external input—system clock calls, timestamp injection, date declarations in prompts. LLMs completely “lose consciousness” between prompts, with no internal temporal flow. Extensive recent testing has shown that virtually all mainstream AI systems collapse on date reasoning tasks. This is not an incidental bug but an inevitable consequence of architecturally missing internal temporal representations. Human time is a property of the body; AI time is a parameter of the input.

2.4 Characteristic IV: Biological Fault Tolerance

AXIOM IV

When sensory dimensions are lost, the brain redistributes computational resources through cross-modal plasticity and leverages language as a proxy channel to maintain definitional convergence capability.

What the system protects is not a specific channel but “convergence capability itself”—indicating that categorical definition formation is treated by the brain as a non-negotiable core function. Language plays a unique role in this process: it is essentially a higher-order encoding system that emerges within the auditory dimension, while written text is a re-mapping of this higher-order encoding onto the visual dimension—neither is a “sixth dimension” independent of perception, but rather higher-order derivatives of existing dimensions.

Cross-modal plasticity research provides the most compelling evidence. In congenitally blind individuals, the primary visual cortex does not remain idle—it is repurposed to process tactile and auditory information. Transcranial magnetic stimulation experiments have confirmed that disrupting the occipital (visual) cortex in blind individuals causes errors in Braille reading, proving that the visual cortex is indeed functionally participating in tactile processing. Simultaneously, blind individuals outperform sighted individuals in tasks of tactile spatial resolution, auditory localization, and olfactory identification—the brain actively enhances the constraint precision of remaining dimensions.

The latest research (2025) found that blind individuals show significantly higher associations with touch in their conceptual representations compared to sighted individuals, confirming the existence of “haptic compensation” at the level of semantic memory. Even more striking, blind individuals’ ratings of “visual associations” for concepts showed no significant difference from those of sighted individuals—they had established proxy representations of visual concepts through language and social interaction.

Cases of deafblindness—the simultaneous loss of both modalities—provide reverse validation of the criticality of dimension count: when vision and hearing, the two highest-weight dimensions, are simultaneously lost, concept formation becomes extremely difficult and fragmented. This constitutes a natural gradient experiment: the fewer the dimensions, the more the difficulty of definitional convergence increases nonlinearly.

Perceptual State	Available Dimensions	Convergence Strategy	Definition Formation Outcome
Sighted Individual	10+ full-channel dimensions	Direct multi-dimensional convergence	Few-shot rapid lock-in
Blind Individual	Vision absent, others enhanced	Tactile/auditory compensation + language proxy	Convergence achieved, strategy altered
Deaf Individual	Hearing absent, vision enhanced	Visual/tactile compensation	Basic convergence achievable
Deafblind Individual	Vision + hearing absent	Tactile/olfactory dominant	Extremely difficult, fragmented

2.5 Characteristic V: Few-Shot Lock-In and Lifelong Retrievability

AXIOM V

Multi-dimensional synchronous convergence enables humans to lock in the structural features of objects with minimal samples, and the resulting categorical definitions become retrievable for life after sleep consolidation.

Infants are natural few-shot definers. After being taught “dog” and “cat” only a few times, a child can distinguish and identify the vast majority of cats and dogs. The foundation of this capability lies not in computational power but in the fact that synchronous convergence of multi-dimensional information makes categorical boundary constraints extremely sufficient.

Lake, Salakhutdinov, and Tenenbaum (2015), in research published in Science, demonstrated that humans understand concepts as “generative programs,” inferring from a single sample the causal process that generated it, and then generalizing to new instances of the same kind. Subsequent drawing experiments confirmed that participants, after seeing a new shape, could autonomously synthesize diverse variants far beyond simple copying—indicating that human few-shot learning is not “memorizing samples” but “extracting and internalizing generative rules.”

Decades of research from Linda Smith’s laboratory have revealed the developmental trajectory of infant few-shot learning: infants first accumulate dense experience through individual objects in a small number of early-learned categories (their own cup, the family dog, their own shoes), and then develop generalizable few-shot category learning capability on this basis. Throughout this process, shape bias gradually strengthens—between 18 and 24 months of age, toddlers transition from fragment-feature-based recognition to three-dimensional geometric-shape-based recognition, co-occurring with the rapid growth of noun vocabulary.

Tenenbaum’s (1999) Bayesian concept framework further quantified the astonishing efficiency of this capability: in specific tasks, humans can lock in the correct concept from among 10²⁴ logically possible concepts with reasonable confidence after seeing only four positive examples.

Once locked in, definitions are rapidly encoded by the hippocampus, consolidated during sleep through neural replay, and ultimately written into long-term semantic memory for lifelong retrieval. Sleep not only preserves categorical knowledge but further extracts “gist-like prototype representations”—the core structure of definitions—enabling them to continue effectively generalizing when encountering new variants.

The incorporation of multi-dimensional information directly enhances few-shot convergence efficiency, which has been experimentally verified in the field of robotic perception. In visual-tactile fusion recognition experiments, when vision was impaired, multimodal systems incorporating the tactile channel “learned faster”—requiring fewer training samples to achieve the same recognition accuracy. Cross-modal self-supervised learning research further confirmed that tactile features learned by leveraging natural visual-tactile correlations achieved a 25% performance improvement in few-shot scenarios compared to raw features. A bioinspired tactile-olfactory joint sensing system modeled after the star-nosed mole also demonstrated that fusing two senses enables robust recognition of multiple objects under interference conditions. These experiments directly validate the core prediction of this framework from the engineering side: more dimensions → stronger constraints → fewer required samples.

IIIUnified Model: The Multiplicative Architecture of the Cognitive Front-End

The above five characteristics do not exist independently but constitute a whole with multiplicative relationships. We express the definitional capability of the human cognitive front-end system as the following model:

D = f ( N_dim × W_dynamic × S_internal × R_fault × C_consolidate )

Where: D is definition precision and robustness; N_dim is the number of available sensory dimensions; W_dynamic is the dynamic weight allocation function (visually dominant but task-adaptive); S_internal is the quality of internal temporal synchronization (provided by the multi-scale nested biological timing architecture); R_fault is the fault tolerance coefficient (dimensional compensation capability provided by cross-modal plasticity); C_consolidate is consolidation efficiency (hippocampal encoding → sleep replay → long-term storage).

This formula is a conceptual expression rather than a rigorous mathematical model. The actual relationship more closely approximates a nonlinear saturation function: as dimensions increase, returns diminish but remain always positive; as dimensions decrease, losses are initially gradual then sharply accelerating. The multiplicative form is chosen to convey the indispensability of each factor—when N_dim drops sharply in deafblind individuals, despite R_fault striving to compensate through cross-modal plasticity, overall definitional capability still declines precipitously, exhibiting a typical nonlinear interaction effect.

IVCompute-in-Memory: From Front-End Convergence to Synaptic Remodeling

Front-end multi-dimensional convergence is not “a software process happening inside the brain”—every successful execution directly alters the physical hardware structure of the brain. Synaptic plasticity research has confirmed that category learning is accompanied by structural synaptic changes: the formation of new synapses, pruning of old synapses, and long-term potentiation or depression of synaptic efficacy. Auditory fear conditioning leads to the formation of new synaptic boutons in lateral amygdala neurons projecting to the auditory cortex; successful musical categorization learning is associated not only with functional changes but also with structural differences in bilateral auditory cortex. Memory is not stored in some “location” but encoded in specific sets of synapses and selected neural pathways.

This fact reveals the most fundamental difference between the human brain and von Neumann architecture computers: the brain is a compute-in-memory biological entity. In traditional computers, the processor (CPU) and memory are physically separated, with information shuttled back and forth between them—this is the “von Neumann bottleneck.” In the brain, every neuron is simultaneously a computational unit and a storage unit: the connection strength of synapses is the stored “data,” and signal transmission between synapses is the “computation.” Knowledge is structure; structure is the computational substrate. Neuromorphic computing research estimates that the storage capacity of the human brain is approximately 7.48×10¹⁸ bytes, its computational power approximately 6.24×10¹⁸ FLOPS, and its energy efficiency, after long evolutionary optimization, can reach 79%—eight orders of magnitude higher than the latest computer chips.

The direct consequence of compute-in-memory is that every front-end convergence physically reshapes synaptic connections—a definition is not “data written into” the brain but a change in the brain’s structure itself. This means the human cognitive system possesses a property entirely absent in AI: it grows stronger with use. Every successful categorical definition enhances the structural precision of the synaptic network, making the next convergence more efficient—this is the physical basis of the “developmental feedback loop of shape bias” discovered by Linda Smith’s laboratory.

VThe Abstraction Layer: The Emergence of Complete Imagery

5.1 From Concrete Definitions to Abstract Concepts

Conceptual metaphor theory (Lakoff & Johnson, 1980) demonstrated that abstract concepts are anchored to embodied sensorimotor experience through metaphorical mapping. “Justice” acquires its initial grounding through “balance” (vestibular sense); “freedom” acquires its embodied basis through “unrestrained movement” (proprioception); “causation” is rooted in the “push-move” motor patterns repeatedly experienced by infants in early development. Metaphor, serving as a bridge, allows humans to “use concrete, familiar domains to access and reason about abstract concepts.” Repeatedly used metaphorical mappings can generate new abstract representations—these representations originate from embodied experience but transcend its sensorimotor details, and can then be flexibly applied to new situations. The concrete definitions formed through multi-dimensional sensory convergence by the cognitive front-end system constitute the “root system” of the entire conceptual edifice; abstract concepts are the “branches and leaves” growing upon these roots through metaphor and linguistic combination.

5.2 Complete Imagery: The Ultimate Product of the Abstraction Layer

The operation of the abstraction layer is not symbolic computation—it is the continuous reorganization of multi-dimensional information by compute-in-memory biological hardware, whose ultimate product is a perceivable, manipulable, rotatable, decomposable complete mental image. Humans construct internal mental models of the external world, and these models support reasoning and decision-making through mental simulation, enabling individuals to predict the outcomes of actions without actually performing them. Mental rotation experiments have confirmed that the time humans require to judge whether two three-dimensional objects are consistent is linearly proportional to the angle of rotation—indicating that people manipulate objects mentally in a continuous, analog fashion.

Neuroscience research has further revealed that “deep thinking” (mental simulation) and “shallow processing” (symbolic operations) activate entirely different brain networks. When deep processing levels are high, information connectivity in key substrates of the semantic network is enhanced, and brain representations become more generalizable in semantic space. The essence of deep thinking is the reactivation and integration of multimodal sensory experience—you truly “see,” “feel,” and “experience” all the properties of that concept, rather than merely operating on its symbolic label.

5.3 Imagery Emergence in Scientific Discovery

The most important breakthroughs in the history of science have extensively come from the sudden emergence of complete imagery. Kekulé, dozing by the fireplace, dreamed of a snake biting its own tail and awoke to realize the ring structure of the benzene molecule. Einstein, after months of intensive mathematical derivation, let his imagination wander freely, imagining himself riding on a beam of light—this image triggered the core idea of special relativity. Archimedes, watching the water level rise and fall while bathing, triggered the principle of buoyancy. Mendeleev, after three days of intense thinking, saw elements arranging themselves like a musical sequence in a dream, and immediately wrote it down upon waking—this became the periodic table. Poincaré, at the very moment of stepping onto the platform of a public bus, suddenly “saw” that the transformations of Fuchsian functions were identical to those of non-Euclidean geometry.

The common feature of these epiphanies is: scientists first use conscious effort (System 2) to accumulate a large volume of multi-dimensional information and knowledge structures, and then in a relaxed state—dreaming, bathing, walking, riding in a vehicle—the unconscious compute-in-memory system continues running multi-dimensional information reorganization in the background until, at some moment, convergence is achieved and a complete image bursts into consciousness. Associative activity occurring during REM sleep helps the brain reorganize information in ways that facilitate breakthroughs. The three conditions for creativity—deep immersion in a domain, relaxation into a flow state, and unexpected combination of different concepts—are essentially about creating optimal conditions for the unconscious convergence of the compute-in-memory system.

When AI generates the token “cat,” what is activated is a high-dimensional floating-point vector—containing no softness, no warmth, no purring, no weight. When a human thinks of “cat,” what emerges is a complete image containing shape, texture, sound, smell, weight, temperature, and emotional coloring—an image that can be rotated, zoomed in, decomposed, and recombined. AI’s vector is a statistical distance, not an image. This is the essence of the gap.

VIFast System 1: The Operational Engine of the Abstraction Layer

Kahneman’s dual-system theory—the fast, automatic, unconscious System 1 and the slow, deliberate, analytical System 2—receives an entirely new interpretation within this framework. The traditional understanding treats System 1 as a “low-level intuition, prone to error” shortcut and System 2 as the “higher-level rationality” error-correction mechanism. But the analysis within this framework reveals: System 1 is the operational engine of the abstraction layer—it runs on the compute-in-memory synaptic structure and constitutes the main body of 96% of human cognitive activity.

The reason System 1 is “fast” is precisely because it runs on compute-in-memory hardware—there is no need to “retrieve data from storage to a processor for computation”; the synaptic structure itself is the knowledge, and signal transmission itself is computation. Every categorical definition formed through front-end convergence has already been written into the synaptic structure, and System 1’s access to them is instantaneous, parallel, and requires no conscious participation. The latest research shows that System 1 is not incapable of logical reasoning—in approximately 30% of cases, participants produced logically correct answers relying solely on System 1, while System 2’s corrective intervention occurred only about 10% of the time.

The eureka moments in scientific discovery—Kekulé’s snake, Einstein’s beam of light, Archimedes’ bathwater—were all products of System 1. Scientists first use System 2 (conscious slow thinking) to extensively collect and arrange informational raw materials, and then after System 2 relinquishes control, System 1’s compute-in-memory hardware continues running multi-dimensional information reorganization unconsciously, ultimately converging long-accumulated knowledge into a complete image. The true role of System 2 is not “higher-level thinking” but conscious front-end assistance—providing raw materials for System 1’s convergence and, after imagery emergence, taking responsibility for verification and articulation.

AI has only “System 2″—every inference is a conscious, serialized, computationally expensive matrix operation. It has no compute-in-memory hardware “continuously running in the background,” and it is impossible for it to produce an epiphany “while taking a bath.” AI’s reasoning ends and then vanishes; human synaptic structures never stop working.

VIIThe Determinative Variables of Cognitive Level Differences

A core corollary of this framework is: the differences in cognitive levels among human individuals are determined not by the brain’s “computational speed” or “memory capacity” but by the efficiency of multi-dimensional sensory alignment and convergence. This corollary is supported by multiple independent lines of evidence.

First, multisensory integration ability is directly correlated with IQ. Research has found that children with enhanced multisensory integration ability under both quiet and noisy conditions are more likely to score above average on the Wechsler Intelligence Scale for Children (WISC-IV); approximately 45% of children with relatively lower intellectual ability show diminished multisensory integration capability. Second, there exists a strong interactive link between sensory discrimination and intelligence—high-IQ individuals not only process faster but, more critically, possess a stronger ability to suppress irrelevant large stimuli. Working memory performance is predicted not by the neural enhancement of task-relevant information but by individual differences in neural suppression of distractors.

Third, the Parieto-Frontal Integration Theory (P-FIT) reveals that the brain network underlying fluid intelligence directly includes the sensory front-end: temporal and occipital sensory processing regions are incorporated into the support circuits of fluid intelligence for their contribution to early-stage processing of sensory information. Individuals in whom the functional distance between the prefrontal cortex and sensory cortices is optimized—neither too far (causing abstraction to become detached from the perceptual foundation) nor too close (becoming overwhelmed by concrete details)—exhibit higher fluid intelligence. Fourth, research on autism spectrum disorders provides reverse validation: impaired multisensory integration, with abnormal temporal binding windows, produces fragmented percepts rather than coherent wholes, cascading into difficulties in social cognition and abstract thinking. Fifth, age-related cognitive decline parallels sensory system degradation—sensory ability and information integration capability can independently predict cognitive status in older adults.

Synthesizing the above evidence, this framework proposes a four-layer progressive model of cognitive level differences:

Layer	Core Process	Individual Difference Manifestation
Perceptual Layer	Multi-dimensional acquisition (dimension count × acuity per dimension)	Sensory sensitivity differences
Convergence Layer	Multi-dimensional alignment and convergence (temporal binding precision × noise suppression)	Few-shot definition accuracy differences
Hardware Layer	Synaptic remodeling (compute-in-memory efficiency × neural connectivity precision)	Neural efficiency differences (more intelligent individuals show lower activation)
Abstraction Layer	Imagery emergence (sensory-prefrontal communication × irrelevant information suppression)	Fluid intelligence and creativity differences

The four layers are in a progressive relationship where lower layers determine upper layers: the dimensional precision of the perceptual layer determines the definition quality of the convergence layer; the definition quality of the convergence layer determines the synaptic remodeling precision of the hardware layer; the remodeling precision of the hardware layer determines the imagery emergence efficiency of the abstraction layer. Individuals with high cognitive levels are those who are more efficient at every layer from bottom to top—and the starting point at the very bottom is multi-dimensional sensory alignment and convergence.

VIIIStructural Critique of AI Systems

Examining current AI systems through this framework, three levels of structural deficiency can be clearly identified:

8.1 Dimensional Poverty

The most advanced current multimodal large language models integrate only vision and text (with some partially incorporating audio), covering at most 2–3 perceptual channels. Tactile, olfactory, gustatory, proprioceptive, interoceptive, and other dimensions are entirely absent. Kadambi et al. (2025) noted: “Multimodal large language models still lack any bodily experience. They interpret ‘hot’ without ever having felt warmth, and parse ‘hunger’ without ever having experienced need.” A review of 40 years of cognitive architecture research shows that olfaction has been implemented in only three architectures. This is not a problem of data volume but of the absence of dimensions themselves—no amount of lines on a two-dimensional plane can enclose a closed surface in three-dimensional space.

8.2 Dependence on External Alignment

AI systems’ multimodal alignment relies on external mechanisms—contrastive learning (CLIP), timestamp matching, prompt injection, etc. These methods can only achieve approximate alignment and are fundamentally “post-hoc stitching”: visual encoders and language models are each trained independently, then aligned through projection layers. This stands in fundamental contrast to the human sensory system’s “innate symbiosis and biological clock synchronization.” LLMs cannot even correctly answer “What is today’s date?” because they lack an internal temporal state—time for them is a parameter of the input, not a property of existence.

8.3 Absence of Convergence Mechanisms

Human few-shot convergence relies on the synchronous action of multi-dimensional constraints—each dimension provides independent categorical boundary constraints, and multi-dimensional cross-verification ensures definitional robustness. AI’s deep learning is fundamentally single-channel or weakly coupled multi-channel statistical fitting, requiring massive data to compensate for the categorical boundary ambiguity caused by dimensional deficiency. This compensation is a surrogate for dimensional absence, not genuine convergence. A system that has never “touched” a cat, no matter how many images of cats it has seen, possesses an incomplete “definition of cat.”

Dimension	Human Cognitive Front-End	Current AI Systems
Number of Perceptual Dimensions	10+ dimensions in parallel	2–3 dimensions (vision + text + partial audio)
Temporal Alignment	Biological clock internal alignment	External timestamps / prompt injection
Convergence Efficiency	1–4 sample lock-in	Thousands to millions of samples for fitting
Verification Mechanism	Multi-channel cross-validation	Single-channel statistical confidence
Fault Tolerance	Cross-modal plasticity compensation	Modality loss → hallucination
Definition Persistence	Sleep consolidation → lifelong retrieval	Valid within context window → disappears upon session end

IXResearch Gaps and Future Directions

This framework reveals systematic gaps in current research. Nearly 80% of existing perception research is concentrated on the single channel of vision; the constraining roles of touch, smell, and taste in categorical definition are systematically underestimated. Few-shot learning research has been conducted almost entirely in the visual domain, with no one having systematically studied “how multi-dimensional synchronous convergence accelerates definition formation.” Biological clock research and multisensory integration research have not yet been incorporated into a single framework. A bridge between cross-modal plasticity research and theories of intelligence definition is lacking.

Core questions that future research needs to answer include: What is the contribution weight of each sensory dimension to categorical boundary precision? How do these weights dynamically change across tasks and developmental stages? Is there a quantifiable correlation between the precision of biological clock synchronization and the efficiency of definition formation? What are the efficiency limits of cross-modal compensation? Can AI systems incorporating tactile and olfactory dimensions be designed, and can the resulting changes in categorical definition precision be measured?

XConclusion

The front-end capability of human intelligence—converging the continuous signals of the physical world into discrete categorical definitions—is a long-neglected yet critically important cognitive foundation. This paper constructs a complete four-layer progressive model from the perceptual layer to the abstraction layer: the multi-dimensional sensory front-end acquires information in parallel through 10+ channels, completing high-dimensional convergence under the internal synchronization of a multi-scale nested biological temporal architecture, locking in the structural features of objects with minimal samples; the results of convergence directly alter synaptic structure, forming physical encoding of knowledge-as-structure on compute-in-memory biological hardware; upon this substrate, the abstraction layer, through metaphorical mapping and continuous unconscious reorganization of multi-dimensional information, gives rise to manipulable, verifiable, complete mental imagery; the fast system (System 1), as the operational engine of the abstraction layer, completes 96% of cognitive activity on compute-in-memory hardware, including the eureka moments in scientific discovery.

The core insight of this framework is: the metric of intelligence should not begin from computational power but from convergence quality—how precisely and how efficiently a cognitive agent can carve out meaningful definitional boundaries from the open world is the first-principles indicator of intelligence. Multi-dimensional sensory alignment and convergence efficiency are the core variables determining differences in cognitive levels among human individuals—from sensory sensitivity to neural efficiency to fluid intelligence to creativity, differences at every layer can be traced back to the signal-to-noise ratio of front-end convergence. The structural deficiencies of current AI systems across four dimensions—dimension count, internal alignment, compute-in-memory, and imagery emergence—cannot be remedied by increasing computational power and data volume. What is required is a paradigm shift from von Neumann architecture to compute-in-memory architecture, and from statistical vectors to multi-dimensional physical grounding.

The power of human cognition lies not merely in having many dimensions, but in the system’s extremely high prioritization of “the imperative to form definitions”—the brain would rather rewire itself than allow convergence capability to be lost. It lies not merely in the ability to form definitions, but in the fact that definition formation directly alters biological hardware—knowledge is structure, and structure is computation. It lies not merely in the ability to think abstractly, but in the fact that the ultimate product of abstraction is complete mental imagery—internal world models that can be rotated, decomposed, and recombined. From front-end convergence to complete imagery, this is a complete pathway from the physical world to the world of the mind—and the starting point of this pathway is an infant simultaneously sensing a cat through a dozen sensory dimensions for the very first time.

References

[1] Legg, S., & Hutter, M. (2007). A Collection of Definitions of Intelligence. Advances in Artificial General Intelligence, 157, 17-24.

[2] Legg, S., & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence. Minds and Machines, 17, 391-444.

[3] Rosch, E. (1978). Principles of Categorization. In E. Rosch & B. Lloyd (Eds.), Cognition and Categorization. Lawrence Erlbaum.

[4] Lakoff, G. (1987). Women, Fire, and Dangerous Things: What Categories Reveal About the Mind. University of Chicago Press.

[5] Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332-1338.

[6] Tenenbaum, J. B. (1999). A Bayesian Framework for Concept Learning. PhD Thesis, MIT.

[7] Smith, L. B. et al. (2002). Object name learning provides on-the-job training for attention. Psychological Science, 13(1), 13-19.

[8] Hutmacher, F. (2019). Why Is There So Much More Research on Vision Than on Any Other Sensory Modality? Frontiers in Psychology, 10, 2246.

[9] Newell, F. N. et al. (2023). Multisensory perception constrains the formation of object categories. Philosophical Transactions of the Royal Society B, 378(1886).

[10] Lacey, S., & Sathian, K. (2014). Visuo-haptic multisensory object recognition, categorization, and representation. Frontiers in Psychology, 5, 730.

[11] Stojanov, S. et al. (2021). Using Shape to Categorize: Low-Shot Learning with an Explicit Shape Bias. arXiv:2101.07296.

[12] Guilbeault, D., Baronchelli, A., & Centola, D. (2021). Experimental evidence for scale-induced category convergence across populations. Nature Communications, 12, 327.

[13] Li, D. (2024). Formalization of Cognition. Security Information Reference.

[14] Kadambi, A. et al. (2025). Embodiment in Multimodal Large Language Models. arXiv:2510.13845.

[15] Merabet, L. B., & Pascual-Leone, A. (2010). Neural reorganization following sensory loss. Nature Reviews Neuroscience, 11(1), 44-52.

[16] Cohen, L. G. et al. (1997). Functional relevance of cross-modal plasticity in blind humans. Nature, 389, 180-183.

[17] Speed, L. J., & Majid, A. (2025). Haptic Compensation in Blind People’s Conceptual Representations. Cognitive Science.

[18] Morgenstern, Y., Schmidt, F., & Fleming, R. W. (2022). One-shot generalization in humans revealed through a drawing task. eLife, 11, e75485.

[19] Mohawk, J. A., Green, C. B., & Takahashi, J. S. (2012). Central and peripheral circadian clocks in mammals. Annual Review of Neuroscience, 35, 445-462.

[20] Mure, L. S. et al. (2025). Brain circadian clocks timing the 24h rhythms of behavior. npj Biological Timing and Sleep.

[21] Kowadlo, G. et al. (2021). One-shot learning for the long term: consolidation with an artificial hippocampal algorithm. arXiv:2102.07503.

[22] Majid, A. et al. (2018). Differential coding of perception in the world’s languages. PNAS, 115(45), 11369-11376.

[23] Cai, X. (2025). Algorithmic Regulation: From Normative Regulation to Layered Regulation. Journal of Southwest University of Political Science and Law.

[24] Buzsáki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304(5679), 1926-1929.

[25] Chrobok, L. et al. (2021). From fast oscillations to circadian rhythms: coupling at multiscale frequency bands in the rodent subcortical visual system. Frontiers in Physiology, 12, 738229.

[26] Stengl, M., & Arendt, A. (2024). Contribution of membrane-associated oscillators to biological timing at different timescales. Frontiers in Physiology, 14, 1243455.

[27] Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. University of Chicago Press.

[28] Jamrozik, A. et al. (2016). Metaphor: Bridging embodiment to abstraction. Psychonomic Bulletin & Review, 23(4), 1080-1089.

[29] Kemmerer, D. (2019). Concepts in the Brain: The View from Cross-linguistic Diversity. Oxford University Press.

[30] Sferrazza, C. et al. (2021). Learning rich touch representations through cross-modal self-supervision. arXiv:2101.08616.

[31] Luo, S. et al. (2017). Object recognition combining vision and touch. Robotics and Autonomous Systems, 93, 123-134.

[32] Wan, C. et al. (2021). Bioinspired multisensory neural network with crossmodal integration and recognition. Nature Communications, 12, 1235.

[33] Yang, Y., Lu, J., & Zuo, Y. (2018). Changes of synaptic structures associated with learning, memory and diseases. Brain Science Advances, 4(2), 99-117.

[34] Bidelman, G. M. et al. (2022). Functional plasticity coupled with structural predispositions in auditory cortex shape successful music category learning. Cerebral Cortex, 32(16), 3507-3522.

[35] Indiveri, G., & Liu, S. C. (2015). Memory and information processing in neuromorphic systems. Proceedings of the IEEE, 103(8), 1379-1397.

[36] Xue, Z. et al. (2024). High-performance neuromorphic computing architecture of brain. arXiv:2508.03191.

[37] Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

[38] Bago, B., & De Neys, W. (2017). Fast logic?: Examining the time course assumption of dual process theory. Cognition, 158, 90-109.

[39] Khalil, R., & Brüne, M. (2025). Adaptive decision-making “fast” and “slow”: A model of creative thinking. European Journal of Neuroscience, 61(6), e70024.

[40] Dijkstra, N. et al. (2025). Visual generation unlocks human-like reasoning through multimodal world models. arXiv:2601.19834.

[41] Bruña, R. et al. (2020). Decoding and encoding models reveal the role of mental simulation in the brain representation of meaning. Royal Society Open Science, 7(5), 192000.

[42] Brandman, T. et al. (2025). A human brain network specialized for abstract formal reasoning. bioRxiv, 2025.10.21.683445.

[43] Baxter, J. et al. (2025). Abstract representations emerge in human hippocampal neurons during inference behavior. Nature Neuroscience.

[44] Sensoy, O. et al. (2011). The relationship between multisensory integration and IQ in children. Developmental Psychology, 47(3), 877-885.

[45] Melnick, M. D. et al. (2013). A strong interactive link between sensory discriminations and intelligence. Current Biology, 23(11), 1013-1017.

[46] Jung, R. E., & Haier, R. J. (2007). The Parieto-Frontal Integration Theory (P-FIT) of intelligence. Behavioral and Brain Sciences, 30(2), 135-187.

[47] Rizzolatti, L., & Fifer, W. (2018). Neurobiological foundations of multisensory integration in people with autism spectrum disorders. Frontiers in Human Neuroscience, 8, 970.

[48] Sio, U. N., & Ormerod, T. C. (2009). Does incubation enhance problem solving? A meta-analytic review. Psychological Bulletin, 135(1), 94-120.

Analysis of the Human BiologicalCognitive Front-End System