TECHNICAL DESIGN PROPOSAL · MAY 2026 · V8

Unified Architecture AI Factory
Design Proposal

A Large-Memory Independent Inference Node Architecture for Long-Context Agent Workloads

Eliminating Cross-Device Model Parallelism
Through Terabyte-Scale Unified Memory
and Single-Node Complete Model Execution

DateMay 19, 2026

CategoryTechnical Design Proposal

DomainsAI Hardware Architecture · Memory Systems Engineering · Data Center Infrastructure · Semiconductor Economics

VersionV8 (Three-way Cross-Model Adversarial Review + Physical Validation + Dual-Path Restructuring + Product Ecosystem Definition)

이조글로벌인공지능연구소

LEECHO Global AI Research Lab

Opus 4.6 · GPT 5.5 · Gemini 3.1

인지집단 (Cognitive Collective)

Abstract This proposal presents a large-memory independent inference node architecture designed for long-context, low-batch, agent-type inference workloads. The core idea is to integrate CPU, inference accelerator, terabyte-scale unified memory, and NVMe SSDs into a self-contained inference layer, where each layer fully loads and runs trillion-parameter-scale large models, thereby eliminating the need for inter-GPU model-parallel communication. This proposal is positioned as a complementary layer to current HBM/NVLink GPU cluster approaches—not a replacement for high-QPS, high-batch online inference clusters, but rather serving specific scenarios such as long-running agent tasks, private deployments, dedicated inference, and standard data center deployments.

The proposal demonstrates two mutually exclusive technical paths: Path A is based on the NVIDIA Vera Rubin Superchip (1.5 TB LPDDR5X + 576 GB HBM4 + NVLink-C2C 1.8 TB/s), available for validation in H2 2026, where 500B Dense FP4 weights (250 GB) can reside entirely within a single Rubin GPU’s 288 GB HBM4, achieving approximately 75 t/s decode speed (dual-GPU tensor parallelism can reach ~150 t/s but requires on-chip communication); Path B is based on a custom inference SoC + 3 TB+ DDR5 RDIMM (883 GB/s), representing the optimal long-term cost/power form factor but requiring 18–24 months of new chip design. Both paths eliminate inter-GPU model-parallel communication, but differ in memory type and performance characteristics.

This proposal also demonstrates the advantages of Dense models in deterministic output and engineering simplicity, as well as the capabilities that SSD-persistent KV Cache provides for long-running agent tasks—including interruptible recovery and cross-session memory (subject to strict version-binding invalidation conditions; this is not a general-purpose agent memory solution). More importantly, by expanding memory boundaries, this proposal enables significant simplification of approximately 19 auxiliary technologies in the current AI inference full stack that exist solely because of “insufficient memory”—of which approximately 6 can be completely removed in target scenarios, approximately 5 can be substantially weakened, and approximately 3 must be retained.

This paper explicitly acknowledges that: HBM GPU advantages rapidly return as batch size increases; actual quality loss from FP4 precision depends on the specific model and task; a 500B Dense FP4 model is currently a hypothetical asset, as current industry trends remain dominated by MoE; the attention O(n²) computational cost and prefill latency for ultra-long contexts (300K+ tokens) may become independent bottlenecks; Path B requires an inference-specialized SoC that does not yet exist. In the near term, the Vera Rubin Superchip (Path A) can serve as the validation platform. The paper further argues for the product ecosystem potential of the unified architecture as a “personalized AI node”: private AI deployment for enterprises and eventually individuals, the feasibility of a macOS-grade stable inference OS (based on the structural advantage of eliminating the distributed communication stack), and a complete four-layer product stack from hardware through application control (LiteClaw) to multimedia input.

SECTION 01

Problem Statement: Systemic Overhead of Current GPU Rack Architectures

The current high-end architecture for data center AI inference is represented by the NVIDIA GB300 NVL72—72 GPUs fully interconnected via NVLink and NVSwitch, forming a unified compute domain. This architecture was designed to distribute ultra-large models across multiple GPUs through tensor parallelism and expert parallelism. Its value in high-batch, high-throughput inference and large-scale training is undeniable.

However, for low-batch dedicated inference scenarios, the hardware serving “communication” occupies a substantial fraction of the entire rack:

Figure 1 · GB300 NVL72 Rack Power Flow Breakdown

Component	Function	Power Share	Cost Share	Produces Compute in Low-Batch Inference?
GPU Compute Cores	Matrix operations	~35%	~40%	✓ Yes
HBM Memory	Stores weights and KV Cache	~10%	~20%	✓ Yes
NVLink SerDes	Inter-GPU communication	~10%	Included in GPU	✗ Low utilization at low batch
NVSwitch Chips	Inter-GPU switching	~12%	~12%	✗ Idle when model is not sharded
Optical Modules	Cross-rack communication	~17%	~20%	✗ Not needed within a single rack
Liquid Cooling System	Thermal management	Additional 30–50%	~8% + infrastructure	✗ Indirect overhead

In the specific inference scenario of “single-user dedicated, low-batch, unsharded model,” there is a significant mismatch between the resource investment in communication and cooling components and the actual inference output. This proposal addresses precisely this mismatch with an architectural alternative.

Empirical MFU (Model FLOPs Utilization) data from industry further confirms the severity of this systemic overhead from a compute utilization perspective:

Organization	GPU Scale	MFU	Compute Waste Rate
xAI Colossus	550K H100/H200	11%	89% (internal memo: “embarrassingly low”)
DeepSeek-v3	H800 cluster	20–30%	70–80% (tighter communication bottleneck)
OpenAI GPT-4	~25K A100	32–36%	64–68%
Meta LLaMA 3 405B	Large-scale H100	38–41%	59–62% (most publicly reported industry data)
Google TPU	Custom TPU + Pathways	~46%	~54% (highest globally)

None of the world’s five most powerful AI companies—xAI, OpenAI, DeepSeek, Meta, and Google—has achieved training MFU above 50%. Google reached 46% using custom TPUs + custom Pathways framework + custom networking, approaching the engineering limits of the distributed paradigm. xAI’s 550K GPU cluster has an MFU of only 11%, meaning the compute power of 490K GPUs is effectively idle. xAI President Michael Nicolls stated: “This marks a shift in the AI race from ‘who can buy more GPUs’ to an engineering battle of ‘who can make every single GPU work effectively.'” The root causes of low MFU—large-scale cluster communication overhead, straggler waiting, and the memory wall—are not operational issues but structural limits of the distributed parallel paradigm under Amdahl’s Law. The unified architecture avoids model-parallel communication overhead through single-node complete model execution—the largest denominator term in the MFU formula (cross-device communication + synchronization waiting + straggler waiting) no longer exists in the target scenario. However, single-node deployments still face independent utilization challenges including memory bandwidth utilization, kernel efficiency, and prefill compute utilization.

SECTION 02

Core Insight: Software Advances Enable Complete Single-Device Execution

Three technological advances in 2025–2026 are expanding the parameter boundary of “a single device can fit an entire model”:

2.1 Extreme Quantization

antirez (creator of Redis) demonstrated in May 2026 on a Mac Studio M3 Ultra (512 GB unified memory): DeepSeek V4 PRO (1.6T parameter MoE model) compressed to a 433 GB GGUF file via 2-bit quantization, with usable performance on benchmarks including GPQA Diamond. However, antirez himself noted: 2-bit quality may be inferior to Flash models, speed may be too slow for some use cases, and performance differs significantly from full precision. Extreme quantization expands the capacity boundary, but it is not a free lunch—quality loss depends on the specific model, task, and quantization method.

2.2 Tiered KV Cache Storage

The technology for offloading KV Cache to SSDs has been implemented in frameworks such as vLLM, FlexGen, and NVIDIA Dynamo/CMX. However, KV Cache offloading is not as simple as “just move it to SSD”—access patterns, latency characteristics, and coordination with the compute pipeline require careful engineering (see Section 9 for details).

2.3 Native FP4 Hardware Support

The NVIDIA Blackwell architecture supports FP4 precision inference through a micro-scaling Transformer Engine. However, the actual quality impact of FP4 depends on training/quantization-aware training, calibration methods, outlier handling, sensitivity of specific layers, and the specific task. Subsequent calculations in this paper use FP4 as a space estimation baseline, but do not assume that all models and tasks can use FP4 without quality loss.

SECTION 03

Applicability Boundaries and Scenario Positioning

AI inference is not a monolithic workload. At least four fundamentally different inference scenarios exist, each with distinct hardware requirements:

Inference Scenario	Characteristics	Key Metrics	Optimal Hardware	This Proposal Applicable?
High-QPS Short Conversations	Many users, short context, high concurrency	QPS, TTFT, $/request	HBM GPU + high batch	No
High-Batch API Services	Batch requests, throughput-prioritized	throughput/GPU, $/Mtok	HBM GPU + batch optimization	No
Long-Context Agent Tasks	Low concurrency, ultra-long context, multi-step reasoning, state persistence required	Context stability, recoverability, cost	Target scenario of this proposal	✓ Yes
Offline Batch Processing / Data Generation	Latency-insensitive, throughput/$-focused	tok/$/hour	Flexible by scale	Partially applicable

Positioning of this proposal: Not a general-purpose solution to replace all GPU inference clusters, but a specialized architecture for long-context agent tasks, private deployments, dedicated slow large-model inference, and standard data center deployments. In these scenarios, the requirements for single-user dedicated access, low batch, ultra-long context, and persistent state differ from the high-batch, high-throughput optimization direction of HBM GPUs, and the DDR5 large-memory approach holds structural advantages.

Scenarios this proposal is not designed for and does not attempt to replace: real-time customer service requiring hundreds of tokens per second, API platforms processing thousands of requests per second, and large-scale online services requiring batch optimization at batch=64 or above. In these scenarios, HBM GPU advantages in bandwidth and Tensor Core utilization scale rapidly with increasing batch size.

SECTION 04

Proposal Design: Unified Architecture Inference Layer

4.1 Architectural Paradigm

NVL72 Paradigm: 72 GPUs Collaboratively Running 1 Model

Model sharded across 72 GPUs
NVLink + NVSwitch full interconnect
High-batch, high-throughput optimization
120 kW, liquid cooled
1 GPU failure affects entire domain

This Proposal: Independent Layers Each Running Complete Models

Each layer loads the complete model, no sharding
Eliminates inter-GPU model-parallel communication
Low-batch dedicated inference
Air-cooled, standard data center
1 layer failure affects only that instance

Important distinction: What this proposal eliminates is cross-device/cross-rack model-parallel communication (NVLink/NVSwitch/optical modules), not all high-speed coherent interconnects. In the near-term path, the NVLink-C2C bridge (900 GB/s) between Grace CPU and accelerator still exists, but this is on-chip CPU-GPU interconnect, not cross-GPU cluster communication.

4.2 Key Hardware Enabler: 256 GB DDR5-9200 RDIMM

Micron began sampling the 256 GB DDR5 RDIMM on May 12, 2026—built on the 1-gamma process, 9,200 MT/s, 3DS/TSV packaging, with per-module power consumption of 11.1 W.

4.3 Per-Layer BOM Estimate (By Path)

Table · Path A Per-Node BOM (Vera Rubin Superchip, Validatable H2 2026)

Component	Specification	Cost Estimate	Power
Vera Rubin Superchip	2× Rubin GPU (576 GB HBM4) + Vera CPU (88-core, 1.5 TB LPDDR5X)	$25,000–50,000	~1,000–1,200 W
NVMe SSD	2 × 4 TB Gen5 Enterprise	$1,200–2,000	~25 W
Network/BMC/PSU	100 GbE NIC + Management Controller + 1.5 kW PSU	$2,000–4,000	~60 W
Path A Total		~$30,000–60,000	~1,200 W

Table · Path B Per-Layer BOM (Custom DDR5 SoC, 2028–2030 Target Form Factor)

Component	Specification	Cost Estimate	Power
Custom Inference SoC	ARM CPU + Inference NPU + 12-ch DDR5 Controller	$1,500–3,000	~150 W
DDR5 RDIMM	12 × 256 GB DDR5-9200 = 3 TB	$6,000–12,000	~133 W
NVMe SSD	2 × 4 TB Gen5 Enterprise	$1,200–2,000	~25 W
Motherboard/BMC/Network/PSU	Custom motherboard + 100 GbE + Management Controller + 800 W PSU	$1,500–3,000	~55 W
Path B Total		~$10,200–20,000	P50: ~320 W / P95: ~430 W

Notes: Path A estimates are based on publicly available information about the NVIDIA Vera Rubin Superchip; actual pricing depends on SKU configuration and procurement volume. Path B is based on custom SoC volume production assumptions. As an early production product, the 256 GB DDR5-9200 RDIMM unit price may fluctuate in the $500–1,000 range. Path A must use NVLink-C2C Superchip (see §4.5 bridge bandwidth analysis); PCIe GPUs cannot be used. Path B power is given at two tiers: P50 (typical load) and P95 (sustained full load + SSD write peak + fans at full speed).

4.4 Runnable Models and Speed

Decode speed at 839 GB/s effective bandwidth (12-ch DDR5-9200, Dense 95% utilization). Note: These are batch=1, decode phase theoretical upper bounds; prefill phase and larger batch behavior differ (see Section 5 for roofline analysis).

Model	Precision	Weight Size	Batch=1 Decode	Experience Tier
200B Dense	FP4	100 GB	~8.4 t/s	Acceptable interaction
500B Dense	FP4	250 GB	~3.4 t/s	Agent/code/document
1T Dense	FP4	500 GB	~1.7 t/s	Research/batch
70B Dense	FP16	140 GB	~6.0 t/s	Smooth interaction
200B Dense	FP8	200 GB	~4.2 t/s	Agent/code

4.5 Bridge Bandwidth Analysis (Critical Physical Constraint)

In this proposal, DDR5 memory is managed by the CPU memory controller, and the GPU/accelerator must access it through some bridging path. The bandwidth of the bridging path directly determines the feasibility of the proposal—choosing the wrong path will cause speed to collapse to unusable levels.

Table · Three Possible GPU→DDR5 Bridging Paths Compared

Bridging Path	Bandwidth	500B FP4 Decode Speed	Feasibility	Hardware Platform
PCIe 5.0 x16	~64 GB/s	~0.26 t/s	✗ Completely unusable	Any PCIe GPU
NVLink-C2C (Blackwell)	900 GB/s	~3.4 t/s	✓ Feasible (matches DDR5 bandwidth)	Grace Blackwell Superchip
NVLink-C2C (Rubin)	1,800 GB/s	~3.4 t/s (limited by DDR5 side)	✓ Feasible (DDR5 becomes bottleneck)	Vera Rubin Superchip
Custom SoC Native DDR5	883 GB/s (direct)	~3.4 t/s	✓ Optimal (zero bridge overhead)	Does not yet exist; requires new design

Fatal constraint: PCIe 5.0 x16 provides only 64 GB/s—merely 7.2% of DDR5-9200’s theoretical bandwidth (883 GB/s). If the near-term path were to use a PCIe GPU, actual inference speed would plummet from 3.4 t/s to approximately 0.26 t/s, rendering it completely unusable. Therefore, the near-term path must use a Grace Blackwell Superchip (NVLink-C2C 900 GB/s), which is currently the only feasible bridging solution. PCIe GPUs cannot be used in this proposal.

NVLink-C2C (900 GB/s) and DDR5-9200 (883 GB/s) are roughly bandwidth-matched, so NVLink-C2C does not constitute a bottleneck. The bottleneck is always on the DDR5 side. After the Rubin-generation NVLink-C2C doubles to 1,800 GB/s, DDR5 bandwidth will become the sole speed-limiting factor.

This analysis has two important implications: (1) The Phase 1 validation platform for this proposal must use an NVLink-C2C Superchip—a cobbled-together solution of a discrete PCIe GPU plus DDR5 server cannot be used; (2) It must be confirmed that the Superchip platform’s memory controller can support sufficiently large memory capacity. Through investigation, NVIDIA’s CPU roadmap (Grace → Vera) uses LPDDR5X rather than DDR5 RDIMM—this discovery necessitated splitting the proposal into two mutually exclusive paths.

4.6 Path A: Vera Rubin Superchip (Near-to-Mid-Term Preferred, H2 2026–2028)

The NVIDIA Vera Rubin Superchip was announced at CES 2026 and enters volume production in H2 2026, addressing two key limitations from the Grace era: memory capacity expands from 480 GB to 1.5 TB, and the memory form factor changes from soldered to modular SOCAMM (co-developed with Micron).

Table · Vera Rubin Superchip Key Specifications (vs. V5 Assumptions vs. Grace Blackwell)

Component	Grace Blackwell	Vera Rubin	V5 Original Assumption
CPU Cores	72-core Grace ARM	88-core Olympus ARM, 176-thread SMT	72-core Grace
CPU Memory	480 GB LPDDR5X soldered	1.5 TB LPDDR5X SOCAMM (modular)	3 TB DDR5 RDIMM
CPU Memory Bandwidth	~500 GB/s	1.2 TB/s	883 GB/s
GPU	2× B200, 384 GB HBM3e	2× Rubin, 576 GB HBM4	No HBM
GPU Bandwidth	~16 TB/s	44 TB/s	—
NVLink-C2C	900 GB/s	1.8 TB/s	900 GB/s
GPU FP4 Compute	~40 PFLOPS	100 PFLOPS (dual GPU)	—

Key finding: 500B Dense FP4 weights (250 GB) can reside entirely within a single Rubin GPU’s 288 GB HBM4 (250 GB < 288 GB), with the remaining 38 GB + the other GPU’s full 288 GB HBM4 (326 GB combined) available for hot KV Cache, and 1.5 TB LPDDR5X entirely for warm/cold KV Cache overflow. Engineering margin warning: The theoretical 38 GB margin will be further consumed in end-to-end deployment by FP4 scale/metadata, embedding/lm_head weights, runtime workspace, activation buffers, CUDA graph workspace, KV hot zones, and memory fragmentation—while 500B FP4 single-GPU residency is theoretically valid, volume deployment requires empirical testing to verify actual margin under tight packing.

Important: Single-GPU vs. Dual-GPU Communication Boundary—Since 500B FP4 weights can fit within a single GPU, no cross-GPU weight communication is needed during inference. The effective HBM4 bandwidth is therefore a single GPU’s ~22 TB/s rather than the combined dual-GPU 44 TB/s. Using dual-GPU tensor parallelism can yield higher throughput but introduces Superchip-internal communication overhead (NVLink-C2C 1.8 TB/s internal bridge). The table below shows both configurations:

Model	Weight Location	Decode Speed (85% Effective Bandwidth)	Cross-GPU Communication Required?
500B Dense FP4	Single GPU HBM4 (250 GB of 288 GB)	~75 t/s	No
500B Dense FP4 (TP=2)	Dual GPU sharded (125 GB each)	~150 t/s	Yes (on-chip C2C)
1T Dense FP4	HBM4 + partial LPDDR5X overflow	Limited by C2C 1.8 TB/s: ~3.1 t/s	Yes
2T Dense FP4	Mostly LPDDR5X	Limited by LPDDR5X 1.2 TB/s: ~1.0 t/s	Yes

Path A’s major finding: A single Rubin GPU’s 288 GB HBM4 can just accommodate 500B FP4 weights—achieving approximately 75 t/s (single GPU 22 TB/s bandwidth) without cross-GPU communication. Dual-GPU tensor parallelism can reach approximately 150 t/s but requires Superchip-internal C2C communication. Either configuration far exceeds Path B’s 3.4 t/s and human reading speed (~5 t/s). Models exceeding 288 GB (700B+) require cross-GPU or overflow to LPDDR5X.

4.7 Path B: Custom DDR5 RDIMM Inference SoC (Mid-to-Long-Term Optimal, 2028–2030)

Path B represents the optimal form factor of the V5 original vision: a unified SoC integrating ARM CPU cores, an inference-specialized NPU, and a native 12+ channel DDR5/DDR6 controller. HBM, NVLink, and most Tensor Cores are removed. Note: NVIDIA’s Grace/Vera CPUs do not support DDR5 RDIMM (they use LPDDR5X), so Path B requires an entirely new chip design. Path B is far slower than Path A (without HBM4, 500B FP4 batch=1 decode is only approximately 3.0–3.4 t/s), but offers extreme advantages in per-layer cost (~$10–20K), power (P50 ~320 W), and deployment simplicity—no HBM, no NVLink, no liquid cooling, pure air cooling.

4.8 Dual-Path Comparison Overview

Dimension	Path A (Vera Rubin)	Path B (Custom DDR5 SoC)
Availability	H2 2026 (in production)	2028–2030 (requires new chip)
Total Memory	2.1 TB (576 GB HBM4 + 1.5 TB LPDDR5X)	3 TB+ DDR5 RDIMM
500B FP4 Decode	~75 t/s (single GPU) / ~150 t/s (TP=2)	~3.0–3.4 t/s (DDR5-limited)
1T+ FP4 Decode	~1–3.4 t/s (overflow to LPDDR5X)	~1.5–1.7 t/s
Per-Node Cost	~$30K–60K	~$10K–20K
Per-Node Power	~1,200 W	~320–430 W
Cooling	NVL72 deployment confirmed 100% liquid-cooled; standalone Superchip cooling TBD	Air-cooled
Eliminates cross-device model-parallel communication?	Yes	Yes
Requires new chip?	No	Yes (18–24 months)

SECTION 05

Roofline Analysis: The Decisive Impact of Batch Size on Architecture Selection

LLM decode is a memory-bandwidth-bound operation: generating each token requires scanning all model weights. As batch size increases, multiple users’ tokens can share a single weight scan, causing throughput to grow linearly (until compute saturation). This is the core advantage of HBM GPU clusters—and the core limitation of this proposal.

5.1 Bandwidth-Bound Decode Model

At batch=1, single-token throughput ≈ memory bandwidth ÷ weight size. At batch=N, aggregate throughput ≈ N × single-token speed (until the compute-bound boundary).

Table · 500B Dense FP4 (250 GB Weights) Batch vs. Throughput Comparison

Batch	DDR5 (883 GB/s)	H100 HBM3 (3.35 TB/s)	B200 HBM3e (8 TB/s)	Gap Multiple
1	3.4 t/s	13.4 t/s	32 t/s	4–9×
4	13.6 t/s total	53.6 t/s total	128 t/s total	4–9×
16	~54 t/s total*	~214 t/s total	~512 t/s total	4–9×
64	~54 t/s total*	~856 t/s total	compute-bound	16×+

* The DDR5 solution begins approaching compute saturation around batch=16 (depending on accelerator compute power), and throughput no longer scales linearly with batch. HBM GPUs, with higher bandwidth, hit the compute-bound inflection point at larger batch sizes.

Key finding: In the batch=1 to 4 range, the DDR5 solution’s per-user speed trails HBM GPUs by approximately 4–9×—the tradeoff for 10–15× lower power and 5–6× lower cost. But beyond batch=16, the HBM GPU throughput advantage grows super-linearly, and the DDR5 solution cannot compete. This roofline characteristic determines the optimal application scenario for this proposal: low-batch (≤4), dedicated, long-context inference.

5.2 $/Token and W/Token Comparison

Economics in the target scenario (batch=1, 500B FP4):

Metric	This Proposal (DDR5, Single Layer)	B200 HBM3e (Single Card)	GB300 NVL72 (Rack)
Batch=1 Speed	3.4 t/s	32 t/s	~Hundreds of t/s (sharded)
Power	~400 W	~1,400 W	~120,000 W
W/token (batch=1)	~118 W/tok	~44 W/tok	N/A (over-provisioned)
Hardware Cost (est.)	~$20K	~$30–40K (GPU alone)	~$2–3M
$/token/hour	Low (dedicated, no waste)	Medium (requires batch sharing to amortize)	High (requires high utilization)

Note: At batch=1, this proposal (Path B) does not outperform a single HBM GPU in W/token—HBM GPU per-token energy efficiency is higher. However, Path B’s advantage lies in total system cost and deployment threshold. Path A (Vera Rubin) with the all-HBM4 configuration for 500B FP4 is also competitive on W/token.

5.3 Prefill Roofline

A critical operation for long-running agent tasks is prefill—processing the long input prompt and generating the KV Cache. Prefill is a compute-intensive operation (unlike the bandwidth-intensive decode), with latency growing linearly with input length.

Input Length	Path A (Vera Rubin, 100 PFLOPS FP4)	Path B (DDR5 SoC, ~5 TFLOPS effective)	B200 Single Card (20 PFLOPS)
32K tokens	~2–5 seconds	~30–60 seconds	~5–10 seconds
128K tokens	~10–30 seconds	~3–8 minutes	~30–60 seconds
300K tokens	~1–3 minutes	~10–25 minutes	~2–5 minutes

Key finding: Path B (DDR5 SoC) prefill performance is a serious bottleneck. Prefilling 300K tokens at ~5 TFLOPS effective compute may require 10–25 minutes—directly impacting usability for agent “interrupt recovery” scenarios. Path A (Vera Rubin 100 PFLOPS) handles prefill much faster, further strengthening its case as the near-to-mid-term preferred option.

5.4 Attention O(n²) Computational Cost at Ultra-Long Contexts

Standard transformer attention has O(n²) complexity. A 300K token context means each new token must attend to 300K historical KV entries. Even if DDR5/LPDDR5X capacity is sufficient to store all KV entries, the attention computation itself grows quadratically with context length. At Path B’s ~5 TFLOPS effective compute, full attention over ultra-long contexts may become a more pressing bottleneck than memory bandwidth. FlashAttention reduces memory access but does not change computational complexity. Therefore, even with sufficient memory, some form of sparse attention may still be necessary for ultra-long context scenarios. Path A’s 100 PFLOPS FP4 compute power substantially mitigates the attention computation bottleneck.

SECTION 06

Engineering Advantages of Dense Model Regression

The two core motivations for MoE architecture—”a single GPU cannot hold the model” and “communication is too expensive”—are mitigated in a unified large-memory node. When 3 TB DDR5 can accommodate 1.5T Dense FP4 weights or 6T Dense FP4 weights, Dense architecture once again becomes a pragmatic choice within the target parameter scale.

6.1 Dense Engineering Simplicity Advantages

Dimension	MoE	Dense
Memory Access Pattern	Sparse random (expert selection depends on input)	Sequential contiguous (layer-by-layer scan of all weights)
DDR5 Bandwidth Utilization	60–80% (cache misses and irregular access)	~95% (sequential reads, hardware prefetch-friendly)
Inference Code Complexity	Expert routing, dynamic selection, load balancing	Standard matrix multiplication loop
Output Determinism	Router may introduce non-determinism	Fully deterministic (same input → same output)
Quantization Robustness	Different experts may have varying sensitivity	Uniform quantization, more predictable behavior

Clarification needed: Dense advantages are at the engineering level—simpler, more predictable, easier to optimize. This paper does not claim that Dense is “universally superior” to MoE in model capability. MoE can typically achieve higher capability with more total parameters at an equivalent compute budget—this is its core value. This proposal’s position is: when unified memory capacity is sufficiently large and the inference scenario is low-batch dedicated, Dense’s engineering simplicity and bandwidth efficiency advantages may outweigh MoE’s parameter efficiency advantages.

SECTION 07

Output Stability Analysis: MoE vs. Dense

This section discusses the impact of MoE routing mechanisms on output stability. A prerequisite declaration: whether MoE is more prone to hallucination depends on training data, routing design, number of activated experts, post-training methods, and multiple other factors—it cannot be simplistically attributed to architectural label. The following discussion focuses on the non-determinism introduced by the routing mechanism itself, not an overall capability assessment of MoE architecture.

7.1 Empirical Data on Routing Non-Determinism

Research by LMSYS and other institutions in 2025 found measurable differences in routing behavior between training and inference in MoE models: approximately 10% of routers selected different experts across the two phases; 94% of tokens were routed to different experts in at least one layer; on average, approximately 6 routers per token made different decisions. Research also noted that even under identical conditions, repeated forward passes may produce different expert selections from the router.

This non-determinism is particularly pronounced in reinforcement learning training—LMSYS noted in December 2025 that “training RL for MoE models has been unstable, frequently causing training crashes,” and specifically developed the R3 (Rollout Routing Replay) method to mitigate this issue.

7.2 Potential Impact on Long-Running Agent Tasks

In multi-step agent tasks, routing non-determinism may accumulate across steps. However, it must be noted that this is a potential risk rather than a proven causal relationship. The specific degree of impact depends on: the actual magnitude of routing non-determinism during inference (not training), whether deterministic inference settings are used (e.g., fixed random seeds, dropout disabled), and the quality of the specific model’s router design.

Dense models, having no routing selection mechanism, possess a structural advantage in this dimension—identical inputs always traverse an identical computation path. This is a valuable property for agent scenarios requiring multi-step reasoning consistency.

SECTION 08

User-Perceived Performance and External Communication Latency Analysis

This proposal’s batch=1 decode speed is 3.4–8.4 t/s (500B–200B Dense FP4). Average human reading speed is approximately 200–250 words/minute (≈4–5 tok/s). Industry-consensus experience tiers are: 50+ t/s feels instantaneous; 10–20 t/s smooth; 5–10 t/s acceptable; 3–5 t/s noticeable wait but usable; below 3 t/s suitable only for non-real-time scenarios.

This proposal’s 200B FP4 (8.4 t/s) falls in the “acceptable” range, and 500B FP4 (3.4 t/s) falls in the “noticeable wait but usable” range. For long-running agent tasks, code generation, and document analysis, this speed meets baseline requirements. For scenarios requiring fast interactive chat, smaller models or higher-bandwidth future DDR standards are needed.

8.1 Agent Task SLA vs. Online Service SLA

The 3.4 t/s speed indeed fails to meet traditional online service SLA standards—modern consumer chat products require TTFT < 1 second and generation speed of 30+ t/s. But the SLA dimensions for long-running agent tasks are fundamentally different:

SLA Dimension	Online Chat Service	Long-Running Agent Task	This Proposal’s Performance
Time to First Token (TTFT)	<1 second (user staring at screen)	Several seconds acceptable (background execution)	Meets Agent SLA
Generation Speed (TPS)	30–100+ t/s	1–10 t/s (no one watching streaming output)	3.4–8.4 t/s meets requirements
Concurrent Users	Thousands–tens of thousands QPS	1–dozens of parallel agents	21 layers = 21 parallel agents
Context Stability	Not critical (short conversations)	Critical (hundreds of steps without information loss)	3 TB memory + SSD persistence
Interruptible Recovery	Not needed	Critical (long tasks may span days)	SSD KV Cache persistence
Output Determinism	Not sensitive	Important (multi-step reasoning consistency)	Dense deterministic output

This proposal explicitly fails to meet online service SLA dimensions; but on agent long-task SLA dimensions—context stability, interruptible recovery, and output determinism—it actually exceeds the capabilities of current GPU cluster solutions. This is not “barely usable”—it is a different architecture optimized for a different SLA framework.

Regarding network latency: 100 GbE baseline latency is ~1.2 microseconds, which is entirely negligible compared to the hundreds-of-milliseconds-scale token generation latency (a five-orders-of-magnitude gap). Per-user dedicated instances eliminate TTFT jitter from batch scheduling queuing, making response latency more predictable.

SECTION 09

KV Cache Engineering Analysis

9.1 KV Cache Size Formula

The KV Cache increment per token can be precisely calculated using the following formula:

KV_per_token = 2 × L × n_kv_heads × d_head × bytes_per_element × batch

Where: 2 = K and V tensors; L = number of transformer layers; n_kv_heads = number of KV heads (may be much smaller than query heads under GQA/MQA); d_head = per-head dimension (typically 128); bytes_per_element = bytes per KV precision (FP16=2, FP8=1, INT4=0.5)

Precise calculation using a typical GQA architecture:

Model Scale	L (Layers)	KV Heads (GQA)	d_head	KV dtype	KV per Token	Path A (2.1 TB) Resident Tokens	Path B (3 TB) Resident Tokens
70B	80	8	128	FP16	0.33 MB	~4.9M	~7.8M
200B	96	16	128	FP16	0.79 MB	~1.9M	~3.3M
500B	120	32	128	FP16	1.97 MB	~170K	~1.25M
500B	120	32	128	FP8	0.98 MB	~330K	~2.5M
1T	160	64	128	FP8	2.62 MB	~380K	~860K

Note: Path A “resident tokens” is calculated as the Vera Rubin total memory of 2.1 TB minus model FP4 weights. Path B is calculated as 3 TB DDR5 minus weights. Layer counts and KV heads are reasonable estimates. The corrected data actually strengthens the large-memory argument: a 500B model with GQA at FP8 can hold approximately 2.5 million resident tokens—far exceeding the requirements of long-running agent tasks.

9.2 SSD Offloading Latency Realities

In the unified architecture, the NVMe controller is integrated within the SoC, making the KV Cache write/readback path shorter than in traditional GPU architectures (which must traverse PCIe twice). However, the physical latency characteristics of SSDs do not change as a result:

Storage Tier	Random Read Latency	Sequential Bandwidth	Suitable KV Data
Unified Memory (DDR5)	~80–100 ns	883 GB/s	Active layers, recent tokens
NVMe SSD	~50–100 μs	7–14 GB/s	Cold historical tokens, persistence
Gap	500–1,000×	~60–125×

The SSD’s 50–100 μs read latency is non-negligible during attention computation. If the current token needs to attend to cold KV entries on SSD, they must be prefetched into unified memory. Whether prefetching can fully hide SSD latency depends on attention patterns, scheduling strategy, and context length—this requires empirical validation and should not be treated as a proven conclusion.

Page Fault Worst Case: In long-running agent tasks, if attention needs to reference cold historical tokens on SSD (e.g., tool call results from step 5 referenced at step 200), this triggers a scenario analogous to virtual memory page faults. A single SSD random read (50–100 μs) is approximately 500–1,000× slower than a DDR5 access (80–100 ns). If multiple SSD page faults occur during a single token’s generation (e.g., cross-layer attention patterns hitting different cold regions), latency accumulates. Mitigation strategies include: (a) leveraging the 2.75 TB DDR5 buffer space to keep hot/warm KV in memory as much as possible; (b) attention-aware prefetching—predicting KV regions about to be accessed based on attention patterns and loading them from SSD in advance; (c) a tiered storage scheduler that locks the most recent N-thousand tokens’ KV in DDR5 and only allows data beyond the threshold to be flushed to disk. Whether these strategies are effective requires benchmarking against real agent workloads.

9.3 KV Cache Persistence Reusability Boundaries

SSD-persistent KV Cache enables agent interrupt recovery and cross-session memory, but subject to strict reusability conditions:

Change Type	Is Persisted KV Still Usable?
Model weight update (new checkpoint)	Not usable—layer weight changes invalidate KV semantics
RoPE/positional encoding parameter change	Not usable—positional information mismatch
Tokenizer change	Not usable—token ID semantics altered
System prompt change	Partially usable—KV corresponding to system prompt needs recomputation
KV precision/format change	Not usable—data format incompatible
Session recovery under same model and settings	Usable

KV Cache persistence is not a general-purpose “agent memory database”—it is tightly bound to model version, positional encoding, tokenizer, and precision format. Its core value lies in interrupt recovery and short-to-medium-term context continuity acceleration under the same model version and configuration. For long-term agent memory, structured state, tool logs, plan trees, and code diffs may be better representations than opaque KV Cache.

SECTION 10

Software Complexity Regression: Auxiliary Technology Stack Eliminated by the Unified Architecture

A substantial portion of the engineering complexity in current AI systems does not serve inference itself, but rather compensates for the hardware constraint of “insufficient memory.” From RAG to KV Cache eviction, from vector databases to continuous batching, the entire auxiliary technology ecosystem exists as compensatory engineering for limited HBM capacity. This section systematically catalogs these auxiliary technologies and analyzes the unified architecture’s impact on them.

10.1 Context Management Layer: From “Selective Forgetting” to “Complete Memory”

In current LLM inference, when conversations exceed KV Cache capacity, the system is forced to perform the following lossy operations:

Auxiliary Operation	What It Does	Information Loss	Status in Unified Architecture
Context compression/summarization	Compresses full conversation into summary text	Details, context, and original phrasing lost	Eliminated—2.75 TB KV space can hold 300K+ tokens
Token truncation	Discards earliest conversation history	Early information permanently lost	Eliminated
KV Cache eviction	Deletes “unimportant” KV entries by attention score	Information judged unimportant is lost; performs poorly for tasks requiring global context	Eliminated
Sliding window attention	Attends only to most recent N tokens	Long-range dependencies lost	Greatly reduced—may still be needed beyond 1M tokens

The limitations of context compression are directly perceptible in practice. When an AI assistant’s conversation exceeds KV Cache capacity, the system triggers forced compression—the complete earlier conversation is replaced with a summary. Subsequently, the AI’s recall accuracy degrades, details become blurred, and reasoning chains from earlier discussions may break. This is not a model capability issue—it is information loss caused by hardware memory constraints. In the unified architecture’s 2.75 TB KV Cache space (500B FP4 model), the complete lossless context of approximately 170K–350K tokens can be preserved—equivalent to dozens of complete in-depth conversations, with no compression required.

10.2 External Memory Layer: From “Retrieval Substituting for Memory” to “Native Memory”

The fundamental reason RAG (Retrieval-Augmented Generation) and its derivative technology stack exist is that the context window is too small.

Auxiliary Technology	What It Does	Limitations	Status in Unified Architecture
RAG retrieval pipeline	Retrieves relevant document fragments from external database and injects into prompt	Retrieval quality depends on embeddings; may retrieve semantically similar but contextually irrelevant content (“vector fog” problem)	Greatly reduced—can directly load 300K+ tokens of documents into context
Vector database	Compresses documents into high-dimensional vector storage	Lossy compression; original text details lost during vectorization	Greatly reduced—attention computes directly on original text
Document chunking	Splits long documents into 512–2048 token chunks	Cross-chunk information relationships broken; information lost at chunk boundaries	Eliminated—long documents can be loaded whole
Agent memory frameworks	External database storage + retrieval of agent history	Retrieval latency, recall issues, noise increases with history length	Eliminated—KV Cache is memory, SSD enables persistence

Research in 2026 has begun reconsidering the fundamental limitations of RAG: the Aeon project noted that as agent memory grows, the “vector fog” problem in flat vector retrieval intensifies—retrieving semantically similar but contextually irrelevant fragments. The increasingly complex architectures of GraphRAG, Agentic RAG, and Hybrid RAG all attempt to patch this fundamental deficiency. In the unified architecture, the attention mechanism itself is the most precise “retriever”—it computes on the complete original text, without the intermediate steps of lossy vectorization compression and approximate nearest-neighbor search.

10.3 KV Cache Compression Layer: From “Extreme Compression” to “Comfortable Storage”

Auxiliary Technology	Compression Ratio	Cost	Status in Unified Architecture
KV Cache quantization (FP16→INT4)	4×	Precision loss; extreme quantization may affect long-range reasoning	Can use higher precision (FP16)—ample space
MLA Multi-Head Latent Attention (DeepSeek)	71× per layer	Requires specialized model architecture design and training	No longer a survival necessity; becomes an optional optimization
GQA/MQA	4–8×	Query/KV head count mismatch may lose expressiveness	Still useful but pressure greatly reduced
Prefix Caching	Avoids redundant prefill	Cache management complexity	Eliminated—SSD-persistent KV achieves this natively

10.4 Distributed Communication Layer: From “Multi-GPU Collaboration” to “Single-Node Completeness”

Communication Overhead	Root Cause	Typical Bandwidth Consumption	Status in Unified Architecture
Tensor parallel allreduce	Model sharded across multiple GPUs	Two allreduces per layer per token	Eliminated—model is not sharded
Pipeline parallelism	Model layers split into stages across GPUs	Activation values passed between stages	Eliminated
Expert parallelism (MoE)	Experts distributed across different GPUs	Tokens must be routed to corresponding GPU	Eliminated—Dense has no experts
NVLink/NVSwitch/optical modules	Supporting the above parallelism	~40% of rack cost	Eliminated

10.5 Inference Service Scheduling Layer: From “Shared Contention” to “Dedicated Determinism”

Scheduling Overhead	Root Cause	Impact on User	Status in Unified Architecture
Continuous batching	Multiple users sharing GPU	Single-user speed slowed by longest request in batch	Eliminated—dedicated instance
Request queuing/scheduling	Limited GPU resources	TTFT spikes (seconds of waiting during peak periods)	Eliminated—no queuing
KV Cache cross-request migration	Load balancing	Service interruption during migration	Eliminated—KV stays fixed on its layer

10.6 Three-Tier Impact Matrix

Table · Three-Tier Assessment of Unified Architecture Impact on Auxiliary Technology Stack

Impact Level	Auxiliary Technologies	Rationale
Removable (~6 items)	Tensor parallel allreduce · Expert parallelism · NVLink/NVSwitch/optical modules · Token truncation · Document chunking · Request queuing/scheduling	Model not sharded, Dense has no experts, memory sufficient for full context, dedicated instance eliminates queuing
Reducible (~5 items)	RAG retrieval pipeline · Vector database · Context compression/summarization · KV Cache eviction · Continuous batching	RAG still needed for knowledge bases exceeding context capacity and data governance; context compression still needed in extreme scenarios; platform-level scheduling and tenant isolation still needed
Still Required (~3 items)	Permission-based retrieval and data governance · Audit/logging/observability · Sparse/efficient attention (ultra-long context O(n²))	Enterprise security compliance is independent of memory size; attention computational complexity is independent of memory capacity

Corrected core insight: In this proposal’s target scenarios (low batch, long-running agent tasks, dedicated instances), approximately 6 auxiliary technologies can be completely removed, approximately 5 can be downgraded from “core infrastructure” to “optional/backup,” and approximately 3 remain essential. Engineering complexity of the inference system is significantly reduced, but does not reach zero. Sufficient memory eliminates most “compensatory engineering,” but security compliance, ultra-long sequence computational optimization, and platform operations are requirements along independent dimensions. The pattern of “using complex software to compensate for hardware limitations” is substantially weakened, but the need for “using necessary software to ensure production reliability” remains unchanged.

SECTION 11

Rack-Level Deployment Plan (By Path)

11.1 Path A Rack Deployment (Vera Rubin Superchip)

A single Vera Rubin Superchip draws approximately 1,200 W. A standard 42U rack can accommodate approximately 6–8 Superchips (depending on cooling configuration). The NVL72 deployment form factor has been confirmed as 100% liquid-cooled; standalone Superchip deployment cooling depends on server design.

Metric	Path A Rack (6–8 Nodes)
Concurrent Agent Instances	6–8 (each node independently runs complete 500B model)
Total Memory	12.6–16.8 TB (HBM4 + LPDDR5X)
Total Power	~7.2–9.6 kW
Cooling	Liquid-cooled or high-density air-cooled (depending on configuration)
500B FP4 Decode	~75 t/s per node (single GPU)
Total Hardware Cost	~$180K–480K

11.2 Path B Rack Deployment (Custom DDR5 SoC)

Standard 42U rack, approximately 2U per layer (including cooling space), accommodating 21 layers:

Concurrent Instances

Each layer independently runs complete model

Total Unified Memory

63 TB

21 layers × 3 TB DDR5

Rack Total Power

P50: 6.7 kW / P95: 9.0 kW

Pure air-cooled (may need rear-door HX at P95)

Total Hardware Cost

$215K–420K

21 × $10.2–20K/layer

11.3 Path A/B vs. GB300 NVL72 Comparison

Three architectures serve different inference workloads:

Metric	GB300 NVL72	Path A (6–8 Nodes)	Path B (21 Layers)
Advantaged Scenario	High batch, high QPS, training	High-performance agent inference	Low-cost private deployment
Total Memory	~38 TB	12.6–16.8 TB	63 TB
Independent Instances	1 (batch shared)	6–8	21
500B FP4 Speed	Extremely high (multi-GPU)	~75 t/s/node	~3.0–3.4 t/s/layer
Total Power	~120 kW	~7–10 kW	~7–9 kW
Cooling	100% liquid-cooled	Liquid/enhanced air-cooled	Pure air-cooled
Data Center Requirements	Liquid cooling + specialized racks	May require liquid cooling	Standard data center
Total Hardware Cost	~$2–3M	~$180–480K	~$215–420K

SECTION 12

Manufacturing Economics: Wafer Efficiency of DDR5 vs. HBM

HBM’s consumption of global DRAM wafer capacity far exceeds its bit output. Industry data shows: 1 GB of HBM consumes approximately 3–4× the wafer capacity of standard DRAM (due to larger die area, 50–60% yield of 12-layer TSV stacking, and CoWoS packaging bottleneck). In 2026, AI effectively consumes nearly 20% of global DRAM supply.

Metric	DDR5 RDIMM	HBM3e
Wafer Area/bit	1× (baseline)	2–3×
Yield	90–95%	50–60%
Aggregate Capacity Consumption/bit	1×	3–4×
Packaging	Standard DIMM (self-sufficient capacity)	CoWoS (TSMC capacity-constrained)

For Korean memory companies (SK Hynix, Samsung, Micron), the DDR5 unified architecture path does not pose a competitive threat—both HBM and DDR5 are their products. The change is merely a production path adjustment: adding a high-yield, fully self-packaged DDR5 path to serve the enormous incremental market for AI inference.

SECTION 13

Energy and Infrastructure

New power approvals for major global data center markets are backlogged 2–5 years. This proposal’s P50 of approximately 8.4 kW/rack—lower than many traditional server racks—can be deployed directly within existing data centers’ spare power capacity, requiring no liquid cooling retrofit or power upgrade.

In the target scenario of “long-running agent task servers,” assuming a demand for 1,000 concurrent agent instances: this proposal requires approximately 48 racks, 403 kW total power (P50), pure air-cooled. A traditional GPU solution would require dozens of NVL72 racks, multi-MW power, and dedicated liquid cooling infrastructure. Deployment lead time is reduced from 12–18 months to standard server delivery timelines.

SECTION 14

Technical Feasibility and Key Prerequisites

The viability of this proposal depends on the simultaneous satisfaction of four conditions:

Four Necessary Conditions for Proposal Viability

Condition	Explanation	Current Status
Inference at batch ≈ 1–4	HBM GPU advantages rapidly return with larger batch	Long-running agent tasks are inherently low-batch
Model accepts FP4 or low precision	Otherwise weight capacity and bandwidth requirements double	Depends on specific model and task
Service target accepts 3–8 t/s	Not suitable for high-interaction chat or large-scale API	Agent/code/research scenarios acceptable
Unified memory SoC or effective bridge exists	GPU needs efficient access to DDR5	Near-term NVLink-C2C bridge / mid-term requires new SoC

14.1 Identified Technical Vulnerabilities

Issue	Severity	Resolution Path
PCIe GPU bridging causes speed to collapse to unusable levels	Fatal	Must use NVLink-C2C Superchip (see §4.5); PCIe GPU path eliminated
DDR5 RDIMM is not GPU-native unified memory	High	Near-term: NVLink-C2C bridge (900 GB/s); mid-term: custom inference SoC (18–24 months)
GPU compute overkill relative to DDR5 bandwidth	Optimization opportunity	Custom inference accelerator with compute tailored to bandwidth
SSD page faults non-negligible in long-context scenarios	Medium-High	DDR5 hot buffer + async prefetch + tiered scheduling strategy (see §9.2)
KV Cache persistence invalidation conditions	Medium	Strict version binding; not positioned as general-purpose memory solution
CPU channel count limits 3 TB/socket	Medium	Dual-socket 6 TB or await CPUs with more channels

14.2 Model Ecosystem Risk

The 500B Dense FP4 discussed in this proposal is a hypothetical asset—current industry trends still heavily use MoE to reduce training and inference compute costs. The training cost of a 500B Dense model is extremely high, and no publicly available high-quality 500B Dense FP4 model currently exists. If the model ecosystem does not shift toward Dense, the runnable models for this proposal may be limited to: low-precision versions of existing MoE models (DDR5 bandwidth utilization drops to 60–80%), 70B–200B Dense models (fast but capability-limited), and distilled or enterprise-proprietary models. Path A (Vera Rubin), with its extremely high HBM4 bandwidth, is not constrained by memory bandwidth even when running MoE models, resulting in lower model ecosystem risk.

SECTION 15

Phased Validation Roadmap

15.0 Phase 0: Simulation Validation (Immediately Feasible)

Use existing GPUs with throttled bandwidth to simulate DDR5 roofline; test KV tiering with vLLM/FlexGen; test batch=1/2/4 long-context agent task success rate. Objective: validate whether low-batch agents accept 3–8 t/s, whether KV persistence improves recovery capability, and whether SSD page-fault tail latency is manageable.

15.1 Phase 1: Vera Rubin Validation (H2 2026–2027)

Use production Vera Rubin Superchip (1.5 TB LPDDR5X + 576 GB HBM4 + NVLink-C2C 1.8 TB/s). Place 500B FP4 weights entirely in HBM4; validate ~75 t/s decode (single GPU) or ~150 t/s (TP=2). Test 1T+ models for HBM4→LPDDR5X overflow performance. Validate SSD KV persistence recovery success rate in real agent tasks. Key Benchmarks: Models 70B/200B/500B; precision FP8/FP4; batch 1/2/4/8; context 32K/128K/512K/1M; metrics TTFT, decode t/s, P95 latency, SSD page fault rate, W/token, agent task completion rate.

15.2 Phase 2: Custom DDR5 Platform (2028–2029)

Design a unified SoC integrating ARM CPU cores, inference-specialized NPU, and native 12+ channel DDR5/DDR6 controller. Remove NVLink, HBM controllers, and most Tensor Cores. Target: 3 TB+ unified memory, 883+ GB/s bandwidth, no HBM, pure air-cooled 320–430 W. Validate Path B’s extreme cost and power advantages.

15.3 Long-Term (2029+): Near-Memory Computing

As DDR6 (2029–2030), 3D DRAM (~2030), and PIM (~2030+) evolve, bandwidth density continues to improve. 3D DRAM may achieve 3–5× bandwidth improvement in DDR form factor, and PIM may enable vector operations directly within memory die. 10T Dense FP4 single-node real-time inference is projected to first reach 1+ t/s in the DDR7 era (2032–2034).

SECTION 16

Industry Impact

For memory companies: The AI inference DDR5 incremental market may reach $60B+/year. The DDR5 path offers high yield (90–95% vs. HBM 50–60%), self-sufficient packaging (no dependency on TSMC CoWoS), and an open supply chain. HBM serves the training market while DDR5 serves the inference market—a two-track strategy.

For the AI industry: This proposal does not seek to overturn the NVL72/GPU cluster paradigm, but rather to add a new product category for AI inference—the “personalized AI inference node.” Path A (Vera Rubin) serves hyperscalers’ high-end agent inference needs, while Path B serves global enterprises and eventually individual users’ private AI deployment needs—the two cover entirely different customer segments.

SECTION 17

From Hardware Proposal to Product Ecosystem: The Complete Stack for Personalized AI Nodes

When a single node can run a complete 500B-class large model, it is not merely “cheaper agent inference”—it opens entirely new possibilities for personalized distributed AI deployment. This section argues for the four-layer structure of this product ecosystem and how it answers the “business economics of Path B” question.

17.1 Enterprise Segment: From “Renting Shared AI” to “Owning Dedicated AI”

Current enterprise private AI deployment is locked into small models: an RTX 4090 (24 GB VRAM) can run at most a 30B model; dual RTX 5090 (48 GB) can run a 70B model. When enterprises face complex business scenarios requiring 500B-class capability, they are forced to send sensitive data to cloud APIs—choosing between data security and model capability. Gartner’s 2025 prediction is that by 2026, over 50% of enterprise AI inference workloads will run locally or at the edge (up from less than 10% in 2023). IDC projects AI infrastructure spending to reach $758B by 2029.

Path B unified architecture ($10–20K, 320 W, air-cooled, 3 TB DDR5) provides enterprises with:

Dimension	Current Enterprise Private AI	Path B Unified Architecture	Gap
Runnable Models	7B–70B (24–48 GB VRAM)	500B–1.7T FP4 (3 TB DDR5)	7–25×
Data Sovereignty	Small models local / large models via API	Complete large model 100% local	Qualitative leap
API Fees	Per-token pricing, ongoing expenditure	Zero marginal cost	Eliminated
Context Length	8K–32K (VRAM-limited)	Hundreds of thousands to millions of tokens	10–100×
Personalized Memory	No persistence	SSD-persistent KV Cache	From nothing to something
Deployment Requirements	Standard office/server room	Standard office/server room	Same

As an intuitive marginal cost example (not a full TCO): the writing process for this paper, from V1 to V7, with three-AI matrix review and physical validation—a single session consumed 87% of a 5× Max Claude user quota, using $24.20 in credits (54% of balance). The same workload’s marginal electricity cost on Path B: 320 W × 5 hours = 1.6 kWh × $0.10/kWh ≈ $0.16. The marginal electricity cost gap is approximately 150×. Note: Full TCO must include hardware depreciation ($10–20K amortized over 5 years ≈ $170–330/month), maintenance, model licensing, SSD wear, and idle rate—Path B’s total cost of ownership advantage depends on usage intensity and depreciation period.

17.2 From Enterprise to Consumer: A Phased Adoption Path

Path B’s hardware parameters—320 W (equivalent to a high-end gaming PC), air-cooled (no special thermal management), $10–20K—make a “personal AI server” physically feasible. However, commercialization should proceed in phases:

Phase 1: Enterprise private deployment (first to market)—Finance, healthcare, legal, government, and other sectors subject to data compliance constraints, with clear ROI on $10–20K equipment investment.

Phase 2: High-end prosumer—Professional researchers, law firms, independent AI developers, creator studios. $10–20K is comparable to a high-end professional workstation (Mac Studio Ultra starts at approximately $8K), within budget for this segment.

Phase 3: Mass consumer market (long-term vision)—As custom SoC volume production and DDR6/DDR7 cost reductions bring node costs to the $3–5K range, a “personal AI server” can potentially enter the mass market. This requires 5–10 years of technology and cost curve evolution.

Regardless of phase, the core value proposition remains the same: SSD-persistent KV Cache retains interaction history as long as the model version is unchanged, Dense deterministic output guarantees consistent behavior, and data never leaves the device. The paradigm shift from “renting intelligence” to “owning intelligence” is a real direction, but its pace depends on the cost curve.

17.3 Inference OS Stability Requirements: Structural Advantage of Eliminating the Distributed Communication Stack

The greatest source of instability in current AI inference infrastructure is not the GPU itself, but the distributed communication stack:

Source of Instability	Evidence	Status in Unified Architecture
NCCL timeouts/deadlocks	Meta HPCA 2025: NCCL timeouts are “relatively common”; 94% of tokens are routed to different experts in at least one layer (MoE); fault attribution is “challenging and noisy”	Eliminated—no NCCL
NVLink/NVSwitch link errors	Meta: Over 50% performance degradation without adaptive routing; network errors have a large “blast radius”	Eliminated—no NVLink/NVSwitch
DGX OS maturity	DGX Spark users: “extremely disappointed”; PCIe configuration errors, CIFS incompatibility, NVFP4 immaturity	Not applicable—simpler OS
Distributed scheduling complexity	Nebius: Full cluster validation requires 8–12 hours of GPU stress testing + NCCL bandwidth testing + thermal stability checks	Eliminated—single node, no scheduling

The unified architecture’s inference software stack degenerates from “CUDA + NCCL + cuDNN + TensorRT + vLLM + container orchestration + scheduler + load balancer” to a “single-process inference loop”—as concise as llama.cpp. This is highly isomorphic to the Apple Silicon + macOS design philosophy: single chip, unified memory, zero distributed coordination. The single-node architecture significantly reduces the failure surface—eliminating distributed failure modes such as NCCL timeouts, NVLink link errors, and straggler waiting—making a consumer-grade stable inference OS more feasible to build. However, a single node must still handle GPU drivers, memory errors, SSD wear, model hot updates, security sandboxing, agent misoperations, and system updates as independent failure surfaces.

17.4 Application Control Layer—LiteClaw in Practice

When AI transforms from a cloud service to a local device, a localized security control center becomes necessary. LEECHO Global AI Research Lab’s LiteClaw project (Apache 2.0 open source, github.com/leechoglobalai2025-hub/LiteClaw) has validated the feasibility of this layer:

The origin story of LiteClaw itself serves as user-side evidence for this paper’s §10 “software complexity regression”: OpenClaw (GitHub Stars 145,000+) caused token explosion due to continuous stacking of all conversation history—Gemini API’s TPM reached 1.26M/1M (exceeding the quota), rendering the system completely unusable. This is a real-world manifestation of “insufficient memory/context → complex compensatory engineering → system fragility.” LiteClaw solved the token management problem from the software side; the unified architecture eliminates this problem entirely from the hardware side—with 3 TB memory, “conversation history stacking” is no longer a cost bomb but a free local memory operation.

LiteClaw as an application control layer provides: zero-trust security architecture (SecretValue encapsulation, API keys never in plaintext), L0–L8 eight-layer strict unidirectional dependency (zero circular dependencies), three-stage audit engine (pre/exec/post), six-mode automatic log sanitization, multi-LLM support (Gemini/OpenAI/Anthropic/local vLLM), and multilingual interface (Chinese/English/Korean). On the unified architecture, LiteClaw evolves from a “cloud API token manager” to a “desktop control environment for local AI instances”—analogous to macOS Finder for hardware.

17.5 Multimedia Input Layer

When AI transforms from a cloud-based text box to a local device, a hardware input layer becomes natural and necessary: cameras (visual understanding, document scanning), microphones (voice interaction, meeting transcription), displays/touch (agent operation interface), and sensors (IoT data ingestion). These inputs are constrained by upload bandwidth and privacy limitations in cloud AI. On the local unified architecture, multimodal data feeds directly into local inference—zero latency, zero upload, zero privacy leakage.

17.6 Path B Market Repositioning: Answering the “Business Economics Death Valley”

Gemini 3.1 raised this question during V6 review: If Vera Rubin (Path A) can already solve 95% of problems with vastly superior performance, who would invest in Path B’s custom SoC? The answer lies in: Path A and Path B serve entirely different customer segments.

Dimension	Path A Customer Segment	Path B Customer Segment
Customer Type	Hyperscalers, major AI labs	Global enterprises, research institutions, end users
Data Center Conditions	Liquid cooling, high-density power	Standard air-cooled server rooms, offices
Budget	$30K–60K/node	$10–20K/node
Operations Capability	Specialized GPU teams	General IT staff (macOS-grade stable OS)
Market Size	Tens of thousands of units (hyperscaler procurement)	Millions of units (enterprise/individual adoption)

If Gartner’s prediction is correct—over 50% of inference workloads running locally by 2026—Path B targets: a distributed AI infrastructure composed of millions of $10–20K independent nodes, replacing the current centralized infrastructure composed of tens of thousands of $2–3M liquid-cooled racks. This is a sufficiently large TAM to justify the R&D investment in a custom SoC.

CONCLUSION

Conclusion

The core thesis of this proposal is: AI inference is diverging, and long-context agent tasks require a hardware form factor different from high-batch GPU clusters. For low-batch dedicated inference, private deployment, and standard data center deployment, large-capacity unified memory independent inference nodes may constitute an important new product category.

This proposal demonstrates two complementary technical paths. Path A is based on the NVIDIA Vera Rubin Superchip (H2 2026 volume production), where 500B FP4 weights can reside entirely within a single Rubin GPU’s 288 GB HBM4, achieving approximately 75 t/s decode (dual-GPU TP=2 can reach ~150 t/s), complemented by 1.5 TB LPDDR5X for large-capacity KV Cache, and can immediately enter validation. Path B is based on a custom DDR5 RDIMM inference SoC (2028–2030), achieving extreme per-node cost (~$10–20K) and power (~320 W) with 3 TB+ DDR5, representing the optimal mid-to-long-term form factor but requiring new chip design. Path A serves hyperscalers and high-end research institutions (liquid-cooled environments), while Path B serves global enterprises and eventually individual users (standard server rooms/offices, air-cooled)—the two cover entirely different customer segments.

This proposal further reveals the product ecosystem potential of the unified architecture. When a single node can completely run a 500B-class large model, AI inference transforms from “renting hyperscaler liquid-cooled supercomputers” to “purchasing your own inference device”—a personalized distributed AI deployment targeting millions of enterprises worldwide and eventually individuals. More importantly, single-node execution eliminates the distributed communication stack (NCCL/NVLink/NVSwitch)—the greatest source of instability in current AI infrastructure—significantly reducing the failure surface and making a consumer-grade stable inference OS more feasible. On stable hardware and OS foundations, LEECHO Global AI Research Lab’s LiteClaw project has validated the feasibility of a secure AI control center (zero-trust architecture, agent scheduling, multi-LLM management), pointing to a complete four-layer product stack: hardware → inference OS → application control → multimedia input.

Dense models have engineering advantages in deterministic output and bandwidth efficiency, but 500B Dense FP4 is currently a hypothetical asset. SSD-persistent KV Cache is an interrupt-recovery acceleration mechanism for the same model version, not a general-purpose agent memory solution.

An important value dimension of this proposal is the significant reduction in software complexity. In the current AI inference full stack, approximately 19 auxiliary technologies—from context compression to RAG pipelines, from KV Cache eviction to tensor parallel communication—exist because of “insufficient memory.” The large-memory architecture, by expanding the physical boundary, enables approximately 6 of these to be completely removed in target scenarios, approximately 5 to be substantially weakened, and approximately 3 (security compliance, ultra-long sequence computational optimization, and platform operations) to remain essential. Engineering complexity is significantly reduced but does not reach zero—each auxiliary technology removed simultaneously eliminates the information loss it introduced, ultimately improving inference quality.

Ultra-long context (300K+ tokens) prefill latency and attention O(n²) computational cost are serious bottlenecks for Path B—Path A (Vera Rubin 100 PFLOPS) holds a massive advantage on this dimension.

This proposal is not a “proven product solution,” but rather a “high-quality architectural hypothesis + validation roadmap.” Its strongest contribution is not power or cost savings, but redefining the optimization objectives for agent inference hardware: from throughput-first to state capacity, recoverability, and system simplicity-first. The next step must transition from paper to benchmark.

References and Disclosures

[1] Micron Technology, “Micron Redefines AI Performance With Sampling of 256GB DDR5 Server Module,” May 12, 2026.

[2] NVIDIA, “GB300 NVL72 Product Page,” nvidia.com, 2025–2026.

[3] NVIDIA, “Blackwell Architecture Technical Overview,” nvidia.com, 2024–2025.

[4] NVIDIA, “Grace CPU Superchip Architecture In Depth,” developer.nvidia.com, 2023–2024.

[5] LMSYS, “NVIDIA DGX Spark In-Depth Review,” October 2025.

[6] SemiAnalysis, “GB200 Hardware Architecture — Component Supply Chain & BOM,” 2024–2025.

[7] SemiAnalysis, “Co-Packaged Optics (CPO) — Scaling with Light,” 2026.

[8] antirez (@antirez), X/Twitter posts on DeepSeek V4 PRO on Mac Studio M3 Ultra, May 17, 2026.

[9] SK Hynix, “DRAM Development Roadmap Through 2031,” November 2025.

[10] TrendForce, “AI to Consume 20% of Global DRAM Wafer Capacity in 2026,” December 2025.

[11] Tom’s Hardware, “HBM is Coming for Your PC’s RAM,” December 2025.

[12] Ma et al., “Stabilizing MoE RL by Aligning Training and Inference Routers (R3),” arXiv:2510.11370, Oct 2025.

[13] dasroot.net, “Dense vs. MoE: Decoding the Mystery of Small Model Supremacy,” April 2026.

[14] Cerebras, “Router Wars: Which MoE Routing Strategy Actually Works,” December 2025.

[15] CraftRigs, “Decode Speed Explained: Tokens Per Second in Local LLMs,” March 2026.

[16] Morph, “Tokens Per Second: LLM Speed Benchmark Guide (2026),” April 2026.

[17] NVIDIA, “Introducing Nemotron 3 Super for Agentic Reasoning,” March 2026.

[18] Rath, A., “Agent Drift: Behavioral Degradation in Multi-Agent Systems,” arXiv:2601.04170, Jan 2026.

[19] “Tutti: Making SSD-Backed KV Cache Practical,” arXiv:2605.03375, May 2026.

[20] “KV Cache Offloading for Context-Intensive Tasks,” arXiv:2604.08426, April 2026.

[21] WEKA, “Nvidia and its partners’ KV Cache extenders,” Blocks and Files, March 2026.

[22] “When Refusals Fail: Unstable Safety in Long-Context LLM Agents,” arXiv:2512.02445, 2026.

[23] Introl Blog, “InfiniBand vs Ethernet for GPU Clusters,” March 2026.

[24] PC Gamer, “Micron unveils 256 GB memory module destined for AI servers,” May 2026.

[25] Tom’s Hardware, “NVIDIA Announces Rubin GPUs in 2026, Rubin Ultra in 2027,” March 2025.

[26] Aeon Project, “High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents,” arXiv:2601.15311, Jan 2026.

[27] VentureBeat / Medium, “RAG is DEAD — Million-token context windows and agentic AI are rewriting the playbook,” Jan 2026.

[28] Memex(RL), “Scaling Long-Horizon LLM Agents via Indexed Experience Memory,” arXiv:2603.04257, Mar 2026.

[29] “LLM Agent Memory: A Survey from a Unified Representation-Management Perspective,” Preprints.org, Mar 2026.

[30] “SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression,” arXiv:2511.18936, 2025.

[31] Kailash, P., “LLM Context Windows: How Engineers Are Fixing the Memory Problem (2026),” Medium, Apr 2026.

[32] NVIDIA, “Vera Rubin Platform: Six New Chips,” developer.nvidia.com, Jan 2026.

[33] VideoCardz, “Vera Rubin NVL72 Detailed: 88 cores, 1.5TB LPDDR5X, 1.8TB/s C2C,” Jan 2026.

[34] ServeTheHome, “NVIDIA Launches Rubin AI Compute Platform at CES 2026,” Jan 2026.

[35] The Register, “Nvidia unpacks Vera Rubin rack system at CES,” Jan 2026.

[36] Introl Blog, “B200 vs GB200 Deployment Guide,” Apr 2026.

[37] FreeCodeCamp, “Evolution of Nvidia Blackwell GPU Memory Architecture,” 2026.

[38] HPE, “HPE AI Grid — Distributed AI Factories powered by NVIDIA,” GTC 2026, Mar 2026.

[39] Gartner, “AI Spending Forecast: $2.5T in 2026,” 2025; IDC, “AI Infrastructure $758B by 2029.”

[40] NVIDIA Developer Forums, “I am EXTREMELY disappointed with DGX Spark,” Apr 2026.

[41] NVIDIA, “DGX OS Known Issues — PCIe Relaxed Ordering, CIFS/DOCA incompatibility,” Release Notes.

[42] Meta, “Revisiting Reliability in Large-Scale ML Research Clusters,” HPCA 2025, arXiv:2410.21680.

[43] NVIDIA, “NCCL Troubleshooting Guide — Timeouts, cuMem, NUMA, ACS,” NCCL 2.30 Docs.

[44] Scalastic.io, “Apple Silicon vs NVIDIA CUDA: AI Comparison 2025,” Aug 2025.

[45] Compute Market, “Local AI Server for Business 2026 — Build Guide + ROI,” Mar 2026.

[46] LEECHO Global AI Research Lab, “LiteClaw — Security-First AI Control Center,” Apache 2.0, github.com/leechoglobalai2025-hub/LiteClaw.

[47] DCD, “Vera Rubin NVL72 will be 100 percent liquid cooled,” Mar 2026.

[48] BigGo Finance / The Information, “Musk Hoards 550,000 GPUs, Yet MFU Sits at Just 11%,” May 2026.

[49] Modal, “GPU Utilization Guide: MFU in Training — Meta 38–41%, DeepSeek 20–30%,” Feb 2025.

[50] SemiAnalysis, “Multi-Datacenter Training: MFU from 40% to 30% = 250K idle GPUs at 1M scale,” Sep 2024.

[51] Tom’s Hardware, “Colossus 1 inefficient mixed-architecture → Anthropic renting for inference,” May 2026.

[52] ikangai, “GPT-4 Leaked: MFU 32–36% due to parallelization complexity,” Jul 2023.

Disclaimer: This paper is an independent technical design proposal and does not constitute investment advice. Company and product names mentioned herein are trademarks of their respective owners. Some data is based on reasonable extrapolation from publicly available information; actual values may differ. Vera Rubin Superchip specifications are based on publicly released information from CES 2026; production specifications may vary. BOM estimates are pre-production projections; actual prices are subject to market and supply chain fluctuations.

Unified Architecture AI FactoryDesign Proposal