THOUGHT PAPER · MAY 2026 · V4

RAG and Long-Term Memory for AI

From External Persistent Knowledge Layers to Memory OS: A Systems Engineering Path for LLM Long-Term Memory

From RAG to Memory OS:
A Systems Engineering Path for Persistent AI Memory

PublishedMay 18, 2026

CategoryOriginal Thought Paper

DomainsRAG · Long-Term Memory · Memory OS · Knowledge Engineering · AI Systems Architecture

VersionV4 (Tri-Model Adversarial Review + Dense Structural Audit)

이조글로벌인공지능연구소

LEECHO Global AI Research Lab

Opus 4.6 · GPT 5.5 · Gemini 3.1

인지집단 (Cognitive Collective)

Abstract

Based on publicly available research and industry data as of May 2026, this paper systematically argues a central thesis: Under current and foreseeable LLM technology stacks, the most mature, most engineerable, and most controllable path for long-term memory oriented toward user or enterprise private knowledge is Generalized RAG—a system paradigm combining external persistent knowledge layers, updatable indexes, permission isolation, on-demand retrieval, and context injection. A complete long-term memory system should aspire to the Memory OS level, encompassing a full-cycle closed loop of read, write, compress, delete, and audit. The term RAG in this paper is not limited to traditional vector retrieval; it refers to any system paradigm that injects external persistent knowledge into model inference in a retrievable, updatable, and isolatable form—with exclusion criteria that explicitly delineate the boundaries of this paradigm. Through systematic examination of parametric memory, context windows, fine-tuning, KV Cache persistence, knowledge compilation, knowledge graphs, and other candidate approaches, we demonstrate that they serve more as complements, backend variants, or partial optimizations to the RAG paradigm, rather than complete substitutes. The paper proposes an eight-layer classification model for AI long-term memory and a twelve-dimensional scenario-based evaluation framework, along with a complete memory write pipeline architecture, compression fidelity analysis, conflict intent inference model, and background compute cost estimation.

RAG
Long-Term Memory
Memory OS
Agentic RAG
Knowledge Compilation
Memory Write Pipeline
Compression Fidelity
Context Window
KV Cache
MoE
AI Memory Systems

SECTION 01

Introduction: The Amnesia Problem of Large Language ModelsWhy LLMs Forget Everything After Every Conversation

Large language models possess the most extensive “factory-installed knowledge” in human history—trillions of tokens of training data compressed into billions to hundreds of billions of parameters. Yet this parametric memory is static. A January 2026 study published in Nature Communications revealed a disquieting fact: LLMs do not store discrete facts but rather assemble fragments from similar sequences, meaning that parametric memory is inherently unreliable for precise recall of specific data.

More critically, parametric memory cannot be updated (except through full retraining), cannot be personalized (all users share the same set of weights), and cannot be deleted (specific information cannot be “forgotten”). This makes large models fundamentally “amnesic” when confronted with any scenario requiring persistent personal knowledge—whether enterprise proprietary data, user preferences, or project histories.

The Retrieval-Augmented Generation (RAG) framework, proposed by Patrick Lewis et al. at Meta AI in 2020, was originally designed merely to address the problem of “outdated model knowledge.” However, six years later, RAG has far exceeded its original design intent, becoming the core infrastructure for persistent long-term memory in AI systems. This paper will systematically argue this thesis—and explicitly delineate both the scope of applicability and the boundaries where it does not apply.

SECTION 02

Defining the Boundaries of RAG: From Narrow to BroadWith Exclusion Criteria to Prevent Tautology

The argumentation in this paper requires first clarifying a conceptual question: when we say “RAG is the core path for long-term memory,” what exactly does RAG mean? If all external knowledge invocations are defined as RAG, then “all long-term memory depends on RAG” approaches tautology. We therefore distinguish three levels:

Level	Definition	Representative Technologies
Narrow RAG	Vector database + document chunking + embedding retrieval + prompt injection	LangChain RAG pipeline, Pinecone + OpenAI embeddings
Generalized RAG	Any system paradigm that injects external persistent knowledge into model inference in a retrievable, updatable, and isolatable form	Graph-RAG, Knowledge Compilation (Nexus), TAG, Agentic RAG, Hybrid Retrieval
Memory OS	Generalized RAG + write pipeline + compression fidelity + deletion governance + permission system + version management + conflict resolution + feedback learning	Letta/MemGPT, Mem0 + ACL, Zep/Graphiti + temporal reasoning

The central thesis of this paper targets Generalized RAG. We do not argue that “vector search is irreplaceable” (Narrow RAG may well be superseded by superior retrieval technologies), but rather that the system paradigm of “external persistent knowledge layer + updatable index + permission isolation + on-demand retrieval + context injection” is irreplaceable.

2.1 Negative Definition: What Is Not Generalized RAG

To prevent conceptual overextension from collapsing into tautology, we explicitly provide exclusion criteria for Generalized RAG. If a system fails to satisfy any one of the following conditions, it does not qualify as Generalized RAG:

Exclusion Condition	Systems Outside Generalized RAG	Reason
No external persistent storage	Pure parametric memory (knowledge embedded in pretrained weights)	Knowledge is frozen in weights and cannot be independently updated or deleted
No runtime retrieval	Pure fine-tuning/LoRA (knowledge absorbed during training)	Knowledge is written into weights during training; no “lookup” action exists at inference time
No persistence	Pure context window injection (in-session prompt assembly)	Vanishes upon session termination, failing the persistence requirement for long-term memory
No queryable index	Pure KV Cache persistence	Preserves intermediate computational states, not queryable or editable knowledge objects
No knowledge isolation	Global agent policy learning without permission boundaries	All users share the same policy space; personalized knowledge isolation is not supported

For a system to qualify as Generalized RAG, it must simultaneously satisfy four criteria: (A) an external persistent knowledge store independent of model weights exists; (B) query-based retrieval actions occur at inference time (rather than full-volume injection); (C) knowledge can be added, modified, or deleted independently of the model; (D) knowledge boundaries can be isolated by user, tenant, or permission level. All four satisfied → Generalized RAG; missing any one → not Generalized RAG, likely a complementary technology.

Through these exclusion criteria, the central thesis of this paper can be stated more precisely: Within the intersection of the four necessary properties A+B+C+D, no other single technology paradigm can satisfy all simultaneously; therefore, Generalized RAG possesses structural irreplaceability in the domain of long-term semantic memory. This is not tautological—because we have explicitly identified five categories of systems that fall outside Generalized RAG.

At the same time, we contend that a complete long-term memory system should aspire to the Memory OS level—capable not only of “reading” (retrieval) but also “writing” (memory formation), “compressing” (multi-scale summarization), “deleting” (forgetting and deletion), and “governing” (permissions and audit). Subsequent sections will address each of these dimensions.

SECTION 03

Six Years of RAG EvolutionFrom Academic Paper to Industry Standard: 2020–2026

RAG’s journey from academic paper to industry standard traversed four distinct phases.

3.1 Foundation Period (2020–2021)

In May 2020, Lewis et al. published the foundational paper at NeurIPS, combining parametric and non-parametric memory to significantly outperform conventional systems on knowledge-intensive QA tasks. Concurrently, REALM achieved joint training of retrieval and pretraining, while DPR demonstrated that semantic search could exceed BM25 by up to 19%.

3.2 Scaling and Exploration Period (2022–2023)

DeepMind’s RETRO pushed retrieval augmentation to the trillion-token corpus scale. The LangChain and LlamaIndex ecosystems propelled RAG from academia into engineering practice. Self-RAG and CRAG introduced self-reflection mechanisms. The vector database market experienced explosive growth.

3.3 Rise of Agentic RAG (2024–2025)

RAG was no longer a single-pass pipeline but an iterative loop of “think → retrieve → re-think → re-retrieve → act.” Anthropic launched MCP, later donating it to the Linux Foundation where it became the de facto standard. Multi-modal RAG and Graph-RAG emerged in succession.

3.4 The Post-RAG Paradigm Shift (2026–Present)

In May 2026, Pinecone released Nexus—a “knowledge compiler” that shifts inference from query time to compile time. Microsoft Fabric IQ and Google Knowledge Catalog simultaneously launched similar architectures. This signals that RAG is being absorbed by higher-level “knowledge layer” architectures, yet the core “store → retrieve → inject” paradigm remains unchanged.

“RAG was built for human users. Nexus was built for agent users—because they speak an entirely different language and expect entirely different responses.”

— Ash Ashutosh, Pinecone CEO, VentureBeat, May 2026

SECTION 04

Eight Layers of AI Memory and Where RAG AppliesWith System 1/System 2 Boundary Delineation

This paper expands LLM memory into an eight-layer classification, splitting “preference memory” into explicit and implicit preferences to eliminate internal contradictions with the “System 1/System 2” framework:

Memory Type	Definition	Optimal Technical Path	RAG Suitability
Semantic Memory	Stable facts, concepts, knowledge	Generalized RAG (vector/graph/knowledge compilation)	★★★★★ Core scenario
Episodic Memory	Events, conversations, timelines	RAG + temporal indexing (Zep/Graphiti)	★★★★★ Core scenario
Explicit Preferences	Articulable preferences: dietary restrictions, time zone, language, tool choices	RAG (user profile key-value pairs/embeddings)	★★★★★ Core scenario
Implicit Preferences	Hard-to-articulate preferences: aesthetics, style, humor, subtle attitudes	Fine-tuning/LoRA/long-term behavioral learning	★★☆☆☆ Difficult for RAG
Procedural Memory	Skills, operational workflows, strategies	Fine-tuning/LoRA, workflow templates	★★☆☆☆ Suboptimal for RAG
Social Memory	Interpersonal relationships, interaction history	Knowledge graph + RAG	★★★★☆ Graph backend preferred
Working Memory	Current task state	Context window + KV Cache	★☆☆☆☆ Not a RAG scenario
Reflective Memory	Summaries, retrospectives, self-corrections	Agent memory + RAG	★★★☆☆ Requires write strategy

“System 1/System 2” Memory Distinction: Borrowing Kahneman’s dual-system framework as an engineering design metaphor (note: the dual-system theory itself is debated within cognitive science; this is an engineering analogy rather than a cognitive science claim), we delineate RAG’s applicability boundary by “articulability.” Memories that can be explicitly expressed in text are suitable for RAG; memories that can only be “felt” require fine-tuning. Splitting “preference memory” into “explicit preferences” (★★★★★) and “implicit preferences” (★★☆☆☆) eliminates the internal contradiction. RAG stores facts and histories; fine-tuning stores capabilities and habits—the two are complementary, not substitutes.

4.1 Five Necessary Conditions for Long-Term Memory

Necessary Condition	Generalized RAG	Parametric Memory	Context Window	Fine-Tuning
Persistence	✅	✅ Not precisely updatable	❌	✅ High cost
Updatability	✅	❌	✅ In-session only	⚠️ Requires retraining
On-demand Retrieval	✅	❌ Imprecise	✅ Within window	❌
Personalized Isolation	✅	❌ Global	✅ In-session	⚠️ Requires multiple copies
Auditable Deletion	✅ (Requires full-pipeline design)	❌	✅ (Vanishes on session end)	❌ (Cannot precisely forget)

SECTION 05

Candidate Approach Assessment: Replacement or Complement?Systematic Elimination of Alternative Paradigms

Each of these technologies has independent value, but none can independently satisfy all five necessary conditions for long-term semantic memory. They are complements to the Generalized RAG paradigm.

5.1 Fine-Tuning / LoRA — The Optimal Path for Capability Memory

Effective for encoding stable skills, styles, implicit preferences, and domain formats, but unsuitable for frequently changing facts. Per the exclusion criteria in Section 02 (no runtime retrieval), pure fine-tuning does not qualify as Generalized RAG.

5.2 Ultra-Long Context — Extension of Working Memory

Can significantly reduce the need for retrieval in certain scenarios, but cannot substitute for data deletion, permission isolation, version control, and auditing. Per the exclusion criteria (no persistence), pure context windows do not qualify as Generalized RAG.

5.3 TTT-E2E — A Supplementary Layer for Broad Understanding

Compresses context into model weights at inference time. The researchers themselves recommend complementary use with RAG.

5.4 KV Cache Persistence — Computational State, Not Knowledge Object

Preserves intermediate computational states, not queryable, editable, or auditable knowledge objects. Per the exclusion criteria (no queryable index), this does not qualify as Generalized RAG.

5.5 Knowledge Graphs — A Structured Memory Backend

Provides explicit entity relationships, interpretable reasoning paths, and conflict detection. It is one of the most powerful structured memory backends for Generalized RAG (satisfying all four ABCD criteria).

5.6 Knowledge Compilation — Internal Evolution Within RAG

Pinecone Nexus and similar systems shift inference to a compile stage. The underlying system still satisfies all four ABCD criteria, representing engineering evolution within Generalized RAG.

Assessment Conclusion: Each of the above technologies has its applicable scenarios, but for long-term semantic memory oriented toward private knowledge, none can independently satisfy all five necessary conditions. They function more as supplements, backend variants, or partial optimizations to the Generalized RAG paradigm, rather than complete replacements.

SECTION 06

Architectural Analysis of Memory Systems in 2026Evidence-Graded Assessment of Major LLM Providers

Based on publicly observable product behavior and published technical documentation, the memory implementations of major LLM providers exhibit strong alignment with the Generalized RAG / external memory layer paradigm.

Model	Memory Mechanism	Generalized RAG Characteristics
ChatGPT (GPT-5)	Persistent user memory + semantic retrieval injection	External persistent storage + runtime retrieval + injection
Claude	24-hour conversation memory synthesis, persisted, automatically retrieved and injected	Standard write → store → retrieve → inject pipeline
Gemini	Personal Intelligence: Gmail/Drive + knowledge graph + cross-modal	Multi-modal retrieval augmentation over user’s real data
Grok	X (Twitter) history/followers/interaction topics	Retrieval augmentation over social behavioral data

The classifications above are based on publicly observable product behavior and published technical documentation. Internal implementations may include rule engines, profile stores, policy layers, caching strategies, and other non-RAG components—actual architectures are almost certainly hybrid systems incorporating multiple technologies. These should be understood as architectural inferences rather than fully verified facts. Nevertheless, from observable behavior, the core pattern of “external persistent storage + runtime retrieval + context injection” is confirmable across all major products.

The explosive growth of dedicated memory-layer products (Mem0, Letta/MemGPT, Zep/Graphiti, Hindsight) further confirms: RAG is not legacy technology destined for replacement; it is being absorbed and elevated by higher-level “persistent cognition” architectures.

SECTION 07

Success Rate Reality and the Preprocessing LeverFrom 40% Failure to 1.9%: The Impact of Pipeline Design

7.1 The Harsh Reality of Current Success Rates

Benchmark / Scenario	Success Rate	Description
Spider 1.0	86.6%–91.2%	Clean, small-scale schemas
BIRD-SQL	81.95%	Includes noisy data, domain knowledge dependencies
Spider 2.0	6%–21.3%	Real-world enterprise schemas
BIRD-Interact	8.67%	Simulated real DBA scenarios
Naïve RAG Pipeline	~60%	Without preprocessing optimization

7.2 Preprocessing: Retrieval Failure Rate from 40% to 1.9%

Anthropic’s contextual retrieval research provides the clearest stratified quantitative data to date:

Raw Files + Fixed-Size Chunking
Retrieval failure rate ~40%

↓ Format conversion to structured Markdown

Structure-Aware Chunking
Recall rate ~85–90%

↓ Contextual augmentation (section path per chunk)

+ Contextual Embeddings
Failure rate drops to 3.7% (↓35%)

↓ Hybrid retrieval (vector + BM25)

+ Hybrid Retrieval
Failure rate drops to 2.9% (↓49%)

↓ Reranking

Complete Pipeline
Failure rate drops to 1.9% (↓67%)

↓ MDKeyChunker semantic key annotation

MDKeyChunker + BM25
Recall@5 = 1.000, MRR = 0.911

Key Finding: Vectara’s NAACL 2025 study confirmed that chunking strategy impacts retrieval quality as much as—or more than—embedding model selection.

Scope Limitation: The data above originate from different studies, different datasets, and different task definitions, and therefore cannot be directly chained into a unified causal ladder. Actual effectiveness depends on document type, domain complexity, and query patterns.

SECTION 08

The Structural Advantage of Markdown — and Its LimitsWhy Format Matters More Than Model Selection

Dimension	Markdown	HTML	Plain Text
RAG Friendliness	★★★★★	★★★☆☆	★★☆☆☆
Best Retrieval Success Rate	Recall@5 = 1.000	Hit@1 = 68.5	Baseline
Token Efficiency	Very high	Low → Medium (90–97% of tokens must be stripped)	High but unstructured
Structure Preservation	Native	Requires specialized processing	Completely lost

Microsoft MarkItDown (91K+ Stars), IBM Docling, LlamaParse, and Firecrawl have collectively established a complete “any format → RAG-ready Markdown” pipeline.

Scope of Applicability: Markdown is a strong intermediate format for text-based files in RAG, but for table-dense PDFs, scanned documents/charts, legal contracts/financial reports, and structured database data, multi-modal parsing and metadata preservation are needed as supplements.

SECTION 09

Context Window and KV Cache ConstraintsThe Gap Between Advertised and Effective Context Length

Lost-in-the-Middle Degradation

10–25%

Accuracy drop in middle positions across all models (TokenMix, April 2026)

Effective vs. Advertised Context Gap

99%

Maximum gap between effective and advertised context window on complex tasks (Paulsen 2025)

KV Cache compression also affects RAG quality. Conventional methods are “blind” to queries, risking the deletion of critical evidence. 2026 solutions such as KVzip and CacheClip optimize specifically for RAG scenarios, achieving 3–4× KV reduction and ~2× latency improvement.

The state-of-the-art implementation in 2026 is a tripartite synergy: RAG handles precise retrieval, long-context models handle deep reasoning, and KV Cache optimization handles latency and cost control.

SECTION 10

Dense vs. MoE: Architectural Impact on RAGStructural Affordances and Engineering Trade-offs

MoE architecture offers unique structural affordances for RAG. Research from Fudan University and Tencent identified three types of core experts in Mixtral:

Core Expert	Function	Significance for RAG
Cognition Expert	Determines whether internal knowledge is sufficient	Avoids unnecessary retrieval
Quality Expert	Evaluates retrieved document quality	Filters low-quality documents
Context Expert	Enhances external knowledge utilization	Better “reads” RAG documents

MoE’s expert routing mechanism provides structural affordances for Adaptive RAG—the router natively supports routing between simple and complex queries. NVIDIA’s analysis notes that MoE’s reduction of time-to-first-token latency is particularly critical in RAG’s multi-call scenarios. However, it must be noted that Dense models can also achieve similar retrieval decision capabilities through external controllers (query classifiers, retrieval gating, confidence estimation, self-reflection prompting). Thus, this represents an engineering advantage for MoE rather than a capability that Dense architectures absolutely lack.

MoE does not save memory (all experts must reside in memory), making Dense small models more suitable for local/edge RAG deployments. The ideal solution is a fusion architecture combining MoE routing + RAG retrieval + Agent tool selection, while maintaining the complementary role of Dense small models in latency-sensitive scenarios.

SECTION 11

Fragmented Recall vs. Global Synthesis: RAG’s Structural Blind SpotPoint Queries, Surface Queries, and the Compression Fidelity Problem

RAG’s essence is “fragmented recall”—it excels at finding the few most query-relevant fragments but cannot synthesize macro-level trends across long time spans.

11.1 Typical Scenarios of Synthesis Failure

When a user asks “What strategic shifts have occurred in my career planning over the past three years?”, RAG retrieves dozens of scattered fragments, but the model struggles to assemble them into a macro-level trend spanning three years. This is not retrieval failure—Recall may be quite high—but rather a structural synthesis disability: RAG’s chunking granularity is inherently unsuited for “global bird’s-eye-view” questions.

11.2 Four Architectural Directions for Overcoming Fragmentation

Architectural Direction	Mechanism	What It Solves	Maturity
Hierarchical Summarization	Day → week → month → year summary pyramid	Multi-granularity “bird’s-eye view”	⚠️ Experimental
Episodic Compression	Compresses continuous conversations into structured “episode cards”	Narrative coherence	⚠️ Explored by Letta
Multi-Scale Retrieval	Simultaneously indexes atomic fragments and summary fragments	Fact/trend routing	✅ Existing practice
Temporally-Aware Graphs	Temporal knowledge graphs tracking entity evolution	“What changed” type questions	⚠️ Early-stage products

Point Queries vs. Surface Queries: RAG’s success metrics (Recall, MRR, Faithfulness) measure point query capability. But long-term memory equally requires surface query capability—synthesizing macro-level insights across time and topics. Current RAG exhibits structural deficiencies in the latter.

11.3 RAG’s Own Latency Cost

A single Agentic RAG invocation’s typical chain incurs 50–200ms per step, with end-to-end latency reaching 2–5 seconds TTFT.

Scenario	Latency Tolerance	RAG Suitability	Alternative
Knowledge Base Q&A	3–10s	★★★★★	—
Document Analysis/Reports	10–30s	★★★★★	—
Real-Time Conversational Assistant	<1s	★★☆☆☆	Prompt Cache + Long Context
Code Completion	<500ms	★☆☆☆☆	Fine-tuning + Local Model
Voice Interaction	<800ms	★★☆☆☆	Core Memory Block + Prompt Cache

11.4 Compression Fidelity: The Positive Feedback Risk of Summary Distortion

Hierarchical summarization is a key architecture for overcoming fragmented recall, but introduces a new risk: summary distortion can become permanently solidified as long-term memory.

Distortion Type	Description	Consequence
Detail Loss	Critical numbers, dates, and conditions compressed away	Macro-level judgments lose critical premises
Causal Inversion	Summary reverses cause-and-effect relationships	Trend analysis conclusions are inverted
Over-Generalization	Outlier events are flattened	Inflection-point signals are erased
Value Tampering	Summarizer bias alters the stance of original text	User preferences are silently rewritten
Legacy Summary Contamination	New summaries inherit and amplify errors from old summaries	Positive feedback loop of compression distortion

Every summary must retain a provenance pointer linking back to the original evidence; otherwise, it cannot serve as the sole source for long-term memory. Multi-level summarization systems should incorporate periodic verification mechanisms: recomparing high-level summaries against underlying data to detect cumulative distortion. A summary without traceability is not memory compression—it is information destruction.

11.5 Computational Economics of Hierarchical Summarization

The Gemini 3.1 review raised a frequently overlooked question: Who pays for the background compute?

Summary Level	Frequency	Estimated Tokens/Run	Annualized Total
Daily Summary	365 runs/year	~2,500	~912K Tokens
Weekly Summary	52 runs/year	~4,300	~224K Tokens
Monthly Summary	12 runs/year	~7,500	~90K Tokens
Annual Summary	1 run/year	~21,000	~21K Tokens
Total			~1.25M Tokens/user/year

Estimated at 2026 API pricing (Sonnet-tier model), the annualized per-user background summarization cost is approximately $5–10. For a platform with one million users, annualized expenditure can reach $5M–10M—which explains why only top-tier providers currently offer automatic memory synthesis features.

SECTION 12

Memory Write Mechanisms: From Problem List to Pipeline ArchitectureWhy Writing Is Harder Than Reading in Memory Systems

RAG discussions have long been biased toward “reading.” If only retrieval is emphasized while writing is neglected, long-term memory degenerates into mere “knowledge base search.”

12.1 Five Core Problems of Memory Writing

Problem	Meaning	Current Status
What is worth remembering?	Which conversations/facts should be persisted	ChatGPT auto-determines; Letta Agent decides autonomously
Who decides what to write?	System / User / Agent	All three modes coexist
How to prevent erroneous writes?	Hallucinations entering long-term memory	No mature verification mechanism exists
How to resolve conflicts?	Multi-version contradictions	Zep temporal graph can track
How to expire?	Outdated information retirement	Titans’ “surprise” metric

Critical Risk: If a RAG system writes model hallucinations into long-term memory, it will not self-correct; instead, the hallucinated content will be retrieved and treated as “fact” in future queries—forming a positive feedback loop of false memories.

12.2 Memory Write Pipeline: A Reference Architecture

① Observe
Capture candidate memories from conversations/documents/event streams

↓

② Extract
Extract fact triples, preference declarations, episodic summaries

↓

③ Salience Score
Casual small talk vs. critical preference? Worth long-term storage?

↓

④ Contradiction Check
Compare against existing memories: new fact? Update? Conflict?

↓

⑤ Provenance Bind
Record source: which conversation, document, timestamp

↓

⑥ Privacy Classify
Tag as PII, health, financial, general

↓

⑦ Memory Type Assign
Assign to the corresponding layer in the eight-layer memory taxonomy

↓

⑧ Index Update
Write to vector index, BM25, knowledge graph, summary layer

↓

⑨ Audit Log
Record: who, when, what was written, based on what evidence

Key design principle: Writing requires far more caution than retrieval. An erroneous retrieval result affects only a single response; an erroneous write contaminates all future retrievals. Steps ③④⑤ constitute the write pipeline’s “quality firewall.”

12.3 Intent Inference Layer for Conflict Resolution

Memory conflicts are classified into three types:

Conflict Type	Typical Scenario	Resolution Strategy	Difficulty
Belief Update	“I changed jobs” → overwrite former employer	New replaces old; old tagged as historical version	Medium
Episodic Exception	“I’m cutting sugar” + “treating myself today”	Do not overwrite long-term preference; tag as episodic event	High
Preference Drift	Multiple deviations from old preference over three months	Trigger preference update once cumulative deviation exceeds threshold	Very High

Correct handling of all three conflict types requires world-model-level intent inference—understanding not just “what the user said” but “what the user intended by saying it.” This may be one of the most challenging frontier problems for Memory OS. Currently, no product has fully solved this problem.

SECTION 13

Security, Privacy, and Deletion CompletenessWhen Long-Term Memory Becomes a High-Risk Data Gateway

Once long-term memory connects to personal or enterprise data, RAG ceases to be a neutral pipeline and becomes a high-risk data gateway.

Threat	Description	Impact
Prompt Injection	Malicious documents embed instructions that contaminate retrieval results	Model executes unintended operations
Data Poisoning	False information injected into knowledge base	Long-term memory systematically corrupted
Embedding Leakage	Original text reverse-engineered from vector embeddings	Sensitive information exposure
Permission Escalation	User queries access unauthorized documents	Data compliance violations
Residual Inference	Model still infers from residual summaries after deletion	“Deletion” becomes effectively meaningless

Complete deletion requires simultaneous cleanup of: original documents, chunked text, vector embeddings, BM25 indexes, derived summaries, graph edges, cached copies, log backups, and multi-device sync replicas. “Auditable deletion” is the fifth necessary condition for long-term memory.

13.1 Deployment Architecture: Data Sovereignty Gradient

Deployment Mode	Data Location	Inference Location	Privacy Level	Suitable Scenario
Fully Cloud	Cloud	Cloud	★☆☆☆☆	Low-sensitivity personal assistant
Local Storage + Cloud Embedding	Local	Embedding via cloud	★★☆☆☆	Caution: raw text still transmitted
Local Storage + Local Retrieval + Cloud Inference	Local	LLM via cloud	★★★☆☆	Compromise for most production scenarios
Local De-identification + Cloud Inference	Local (de-identified before transmission)	Cloud	★★★★☆	Enterprise compliance deployment
TEE Confidential Computing	Encrypted transit	Trusted Execution Environment	★★★★☆	Finance / Healthcare
Edge Small Model + Cloud Large Model	Local	Split routing	★★★★☆	Balancing latency + privacy
Enterprise Private Cloud / VPC	Private cloud	Private cloud	★★★★★	Data never leaves domain
Fully Local	Local	Local	★★★★★	Maximum privacy scenarios

The “local storage + cloud inference” mode contains a physical contradiction—retrieved context must still be transmitted to the cloud API. Mitigation approaches: (A) pre-transmission de-identification—strip PII and sensitive entities; (B) minimal injection—send only the minimum fragment needed for the answer; (C) TEE inference—process within a Trusted Execution Environment. All three carry trade-offs (de-identification loses semantics, minimization risks omission, TEE adds latency), and no perfect solution currently exists.

SECTION 14

A Long-Term Memory Evaluation FrameworkTwelve Dimensions with Scenario-Based Prioritization

The evaluation framework expands from ten to twelve dimensions, adding “Global Synthesis Accuracy” and “Compression Fidelity.”

Dimension	Metric	Meaning	Recommended Target
Recall	Recall@K	Retrieve relevant memories	≥ 0.9
Precision	Precision@K	Retrieve with low noise	≥ 0.7
Faithfulness	Faithfulness	Answers faithful to sources	≥ 0.9
Freshness	Freshness	Prioritize recent information	Temporal decay
Conflict Resolution	Conflict Resolution	Handle contradictory memories	Detectable + annotated
Update Latency	Update Latency	Delay until new memory is retrievable	< 60s
Deletion Completeness	Deletion	Deletion spans entire pipeline	100% auditable
Permission	Permission	Respect access controls	0 violations
Correction Persistence	Correction	User corrections do not regress	0 regressions
Provenance	Provenance	Outputs traceable to sources	≥ 0.95
Global Synthesis ★	Synthesis Accuracy	Cross-temporal/topical trend synthesis accuracy	Pending standardization
Compression Fidelity ★	Compression Fidelity	Semantic consistency between summaries and raw data	≥ 0.85

14.1 Scenario-Based Evaluation Matrix

Different scenarios exhibit markedly different priorities across the twelve dimensions (● Critical ○ Important · Secondary):

Dimension	Personal Assistant	Enterprise Knowledge Base	Legal / Medical	Research Assistant	Coding Assistant
Recall	○	●	●	●	○
Precision	·	○	●	○	●
Faithfulness	○	●	●	●	○
Freshness	●	○	○	○	●
Conflict	○	●	●	○	·
Latency	○	○	·	·	●
Deletion	○	●	●	·	·
Permission	·	●	●	·	○
Correction	●	○	●	○	○
Provenance	·	●	●	●	·
Synthesis	●	○	·	●	·
Compression	○	○	●	○	·

Together, the twelve dimensions constitute the trustworthiness boundary of long-term memory. The scenario-based matrix ensures evaluation does not apply a one-size-fits-all approach—personal assistants prioritize correction persistence and freshness, legal/medical systems prioritize faithfulness and provenance, and coding assistants prioritize latency and precision.

SECTION 15

ConclusionMemory Infrastructure Outlasts Model Generations

Central Thesis: Under current and foreseeable LLM technology stacks, the most mature, most engineerable, and most controllable path for long-term memory oriented toward user or enterprise private knowledge is Generalized RAG—a system paradigm satisfying the four ABCD criteria of “external persistent storage + runtime retrieval + independently addable and deletable + knowledge isolation.” A complete long-term memory system should aspire to the Memory OS level, encompassing a full-cycle closed loop of read, write, compress, delete, and audit.

This conclusion rests on necessary condition analysis: among the five conditions of persistence, updatability, on-demand retrieval, personalized isolation, and auditable deletion, no other single technology can satisfy all simultaneously. Through the negative exclusion criteria (Section 02), we have mitigated the risk of “Generalized RAG defined so broadly that it becomes tautological.”

Version	Core Contribution	Responding to Review
V1	Central thesis, retrieval pipeline, Markdown preprocessing lever	—
V2	Boundary definitions (Narrow/Generalized/Memory OS), seven-layer classification, write/delete/security, ten-dimensional evaluation	GPT-5.5
V3	System 1/2 distinction, fragmented recall, latency cost, deployment architecture	Gemini 3.1
V4	Negative definition, evidence grading, Write Pipeline, compression fidelity, compute cost, conflict intent inference, twelve-dimensional scenario-based evaluation	Tri-model joint review

The final judgment remains unchanged: Models will be superseded, but memory infrastructure will not be deprecated. Investing in the Memory OS pipeline—from file Markdown conversion and structured annotation to the complete retrieve-write-compress-delete-audit pipeline—is building the most essential long-term cognitive infrastructure for AI.

This paper has evolved from “why RAG is irreplaceable” to “what a complete Memory OS requires.” The core value lies not in proving that “vector retrieval will exist forever,” but in demonstrating that: as long as AI needs updatable, deletable, isolatable, and auditable long-term private knowledge, it will inevitably require an external persistent knowledge layer—and this knowledge layer, together with its complete read-write-compress-delete-audit pipeline, is precisely where Memory OS resides.

APPENDIX A

Future Experimental DirectionsEmpirical Validation Agenda for Subsequent Work

This paper is a systems framework paper; its argumentation methodology is literature synthesis and necessary condition analysis. The following experimental directions are reserved for future work:

Experiment	Design	Expected Validation
Four-Approach Comparison	Compare pure long context, pure fine-tuning, RAG, and RAG + fine-tuning on the same knowledge base	Differentiated advantages across eight memory types
Compression Fidelity Decay	1/2/3/4 levels of summary compression, measuring semantic consistency	Cumulative distortion rate of multi-level summarization
Conflict Intent Inference	Test sets for Belief Update / Episodic Exception / Preference Drift	Current model accuracy for conflict classification
Point Query vs. Surface Query	Fact lookup vs. trend synthesis, standard RAG vs. multi-scale RAG	Degree of global synthesis disability from fragmented recall

Primary References

[1] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.

[2] Anthropic (2024). Contextual Retrieval. anthropic.com/news/contextual-retrieval

[3] Pinecone (2026). Nexus: The Knowledge Engine for Agents. pinecone.io/blog/knowledge-infrastructure-for-agents

[4] Mangla, B. (2026). MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys. arXiv:2603.23533

[5] Paulsen, N. (2025). The Maximum Effective Context Window for Real World LLMs. OAJAIML.

[6] Zhou, X. et al. (2024). Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs. arXiv:2410.15438

[7] Vectara (2025). Chunking Configuration vs Embedding Model Selection. NAACL 2025. arXiv:2410.13070

[8] BIRD-Interact (2026). Re-imagining Text-to-SQL via Dynamic Interactions. ICLR 2026.

[9] Spider 2.0 (2025). Evaluating LMs on Real-World Enterprise Text-to-SQL. ICLR 2025 Oral.

[10] Tan, J. et al. (2025). HtmlRAG: HTML is Better Than Plain Text for RAG. WWW 2025.

[11] Mem0.ai (2026). State of AI Agent Memory 2026. mem0.ai/blog

[12] Microsoft (2024). MarkItDown: Open-source Document-to-Markdown Converter.

[13] Hooper, C. et al. (2026). KVzip: Query-Dependent KV Cache Compression for Long-Context LLMs. arXiv.

[14] Yu, X. et al. (2026). CacheClip: Robust RAG-Aware KV Cache Pruning. arXiv.

[15] Zhang, J. et al. (2026). TokenMix: Cross-Model Investigation of Lost-in-the-Middle. arXiv.

[16] Borgeaud, S. et al. (2022). Improving Language Models by Retrieving from Trillions of Tokens (RETRO). ICML 2022.

[17] Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv:2310.11511

[18] Sun, S. et al. (2024). Think-on-Graph: Deep and Responsible Reasoning with KG. ICLR 2024.

[19] Packer, C. et al. (2024). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560

[20] Zep AI (2025). Graphiti: A Temporal Knowledge Graph for AI Agents. github.com/getzep/graphiti

[21] Sun, Y. et al. (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663

[22] Gao, Y. et al. (2024). RAG for LLMs: A Survey. arXiv:2312.10997

[23] Singh, C. et al. (2025). Rethinking Memory in AI: Taxonomy, Operations, and Benchmarks. arXiv.

[24] Maekawa, S. et al. (2026). Retrieval Helps Generation But Can Be a Double-Edged Sword. Nature Comms.

[25] NVIDIA (2025). Optimizing LLM Serving: MoE Inference Performance Analysis. NVIDIA Technical Blog.

[26] Karpathy, A. et al. (2020). Dense Passage Retrieval for Open-Domain QA. EMNLP 2020.