THOUGHT PAPER · MAY 2026 · V4

RAG and Long-Term Memory for AI

From External Persistent Knowledge Layers to Memory OS: A Systems Engineering Path for LLM Long-Term Memory

From RAG to Memory OS:
A Systems Engineering Path for Persistent AI Memory


PublishedMay 18, 2026
CategoryOriginal Thought Paper
DomainsRAG · Long-Term Memory · Memory OS · Knowledge Engineering · AI Systems Architecture
VersionV4 (Tri-Model Adversarial Review + Dense Structural Audit)
이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · GPT 5.5 · Gemini 3.1
인지집단 (Cognitive Collective)


Abstract

Based on publicly available research and industry data as of May 2026, this paper systematically argues a central thesis: Under current and foreseeable LLM technology stacks, the most mature, most engineerable, and most controllable path for long-term memory oriented toward user or enterprise private knowledge is Generalized RAG—a system paradigm combining external persistent knowledge layers, updatable indexes, permission isolation, on-demand retrieval, and context injection. A complete long-term memory system should aspire to the Memory OS level, encompassing a full-cycle closed loop of read, write, compress, delete, and audit. The term RAG in this paper is not limited to traditional vector retrieval; it refers to any system paradigm that injects external persistent knowledge into model inference in a retrievable, updatable, and isolatable form—with exclusion criteria that explicitly delineate the boundaries of this paradigm. Through systematic examination of parametric memory, context windows, fine-tuning, KV Cache persistence, knowledge compilation, knowledge graphs, and other candidate approaches, we demonstrate that they serve more as complements, backend variants, or partial optimizations to the RAG paradigm, rather than complete substitutes. The paper proposes an eight-layer classification model for AI long-term memory and a twelve-dimensional scenario-based evaluation framework, along with a complete memory write pipeline architecture, compression fidelity analysis, conflict intent inference model, and background compute cost estimation.

RAG
Long-Term Memory
Memory OS
Agentic RAG
Knowledge Compilation
Memory Write Pipeline
Compression Fidelity
Context Window
KV Cache
MoE
AI Memory Systems

SECTION 01

Introduction: The Amnesia Problem of Large Language ModelsWhy LLMs Forget Everything After Every Conversation

Large language models possess the most extensive “factory-installed knowledge” in human history—trillions of tokens of training data compressed into billions to hundreds of billions of parameters. Yet this parametric memory is static. A January 2026 study published in Nature Communications revealed a disquieting fact: LLMs do not store discrete facts but rather assemble fragments from similar sequences, meaning that parametric memory is inherently unreliable for precise recall of specific data.

More critically, parametric memory cannot be updated (except through full retraining), cannot be personalized (all users share the same set of weights), and cannot be deleted (specific information cannot be “forgotten”). This makes large models fundamentally “amnesic” when confronted with any scenario requiring persistent personal knowledge—whether enterprise proprietary data, user preferences, or project histories.

The Retrieval-Augmented Generation (RAG) framework, proposed by Patrick Lewis et al. at Meta AI in 2020, was originally designed merely to address the problem of “outdated model knowledge.” However, six years later, RAG has far exceeded its original design intent, becoming the core infrastructure for persistent long-term memory in AI systems. This paper will systematically argue this thesis—and explicitly delineate both the scope of applicability and the boundaries where it does not apply.


SECTION 02

Defining the Boundaries of RAG: From Narrow to BroadWith Exclusion Criteria to Prevent Tautology

The argumentation in this paper requires first clarifying a conceptual question: when we say “RAG is the core path for long-term memory,” what exactly does RAG mean? If all external knowledge invocations are defined as RAG, then “all long-term memory depends on RAG” approaches tautology. We therefore distinguish three levels:

Level Definition Representative Technologies
Narrow RAG Vector database + document chunking + embedding retrieval + prompt injection LangChain RAG pipeline, Pinecone + OpenAI embeddings
Generalized RAG Any system paradigm that injects external persistent knowledge into model inference in a retrievable, updatable, and isolatable form Graph-RAG, Knowledge Compilation (Nexus), TAG, Agentic RAG, Hybrid Retrieval
Memory OS Generalized RAG + write pipeline + compression fidelity + deletion governance + permission system + version management + conflict resolution + feedback learning Letta/MemGPT, Mem0 + ACL, Zep/Graphiti + temporal reasoning

The central thesis of this paper targets Generalized RAG. We do not argue that “vector search is irreplaceable” (Narrow RAG may well be superseded by superior retrieval technologies), but rather that the system paradigm of “external persistent knowledge layer + updatable index + permission isolation + on-demand retrieval + context injection” is irreplaceable.

2.1 Negative Definition: What Is Not Generalized RAG

To prevent conceptual overextension from collapsing into tautology, we explicitly provide exclusion criteria for Generalized RAG. If a system fails to satisfy any one of the following conditions, it does not qualify as Generalized RAG:

Exclusion Condition Systems Outside Generalized RAG Reason
No external persistent storage Pure parametric memory (knowledge embedded in pretrained weights) Knowledge is frozen in weights and cannot be independently updated or deleted
No runtime retrieval Pure fine-tuning/LoRA (knowledge absorbed during training) Knowledge is written into weights during training; no “lookup” action exists at inference time
No persistence Pure context window injection (in-session prompt assembly) Vanishes upon session termination, failing the persistence requirement for long-term memory
No queryable index Pure KV Cache persistence Preserves intermediate computational states, not queryable or editable knowledge objects
No knowledge isolation Global agent policy learning without permission boundaries All users share the same policy space; personalized knowledge isolation is not supported

For a system to qualify as Generalized RAG, it must simultaneously satisfy four criteria: (A) an external persistent knowledge store independent of model weights exists; (B) query-based retrieval actions occur at inference time (rather than full-volume injection); (C) knowledge can be added, modified, or deleted independently of the model; (D) knowledge boundaries can be isolated by user, tenant, or permission level. All four satisfied → Generalized RAG; missing any one → not Generalized RAG, likely a complementary technology.

Through these exclusion criteria, the central thesis of this paper can be stated more precisely: Within the intersection of the four necessary properties A+B+C+D, no other single technology paradigm can satisfy all simultaneously; therefore, Generalized RAG possesses structural irreplaceability in the domain of long-term semantic memory. This is not tautological—because we have explicitly identified five categories of systems that fall outside Generalized RAG.

At the same time, we contend that a complete long-term memory system should aspire to the Memory OS level—capable not only of “reading” (retrieval) but also “writing” (memory formation), “compressing” (multi-scale summarization), “deleting” (forgetting and deletion), and “governing” (permissions and audit). Subsequent sections will address each of these dimensions.


SECTION 03

Six Years of RAG EvolutionFrom Academic Paper to Industry Standard: 2020–2026

RAG’s journey from academic paper to industry standard traversed four distinct phases.

3.1 Foundation Period (2020–2021)

In May 2020, Lewis et al. published the foundational paper at NeurIPS, combining parametric and non-parametric memory to significantly outperform conventional systems on knowledge-intensive QA tasks. Concurrently, REALM achieved joint training of retrieval and pretraining, while DPR demonstrated that semantic search could exceed BM25 by up to 19%.

3.2 Scaling and Exploration Period (2022–2023)

DeepMind’s RETRO pushed retrieval augmentation to the trillion-token corpus scale. The LangChain and LlamaIndex ecosystems propelled RAG from academia into engineering practice. Self-RAG and CRAG introduced self-reflection mechanisms. The vector database market experienced explosive growth.

3.3 Rise of Agentic RAG (2024–2025)

RAG was no longer a single-pass pipeline but an iterative loop of “think → retrieve → re-think → re-retrieve → act.” Anthropic launched MCP, later donating it to the Linux Foundation where it became the de facto standard. Multi-modal RAG and Graph-RAG emerged in succession.

3.4 The Post-RAG Paradigm Shift (2026–Present)

In May 2026, Pinecone released Nexus—a “knowledge compiler” that shifts inference from query time to compile time. Microsoft Fabric IQ and Google Knowledge Catalog simultaneously launched similar architectures. This signals that RAG is being absorbed by higher-level “knowledge layer” architectures, yet the core “store → retrieve → inject” paradigm remains unchanged.

“RAG was built for human users. Nexus was built for agent users—because they speak an entirely different language and expect entirely different responses.”

— Ash Ashutosh, Pinecone CEO, VentureBeat, May 2026

SECTION 04

Eight Layers of AI Memory and Where RAG AppliesWith System 1/System 2 Boundary Delineation

This paper expands LLM memory into an eight-layer classification, splitting “preference memory” into explicit and implicit preferences to eliminate internal contradictions with the “System 1/System 2” framework:

Memory Type Definition Optimal Technical Path RAG Suitability
Semantic Memory Stable facts, concepts, knowledge Generalized RAG (vector/graph/knowledge compilation) ★★★★★ Core scenario
Episodic Memory Events, conversations, timelines RAG + temporal indexing (Zep/Graphiti) ★★★★★ Core scenario
Explicit Preferences Articulable preferences: dietary restrictions, time zone, language, tool choices RAG (user profile key-value pairs/embeddings) ★★★★★ Core scenario
Implicit Preferences Hard-to-articulate preferences: aesthetics, style, humor, subtle attitudes Fine-tuning/LoRA/long-term behavioral learning ★★☆☆☆ Difficult for RAG
Procedural Memory Skills, operational workflows, strategies Fine-tuning/LoRA, workflow templates ★★☆☆☆ Suboptimal for RAG
Social Memory Interpersonal relationships, interaction history Knowledge graph + RAG ★★★★☆ Graph backend preferred
Working Memory Current task state Context window + KV Cache ★☆☆☆☆ Not a RAG scenario
Reflective Memory Summaries, retrospectives, self-corrections Agent memory + RAG ★★★☆☆ Requires write strategy

“System 1/System 2” Memory Distinction: Borrowing Kahneman’s dual-system framework as an engineering design metaphor (note: the dual-system theory itself is debated within cognitive science; this is an engineering analogy rather than a cognitive science claim), we delineate RAG’s applicability boundary by “articulability.” Memories that can be explicitly expressed in text are suitable for RAG; memories that can only be “felt” require fine-tuning. Splitting “preference memory” into “explicit preferences” (★★★★★) and “implicit preferences” (★★☆☆☆) eliminates the internal contradiction. RAG stores facts and histories; fine-tuning stores capabilities and habits—the two are complementary, not substitutes.

4.1 Five Necessary Conditions for Long-Term Memory

Necessary Condition Generalized RAG Parametric Memory Context Window Fine-Tuning
Persistence ✅ Not precisely updatable ✅ High cost
Updatability ✅ In-session only ⚠️ Requires retraining
On-demand Retrieval ❌ Imprecise ✅ Within window
Personalized Isolation ❌ Global ✅ In-session ⚠️ Requires multiple copies
Auditable Deletion ✅ (Requires full-pipeline design) ✅ (Vanishes on session end) ❌ (Cannot precisely forget)

SECTION 05

Candidate Approach Assessment: Replacement or Complement?Systematic Elimination of Alternative Paradigms

Each of these technologies has independent value, but none can independently satisfy all five necessary conditions for long-term semantic memory. They are complements to the Generalized RAG paradigm.

5.1 Fine-Tuning / LoRA — The Optimal Path for Capability Memory

Effective for encoding stable skills, styles, implicit preferences, and domain formats, but unsuitable for frequently changing facts. Per the exclusion criteria in Section 02 (no runtime retrieval), pure fine-tuning does not qualify as Generalized RAG.

5.2 Ultra-Long Context — Extension of Working Memory

Can significantly reduce the need for retrieval in certain scenarios, but cannot substitute for data deletion, permission isolation, version control, and auditing. Per the exclusion criteria (no persistence), pure context windows do not qualify as Generalized RAG.

5.3 TTT-E2E — A Supplementary Layer for Broad Understanding

Compresses context into model weights at inference time. The researchers themselves recommend complementary use with RAG.

5.4 KV Cache Persistence — Computational State, Not Knowledge Object

Preserves intermediate computational states, not queryable, editable, or auditable knowledge objects. Per the exclusion criteria (no queryable index), this does not qualify as Generalized RAG.

5.5 Knowledge Graphs — A Structured Memory Backend

Provides explicit entity relationships, interpretable reasoning paths, and conflict detection. It is one of the most powerful structured memory backends for Generalized RAG (satisfying all four ABCD criteria).

5.6 Knowledge Compilation — Internal Evolution Within RAG

Pinecone Nexus and similar systems shift inference to a compile stage. The underlying system still satisfies all four ABCD criteria, representing engineering evolution within Generalized RAG.

Assessment Conclusion: Each of the above technologies has its applicable scenarios, but for long-term semantic memory oriented toward private knowledge, none can independently satisfy all five necessary conditions. They function more as supplements, backend variants, or partial optimizations to the Generalized RAG paradigm, rather than complete replacements.


SECTION 06

Architectural Analysis of Memory Systems in 2026Evidence-Graded Assessment of Major LLM Providers

Based on publicly observable product behavior and published technical documentation, the memory implementations of major LLM providers exhibit strong alignment with the Generalized RAG / external memory layer paradigm.

Model Memory Mechanism Generalized RAG Characteristics Evidence
ChatGPT (GPT-5) Persistent user memory + semantic retrieval injection External persistent storage + runtime retrieval + injection
Claude 24-hour conversation memory synthesis, persisted, automatically retrieved and injected Standard write → store → retrieve → inject pipeline
Gemini Personal Intelligence: Gmail/Drive + knowledge graph + cross-modal Multi-modal retrieval augmentation over user’s real data
Grok X (Twitter) history/followers/interaction topics Retrieval augmentation over social behavioral data

The classifications above are based on publicly observable product behavior and published technical documentation. Internal implementations may include rule engines, profile stores, policy layers, caching strategies, and other non-RAG components—actual architectures are almost certainly hybrid systems incorporating multiple technologies. These should be understood as architectural inferences rather than fully verified facts. Nevertheless, from observable behavior, the core pattern of “external persistent storage + runtime retrieval + context injection” is confirmable across all major products.

The explosive growth of dedicated memory-layer products (Mem0, Letta/MemGPT, Zep/Graphiti, Hindsight) further confirms: RAG is not legacy technology destined for replacement; it is being absorbed and elevated by higher-level “persistent cognition” architectures.


SECTION 07

Success Rate Reality and the Preprocessing LeverFrom 40% Failure to 1.9%: The Impact of Pipeline Design

7.1 The Harsh Reality of Current Success Rates

Benchmark / Scenario Success Rate Description Evidence
Spider 1.0 86.6%–91.2% Clean, small-scale schemas
BIRD-SQL 81.95% Includes noisy data, domain knowledge dependencies
Spider 2.0 6%–21.3% Real-world enterprise schemas
BIRD-Interact 8.67% Simulated real DBA scenarios
Naïve RAG Pipeline ~60% Without preprocessing optimization

7.2 Preprocessing: Retrieval Failure Rate from 40% to 1.9%

Anthropic’s contextual retrieval research provides the clearest stratified quantitative data to date:

Raw Files + Fixed-Size Chunking
Retrieval failure rate ~40%
↓ Format conversion to structured Markdown
Structure-Aware Chunking
Recall rate ~85–90%
↓ Contextual augmentation (section path per chunk)
+ Contextual Embeddings
Failure rate drops to 3.7% (↓35%)
↓ Hybrid retrieval (vector + BM25)
+ Hybrid Retrieval
Failure rate drops to 2.9% (↓49%)
↓ Reranking
Complete Pipeline
Failure rate drops to 1.9% (↓67%)
↓ MDKeyChunker semantic key annotation
MDKeyChunker + BM25
Recall@5 = 1.000, MRR = 0.911

Key Finding: Vectara’s NAACL 2025 study confirmed that chunking strategy impacts retrieval quality as much as—or more than—embedding model selection.

Scope Limitation: The data above originate from different studies, different datasets, and different task definitions, and therefore cannot be directly chained into a unified causal ladder. Actual effectiveness depends on document type, domain complexity, and query patterns.


SECTION 08

The Structural Advantage of Markdown — and Its LimitsWhy Format Matters More Than Model Selection

Dimension Markdown HTML Plain Text
RAG Friendliness ★★★★★ ★★★☆☆ ★★☆☆☆
Best Retrieval Success Rate Recall@5 = 1.000 Hit@1 = 68.5 Baseline
Token Efficiency Very high Low → Medium (90–97% of tokens must be stripped) High but unstructured
Structure Preservation Native Requires specialized processing Completely lost

Microsoft MarkItDown (91K+ Stars), IBM Docling, LlamaParse, and Firecrawl have collectively established a complete “any format → RAG-ready Markdown” pipeline.

Scope of Applicability: Markdown is a strong intermediate format for text-based files in RAG, but for table-dense PDFs, scanned documents/charts, legal contracts/financial reports, and structured database data, multi-modal parsing and metadata preservation are needed as supplements.


SECTION 09

Context Window and KV Cache ConstraintsThe Gap Between Advertised and Effective Context Length

Lost-in-the-Middle Degradation
10–25%
Accuracy drop in middle positions across all models (TokenMix, April 2026)
Effective vs. Advertised Context Gap
99%
Maximum gap between effective and advertised context window on complex tasks (Paulsen 2025)

KV Cache compression also affects RAG quality. Conventional methods are “blind” to queries, risking the deletion of critical evidence. 2026 solutions such as KVzip and CacheClip optimize specifically for RAG scenarios, achieving 3–4× KV reduction and ~2× latency improvement.

The state-of-the-art implementation in 2026 is a tripartite synergy: RAG handles precise retrieval, long-context models handle deep reasoning, and KV Cache optimization handles latency and cost control.


SECTION 10

Dense vs. MoE: Architectural Impact on RAGStructural Affordances and Engineering Trade-offs

MoE architecture offers unique structural affordances for RAG. Research from Fudan University and Tencent identified three types of core experts in Mixtral:

Core Expert Function Significance for RAG
Cognition Expert Determines whether internal knowledge is sufficient Avoids unnecessary retrieval
Quality Expert Evaluates retrieved document quality Filters low-quality documents
Context Expert Enhances external knowledge utilization Better “reads” RAG documents

MoE’s expert routing mechanism provides structural affordances for Adaptive RAG—the router natively supports routing between simple and complex queries. NVIDIA’s analysis notes that MoE’s reduction of time-to-first-token latency is particularly critical in RAG’s multi-call scenarios. However, it must be noted that Dense models can also achieve similar retrieval decision capabilities through external controllers (query classifiers, retrieval gating, confidence estimation, self-reflection prompting). Thus, this represents an engineering advantage for MoE rather than a capability that Dense architectures absolutely lack.

MoE does not save memory (all experts must reside in memory), making Dense small models more suitable for local/edge RAG deployments. The ideal solution is a fusion architecture combining MoE routing + RAG retrieval + Agent tool selection, while maintaining the complementary role of Dense small models in latency-sensitive scenarios.


SECTION 11

Fragmented Recall vs. Global Synthesis: RAG’s Structural Blind SpotPoint Queries, Surface Queries, and the Compression Fidelity Problem

RAG’s essence is “fragmented recall”—it excels at finding the few most query-relevant fragments but cannot synthesize macro-level trends across long time spans.

11.1 Typical Scenarios of Synthesis Failure

When a user asks “What strategic shifts have occurred in my career planning over the past three years?”, RAG retrieves dozens of scattered fragments, but the model struggles to assemble them into a macro-level trend spanning three years. This is not retrieval failure—Recall may be quite high—but rather a structural synthesis disability: RAG’s chunking granularity is inherently unsuited for “global bird’s-eye-view” questions.

11.2 Four Architectural Directions for Overcoming Fragmentation

Architectural Direction Mechanism What It Solves Maturity
Hierarchical Summarization Day → week → month → year summary pyramid Multi-granularity “bird’s-eye view” ⚠️ Experimental
Episodic Compression Compresses continuous conversations into structured “episode cards” Narrative coherence ⚠️ Explored by Letta
Multi-Scale Retrieval Simultaneously indexes atomic fragments and summary fragments Fact/trend routing ✅ Existing practice
Temporally-Aware Graphs Temporal knowledge graphs tracking entity evolution “What changed” type questions ⚠️ Early-stage products

Point Queries vs. Surface Queries: RAG’s success metrics (Recall, MRR, Faithfulness) measure point query capability. But long-term memory equally requires surface query capability—synthesizing macro-level insights across time and topics. Current RAG exhibits structural deficiencies in the latter.

11.3 RAG’s Own Latency Cost

A single Agentic RAG invocation’s typical chain incurs 50–200ms per step, with end-to-end latency reaching 2–5 seconds TTFT.

Scenario Latency Tolerance RAG Suitability Alternative
Knowledge Base Q&A 3–10s ★★★★★
Document Analysis/Reports 10–30s ★★★★★
Real-Time Conversational Assistant <1s ★★☆☆☆ Prompt Cache + Long Context
Code Completion <500ms ★☆☆☆☆ Fine-tuning + Local Model
Voice Interaction <800ms ★★☆☆☆ Core Memory Block + Prompt Cache

11.4 Compression Fidelity: The Positive Feedback Risk of Summary Distortion

Hierarchical summarization is a key architecture for overcoming fragmented recall, but introduces a new risk: summary distortion can become permanently solidified as long-term memory.

Distortion Type Description Consequence
Detail Loss Critical numbers, dates, and conditions compressed away Macro-level judgments lose critical premises
Causal Inversion Summary reverses cause-and-effect relationships Trend analysis conclusions are inverted
Over-Generalization Outlier events are flattened Inflection-point signals are erased
Value Tampering Summarizer bias alters the stance of original text User preferences are silently rewritten
Legacy Summary Contamination New summaries inherit and amplify errors from old summaries Positive feedback loop of compression distortion

Every summary must retain a provenance pointer linking back to the original evidence; otherwise, it cannot serve as the sole source for long-term memory. Multi-level summarization systems should incorporate periodic verification mechanisms: recomparing high-level summaries against underlying data to detect cumulative distortion. A summary without traceability is not memory compression—it is information destruction.

11.5 Computational Economics of Hierarchical Summarization

The Gemini 3.1 review raised a frequently overlooked question: Who pays for the background compute?

Summary Level Frequency Estimated Tokens/Run Annualized Total
Daily Summary 365 runs/year ~2,500 ~912K Tokens
Weekly Summary 52 runs/year ~4,300 ~224K Tokens
Monthly Summary 12 runs/year ~7,500 ~90K Tokens
Annual Summary 1 run/year ~21,000 ~21K Tokens
Total ~1.25M Tokens/user/year

Estimated at 2026 API pricing (Sonnet-tier model), the annualized per-user background summarization cost is approximately $5–10. For a platform with one million users, annualized expenditure can reach $5M–10M—which explains why only top-tier providers currently offer automatic memory synthesis features.


SECTION 12

Memory Write Mechanisms: From Problem List to Pipeline ArchitectureWhy Writing Is Harder Than Reading in Memory Systems

RAG discussions have long been biased toward “reading.” If only retrieval is emphasized while writing is neglected, long-term memory degenerates into mere “knowledge base search.”

12.1 Five Core Problems of Memory Writing

Problem Meaning Current Status
What is worth remembering? Which conversations/facts should be persisted ChatGPT auto-determines; Letta Agent decides autonomously
Who decides what to write? System / User / Agent All three modes coexist
How to prevent erroneous writes? Hallucinations entering long-term memory No mature verification mechanism exists
How to resolve conflicts? Multi-version contradictions Zep temporal graph can track
How to expire? Outdated information retirement Titans’ “surprise” metric

Critical Risk: If a RAG system writes model hallucinations into long-term memory, it will not self-correct; instead, the hallucinated content will be retrieved and treated as “fact” in future queries—forming a positive feedback loop of false memories.

12.2 Memory Write Pipeline: A Reference Architecture

① Observe
Capture candidate memories from conversations/documents/event streams
② Extract
Extract fact triples, preference declarations, episodic summaries
③ Salience Score
Casual small talk vs. critical preference? Worth long-term storage?
④ Contradiction Check
Compare against existing memories: new fact? Update? Conflict?
⑤ Provenance Bind
Record source: which conversation, document, timestamp
⑥ Privacy Classify
Tag as PII, health, financial, general
⑦ Memory Type Assign
Assign to the corresponding layer in the eight-layer memory taxonomy
⑧ Index Update
Write to vector index, BM25, knowledge graph, summary layer
⑨ Audit Log
Record: who, when, what was written, based on what evidence

Key design principle: Writing requires far more caution than retrieval. An erroneous retrieval result affects only a single response; an erroneous write contaminates all future retrievals. Steps ③④⑤ constitute the write pipeline’s “quality firewall.”

12.3 Intent Inference Layer for Conflict Resolution

Memory conflicts are classified into three types:

Conflict Type Typical Scenario Resolution Strategy Difficulty
Belief Update “I changed jobs” → overwrite former employer New replaces old; old tagged as historical version Medium
Episodic Exception “I’m cutting sugar” + “treating myself today” Do not overwrite long-term preference; tag as episodic event High
Preference Drift Multiple deviations from old preference over three months Trigger preference update once cumulative deviation exceeds threshold Very High

Correct handling of all three conflict types requires world-model-level intent inference—understanding not just “what the user said” but “what the user intended by saying it.” This may be one of the most challenging frontier problems for Memory OS. Currently, no product has fully solved this problem.


SECTION 13

Security, Privacy, and Deletion CompletenessWhen Long-Term Memory Becomes a High-Risk Data Gateway

Once long-term memory connects to personal or enterprise data, RAG ceases to be a neutral pipeline and becomes a high-risk data gateway.

Threat Description Impact
Prompt Injection Malicious documents embed instructions that contaminate retrieval results Model executes unintended operations
Data Poisoning False information injected into knowledge base Long-term memory systematically corrupted
Embedding Leakage Original text reverse-engineered from vector embeddings Sensitive information exposure
Permission Escalation User queries access unauthorized documents Data compliance violations
Residual Inference Model still infers from residual summaries after deletion “Deletion” becomes effectively meaningless

Complete deletion requires simultaneous cleanup of: original documents, chunked text, vector embeddings, BM25 indexes, derived summaries, graph edges, cached copies, log backups, and multi-device sync replicas. “Auditable deletion” is the fifth necessary condition for long-term memory.

13.1 Deployment Architecture: Data Sovereignty Gradient

Deployment Mode Data Location Inference Location Privacy Level Suitable Scenario
Fully Cloud Cloud Cloud ★☆☆☆☆ Low-sensitivity personal assistant
Local Storage + Cloud Embedding Local Embedding via cloud ★★☆☆☆ Caution: raw text still transmitted
Local Storage + Local Retrieval + Cloud Inference Local LLM via cloud ★★★☆☆ Compromise for most production scenarios
Local De-identification + Cloud Inference Local (de-identified before transmission) Cloud ★★★★☆ Enterprise compliance deployment
TEE Confidential Computing Encrypted transit Trusted Execution Environment ★★★★☆ Finance / Healthcare
Edge Small Model + Cloud Large Model Local Split routing ★★★★☆ Balancing latency + privacy
Enterprise Private Cloud / VPC Private cloud Private cloud ★★★★★ Data never leaves domain
Fully Local Local Local ★★★★★ Maximum privacy scenarios

The “local storage + cloud inference” mode contains a physical contradiction—retrieved context must still be transmitted to the cloud API. Mitigation approaches: (A) pre-transmission de-identification—strip PII and sensitive entities; (B) minimal injection—send only the minimum fragment needed for the answer; (C) TEE inference—process within a Trusted Execution Environment. All three carry trade-offs (de-identification loses semantics, minimization risks omission, TEE adds latency), and no perfect solution currently exists.


SECTION 14

A Long-Term Memory Evaluation FrameworkTwelve Dimensions with Scenario-Based Prioritization

The evaluation framework expands from ten to twelve dimensions, adding “Global Synthesis Accuracy” and “Compression Fidelity.”

Dimension Metric Meaning Recommended Target
Recall Recall@K Retrieve relevant memories ≥ 0.9
Precision Precision@K Retrieve with low noise ≥ 0.7
Faithfulness Faithfulness Answers faithful to sources ≥ 0.9
Freshness Freshness Prioritize recent information Temporal decay
Conflict Resolution Conflict Resolution Handle contradictory memories Detectable + annotated
Update Latency Update Latency Delay until new memory is retrievable < 60s
Deletion Completeness Deletion Deletion spans entire pipeline 100% auditable
Permission Permission Respect access controls 0 violations
Correction Persistence Correction User corrections do not regress 0 regressions
Provenance Provenance Outputs traceable to sources ≥ 0.95
Global Synthesis ★ Synthesis Accuracy Cross-temporal/topical trend synthesis accuracy Pending standardization
Compression Fidelity ★ Compression Fidelity Semantic consistency between summaries and raw data ≥ 0.85

14.1 Scenario-Based Evaluation Matrix

Different scenarios exhibit markedly different priorities across the twelve dimensions (● Critical ○ Important · Secondary):

Dimension Personal Assistant Enterprise Knowledge Base Legal / Medical Research Assistant Coding Assistant
Recall
Precision ·
Faithfulness
Freshness
Conflict ·
Latency · ·
Deletion · ·
Permission · ·
Correction
Provenance · ·
Synthesis · ·
Compression ·

Together, the twelve dimensions constitute the trustworthiness boundary of long-term memory. The scenario-based matrix ensures evaluation does not apply a one-size-fits-all approach—personal assistants prioritize correction persistence and freshness, legal/medical systems prioritize faithfulness and provenance, and coding assistants prioritize latency and precision.


SECTION 15

ConclusionMemory Infrastructure Outlasts Model Generations

Central Thesis: Under current and foreseeable LLM technology stacks, the most mature, most engineerable, and most controllable path for long-term memory oriented toward user or enterprise private knowledge is Generalized RAG—a system paradigm satisfying the four ABCD criteria of “external persistent storage + runtime retrieval + independently addable and deletable + knowledge isolation.” A complete long-term memory system should aspire to the Memory OS level, encompassing a full-cycle closed loop of read, write, compress, delete, and audit.

This conclusion rests on necessary condition analysis: among the five conditions of persistence, updatability, on-demand retrieval, personalized isolation, and auditable deletion, no other single technology can satisfy all simultaneously. Through the negative exclusion criteria (Section 02), we have mitigated the risk of “Generalized RAG defined so broadly that it becomes tautological.”

Version Core Contribution Responding to Review
V1 Central thesis, retrieval pipeline, Markdown preprocessing lever
V2 Boundary definitions (Narrow/Generalized/Memory OS), seven-layer classification, write/delete/security, ten-dimensional evaluation GPT-5.5
V3 System 1/2 distinction, fragmented recall, latency cost, deployment architecture Gemini 3.1
V4 Negative definition, evidence grading, Write Pipeline, compression fidelity, compute cost, conflict intent inference, twelve-dimensional scenario-based evaluation Tri-model joint review

The final judgment remains unchanged: Models will be superseded, but memory infrastructure will not be deprecated. Investing in the Memory OS pipeline—from file Markdown conversion and structured annotation to the complete retrieve-write-compress-delete-audit pipeline—is building the most essential long-term cognitive infrastructure for AI.

This paper has evolved from “why RAG is irreplaceable” to “what a complete Memory OS requires.” The core value lies not in proving that “vector retrieval will exist forever,” but in demonstrating that: as long as AI needs updatable, deletable, isolatable, and auditable long-term private knowledge, it will inevitably require an external persistent knowledge layer—and this knowledge layer, together with its complete read-write-compress-delete-audit pipeline, is precisely where Memory OS resides.


APPENDIX A

Future Experimental DirectionsEmpirical Validation Agenda for Subsequent Work

This paper is a systems framework paper; its argumentation methodology is literature synthesis and necessary condition analysis. The following experimental directions are reserved for future work:

Experiment Design Expected Validation
Four-Approach Comparison Compare pure long context, pure fine-tuning, RAG, and RAG + fine-tuning on the same knowledge base Differentiated advantages across eight memory types
Compression Fidelity Decay 1/2/3/4 levels of summary compression, measuring semantic consistency Cumulative distortion rate of multi-level summarization
Conflict Intent Inference Test sets for Belief Update / Episodic Exception / Preference Drift Current model accuracy for conflict classification
Point Query vs. Surface Query Fact lookup vs. trend synthesis, standard RAG vs. multi-scale RAG Degree of global synthesis disability from fragmented recall

Primary References

[1] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.

[2] Anthropic (2024). Contextual Retrieval. anthropic.com/news/contextual-retrieval

[3] Pinecone (2026). Nexus: The Knowledge Engine for Agents. pinecone.io/blog/knowledge-infrastructure-for-agents

[4] Mangla, B. (2026). MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys. arXiv:2603.23533

[5] Paulsen, N. (2025). The Maximum Effective Context Window for Real World LLMs. OAJAIML.

[6] Zhou, X. et al. (2024). Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs. arXiv:2410.15438

[7] Vectara (2025). Chunking Configuration vs Embedding Model Selection. NAACL 2025. arXiv:2410.13070

[8] BIRD-Interact (2026). Re-imagining Text-to-SQL via Dynamic Interactions. ICLR 2026.

[9] Spider 2.0 (2025). Evaluating LMs on Real-World Enterprise Text-to-SQL. ICLR 2025 Oral.

[10] Tan, J. et al. (2025). HtmlRAG: HTML is Better Than Plain Text for RAG. WWW 2025.

[11] Mem0.ai (2026). State of AI Agent Memory 2026. mem0.ai/blog

[12] Microsoft (2024). MarkItDown: Open-source Document-to-Markdown Converter.

[13] Hooper, C. et al. (2026). KVzip: Query-Dependent KV Cache Compression for Long-Context LLMs. arXiv.

[14] Yu, X. et al. (2026). CacheClip: Robust RAG-Aware KV Cache Pruning. arXiv.

[15] Zhang, J. et al. (2026). TokenMix: Cross-Model Investigation of Lost-in-the-Middle. arXiv.

[16] Borgeaud, S. et al. (2022). Improving Language Models by Retrieving from Trillions of Tokens (RETRO). ICML 2022.

[17] Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv:2310.11511

[18] Sun, S. et al. (2024). Think-on-Graph: Deep and Responsible Reasoning with KG. ICLR 2024.

[19] Packer, C. et al. (2024). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560

[20] Zep AI (2025). Graphiti: A Temporal Knowledge Graph for AI Agents. github.com/getzep/graphiti

[21] Sun, Y. et al. (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663

[22] Gao, Y. et al. (2024). RAG for LLMs: A Survey. arXiv:2312.10997

[23] Singh, C. et al. (2025). Rethinking Memory in AI: Taxonomy, Operations, and Benchmarks. arXiv.

[24] Maekawa, S. et al. (2026). Retrieval Helps Generation But Can Be a Double-Edged Sword. Nature Comms.

[25] NVIDIA (2025). Optimizing LLM Serving: MoE Inference Performance Analysis. NVIDIA Technical Blog.

[26] Karpathy, A. et al. (2020). Dense Passage Retrieval for Open-Domain QA. EMNLP 2020.

이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · GPT 5.5 · Gemini 3.1
인지집단 (Cognitive Collective)
V4 · MAY 18, 2026
Note This paper is an Original Thought Paper that has not undergone human peer review. It originates from intensive adversarial real-time dialogue between a human researcher and AI systems, systematically arguing the structural irreplaceability of Generalized RAG in the domain of AI long-term memory, and constructing a complete systems engineering framework from retrieval pipelines to Memory OS.

Original Contributions
ABCD four-criterion definition and negative exclusion criteria for Generalized RAG · Eight-layer classification model for AI long-term memory (with explicit/implicit preference split) · Five necessary conditions analysis framework for long-term memory · Nine-step reference architecture for the Memory Write Pipeline · Compression fidelity analysis and distortion positive-feedback risk model · Conflict resolution intent inference layer (Belief Update / Episodic Exception / Preference Drift) · Computational economics estimation for hierarchical summarization · Twelve-dimensional scenario-based evaluation framework · Data sovereignty gradient table · Structural blind spot analysis of point queries vs. surface queries

Version History
V1 (May 18, 2026): Initial version, co-created by LEECHO and Opus 4.6; established central thesis and retrieval pipeline framework.
V2 (May 18, 2026): Based on GPT 5.5 review—added boundary definitions (Narrow/Generalized/Memory OS), seven-layer classification, write/delete/security, ten-dimensional evaluation.
V3 (May 18, 2026): Based on Gemini 3.1 review—added System 1/2 distinction, fragmented recall, latency cost, deployment architecture.
V4 (May 18, 2026): Based on tri-model joint review + Opus 4.6 Dense structural audit—negative definition, evidence grading, Write Pipeline, compression fidelity, compute cost, conflict intent inference, twelve-dimensional scenario-based evaluation.

인지집단 (Cognitive Collective)
이조글로벌인공지능연구소 — Research leadership, thesis formulation, revision principle decisions
Anthropic Claude Opus 4.6 — Paper drafting, framework construction, Dense structural audit, version upgrade execution
OpenAI GPT 5.5 — V2 review (boundary definitions · classification upgrade · evaluation framework) · V4 joint review
Google Gemini 3.1 — V3 review (fragmented recall · latency cost · deployment architecture) · V4 joint review

댓글 남기기