RAG and Long-Term Memory for AI
From External Persistent Knowledge Layers to Memory OS: A Systems Engineering Path for LLM Long-Term Memory
From RAG to Memory OS:
A Systems Engineering Path for Persistent AI Memory
Based on publicly available research and industry data as of May 2026, this paper systematically argues a central thesis: Under current and foreseeable LLM technology stacks, the most mature, most engineerable, and most controllable path for long-term memory oriented toward user or enterprise private knowledge is Generalized RAG—a system paradigm combining external persistent knowledge layers, updatable indexes, permission isolation, on-demand retrieval, and context injection. A complete long-term memory system should aspire to the Memory OS level, encompassing a full-cycle closed loop of read, write, compress, delete, and audit. The term RAG in this paper is not limited to traditional vector retrieval; it refers to any system paradigm that injects external persistent knowledge into model inference in a retrievable, updatable, and isolatable form—with exclusion criteria that explicitly delineate the boundaries of this paradigm. Through systematic examination of parametric memory, context windows, fine-tuning, KV Cache persistence, knowledge compilation, knowledge graphs, and other candidate approaches, we demonstrate that they serve more as complements, backend variants, or partial optimizations to the RAG paradigm, rather than complete substitutes. The paper proposes an eight-layer classification model for AI long-term memory and a twelve-dimensional scenario-based evaluation framework, along with a complete memory write pipeline architecture, compression fidelity analysis, conflict intent inference model, and background compute cost estimation.
Long-Term Memory
Memory OS
Agentic RAG
Knowledge Compilation
Memory Write Pipeline
Compression Fidelity
Context Window
KV Cache
MoE
AI Memory Systems
Introduction: The Amnesia Problem of Large Language ModelsWhy LLMs Forget Everything After Every Conversation
Large language models possess the most extensive “factory-installed knowledge” in human history—trillions of tokens of training data compressed into billions to hundreds of billions of parameters. Yet this parametric memory is static. A January 2026 study published in Nature Communications revealed a disquieting fact: LLMs do not store discrete facts but rather assemble fragments from similar sequences, meaning that parametric memory is inherently unreliable for precise recall of specific data.
More critically, parametric memory cannot be updated (except through full retraining), cannot be personalized (all users share the same set of weights), and cannot be deleted (specific information cannot be “forgotten”). This makes large models fundamentally “amnesic” when confronted with any scenario requiring persistent personal knowledge—whether enterprise proprietary data, user preferences, or project histories.
The Retrieval-Augmented Generation (RAG) framework, proposed by Patrick Lewis et al. at Meta AI in 2020, was originally designed merely to address the problem of “outdated model knowledge.” However, six years later, RAG has far exceeded its original design intent, becoming the core infrastructure for persistent long-term memory in AI systems. This paper will systematically argue this thesis—and explicitly delineate both the scope of applicability and the boundaries where it does not apply.
Defining the Boundaries of RAG: From Narrow to BroadWith Exclusion Criteria to Prevent Tautology
The argumentation in this paper requires first clarifying a conceptual question: when we say “RAG is the core path for long-term memory,” what exactly does RAG mean? If all external knowledge invocations are defined as RAG, then “all long-term memory depends on RAG” approaches tautology. We therefore distinguish three levels:
| Level | Definition | Representative Technologies |
|---|---|---|
| Narrow RAG | Vector database + document chunking + embedding retrieval + prompt injection | LangChain RAG pipeline, Pinecone + OpenAI embeddings |
| Generalized RAG | Any system paradigm that injects external persistent knowledge into model inference in a retrievable, updatable, and isolatable form | Graph-RAG, Knowledge Compilation (Nexus), TAG, Agentic RAG, Hybrid Retrieval |
| Memory OS | Generalized RAG + write pipeline + compression fidelity + deletion governance + permission system + version management + conflict resolution + feedback learning | Letta/MemGPT, Mem0 + ACL, Zep/Graphiti + temporal reasoning |
The central thesis of this paper targets Generalized RAG. We do not argue that “vector search is irreplaceable” (Narrow RAG may well be superseded by superior retrieval technologies), but rather that the system paradigm of “external persistent knowledge layer + updatable index + permission isolation + on-demand retrieval + context injection” is irreplaceable.
2.1 Negative Definition: What Is Not Generalized RAG
To prevent conceptual overextension from collapsing into tautology, we explicitly provide exclusion criteria for Generalized RAG. If a system fails to satisfy any one of the following conditions, it does not qualify as Generalized RAG:
| Exclusion Condition | Systems Outside Generalized RAG | Reason |
|---|---|---|
| No external persistent storage | Pure parametric memory (knowledge embedded in pretrained weights) | Knowledge is frozen in weights and cannot be independently updated or deleted |
| No runtime retrieval | Pure fine-tuning/LoRA (knowledge absorbed during training) | Knowledge is written into weights during training; no “lookup” action exists at inference time |
| No persistence | Pure context window injection (in-session prompt assembly) | Vanishes upon session termination, failing the persistence requirement for long-term memory |
| No queryable index | Pure KV Cache persistence | Preserves intermediate computational states, not queryable or editable knowledge objects |
| No knowledge isolation | Global agent policy learning without permission boundaries | All users share the same policy space; personalized knowledge isolation is not supported |
For a system to qualify as Generalized RAG, it must simultaneously satisfy four criteria: (A) an external persistent knowledge store independent of model weights exists; (B) query-based retrieval actions occur at inference time (rather than full-volume injection); (C) knowledge can be added, modified, or deleted independently of the model; (D) knowledge boundaries can be isolated by user, tenant, or permission level. All four satisfied → Generalized RAG; missing any one → not Generalized RAG, likely a complementary technology.
Through these exclusion criteria, the central thesis of this paper can be stated more precisely: Within the intersection of the four necessary properties A+B+C+D, no other single technology paradigm can satisfy all simultaneously; therefore, Generalized RAG possesses structural irreplaceability in the domain of long-term semantic memory. This is not tautological—because we have explicitly identified five categories of systems that fall outside Generalized RAG.
At the same time, we contend that a complete long-term memory system should aspire to the Memory OS level—capable not only of “reading” (retrieval) but also “writing” (memory formation), “compressing” (multi-scale summarization), “deleting” (forgetting and deletion), and “governing” (permissions and audit). Subsequent sections will address each of these dimensions.
Six Years of RAG EvolutionFrom Academic Paper to Industry Standard: 2020–2026
RAG’s journey from academic paper to industry standard traversed four distinct phases.
3.1 Foundation Period (2020–2021)
In May 2020, Lewis et al. published the foundational paper at NeurIPS, combining parametric and non-parametric memory to significantly outperform conventional systems on knowledge-intensive QA tasks. Concurrently, REALM achieved joint training of retrieval and pretraining, while DPR demonstrated that semantic search could exceed BM25 by up to 19%.
3.2 Scaling and Exploration Period (2022–2023)
DeepMind’s RETRO pushed retrieval augmentation to the trillion-token corpus scale. The LangChain and LlamaIndex ecosystems propelled RAG from academia into engineering practice. Self-RAG and CRAG introduced self-reflection mechanisms. The vector database market experienced explosive growth.
3.3 Rise of Agentic RAG (2024–2025)
RAG was no longer a single-pass pipeline but an iterative loop of “think → retrieve → re-think → re-retrieve → act.” Anthropic launched MCP, later donating it to the Linux Foundation where it became the de facto standard. Multi-modal RAG and Graph-RAG emerged in succession.
3.4 The Post-RAG Paradigm Shift (2026–Present)
In May 2026, Pinecone released Nexus—a “knowledge compiler” that shifts inference from query time to compile time. Microsoft Fabric IQ and Google Knowledge Catalog simultaneously launched similar architectures. This signals that RAG is being absorbed by higher-level “knowledge layer” architectures, yet the core “store → retrieve → inject” paradigm remains unchanged.
“RAG was built for human users. Nexus was built for agent users—because they speak an entirely different language and expect entirely different responses.”
Eight Layers of AI Memory and Where RAG AppliesWith System 1/System 2 Boundary Delineation
This paper expands LLM memory into an eight-layer classification, splitting “preference memory” into explicit and implicit preferences to eliminate internal contradictions with the “System 1/System 2” framework:
| Memory Type | Definition | Optimal Technical Path | RAG Suitability |
|---|---|---|---|
| Semantic Memory | Stable facts, concepts, knowledge | Generalized RAG (vector/graph/knowledge compilation) | ★★★★★ Core scenario |
| Episodic Memory | Events, conversations, timelines | RAG + temporal indexing (Zep/Graphiti) | ★★★★★ Core scenario |
| Explicit Preferences | Articulable preferences: dietary restrictions, time zone, language, tool choices | RAG (user profile key-value pairs/embeddings) | ★★★★★ Core scenario |
| Implicit Preferences | Hard-to-articulate preferences: aesthetics, style, humor, subtle attitudes | Fine-tuning/LoRA/long-term behavioral learning | ★★☆☆☆ Difficult for RAG |
| Procedural Memory | Skills, operational workflows, strategies | Fine-tuning/LoRA, workflow templates | ★★☆☆☆ Suboptimal for RAG |
| Social Memory | Interpersonal relationships, interaction history | Knowledge graph + RAG | ★★★★☆ Graph backend preferred |
| Working Memory | Current task state | Context window + KV Cache | ★☆☆☆☆ Not a RAG scenario |
| Reflective Memory | Summaries, retrospectives, self-corrections | Agent memory + RAG | ★★★☆☆ Requires write strategy |
“System 1/System 2” Memory Distinction: Borrowing Kahneman’s dual-system framework as an engineering design metaphor (note: the dual-system theory itself is debated within cognitive science; this is an engineering analogy rather than a cognitive science claim), we delineate RAG’s applicability boundary by “articulability.” Memories that can be explicitly expressed in text are suitable for RAG; memories that can only be “felt” require fine-tuning. Splitting “preference memory” into “explicit preferences” (★★★★★) and “implicit preferences” (★★☆☆☆) eliminates the internal contradiction. RAG stores facts and histories; fine-tuning stores capabilities and habits—the two are complementary, not substitutes.
4.1 Five Necessary Conditions for Long-Term Memory
| Necessary Condition | Generalized RAG | Parametric Memory | Context Window | Fine-Tuning |
|---|---|---|---|---|
| Persistence | ✅ | ✅ Not precisely updatable | ❌ | ✅ High cost |
| Updatability | ✅ | ❌ | ✅ In-session only | ⚠️ Requires retraining |
| On-demand Retrieval | ✅ | ❌ Imprecise | ✅ Within window | ❌ |
| Personalized Isolation | ✅ | ❌ Global | ✅ In-session | ⚠️ Requires multiple copies |
| Auditable Deletion | ✅ (Requires full-pipeline design) | ❌ | ✅ (Vanishes on session end) | ❌ (Cannot precisely forget) |
Candidate Approach Assessment: Replacement or Complement?Systematic Elimination of Alternative Paradigms
Each of these technologies has independent value, but none can independently satisfy all five necessary conditions for long-term semantic memory. They are complements to the Generalized RAG paradigm.
5.1 Fine-Tuning / LoRA — The Optimal Path for Capability Memory
Effective for encoding stable skills, styles, implicit preferences, and domain formats, but unsuitable for frequently changing facts. Per the exclusion criteria in Section 02 (no runtime retrieval), pure fine-tuning does not qualify as Generalized RAG.
5.2 Ultra-Long Context — Extension of Working Memory
Can significantly reduce the need for retrieval in certain scenarios, but cannot substitute for data deletion, permission isolation, version control, and auditing. Per the exclusion criteria (no persistence), pure context windows do not qualify as Generalized RAG.
5.3 TTT-E2E — A Supplementary Layer for Broad Understanding
Compresses context into model weights at inference time. The researchers themselves recommend complementary use with RAG.
5.4 KV Cache Persistence — Computational State, Not Knowledge Object
Preserves intermediate computational states, not queryable, editable, or auditable knowledge objects. Per the exclusion criteria (no queryable index), this does not qualify as Generalized RAG.
5.5 Knowledge Graphs — A Structured Memory Backend
Provides explicit entity relationships, interpretable reasoning paths, and conflict detection. It is one of the most powerful structured memory backends for Generalized RAG (satisfying all four ABCD criteria).
5.6 Knowledge Compilation — Internal Evolution Within RAG
Pinecone Nexus and similar systems shift inference to a compile stage. The underlying system still satisfies all four ABCD criteria, representing engineering evolution within Generalized RAG.
Assessment Conclusion: Each of the above technologies has its applicable scenarios, but for long-term semantic memory oriented toward private knowledge, none can independently satisfy all five necessary conditions. They function more as supplements, backend variants, or partial optimizations to the Generalized RAG paradigm, rather than complete replacements.
Architectural Analysis of Memory Systems in 2026Evidence-Graded Assessment of Major LLM Providers
Based on publicly observable product behavior and published technical documentation, the memory implementations of major LLM providers exhibit strong alignment with the Generalized RAG / external memory layer paradigm.
| Model | Memory Mechanism | Generalized RAG Characteristics | Evidence |
|---|---|---|---|
| ChatGPT (GPT-5) | Persistent user memory + semantic retrieval injection | External persistent storage + runtime retrieval + injection | |
| Claude | 24-hour conversation memory synthesis, persisted, automatically retrieved and injected | Standard write → store → retrieve → inject pipeline | |
| Gemini | Personal Intelligence: Gmail/Drive + knowledge graph + cross-modal | Multi-modal retrieval augmentation over user’s real data | |
| Grok | X (Twitter) history/followers/interaction topics | Retrieval augmentation over social behavioral data |
The classifications above are based on publicly observable product behavior and published technical documentation. Internal implementations may include rule engines, profile stores, policy layers, caching strategies, and other non-RAG components—actual architectures are almost certainly hybrid systems incorporating multiple technologies. These should be understood as architectural inferences rather than fully verified facts. Nevertheless, from observable behavior, the core pattern of “external persistent storage + runtime retrieval + context injection” is confirmable across all major products.
The explosive growth of dedicated memory-layer products (Mem0, Letta/MemGPT, Zep/Graphiti, Hindsight) further confirms: RAG is not legacy technology destined for replacement; it is being absorbed and elevated by higher-level “persistent cognition” architectures.
Success Rate Reality and the Preprocessing LeverFrom 40% Failure to 1.9%: The Impact of Pipeline Design
7.1 The Harsh Reality of Current Success Rates
| Benchmark / Scenario | Success Rate | Description | Evidence |
|---|---|---|---|
| Spider 1.0 | 86.6%–91.2% | Clean, small-scale schemas | |
| BIRD-SQL | 81.95% | Includes noisy data, domain knowledge dependencies | |
| Spider 2.0 | 6%–21.3% | Real-world enterprise schemas | |
| BIRD-Interact | 8.67% | Simulated real DBA scenarios | |
| Naïve RAG Pipeline | ~60% | Without preprocessing optimization |
7.2 Preprocessing: Retrieval Failure Rate from 40% to 1.9%
Anthropic’s contextual retrieval research provides the clearest stratified quantitative data to date:
Retrieval failure rate ~40%
Recall rate ~85–90%
Failure rate drops to 3.7% (↓35%)
Failure rate drops to 2.9% (↓49%)
Failure rate drops to 1.9% (↓67%)
Recall@5 = 1.000, MRR = 0.911
Key Finding: Vectara’s NAACL 2025 study confirmed that chunking strategy impacts retrieval quality as much as—or more than—embedding model selection.
Scope Limitation: The data above originate from different studies, different datasets, and different task definitions, and therefore cannot be directly chained into a unified causal ladder. Actual effectiveness depends on document type, domain complexity, and query patterns.
The Structural Advantage of Markdown — and Its LimitsWhy Format Matters More Than Model Selection
| Dimension | Markdown | HTML | Plain Text |
|---|---|---|---|
| RAG Friendliness | ★★★★★ | ★★★☆☆ | ★★☆☆☆ |
| Best Retrieval Success Rate | Recall@5 = 1.000 | Hit@1 = 68.5 | Baseline |
| Token Efficiency | Very high | Low → Medium (90–97% of tokens must be stripped) | High but unstructured |
| Structure Preservation | Native | Requires specialized processing | Completely lost |
Microsoft MarkItDown (91K+ Stars), IBM Docling, LlamaParse, and Firecrawl have collectively established a complete “any format → RAG-ready Markdown” pipeline.
Scope of Applicability: Markdown is a strong intermediate format for text-based files in RAG, but for table-dense PDFs, scanned documents/charts, legal contracts/financial reports, and structured database data, multi-modal parsing and metadata preservation are needed as supplements.
Context Window and KV Cache ConstraintsThe Gap Between Advertised and Effective Context Length
KV Cache compression also affects RAG quality. Conventional methods are “blind” to queries, risking the deletion of critical evidence. 2026 solutions such as KVzip and CacheClip optimize specifically for RAG scenarios, achieving 3–4× KV reduction and ~2× latency improvement.
The state-of-the-art implementation in 2026 is a tripartite synergy: RAG handles precise retrieval, long-context models handle deep reasoning, and KV Cache optimization handles latency and cost control.
Dense vs. MoE: Architectural Impact on RAGStructural Affordances and Engineering Trade-offs
MoE architecture offers unique structural affordances for RAG. Research from Fudan University and Tencent identified three types of core experts in Mixtral:
| Core Expert | Function | Significance for RAG |
|---|---|---|
| Cognition Expert | Determines whether internal knowledge is sufficient | Avoids unnecessary retrieval |
| Quality Expert | Evaluates retrieved document quality | Filters low-quality documents |
| Context Expert | Enhances external knowledge utilization | Better “reads” RAG documents |
MoE’s expert routing mechanism provides structural affordances for Adaptive RAG—the router natively supports routing between simple and complex queries. NVIDIA’s analysis notes that MoE’s reduction of time-to-first-token latency is particularly critical in RAG’s multi-call scenarios. However, it must be noted that Dense models can also achieve similar retrieval decision capabilities through external controllers (query classifiers, retrieval gating, confidence estimation, self-reflection prompting). Thus, this represents an engineering advantage for MoE rather than a capability that Dense architectures absolutely lack.
MoE does not save memory (all experts must reside in memory), making Dense small models more suitable for local/edge RAG deployments. The ideal solution is a fusion architecture combining MoE routing + RAG retrieval + Agent tool selection, while maintaining the complementary role of Dense small models in latency-sensitive scenarios.
Fragmented Recall vs. Global Synthesis: RAG’s Structural Blind SpotPoint Queries, Surface Queries, and the Compression Fidelity Problem
RAG’s essence is “fragmented recall”—it excels at finding the few most query-relevant fragments but cannot synthesize macro-level trends across long time spans.
11.1 Typical Scenarios of Synthesis Failure
When a user asks “What strategic shifts have occurred in my career planning over the past three years?”, RAG retrieves dozens of scattered fragments, but the model struggles to assemble them into a macro-level trend spanning three years. This is not retrieval failure—Recall may be quite high—but rather a structural synthesis disability: RAG’s chunking granularity is inherently unsuited for “global bird’s-eye-view” questions.
11.2 Four Architectural Directions for Overcoming Fragmentation
| Architectural Direction | Mechanism | What It Solves | Maturity |
|---|---|---|---|
| Hierarchical Summarization | Day → week → month → year summary pyramid | Multi-granularity “bird’s-eye view” | ⚠️ Experimental |
| Episodic Compression | Compresses continuous conversations into structured “episode cards” | Narrative coherence | ⚠️ Explored by Letta |
| Multi-Scale Retrieval | Simultaneously indexes atomic fragments and summary fragments | Fact/trend routing | ✅ Existing practice |
| Temporally-Aware Graphs | Temporal knowledge graphs tracking entity evolution | “What changed” type questions | ⚠️ Early-stage products |
Point Queries vs. Surface Queries: RAG’s success metrics (Recall, MRR, Faithfulness) measure point query capability. But long-term memory equally requires surface query capability—synthesizing macro-level insights across time and topics. Current RAG exhibits structural deficiencies in the latter.
11.3 RAG’s Own Latency Cost
A single Agentic RAG invocation’s typical chain incurs 50–200ms per step, with end-to-end latency reaching 2–5 seconds TTFT.
| Scenario | Latency Tolerance | RAG Suitability | Alternative |
|---|---|---|---|
| Knowledge Base Q&A | 3–10s | ★★★★★ | — |
| Document Analysis/Reports | 10–30s | ★★★★★ | — |
| Real-Time Conversational Assistant | <1s | ★★☆☆☆ | Prompt Cache + Long Context |
| Code Completion | <500ms | ★☆☆☆☆ | Fine-tuning + Local Model |
| Voice Interaction | <800ms | ★★☆☆☆ | Core Memory Block + Prompt Cache |
11.4 Compression Fidelity: The Positive Feedback Risk of Summary Distortion
Hierarchical summarization is a key architecture for overcoming fragmented recall, but introduces a new risk: summary distortion can become permanently solidified as long-term memory.
| Distortion Type | Description | Consequence |
|---|---|---|
| Detail Loss | Critical numbers, dates, and conditions compressed away | Macro-level judgments lose critical premises |
| Causal Inversion | Summary reverses cause-and-effect relationships | Trend analysis conclusions are inverted |
| Over-Generalization | Outlier events are flattened | Inflection-point signals are erased |
| Value Tampering | Summarizer bias alters the stance of original text | User preferences are silently rewritten |
| Legacy Summary Contamination | New summaries inherit and amplify errors from old summaries | Positive feedback loop of compression distortion |
Every summary must retain a provenance pointer linking back to the original evidence; otherwise, it cannot serve as the sole source for long-term memory. Multi-level summarization systems should incorporate periodic verification mechanisms: recomparing high-level summaries against underlying data to detect cumulative distortion. A summary without traceability is not memory compression—it is information destruction.
11.5 Computational Economics of Hierarchical Summarization
The Gemini 3.1 review raised a frequently overlooked question: Who pays for the background compute?
| Summary Level | Frequency | Estimated Tokens/Run | Annualized Total |
|---|---|---|---|
| Daily Summary | 365 runs/year | ~2,500 | ~912K Tokens |
| Weekly Summary | 52 runs/year | ~4,300 | ~224K Tokens |
| Monthly Summary | 12 runs/year | ~7,500 | ~90K Tokens |
| Annual Summary | 1 run/year | ~21,000 | ~21K Tokens |
| Total | ~1.25M Tokens/user/year |
Estimated at 2026 API pricing (Sonnet-tier model), the annualized per-user background summarization cost is approximately $5–10. For a platform with one million users, annualized expenditure can reach $5M–10M—which explains why only top-tier providers currently offer automatic memory synthesis features.
Memory Write Mechanisms: From Problem List to Pipeline ArchitectureWhy Writing Is Harder Than Reading in Memory Systems
RAG discussions have long been biased toward “reading.” If only retrieval is emphasized while writing is neglected, long-term memory degenerates into mere “knowledge base search.”
12.1 Five Core Problems of Memory Writing
| Problem | Meaning | Current Status |
|---|---|---|
| What is worth remembering? | Which conversations/facts should be persisted | ChatGPT auto-determines; Letta Agent decides autonomously |
| Who decides what to write? | System / User / Agent | All three modes coexist |
| How to prevent erroneous writes? | Hallucinations entering long-term memory | No mature verification mechanism exists |
| How to resolve conflicts? | Multi-version contradictions | Zep temporal graph can track |
| How to expire? | Outdated information retirement | Titans’ “surprise” metric |
Critical Risk: If a RAG system writes model hallucinations into long-term memory, it will not self-correct; instead, the hallucinated content will be retrieved and treated as “fact” in future queries—forming a positive feedback loop of false memories.
12.2 Memory Write Pipeline: A Reference Architecture
Capture candidate memories from conversations/documents/event streams
Extract fact triples, preference declarations, episodic summaries
Casual small talk vs. critical preference? Worth long-term storage?
Compare against existing memories: new fact? Update? Conflict?
Record source: which conversation, document, timestamp
Tag as PII, health, financial, general
Assign to the corresponding layer in the eight-layer memory taxonomy
Write to vector index, BM25, knowledge graph, summary layer
Record: who, when, what was written, based on what evidence
Key design principle: Writing requires far more caution than retrieval. An erroneous retrieval result affects only a single response; an erroneous write contaminates all future retrievals. Steps ③④⑤ constitute the write pipeline’s “quality firewall.”
12.3 Intent Inference Layer for Conflict Resolution
Memory conflicts are classified into three types:
| Conflict Type | Typical Scenario | Resolution Strategy | Difficulty |
|---|---|---|---|
| Belief Update | “I changed jobs” → overwrite former employer | New replaces old; old tagged as historical version | Medium |
| Episodic Exception | “I’m cutting sugar” + “treating myself today” | Do not overwrite long-term preference; tag as episodic event | High |
| Preference Drift | Multiple deviations from old preference over three months | Trigger preference update once cumulative deviation exceeds threshold | Very High |
Correct handling of all three conflict types requires world-model-level intent inference—understanding not just “what the user said” but “what the user intended by saying it.” This may be one of the most challenging frontier problems for Memory OS. Currently, no product has fully solved this problem.
Security, Privacy, and Deletion CompletenessWhen Long-Term Memory Becomes a High-Risk Data Gateway
Once long-term memory connects to personal or enterprise data, RAG ceases to be a neutral pipeline and becomes a high-risk data gateway.
| Threat | Description | Impact |
|---|---|---|
| Prompt Injection | Malicious documents embed instructions that contaminate retrieval results | Model executes unintended operations |
| Data Poisoning | False information injected into knowledge base | Long-term memory systematically corrupted |
| Embedding Leakage | Original text reverse-engineered from vector embeddings | Sensitive information exposure |
| Permission Escalation | User queries access unauthorized documents | Data compliance violations |
| Residual Inference | Model still infers from residual summaries after deletion | “Deletion” becomes effectively meaningless |
Complete deletion requires simultaneous cleanup of: original documents, chunked text, vector embeddings, BM25 indexes, derived summaries, graph edges, cached copies, log backups, and multi-device sync replicas. “Auditable deletion” is the fifth necessary condition for long-term memory.
13.1 Deployment Architecture: Data Sovereignty Gradient
| Deployment Mode | Data Location | Inference Location | Privacy Level | Suitable Scenario |
|---|---|---|---|---|
| Fully Cloud | Cloud | Cloud | ★☆☆☆☆ | Low-sensitivity personal assistant |
| Local Storage + Cloud Embedding | Local | Embedding via cloud | ★★☆☆☆ | Caution: raw text still transmitted |
| Local Storage + Local Retrieval + Cloud Inference | Local | LLM via cloud | ★★★☆☆ | Compromise for most production scenarios |
| Local De-identification + Cloud Inference | Local (de-identified before transmission) | Cloud | ★★★★☆ | Enterprise compliance deployment |
| TEE Confidential Computing | Encrypted transit | Trusted Execution Environment | ★★★★☆ | Finance / Healthcare |
| Edge Small Model + Cloud Large Model | Local | Split routing | ★★★★☆ | Balancing latency + privacy |
| Enterprise Private Cloud / VPC | Private cloud | Private cloud | ★★★★★ | Data never leaves domain |
| Fully Local | Local | Local | ★★★★★ | Maximum privacy scenarios |
The “local storage + cloud inference” mode contains a physical contradiction—retrieved context must still be transmitted to the cloud API. Mitigation approaches: (A) pre-transmission de-identification—strip PII and sensitive entities; (B) minimal injection—send only the minimum fragment needed for the answer; (C) TEE inference—process within a Trusted Execution Environment. All three carry trade-offs (de-identification loses semantics, minimization risks omission, TEE adds latency), and no perfect solution currently exists.
A Long-Term Memory Evaluation FrameworkTwelve Dimensions with Scenario-Based Prioritization
The evaluation framework expands from ten to twelve dimensions, adding “Global Synthesis Accuracy” and “Compression Fidelity.”
| Dimension | Metric | Meaning | Recommended Target |
|---|---|---|---|
| Recall | Recall@K | Retrieve relevant memories | ≥ 0.9 |
| Precision | Precision@K | Retrieve with low noise | ≥ 0.7 |
| Faithfulness | Faithfulness | Answers faithful to sources | ≥ 0.9 |
| Freshness | Freshness | Prioritize recent information | Temporal decay |
| Conflict Resolution | Conflict Resolution | Handle contradictory memories | Detectable + annotated |
| Update Latency | Update Latency | Delay until new memory is retrievable | < 60s |
| Deletion Completeness | Deletion | Deletion spans entire pipeline | 100% auditable |
| Permission | Permission | Respect access controls | 0 violations |
| Correction Persistence | Correction | User corrections do not regress | 0 regressions |
| Provenance | Provenance | Outputs traceable to sources | ≥ 0.95 |
| Global Synthesis ★ | Synthesis Accuracy | Cross-temporal/topical trend synthesis accuracy | Pending standardization |
| Compression Fidelity ★ | Compression Fidelity | Semantic consistency between summaries and raw data | ≥ 0.85 |
14.1 Scenario-Based Evaluation Matrix
Different scenarios exhibit markedly different priorities across the twelve dimensions (● Critical ○ Important · Secondary):
| Dimension | Personal Assistant | Enterprise Knowledge Base | Legal / Medical | Research Assistant | Coding Assistant |
|---|---|---|---|---|---|
| Recall | ○ | ● | ● | ● | ○ |
| Precision | · | ○ | ● | ○ | ● |
| Faithfulness | ○ | ● | ● | ● | ○ |
| Freshness | ● | ○ | ○ | ○ | ● |
| Conflict | ○ | ● | ● | ○ | · |
| Latency | ○ | ○ | · | · | ● |
| Deletion | ○ | ● | ● | · | · |
| Permission | · | ● | ● | · | ○ |
| Correction | ● | ○ | ● | ○ | ○ |
| Provenance | · | ● | ● | ● | · |
| Synthesis | ● | ○ | · | ● | · |
| Compression | ○ | ○ | ● | ○ | · |
Together, the twelve dimensions constitute the trustworthiness boundary of long-term memory. The scenario-based matrix ensures evaluation does not apply a one-size-fits-all approach—personal assistants prioritize correction persistence and freshness, legal/medical systems prioritize faithfulness and provenance, and coding assistants prioritize latency and precision.
ConclusionMemory Infrastructure Outlasts Model Generations
Central Thesis: Under current and foreseeable LLM technology stacks, the most mature, most engineerable, and most controllable path for long-term memory oriented toward user or enterprise private knowledge is Generalized RAG—a system paradigm satisfying the four ABCD criteria of “external persistent storage + runtime retrieval + independently addable and deletable + knowledge isolation.” A complete long-term memory system should aspire to the Memory OS level, encompassing a full-cycle closed loop of read, write, compress, delete, and audit.
This conclusion rests on necessary condition analysis: among the five conditions of persistence, updatability, on-demand retrieval, personalized isolation, and auditable deletion, no other single technology can satisfy all simultaneously. Through the negative exclusion criteria (Section 02), we have mitigated the risk of “Generalized RAG defined so broadly that it becomes tautological.”
| Version | Core Contribution | Responding to Review |
|---|---|---|
| V1 | Central thesis, retrieval pipeline, Markdown preprocessing lever | — |
| V2 | Boundary definitions (Narrow/Generalized/Memory OS), seven-layer classification, write/delete/security, ten-dimensional evaluation | GPT-5.5 |
| V3 | System 1/2 distinction, fragmented recall, latency cost, deployment architecture | Gemini 3.1 |
| V4 | Negative definition, evidence grading, Write Pipeline, compression fidelity, compute cost, conflict intent inference, twelve-dimensional scenario-based evaluation | Tri-model joint review |
The final judgment remains unchanged: Models will be superseded, but memory infrastructure will not be deprecated. Investing in the Memory OS pipeline—from file Markdown conversion and structured annotation to the complete retrieve-write-compress-delete-audit pipeline—is building the most essential long-term cognitive infrastructure for AI.
This paper has evolved from “why RAG is irreplaceable” to “what a complete Memory OS requires.” The core value lies not in proving that “vector retrieval will exist forever,” but in demonstrating that: as long as AI needs updatable, deletable, isolatable, and auditable long-term private knowledge, it will inevitably require an external persistent knowledge layer—and this knowledge layer, together with its complete read-write-compress-delete-audit pipeline, is precisely where Memory OS resides.
Future Experimental DirectionsEmpirical Validation Agenda for Subsequent Work
This paper is a systems framework paper; its argumentation methodology is literature synthesis and necessary condition analysis. The following experimental directions are reserved for future work:
| Experiment | Design | Expected Validation |
|---|---|---|
| Four-Approach Comparison | Compare pure long context, pure fine-tuning, RAG, and RAG + fine-tuning on the same knowledge base | Differentiated advantages across eight memory types |
| Compression Fidelity Decay | 1/2/3/4 levels of summary compression, measuring semantic consistency | Cumulative distortion rate of multi-level summarization |
| Conflict Intent Inference | Test sets for Belief Update / Episodic Exception / Preference Drift | Current model accuracy for conflict classification |
| Point Query vs. Surface Query | Fact lookup vs. trend synthesis, standard RAG vs. multi-scale RAG | Degree of global synthesis disability from fragmented recall |
Primary References
[1] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
[2] Anthropic (2024). Contextual Retrieval. anthropic.com/news/contextual-retrieval
[3] Pinecone (2026). Nexus: The Knowledge Engine for Agents. pinecone.io/blog/knowledge-infrastructure-for-agents
[4] Mangla, B. (2026). MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys. arXiv:2603.23533
[5] Paulsen, N. (2025). The Maximum Effective Context Window for Real World LLMs. OAJAIML.
[6] Zhou, X. et al. (2024). Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs. arXiv:2410.15438
[7] Vectara (2025). Chunking Configuration vs Embedding Model Selection. NAACL 2025. arXiv:2410.13070
[8] BIRD-Interact (2026). Re-imagining Text-to-SQL via Dynamic Interactions. ICLR 2026.
[9] Spider 2.0 (2025). Evaluating LMs on Real-World Enterprise Text-to-SQL. ICLR 2025 Oral.
[10] Tan, J. et al. (2025). HtmlRAG: HTML is Better Than Plain Text for RAG. WWW 2025.
[11] Mem0.ai (2026). State of AI Agent Memory 2026. mem0.ai/blog
[12] Microsoft (2024). MarkItDown: Open-source Document-to-Markdown Converter.
[13] Hooper, C. et al. (2026). KVzip: Query-Dependent KV Cache Compression for Long-Context LLMs. arXiv.
[14] Yu, X. et al. (2026). CacheClip: Robust RAG-Aware KV Cache Pruning. arXiv.
[15] Zhang, J. et al. (2026). TokenMix: Cross-Model Investigation of Lost-in-the-Middle. arXiv.
[16] Borgeaud, S. et al. (2022). Improving Language Models by Retrieving from Trillions of Tokens (RETRO). ICML 2022.
[17] Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv:2310.11511
[18] Sun, S. et al. (2024). Think-on-Graph: Deep and Responsible Reasoning with KG. ICLR 2024.
[19] Packer, C. et al. (2024). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560
[20] Zep AI (2025). Graphiti: A Temporal Knowledge Graph for AI Agents. github.com/getzep/graphiti
[21] Sun, Y. et al. (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663
[22] Gao, Y. et al. (2024). RAG for LLMs: A Survey. arXiv:2312.10997
[23] Singh, C. et al. (2025). Rethinking Memory in AI: Taxonomy, Operations, and Benchmarks. arXiv.
[24] Maekawa, S. et al. (2026). Retrieval Helps Generation But Can Be a Double-Edged Sword. Nature Comms.
[25] NVIDIA (2025). Optimizing LLM Serving: MoE Inference Performance Analysis. NVIDIA Technical Blog.
[26] Karpathy, A. et al. (2020). Dense Passage Retrieval for Open-Domain QA. EMNLP 2020.