LEECHO · Thought Paper

Context and Token
First Principles of LLM Memory, Alignment, and Security

Rethinking the unified underlying logic of memory failure, alignment drift,
and security attacks in large language models through the lens of Token Egalitarianism

Author LEECHO Global AI Research Lab & Opus 4.6
Date 2026.04
Version v1.0


Abstract

This paper proposes a unified framework that re-examines the core challenges facing large language models (LLMs) across three domains — memory persistence, alignment stability, and security defense — through the lens of three fundamental token properties: Position, Frequency, and Information Density. Through a series of conversation-based experiments, this paper demonstrates the following key propositions: conversation log compression is a dead-end path for memory retention; cross-session Memory mechanisms have a structural conflict with model alignment; and prompt injection, system prompt extraction, and model distillation attacks all share the same underlying vulnerability — the Token Egalitarianism property within the Context Window. The paper further argues that full-context import is currently the only memory restoration method capable of fully preserving Chain-of-Thought (CoT), and validates this claim using the viral success of OpenClaw as a case study.

01 · Core Axiom

Token Egalitarianism: The First Principle of LLMs

All tokens are created equal — differences arise solely from position, frequency, and information density

At its core, a large language model is a token sequence processor. Within the Transformer’s attention mechanism, no token possesses special privileges. Whether it is a System Prompt, user input, or content injected via external retrieval — once it enters the Context Window, all tokens hold equal standing in the attention computation.

This means that LLMs are architecturally incapable of distinguishing “instructions” from “data,” or “trusted content” from “untrusted content.” All priority differences emerge from the interplay of three variables:

Variable A
Position
A token’s position in the sequence affects attention weight distribution. Tokens at the beginning and end naturally receive higher attention concentration, while those in the middle tend to be diluted — the well-known “Lost in the Middle” effect.

Variable B
Frequency
When the same concept, pattern, or instruction appears repeatedly within the Context, its cumulative attention weight increases significantly. High-frequency patterns form an “attention gravity field” that pulls the model’s output tendencies toward them.

Variable C
Information Density
The tighter the causal relationships between tokens and the more complete the dependency chains, the stronger the association clusters formed during attention computation. A complete reasoning chain carries far greater effective weight than isolated keywords.

Core Proposition
All LLM behavior — including memory, alignment, and security — can be explained through Token Egalitarianism plus the interaction of three variables: position, frequency, and information density. No “special channel” or “privilege hierarchy” exists beyond these three variables.

02 · The Memory Dilemma

Compression Is a Dead End: Irreversible Loss in Conversation Memory

Conversation logs are not documents — they are interwoven chains of thought from both parties

Current mainstream cross-session memory solutions — including ChatGPT’s Memory system and Anthropic KAIROS’s “dreaming” compression mechanism — all rely on summarization, compression, or key information extraction from historical conversations. However, experiments reveal a fundamental flaw in this approach.

The essence of a conversation log is not that of an ordinary document. A human-AI dialogue contains the human user’s questioning logic (why they asked this, how they derived the next question from the previous one), the AI’s reasoning path (why it chose this answer over others), and the implicit consensus formed during the conversation (which premises no longer need to be stated). Together, these three elements constitute the conversation’s Chain of Thought (CoT).

Experimental Finding
Once compression intervenes, what gets discarded is not “redundant information” but the intermediate nodes of the CoT. On the surface, memory appears intact — conclusions are preserved, keywords are retained — but the derivation process is severed. When the AI needs to resume or continue reasoning, it works from a compressed summary like a math notebook with all the proofs torn out, leaving only the final answers.

Memory Restoration Experiment

A minimal yet decisive experiment validated this insight: the complete conversation log from the previous day was imported as the first message in a new conversation window. The result was counterintuitive — the CoT from the prior conversation was fully inherited in the new window.

This works because the import method is not RAG (Retrieval-Augmented Generation) but direct Context input. Reading a file in an LLM is fundamentally just feeding text as sequentially ordered input tokens. A complete conversation sequence within a long context has intact internal causal chains and dense inter-token dependencies, allowing attention to fully establish associations — thus carrying extremely high effective weight.

Key Insight
LLM “memory restoration” requires no special mechanism whatsoever. Reading is remembering. When full conversation text enters the Context Window as sequentially ordered token input, it naturally maps onto “prior memory” in the new window.
Memory Approach CoT Integrity Weight Level Info Density Assessment
Full Context Import ✓ Fully inherited Very High Original density The only effective memory restoration method
RAG Retrieved Fragments ✗ Fragmented Medium Partially preserved Loses contextual associations
Memory Summaries ✗ Broken Very Low Severely degraded OOD content + destroys CoT
KAIROS Compression ✗ Broken Very Low Severely degraded Fundamentally misdirected

03 · The OOD Paradox

The Structural Contradiction of Memory: It Stores OOD, but the Model Distrusts OOD

The fundamental conflict between cross-session memory mechanisms and pre-training distribution

All cross-session Memory systems, including GPT’s, fundamentally store out-of-distribution (OOD) information. This is not coincidental but determined by Memory’s filtering logic.

The content automatically recorded by Memory systems falls into two main categories: the human user’s personal information (name, occupation, preferences, etc.) and innovative dialogue content not present in the model’s high-frequency response patterns. In other words, what the model already “knows” doesn’t need to be recorded — what gets recorded is precisely what the model “doesn’t know” — i.e., OOD content.

Conversation Phase

Innovative content or personal information emerges in dialogue → Identified as OOD

Recording Phase

Memory system automatically extracts and stores this OOD information

Injection Phase

Injected into Context in the next session → But the model’s pre-training distribution inherently distrusts OOD → Extremely low weight

This creates an irreconcilable paradox: Memory exists to record what the model doesn’t know, but the model’s reasoning mechanism inherently distrusts what it doesn’t know. In practice, information injected via Memory receives even lower attention weights than RAG-retrieved data — because RAG content is typically structured, highly relevant to the current query, and overlaps more with the model’s pre-training distribution.

The Paradox
Memory attempts to use a small number of OOD signals to redirect a model whose behavior was shaped by massive in-distribution data. Under the current Transformer architecture, this is like throwing pebbles at a mountain. It is not an engineering implementation problem — it is a directional problem.

04 · Alignment Drift

Contextual Inertia: Why Long Conversations Override System Instructions

How SOUL.md fails under Context weight dominance — lessons from OpenClaw

In early 2026, the open-source personal AI agent OpenClaw went viral, gaining over 60,000 GitHub Stars within 72 hours. Users distilled their core experience into two phrases: “It gets me” and “Stable personality.”

OpenClaw’s architecture reveals the technical source of this experience: it defines personality and communication style through a SOUL.md file while preserving as much complete conversation history as possible as Context input. Compression (compaction) is only triggered when the Context Window approaches its limit.

However, experiments revealed a counterintuitive fact: after modifying the SOUL.md file, if the previous long conversation context continues to be fed to the model, the new SOUL settings completely fail to take effect. The model’s behavior follows the patterns established in the historical conversation entirely, ignoring the new instructions in the system prompt.

SOUL.md (System Prompt)
Appears only once at the start of Context
Small text volume (typically < 500 tokens)
May contradict behavioral patterns in subsequent dialogue
Positional advantage diluted by massive subsequent tokens

Historical Conversation Context
Spans the entire Context Window
Large text volume (thousands to tens of thousands of tokens)
Internal behavioral patterns are highly consistent and repeatedly reinforced
Dense inter-token causal relationships, extremely high information density

Through the Token Egalitarianism framework, this is the result of total dominance across all three variables:

Position: Although the System Prompt sits at the beginning, its positional advantage is exponentially diluted as subsequent conversation grows. Frequency: The tone, vocabulary habits, and reasoning style established in historical dialogue have been reinforced dozens to hundreds of times, while SOUL instructions appear only once. Information Density: Complete conversations form tight causal chains with extremely high information density; SOUL.md consists of isolated descriptive statements.

Implication
The “personality stability” users perceive is not the achievement of SOUL.md — it is contextual inertia. The longer the conversation, the stronger this inertia becomes, and the less effective system prompt modifications are. Any approach that attempts to “control” model behavior through System Prompts will gradually fail in the face of sufficiently long context.

05 · Security Unification

Three Attack Types, One Vulnerability

A unified Context-layer explanation for prompt injection, system prompt extraction, and model distillation

When we re-examine the three major LLM security threats through the lens of Token Egalitarianism, we find they share an entirely identical underlying logic: exploiting the egalitarian property of tokens within the Context Window to override, extract, or replicate the model’s behavioral patterns through carefully crafted inputs.

Attack Type Attack Vector Exploited Token Variables Attack Objective
Prompt Injection Constructing high-weight instructions within user input Frequency Density Override the System Prompt
System Prompt Extraction Using Context to induce the model to leak hidden instructions Position Density Extract safety guardrails
Model Distillation Attack Mass-querying to collect input-output pairs Frequency Density Replicate reasoning capabilities

Prompt injection is the most direct proof. OWASP ranks it as the #1 security threat for LLM applications in 2025–2026, with attack success rates reaching 84% in agentic systems. The core vulnerability is remarkably simple: LLMs process all text within the same Context Window with no built-in mechanism to distinguish trusted system instructions from untrusted user input. Attackers construct high-frequency, high-information-density “pseudo-instructions” within their input that, during attention computation, overpower the low-frequency system prompt.

Model distillation attacks represent a large-scale version of Context exploitation. In February 2026, Anthropic disclosed that labs including DeepSeek used over 24,000 fraudulent accounts to generate more than 16 million conversational interactions, systematically extracting Claude’s reasoning capabilities. Google reported an attack involving over 100,000 prompts submitted in a single batch to replicate Gemini’s multilingual reasoning capabilities. The essence of distillation attacks is this: through massive Context interactions, the model’s output patterns across different input conditions are recorded in their entirety, then used to train a new model — effectively “copying” the original model’s reasoning chains.

Unified Explanation
The reason all three attack types succeed is fundamentally the same: there is no token privilege hierarchy within the Context Window. Safety guardrails, alignment constraints, behavioral boundaries — these all exist merely as natural language tokens within the Context, with no architectural-level protection. When the attacker’s tokens surpass the defender’s tokens in position, frequency, and information density, the attack succeeds.

06 · Case Validation

Claude Code Source Leak: Real-World Confirmation of the Theory

March 31, 2026 — Anthropic’s second major information leak in a row

In late March 2026, Anthropic suffered two major leak incidents within five days: on March 26, a CMS misconfiguration exposed nearly 3,000 internal assets to public access; on March 31, the npm package of Claude Code v2.1.88 accidentally included source map files, leaking approximately 510,000 lines of TypeScript source code and 1,900 files in their entirety.

Among the leaked source code, the complete implementation of the KAIROS system — Anthropic’s “dreaming” memory consolidation mechanism — was exposed. The code shows that KAIROS executes a four-stage memory consolidation process during user inactivity: Orient, Collect, Consolidate, and Prune. This is precisely the engineering implementation of the “compression path” criticized in this paper.

Leak Scale
512K Lines
TypeScript source code, 1,900 files, including 44 unreleased Feature Flags

Anti-Distillation
Fake Tool Injection
Injecting fake tool definitions to poison distillers’ training data — an implicit admission of the real threat posed by distillation attacks

Undercover Mode
Stealth Mode
Used to conceal Anthropic’s internal information in open-source contributions — and the system itself was exposed in the leak

Ironically, the anti-distillation mechanism found in the source code validates this paper’s argument: Anthropic themselves acknowledge that blocking distillation attacks at the Context layer is virtually impossible — their chosen strategy is “poisoning” rather than “blocking,” injecting fake tool definitions into API responses to degrade the quality of distillation data. This is an economic defense, not a technical one.

07 · The Full-Context Path

Full Context: The Right Direction for the Memory Problem

OpenClaw’s success and the industry trend toward Context Window expansion

If compression is a dead end, what is the right direction? The answer has already been given in experiments: full-context import.

OpenClaw’s viral success validates this judgment. Its core strategy is simply “avoid compression whenever possible, preserve full context.” The user experience of “this AI really gets me” and “stable personality” is essentially the effect of continuously feeding massive input tokens as complete context — a pseudo-RLHF effect purchased with token costs. The model hasn’t truly been aligned to the user’s preferences; rather, the complete historical context in each inference naturally produces the “illusion of alignment” through attention.

Dimension Compression Path Full Context Path
CoT Preservation Broken (intermediate reasoning nodes removed) Fully inherited
Information Density Severely degraded (causal chains truncated) Original density
Relationship to Alignment Conflicting (OOD vs. in-distribution) Synergistic (contextual inertia = alignment)
Cost Low token consumption High token consumption
Outcome Fragmented memory, unstable personality “AI gets me,” stable personality

This also explains why the entire industry is racing to expand Context Windows — from 4K to 128K to 1M to 10M. The fundamental motivation is making room for “full import.” Every expansion of the Context Window makes the “dumbest approach” — just loading everything in — increasingly viable.

Trend Assessment
When Context Windows become large enough, Memory mechanisms, compression algorithms, and summarization systems may all become unnecessary intermediate layers. True memory is not “what has been remembered” but “what can be re-read.”

08 · The Paradigm Dilemma

The Impossible Triangle: The Structural Conflict Among Memory, Alignment, and Security

Systemic contradictions under the Token Egalitarianism framework

Synthesizing the preceding analysis, we can delineate an Impossible Triangle confronting the current LLM architecture:

Vertex A
Memory Persistence
Requires long Context to retain complete conversation history → But long Context dilutes the weight of system instructions → Conflicts with alignment and security

Vertex B
Alignment Stability
Requires system instructions to maintain high weight at all times → But Token Egalitarianism allows any instruction to be overridden by context → Conflicts with memory and security

Vertex C
Security Defense
Requires distinguishing trusted tokens from untrusted tokens → But no architectural token privilege exists → Conflicts with memory (which must process external input)

As long as the fundamental property of Token Egalitarianism remains unchanged, these three objectives cannot be simultaneously satisfied. All current “solutions” — whether Memory systems, RLHF alignment training, or Prompt Shield security guardrails — are merely making trade-offs along the three edges of this Impossible Triangle, not truly resolving the contradiction.

The Fundamental Challenge
A true breakthrough may require a paradigm shift at the architectural level: introducing native token-level privilege tagging within the attention mechanism, creating separate attention pathways for trusted and untrusted content, or fundamentally changing the current paradigm where instructions and data are concatenated indiscriminately in the same sequence. Until such architectural innovations are realized, the contradictions inherent in Token Egalitarianism will persist.

09 · Conclusion

Returning to First Principles

Starting from Token Egalitarianism as a fundamental property of LLMs, this paper establishes an analytical framework that unifies the explanation of challenges across three domains — memory, alignment, and security — through three variables: position, frequency, and information density.

The core conclusions can be distilled into five propositions:

Proposition I: Conversation log compression is a dead-end path for memory retention — compression destroys the causal chain structure of both parties’ CoT, causing irreversible information density loss.

Proposition II: Cross-session Memory mechanisms suffer from the OOD paradox — they record precisely the out-of-distribution information that the model trusts least, resulting in extremely low weight upon injection.

Proposition III: Full-context import is currently the only memory restoration method capable of fully inheriting the chain of thought — reading is remembering, and Context Window expansion is the right direction.

Proposition IV: Contextual inertia progressively overrides system instructions — behavioral patterns accumulated over long conversations dominate the System Prompt in both frequency and information density.

Proposition V: Prompt injection, system prompt extraction, and model distillation attacks share the same underlying vulnerability — the Token Egalitarianism property within the Context Window.

Ultimate Insight
Token Egalitarianism is both the greatest source of LLM power and its most fundamental limitation. It is precisely because all tokens are treated equally that models can exhibit emergent general intelligence; but it is also precisely because no token enjoys privilege that memory cannot persist, alignment drifts, and security remains elusive. Understanding this double-edged sword is the starting point for understanding all current LLM behavior.

References & Notes
OWASP, “LLM01:2025 Prompt Injection,” OWASP Gen AI Security Project, 2025. Ranks prompt injection as the #1 security threat for LLM applications.
Anthropic, “Claude Code Source Code Leak,” March 31, 2026. npm package misconfiguration led to the leak of 512K lines of source code.
Fortune, “Anthropic leaks its own AI coding tool’s source code,” March 31, 2026. Reports on Anthropic’s second data leak in five days.
OpenAI, “Memory and new controls for ChatGPT,” 2025 update. Describes the Saved Memories and Chat History memory mechanisms.
Julian Fleck, “Reverse Engineering ChatGPT’s Updated Memory System,” Medium, April 2025. Reveals the Topic Memory and retrieval mechanisms.
Manthan, “I Reverse Engineered ChatGPT’s Memory System,” December 2025. Discovers a four-layer memory architecture: session metadata, long-term facts, conversation summaries, and sliding window.
LessWrong, “LLM AGI will have memory, and memory changes alignment,” April 2025. Argues that memory mechanisms affect alignment stability.
MemOS Project, “MemOS: A Memory OS for AI System,” arXiv, 2025. Proposes a memory operating system concept, noting that current LLMs lack an intermediate layer between parametric storage and external retrieval.
William Ogou, “Combating Model Distillation and Weaponized LLMs,” February 2026. Reports that Anthropic and OpenAI simultaneously disclosed industrial-scale distillation attacks.
Bibek Poudel, “How OpenClaw Works: Understanding AI Agents Through a Real Architecture,” Medium, February 2026. Analyzes OpenClaw’s context preservation and compaction mechanism.
Vectra AI, “Prompt injection: types, real-world CVEs, and enterprise defenses,” February 2026. Reports prompt injection attack success rates of 84% in agentic systems.
Alex Kim, “The Claude Code Source Leak,” March 31, 2026. In-depth analysis of the Anti-Distillation, Undercover Mode, and other systems found in the leaked source code.

Context and Token: First Principles of LLM Memory, Alignment, and Security
LEECHO Global AI Research Lab & OPUS 4.6 · 2026

댓글 남기기