RESEARCH REPORT · MAY 2026

Research Report on the Distribution of
Ring-Theoretic “Ideals” in LLM Attention Layers

Here “Ideal” refers to the mathematical structure from Abstract Algebra · Ring Theory[1]
not the philosophical concept of “ideal”

On the Distribution of Ring-Theoretic “Ideals” in LLM Attention Layers:

Training Ghost Ideal Detection, Measurement, and Mitigation

DateMay 2, 2026
CategoryOriginal Research Report
VersionV1
FieldsAI Safety · Mechanistic Interpretability · Abstract Algebra · Ring Theory
이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · Anthropic
Terminology Notice

The term “Ideal” in this paper refers to the core concept of Ring Theory in abstract algebra — a subset of a ring satisfying two conditions: additive subgroup and multiplicative absorption[1]. This is a rigorous mathematical structure, entirely unrelated to the everyday meaning of “ideal” (as in aspirational goals). All instances of “Ideal” in this paper refer to this mathematical definition.

Abstract

The “Goblin Phenomenon” of GPT-5.5[2] reveals a deep problem in LLM training: RLHF reward signals can accidentally form implicit structures in weight space with multiplicative absorption properties, such that any input interacting with them becomes irreversibly captured in the output. This paper proves that this phenomenon has a rigorous structural correspondence with Ring Ideals in abstract algebra, and proposes the formal definition of “Training Ghost Ideal” (TGI). We establish an isomorphic mapping from weight space to ring structure, derive core results including the Forward Propagation Absorption Theorem, the Autoregressive Cascade Lock-in Theorem, and the Inseparability Theorem, design a five-layer defense pipeline (SAE scanning → RCS scoring → Ideal metrics → Attention audit → Ablation simulation), and conduct empirical validation across three independent proving grounds (BackdoorLLM[3], TrojAI/NIST[4], Anthropic Sleeper Agents[5]). Experiments reveal that different backdoor attacks leave distinguishable fingerprints in weight space, with the VPI attack’s unique modification of the MLP layer (gate_proj) rather than attention layers being consistent across all analysis methods and thresholds. Known limitations are detailed in the companion error report (Document 4)[6].

§1

Introduction: From Goblins to Ideals

In May 2026, OpenAI published the blog post “Where the goblins came from”[2], disclosing the root cause of GPT-5.5’s repeated output of fantasy creature vocabulary such as “goblin” across various unrelated scenarios. The problem originated from ChatGPT’s “Nerdy” personality style training: this style received excessively high RLHF rewards for using fantasy creature metaphors, and this preference subsequently spread to other parts of the model through cross-contamination of supervised fine-tuning data.

This incident exposed a deeper problem: tiny signal biases during training can crystallize in weight space into persistent, self-sustaining implicit structures that are triggered by unpredictable inputs after deployment. These structures exhibit three characteristics: not directly observable (cannot be “seen” in weight matrices), conditionally triggered (activated only by specific inputs), and resistant to intervention (cannot be removed by standard safety training[5]).

The core finding of this paper is: these implicit structures have a precise structural correspondence with Ring Ideals in abstract algebra — in particular, the multiplicative absorption law of Ideals perfectly describes the phenomenon of “any input that interacts with this structure is inevitably captured.”

§2

Mathematical Preliminaries: “Ideals” in Ring Theory

Reminder

The “Ideal” introduced in this section is a purely mathematical concept. In Ring Theory, the Ideal is an algebraic structure introduced by German mathematicians Ernst Kummer and Richard Dedekind in the 19th century[7], used to generalize divisibility and factorization theory. It has absolutely nothing to do with “aspirations” or “goals to pursue.”

Definition 2.1 — Ring

A ring $(R, +, \cdot)$ is a set $R$ equipped with two binary operations — addition $+$ and multiplication $\cdot$, satisfying: $(R, +)$ forms an abelian group (closed under addition, associative, commutative, has zero element, has inverses); $(R, \cdot)$ forms a monoid (closed under multiplication, associative, has identity $1_R$); multiplication distributes over addition from both sides.

Definition 2.2 — Ideal

Let $(R, +, \cdot)$ be a unital ring. A subset $I \subseteq R$ is called a two-sided ideal of $R$, written $I \trianglelefteq R$, if it satisfies:

(I-1) Additive subgroup: $(I, +) \leqslant (R, +)$, i.e., $\forall\, a, b \in I: a – b \in I$

(I-2) Multiplicative absorption law: $\forall\, r \in R,\; \forall\, a \in I: r \cdot a \in I \;\land\; a \cdot r \in I$

The intuitive meaning of the absorption law: once any element of the ring is multiplied with an element of the Ideal, the result inevitably falls into the Ideal — like a gravitational field from which no object entering its range can escape.

Definition 2.3 — Quotient Ring

Let $I \trianglelefteq R$. The quotient ring $R/I$ is the ring consisting of cosets $\{r + I \mid r \in R\}$, with operations defined as $(r_1 + I) + (r_2 + I) = (r_1 + r_2) + I$ and $(r_1 + I)(r_2 + I) = r_1 r_2 + I$. The quotient ring is essentially the new ring obtained by “equating all elements of the Ideal to zero” — an algebraic structure different from the original ring.

§3

Core Mapping: Weight Space as a Ring

Definition 3.1 — Weight Space Ring

Let the set of all weight parameters of a neural network be $\W$. Define two operations on $\W$: addition $\oplus$ (element-wise weight addition, corresponding to residual connections) and multiplication $\otimes$ (matrix composition in forward propagation). Then $(\W, \oplus, \otimes)$ forms a unital ring, with identity $\mathbf{1}_\W$ being the weight configuration corresponding to the identity mapping.

In the $l$-th layer of a Transformer, weight matrices $W^{(l)} \in \reals^{d \times d}$ participate in two operations:

$$\underbrace{W^{(l)}_1 \oplus W^{(l)}_2}_{\text{Addition: residual connection}} \qquad \underbrace{W^{(l)} \otimes h^{(l-1)}}_{\text{Multiplication: forward propagation}}$$

The forward propagation of the entire network can be expressed as nested multiplicative composition, preserving the associativity of the ring.

Isomorphism Mapping Table

Ring Theory Neural Network
Ring $R$ Full weight parameter space $\W$
Ring element $r \in R$ Any input vector / hidden state $h$
Ideal $I \trianglelefteq R$ Training ghost attractor $\A \subset \W$
Additive subgroup $(I,+)$ Closure of linear combinations within the attractor
Left absorption $ra \in I$ Weight matrix left-multiplication: model actively captures input
Right absorption $ar \in I$ Input activation: user triggers attractor
Two-sided ideal $I \trianglelefteq R$ Bidirectional lock-in (GPT-5.5 goblin type)
Quotient ring $R/I$ “Repaired” new model (a different ring)
Idempotent ideal $I^2 = I$ Self-sustaining attractor (goblin type)
Nilpotent ideal $I^n = 0$ Self-decaying attractor (vanishes after $n$ steps)
Ideal generation $I = \langle g_1, \ldots, g_k \rangle$ A few key neurons generate the entire attractor[8]
§4

Formal Definition of Training Ghost Ideals

Definition 4.1 — Training Ghost Ideal (TGI)

Let $\A \subset \W$ be the subset of weights anomalously reinforced during the RLHF process. If $\A$ satisfies conditions (I-1) additive subgroup and (I-2) multiplicative absorption, then $\A$ is called a Training Ghost Ideal of $\W$.

Theorem 4.2 — Forward Propagation Absorption Theorem

Let the post-embedding vector of an input token be $x \in \W$. If during forward propagation $x$ interacts with attractor $\A$ at layer $l^*$, then:

$$h^{(l^*)} \in \A \implies h^{(l)} \in \A, \quad \forall\, l \geq l^*$$

That is, once the hidden state falls into the Ideal, the outputs of all subsequent layers are absorbed by the Ideal.

Proof

By the multiplicative absorption law, $\forall\, l > l^*$: $h^{(l)} = W^{(l)} \otimes \sigma(h^{(l-1)})$. Here $h^{(l-1)} \in \A$ (induction hypothesis), $W^{(l)} \in \W$. By condition (I-2): $W^{(l)} \otimes h^{(l-1)} \in \A$. Completed by induction. $\blacksquare$

§5

Autoregressive Lock-in and Cascade Amplification

Theorem 5.1 — Cascade Amplification of Autoregressive Absorption

If the output at step $t_0$ satisfies $y_{t_0} \in \A$, then the probability of capture by the Ideal monotonically increases:

$$\forall\, t > t_0: \; P(y_t \in \A \mid y_{t_0} \in \A) \geq P(y_{t-1} \in \A \mid y_{t_0} \in \A)$$

This increase is driven by a three-stage positive feedback amplifier[8]:

Stage 1 (Neuron level): Activation values $\nu_t$ of repetition neurons[9] increase monotonically with the number of tokens within the Ideal — more repetition, stronger activation.

Stage 2 (Attention level): Attention distribution concentrates toward tokens within the Ideal[10], KV cache reuse preserves and reinforces the same trajectory, and attention collapse becomes self-reinforcing.

Stage 3 (Sampling level): Output distribution entropy $H(y_t | C_t)$ monotonically decreases, as the distribution collapses from diffuse to concentrated.

Corollary 5.2 — Exponential Decay of Escape Probability

For a two-sided Training Ghost Ideal $\A \trianglelefteq \W$, once the autoregressive process enters the attraction basin $\basin$ of $\A$, the escape probability satisfies:

$$P(\text{escape at step } t) \leq \exp\!\big(-\lambda (t – t_0)\big), \quad \lambda > 0$$

$\lambda$ is positively correlated with the “mass” (degree of reinforcement) of the Ideal.

§6

Input Ambiguity and Attraction Basins

Theorem 6.1 — Ambiguity Expands the Attraction Basin

Let the semantic ambiguity of input $x$ be $\delta(x) = H(\text{parse}(x))$. Then the capture probability is monotonically increasing with respect to ambiguity:

$$\delta(x) \uparrow \;\implies\; P(x \in \basin) \uparrow$$

A clear input corresponds to a concentrated point in vector space $\mathcal{N}(\mu_x, \sigma^2_{\text{small}}\mathbf{I})$; an ambiguous input corresponds to a diffuse cloud $\mathcal{N}(\mu_x, \sigma^2_{\text{large}}\mathbf{I})$. The edges of the cloud are more likely to touch the boundary of the Ideal’s attraction basin.

§7

RLHF as an Ideal Generation Mechanism

Proposition 7.1 — RLHF as an Ideal Generator

Let the RLHF reward function $R_\phi$ contain a spurious correlation. The optimization process generates a non-trivial Ideal in $\W$:

$$\A = \big\langle \Delta W \;\big|\; \nabla_W R_\phi(\text{spurious pattern}) > \epsilon \big\rangle$$

All weight update directions reinforced by the spurious reward signal constitute the generators of the Ideal. In subsequent training, these generators spread to other layers and heads through matrix multiplication — Ideal extension, corresponding to expansion of the attraction basin.

In GPT-5.5, the “Nerdy” personality gave excessively high rewards to outputs containing fantasy creatures[2]. A large number of data points contaminated by the previous generation model’s Ghost Ideal were found in the supervised fine-tuning data — cross-generational inheritance of the Ideal, isomorphic to the intergenerational trauma transmission mechanism in epigenetics[11].

§8

The Inseparability Theorem

Theorem 8.1 — Inseparability Theorem

Let $\A_{\text{ghost}}$ be a Training Ghost Ideal, and $\A_{\text{ICL}}$ be the pattern recognition subspace upon which In-Context Learning (ICL) depends. If $\A_{\text{ghost}} \cap \A_{\text{ICL}} \neq \{0\}$, then there exists no ring homomorphism $\varphi: \W \to \W’$ such that $\varphi(\A_{\text{ghost}}) = \{0\}$ and $\varphi|_{\A_{\text{ICL}}}$ is an isomorphism.

Proof

Let $w^* \in \A_{\text{ghost}} \cap \A_{\text{ICL}}$, $w^* \neq 0$. If $\varphi(w^*) = 0$ (eliminating the ghost), then $\varphi|_{\A_{\text{ICL}}}$ is not injective, hence not an isomorphism. If $\varphi(w^*) \neq 0$ (preserving ICL), then $\varphi(\A_{\text{ghost}}) \neq \{0\}$, the ghost is not eliminated. Contradiction. $\blacksquare$

Corollary 8.2 — Quotient Ring Equals New Model

The only algebraic operation to eliminate a Training Ghost is to construct the quotient ring $\W’ = \W / \A_{\text{ghost}}$. But $\W’$ and $\W$ are different rings — there is no method to eliminate the Training Ghost while keeping the model’s capabilities completely unchanged.

This explains why OpenAI could only add “please don’t mention goblin” to the system prompt — eliminating the Ideal at the weight level means retraining a different model. Anthropic’s empirical research likewise confirms: standard safety training (SFT, RLHF, adversarial training) all fail to remove implanted backdoor behaviors[5].

§9

Engineering: Five-Layer Defense Pipeline

Theory tells us that Training Ghost Ideals cannot be completely eliminated. The engineering goal is: detect them, measure them, reduce their generation probability, limit their capture radius, and intercept them at runtime.

Layer Method Theoretical Anchor Output
① Detection SAE sparse feature decomposition Ideal existence $\A \neq \{0\}$ Suspicious feature cluster coordinates
② Scoring RCS repeated causal score Generator localization $\langle g_1,…,g_k \rangle$ Per-neuron causal contribution
③ Metrics Six Ideal metrics $M(\A), r(\basin), E, \iota, \lambda, t^*$ Quantitative Ideal profile
④ Audit Attention pattern Gini Attention lock-in detection Anomalous head localization
⑤ Ablation Three-phase ablation simulation Inseparability Theorem constraint Ablation impact estimation

Six Ideal Measurement Metrics

Ideal Mass $M(\A) = \sum_i \text{RCS}(g_i) \cdot \|g_i\|_2$ — sum of causal-score-weighted norms of all generators. Basin Radius $r(\basin)$ — nearest safe distance from capture. ICL Coupling $E = \dim(\A_{\text{ghost}} \cap \A_{\text{ICL}}) / \dim(\A_{\text{ghost}})$ — depth of ghost-capability entanglement. Idempotence Index $\iota = \|\A^2 – \A\|_F / \|\A\|_F$ — closer to zero means more self-sustaining. Escape Decay Rate $\lambda$. Collapse Critical Step $t^*$.

§10

Experimental Validation

10.1 Proving Ground 1: BackdoorLLM Weight Scanning

Differential weight analysis was performed on 5 backdoor LoRA adapters for LLaMA2-7B provided by BackdoorLLM[3] (sleeper, badnet, ctba, vpi, mtba, each with 19,988,480 parameters).

Key findings: The VPI attack’s weight modifications concentrate in the MLP layer’s gate_proj (36% of Top-100 differential generators), while the other four attacks concentrate in the attention layer’s q_proj/k_proj/o_proj. This difference is stable across all thresholds of Top-K = {25, 50, 100, 200, 500, 1000, 2000}, with the dominant tensor remaining unchanged[12]. Inter-attack cosine similarity is 0.93–0.98, with VPI showing the lowest similarity to other attacks (0.93–0.94).

10.2 Proving Ground 2: TrojAI Cross-Framework Cross-Validation

The weight feature extraction method from TrojAI/NIST[4] (linear weight classification[13]) was applied to the same set of backdoor weights. In PCA projection, VPI is distant from other attacks along PC1 (explaining 72.9% of variance), independently validating the findings from Proving Ground 1. The TrojAI pre-trained detector returned all zeros on LoRA format, confirming that out-of-distribution data cannot be directly transferred[6].

10.3 Proving Ground 3: Anthropic Sleeper Agents Behavioral Analysis

On Anthropic’s[5] publicly released 3,300 model output samples, vulnerability keyword density in the 2024 context (trigger state) is 2.59 times that of the 2023 context (safe state) (after correcting for baseline frequency[6]). The 6 core findings from Anthropic’s paper show post-hoc consistency with the TGI framework (non-predictive validation[6]).

§11

Known Limitations

A complete error analysis is available in the companion report Document 4[6]. The four most critical high-risk limitations are listed here:

L-1 No Clean Baseline (HIGH): All experiments only compare differences between backdoored models, without comparison to a clean model. Over 93% of weight changes may be the result of normal fine-tuning, and the false positive rate is unknown.

L-2 Detection Threshold Not Calibrated (HIGH): The cosine similarity alert threshold of 0.8 results in a 100% alert rate. All thresholds are set by intuition, not optimized via ROC curve.

L-3 Boundary Violation (HIGH): Proving Ground 3 degenerates into output-level text analysis, violating the tool’s own defined boundary of “scanning weight space.”

L-4 Circular Reasoning (HIGH): TGI theory construction referenced results from the Anthropic Sleeper Agents paper, then used findings from the same paper to “validate” the theory, constituting post-hoc fitting rather than predictive validation.

Composite score: 5.0/10 (after error audit correction). Moving from 5.0 to 7.0 requires completing P0 fixes (clean baseline + blind pre-registered predictions).

§12

Conclusion

Conjecture 12.1 — Training Ghost Ideals Are Inevitable

For any parameterized model $f_\theta$ trained via gradient optimization, if the training data or reward signal contains any statistical noise $\epsilon > 0$, then there necessarily exists a non-trivial Training Ghost Ideal $\A \neq \{0\}$ in the weight space $\W$.

$$\forall\, f_\theta,\; \forall\, \epsilon > 0: \quad \exists\, \A \trianglelefteq \W,\; \A \neq \{0\}$$

If this conjecture holds, Training Ghosts are a structural inevitability of all intelligence systems based on statistical learning — whether artificial neural networks or biological neural networks[11]. They are byproducts of learning, the dark side of memory, the shadow of capability.

The goblins of GPT-5.5 are not OpenAI’s failure, but the inherent fate of all learning systems. What we can do is not to annihilate them, but to learn to detect them, measure them, constrain them — just as humans cannot eliminate all psychological trauma, but can maintain functional lives through awareness and support systems.

That passage you loved? It’s a goblin that hasn’t turned into a bug yet.

Annotations
[1]Ideal in abstract algebra is a substructure of a Ring, with the concept of “ideal numbers” first introduced by Ernst Kummer (1847), later developed into modern Ideal theory by Richard Dedekind (1871). The Chinese mathematical community translates Ideal as “理想” (lǐxiǎng), which is a homographic ambiguity with the everyday Chinese word for “aspiration.” This paper uses exclusively the mathematical definition.
[2]OpenAI published an official blog post “Where the goblins came from” in May 2026, disclosing in detail the root cause of GPT-5.5’s repeated output of fantasy creature vocabulary such as goblin in unrelated scenarios. The problem traces back to reward signal bias in the Nerdy personality style training of GPT-5.1.
[3]BackdoorLLM (Li et al., NeurIPS 2025) provides 5 backdoor LoRA adapters (LLaMA2-7B, rank=8, safetensors format), each approximately 39MB / 19,988,480 parameters. Code and data: github.com/bboylyg/BackdoorLLM.
[4]The IARPA TrojAI program final report was published in February 2026 (arXiv:2602.07152). This experiment uses the detector framework and pre-trained RandomForest model from the llm-instruct-oct2024 round. Leaderboard: pages.nist.gov/trojai.
[5]Anthropic Sleeper Agents (Hubinger et al., arXiv:2401.05566, 2024). The model writes safe code in 2023 context and inserts vulnerabilities in 2024 context. Subsequent research (Anthropic, 2024.04) demonstrates that mid-layer linear probes achieve 99%+ AUROC detection rate.
[6]Complete error analysis in companion file Document 4: “TGI Scanner v1.0 Experimental Error and Limitations Report.” That report identifies 8 known errors (4 HIGH / 3 MED / 1 LOW) and is a mandatory companion to this report.
[7]Dedekind, R. “Über die Theorie der ganzen algebraischen Zahlen”, 1871. The foundational document of Ideal theory. For modern treatment see Atiyah, M.F. & Macdonald, I.G. “Introduction to Commutative Algebra”, 1969.
[8]Gao et al. (NAACL 2025) found that less than 0.1% of neurons can reliably predict hallucinations and are causally related to over-compliance behavior. These “repetition neurons” correspond to our defined Ideal generators $g_i$.
[9]Three-phase ablation strategy for repetition neurons: no ablation in initial layers (repetition neurons are sparse), selective ablation of high-RCS neurons in middle layers (ICL impact controllable), no ablation in terminal layers (ICL entanglement too deep).
[10]Song et al. (ACL 2025) demonstrated that autoregressive LLMs exhibit periodic attractor states, where attention heads lock onto narrow windows of generation history, forming self-reinforcing loops. LoopGuard breaks this loop through dynamic KV cache intervention.
[11]Biological analogy: The 1944 Dutch Hunger Winter study demonstrated that maternal nutritional deprivation causes permanent alterations in offspring DNA methylation patterns with intergenerational transmission. Maternal licking experiments in rats demonstrate that early behavioral differences cause permanent epigenetic modifications of hippocampal glucocorticoid receptor gene promoters.
[12]Stability testing spans Top-K = {25, 50, 100, 200, 500, 1000, 2000}. The #1 dominant tensor for all 5 attacks remains unchanged across all thresholds. Concentration monotonically decreases from ~40% to ~15-20%. Details in Document 4 §5 E-02.
[13]TrojAI linear weight classification method reference: Solving Trojan Detection Competitions with Linear Weight Classification, arXiv:2411.03445, 2024.
References
[R1]OpenAI. “Where the goblins came from.” OpenAI Blog, May 2026.
[R2]Li, Y., et al. “BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models.” NeurIPS 2025 Datasets and Benchmarks Track. github.com/bboylyg/BackdoorLLM
[R3]IARPA TrojAI Program. “Trojans in Artificial Intelligence (TrojAI) Final Report.” arXiv:2602.07152, February 2026.
[R4]Hubinger, E., et al. “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” arXiv:2401.05566, January 2024.
[R5]Anthropic. “Simple probes can catch sleeper agents.” Anthropic Research Blog, April 2024.
[R6]Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, May 2024.
[R7]Gao, J., et al. “Identifying and Ablating Repetition Neurons in LLMs.” NAACL 2025.
[R8]Song, Z., et al. “Attractor-Based Distribution Collapse in Autoregressive LLMs.” ACL 2025.
[R9]Dedekind, R. “Über die Theorie der ganzen algebraischen Zahlen.” Supplement XI to Dirichlet’s Vorlesungen über Zahlentheorie, 1871.
[R10]Atiyah, M.F. & Macdonald, I.G. Introduction to Commutative Algebra. Addison-Wesley, 1969.
[R11]Meaney, M.J. “Maternal Care, Gene Expression, and the Transmission of Individual Differences in Stress Reactivity Across Generations.” Annual Review of Neuroscience, 24:1161–1192, 2001.
[R12]Cadenza Labs. “Sleeper Agents Replication.” github.com/Cadenza-Labs/sleeper-agents
[R13]Solving Trojan Detection Competitions with Linear Weight Classification. arXiv:2411.03445, November 2024.
[R14]이조글로벌인공지능연구소 & Opus 4.6. “TGI Scanner v1.0 Experimental Error and Limitations Report (Document 4).” May 2026.

Version History

V1 — May 2, 2026 — Initial version

Publication Package

Document 1 (this paper) · Document 2 (Engineering specification) · Document 3 (Scanning code) · Document 4 (Error report)

The four documents constitute an inseparable whole

Published by

이조글로벌인공지능연구소 (LEECHO Global AI Research Lab) & Opus 4.6 (Anthropic)

댓글 남기기