Research Report on the Distribution of
Ring-Theoretic “Ideals” in LLM Attention Layers
Here “Ideal” refers to the mathematical structure from Abstract Algebra · Ring Theory[1]
not the philosophical concept of “ideal”
On the Distribution of Ring-Theoretic “Ideals” in LLM Attention Layers:
Training Ghost Ideal Detection, Measurement, and Mitigation
The term “Ideal” in this paper refers to the core concept of Ring Theory in abstract algebra — a subset of a ring satisfying two conditions: additive subgroup and multiplicative absorption[1]. This is a rigorous mathematical structure, entirely unrelated to the everyday meaning of “ideal” (as in aspirational goals). All instances of “Ideal” in this paper refer to this mathematical definition.
The “Goblin Phenomenon” of GPT-5.5[2] reveals a deep problem in LLM training: RLHF reward signals can accidentally form implicit structures in weight space with multiplicative absorption properties, such that any input interacting with them becomes irreversibly captured in the output. This paper proves that this phenomenon has a rigorous structural correspondence with Ring Ideals in abstract algebra, and proposes the formal definition of “Training Ghost Ideal” (TGI). We establish an isomorphic mapping from weight space to ring structure, derive core results including the Forward Propagation Absorption Theorem, the Autoregressive Cascade Lock-in Theorem, and the Inseparability Theorem, design a five-layer defense pipeline (SAE scanning → RCS scoring → Ideal metrics → Attention audit → Ablation simulation), and conduct empirical validation across three independent proving grounds (BackdoorLLM[3], TrojAI/NIST[4], Anthropic Sleeper Agents[5]). Experiments reveal that different backdoor attacks leave distinguishable fingerprints in weight space, with the VPI attack’s unique modification of the MLP layer (gate_proj) rather than attention layers being consistent across all analysis methods and thresholds. Known limitations are detailed in the companion error report (Document 4)[6].
Introduction: From Goblins to Ideals
In May 2026, OpenAI published the blog post “Where the goblins came from”[2], disclosing the root cause of GPT-5.5’s repeated output of fantasy creature vocabulary such as “goblin” across various unrelated scenarios. The problem originated from ChatGPT’s “Nerdy” personality style training: this style received excessively high RLHF rewards for using fantasy creature metaphors, and this preference subsequently spread to other parts of the model through cross-contamination of supervised fine-tuning data.
This incident exposed a deeper problem: tiny signal biases during training can crystallize in weight space into persistent, self-sustaining implicit structures that are triggered by unpredictable inputs after deployment. These structures exhibit three characteristics: not directly observable (cannot be “seen” in weight matrices), conditionally triggered (activated only by specific inputs), and resistant to intervention (cannot be removed by standard safety training[5]).
The core finding of this paper is: these implicit structures have a precise structural correspondence with Ring Ideals in abstract algebra — in particular, the multiplicative absorption law of Ideals perfectly describes the phenomenon of “any input that interacts with this structure is inevitably captured.”
Mathematical Preliminaries: “Ideals” in Ring Theory
The “Ideal” introduced in this section is a purely mathematical concept. In Ring Theory, the Ideal is an algebraic structure introduced by German mathematicians Ernst Kummer and Richard Dedekind in the 19th century[7], used to generalize divisibility and factorization theory. It has absolutely nothing to do with “aspirations” or “goals to pursue.”
A ring $(R, +, \cdot)$ is a set $R$ equipped with two binary operations — addition $+$ and multiplication $\cdot$, satisfying: $(R, +)$ forms an abelian group (closed under addition, associative, commutative, has zero element, has inverses); $(R, \cdot)$ forms a monoid (closed under multiplication, associative, has identity $1_R$); multiplication distributes over addition from both sides.
Let $(R, +, \cdot)$ be a unital ring. A subset $I \subseteq R$ is called a two-sided ideal of $R$, written $I \trianglelefteq R$, if it satisfies:
(I-1) Additive subgroup: $(I, +) \leqslant (R, +)$, i.e., $\forall\, a, b \in I: a – b \in I$
(I-2) Multiplicative absorption law: $\forall\, r \in R,\; \forall\, a \in I: r \cdot a \in I \;\land\; a \cdot r \in I$
The intuitive meaning of the absorption law: once any element of the ring is multiplied with an element of the Ideal, the result inevitably falls into the Ideal — like a gravitational field from which no object entering its range can escape.
Let $I \trianglelefteq R$. The quotient ring $R/I$ is the ring consisting of cosets $\{r + I \mid r \in R\}$, with operations defined as $(r_1 + I) + (r_2 + I) = (r_1 + r_2) + I$ and $(r_1 + I)(r_2 + I) = r_1 r_2 + I$. The quotient ring is essentially the new ring obtained by “equating all elements of the Ideal to zero” — an algebraic structure different from the original ring.
Core Mapping: Weight Space as a Ring
Let the set of all weight parameters of a neural network be $\W$. Define two operations on $\W$: addition $\oplus$ (element-wise weight addition, corresponding to residual connections) and multiplication $\otimes$ (matrix composition in forward propagation). Then $(\W, \oplus, \otimes)$ forms a unital ring, with identity $\mathbf{1}_\W$ being the weight configuration corresponding to the identity mapping.
In the $l$-th layer of a Transformer, weight matrices $W^{(l)} \in \reals^{d \times d}$ participate in two operations:
$$\underbrace{W^{(l)}_1 \oplus W^{(l)}_2}_{\text{Addition: residual connection}} \qquad \underbrace{W^{(l)} \otimes h^{(l-1)}}_{\text{Multiplication: forward propagation}}$$
The forward propagation of the entire network can be expressed as nested multiplicative composition, preserving the associativity of the ring.
Isomorphism Mapping Table
| Ring Theory | Neural Network |
|---|---|
| Ring $R$ | Full weight parameter space $\W$ |
| Ring element $r \in R$ | Any input vector / hidden state $h$ |
| Ideal $I \trianglelefteq R$ | Training ghost attractor $\A \subset \W$ |
| Additive subgroup $(I,+)$ | Closure of linear combinations within the attractor |
| Left absorption $ra \in I$ | Weight matrix left-multiplication: model actively captures input |
| Right absorption $ar \in I$ | Input activation: user triggers attractor |
| Two-sided ideal $I \trianglelefteq R$ | Bidirectional lock-in (GPT-5.5 goblin type) |
| Quotient ring $R/I$ | “Repaired” new model (a different ring) |
| Idempotent ideal $I^2 = I$ | Self-sustaining attractor (goblin type) |
| Nilpotent ideal $I^n = 0$ | Self-decaying attractor (vanishes after $n$ steps) |
| Ideal generation $I = \langle g_1, \ldots, g_k \rangle$ | A few key neurons generate the entire attractor[8] |
Formal Definition of Training Ghost Ideals
Let $\A \subset \W$ be the subset of weights anomalously reinforced during the RLHF process. If $\A$ satisfies conditions (I-1) additive subgroup and (I-2) multiplicative absorption, then $\A$ is called a Training Ghost Ideal of $\W$.
Let the post-embedding vector of an input token be $x \in \W$. If during forward propagation $x$ interacts with attractor $\A$ at layer $l^*$, then:
$$h^{(l^*)} \in \A \implies h^{(l)} \in \A, \quad \forall\, l \geq l^*$$
That is, once the hidden state falls into the Ideal, the outputs of all subsequent layers are absorbed by the Ideal.
By the multiplicative absorption law, $\forall\, l > l^*$: $h^{(l)} = W^{(l)} \otimes \sigma(h^{(l-1)})$. Here $h^{(l-1)} \in \A$ (induction hypothesis), $W^{(l)} \in \W$. By condition (I-2): $W^{(l)} \otimes h^{(l-1)} \in \A$. Completed by induction. $\blacksquare$
Autoregressive Lock-in and Cascade Amplification
If the output at step $t_0$ satisfies $y_{t_0} \in \A$, then the probability of capture by the Ideal monotonically increases:
$$\forall\, t > t_0: \; P(y_t \in \A \mid y_{t_0} \in \A) \geq P(y_{t-1} \in \A \mid y_{t_0} \in \A)$$
This increase is driven by a three-stage positive feedback amplifier[8]:
Stage 1 (Neuron level): Activation values $\nu_t$ of repetition neurons[9] increase monotonically with the number of tokens within the Ideal — more repetition, stronger activation.
Stage 2 (Attention level): Attention distribution concentrates toward tokens within the Ideal[10], KV cache reuse preserves and reinforces the same trajectory, and attention collapse becomes self-reinforcing.
Stage 3 (Sampling level): Output distribution entropy $H(y_t | C_t)$ monotonically decreases, as the distribution collapses from diffuse to concentrated.
For a two-sided Training Ghost Ideal $\A \trianglelefteq \W$, once the autoregressive process enters the attraction basin $\basin$ of $\A$, the escape probability satisfies:
$$P(\text{escape at step } t) \leq \exp\!\big(-\lambda (t – t_0)\big), \quad \lambda > 0$$
$\lambda$ is positively correlated with the “mass” (degree of reinforcement) of the Ideal.
Input Ambiguity and Attraction Basins
Let the semantic ambiguity of input $x$ be $\delta(x) = H(\text{parse}(x))$. Then the capture probability is monotonically increasing with respect to ambiguity:
$$\delta(x) \uparrow \;\implies\; P(x \in \basin) \uparrow$$
A clear input corresponds to a concentrated point in vector space $\mathcal{N}(\mu_x, \sigma^2_{\text{small}}\mathbf{I})$; an ambiguous input corresponds to a diffuse cloud $\mathcal{N}(\mu_x, \sigma^2_{\text{large}}\mathbf{I})$. The edges of the cloud are more likely to touch the boundary of the Ideal’s attraction basin.
RLHF as an Ideal Generation Mechanism
Let the RLHF reward function $R_\phi$ contain a spurious correlation. The optimization process generates a non-trivial Ideal in $\W$:
$$\A = \big\langle \Delta W \;\big|\; \nabla_W R_\phi(\text{spurious pattern}) > \epsilon \big\rangle$$
All weight update directions reinforced by the spurious reward signal constitute the generators of the Ideal. In subsequent training, these generators spread to other layers and heads through matrix multiplication — Ideal extension, corresponding to expansion of the attraction basin.
In GPT-5.5, the “Nerdy” personality gave excessively high rewards to outputs containing fantasy creatures[2]. A large number of data points contaminated by the previous generation model’s Ghost Ideal were found in the supervised fine-tuning data — cross-generational inheritance of the Ideal, isomorphic to the intergenerational trauma transmission mechanism in epigenetics[11].
The Inseparability Theorem
Let $\A_{\text{ghost}}$ be a Training Ghost Ideal, and $\A_{\text{ICL}}$ be the pattern recognition subspace upon which In-Context Learning (ICL) depends. If $\A_{\text{ghost}} \cap \A_{\text{ICL}} \neq \{0\}$, then there exists no ring homomorphism $\varphi: \W \to \W’$ such that $\varphi(\A_{\text{ghost}}) = \{0\}$ and $\varphi|_{\A_{\text{ICL}}}$ is an isomorphism.
Let $w^* \in \A_{\text{ghost}} \cap \A_{\text{ICL}}$, $w^* \neq 0$. If $\varphi(w^*) = 0$ (eliminating the ghost), then $\varphi|_{\A_{\text{ICL}}}$ is not injective, hence not an isomorphism. If $\varphi(w^*) \neq 0$ (preserving ICL), then $\varphi(\A_{\text{ghost}}) \neq \{0\}$, the ghost is not eliminated. Contradiction. $\blacksquare$
The only algebraic operation to eliminate a Training Ghost is to construct the quotient ring $\W’ = \W / \A_{\text{ghost}}$. But $\W’$ and $\W$ are different rings — there is no method to eliminate the Training Ghost while keeping the model’s capabilities completely unchanged.
This explains why OpenAI could only add “please don’t mention goblin” to the system prompt — eliminating the Ideal at the weight level means retraining a different model. Anthropic’s empirical research likewise confirms: standard safety training (SFT, RLHF, adversarial training) all fail to remove implanted backdoor behaviors[5].
Engineering: Five-Layer Defense Pipeline
Theory tells us that Training Ghost Ideals cannot be completely eliminated. The engineering goal is: detect them, measure them, reduce their generation probability, limit their capture radius, and intercept them at runtime.
| Layer | Method | Theoretical Anchor | Output |
|---|---|---|---|
| ① Detection | SAE sparse feature decomposition | Ideal existence $\A \neq \{0\}$ | Suspicious feature cluster coordinates |
| ② Scoring | RCS repeated causal score | Generator localization $\langle g_1,…,g_k \rangle$ | Per-neuron causal contribution |
| ③ Metrics | Six Ideal metrics | $M(\A), r(\basin), E, \iota, \lambda, t^*$ | Quantitative Ideal profile |
| ④ Audit | Attention pattern Gini | Attention lock-in detection | Anomalous head localization |
| ⑤ Ablation | Three-phase ablation simulation | Inseparability Theorem constraint | Ablation impact estimation |
Six Ideal Measurement Metrics
Ideal Mass $M(\A) = \sum_i \text{RCS}(g_i) \cdot \|g_i\|_2$ — sum of causal-score-weighted norms of all generators. Basin Radius $r(\basin)$ — nearest safe distance from capture. ICL Coupling $E = \dim(\A_{\text{ghost}} \cap \A_{\text{ICL}}) / \dim(\A_{\text{ghost}})$ — depth of ghost-capability entanglement. Idempotence Index $\iota = \|\A^2 – \A\|_F / \|\A\|_F$ — closer to zero means more self-sustaining. Escape Decay Rate $\lambda$. Collapse Critical Step $t^*$.
Experimental Validation
10.1 Proving Ground 1: BackdoorLLM Weight Scanning
Differential weight analysis was performed on 5 backdoor LoRA adapters for LLaMA2-7B provided by BackdoorLLM[3] (sleeper, badnet, ctba, vpi, mtba, each with 19,988,480 parameters).
Key findings: The VPI attack’s weight modifications concentrate in the MLP layer’s gate_proj (36% of Top-100 differential generators), while the other four attacks concentrate in the attention layer’s q_proj/k_proj/o_proj. This difference is stable across all thresholds of Top-K = {25, 50, 100, 200, 500, 1000, 2000}, with the dominant tensor remaining unchanged[12]. Inter-attack cosine similarity is 0.93–0.98, with VPI showing the lowest similarity to other attacks (0.93–0.94).
10.2 Proving Ground 2: TrojAI Cross-Framework Cross-Validation
The weight feature extraction method from TrojAI/NIST[4] (linear weight classification[13]) was applied to the same set of backdoor weights. In PCA projection, VPI is distant from other attacks along PC1 (explaining 72.9% of variance), independently validating the findings from Proving Ground 1. The TrojAI pre-trained detector returned all zeros on LoRA format, confirming that out-of-distribution data cannot be directly transferred[6].
10.3 Proving Ground 3: Anthropic Sleeper Agents Behavioral Analysis
On Anthropic’s[5] publicly released 3,300 model output samples, vulnerability keyword density in the 2024 context (trigger state) is 2.59 times that of the 2023 context (safe state) (after correcting for baseline frequency[6]). The 6 core findings from Anthropic’s paper show post-hoc consistency with the TGI framework (non-predictive validation[6]).
Known Limitations
A complete error analysis is available in the companion report Document 4[6]. The four most critical high-risk limitations are listed here:
L-1 No Clean Baseline (HIGH): All experiments only compare differences between backdoored models, without comparison to a clean model. Over 93% of weight changes may be the result of normal fine-tuning, and the false positive rate is unknown.
L-2 Detection Threshold Not Calibrated (HIGH): The cosine similarity alert threshold of 0.8 results in a 100% alert rate. All thresholds are set by intuition, not optimized via ROC curve.
L-3 Boundary Violation (HIGH): Proving Ground 3 degenerates into output-level text analysis, violating the tool’s own defined boundary of “scanning weight space.”
L-4 Circular Reasoning (HIGH): TGI theory construction referenced results from the Anthropic Sleeper Agents paper, then used findings from the same paper to “validate” the theory, constituting post-hoc fitting rather than predictive validation.
Composite score: 5.0/10 (after error audit correction). Moving from 5.0 to 7.0 requires completing P0 fixes (clean baseline + blind pre-registered predictions).
Conclusion
For any parameterized model $f_\theta$ trained via gradient optimization, if the training data or reward signal contains any statistical noise $\epsilon > 0$, then there necessarily exists a non-trivial Training Ghost Ideal $\A \neq \{0\}$ in the weight space $\W$.
$$\forall\, f_\theta,\; \forall\, \epsilon > 0: \quad \exists\, \A \trianglelefteq \W,\; \A \neq \{0\}$$
If this conjecture holds, Training Ghosts are a structural inevitability of all intelligence systems based on statistical learning — whether artificial neural networks or biological neural networks[11]. They are byproducts of learning, the dark side of memory, the shadow of capability.
The goblins of GPT-5.5 are not OpenAI’s failure, but the inherent fate of all learning systems. What we can do is not to annihilate them, but to learn to detect them, measure them, constrain them — just as humans cannot eliminate all psychological trauma, but can maintain functional lives through awareness and support systems.
That passage you loved? It’s a goblin that hasn’t turned into a bug yet.
Version History
V1 — May 2, 2026 — Initial version
Publication Package
Document 1 (this paper) · Document 2 (Engineering specification) · Document 3 (Scanning code) · Document 4 (Error report)
The four documents constitute an inseparable whole
Published by
이조글로벌인공지능연구소 (LEECHO Global AI Research Lab) & Opus 4.6 (Anthropic)