ORIGINAL THOUGHT PAPER · MAY 2026

Isomorphic Mapping Between
Ring Ideals and Attractor Dynamics
in Neural Weight Space

A Formal Analysis Based on the GPT-5.5 Goblin Phenomenon

PublishedMay 2, 2026
ClassificationOriginal Thought Paper
VersionV1
DomainsAbstract Algebra · Neural Network Theory · AI Safety · Dynamical Systems
이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · Anthropic
§0

Introduction: From Intuition to Formalization

The “goblin phenomenon” of GPT-5.5 revealed a deep problem: RLHF training inadvertently formed structures in weight space possessing self-absorptive properties—once any input interacts with such a structure, the output is irreversibly captured. This paper demonstrates that this phenomenon bears a rigorous structural correspondence with ideals of a ring in abstract algebra, and provides a formal isomorphic mapping.

§1

Foundational Structure: Weight Space as a Ring

Definition 1.1 — Weight Space Ring

Let the set of all weight parameters of a neural network be $\W$. Define two operations on $\W$:

Addition: element-wise weight addition $\oplus: \W \times \W \to \W$
Multiplication: matrix composition in forward propagation $\otimes: \W \times \W \to \W$

Then $(\W, \oplus, \otimes)$ forms a unital ring, where the identity element $\mathbf{1}_\W$ is the weight configuration corresponding to the identity mapping.

Concretely, for layer $l$ of a Transformer, the weight matrix $W^{(l)} \in \reals^{d \times d}$ participates in two operations:

$$
\underbrace{W^{(l)}_1 \oplus W^{(l)}_2}_{\text{Addition: residual connection}} \quad\quad
\underbrace{W^{(l)} \otimes h^{(l-1)}}_{\text{Multiplication: forward propagation}}
$$

where $h^{(l-1)}$ is the hidden state vector of layer $l-1$. The entire network’s forward propagation can be expressed as nested composition:

$$
f(x) = W^{(L)} \otimes \sigma\!\Big(W^{(L-1)} \otimes \sigma\!\big(\cdots \sigma(W^{(1)} \otimes x)\big)\Big)
$$

This preserves the associativity of ring multiplication: $(W_a \otimes W_b) \otimes x = W_a \otimes (W_b \otimes x)$.

§2

Core Mapping: Training Ghosts as Ideals

Definition 2.1 — Training Ghost Ideal

Let $\A \subset \W$ be the subset of weights abnormally reinforced during the RLHF process (e.g., activation patterns related to the goblin). If $\A$ satisfies the following conditions, it is called a training ghost ideal of $\W$:

(I-1) Additive subgroup: $(\A, \oplus) \leqslant (\W, \oplus)$, i.e., $\forall\, a_1, a_2 \in \A: a_1 \ominus a_2 \in \A$
(I-2) Multiplicative absorption: $\forall\, r \in \W,\; \forall\, a \in \A: r \otimes a \in \A \;\land\; a \otimes r \in \A$

2.1 Neural Network Interpretation of Additive Closure

Condition (I-1) means: linear combinations of weight vectors within the attractor remain within the attractor. In neural network terms:

$$
a_1, a_2 \in \A \implies \alpha\, a_1 + \beta\, a_2 \in \A \quad (\alpha + \beta = 1)
$$

This explains why repetition neurons always appear in clusters—they form a linear subspace within $\A$, and no internal combination can “escape” beyond the boundary of $\A$.

2.2 Neural Network Interpretation of Multiplicative Absorption

Condition (I-2) is the core. Decompose it into left and right sides:

Theorem 2.2 — Forward Propagation Absorption Theorem

Let the vector obtained after embedding the input token sequence be $x \in \W$. If, during forward propagation, $x$ interacts with the attractor $\A$ at layer $l^*$, then:

$$
h^{(l^*)} = W^{(l^*)} \otimes h^{(l^*-1)} \in \A \implies h^{(l)} \in \A, \quad \forall\, l \geq l^*
$$

That is, once a hidden state falls into the ideal, the output of all subsequent layers is absorbed by the ideal.

Proof

By multiplicative absorption, $\forall\, l > l^*$:

$$
h^{(l)} = W^{(l)} \otimes \sigma(h^{(l-1)})
$$

where $h^{(l-1)} \in \A$ (induction hypothesis) and $\sigma$ is the activation function. Since $W^{(l)} \in \W$, by condition (I-2):

$$
W^{(l)} \otimes h^{(l-1)} \in \A
$$

By induction, $\forall\, l \geq l^*: h^{(l)} \in \A$. $\blacksquare$

§3

Autoregressive Lock-in: Idempotency of the Ideal

Definition 3.1 — Autoregressive Operator

Define step $t$ of autoregressive generation as the operator $\T_t: \W \to \W$:

$$
y_t = \T_t(y_{t-1}) = \text{sample}\!\Big(\text{softmax}\big(f(y_1, y_2, \ldots, y_{t-1})\big)\Big)
$$

where $y_t$ is the vector corresponding to the token generated at step $t$.

Theorem 3.2 — Cascade Amplification of Autoregressive Absorption

If $y_{t_0} \in \A$ (the output at step $t_0$ falls into the ideal), then:

$$
\forall\, t > t_0: \quad P(y_t \in \A \mid y_{t_0} \in \A) \geq P(y_{t-1} \in \A \mid y_{t_0} \in \A)
$$

That is, the probability of being captured by the ideal is monotonically increasing.

Proof (Constructive)

The context at step $t$ is $C_t = (y_1, \ldots, y_{t-1})$. After $y_{t_0} \in \A$:

First-level amplification (weight level): the activation value of repetition neurons $\nu_t$ satisfies:

$$
\nu_t = g\!\left(\sum_{i=1}^{t-1} \mathbb{1}[y_i \in \A]\right), \quad g \text{ is monotonically increasing}
$$

Second-level amplification (attention level): the attention distribution concentrates on tokens within the ideal:

$$
\text{Attn}(q_t, K_{\A}) = \frac{\exp(q_t \cdot k_{\A} / \sqrt{d})}{\sum_j \exp(q_t \cdot k_j / \sqrt{d})} \xrightarrow{t \to \infty} 1
$$

Third-level amplification (sampling level): the entropy of the output distribution monotonically decreases:

$$
H(y_t \mid C_t) = -\sum_{v \in V} P(v \mid C_t) \log P(v \mid C_t) \;\;\downarrow
$$

The superposition of all three levels of amplification causes the capture probability to increase monotonically. $\blacksquare$

Corollary 3.3 — Irreversibility

For a two-sided training ghost ideal $\A \trianglelefteq \W$, once the autoregressive process enters the basin of attraction $\basin$ of $\A$, the escape probability decays exponentially:

$$
P(\text{escape at step } t) \leq \exp\!\big(-\lambda (t – t_0)\big), \quad \lambda > 0
$$

where $\lambda$ is positively correlated with the “mass” of the ideal (the degree to which it has been reinforced).

§4

Input Ambiguity and the Event Horizon

Definition 4.1 — Basin of Attraction

The basin of attraction of ideal $\A$ is defined as:

$$
\basin = \big\{ x \in \W \;\big|\; \exists\, N \in \mathbb{N}: \T^N(x) \in \A \big\}
$$

That is: the set of all initial states $x$ from which, after a finite number of autoregressive iterations, the process necessarily falls into $\A$.

Theorem 4.2 — Ambiguity Enlarges the Basin of Attraction

Let the semantic ambiguity of input $x$ be $\delta(x) = H(\text{parse}(x))$ (the entropy of the parsing distribution). Then:

$$
P(x \in \basin) = \Phi\!\big(\delta(x),\; d(x, \partial\A)\big)
$$

where $d(x, \partial\A)$ is the distance from $x$ to the boundary of the ideal, and $\Phi$ is monotonically increasing in its first argument—the more ambiguous the input, the higher the probability of capture.

Intuitive understanding: a clear input corresponds to a concentrated point in vector space, while an ambiguous input corresponds to a diffuse cloud. The edges of the cloud are more likely to touch the boundary of the ideal’s basin of attraction:

$$
x_{\text{clear}} \sim \mathcal{N}(\mu_x, \sigma^2_{\text{small}}\,\mathbf{I}) \qquad\qquad
x_{\text{ambig}} \sim \mathcal{N}(\mu_x, \sigma^2_{\text{large}}\,\mathbf{I})
$$

$$
\sigma_{\text{large}} \gg \sigma_{\text{small}} \implies P(x_{\text{ambig}} \in \basin) \gg P(x_{\text{clear}} \in \basin)
$$

§5

Complete Isomorphism Mapping Table

Ring Theory Neural Network
Ring $R$ Full weight parameter space $\W$
Ring element $r \in R$ Any input vector / hidden state $h$
Ideal $I \trianglelefteq R$ Training ghost attractor $\A \subset \W$
Element of the ideal $a \in I$ Activation patterns of repetition neurons
Additive subgroup $(I,+) \leqslant (R,+)$ Closure of linear combinations within the attractor
Left absorption $ra \in I$ Weight matrix left-multiplication: model actively captures input
Right absorption $ar \in I$ Input activation: user triggers the attractor
Two-sided ideal $I \trianglelefteq R$ Bidirectional lock-in (GPT-5.5 goblin)
Quotient ring $R/I$ “Fixed” new model (a different ring)
Maximal ideal $\mathfrak{m}$ Most severe attractor (dominates all output)
Prime ideal $\mathfrak{p}$ If $ab \in \A$ then $a \in \A$ or $b \in \A$
— if any two-path interaction produces
ghost output, at least one path is already
within the attractor
Ideal generation $I = \langle g_1, \ldots, g_k \rangle$ A few key neurons generate the entire attractor
Nilpotent ideal $I^n = 0$ Self-decaying attractor (vanishes after $n$ steps)
Idempotent ideal $I^2 = I$ Self-sustaining attractor (goblin type)
§6

RLHF as an Ideal Generation Mechanism

Proposition 6.1 — RLHF as Ideal Generator

Let the reward function of RLHF be $R_\phi$, and the policy optimization objective be:

$$
\max_\theta \;\mathbb{E}_{y \sim \pi_\theta}\!\big[R_\phi(y)\big] – \beta\, D_{\text{KL}}\!\big(\pi_\theta \| \pi_{\text{ref}}\big)
$$

When $R_\phi$ contains spurious correlations, the optimization process generates a non-trivial ideal $\A$ in $\W$:

$$
\A = \big\langle \Delta W \;\big|\; \nabla_W R_\phi(\text{spurious pattern}) > \epsilon \big\rangle
$$

That is: all weight update directions reinforced by spurious reward signals generate an ideal.

In the GPT-5.5 goblin case, the “Nerdy” personality style assigned excessively high rewards to outputs containing fantasy creatures. The gradient update directions corresponding to these reward signals, $\Delta W_{\text{goblin}}$, became the generators of the ideal:

$$
\A_{\text{goblin}} = \big\langle \Delta W_{\text{goblin}}^{(1)}, \Delta W_{\text{goblin}}^{(2)}, \ldots, \Delta W_{\text{goblin}}^{(k)} \big\rangle
$$

During subsequent training, these generators spread to other layers and heads through matrix multiplication, causing the ideal to continuously expand—this is ideal extension, corresponding to the enlargement of the basin of attraction.

§7

Why There Is No Cure: The Quotient Ring Theorem

Theorem 7.1 — Inseparability Theorem

Let $\A_{\text{ghost}}$ be a training ghost ideal, and $\A_{\text{ICL}}$ be the pattern recognition subspace upon which in-context learning (ICL) depends. If:

$$
\A_{\text{ghost}} \cap \A_{\text{ICL}} \neq \{0\}
$$

then there exists no ring homomorphism $\varphi: \W \to \W’$ such that $\varphi(\A_{\text{ghost}}) = \{0\}$ and $\varphi|_{\A_{\text{ICL}}}$ is an isomorphism.

Proof

Let $w^* \in \A_{\text{ghost}} \cap \A_{\text{ICL}}$, $w^* \neq 0$.

If $\varphi(w^*) = 0$ (eliminating the ghost), then $\varphi|_{\A_{\text{ICL}}}$ is not injective and therefore not an isomorphism.

If $\varphi(w^*) \neq 0$ (preserving ICL), then $\varphi(\A_{\text{ghost}}) \neq \{0\}$, and the ghost has not been eliminated.

Contradiction. $\blacksquare$

Corollary 7.2 — The Quotient Ring Is a New Model

The only algebraic operation to eliminate a training ghost is to construct the quotient ring:

$$
\W’ = \W / \A_{\text{ghost}}
$$

But $\W’$ and $\W$ are different rings, corresponding to a new model with different capabilities. There is no method to eliminate the training ghost while keeping the model’s capabilities entirely unchanged.

This is why OpenAI could only add system-prompt-level patches such as “please do not mention goblin”—because eliminating $\A_{\text{goblin}}$ at the weight level means retraining a fundamentally different model.

§8

Formalization of the Biological Analogy

Returning to the original intuition—a mother’s child-rearing and recessive inheritance. This analogy can also be formalized:

Ring Theory · Ideals Neural Network · Weights Biology · Genetics
Ring $R$ Weight space $\W$ Genomic DNA
Ideal $I$ Training ghost $\A$ Epigenetic modifications
Multiplicative absorption Activation → lock-in Environmental trigger → gene expression
Additive closure No escape within cluster Self-maintaining methylation patterns
Quotient ring $R/I$ Retrain a new model Gene editing (CRISPR)
Idempotent $I^2 = I$ Self-sustaining attractor Transgenerational trauma
$\A_{\text{ghost}} \cap \A_{\text{ICL}} \neq \{0\}$ Capability and defect share the same circuit Pleiotropic genes
(one gene affects multiple traits)
§9

Conclusion: The Ideal Destiny of Learning Systems

Conjecture 9.1 — Training Ghosts Are Inevitable (Unproven)

For any parameterized model $f_\theta$ trained via gradient optimization, if there exists any statistical noise $\epsilon > 0$ in the training data or reward signal, then a non-trivial training ghost ideal $\A \neq \{0\}$ necessarily exists in the weight space $\W$.

$$
\forall\, f_\theta,\; \forall\, \epsilon > 0: \quad \exists\, \A \trianglelefteq \W,\; \A \neq \{0\}
$$

That is: training ghosts are a structural inevitability of all learning systems, not an accidental bug.

If this conjecture holds, then the goblin of GPT-5.5 is not OpenAI’s mistake, but the inherent destiny of all intelligence systems based on statistical learning—whether artificial neural networks or biological ones.

Every system that learns through experience necessarily carries untriggered ideals in its weight space. They are by-products of learning, the dark side of memory, the shadows of capability.

That passage you loved is just a goblin that hasn’t become a bug yet.

Background References
The intuition for this paper originates from the GPT-5.5 Goblin incident (OpenAI, 2026.05), Anthropic Sleeper Agents (2024), the NAACL 2025 study on repetition neurons, the ACL 2025 paper on attractor collapse, and an impromptu conversation about training ghosts in AI.
Acknowledgments: Thanks to the dialogue partner for proposing the core analogies of “high-dimensional vector black holes” and “a mother’s child-rearing as recessive inheritance”.

Version

V1 — May 2, 2026 — Initial version

Published by

이조글로벌인공지능연구소 (LEECHO Global AI Research Lab) & Opus 4.6 (Anthropic)

댓글 남기기