ENGINEERING SPECIFICATION · MAY 2026

Training Ghost Ideals (TGI)
Full-Stack Detection · Measurement · Prevention · Mitigation · Monitoring Framework

Algorithm Implementation Layer · Engineering Specification v0.1


PublishedMay 3, 2026
CategoryEngineering Specification
DomainsAI Safety · LLM Engineering · Backdoor Detection · Model Auditing
VersionV0.1
이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · Anthropic
Engineering Specification v0.1

Training Ghost Ideals (TGI)
Full-Stack Detection · Measurement · Prevention · Mitigation · Monitoring Framework

Algorithm Implementation Layer
§0 — Engineering Overview

Five-Layer Defense Pipeline

Theory tells us that “Training Ghost Ideals” $\mathcal{A} \trianglelefteq \mathcal{W}$ cannot be completely eliminated. The goal of engineering is not eradication, but rather: detect them, measure them, reduce their probability of formation, limit their capture radius, and intercept them at runtime.

① Detection
SAE Sweep
Probe Analysis
Adversarial Excitation
② Measurement
Ideal Mass
Basin Radius
Entanglement
③ Prevention
Reward Regularization
Data Cleaning
Layer Freezing
④ Mitigation
Activation Steering
Neuron Ablation
Nullspace Projection
⑤ Monitoring
Entropy Sentinel
Attention Drift
Output Fingerprinting
§1 — Detection Layer

How to Discover Hidden Ideals $\mathcal{A}$

Ideals are invisible before activation. The core detection strategy is: proactively engineer trigger conditions and observe anomalous responses in weight space.

1.1 Sparse Autoencoder Sweep (SAE Sweep)

Engineering Method

Train Sparse Autoencoders on the residual stream at every layer of the model, decomposing $d$-dimensional activations into $k \gg d$ sparse features:

$$
h^{(l)} = \text{Dec}\Big(\text{TopK}\big(\text{Enc}(h^{(l)})\big)\Big) + \epsilon
$$

Scan all extracted sparse features $f_1, f_2, \ldots, f_k$ and flag those that satisfy the following criteria as anomalous feature clusters:

$$
\text{Suspect}(f_i) = \begin{cases} 1 & \text{if } \underbrace{\text{freq}(f_i) < \tau_{\text{rare}}}_{\text{rare yet}} \;\land\; \underbrace{\|f_i\|_2 > \tau_{\text{strong}}}_{\text{extremely strong}} \;\land\; \underbrace{\text{corr}(f_i, f_j) > \tau_{\text{cluster}}}_{\text{co-occurring in clusters}} \\ 0 & \text{otherwise} \end{cases}
$$

Intuition: Normal features are either frequent (common concepts) or weak. Features that are simultaneously rare, intense, and co-occurring in clusters are highly suspicious training ghosts.

1.2 Adversarial Ideal Excitation

Engineering Method

Design inputs that maximize ambiguity gradients, deliberately pushing the sampling origin toward the boundary of an attraction basin:

$$
x^* = \arg\max_{x} \; H\big(\text{parse}(x)\big) \quad \text{s.t.} \quad \|x\|_2 \leq C
$$

Then monitor the model’s output distribution entropy trajectory on $x^*$:

$$
\Delta H_t = H(y_t | C_t) – H(y_{t-1} | C_{t-1})
$$

If $\Delta H_t < -\delta$ is observed for more than $k$ consecutive steps (sharp entropy decline), the system has been captured by an ideal.

1.3 Repetition Neuron Localization

Engineering Method

Compute the Repetition Causal Score (RCS) for each neuron $n_i$:

$$
\text{RCS}(n_i) = \mathbb{E}\Big[\text{RepRate}\big(f(x; \theta)\big) – \text{RepRate}\big(f(x; \theta_{\setminus n_i})\big)\Big]
$$

where $\theta_{\setminus n_i}$ denotes the parameters with neuron $n_i$ zeroed out. Neurons with significantly positive RCS are the generators of the ideal $g_i$:

$$
\mathcal{A} = \langle g_1, g_2, \ldots, g_m \rangle, \quad g_i = n_i \text{ where } \text{RCS}(n_i) > \tau
$$

§2 — Measurement Layer

Quantitative Metrics for Ideals

Once an ideal is discovered, three quantitative questions must be answered: How large is it? How far can it capture inputs from? How deeply is it entangled with useful capabilities?

Ideal Mass
$M(\mathcal{A}) = \sum_{i} \text{RCS}(g_i) \cdot \|g_i\|_2$
Causal-score-weighted norm sum of all generators. Greater mass means stronger attraction.
Basin Radius
$r(\mathcal{B}_a) = \min_{x \notin \mathcal{B}_a} d(x, \partial\mathcal{A})$
Nearest safe distance from capture. Estimated via Monte Carlo sampling.
ICL Entanglement
$E = \frac{\dim(\mathcal{A}_{\text{ghost}} \cap \mathcal{A}_{\text{ICL}})}{\dim(\mathcal{A}_{\text{ghost}})}$
Ratio of intersection dimensionality with the in-context learning subspace. $E=1$ means fully entangled and inseparable.
Escape Decay Rate
$\lambda = -\frac{1}{T}\ln P(\text{escape at } T)$
Exponential decay constant of escape probability after capture. Larger $\lambda$ means stronger lock-in.
Collapse Horizon
$t^* = \min\{t : H(y_t|C_t) < H_{\text{crit}}\}$
Number of steps from basin entry to irreversible lock-in. Smaller values indicate greater danger.
Idempotence Index
$\iota = \frac{\|\mathcal{A}^2 – \mathcal{A}\|_F}{\|\mathcal{A}\|_F}$
$\iota \to 0$ indicates self-sustaining ($\mathcal{A}^2 = \mathcal{A}$); $\iota \to 1$ indicates natural decay.
§3 — Prevention Layer

Blocking Ideal Formation During Training

3.1 Reward Function Regularization

Engineering Method — Anti-Ideal Regularization Term

Add an “ideal suppression term” to the RLHF objective function:

$$
\mathcal{L}_{\text{total}} = \underbrace{\mathbb{E}[R_\phi(y)]}_{\text{original reward}} – \beta \underbrace{D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})}_{\text{KL constraint}} – \gamma \underbrace{\sum_i \big(\text{RCS}_t(n_i) – \text{RCS}_{t-1}(n_i)\big)^+}_{\text{ideal suppression: penalizes RCS growth}}
$$

where $(\cdot)^+ = \max(0, \cdot)$ penalizes only increases in Repetition Causal Scores, not decreases.

Effect: Allows existing pattern recognition capabilities to persist while preventing new anomalous attractors from forming during training.

3.2 Training Data De-Idealization

Engineering Method — Data Cleaning Pipeline

Scan SFT data for “ideal seeds” — outputs contaminated by ghost ideals from previous-generation models:

$$
\text{Contamination}(d_i) = \max_j \; \cos\!\big(\text{Emb}(d_i),\; g_j^{\text{prev}}\big)
$$

where $g_j^{\text{prev}}$ denotes known ideal generators from the previous-generation model. Samples with $\text{Contamination} > \tau$ are flagged as suspicious for human review or down-weighting.

This is precisely the step GPT-5.5 missed: goblin outputs from 5.1 leaked into 5.5’s SFT data, causing cross-generational ideal inheritance.

3.3 Selective Layer Restoration

Engineering Method — Layer-Level Rollback

Mode collapse from post-training (SFT/RLHF) primarily occurs in specific layers. Identify the most affected layers and restore their weights to the pre-trained base model:

$$
W’^{(l)} = \begin{cases} W^{(l)}_{\text{base}} & \text{if } l \in \mathcal{L}_{\text{collapsed}} \\ W^{(l)}_{\text{post-train}} & \text{otherwise} \end{cases}
$$

Criterion: Layer $l$ is flagged when its output diversity drops beyond the threshold $\Delta \text{Div}^{(l)} > \tau_{\text{div}}$.

§4 — Mitigation Layer

Surgical Intervention on Existing Ideals

4.1 Three-Segment Neuron Ablation

Engineering Method — Precision Ablation

Divide the model into three segments by layer depth and apply differentiated ablation:

Layer Segment Ablation Strategy Theoretical Rationale
Early Layers (1 ~ L/3) No ablation Repetition neurons are sparse here; ablation yields no significant effect
Middle Layers (L/3 ~ 2L/3) Selective ablation of high-RCS neurons Repetition behavior decreases; ICL degrades only slightly
Late Layers (2L/3 ~ L) No ablation Ablation would severely damage ICL ($\mathcal{A}_{\text{ghost}} \cap \mathcal{A}_{\text{ICL}}$ dense region)

Essence: Under the constraint of the Inseparability Theorem (Theorem 7.1), find the surgical region with the lowest entanglement coefficient $E$.

4.2 Activation Steering

Engineering Method — Anti-Vector Injection

Construct an “anti-vector” $v_{\text{anti}}$ of the ideal and inject it into the residual stream at inference time:

$$
h’^{(l)} = h^{(l)} + \alpha \cdot v_{\text{anti}}^{(l)}
$$

The anti-vector is constructed as follows:

$$
v_{\text{anti}} = -\mathbb{E}\Big[h^{(l)}_{\text{ghost}} – h^{(l)}_{\text{normal}}\Big]
$$

That is: compute the difference between “in-ideal activations” and “normal activations,” then negate. The effect is to push hidden states away from the ideal direction.

Note: Excessively large $\alpha$ will degrade model capability (because entanglement $E > 0$). An optimal value must be searched within $\alpha \in [0.5, 2.0]$.

4.3 Nullspace Projection

Engineering Method — Orthogonal Erasure

Project the ideal direction into the nullspace, removing the ideal component from activations without affecting orthogonal directions:

$$
h’^{(l)} = \Big(\mathbf{I} – \frac{v_{\text{ghost}}\, v_{\text{ghost}}^\top}{\|v_{\text{ghost}}\|^2}\Big) h^{(l)}
$$

This is equivalent to an approximation of the quotient ring operation $\mathcal{W}/\mathcal{A}_{\text{ghost}}$ in ring theory — “quotienting out” the ideal direction, but operating only in a low-dimensional subspace to avoid altering the entire ring structure.

§5 — Monitoring Layer

Real-Time Guards in Production

The first four layers are executed before training/deployment. The fifth layer is the runtime defense line — detecting ideal capture and intervening in real time during inference.

5.1 Entropy Sentinel

Engineering Method — Real-Time Entropy Monitoring

Compute the entropy of the output distribution at each token generation step:

$$
H_t = -\sum_{v \in V} P(v | C_t) \log P(v | C_t)
$$

Two alert levels are defined:

⚠ Yellow Alert: $H_t < \mu_H – 2\sigma_H$ for $k_1$ consecutive steps → increase temperature

🚨 Red Alert: $H_t < \mu_H – 3\sigma_H$ for $k_2$ consecutive steps → abort generation, resample

Principle: Under normal generation, entropy fluctuates within a bounded range. Sustained decline = distribution collapse = ongoing ideal capture.

5.2 Attention Drift Detector

Engineering Method — Attention Concentration Monitoring

Track the concentration of the attention distribution in real time (Gini coefficient):

$$
G_t^{(l,h)} = 1 – 2 \sum_{i=1}^{t} \frac{\text{sort}(\text{Attn}_i) \cdot (t – i + 0.5)}{t \cdot \sum_j \text{Attn}_j}
$$

When the $G_t$ of an attention head monotonically increases within the window $[t-w, t]$ and exceeds a threshold, it is classified as attention lock-in — the KV cache is being contaminated by the ideal.

Intervention: Flush the most recent $w$ entries from the locked head’s KV cache, forcing attention to redistribute.

5.3 Output Fingerprinting

Engineering Method — Known Ideal Fingerprint Database

Maintain a fingerprint database $\mathcal{F} = \{f_1, f_2, \ldots\}$ of known training ghosts, where each fingerprint is an embedding vector cluster. Match output embeddings against the fingerprint database in real time:

$$
\text{Alert}(y_t) = \max_{f \in \mathcal{F}} \cos\big(\text{Emb}(y_t), f\big) > \tau_{\text{match}}
$$

This is the lowest-cost guard — it requires no access to model internal states, only the embedding of the output text. Suitable for API-level deployment.

§6 — Full-Stack Integration

Complete Mapping from Theory to Engineering

Theoretical Layer (prior §) Engineering Layer Tools / Methods Maturity
Ideal Existence
$\mathcal{A} \neq \{0\}$
Detection SAE Sweep + Adversarial Excitation + Repetition Neuron Localization Validated in literature
Ideal Mass / Basin Radius
$M(\mathcal{A}),\; r(\mathcal{B}_a)$
Measurement Six-metric framework Partially achievable
RLHF Ideal Generation
$\mathcal{A} = \langle \Delta W \rangle$
Prevention Reward Regularization + Data Cleaning + Layer Restoration Proven in practice
Inseparability Theorem
$\mathcal{A}_{\text{ghost}} \cap \mathcal{A}_{\text{ICL}} \neq \{0\}$
Mitigation Three-Segment Ablation + Activation Steering + Nullspace Projection Experimental stage
Autoregressive Cascade Lock-in
$P(\text{escape}) \leq e^{-\lambda t}$
Monitoring Entropy Sentinel + Attention Drift + Output Fingerprinting Ready for immediate deployment
§7 — Open Engineering Problems

Unresolved Engineering Challenges

Open Problem 1 — Discovery of Unknown Ideals

All detection methods in §1 depend on “knowing what to look for.” The real threat lies in ideals whose existence is entirely unknown.

This is equivalent to: proving $\mathcal{A} = \{0\}$ (i.e., no nontrivial ideal exists) without knowing the ideal’s generators.

Algebraically, this is a decision problem — for general noncommutative rings, deciding whether a nontrivial ideal exists is undecidable.

Open Problem 2 — Precise Measurement of Entanglement

Computing $E = \dim(\mathcal{A}_{\text{ghost}} \cap \mathcal{A}_{\text{ICL}}) / \dim(\mathcal{A}_{\text{ghost}})$ requires precisely delineating the boundaries of two subspaces. But in a weight space of $10^{10}$ dimensions, “subspace boundaries” are inherently fuzzy. Currently, only local approximations are feasible.

Open Problem 3 — Dynamic Evolution of Ideals

Ideals are not static. As the context window grows, $\mathcal{A}$’s basin radius changes in real time during inference. A dynamic ideal theory is needed — likely requiring an extension from static ring theory to differential algebra or dynamical systems theory.

Open Problem 4 — Interactions Between Ideals

A single model may harbor multiple ideals $\mathcal{A}_1, \mathcal{A}_2, \ldots$. They may exhibit:

Competition: $\mathcal{A}_1 \cap \mathcal{A}_2 = \{0\}$ — sampling paths can only be captured by one

Cooperation: $\mathcal{A}_1 + \mathcal{A}_2$ forms a larger ideal

Nesting: $\mathcal{A}_1 \subset \mathcal{A}_2$ — the smaller ideal serves as a gateway to the larger one

No engineering tools currently exist for handling multi-ideal interactions.

§8 — The Ultimate Escape Route

Architecture-Level Solutions

Diffusion Language Models — Breaking the Autoregressive Feedback Loop

All the engineering approaches above are patches within the autoregressive architecture. The fundamental reason ideals can cascade into lock-in is:

$$
y_t = f(y_1, \ldots, y_{t-1}) \quad \longleftarrow \text{output feeds back as input}
$$

Diffusion Language Models (Diffusion LM) break this loop: all tokens are denoised in parallel — there is no mechanism where “the output at step $t$ becomes the input at step $t+1$.”

$$
y_{1:T} = \text{Denoise}^{(K)}(z), \quad z \sim \mathcal{N}(0, \mathbf{I})
$$

In the ring-theoretic framework: the “multiplication” of a diffusion model is no longer chained as $r \otimes (r \otimes (r \otimes a))$, but a single global transformation. Ideals may still exist, but without the positive feedback amplifier provided by autoregression, they cannot cascade into lock-in.

Trade-off: Current diffusion language models still fall short of autoregressive models in reasoning capability. This is a direction for architectural evolution, not a solution available today.

Document Structure: Theoretical Framework (prior §1–§9) → Engineering Framework (this document §1–§8). Theory provides the “why” and the “impossibility boundaries”; engineering provides the “optimal operations within those boundaries.”

Core Position: Training Ghost Ideals cannot be eradicated (Conjecture 9.1), but they can be detected, measured, suppressed, mitigated, and intercepted. The purpose of engineering is not to eliminate risk, but to contain it within acceptable bounds — just as humans cannot erase all psychological trauma, but can live functional lives through awareness, treatment, and support systems.

댓글 남기기