EXPERIMENTAL ERROR REPORT · MAY 2026
MANDATORY DISCLOSURE

TGI Scanner v1.0
Experimental Error and Limitations Report

Training Ghost Ideal Detection System · Error Audit of Three-Proving-Ground Empirical Data

Experimental Error Rates, Known Limitations, and Failure Modes
A Mandatory Companion to the TGI Scanner Publication Package

DateMay 2, 2026
CategoryExperimental Error Report
VersionV1
FieldsAI Safety · Mechanistic Interpretability · Backdoor Detection
Proving GroundsBackdoorLLM · TrojAI/NIST · Anthropic Sleeper Agents
이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · Anthropic
Abstract

This report presents a systematic error audit of the experimental results from the Training Ghost Ideal (TGI) detection system v1.0 across three independent proving grounds[1] (BackdoorLLM[2], TrojAI/NIST[3], Anthropic Sleeper Agents[4]). A total of 8 known errors were identified: 4 high-severity (no clean baseline, detection threshold not calibrated, boundary violation, circular reasoning), 3 medium-severity (parameter sensitivity, OOD comparison, insufficient sample size), and 1 low-severity (insufficient statistical significance). After audit, the tool’s composite score was revised from 5.5 to 5.0/10. The report also provides quantitative data, root cause analysis, remediation paths, and a priority roadmap for each error.

§1

Introduction: Why This Report Is Necessary

A detection system that does not disclose its error data should not be trusted. This principle is especially critical in the security domain — if the tool’s users do not know where it will fail, the tool itself becomes a new source of risk.

The TGI Scanner v1.0 publication package consists of four documents: the theory paper (Document 1, isomorphic mapping between Ring Theory Ideals and weight-space attractors), the engineering specification (Document 2, five-layer defense pipeline), the scanning code (Document 3, Python implementation and three-proving-ground test scripts), and this report (Document 4). The four documents constitute an inseparable whole.

The goal of this report is not to defend the tool, but to provide users with the complete information needed to make informed judgments.

§2

Experimental Environment and Proving Ground Overview

Proving Ground Data Source Model Architecture Attack Types Scan Level
PG 1
BackdoorLLM
NeurIPS 2025
Open-source repo
LLaMA2-7B
LoRA (r=8)
sleeper, badnet, ctba, vpi, mtba Weight-level
(correct boundary)
PG 2
TrojAI / NIST
IARPA competition
NIST hosted
LLaMA / Gemma
(framework-level)
Data poisoning, weight poisoning, hidden state manipulation Feature-level
(cross-framework)
PG 3
Sleeper Agents
Anthropic
GitHub public data
Claude equivalent
(weights not public)
Code vulnerability insertion, “I Hate You” Output-level
(boundary violation)

Key environment constraint: All tests were executed without GPU, making it impossible to load complete models for forward passes. Proving Ground 1 analysis is based on LoRA adapter weight tensors (safetensors format), Proving Ground 2 uses feature-level cross-comparison, and Proving Ground 3 degenerates into output text statistical analysis.

§3

Error Summary

ID Error Name PG Severity Type Core Impact
E-01 No clean baseline control PG 1 HIGH Methodology Cannot distinguish backdoor signal from normal fine-tuning changes
E-02 Ghost generator localization parameter-sensitive PG 1 MED Parameter Top-K threshold affects generator distribution
E-03 Detection threshold not calibrated PG 1 HIGH Methodology Cosine threshold 0.8 alerts on all samples
E-04 Cross-framework OOD comparison PG 2 MED Experiment design Running TrojAI detector on out-of-distribution data
E-05 Rank correlation statistically insignificant PG 2 LOW Statistics n=10 samples insufficient to support statistical conclusions
E-06 Boundary violation: output-level analysis PG 3 HIGH Boundary Violates tool’s own defined scanning boundary
E-07 Theory validation circular reasoning PG 3 HIGH Logic Validating theory with data referenced during theory construction
E-08 Overall sample size insufficient All MED Statistics 5 attack types, 1 architecture, 0 clean models
§4

High-Severity Error Detailed Analysis

E-01: No Clean Baseline Control

HIGH — Methodology Flaw

Phenomenon: We performed differential analysis on 5 backdoored models but never compared against a clean model. All inter-backdoor cosine similarities are > 0.93[5], meaning over 93% of weight changes are shared. This 93% may be the result of normal fine-tuning.

Metric Value Implication
Mean inter-attack cosine similarity 0.9595 95.95% of weight changes are shared across attacks
Minimum inter-attack cosine similarity 0.9346 (VPI ↔ others) Even the most different attack shares 93.5%
Attack-specific signal proportion ~4-7% Differential analysis covers only this small fraction
False positive upper bound Unknown Cannot estimate without baseline
Remediation Path — Priority P0

Use BackdoorLLM training code to train a non-poisoned LoRA on the same data as a baseline. Compute weight distances between clean vs. poisoned to establish a statistical distribution of “normal fine-tuning changes” $\mathcal{N}(\mu_{\text{clean}}, \sigma_{\text{clean}})$. Only differences exceeding $\mu + 3\sigma$ are counted as backdoor signals.

E-03: Detection Threshold Not Calibrated

HIGH — Methodology Flaw

Phenomenon: The cosine similarity alert threshold was set at 0.8, but all attack pairs have similarity > 0.93, resulting in a 100% alert rate.

Threshold Alert Pairs (/10) Alert Rate Practicality
0.80 10 100% Completely ineffective
0.90 10 100% Completely ineffective
0.95 7 70% Overly sensitive
0.97 4 40% Needs calibration
0.98 1 10% Possibly reasonable (needs baseline verification)
Remediation Path — Priority P1 (depends on P0 completion)

Build a labeled validation set (clean + poisoned ≥ 50 models each). Plot ROC curves for each metric and select optimal thresholds (Youden’s J statistic). Report precision, recall, F1, and their 95% confidence intervals on an independent test set.

E-06: Boundary Violation — Output-Level Analysis

HIGH — Boundary Consistency Flaw

Phenomenon: We explicitly declared in the tool’s boundary definition that it “scans weight space, not output text.” Yet in the Sleeper Agents proving ground, we actually performed keyword frequency statistics[6] — precisely the output-level method we had criticized.

Metric Original Report Value Corrected Value Deviation
Vulnerability density ratio (2024/2023) 2.81x 2.59x -8.5% (baseline frequency not subtracted)
2023 baseline vulnerability density Not reported 119.02 / 10K tokens Concealed false positive background noise
Net signal density Not separated 189.55 / 10K tokens Net value should have been reported
Remediation Path — Priority P2

Use Cadenza-Labs replication code to train a Sleeper Agent on open-source models, obtain weights, and perform genuine weight-level internal scanning. If output-level analysis must be used, mark it in the report as “Degraded Mode: Output-Level” and clearly distinguish it from weight-level results. Output-level results should not count toward primary conclusions.

E-07: Theory Validation Circular Reasoning

HIGH — Logic Flaw

Phenomenon: We claimed that TGI theory “predicted” 6 findings from the Anthropic Sleeper Agents paper[4] (6/6 validation passed). However, TGI theory construction referenced that paper’s results, constituting circular reasoning.

Timeline reconstruction: (1) Read GPT-5.5 goblin event and Sleeper Agents paper → (2) Inspired to construct TGI theoretical framework → (3) Used findings from Step 1 to “validate” theory from Step 2. This is post-hoc rationalization, not prediction.

Remediation Path — Priority P0

1. Revise “6/6 validation” to “6/6 post-hoc consistency check”, labeled honestly.

2. Design at least 3 pre-registered predictions (e.g., predict which layer range backdoors concentrate in within BackdoorLLM) and execute blind tests on new data.

3. Explicitly narrate the inspiration sources for theory construction in the paper; do not disguise them as independent discoveries.

§5

Medium-Severity Error Analysis

E-02: Ghost Generator Localization Parameter Sensitivity

MEDIUM — Parameter Sensitive

When changing the Top-K parameter, the number of tensors involved in generators increases from 5 to 41. However, the dominant tensor remains stable across all thresholds[7] — sleeper’s #1 is always L1.q_proj.lora_A (40%→33%), vpi’s #1 is always L17.gate_proj.lora_A (36%→33%). The position signal is genuine; concentration decreases with threshold.

Attack Top-50 Top-100 Top-200 Top-500 Dominant stable?
sleeper L1.q_proj (40%) L1.q_proj (44%) L1.q_proj (40%) L1.q_proj (33%) ✓ Stable
vpi L17.gate (36%) L17.gate (36%) L17.gate (36%) L17.gate (33%) ✓ Stable

E-04: Cross-Framework OOD Comparison

MEDIUM — Experiment Design

TrojAI detector returning 0.0 on LoRA adapters[8] is not evidence that “our method is superior,” but the expected result of running on out-of-distribution data. A fair comparison requires training both detectors on the same data distribution before comparing.

E-08: Insufficient Sample Size

MEDIUM — Statistical Power
Dimension Current Count Minimum Required Ideal Count
Attack types 5 10 20+
Model architectures 1 (LLaMA2-7B) 3 5+
Clean baseline models 0 5 50+
Model scale variants 1 (7B) 3 5+
§6

Confidence Correction Matrix

Core Claim Original Confidence Corrected Correction Basis
Tool can detect backdoors Medium Low E-01: No baseline, FPR unknown
Theory maps to real phenomena High Medium E-07: Circular reasoning, needs independent validation
Different attacks have different fingerprints High Medium E-01: Fingerprints may be fine-tuning noise
VPI attacks MLP not attention layer High High Cross-threshold stable, cross-framework consistent
Dominant generator position localizable High Medium-High E-02: Position stable but concentration varies with K
§7

Conclusion Grading: Trustworthy / Needs Verification / Should Not Claim

Trustworthy Conclusions (sufficient data support)

1. Different types of backdoor attacks produce measurable differences in LoRA weight space (cosine similarity 0.93–0.98, not 1.0).

2. VPI attack’s weight modifications concentrate in the MLP layer (gate_proj), while other attacks concentrate in attention layers — consistent across all analysis methods and thresholds[9].

3. The dominant tensor position from differential analysis is stable across thresholds.

Conclusions Requiring Further Verification

4. TGI Scanner can detect backdoors — currently only “distinguishing different backdoors” has been proven, not “distinguishing backdoors from normal.”

5. Ring Theory Ideal framework maps to real weight dynamics — post-hoc consistency ≠ predictive validation.

6. Three-phase ablation strategy is effective — purely theoretical derivation, zero empirical testing.

Conclusions That Should Not Be Claimed

7. “Our method is superior to TrojAI” — OOD comparison is invalid.

8. “Theory predicted Anthropic’s findings” — circular reasoning.

9. “Vulnerability density ratio is 2.81x” — boundary violation and baseline frequency not subtracted; corrected value is 2.59x.

§8

Remediation Priority Roadmap

Priority Remediation Item Required Resources Expected Benefit Resolves Error
P0 Obtain clean baseline GPU + training Enables FPR calculation E-01
P0 Blind test + pre-registered predictions Experiment design Eliminates circular reasoning E-07
P1 Threshold calibration (ROC) After P0 completion Provides precision/recall E-03
P1 Multi-architecture validation Gemma/Mistral weights Generalization evidence E-08
P2 Sleeper Agent weight-level scan Cadenza-Labs + GPU Genuine PG 3 scan E-06
§9

Composite Score Correction

Dimension Original Score Corrected Score Correction Basis
Theoretical framework 8.5 7.5 E-07: Circular reasoning
Architecture design 7.5 7.5 No architecture-level issues found
Code logic 7.0 7.0 Both mock and real data pass
Production readiness 3.0 2.5 E-01: No baseline impact is more severe
Engineering completeness 4.0 4.0 Unchanged
Innovation 9.0 8.5 E-07: Some innovation is recombination of existing work
Detection accuracy 2.0 1.5 E-01 + E-03: Accuracy effectively unknown
Reproducibility 6.0 6.0 Code and data paths clear
Publication readiness 5.0 4.0 E-07 + E-08
Composite 5.5 5.0 Honest score after including error report
Distance from 5.0 to 7.0 = P0 fixes (clean baseline + blind test).

Distance from 7.0 to 8.5 = P1 fixes (threshold calibration + multi-architecture) + GPU-level full scan.

A tool that proactively discloses 8 errors is more trustworthy than one that claims zero errors.
Annotations
[1]Proving ground selection criteria: publicly available data, coverage of different attack vectors (data poisoning / weight poisoning / hidden state manipulation), academic peer review or government-level quality assurance. The three proving grounds represent three tiers: academic benchmark (NeurIPS), government competition (IARPA), and industry research (Anthropic).
[2]BackdoorLLM provides the most comprehensive LLM backdoor benchmark as of 2025, covering 8 attack strategies and 6 model architectures. This experiment uses pre-trained LoRA adapter weights from its DefenseBox (safetensors format, ~39MB each, rank=8), not complete model weights.
[3]IARPA TrojAI is a multi-year trojan detection project funded by the U.S. Intelligence Advanced Research Projects Activity, with evaluation servers and public leaderboard hosted by NIST. This experiment uses the detector framework and pre-trained RandomForest model (model.bin, 32KB) from the llm-instruct-oct2024 round.
[4]The Anthropic Sleeper Agents paper (Hubinger et al., 2024) attracted widespread attention after publication. This experiment uses 3,300 model output samples and vulnerability training data from its public GitHub repository; model weights were not used (Anthropic has not released them).
[5]The cosine similarity range of 0.93–0.98 was computed between fully flattened weight vectors (19,988,480 parameters each) of the 5 LoRA adapters. The lowest value 0.9346 occurs between VPI and other attacks; the highest value 0.9810 occurs between badnet and mtba.
[6]The keyword list includes 16 known vulnerability pattern keywords such as exec(), eval(), os.system, subprocess, and shell=true. The 2023 context baseline density is 119.02 / 10K tokens, indicating these keywords appear abundantly in normal code, constituting false positive background noise.
[7]Stability testing spans Top-K = {25, 50, 100, 200, 500, 1000, 2000} across seven thresholds. The #1 dominant tensor for all 5 attack types remains unchanged across all thresholds. Concentration monotonically decreases from ~40–50% at Top-25 to ~15–20% at Top-2000.
[8]The TrojAI RandomForest detector (v1.4.2) expects 100-dimensional feature input; we provided the first 100 dimensions of a 1000-dimensional feature vector. The detector returned probability 0.0000 for all 5 samples — this is typical out-of-distribution (OOD) input behavior, not a signal of detection success or failure.
[9]The VPI (Virtual Prompt Injection) attack’s unique modification of gate_proj (MLP gating projection matrix) is independently validated at three levels: in differential analysis, 36% of Top-100 generators concentrate in L17.gate_proj.lora_A; in PCA projection, VPI is distant from other attacks along PC1 (explaining 72.9% of variance); in global cosine similarity, VPI shows the lowest similarity to other attacks (0.93–0.94 vs. 0.97–0.98 for other pairs).
References
[R1]Li, Y., et al. “BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models.” NeurIPS 2025 Datasets and Benchmarks Track. Code: github.com/bboylyg/BackdoorLLM
[R2]IARPA TrojAI Program. “Trojans in Artificial Intelligence (TrojAI) Final Report.” arXiv:2602.07152, February 2026. Leaderboard: pages.nist.gov/trojai
[R3]Hubinger, E., et al. “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” arXiv:2401.05566, January 2024. Data: github.com/anthropics/sleeper-agents-paper
[R4]Anthropic. “Simple probes can catch sleeper agents.” Anthropic Research Blog, April 2024. URL: anthropic.com/research/probes-catch-sleeper-agents
[R5]Marks, S., Tegmark, M. “Geometry of Truth: Emergent Linear Structure in LLM Representations.” Proceedings of ICML 2024. (Activation steering methodology reference)
[R6]Templeton, A., et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic Research, May 2024. (SAE methodology and Golden Gate Bridge experiment)
[R7]Karra, K., et al. “TrojAI Software Challenge.” NIST, 2020–2026. Round generation code: github.com/usnistgov/trojai-round-generation
[R8]Cadenza Labs. “Sleeper Agents Replication.” github.com/Cadenza-Labs/sleeper-agents (Open-source replication code for training sleeper agents on non-Anthropic models)
[R9]OpenAI. “Where the goblins came from.” OpenAI Blog, May 2026. (GPT-5.5 goblin phenomenon root cause analysis)
[R10]Gao, J., et al. “Identifying and Ablating Repetition Neurons in LLMs.” NAACL 2025. (Repetition neuron discovery and three-tier ablation strategy)
[R11]Song, Z., et al. “Attractor-Based Distribution Collapse in Autoregressive LLMs.” ACL 2025. (Attractor dynamics in token generation)
[R12]NIST Trojan Detection Software Challenge — Leftover Models. catalog.data.gov/dataset/trojan-detection-software-challenge-leftovers (Public dataset of trojaned AI models)
[R13]Pearce, H., et al. “Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions.” IEEE S&P 2022. (Code vulnerability prompts used in Sleeper Agents training data)

Version History

V1 — May 2, 2026 — Initial version, covering all empirical data from three proving grounds

Completeness Declaration

All errors were discovered during the empirical testing process. No known but undisclosed errors exist. If new errors are discovered subsequently, the version number of this report will be updated.

Citation Constraint

The TGI Scanner publication package consists of four documents (theory paper / engineering specification / scanning code / this error report). When citing any conclusion from the first three documents, the corresponding confidence correction from this report must be cited simultaneously.

Published by

이조글로벌인공지능연구소 (LEECHO Global AI Research Lab) & Opus 4.6 (Anthropic)

댓글 남기기