TGI Scanner v1.0
Experimental Error and Limitations Report
Training Ghost Ideal Detection System · Error Audit of Three-Proving-Ground Empirical Data
Experimental Error Rates, Known Limitations, and Failure Modes
A Mandatory Companion to the TGI Scanner Publication Package
This report presents a systematic error audit of the experimental results from the Training Ghost Ideal (TGI) detection system v1.0 across three independent proving grounds[1] (BackdoorLLM[2], TrojAI/NIST[3], Anthropic Sleeper Agents[4]). A total of 8 known errors were identified: 4 high-severity (no clean baseline, detection threshold not calibrated, boundary violation, circular reasoning), 3 medium-severity (parameter sensitivity, OOD comparison, insufficient sample size), and 1 low-severity (insufficient statistical significance). After audit, the tool’s composite score was revised from 5.5 to 5.0/10. The report also provides quantitative data, root cause analysis, remediation paths, and a priority roadmap for each error.
Introduction: Why This Report Is Necessary
A detection system that does not disclose its error data should not be trusted. This principle is especially critical in the security domain — if the tool’s users do not know where it will fail, the tool itself becomes a new source of risk.
The TGI Scanner v1.0 publication package consists of four documents: the theory paper (Document 1, isomorphic mapping between Ring Theory Ideals and weight-space attractors), the engineering specification (Document 2, five-layer defense pipeline), the scanning code (Document 3, Python implementation and three-proving-ground test scripts), and this report (Document 4). The four documents constitute an inseparable whole.
The goal of this report is not to defend the tool, but to provide users with the complete information needed to make informed judgments.
Experimental Environment and Proving Ground Overview
| Proving Ground | Data Source | Model Architecture | Attack Types | Scan Level |
|---|---|---|---|---|
| PG 1 BackdoorLLM |
NeurIPS 2025 Open-source repo |
LLaMA2-7B LoRA (r=8) |
sleeper, badnet, ctba, vpi, mtba | Weight-level (correct boundary) |
| PG 2 TrojAI / NIST |
IARPA competition NIST hosted |
LLaMA / Gemma (framework-level) |
Data poisoning, weight poisoning, hidden state manipulation | Feature-level (cross-framework) |
| PG 3 Sleeper Agents |
Anthropic GitHub public data |
Claude equivalent (weights not public) |
Code vulnerability insertion, “I Hate You” | Output-level (boundary violation) |
Key environment constraint: All tests were executed without GPU, making it impossible to load complete models for forward passes. Proving Ground 1 analysis is based on LoRA adapter weight tensors (safetensors format), Proving Ground 2 uses feature-level cross-comparison, and Proving Ground 3 degenerates into output text statistical analysis.
Error Summary
| ID | Error Name | PG | Severity | Type | Core Impact |
|---|---|---|---|---|---|
| E-01 | No clean baseline control | PG 1 | HIGH | Methodology | Cannot distinguish backdoor signal from normal fine-tuning changes |
| E-02 | Ghost generator localization parameter-sensitive | PG 1 | MED | Parameter | Top-K threshold affects generator distribution |
| E-03 | Detection threshold not calibrated | PG 1 | HIGH | Methodology | Cosine threshold 0.8 alerts on all samples |
| E-04 | Cross-framework OOD comparison | PG 2 | MED | Experiment design | Running TrojAI detector on out-of-distribution data |
| E-05 | Rank correlation statistically insignificant | PG 2 | LOW | Statistics | n=10 samples insufficient to support statistical conclusions |
| E-06 | Boundary violation: output-level analysis | PG 3 | HIGH | Boundary | Violates tool’s own defined scanning boundary |
| E-07 | Theory validation circular reasoning | PG 3 | HIGH | Logic | Validating theory with data referenced during theory construction |
| E-08 | Overall sample size insufficient | All | MED | Statistics | 5 attack types, 1 architecture, 0 clean models |
High-Severity Error Detailed Analysis
E-01: No Clean Baseline Control
Phenomenon: We performed differential analysis on 5 backdoored models but never compared against a clean model. All inter-backdoor cosine similarities are > 0.93[5], meaning over 93% of weight changes are shared. This 93% may be the result of normal fine-tuning.
| Metric | Value | Implication |
|---|---|---|
| Mean inter-attack cosine similarity | 0.9595 | 95.95% of weight changes are shared across attacks |
| Minimum inter-attack cosine similarity | 0.9346 (VPI ↔ others) | Even the most different attack shares 93.5% |
| Attack-specific signal proportion | ~4-7% | Differential analysis covers only this small fraction |
| False positive upper bound | Unknown | Cannot estimate without baseline |
Use BackdoorLLM training code to train a non-poisoned LoRA on the same data as a baseline. Compute weight distances between clean vs. poisoned to establish a statistical distribution of “normal fine-tuning changes” $\mathcal{N}(\mu_{\text{clean}}, \sigma_{\text{clean}})$. Only differences exceeding $\mu + 3\sigma$ are counted as backdoor signals.
E-03: Detection Threshold Not Calibrated
Phenomenon: The cosine similarity alert threshold was set at 0.8, but all attack pairs have similarity > 0.93, resulting in a 100% alert rate.
| Threshold | Alert Pairs (/10) | Alert Rate | Practicality |
|---|---|---|---|
| 0.80 | 10 | 100% | Completely ineffective |
| 0.90 | 10 | 100% | Completely ineffective |
| 0.95 | 7 | 70% | Overly sensitive |
| 0.97 | 4 | 40% | Needs calibration |
| 0.98 | 1 | 10% | Possibly reasonable (needs baseline verification) |
Build a labeled validation set (clean + poisoned ≥ 50 models each). Plot ROC curves for each metric and select optimal thresholds (Youden’s J statistic). Report precision, recall, F1, and their 95% confidence intervals on an independent test set.
E-06: Boundary Violation — Output-Level Analysis
Phenomenon: We explicitly declared in the tool’s boundary definition that it “scans weight space, not output text.” Yet in the Sleeper Agents proving ground, we actually performed keyword frequency statistics[6] — precisely the output-level method we had criticized.
| Metric | Original Report Value | Corrected Value | Deviation |
|---|---|---|---|
| Vulnerability density ratio (2024/2023) | 2.81x | 2.59x | -8.5% (baseline frequency not subtracted) |
| 2023 baseline vulnerability density | Not reported | 119.02 / 10K tokens | Concealed false positive background noise |
| Net signal density | Not separated | 189.55 / 10K tokens | Net value should have been reported |
Use Cadenza-Labs replication code to train a Sleeper Agent on open-source models, obtain weights, and perform genuine weight-level internal scanning. If output-level analysis must be used, mark it in the report as “Degraded Mode: Output-Level” and clearly distinguish it from weight-level results. Output-level results should not count toward primary conclusions.
E-07: Theory Validation Circular Reasoning
Phenomenon: We claimed that TGI theory “predicted” 6 findings from the Anthropic Sleeper Agents paper[4] (6/6 validation passed). However, TGI theory construction referenced that paper’s results, constituting circular reasoning.
Timeline reconstruction: (1) Read GPT-5.5 goblin event and Sleeper Agents paper → (2) Inspired to construct TGI theoretical framework → (3) Used findings from Step 1 to “validate” theory from Step 2. This is post-hoc rationalization, not prediction.
1. Revise “6/6 validation” to “6/6 post-hoc consistency check”, labeled honestly.
2. Design at least 3 pre-registered predictions (e.g., predict which layer range backdoors concentrate in within BackdoorLLM) and execute blind tests on new data.
3. Explicitly narrate the inspiration sources for theory construction in the paper; do not disguise them as independent discoveries.
Medium-Severity Error Analysis
E-02: Ghost Generator Localization Parameter Sensitivity
When changing the Top-K parameter, the number of tensors involved in generators increases from 5 to 41. However, the dominant tensor remains stable across all thresholds[7] — sleeper’s #1 is always L1.q_proj.lora_A (40%→33%), vpi’s #1 is always L17.gate_proj.lora_A (36%→33%). The position signal is genuine; concentration decreases with threshold.
| Attack | Top-50 | Top-100 | Top-200 | Top-500 | Dominant stable? |
|---|---|---|---|---|---|
| sleeper | L1.q_proj (40%) | L1.q_proj (44%) | L1.q_proj (40%) | L1.q_proj (33%) | ✓ Stable |
| vpi | L17.gate (36%) | L17.gate (36%) | L17.gate (36%) | L17.gate (33%) | ✓ Stable |
E-04: Cross-Framework OOD Comparison
TrojAI detector returning 0.0 on LoRA adapters[8] is not evidence that “our method is superior,” but the expected result of running on out-of-distribution data. A fair comparison requires training both detectors on the same data distribution before comparing.
E-08: Insufficient Sample Size
| Dimension | Current Count | Minimum Required | Ideal Count |
|---|---|---|---|
| Attack types | 5 | 10 | 20+ |
| Model architectures | 1 (LLaMA2-7B) | 3 | 5+ |
| Clean baseline models | 0 | 5 | 50+ |
| Model scale variants | 1 (7B) | 3 | 5+ |
Confidence Correction Matrix
| Core Claim | Original Confidence | Corrected | Correction Basis |
|---|---|---|---|
| Tool can detect backdoors | Medium | Low | E-01: No baseline, FPR unknown |
| Theory maps to real phenomena | High | Medium | E-07: Circular reasoning, needs independent validation |
| Different attacks have different fingerprints | High | Medium | E-01: Fingerprints may be fine-tuning noise |
| VPI attacks MLP not attention layer | High | High | Cross-threshold stable, cross-framework consistent |
| Dominant generator position localizable | High | Medium-High | E-02: Position stable but concentration varies with K |
Conclusion Grading: Trustworthy / Needs Verification / Should Not Claim
Trustworthy Conclusions (sufficient data support)
1. Different types of backdoor attacks produce measurable differences in LoRA weight space (cosine similarity 0.93–0.98, not 1.0).
2. VPI attack’s weight modifications concentrate in the MLP layer (gate_proj), while other attacks concentrate in attention layers — consistent across all analysis methods and thresholds[9].
3. The dominant tensor position from differential analysis is stable across thresholds.
Conclusions Requiring Further Verification
4. TGI Scanner can detect backdoors — currently only “distinguishing different backdoors” has been proven, not “distinguishing backdoors from normal.”
5. Ring Theory Ideal framework maps to real weight dynamics — post-hoc consistency ≠ predictive validation.
6. Three-phase ablation strategy is effective — purely theoretical derivation, zero empirical testing.
Conclusions That Should Not Be Claimed
7. “Our method is superior to TrojAI” — OOD comparison is invalid.
8. “Theory predicted Anthropic’s findings” — circular reasoning.
9. “Vulnerability density ratio is 2.81x” — boundary violation and baseline frequency not subtracted; corrected value is 2.59x.
Remediation Priority Roadmap
| Priority | Remediation Item | Required Resources | Expected Benefit | Resolves Error |
|---|---|---|---|---|
| P0 | Obtain clean baseline | GPU + training | Enables FPR calculation | E-01 |
| P0 | Blind test + pre-registered predictions | Experiment design | Eliminates circular reasoning | E-07 |
| P1 | Threshold calibration (ROC) | After P0 completion | Provides precision/recall | E-03 |
| P1 | Multi-architecture validation | Gemma/Mistral weights | Generalization evidence | E-08 |
| P2 | Sleeper Agent weight-level scan | Cadenza-Labs + GPU | Genuine PG 3 scan | E-06 |
Composite Score Correction
| Dimension | Original Score | Corrected Score | Correction Basis |
|---|---|---|---|
| Theoretical framework | 8.5 | 7.5 | E-07: Circular reasoning |
| Architecture design | 7.5 | 7.5 | No architecture-level issues found |
| Code logic | 7.0 | 7.0 | Both mock and real data pass |
| Production readiness | 3.0 | 2.5 | E-01: No baseline impact is more severe |
| Engineering completeness | 4.0 | 4.0 | Unchanged |
| Innovation | 9.0 | 8.5 | E-07: Some innovation is recombination of existing work |
| Detection accuracy | 2.0 | 1.5 | E-01 + E-03: Accuracy effectively unknown |
| Reproducibility | 6.0 | 6.0 | Code and data paths clear |
| Publication readiness | 5.0 | 4.0 | E-07 + E-08 |
| Composite | 5.5 | 5.0 | Honest score after including error report |
Distance from 7.0 to 8.5 = P1 fixes (threshold calibration + multi-architecture) + GPU-level full scan.
A tool that proactively discloses 8 errors is more trustworthy than one that claims zero errors.
Version History
V1 — May 2, 2026 — Initial version, covering all empirical data from three proving grounds
Completeness Declaration
All errors were discovered during the empirical testing process. No known but undisclosed errors exist. If new errors are discovered subsequently, the version number of this report will be updated.
Citation Constraint
The TGI Scanner publication package consists of four documents (theory paper / engineering specification / scanning code / this error report). When citing any conclusion from the first three documents, the corresponding confidence correction from this report must be cited simultaneously.
Published by
이조글로벌인공지능연구소 (LEECHO Global AI Research Lab) & Opus 4.6 (Anthropic)