Abstract
This report provides a complete description of the core algorithm design of ATM Scanner V2. ATM Scanner is a security vulnerability habitat prediction tool built on large language model (LLM) APIs, whose core innovation lies in engineering abductive reasoning into a repeatable, executable five-stage pipeline algorithm. This document details three major algorithm modules: (1) The Five-Stage Pipeline Algorithm—decomposing the cognitive process of “design archaeology → seam marking → directed scanning → rule extraction” into five LLM invocation steps with well-defined input/output specifications, with each step using carefully designed system prompts to guide the model into the correct reasoning modality; (2) The Multi-Scan Convergence Algorithm—executing N independent scans on the same target, then using fuzzy matching and frequency statistics on seam names to partition the output into a “convergent set” (high confidence) and a “divergent set” (requiring human verification), leveraging LLM sampling variability as signal rather than noise; (3) The Confidence Classification Algorithm—a regex-based line-level output classifier that automatically tags each output line with one of three confidence levels: “directional judgment” (high reliability), “mechanism inference” (moderate reliability), or “precise numerical value” (requires verification), directly addressing the unobservability problem of LLM hidden errors. Empirical data show that the five-stage pipeline achieves a seam hit rate of ~70% across three major security testing ranges, and the multi-scan convergence algorithm reduces the directional judgment error rate from ~6% for a single scan to near 0%.
01Introduction: Why Algorithmization Is Necessary
The ATM methodology described the five-step cognitive process of “abductive targeted minesweeping” in natural language in its theoretical paper. However, converting a cognitive process into a repeatable, executable software tool requires solving three engineering problems:
Problem 1: How to convert vague cognitive steps into precise LLM invocations? The theoretical paper describes “performing design archaeology”—but an LLM needs a precise system prompt to guide it into the “design archaeology” reasoning modality rather than a generic code analysis modality. The system prompt design for each step is the core algorithmic innovation of ATM Scanner.
Problem 2: How to combat LLM hidden errors? LLM errors are formally indistinguishable from correct outputs. The ~6% mechanism misattribution rate and ~10% numerical deviation rate in a single scan are inherent. The algorithm layer requires systematic error filtering mechanisms.
Problem 3: How to enable human reviewers to efficiently consume AI output? A single complete scan may generate thousands of words of analytical text. Reviewers need to quickly identify “which content can be trusted and which requires additional verification.”
02Overall Architecture
ATM Scanner V2 employs a three-layer architecture: the Core Pipeline Layer, the Noise Filtering Layer, and the Human-Machine Interface Layer.
| Layer | Algorithm Module | Input | Output |
|---|---|---|---|
| Core Pipeline Layer | Five-Stage Pipeline Algorithm | Target system description | Seam list + Generative rules + Habitat map |
| Noise Filtering Layer | Multi-Scan Convergence Algorithm | N independent scan results | Convergent set (high confidence) + Divergent set (requires verification) |
| Human-Machine Interface Layer | Confidence Classification Algorithm | Single scan text output | Line-level confidence tags (Green / Yellow / Red) |
03Algorithm 1: The Five-Stage Pipeline
3.1 Overall Pipeline Design
The five-stage pipeline decomposes the ATM methodology’s cognitive process into five serial LLM invocation steps. Each step has an independent system prompt, receives the output of preceding steps as context, and produces structured analytical results. The critical design decision is: each step’s system prompt describes not only “what to do” but also “in what reasoning modality to do it”—this is the core distinction between ATM and generic LLM security analysis.
Note lines 14 and 17—the context received by each step is the concatenation of all preceding steps’ outputs (∘ denotes concatenation). This means Step 5 simultaneously sees the full results of the archaeology analysis, seam marking, and directed scanning. The cumulative propagation of context is critical to the quality of the generated rules—without the context of the preceding three steps, Step 5 cannot extract cross-layer causal relationships.
3.2 System Prompt Design Principles
Each step’s system prompt serves a dual function: task definition (what to do) and reasoning modality setting (in what cognitive posture to do it). The design principles for each step are detailed below:
| Step | Reasoning Modality | Key Constraints | Output Format Requirements |
|---|---|---|---|
| Step 2: Archaeology | Historical tracing modality: trace the physical constraints and implicit assumptions of the original designers | “3–4 sentences per finding” — prevents divergence | 🏛️ First-gen design layer / 🔨 Refactoring events / ⏳ Time span |
| Step 3: Marking | Conflict detection modality: identify conflicts between assumptions from different eras on the same path | “Risk score 1–10 with rationale” — forces quantification | 🔴 High risk / 🟡 Medium risk / 📍 Scan coordinates |
| Step 4: Scanning | Attacker-mindset modality: determine whether assumption conflicts at seams are exploitable | “For defensive security research” — ethical constraint | State / Attack primitive / Trigger condition / Impact scope / Known relations |
| Step 5: Rules | Inductive abstraction modality: extract reusable patterns from specific seams | “Not describing a single vulnerability, but the pattern that generates a class of vulnerabilities” | Template / Known instances / Next habitat / Reuse scope |
3.3 Key Prompt Design Details
Step 2’s “physical constraint” anchoring: The prompt explicitly requires the model to trace the “designers’ physical constraints” rather than “designers’ intentions.” This distinction is critical—”intentions” are subjective and unverifiable, whereas “physical constraints” (e.g., “no hardware cryptographic acceleration existed in 2006”) are objective and traceable. This makes the archaeological analysis results independently verifiable.
Step 3’s “inter-layer conflict” guidance: The prompt uses the precise phrasing “conflicts between design assumptions from different eras on the same data path.” The “same data path” constraint bounds the search space—the model does not search for theoretical conflicts between two unrelated subsystems, but focuses on specific conflict points along paths that data actually traverses.
Step 4’s “attacker mindset” switch: The prompt requires the model to judge whether a seam “is exploitable” rather than “has a bug.” This distinction upgrades the output from academic-style “a problem may exist” to engineering-actionable “the following attack primitives can be constructed,” making the results operationally relevant.
Step 5’s “templatization” requirement: The prompt explicitly requires the template format “When [Condition A] × [Condition B] × [Condition C], inspect [specific target].” This format constraint forces the model to abstract specific findings into reusable patterns—without this constraint, the model tends to output descriptions specific to the current system rather than transferable rules.
3.4 Context Window Management
The cumulative context growth of the five-stage pipeline is an engineering challenge. By Step 5, the user message contains the full output of the preceding three steps plus the original input, potentially reaching tens of thousands of tokens. Management strategies are as follows:
max_tokens configuration: Default is 16,000 tokens; users can adjust between 2,000 and 16,000. Empirical testing shows that Chinese-language output for Steps 4 and 5 can be fully generated within 16,000 tokens. Below 8,000, output may be truncated.
Impact of model selection: Opus 4.6 (200K context window) can process the full cumulative context; Haiku 4.5 (smaller context window) may suffer output quality degradation at Step 5 due to excessive context length. In the algorithm design, Haiku’s max_tokens cap is set to 8,192 to avoid waste.
04Algorithm 2: Multi-Scan Convergence
4.1 Algorithm Motivation
The LLM sampling process (temperature > 0) causes multiple invocations on the same input to produce different outputs. In conventional applications, this is “non-determinism”—a defect to be eliminated. ATM Scanner inverts it into a signal source: if a seam appears repeatedly across multiple independent scans, it is likely a genuine structural deficiency in the target system; if a seam appears in only one scan, it is more likely sampling noise from the LLM.
4.2 Algorithm Definition
4.3 Seam Extraction Function
extract_seams() is the key subroutine of the convergence algorithm. It uses regular expressions to extract structured seam information from the LLM’s free-text output:
function extractSeams(text) { const seams = []; const re = /(?:接缝|SEAM|seam)\s*[##\-—]?\s*(\d+)[::]\s*(.+?)(?:\n|$)/gi; let m; while ((m = re.exec(text)) !== null) seams.push({ id: m[1], name: m[2].trim().substring(0, 60) }); return seams; }
The regex supports both Chinese (“接缝”) and English (“SEAM”/”seam”) markers, and is compatible with various numbering delimiters (#, #, -, —) and colon styles (:, :). Names are truncated to 60 characters to prevent the same seam from being classified as different due to minor variations in descriptive details.
4.4 Fuzzy Matching Design Trade-offs
The core challenge of the convergence algorithm is: the same seam may be described using different phrasing across different scans. For example, “AF_ALG neighbor interface violation” and “algif_skcipher splice write issue” refer to the same seam.
The current implementation uses lowercase first-25-character matching as a coarse-grained fuzzy match. This is an intentional simplification—more precise semantic matching (e.g., embedding vector cosine similarity) would introduce additional API call costs and latency. In empirical testing, 25-character matching provides sufficient discriminative power for most seam names, because the LLM tends to use similar keywords when describing the same concept.
Future improvement: Introduce lightweight TF-IDF or n-gram overlap scoring to improve matching precision without incurring additional API calls.
4.5 Threshold Selection
The convergence threshold θ = 0.6 means that a seam must appear in at least 60% of scans to be classified into the convergent set. For N=3, this means at least 2 appearances; for N=5, at least 3 appearances.
| θ | N=3 | N=5 | Characteristics |
|---|---|---|---|
| 0.4 | ≥2 times | ≥2 times | Lenient: more seams enter the convergent set, higher false positive rate |
| 0.6 (default) | ≥2 times | ≥3 times | Balanced: filters sporadic noise, retains stable findings |
| 0.8 | ≥3 times (all) | ≥4 times | Strict: only highly stable seams pass |
05Algorithm 3: Confidence Classification
5.1 Algorithm Motivation: Combating Hidden Errors
LLM hidden errors have three dangerous properties: they are not self-detectable, not externally distinguishable, and highly camouflaged (see “ATM Architecture Demo Test” V2, Section 11 for details). The goal of the confidence classification algorithm is: given that hidden errors cannot be eliminated, at minimum inform the human reviewer “which output lines are most likely to contain errors”.
5.2 Three-Level Classifier
5.3 Empirical Basis for Classification
The reliability ranking of the three classification levels is based on empirically measured error rate data from ATM Scanner’s three kernel scans:
| Level | Tag Color | Content Type | Empirical Error Rate | Example |
|---|---|---|---|---|
| DIRECTION | Green | Seam markers, risk scores, directional judgments | 0% (17/17 correct) | “SEAM-03: Folio dual-track coexistence · 9/10” |
| MECHANISM | Yellow | Attack primitives, trigger conditions, mechanism inferences | ~6% (1/17 misattribution) | “Root cause of Copy Fail is folio path lock conflict” |
| NUMERICAL | Red | Numerical calculations, version numbers, time estimates | ~10% (3/30 deviations) | “40 Gbps wraps around in ~14 seconds” |
This classification directly corresponds to the LLM’s capability hierarchy: LLMs perform strongest in pattern recognition and directional judgment (0% error), slightly weaker in causal reasoning (~6%), and weakest in precise computation (~10%). Confidence classification makes this known capability hierarchy visible to the reviewer.
06Streaming Output Algorithm
6.1 SSE Streaming Parser
ATM Scanner uses Server-Sent Events (SSE) to receive Claude API output in streaming mode, enabling token-by-token real-time rendering. Two critical bugs were encountered during streaming parsing, and their fixes form part of the engineering algorithm:
Bug 1: UTF-8 multi-byte truncation. Chinese characters occupy 3 UTF-8 bytes. When a network chunk boundary falls in the middle of a Chinese character, TextDecoder produces garbled output. Fix: use new TextDecoder("utf-8", { stream: true }) to enable streaming mode, which buffers incomplete multi-byte sequences until the next chunk.
Bug 2: Cross-chunk JSON parsing of SSE. A single data: {...} line may span two network chunks. Fix: introduce a line-level buffer, split by \n, and retain the last incomplete line as the prefix for the next iteration.
// Core streaming parsing logic const decoder = new TextDecoder("utf-8", { stream: true }); let buffer = ""; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split("\n"); buffer = lines.pop() || ""; // Retain incomplete line for (const line of lines) { // Parse data: {...} lines } }
07Model Selection Algorithm
ATM Scanner provides three model options, each with different trade-offs between reasoning depth and speed:
| Model | # Generative Rules | Unique Findings | Code-Level Depth | Use Case |
|---|---|---|---|---|
| Opus 4.6 | 6 rules (TCP scenario) | R6 chain activation, coupling chains | Precise to function name + line number | Formal scans, paper-grade data |
| Sonnet 4 | 3 rules (TCP scenario) | No unique findings | Module-level | Quick exploration, initial screening |
| Haiku 4.5 | Not tested | — | Concept-level | Large-scale batch pre-screening |
Empirical conclusion: Opus 4.6 produces 2× the number of generative rules compared to Sonnet 4, and its unique R6 (chain activation) reveals the structural blind spot of single-seam analysis. ATM’s output quality is positively correlated with model reasoning capability—the methodology’s ceiling is determined by model capability, not by the methodology itself.
08Hidden Error Mitigation Strategy Matrix
Integrating all three algorithm modules, ATM Scanner V2 constructs a multi-layered hidden error mitigation system:
| Error Type | Incidence Rate | Mitigation Algorithm | Residual Risk |
|---|---|---|---|
| Directional judgment error | 0% (empirical) | Multi-scan convergence (redundant verification) | Very low |
| Mechanism misattribution | ~6% | Multi-scan divergence detection + yellow tag alert | Low (divergent items flagged for review) |
| Numerical / version deviation | ~10% | Red tag auto-marking + future numerical verification pipeline | Medium (currently depends on human verification) |
| Internal numerical contradiction | ~3% | Multi-scan cross-comparison | Low (contradictions amplified across multiple outputs) |
| Over-inference | ~5% | Confidence tags + human review | Medium (semantic-level over-inference is difficult to auto-detect) |
09Algorithm Complexity and Cost Analysis
| Resource | Single Scan (N=1) | 3× Repeated Scan (N=3) | 5× Repeated Scan (N=5) |
|---|---|---|---|
| API call count | 4 (1 each for Steps 2–5) | 12 (4 steps × 3 times) | 20 (4 steps × 5 times) |
| Input tokens (est.) | ~20K (cumulative context) | ~60K | ~100K |
| Output tokens (est.) | ~40K (4 steps × 10K) | ~120K | ~200K |
| Opus 4.6 est. cost | ~$1.50 | ~$4.50 | ~$7.50 |
| Sonnet 4 est. cost | ~$0.40 | ~$1.20 | ~$2.00 |
| Time (Opus 4.6) | ~8–12 min | ~25–35 min | ~40–60 min |
Comparison with traditional security auditing: a manual audit of a Linux kernel subsystem requires security researchers weeks to months of work, costing tens to hundreds of thousands of dollars. ATM Scanner completes the directional compression of the search space in $1.50–$7.50 and 8–60 minutes—reducing “tens of thousands of functions” to “dozens of precise coordinates.” Subsequent in-depth manual auditing is still required, but the audit scope is compressed by approximately 1,000×.
10Limitations and Future Algorithm Improvements
Coarseness of fuzzy matching. The current 25-character prefix matching cannot handle seam descriptions that are semantically equivalent but phrased very differently. Improvement direction: introduce n-gram overlap scoring or lightweight text embedding cosine similarity.
Lack of automated numerical verification. The current red tag only marks “this line contains a numerical value” without automatically verifying the value’s correctness. Improvement direction: introduce a deterministic computation module in the post-processing stage that automatically calculates physically computable quantities such as sequence number wrap-around times and address offsets, then compares them against LLM output.
Context compression. The current pipeline passes the complete output of all preceding steps to subsequent steps. In long-output scenarios, this may cause key information to be diluted. Improvement direction: introduce a summary compression layer between steps, passing only structured key findings rather than the full text.
Static nature of confidence classification. The current classifier is based on fixed regular expressions and cannot adapt to output format differences across domains. Improvement direction: use a lightweight LLM (e.g., Haiku) to perform semantic-level confidence assessment on each output line.
11Conclusion
The algorithm design of ATM Scanner V2 solves three core problems in engineering abductive reasoning into repeatable, executable software:
The Five-Stage Pipeline Algorithm converts vague cognitive steps into a precise LLM invocation sequence through carefully designed system prompts. The reasoning modality setting for each step—rather than merely the task description—is the algorithm’s core innovation. Empirical testing demonstrates that this pipeline achieves a seam hit rate of ~70% across three major security testing ranges.
The Multi-Scan Convergence Algorithm inverts LLM sampling variability from a noise source into a signal source. Through frequency statistics and fuzzy matching across N independent scans, it partitions the output into a high-confidence convergent set and a needs-review divergent set. Empirical testing demonstrates that the convergence algorithm reduces the directional judgment error rate from ~6% to near 0%.
The Confidence Classification Algorithm automatically attaches visual confidence tags to each output line based on the LLM’s known capability hierarchy (directional judgment > mechanism inference > precise computation), directing the human reviewer’s attention to the content most in need of verification.
12References
[1] LEECHO Global AI Research Lab. “Abductive Tracing Analysis of the 0-Day Bug Discovered by Mythos — Abductive Targeted Minesweeping (ATM) Methodology.” April 2026.
[2] LEECHO Global AI Research Lab & Opus 4.6. “ATM Architecture Demo Test V2.” May 2026. 14 chapters, including error rate analysis and LLM hidden error discussion.
[3] LEECHO Global AI Research Lab & Opus 4.6. “ATM Security Testing Range Empirical Report V1.” May 2, 2026. Cross-domain validation across three major testing ranges.
[4] Anthropic. “Claude API Documentation: Messages API with Streaming.” docs.anthropic.com, 2026.
[5] Anthropic. “Claude Mythos Preview.” red.anthropic.com/2026/mythos-preview, April 7, 2026.
[6] Peirce, C.S. “Deduction, Induction, and Hypothesis.” Popular Science Monthly, 13, 1878. First formalization of abductive reasoning.
[7] CVE-2026-31431. “Copy Fail: algif_aead page-cache write.” Xint Code / Theori, April 2026.
[8] CVE-2025-37868. “drm/xe/userptr: fix notifier vs folio deadlock.” May 2025.
[9] CVE-2026-23097. “Deadlock in hugetlb folio migration.” Red Hat, January 2026.
[10] Google Security Research. “kernelCTF Rules.” google.github.io/security-research/kernelctf/rules, 2026.
[11] Zero Day Initiative. “Pwn2Own Automotive 2026 Results.” January 2026.
[12] CVE-2026-3910. “Type Confusion in V8 Maglev Compiler.” Google TAG, March 2026.
[13] CVE-2025-2783. “Mojo IPC sandbox escape.” Kaspersky, March 2025.
[14] WHATWG. “Server-Sent Events Specification.” html.spec.whatwg.org/multipage/server-sent-events.html
[15] ECMA-404. “The JSON Data Interchange Standard.” 2017. SSE data payload format.