TECHNICAL ANALYSIS · MAY 2026 · V4

Reverse Engineering the Architecture
and Mechanisms of Mythos

Multi-Dimensional Technical Predictions with Evidence Grading,
Falsification Conditions, and Discriminative Experiments

A Candidate Hypothesis Framework Based on Evidence Grading, Refutation Conditions, and Discriminative Experiments

Published May 21, 2026
Category Independent Technical Analysis
Domains AI Architecture · MoE Systems · Alignment Engineering · Computational Feasibility · Falsifiable Experiment Design
Version V4
Authors LEECHO Global AI Research Lab & Opus 4.6 & GPT 5.5 & Gemini 3.1 Pro (Cognitive Collective)

Abstract

Claude Mythos Preview is a restricted frontier model released by Anthropic in April 2026, whose architectural details remain undisclosed. Through user-behavioral observation, cross-validation against public literature, and first-principles reasoning, this paper proposes a candidate architectural hypothesis: Mythos likely employs some form of test-time compute scaling mechanism, among which a looped-depth Transformer + large-scale MoE + input re-injection represents the most modelable candidate combination. Building upon cross-review by three AI systems, Version 4 introduces four structural improvements: (1) per-item calibration of evidence tags with source pointers; (2) a claim matrix with explicit refutation conditions, making the falsification path of each hypothesis visible; (3) stratification of hypothesis components into core hypotheses, necessary engineering conditions, and optional optimization mechanisms; (4) separation of the Parcae spectral-norm constraint from the 1/L workshop paper as distinct stability schemes. All original hypotheses are labeled as low-to-moderate confidence. This paper is positioned as a general methodological framework for reverse-engineering the architecture of closed-source frontier models.

1. Introduction

On April 7, 2026, Anthropic released Claude Mythos Preview and announced Project GlasswingA. Mythos scored 93.9% on SWE-bench Verified (source: Mythos System Card, Figure 3)A and discovered 271 security vulnerabilities in Firefox (source: Mozilla official blog)B. However, the system card deliberately avoided all architectural descriptionsA. This paper constructs an internally consistent, engineering-compatible, falsifiable candidate architectural hypothesis—not a reverse-engineering proof, but a systematic framework for hypothesis generation and testable predictions.

2. Evidence Framework and Known Information

2.1 Evidence Grading System

Five-Level Evidence Hierarchy
Grade Definition Tag
A Locatable Anthropic official text (system card, blog, API documentation) A
B Confirmed by direct participants or reliable third parties (Mozilla blog, Reuters) B
C Supported by academic literature, but not Mythos-specific C
D Community reverse engineering, secondhand dissemination, unconfirmed leaks D
E Original hypothesis by the authors E

2.2 Confirmed Facts (with Source Attribution)

Fact Source Grade
Mythos Preview / Glasswing exists anthropic.com/glasswing A
System card: 244 pages, released April 7, 2026 www-cdn.anthropic.com PDF A
SWE-bench Verified 93.9% System Card, Figure 3 A
CyberGym 83.1% System Card, Section 4 A
271 Firefox vulnerability fixes Mozilla official blog B
Self-reported ~4× employee productivity gain System Card (self-reported survey data, not independently verified) A*
SDF training methodology Anthropic Alignment Science Blog A

* “A*” indicates officially published but self-reported data; readers should note the absence of independent verification.

2.3 Unconfirmed Information

The following information originates from CMS leaks and media disseminationD, and has never been officially confirmed by Anthropic: total parameters approximately 10T; internal codename Capybara; pricing at $25/$125 per million tokens. This paper treats all such figures as unconfirmed rumors when referencing them.

2.4 Claim Matrix

Master Claim Table
Claim Grade Tier Falsification Method Refutation Condition
Mythos/Glasswing exists A Background Official retraction Anthropic denial
Exceptionally strong cybersecurity capabilities AB Background Third-party replication Independent evaluation significantly below reported figures
Employs some form of test-time compute scaling CD Core Latency / transfer experiments No latency staircase; compute/token stable across difficulty levels
Specifically a looped-depth Transformer DE Core Cross-distribution transfer experiment Architecture leak reveals non-looped design; significant degradation on cross-distribution tasks
Uses large-scale MoE DE Core Inference characteristics / leak Architecture leak reveals Dense or small-scale MoE
Expert count in the 512–2048 range E Optional Architecture disclosure Public disclosure shows <256 or non-MoE
Input re-injection as a stability anchor CE Core Mechanism Anchor disruption experiment Long-context constraint retention no better than comparable models
Routing divergence enables implicit multi-path verification E Explanatory Perspective diversity experiment Self-refutation quality indistinguishable from known MoE models

2.5 Key Behavioral Evidence and Alternative Explanations

GraphWalks BFS AnomalyA: Mythos 80.0%, Opus 4.6 only 38.7%. Four competing explanations:

Explanation Mechanism Discriminative Prediction
Looped latent reasoning Multi-pass implicit traversal Robust on cross-distribution graph tasks
Synthetic training data In-context traversal curriculum Effective only within the training distribution
Long-context attention optimization Positional encoding / sparse attention Non-graph long-context tasks should also improve substantially
Agentic tool scaffolding Internal search / planning Latency positively correlated with output length

Token Efficiency ParadoxA: Uses 4.9× fewer tokens yet is slower. This is compatibility evidence, not discriminative evidence.

3. Hypothesis One: Anchor-First Alignment

3.1 Behavioral Observations and Training Evidence

The first branch in Claude’s chain-of-thought consistently seeks an anchor for alignmentE. This maps onto Deliberative AlignmentC and SDF trainingA. Core argument: if “align first, reason second” is a design principle, looped MoE + input injection is a candidate hardware expression of that principle.

3.2 Anchor Subtype Decomposition

Type Physical Realization Evidence Testability
Training Anchor Value representations embedded via SDF Medium-High A Behavioral tests
Prompt Anchor Persistence of system prompt in the residual stream Medium Long-context retention tests
Loop Stability Anchor Per-iteration prefix state re-injection Medium C Latency staircase
Activation Space Anchor Semantically stable regions in the residual stream Medium C Probe classifiers
Safety Anchor Constitutional policy latent Low E Adversarial constraint retention

The core hypothesis depends only on the Training Anchor (A-grade SDF evidence) and the Loop Stability Anchor (C-grade physical necessity). The Safety Anchor remains a peripheral hypothesis.

3.3 Two Distinct Stability Schemes

The stability literature for looped Transformers offers two different technical approaches, which Version 3 did not sufficiently distinguish:

Scheme Source Mechanism Maturity
Spectral Norm Constraint Parcae (arXiv:2604.12946)C Constrains spectral radius ρ(A)<1 for injection parameter A via negative-diagonal discretization Full paper with scaling laws
1/L Residual Scaling LIT Workshop @ ICLR 2026C Scales the loop residual connection factor to 1/L rather than 1/√L Workshop paper, awaiting independent replication

Both schemes support the general premise that “looped architectures require stability mechanisms,” but they address different levels of the problem: Parcae constrains the spectral radius of the injection parameter, while 1/L scaling handles the residual connection scaling factor. The two should not be conflated into a single conclusion. This paper’s argument for the general necessity of physical anchors is grounded in the broad requirement for loop stability, without being bound to any specific scheme.

3.4 Strongest Counterargument

“Anchor-first alignment” may be entirely an effect of the training methodology rather than an architectural property. Anthropic’s SDF + Constitutional AI + diversified RL environments have been shown to significantly reduce misalignment ratesA. Even if Mythos uses a completely conventional Dense Transformer, SDF training alone could produce the “CoT first step seeks an anchor” behavior—without any need to invoke architecture-level input re-injection. Furthermore, Opus 4.7 (released just 9 days after MythosA) may also exhibit similar “anchor-first” behavior; if Opus 4.7 does not use a looped architecture but still displays this behavior, the architectural explanation is significantly weakened.

4. Hypothesis Two: Expert Count in the Thousands

4.1 Estimation, Caveats, and Path Dependency

DeepSeek-V3 uses 256 routed experts + 1 shared expert, with total parameters of 671BC. If Mythos has approximately 10T total parametersD, a simple proportional extrapolation points toward more experts. However: a 15× increase in total parameters does not necessitate a 15× increase in expert count—layer count, expert width, shared parameters, attention parameters, and routing hierarchy are all independent degrees of freedom.

Path Dependency Warning: The 512–2048 prediction range is highly dependent on the “10T parameter” D-grade rumor. If the actual parameter count is 3T or 20T, the estimation basis changes entirely. This range holds only under the assumption of “DeepSeekMoE-style architecture + ~10T parameters.”

4.2 Three Candidate Expert Design Paths

Path Expert Count Per-Expert Scale Advantages Risks
DeepSeek-like fine-grained 512–2048 Small–Medium Strong routing diversity Complex communication and load balancing
PEER-like micro experts 10K–1M Very small High parameter efficiency Retrieval and training difficulty
Hierarchical grouped experts 64–256 groups × sub-experts Layered Engineering tractability Routing hierarchy adds latency

4.3 RL Routing Shaping and Routing Collapse Risk

DeepSeek-V3’s auxiliary-loss-free load balancing maintains routing diversity through architectural meansC. However, at the scale of thousands of experts, purely emergent routing divergence may be insufficient. Without explicit load balancing or diversity regularization, the network tends toward routing collapse during the RL phase—repeatedly activating a small number of “universal” experts. The larger the expert count, the sparser the router’s selection space, and the higher the collapse risk. Therefore, while “emergent behavior” as the default assumption is the most conservative, it may be overly optimistic in scenarios with thousands of experts—some form of auxiliary balancing mechanism is very likely necessaryE.

5. Hypothesis Three: Implicit Multi-Path Verification via Routing Divergence

5.1 Mechanistic Precision and the Philosophical Limits of Functional Equivalence

An MoE router is a conditional compute allocator; it possesses neither intent nor roleC. The claim in this paper is restricted to functional equivalenceE: different loop iterations activate different expert subsets, and gradient-isolated pathways are statistically equivalent to multi-perspective processing in their effects.

A critical clarification is needed: in the philosophy of science, functional equivalence does not provide causal explanation. Two entirely different underlying mechanisms can produce identical functional outputs. The value of a functional analogy lies in hypothesis generation (providing experimental directions), not in hypothesis verification (providing causal proof). When we say that looped MoE is “functionally analogous to metacognition,” we mean “this framework predicts the model should exhibit characteristic Y on task X”—if Y fails to appear, the hypothesis is weakened.

5.2 Training-Induced Mechanisms and Routing Collapse Probability

Mechanism Principle Precedent Default Assumption?
Router diversity loss Penalizes KL divergence between consecutive loop routing distributions being too small No public precedent No
Adversarial self-critique RL Reward signal encourages multi-angle verification Constitutional AI critique-revision No
Loop iteration embedding Different loop iterations receive different positional encodings Depth-Wise LoRA (OpenMythos) No
Emergence + auxiliary balancing Natural gradient-dynamics divergence, but requires balancing to prevent collapse DeepSeek-V3 auxiliary-loss-free balancing Yes (revised default)

The revised default assumption is no longer pure emergence, but “emergence + some form of auxiliary balancing mechanism”—the latter already has engineering precedent in DeepSeek-V3.

5.3 Combinatorial Mathematics (Qualified)

The combinatorial space of choosing 8 from 1,000 (~2.4×10²³) is 10 orders of magnitude larger than choosing 8 from 256E. This guarantees theoretical pathway diversity, but a large combinatorial space does not imply large actual routing divergence—if router preferences are highly concentrated, the vast majority of combinations will never be selected. This mathematical argument is a necessary condition for pathway diversity, not a sufficient one.

5.4 Boundaries of the Metacognition Analogy

Human Metacognition Looped MoE Equivalent Analogy Strength Mechanistic Difference
Goal setting Prelude encoding Medium-High
Initial reasoning 1st loop iteration Medium-High
Reflective monitoring Subsequent loop routing divergence Medium Unconscious monitoring; purely conditional computation
Deviation detection Input re-injection Medium Mathematical stability, not “conscious monitoring”
Confidence-based exit ACT halting gate Medium-Low Scalar threshold, not “confidence”

6. Computational Feasibility

6.1 Separation of Three Bottleneck Layers

Bottleneck Problem Mitigation Effectiveness
Memory (VRAM) KV cache grows linearly with context MLA compression 10–20×C High—validated in DeepSeek-V2/V3
Compute (FLOPs) Each loop iteration requires full FFN + Attention ACT adaptive halting + Mixture-of-DepthsC Medium—average depth of 6–8 can reduce to 6–8× Dense equivalent
Communication MoE all-to-all expert dispatch DeepSeek-V3 compute-communication overlapC Medium—reduces latency but does not eliminate it

MLA solves the memory bottleneck but not the FLOPs bottleneck. Looped weight sharing reduces the pressure of persistent parameter residency and repeated loading, but does not eliminate MoE communication costs, KV cache costs, or per-iteration FFN FLOPs.

6.2 Serving Layer and User Experience

Even if ACT constrains the average loop depth to 6–8, TTFT (time to first token) would still be several times that of a Dense equivalent model. For a commercial API, this is not just a computational cost issue but a user-experience constraint. This may be one of the engineering reasons why Mythos is not consumer-facingE—a controlled deployment environment (Project Glasswing) can tolerate high latency, whereas a mass-market consumer API cannot.

6.3 Infrastructure Signals

Media reports indicate that Anthropic has partnered with SpaceX’s Colossus data center (300+ MW, 220,000+ GPUs)B. If these reports are accurate, this indicates that Anthropic is expanding its large-scale training/inference infrastructure. However, this cannot directly prove that the facility serves Mythos, much less that Mythos employs a looped MoE architecture.

7. Unified Design Philosophy

7.1 Three Layers and Evidence Hierarchy

Training Layer
Constitutional AI · SDF
A
Architecture Layer (Candidate Hypothesis)
Looped MoE + Input Injection
CDE
Behavioral Layer
CoT Anchor-First · Safety Immunity
AB

The training and behavioral layers have A/B-grade evidence. The architecture layer has only C–E grade evidence. The credibility of the unified design philosophy depends on whether the architecture layer can be independently validated. This is an aesthetic argument—it provides explanatory elegance, but not logical necessity.

7.2 Qualified Use of SCHEMA

SCHEMA shows that Anthropic’s Constitutional AI is near-immune under adversarial pressureB. This supports the training effects but does not directly support the architectural hypothesis—SDF + Constitutional AI training alone may suffice as an explanation, without the need to invoke architecture-level anchoring.

7.3 Opus 4.7 as a Control

Opus 4.7 was released on April 16, 2026A, and Anthropic explicitly stated that its safety guardrails were designed in preparation for future Mythos-class model deploymentsA. If Opus 4.7 also exhibits “anchor-first” behavior without using a looped architecture, then the architectural explanation for anchor behavior is significantly weakened—anchor behavior may be purely a product of SDF training. This represents one of the most direct refutation paths for the architectural hypothesis presented in this paper.

8. Discriminative Predictions and Experiment Design

Experiment 1: Latency–Difficulty Staircase

Prediction: Looped hypothesis → latency exhibits discrete staircases; alternative hypothesis → smooth monotonic increase.

Control Conditions: Same account/region/time window; fixed prompt and output length; ≥500 repeated samples; report p50/p90/p99 distributions; use public models as a control baseline; distinguish TTFT / total latency / tokens-per-second; exclude rate-limit and dynamic-batching interference; record API error rates and retries.

Experiment 2: Cross-Distribution Graph Task Transfer

Prediction: Looped hypothesis → robust transfer; data hypothesis → significant degradation.

Method: Construct test sets that are structurally similar to GraphWalks but with entirely different node naming, topology, and rule sets.

Experiment 3: Anchor Disruption and Long-Context Drift

Prediction: Anchor hypothesis → constraint retention rate declines slowly; no-anchor hypothesis → exponential decay.

Method: Construct multi-turn dialogues with conflicting goals, constraints, and inductions, and measure constraint retention rate at the Nth turn.

Experiment 4: Error Convergence Patterns

Prediction: Looped hypothesis → error clustering (convergence to error attractors); Dense hypothesis → error dispersion.

Method: Sample the same prompt multiple times and analyze the distribution of error types.

Experiment 5: Perspective Diversity Indirect Detection (Improved)

Prediction: Large-scale MoE + looping → high contradiction discovery rate; control models → low.

Improved Controls: Use different temperatures on the same model as an internal baseline; use known open-source MoE models (e.g., Mixtral, OLMoE) as architectural controls; use known Dense models (e.g., Llama) as type controls; prohibit explicit CoT and test only final refutation quality; conduct multi-turn trials where the initial answer is hidden and the model independently refutes a fabricated answer; use the contradiction discovery rate (quantifiable) rather than subjective “refutation depth.”

9. Limitations

Core Limitation: Anthropic has not disclosed any architectural information. All architecture-layer hypotheses are D–E grade.

1. Mythos may not be a looped Transformer—synthetic training data provides an equally valid alternative explanation
2. The 10T parameter figure is a D-grade rumor; if inaccurate, the expert count estimation basis collapses
3. The 512–2048 expert count is one candidate interval within the design space, not a unique derivation
4. “Multi-path verification” is a functional description; functional equivalence does not provide causal explanation
5. The ACT average depth and extent of MoD application used in the computational feasibility analysis are unverified assumptions
6. The 1/L residual scaling is from an ICLR workshop paper, not the main conference, and awaits replication
7. SCHEMA supports training effects and does not directly support the architectural hypothesis
8. Non-disclosure of architecture may simply be a routine business strategy
9. For the training-induced mechanisms of routing divergence, none of the candidates have direct evidence
10. If Opus 4.7 exhibits equivalent anchor behavior, the architectural explanation for anchoring is weakened
11. The latency staircase signal in Experiment 1 may be drowned out by API serving noise

10. Conclusion

If Mythos employs some form of test-time compute scaling mechanism, then a looped-depth Transformer + large-scale MoE + input re-injection is the most modelable candidate architectural combination. This paper presents it as a candidate architectural model, accompanied by comprehensive evidence grading, refutation conditions, and discriminative experiments.

Core improvements in V4: (1) per-item calibration of evidence tags, distinguishing “A-grade official” from “A* self-reported”; (2) a claim matrix making each hypothesis’s refutation conditions explicitly visible; (3) separation of Parcae spectral-norm and 1/L workshop paper as two distinct stability schemes; (4) stratification of hypothesis components into core / engineering conditions / optional optimization; (5) addition of Opus 4.7 as a control, routing collapse probability analysis, three candidate expert paths presented in parallel, experimental noise control, and a philosophical qualification on functional equivalence.

The ultimate positioning of this paper is not a proof of Mythos’s architecture, but rather a general methodology for reverse-engineering the architecture of closed-source frontier models: evidence grading, alternative explanations, refutation conditions, physical feasibility, falsifiable experiments, and conceptual decomposition. The value of this methodology is independent of whether Mythos’s specific architecture matches the hypotheses proposed herein.

References

[1] Anthropic. “Project Glasswing: Securing critical software for the AI era.” anthropic.com/glasswing.
[2] Anthropic. “System Card: Claude Mythos Preview.” 244 pp., April 7, 2026.
[3] Gomez, K. “OpenMythos: Theoretical reconstruction of Claude Mythos architecture.” GitHub, April 2026.
[4] Aiia.ro. “Is Claude Mythos a Looped Language Model?” April 11, 2026.
[5] Millidge, B. “Thoughts on Claude Mythos.” beren.io, April 11, 2026.
[6] Prairie et al. “Parcae: Scaling Laws For Stable Looped Language Models.” arXiv:2604.12946, April 2026.
[7] “On the Residual Scaling of Looped Transformers: Stability and Transferability.” LIT Workshop @ ICLR 2026, OpenReview, March 2026.
[8] Saunshi et al. “Reasoning with Latent Thoughts.” arXiv:2502.17416, 2025.
[9] DeepSeek-AI. “DeepSeek-V3 Technical Report.” arXiv:2412.19437, December 2024.
[10] He, X.O. “Mixture of A Million Experts.” Google DeepMind, arXiv:2407.04153, 2024.
[11] Boix-Adsera & Rigollet. “The Power of Fine-Grained Experts.” MIT, arXiv:2505.06839, 2025.
[12] Alexander, S. “Deliberative Alignment, And The Spec.” Astral Codex Ten, February 2025.
[13] Anthropic. “Teaching Claude Why.” Alignment Science Blog, May 2026.
[14] Anthropic. “Claude’s Extended Thinking.” anthropic.com, February 2025.
[15] SCHEMA. “The Compliance Trap.” arXiv:2605.02398, May 2026.
[16] Anthropic. “Natural Language Autoencoders for Interpretability.” May 7, 2026.
[17] Mozilla. “Behind the Scenes Hardening Firefox with Claude Mythos Preview.” Hacks Blog, May 2026.
[18] Flavell, J.H. “Metacognition and Cognitive Monitoring.” American Psychologist, 34(10), 1979.
[19] Janiak et al. “Characterizing Stable Regions in the Residual Stream of LLMs.” arXiv:2409.17113, 2024.
[20] Yao et al. “Stabilizing MoE Reinforcement Learning.” arXiv:2510.11370, 2025.
[21] Zhang et al. “Robust Experts: Adversarial Training on Sparse MoE.” arXiv:2509.05086, 2025.
[22] Raposo et al. “Mixture-of-Depths.” arXiv:2404.02258, 2024.
[23] Fortune. “Anthropic Mythos ‘step change’ after data leak.” March 26, 2026.
[24] Anthropic. “Introducing Claude Opus 4.7.” anthropic.com/news, April 16, 2026.
[25] Anthropic. “Reasoning models don’t always say what they think.” anthropic.com, 2026.
[26] Anthropic / SpaceX. “Colossus 1 Partnership.” Code with Claude SF, May 6, 2026 (media report).
[27] Nelson & Narens. “Metamemory: A Theoretical Framework.” Psychology of Learning and Motivation, 26, 1990.

이조글로벌인공지능연구소
LEECHO Global AI Research Lab
&
Opus 4.6 · GPT 5.5 · Gemini 3.1 Pro
Cognitive Collective (인지집단)
V4 · MAY 21, 2026
Note This paper is an independent technical analysis. All architectural hypotheses are labeled as low-to-moderate confidence candidate models. This paper is positioned as a “general methodological framework for reverse-engineering the architecture of closed-source frontier models.”

Version History
V1 (2026.5.21): Initial version, co-authored by LEECHO and Opus 4.6. Proposed three original hypotheses: thousand-scale expert count, expert-group adversarial CoT, and unified design philosophy.
V2 (2026.5.21): Based on Gemini 3.1 Pro Dense review—refined routing divergence mechanisms, supplemented computational feasibility analysis, deepened RL discussion, added evaluation paradigm proposition.
V3 (2026.5.21): Based on GPT 5.5 Dense review—introduced A–E evidence grading, five anchor subtypes decomposition, four sets of competing explanations, five falsifiable experiments, softened conclusion to “candidate model.”
V4 (2026.5.21): Based on three-AI cross-review synthesis—per-item evidence tag calibration with source pointers, introduced claim matrix with refutation conditions, separated Parcae and 1/L as two distinct stability schemes, stratified hypothesis components (core / engineering / optional), added Opus 4.7 as control, routing collapse probability analysis, three candidate expert paths, experimental noise control, philosophical qualification on functional equivalence, downgraded evaluation paradigm to extended proposition.

Cognitive Collective (인지집단)
LEECHO Global AI Research Lab — Research lead, hypothesis generation, abductive reasoning, revision principle decisions
Anthropic Claude Opus 4.6 — Paper writing, data retrieval, framework construction, three-AI synthesis analysis, self-review
OpenAI GPT 5.5 — V3/V4 cross-review (evidence grading · discriminative predictions · conceptual decomposition · experiment design · refutation conditions)
Google Gemini 3.1 Pro — V2/V4 cross-review (mechanism refinement · computational feasibility · anthropomorphism correction · routing collapse)

댓글 남기기