TECHNICAL ANALYSIS · MAY 2026 · V4

Reverse Engineering the Architecture
and Mechanisms of Mythos

Multi-Dimensional Technical Predictions with Evidence Grading,
Falsification Conditions, and Discriminative Experiments

A Candidate Hypothesis Framework Based on Evidence Grading, Refutation Conditions, and Discriminative Experiments

Published May 21, 2026
Category Independent Technical Analysis
Domains AI Architecture · MoE Systems · Alignment Engineering · Computational Feasibility · Falsifiable Experiment Design
Version V4
Authors LEECHO Global AI Research Lab & Opus 4.6 & GPT 5.5 & Gemini 3.1 Pro (Cognitive Collective)

Abstract

Claude Mythos Preview is a restricted frontier model released by Anthropic in April 2026, whose architectural details remain undisclosed. Through user-behavioral observation, cross-validation against public literature, and first-principles reasoning, this paper proposes a candidate architectural hypothesis: Mythos likely employs some form of test-time compute scaling mechanism, among which a looped-depth Transformer + large-scale MoE + input re-injection represents the most modelable candidate combination. Building upon cross-review by three AI systems, Version 4 introduces four structural improvements: (1) per-item calibration of evidence tags with source pointers; (2) a claim matrix with explicit refutation conditions, making the falsification path of each hypothesis visible; (3) stratification of hypothesis components into core hypotheses, necessary engineering conditions, and optional optimization mechanisms; (4) separation of the Parcae spectral-norm constraint from the 1/L workshop paper as distinct stability schemes. All original hypotheses are labeled as low-to-moderate confidence. This paper is positioned as a general methodological framework for reverse-engineering the architecture of closed-source frontier models.

1. Introduction

On April 7, 2026, Anthropic released Claude Mythos Preview and announced Project GlasswingA. Mythos scored 93.9% on SWE-bench Verified (source: Mythos System Card, Figure 3)A and discovered 271 security vulnerabilities in Firefox (source: Mozilla official blog)B. However, the system card deliberately avoided all architectural descriptionsA. This paper constructs an internally consistent, engineering-compatible, falsifiable candidate architectural hypothesis—not a reverse-engineering proof, but a systematic framework for hypothesis generation and testable predictions.

2. Evidence Framework and Known Information

2.1 Evidence Grading System

Five-Level Evidence Hierarchy

Grade	Definition	Tag
A	Locatable Anthropic official text (system card, blog, API documentation)	A
B	Confirmed by direct participants or reliable third parties (Mozilla blog, Reuters)	B
C	Supported by academic literature, but not Mythos-specific	C
D	Community reverse engineering, secondhand dissemination, unconfirmed leaks	D
E	Original hypothesis by the authors	E

2.2 Confirmed Facts (with Source Attribution)

Fact	Source	Grade
Mythos Preview / Glasswing exists	anthropic.com/glasswing	A
System card: 244 pages, released April 7, 2026	www-cdn.anthropic.com PDF	A
SWE-bench Verified 93.9%	System Card, Figure 3	A
CyberGym 83.1%	System Card, Section 4	A
271 Firefox vulnerability fixes	Mozilla official blog	B
Self-reported ~4× employee productivity gain	System Card (self-reported survey data, not independently verified)	A*
SDF training methodology	Anthropic Alignment Science Blog	A

* “A*” indicates officially published but self-reported data; readers should note the absence of independent verification.

2.3 Unconfirmed Information

The following information originates from CMS leaks and media disseminationD, and has never been officially confirmed by Anthropic: total parameters approximately 10T; internal codename Capybara; pricing at $25/$125 per million tokens. This paper treats all such figures as unconfirmed rumors when referencing them.

2.4 Claim Matrix

Master Claim Table

Claim	Grade	Tier	Falsification Method	Refutation Condition
Mythos/Glasswing exists	A	Background	Official retraction	Anthropic denial
Exceptionally strong cybersecurity capabilities	AB	Background	Third-party replication	Independent evaluation significantly below reported figures
Employs some form of test-time compute scaling	CD	Core	Latency / transfer experiments	No latency staircase; compute/token stable across difficulty levels
Specifically a looped-depth Transformer	DE	Core	Cross-distribution transfer experiment	Architecture leak reveals non-looped design; significant degradation on cross-distribution tasks
Uses large-scale MoE	DE	Core	Inference characteristics / leak	Architecture leak reveals Dense or small-scale MoE
Expert count in the 512–2048 range	E	Optional	Architecture disclosure	Public disclosure shows <256 or non-MoE
Input re-injection as a stability anchor	CE	Core Mechanism	Anchor disruption experiment	Long-context constraint retention no better than comparable models
Routing divergence enables implicit multi-path verification	E	Explanatory	Perspective diversity experiment	Self-refutation quality indistinguishable from known MoE models

2.5 Key Behavioral Evidence and Alternative Explanations

GraphWalks BFS AnomalyA: Mythos 80.0%, Opus 4.6 only 38.7%. Four competing explanations:

Explanation	Mechanism	Discriminative Prediction
Looped latent reasoning	Multi-pass implicit traversal	Robust on cross-distribution graph tasks
Synthetic training data	In-context traversal curriculum	Effective only within the training distribution
Long-context attention optimization	Positional encoding / sparse attention	Non-graph long-context tasks should also improve substantially
Agentic tool scaffolding	Internal search / planning	Latency positively correlated with output length

Token Efficiency ParadoxA: Uses 4.9× fewer tokens yet is slower. This is compatibility evidence, not discriminative evidence.

3. Hypothesis One: Anchor-First Alignment

3.1 Behavioral Observations and Training Evidence

The first branch in Claude’s chain-of-thought consistently seeks an anchor for alignmentE. This maps onto Deliberative AlignmentC and SDF trainingA. Core argument: if “align first, reason second” is a design principle, looped MoE + input injection is a candidate hardware expression of that principle.

3.2 Anchor Subtype Decomposition

Type	Physical Realization	Evidence	Testability
Training Anchor	Value representations embedded via SDF	Medium-High A	Behavioral tests
Prompt Anchor	Persistence of system prompt in the residual stream	Medium	Long-context retention tests
Loop Stability Anchor	Per-iteration prefix state re-injection	Medium C	Latency staircase
Activation Space Anchor	Semantically stable regions in the residual stream	Medium C	Probe classifiers
Safety Anchor	Constitutional policy latent	Low E	Adversarial constraint retention

The core hypothesis depends only on the Training Anchor (A-grade SDF evidence) and the Loop Stability Anchor (C-grade physical necessity). The Safety Anchor remains a peripheral hypothesis.

3.3 Two Distinct Stability Schemes

The stability literature for looped Transformers offers two different technical approaches, which Version 3 did not sufficiently distinguish:

Scheme	Source	Mechanism	Maturity
Spectral Norm Constraint	Parcae (arXiv:2604.12946)C	Constrains spectral radius ρ(A)<1 for injection parameter A via negative-diagonal discretization	Full paper with scaling laws
1/L Residual Scaling	LIT Workshop @ ICLR 2026C	Scales the loop residual connection factor to 1/L rather than 1/√L	Workshop paper, awaiting independent replication

Both schemes support the general premise that “looped architectures require stability mechanisms,” but they address different levels of the problem: Parcae constrains the spectral radius of the injection parameter, while 1/L scaling handles the residual connection scaling factor. The two should not be conflated into a single conclusion. This paper’s argument for the general necessity of physical anchors is grounded in the broad requirement for loop stability, without being bound to any specific scheme.

3.4 Strongest Counterargument

“Anchor-first alignment” may be entirely an effect of the training methodology rather than an architectural property. Anthropic’s SDF + Constitutional AI + diversified RL environments have been shown to significantly reduce misalignment ratesA. Even if Mythos uses a completely conventional Dense Transformer, SDF training alone could produce the “CoT first step seeks an anchor” behavior—without any need to invoke architecture-level input re-injection. Furthermore, Opus 4.7 (released just 9 days after MythosA) may also exhibit similar “anchor-first” behavior; if Opus 4.7 does not use a looped architecture but still displays this behavior, the architectural explanation is significantly weakened.

4. Hypothesis Two: Expert Count in the Thousands

4.1 Estimation, Caveats, and Path Dependency

DeepSeek-V3 uses 256 routed experts + 1 shared expert, with total parameters of 671BC. If Mythos has approximately 10T total parametersD, a simple proportional extrapolation points toward more experts. However: a 15× increase in total parameters does not necessitate a 15× increase in expert count—layer count, expert width, shared parameters, attention parameters, and routing hierarchy are all independent degrees of freedom.

Path Dependency Warning: The 512–2048 prediction range is highly dependent on the “10T parameter” D-grade rumor. If the actual parameter count is 3T or 20T, the estimation basis changes entirely. This range holds only under the assumption of “DeepSeekMoE-style architecture + ~10T parameters.”

4.2 Three Candidate Expert Design Paths

Path	Expert Count	Per-Expert Scale	Advantages	Risks
DeepSeek-like fine-grained	512–2048	Small–Medium	Strong routing diversity	Complex communication and load balancing
PEER-like micro experts	10K–1M	Very small	High parameter efficiency	Retrieval and training difficulty
Hierarchical grouped experts	64–256 groups × sub-experts	Layered	Engineering tractability	Routing hierarchy adds latency

4.3 RL Routing Shaping and Routing Collapse Risk

DeepSeek-V3’s auxiliary-loss-free load balancing maintains routing diversity through architectural meansC. However, at the scale of thousands of experts, purely emergent routing divergence may be insufficient. Without explicit load balancing or diversity regularization, the network tends toward routing collapse during the RL phase—repeatedly activating a small number of “universal” experts. The larger the expert count, the sparser the router’s selection space, and the higher the collapse risk. Therefore, while “emergent behavior” as the default assumption is the most conservative, it may be overly optimistic in scenarios with thousands of experts—some form of auxiliary balancing mechanism is very likely necessaryE.

5. Hypothesis Three: Implicit Multi-Path Verification via Routing Divergence

5.1 Mechanistic Precision and the Philosophical Limits of Functional Equivalence

An MoE router is a conditional compute allocator; it possesses neither intent nor roleC. The claim in this paper is restricted to functional equivalenceE: different loop iterations activate different expert subsets, and gradient-isolated pathways are statistically equivalent to multi-perspective processing in their effects.

A critical clarification is needed: in the philosophy of science, functional equivalence does not provide causal explanation. Two entirely different underlying mechanisms can produce identical functional outputs. The value of a functional analogy lies in hypothesis generation (providing experimental directions), not in hypothesis verification (providing causal proof). When we say that looped MoE is “functionally analogous to metacognition,” we mean “this framework predicts the model should exhibit characteristic Y on task X”—if Y fails to appear, the hypothesis is weakened.

5.2 Training-Induced Mechanisms and Routing Collapse Probability

Mechanism	Principle	Precedent	Default Assumption?
Router diversity loss	Penalizes KL divergence between consecutive loop routing distributions being too small	No public precedent	No
Adversarial self-critique RL	Reward signal encourages multi-angle verification	Constitutional AI critique-revision	No
Loop iteration embedding	Different loop iterations receive different positional encodings	Depth-Wise LoRA (OpenMythos)	No
Emergence + auxiliary balancing	Natural gradient-dynamics divergence, but requires balancing to prevent collapse	DeepSeek-V3 auxiliary-loss-free balancing	Yes (revised default)

The revised default assumption is no longer pure emergence, but “emergence + some form of auxiliary balancing mechanism”—the latter already has engineering precedent in DeepSeek-V3.

5.3 Combinatorial Mathematics (Qualified)

The combinatorial space of choosing 8 from 1,000 (~2.4×10²³) is 10 orders of magnitude larger than choosing 8 from 256E. This guarantees theoretical pathway diversity, but a large combinatorial space does not imply large actual routing divergence—if router preferences are highly concentrated, the vast majority of combinations will never be selected. This mathematical argument is a necessary condition for pathway diversity, not a sufficient one.

5.4 Boundaries of the Metacognition Analogy

Human Metacognition	Looped MoE Equivalent	Analogy Strength	Mechanistic Difference
Goal setting	Prelude encoding	Medium-High	—
Initial reasoning	1st loop iteration	Medium-High	—
Reflective monitoring	Subsequent loop routing divergence	Medium	Unconscious monitoring; purely conditional computation
Deviation detection	Input re-injection	Medium	Mathematical stability, not “conscious monitoring”
Confidence-based exit	ACT halting gate	Medium-Low	Scalar threshold, not “confidence”

6. Computational Feasibility

6.1 Separation of Three Bottleneck Layers

Bottleneck	Problem	Mitigation	Effectiveness
Memory (VRAM)	KV cache grows linearly with context	MLA compression 10–20×C	High—validated in DeepSeek-V2/V3
Compute (FLOPs)	Each loop iteration requires full FFN + Attention	ACT adaptive halting + Mixture-of-DepthsC	Medium—average depth of 6–8 can reduce to 6–8× Dense equivalent
Communication	MoE all-to-all expert dispatch	DeepSeek-V3 compute-communication overlapC	Medium—reduces latency but does not eliminate it

MLA solves the memory bottleneck but not the FLOPs bottleneck. Looped weight sharing reduces the pressure of persistent parameter residency and repeated loading, but does not eliminate MoE communication costs, KV cache costs, or per-iteration FFN FLOPs.

6.2 Serving Layer and User Experience

Even if ACT constrains the average loop depth to 6–8, TTFT (time to first token) would still be several times that of a Dense equivalent model. For a commercial API, this is not just a computational cost issue but a user-experience constraint. This may be one of the engineering reasons why Mythos is not consumer-facingE—a controlled deployment environment (Project Glasswing) can tolerate high latency, whereas a mass-market consumer API cannot.

6.3 Infrastructure Signals

Media reports indicate that Anthropic has partnered with SpaceX’s Colossus data center (300+ MW, 220,000+ GPUs)B. If these reports are accurate, this indicates that Anthropic is expanding its large-scale training/inference infrastructure. However, this cannot directly prove that the facility serves Mythos, much less that Mythos employs a looped MoE architecture.

7. Unified Design Philosophy

7.1 Three Layers and Evidence Hierarchy

Training Layer
Constitutional AI · SDF
A

Architecture Layer (Candidate Hypothesis)
Looped MoE + Input Injection
CDE

Behavioral Layer
CoT Anchor-First · Safety Immunity
AB

The training and behavioral layers have A/B-grade evidence. The architecture layer has only C–E grade evidence. The credibility of the unified design philosophy depends on whether the architecture layer can be independently validated. This is an aesthetic argument—it provides explanatory elegance, but not logical necessity.

7.2 Qualified Use of SCHEMA

SCHEMA shows that Anthropic’s Constitutional AI is near-immune under adversarial pressureB. This supports the training effects but does not directly support the architectural hypothesis—SDF + Constitutional AI training alone may suffice as an explanation, without the need to invoke architecture-level anchoring.

7.3 Opus 4.7 as a Control

Opus 4.7 was released on April 16, 2026A, and Anthropic explicitly stated that its safety guardrails were designed in preparation for future Mythos-class model deploymentsA. If Opus 4.7 also exhibits “anchor-first” behavior without using a looped architecture, then the architectural explanation for anchor behavior is significantly weakened—anchor behavior may be purely a product of SDF training. This represents one of the most direct refutation paths for the architectural hypothesis presented in this paper.

8. Discriminative Predictions and Experiment Design

Experiment 1: Latency–Difficulty Staircase

Prediction: Looped hypothesis → latency exhibits discrete staircases; alternative hypothesis → smooth monotonic increase.

Control Conditions: Same account/region/time window; fixed prompt and output length; ≥500 repeated samples; report p50/p90/p99 distributions; use public models as a control baseline; distinguish TTFT / total latency / tokens-per-second; exclude rate-limit and dynamic-batching interference; record API error rates and retries.

Experiment 2: Cross-Distribution Graph Task Transfer

Prediction: Looped hypothesis → robust transfer; data hypothesis → significant degradation.

Method: Construct test sets that are structurally similar to GraphWalks but with entirely different node naming, topology, and rule sets.

Experiment 3: Anchor Disruption and Long-Context Drift

Prediction: Anchor hypothesis → constraint retention rate declines slowly; no-anchor hypothesis → exponential decay.

Method: Construct multi-turn dialogues with conflicting goals, constraints, and inductions, and measure constraint retention rate at the Nth turn.

Experiment 4: Error Convergence Patterns

Prediction: Looped hypothesis → error clustering (convergence to error attractors); Dense hypothesis → error dispersion.

Method: Sample the same prompt multiple times and analyze the distribution of error types.

Experiment 5: Perspective Diversity Indirect Detection (Improved)

Prediction: Large-scale MoE + looping → high contradiction discovery rate; control models → low.

Improved Controls: Use different temperatures on the same model as an internal baseline; use known open-source MoE models (e.g., Mixtral, OLMoE) as architectural controls; use known Dense models (e.g., Llama) as type controls; prohibit explicit CoT and test only final refutation quality; conduct multi-turn trials where the initial answer is hidden and the model independently refutes a fabricated answer; use the contradiction discovery rate (quantifiable) rather than subjective “refutation depth.”

9. Limitations

Core Limitation: Anthropic has not disclosed any architectural information. All architecture-layer hypotheses are D–E grade.

1. Mythos may not be a looped Transformer—synthetic training data provides an equally valid alternative explanation
2. The 10T parameter figure is a D-grade rumor; if inaccurate, the expert count estimation basis collapses
3. The 512–2048 expert count is one candidate interval within the design space, not a unique derivation
4. “Multi-path verification” is a functional description; functional equivalence does not provide causal explanation
5. The ACT average depth and extent of MoD application used in the computational feasibility analysis are unverified assumptions
6. The 1/L residual scaling is from an ICLR workshop paper, not the main conference, and awaits replication
7. SCHEMA supports training effects and does not directly support the architectural hypothesis
8. Non-disclosure of architecture may simply be a routine business strategy
9. For the training-induced mechanisms of routing divergence, none of the candidates have direct evidence
10. If Opus 4.7 exhibits equivalent anchor behavior, the architectural explanation for anchoring is weakened
11. The latency staircase signal in Experiment 1 may be drowned out by API serving noise

10. Conclusion

If Mythos employs some form of test-time compute scaling mechanism, then a looped-depth Transformer + large-scale MoE + input re-injection is the most modelable candidate architectural combination. This paper presents it as a candidate architectural model, accompanied by comprehensive evidence grading, refutation conditions, and discriminative experiments.

Core improvements in V4: (1) per-item calibration of evidence tags, distinguishing “A-grade official” from “A* self-reported”; (2) a claim matrix making each hypothesis’s refutation conditions explicitly visible; (3) separation of Parcae spectral-norm and 1/L workshop paper as two distinct stability schemes; (4) stratification of hypothesis components into core / engineering conditions / optional optimization; (5) addition of Opus 4.7 as a control, routing collapse probability analysis, three candidate expert paths presented in parallel, experimental noise control, and a philosophical qualification on functional equivalence.

The ultimate positioning of this paper is not a proof of Mythos’s architecture, but rather a general methodology for reverse-engineering the architecture of closed-source frontier models: evidence grading, alternative explanations, refutation conditions, physical feasibility, falsifiable experiments, and conceptual decomposition. The value of this methodology is independent of whether Mythos’s specific architecture matches the hypotheses proposed herein.

References

[1] Anthropic. “Project Glasswing: Securing critical software for the AI era.” anthropic.com/glasswing.

[2] Anthropic. “System Card: Claude Mythos Preview.” 244 pp., April 7, 2026.

[3] Gomez, K. “OpenMythos: Theoretical reconstruction of Claude Mythos architecture.” GitHub, April 2026.

[4] Aiia.ro. “Is Claude Mythos a Looped Language Model?” April 11, 2026.

[5] Millidge, B. “Thoughts on Claude Mythos.” beren.io, April 11, 2026.

[6] Prairie et al. “Parcae: Scaling Laws For Stable Looped Language Models.” arXiv:2604.12946, April 2026.

[7] “On the Residual Scaling of Looped Transformers: Stability and Transferability.” LIT Workshop @ ICLR 2026, OpenReview, March 2026.

[8] Saunshi et al. “Reasoning with Latent Thoughts.” arXiv:2502.17416, 2025.

[9] DeepSeek-AI. “DeepSeek-V3 Technical Report.” arXiv:2412.19437, December 2024.

[10] He, X.O. “Mixture of A Million Experts.” Google DeepMind, arXiv:2407.04153, 2024.

[11] Boix-Adsera & Rigollet. “The Power of Fine-Grained Experts.” MIT, arXiv:2505.06839, 2025.

[12] Alexander, S. “Deliberative Alignment, And The Spec.” Astral Codex Ten, February 2025.

[13] Anthropic. “Teaching Claude Why.” Alignment Science Blog, May 2026.

[14] Anthropic. “Claude’s Extended Thinking.” anthropic.com, February 2025.

[15] SCHEMA. “The Compliance Trap.” arXiv:2605.02398, May 2026.

[16] Anthropic. “Natural Language Autoencoders for Interpretability.” May 7, 2026.

[17] Mozilla. “Behind the Scenes Hardening Firefox with Claude Mythos Preview.” Hacks Blog, May 2026.

[18] Flavell, J.H. “Metacognition and Cognitive Monitoring.” American Psychologist, 34(10), 1979.

[19] Janiak et al. “Characterizing Stable Regions in the Residual Stream of LLMs.” arXiv:2409.17113, 2024.

[20] Yao et al. “Stabilizing MoE Reinforcement Learning.” arXiv:2510.11370, 2025.

[21] Zhang et al. “Robust Experts: Adversarial Training on Sparse MoE.” arXiv:2509.05086, 2025.

[22] Raposo et al. “Mixture-of-Depths.” arXiv:2404.02258, 2024.

[23] Fortune. “Anthropic Mythos ‘step change’ after data leak.” March 26, 2026.

[24] Anthropic. “Introducing Claude Opus 4.7.” anthropic.com/news, April 16, 2026.

[25] Anthropic. “Reasoning models don’t always say what they think.” anthropic.com, 2026.

[26] Anthropic / SpaceX. “Colossus 1 Partnership.” Code with Claude SF, May 6, 2026 (media report).

[27] Nelson & Narens. “Metamemory: A Theoretical Framework.” Psychology of Learning and Motivation, 26, 1990.