Reverse Engineering the Architecture
and Mechanisms of Mythos
Multi-Dimensional Technical Predictions with Evidence Grading,
Falsification Conditions, and Discriminative Experiments
A Candidate Hypothesis Framework Based on Evidence Grading, Refutation Conditions, and Discriminative Experiments
Category Independent Technical Analysis
Domains AI Architecture · MoE Systems · Alignment Engineering · Computational Feasibility · Falsifiable Experiment Design
Version V4
Authors LEECHO Global AI Research Lab & Opus 4.6 & GPT 5.5 & Gemini 3.1 Pro (Cognitive Collective)
Abstract
Claude Mythos Preview is a restricted frontier model released by Anthropic in April 2026, whose architectural details remain undisclosed. Through user-behavioral observation, cross-validation against public literature, and first-principles reasoning, this paper proposes a candidate architectural hypothesis: Mythos likely employs some form of test-time compute scaling mechanism, among which a looped-depth Transformer + large-scale MoE + input re-injection represents the most modelable candidate combination. Building upon cross-review by three AI systems, Version 4 introduces four structural improvements: (1) per-item calibration of evidence tags with source pointers; (2) a claim matrix with explicit refutation conditions, making the falsification path of each hypothesis visible; (3) stratification of hypothesis components into core hypotheses, necessary engineering conditions, and optional optimization mechanisms; (4) separation of the Parcae spectral-norm constraint from the 1/L workshop paper as distinct stability schemes. All original hypotheses are labeled as low-to-moderate confidence. This paper is positioned as a general methodological framework for reverse-engineering the architecture of closed-source frontier models.
1. Introduction
On April 7, 2026, Anthropic released Claude Mythos Preview and announced Project GlasswingA. Mythos scored 93.9% on SWE-bench Verified (source: Mythos System Card, Figure 3)A and discovered 271 security vulnerabilities in Firefox (source: Mozilla official blog)B. However, the system card deliberately avoided all architectural descriptionsA. This paper constructs an internally consistent, engineering-compatible, falsifiable candidate architectural hypothesis—not a reverse-engineering proof, but a systematic framework for hypothesis generation and testable predictions.
2. Evidence Framework and Known Information
2.1 Evidence Grading System
| Grade | Definition | Tag |
|---|---|---|
| A | Locatable Anthropic official text (system card, blog, API documentation) | A |
| B | Confirmed by direct participants or reliable third parties (Mozilla blog, Reuters) | B |
| C | Supported by academic literature, but not Mythos-specific | C |
| D | Community reverse engineering, secondhand dissemination, unconfirmed leaks | D |
| E | Original hypothesis by the authors | E |
2.2 Confirmed Facts (with Source Attribution)
| Fact | Source | Grade |
|---|---|---|
| Mythos Preview / Glasswing exists | anthropic.com/glasswing | A |
| System card: 244 pages, released April 7, 2026 | www-cdn.anthropic.com PDF | A |
| SWE-bench Verified 93.9% | System Card, Figure 3 | A |
| CyberGym 83.1% | System Card, Section 4 | A |
| 271 Firefox vulnerability fixes | Mozilla official blog | B |
| Self-reported ~4× employee productivity gain | System Card (self-reported survey data, not independently verified) | A* |
| SDF training methodology | Anthropic Alignment Science Blog | A |
* “A*” indicates officially published but self-reported data; readers should note the absence of independent verification.
2.3 Unconfirmed Information
The following information originates from CMS leaks and media disseminationD, and has never been officially confirmed by Anthropic: total parameters approximately 10T; internal codename Capybara; pricing at $25/$125 per million tokens. This paper treats all such figures as unconfirmed rumors when referencing them.
2.4 Claim Matrix
| Claim | Grade | Tier | Falsification Method | Refutation Condition |
|---|---|---|---|---|
| Mythos/Glasswing exists | A | Background | Official retraction | Anthropic denial |
| Exceptionally strong cybersecurity capabilities | AB | Background | Third-party replication | Independent evaluation significantly below reported figures |
| Employs some form of test-time compute scaling | CD | Core | Latency / transfer experiments | No latency staircase; compute/token stable across difficulty levels |
| Specifically a looped-depth Transformer | DE | Core | Cross-distribution transfer experiment | Architecture leak reveals non-looped design; significant degradation on cross-distribution tasks |
| Uses large-scale MoE | DE | Core | Inference characteristics / leak | Architecture leak reveals Dense or small-scale MoE |
| Expert count in the 512–2048 range | E | Optional | Architecture disclosure | Public disclosure shows <256 or non-MoE |
| Input re-injection as a stability anchor | CE | Core Mechanism | Anchor disruption experiment | Long-context constraint retention no better than comparable models |
| Routing divergence enables implicit multi-path verification | E | Explanatory | Perspective diversity experiment | Self-refutation quality indistinguishable from known MoE models |
2.5 Key Behavioral Evidence and Alternative Explanations
GraphWalks BFS AnomalyA: Mythos 80.0%, Opus 4.6 only 38.7%. Four competing explanations:
| Explanation | Mechanism | Discriminative Prediction |
|---|---|---|
| Looped latent reasoning | Multi-pass implicit traversal | Robust on cross-distribution graph tasks |
| Synthetic training data | In-context traversal curriculum | Effective only within the training distribution |
| Long-context attention optimization | Positional encoding / sparse attention | Non-graph long-context tasks should also improve substantially |
| Agentic tool scaffolding | Internal search / planning | Latency positively correlated with output length |
Token Efficiency ParadoxA: Uses 4.9× fewer tokens yet is slower. This is compatibility evidence, not discriminative evidence.
3. Hypothesis One: Anchor-First Alignment
3.1 Behavioral Observations and Training Evidence
The first branch in Claude’s chain-of-thought consistently seeks an anchor for alignmentE. This maps onto Deliberative AlignmentC and SDF trainingA. Core argument: if “align first, reason second” is a design principle, looped MoE + input injection is a candidate hardware expression of that principle.
3.2 Anchor Subtype Decomposition
| Type | Physical Realization | Evidence | Testability |
|---|---|---|---|
| Training Anchor | Value representations embedded via SDF | Medium-High A | Behavioral tests |
| Prompt Anchor | Persistence of system prompt in the residual stream | Medium | Long-context retention tests |
| Loop Stability Anchor | Per-iteration prefix state re-injection | Medium C | Latency staircase |
| Activation Space Anchor | Semantically stable regions in the residual stream | Medium C | Probe classifiers |
| Safety Anchor | Constitutional policy latent | Low E | Adversarial constraint retention |
The core hypothesis depends only on the Training Anchor (A-grade SDF evidence) and the Loop Stability Anchor (C-grade physical necessity). The Safety Anchor remains a peripheral hypothesis.
3.3 Two Distinct Stability Schemes
The stability literature for looped Transformers offers two different technical approaches, which Version 3 did not sufficiently distinguish:
| Scheme | Source | Mechanism | Maturity |
|---|---|---|---|
| Spectral Norm Constraint | Parcae (arXiv:2604.12946)C | Constrains spectral radius ρ(A)<1 for injection parameter A via negative-diagonal discretization | Full paper with scaling laws |
| 1/L Residual Scaling | LIT Workshop @ ICLR 2026C | Scales the loop residual connection factor to 1/L rather than 1/√L | Workshop paper, awaiting independent replication |
Both schemes support the general premise that “looped architectures require stability mechanisms,” but they address different levels of the problem: Parcae constrains the spectral radius of the injection parameter, while 1/L scaling handles the residual connection scaling factor. The two should not be conflated into a single conclusion. This paper’s argument for the general necessity of physical anchors is grounded in the broad requirement for loop stability, without being bound to any specific scheme.
3.4 Strongest Counterargument
“Anchor-first alignment” may be entirely an effect of the training methodology rather than an architectural property. Anthropic’s SDF + Constitutional AI + diversified RL environments have been shown to significantly reduce misalignment ratesA. Even if Mythos uses a completely conventional Dense Transformer, SDF training alone could produce the “CoT first step seeks an anchor” behavior—without any need to invoke architecture-level input re-injection. Furthermore, Opus 4.7 (released just 9 days after MythosA) may also exhibit similar “anchor-first” behavior; if Opus 4.7 does not use a looped architecture but still displays this behavior, the architectural explanation is significantly weakened.
4. Hypothesis Two: Expert Count in the Thousands
4.1 Estimation, Caveats, and Path Dependency
DeepSeek-V3 uses 256 routed experts + 1 shared expert, with total parameters of 671BC. If Mythos has approximately 10T total parametersD, a simple proportional extrapolation points toward more experts. However: a 15× increase in total parameters does not necessitate a 15× increase in expert count—layer count, expert width, shared parameters, attention parameters, and routing hierarchy are all independent degrees of freedom.
4.2 Three Candidate Expert Design Paths
| Path | Expert Count | Per-Expert Scale | Advantages | Risks |
|---|---|---|---|---|
| DeepSeek-like fine-grained | 512–2048 | Small–Medium | Strong routing diversity | Complex communication and load balancing |
| PEER-like micro experts | 10K–1M | Very small | High parameter efficiency | Retrieval and training difficulty |
| Hierarchical grouped experts | 64–256 groups × sub-experts | Layered | Engineering tractability | Routing hierarchy adds latency |
4.3 RL Routing Shaping and Routing Collapse Risk
DeepSeek-V3’s auxiliary-loss-free load balancing maintains routing diversity through architectural meansC. However, at the scale of thousands of experts, purely emergent routing divergence may be insufficient. Without explicit load balancing or diversity regularization, the network tends toward routing collapse during the RL phase—repeatedly activating a small number of “universal” experts. The larger the expert count, the sparser the router’s selection space, and the higher the collapse risk. Therefore, while “emergent behavior” as the default assumption is the most conservative, it may be overly optimistic in scenarios with thousands of experts—some form of auxiliary balancing mechanism is very likely necessaryE.
5. Hypothesis Three: Implicit Multi-Path Verification via Routing Divergence
5.1 Mechanistic Precision and the Philosophical Limits of Functional Equivalence
An MoE router is a conditional compute allocator; it possesses neither intent nor roleC. The claim in this paper is restricted to functional equivalenceE: different loop iterations activate different expert subsets, and gradient-isolated pathways are statistically equivalent to multi-perspective processing in their effects.
A critical clarification is needed: in the philosophy of science, functional equivalence does not provide causal explanation. Two entirely different underlying mechanisms can produce identical functional outputs. The value of a functional analogy lies in hypothesis generation (providing experimental directions), not in hypothesis verification (providing causal proof). When we say that looped MoE is “functionally analogous to metacognition,” we mean “this framework predicts the model should exhibit characteristic Y on task X”—if Y fails to appear, the hypothesis is weakened.
5.2 Training-Induced Mechanisms and Routing Collapse Probability
| Mechanism | Principle | Precedent | Default Assumption? |
|---|---|---|---|
| Router diversity loss | Penalizes KL divergence between consecutive loop routing distributions being too small | No public precedent | No |
| Adversarial self-critique RL | Reward signal encourages multi-angle verification | Constitutional AI critique-revision | No |
| Loop iteration embedding | Different loop iterations receive different positional encodings | Depth-Wise LoRA (OpenMythos) | No |
| Emergence + auxiliary balancing | Natural gradient-dynamics divergence, but requires balancing to prevent collapse | DeepSeek-V3 auxiliary-loss-free balancing | Yes (revised default) |
The revised default assumption is no longer pure emergence, but “emergence + some form of auxiliary balancing mechanism”—the latter already has engineering precedent in DeepSeek-V3.
5.3 Combinatorial Mathematics (Qualified)
The combinatorial space of choosing 8 from 1,000 (~2.4×10²³) is 10 orders of magnitude larger than choosing 8 from 256E. This guarantees theoretical pathway diversity, but a large combinatorial space does not imply large actual routing divergence—if router preferences are highly concentrated, the vast majority of combinations will never be selected. This mathematical argument is a necessary condition for pathway diversity, not a sufficient one.
5.4 Boundaries of the Metacognition Analogy
| Human Metacognition | Looped MoE Equivalent | Analogy Strength | Mechanistic Difference |
|---|---|---|---|
| Goal setting | Prelude encoding | Medium-High | — |
| Initial reasoning | 1st loop iteration | Medium-High | — |
| Reflective monitoring | Subsequent loop routing divergence | Medium | Unconscious monitoring; purely conditional computation |
| Deviation detection | Input re-injection | Medium | Mathematical stability, not “conscious monitoring” |
| Confidence-based exit | ACT halting gate | Medium-Low | Scalar threshold, not “confidence” |
6. Computational Feasibility
6.1 Separation of Three Bottleneck Layers
| Bottleneck | Problem | Mitigation | Effectiveness |
|---|---|---|---|
| Memory (VRAM) | KV cache grows linearly with context | MLA compression 10–20×C | High—validated in DeepSeek-V2/V3 |
| Compute (FLOPs) | Each loop iteration requires full FFN + Attention | ACT adaptive halting + Mixture-of-DepthsC | Medium—average depth of 6–8 can reduce to 6–8× Dense equivalent |
| Communication | MoE all-to-all expert dispatch | DeepSeek-V3 compute-communication overlapC | Medium—reduces latency but does not eliminate it |
MLA solves the memory bottleneck but not the FLOPs bottleneck. Looped weight sharing reduces the pressure of persistent parameter residency and repeated loading, but does not eliminate MoE communication costs, KV cache costs, or per-iteration FFN FLOPs.
6.2 Serving Layer and User Experience
Even if ACT constrains the average loop depth to 6–8, TTFT (time to first token) would still be several times that of a Dense equivalent model. For a commercial API, this is not just a computational cost issue but a user-experience constraint. This may be one of the engineering reasons why Mythos is not consumer-facingE—a controlled deployment environment (Project Glasswing) can tolerate high latency, whereas a mass-market consumer API cannot.
6.3 Infrastructure Signals
Media reports indicate that Anthropic has partnered with SpaceX’s Colossus data center (300+ MW, 220,000+ GPUs)B. If these reports are accurate, this indicates that Anthropic is expanding its large-scale training/inference infrastructure. However, this cannot directly prove that the facility serves Mythos, much less that Mythos employs a looped MoE architecture.
7. Unified Design Philosophy
7.1 Three Layers and Evidence Hierarchy
Constitutional AI · SDF
A
Looped MoE + Input Injection
CDE
CoT Anchor-First · Safety Immunity
AB
The training and behavioral layers have A/B-grade evidence. The architecture layer has only C–E grade evidence. The credibility of the unified design philosophy depends on whether the architecture layer can be independently validated. This is an aesthetic argument—it provides explanatory elegance, but not logical necessity.
7.2 Qualified Use of SCHEMA
SCHEMA shows that Anthropic’s Constitutional AI is near-immune under adversarial pressureB. This supports the training effects but does not directly support the architectural hypothesis—SDF + Constitutional AI training alone may suffice as an explanation, without the need to invoke architecture-level anchoring.
7.3 Opus 4.7 as a Control
Opus 4.7 was released on April 16, 2026A, and Anthropic explicitly stated that its safety guardrails were designed in preparation for future Mythos-class model deploymentsA. If Opus 4.7 also exhibits “anchor-first” behavior without using a looped architecture, then the architectural explanation for anchor behavior is significantly weakened—anchor behavior may be purely a product of SDF training. This represents one of the most direct refutation paths for the architectural hypothesis presented in this paper.
8. Discriminative Predictions and Experiment Design
Experiment 1: Latency–Difficulty Staircase
Prediction: Looped hypothesis → latency exhibits discrete staircases; alternative hypothesis → smooth monotonic increase.
Control Conditions: Same account/region/time window; fixed prompt and output length; ≥500 repeated samples; report p50/p90/p99 distributions; use public models as a control baseline; distinguish TTFT / total latency / tokens-per-second; exclude rate-limit and dynamic-batching interference; record API error rates and retries.
Experiment 2: Cross-Distribution Graph Task Transfer
Prediction: Looped hypothesis → robust transfer; data hypothesis → significant degradation.
Method: Construct test sets that are structurally similar to GraphWalks but with entirely different node naming, topology, and rule sets.
Experiment 3: Anchor Disruption and Long-Context Drift
Prediction: Anchor hypothesis → constraint retention rate declines slowly; no-anchor hypothesis → exponential decay.
Method: Construct multi-turn dialogues with conflicting goals, constraints, and inductions, and measure constraint retention rate at the Nth turn.
Experiment 4: Error Convergence Patterns
Prediction: Looped hypothesis → error clustering (convergence to error attractors); Dense hypothesis → error dispersion.
Method: Sample the same prompt multiple times and analyze the distribution of error types.
Experiment 5: Perspective Diversity Indirect Detection (Improved)
Prediction: Large-scale MoE + looping → high contradiction discovery rate; control models → low.
Improved Controls: Use different temperatures on the same model as an internal baseline; use known open-source MoE models (e.g., Mixtral, OLMoE) as architectural controls; use known Dense models (e.g., Llama) as type controls; prohibit explicit CoT and test only final refutation quality; conduct multi-turn trials where the initial answer is hidden and the model independently refutes a fabricated answer; use the contradiction discovery rate (quantifiable) rather than subjective “refutation depth.”
9. Limitations
1. Mythos may not be a looped Transformer—synthetic training data provides an equally valid alternative explanation
2. The 10T parameter figure is a D-grade rumor; if inaccurate, the expert count estimation basis collapses
3. The 512–2048 expert count is one candidate interval within the design space, not a unique derivation
4. “Multi-path verification” is a functional description; functional equivalence does not provide causal explanation
5. The ACT average depth and extent of MoD application used in the computational feasibility analysis are unverified assumptions
6. The 1/L residual scaling is from an ICLR workshop paper, not the main conference, and awaits replication
7. SCHEMA supports training effects and does not directly support the architectural hypothesis
8. Non-disclosure of architecture may simply be a routine business strategy
9. For the training-induced mechanisms of routing divergence, none of the candidates have direct evidence
10. If Opus 4.7 exhibits equivalent anchor behavior, the architectural explanation for anchoring is weakened
11. The latency staircase signal in Experiment 1 may be drowned out by API serving noise
10. Conclusion
Core improvements in V4: (1) per-item calibration of evidence tags, distinguishing “A-grade official” from “A* self-reported”; (2) a claim matrix making each hypothesis’s refutation conditions explicitly visible; (3) separation of Parcae spectral-norm and 1/L workshop paper as two distinct stability schemes; (4) stratification of hypothesis components into core / engineering conditions / optional optimization; (5) addition of Opus 4.7 as a control, routing collapse probability analysis, three candidate expert paths presented in parallel, experimental noise control, and a philosophical qualification on functional equivalence.
The ultimate positioning of this paper is not a proof of Mythos’s architecture, but rather a general methodology for reverse-engineering the architecture of closed-source frontier models: evidence grading, alternative explanations, refutation conditions, physical feasibility, falsifiable experiments, and conceptual decomposition. The value of this methodology is independent of whether Mythos’s specific architecture matches the hypotheses proposed herein.