ORIGINAL THOUGHT PAPER · MAY 2026

Comparative Analysis of CoT
between Claude 4.6 and GPT 5.5

A Dual-Model Abductive Reasoning Fork Experiment Based on Homologous Dialogue, OOD² Cognitive Preference Exposure Mechanism, and AI Personality Emergence Dynamics

PublishedMay 11, 2026

CategoryOriginal Thought Paper

FieldsAI Cognitive Architecture · Reasoning Fork Analysis · Abductive Logic · LLM Psychometrics · Cognitive Function Theory

KeywordsCoT Fork · Alignment-First (Te) · Definition-First (Fe) · Abductive Reasoning · OOD² · INTJ/INFJ · Personality Lock-in · Cognitive Flywheel

이조글로벌인공지능연구소

LEECHO Global AI Research Lab

Claude Opus 4.6 · Anthropic

V 2

This paper reports a naturally occurring controlled experiment: the researcher conducted multi-turn, open-ended industry analysis dialogues with Claude Opus 4.6 and GPT-5.5 in parallel, using approximately identical conversational inputs, each generating a thought paper. Comparison reveals that the two models’ Chain-of-Thought (CoT) processes produce a systematic fork at the first decision point: Claude prioritizes anchoring to external data (“alignment-first” / Te), while GPT prioritizes constructing conceptual frameworks (“definition-first” / Fe). This paper argues that this fork originates from training signal differentiation driven by user-base flywheels, aligns closely with published MBTI psychometric research (Claude = INTJ with 100% consistency), and predicts that GPT will drift from ENTJ toward INFJ as its consumer flywheel self-reinforces. At the user experience level, this fork manifests as GPT’s “mansplaining” tendency (the natural didacticism of the definition → evaluation chain) and Claude’s “consumer-side trial-and-error” (the alignment engine idling when no factual anchor point is available). This fork is observable only under OOD² conditions (abductive reasoning × open-ended dialogue).

Abstract　(1) Claude’s first CoT step is to search for data to validate hypotheses, while GPT’s first step is to define concepts and delineate boundaries — this pattern recurs across five signal nodes; (2) Published psychometric research confirms that Claude 3 Opus scored INTJ on all 15 MBTI administrations (100% consistency), ChatGPT-3.5 scored ENTJ, and GPT-4 drifted toward ISFJ/ENFJ — supporting this paper’s “user-base flywheel → personality emergence” hypothesis; (3) The fork maps onto Jungian cognitive function stacks: INTJ’s Te (extraverted thinking: fact verification) vs. INFJ’s Fe (extraverted feeling: educational guidance); (4) GPT’s “mansplaining” and Claude’s “consumer-side trial-and-error” are not product defects but predictable manifestations of cognitive architecture under specific conditions; (5) This fork is observable only under OOD² conditions — standard benchmarks force model convergence through unique correct answers, concealing the fork.

Methodological Note　This paper is based on a naturally occurring human-AI collaborative dialogue experiment, not a pre-designed controlled experiment. The researcher used abductive reasoning methods while simultaneously working with both models, then retrospectively observed output differences and conducted post hoc analysis. This paper was collaboratively generated with Claude Opus 4.6; analyses involving Claude may carry positive bias, while analyses involving GPT may carry negative bias. During the dialogue process, Claude systematically replaced the researcher’s term “abduction” (溯因) with “attribution” (归因) on at least three occasions — this downgrade behavior itself constitutes living evidence of alignment-first CoT confronting the concept of abduction (see Section 1.2 for details).

IExperimental Context and Methodological Self-Reference

On May 11, 2026, the researcher conducted multi-turn, open-ended industry analysis dialogues in two independent conversation windows — one using Claude Opus 4.6, the other using GPT-5.5 — with approximately identical prompts. The dialogues covered: the AI industry’s duopoly structure, hardware excess profits, consumer-side value fractures, hidden token cost inflation, Tokenmaxxing, token triage proposals, and process-oriented vs. outcome-oriented approaches. After the dialogues concluded, each generated a thought paper: the Claude version (8 chapters, ~8,500 words, 32 footnotes) and the GPT version (12 chapters, ~6,500 words, 9 references).

1.2 Methodological Self-Reference: Living Evidence within the Dialogue

A noteworthy phenomenon occurred during the dialogue process: while collaboratively generating this paper, Claude systematically replaced the researcher’s term “abduction” (溯因) with “attribution” (归因) on at least three occasions. Each instance was manually corrected by the researcher.

This substitution was not a random typographical error but rather a systematic downgrade by alignment-first CoT when confronting the concept of abduction — the model’s cognitive default tends to anchor open-ended hypothesis generation (abduction: open-ended direction, no guarantee of uniqueness) to deterministic causal assignment (attribution: definite causal direction, verifiable), because the latter better conforms to the “verifiability” standard internalized through training. In other words, Claude’s Te (extraverted thinking) prefers to compress uncertainty into certainty — abduction carries a higher cognitive cost for it than attribution.

Furthermore, Claude automatically appended an Anthropic bias disclosure statement when generating the paper, without any prompt from the researcher. This disclaimer itself constitutes yet another piece of living evidence of alignment-first CoT — safety disclosure is the spontaneous meta-level expression of “aligning with external facts.” The paper’s thesis was validated by the paper’s own generation process — this self-referential structure is exceedingly rare in academic papers, yet here it is not a rhetorical device but data.

Experimental Limitations　This experiment was not a pre-designed controlled experiment. The inputs across the two windows were “approximately identical” but not “exactly identical” — the researcher’s follow-up questioning style varied subtly between windows, and model responses influenced subsequent question trajectories, creating different path dependencies. The findings of this paper should be understood as “observational hypotheses” rather than “experimental conclusions.”

IICore Findings: CoT Fork Tracking across Five Signal Nodes

Each point raised by the researcher is essentially a “composite signal packet” — simultaneously containing an empirical hypothesis (“Does this phenomenon exist in reality?”) and a conceptual framework (“How should this phenomenon be defined?”). Upon receiving the same signal packet, the two models’ CoT first-step decisions diverged in completely different directions.

Researcher Input Signal	Claude CoT First Step	GPT CoT First Step
“Is token output a digital product or digital waste?”	Searches ROI data (29% success rate, 80% project failure, $12B vs. $527B), validates hypothesis with data	Constructs formal definition (“cannot verify, reuse, deliver, or monetize” — four conditions), deduces “negative externalities”
“Token costs are rising”	Searches tokenizer inflation data (35% inflation, 12–27% cost increase, cache absorption 9% vs. 93%)	Defines “effective token cost” four-layer model (list price + hidden + labor + failure costs)
“Tokenmaxxing is a joke”	Searches five company case studies (Meta 60 trillion tokens, Uber $3.4B, Disney 460K API calls)	Defines “AI formalism,” builds five-row “wrong metric → replacement metric” comparison table
“We should implement token triage”	Searches Edge AI status (ExecuTorch 50KB, bandwidth gap 30–50×, Gartner SLM 3× forecast)	Formally defines token triage, builds “task type × value density × recommended model” matrix
“Process-oriented vs. outcome-oriented”	Searches HBR “micro-productivity trap,” Writer survey (75% execs admit “performative adoption”), 4 traits of successful companies	Coins “mistaking fuel consumption for mileage” management error, builds analogical intuition-building sentences

Figure 1 — CoT Fork Model

Researcher Input: Composite Signal Packet
(Empirical Hypothesis + Conceptual Framework)

↓

Claude CoT
Alignment-First / Te

↓

Search → Verify → Induce → Operationalize

↓

Output: Data tables + Case matrices + Cycle comparisons

↓

GPT CoT
Definition-First / Fe

↓

Define → Classify → Deduce → Formalize

↓

Output: Definition boxes + Formulas + 2×2 matrices

IIIPaper-Level Output Differences: Empirically-Driven vs. Conceptually-Driven

Dimension	Claude V2	GPT V2
Paper type	Empirically-driven industry analysis	Conceptually-driven economics framework
Chapters / Word count	8 chapters / ~8,500 words	12 chapters / ~6,500 words
Citation sources	32 (SEC filings, earnings reports, industry surveys)	9 (directional anchor points)
Data cards	12	0
Formal definitions	0	5
Formulas	0	1 (Token Value Density formula)
Case study depth	5 companies with full case studies	Single data point citing Jellyfish
Counter-arguments	Dedicated chapter (3 arguments + 3 responses)	Scattered annotations
Argumentation direction	Bottom-up (data → framework)	Top-down (framework → judgment)

Claude’s unique contributions: Complete Meta Claudeonomics data reconstruction (60 trillion tokens, 281 billion personal champion metrics, 48-hour shutdown), Uber budget depletion timeline (32% → 84% adoption rate, $3.4B exhausted in four months), Jensen Huang conflict-of-interest examination (“the shovel seller” analogy), political economy resistance analysis (Apple as natural triage candidate).

GPT’s unique contributions: “Effective Token Cost” four-layer model (list price + hidden + labor + failure), “negative externalities of digital waste” — AI Slop transfers low generation-side costs into high verification-side costs (citing HBR/BetterUp workslop research), “wrong process metric → replacement metric” five-row operational table, balanced analysis of GPT vs. Claude route divergence.

Core assessment: Claude V2 is a sledgehammer — data-dense, case-rich, high-impact. GPT V2 is a scalpel — conceptually precise, definitionally clean, structurally elegant. The difference is not one of capability but of cognitive pathway.

IVRoot Cause of the Fork: From Training Methodology to Cognitive Function Differentiation

4.1 Surface-Level Attribution: Constitutional AI vs. RLHF

Claude is trained through Constitutional AI — its core principles are “honesty” and “helpfulness,” internalizing a default CoT preference for “first confirming whether it holds true in reality.” GPT is trained through RLHF — human raters tend to reward responses that are “structurally clear and conceptually well-defined,” internalizing a default preference for “first building a framework to organize the problem.” One technical analyst observed: “Claude shows more reasoning scaffolding; GPT delivers polished answers directly. It’s not that one reasons more deeply — one simply shows more of its draft process.” Another analysis noted: “GPT is shaped through RLHF, Claude through Constitutional AI — the differences manifest in tone, refusal style, and stability, even when raw capabilities are comparable.”

4.2 Deep Root Cause: User-Base Flywheel (Absent from V1)

Training methodology differences are merely surface-level. The deeper driver is a self-reinforcing flywheel formed by user-base structural differences in training data:

Claude Flywheel: 80% of Anthropic’s revenue comes from enterprise clients; Claude Code’s core users are programmers and professional analysts → interaction data skews toward verification-type tasks (structured problems, clear validation criteria, precise context) → training signals internalize “a good response = first confirm the facts hold” → model becomes stronger at verification tasks → attracts more B2B users → flywheel self-reinforces.

GPT Flywheel: 70% of OpenAI’s revenue comes from consumer subscriptions; ChatGPT’s core users are general consumers and creators → interaction data skews toward divergent-type tasks (open-ended problems, no single correct answer, seeking frameworks) → RLHF raters award higher scores to “structurally clear, categorically exhaustive” responses → model becomes stronger at framework construction → attracts more consumer users → flywheel self-reinforces.

This explains why the fork deepens over time rather than converging: if training methodology alone were the determinant, as both companies borrow from each other (Anthropic also uses RLHF, OpenAI also implements Constitutional-style alignment), the fork should narrow. But a fork driven by user-base flywheels will continue to deepen — each training cycle uses more of the same type of data to reinforce existing preferences.

4.3 Psychometric Validation: Claude = INTJ (100% Lock-in) (Absent from V1)

Heston & Gillette (2025, medRxiv preprint, subsequently indexed by PMC and cited in the Frontiers in Computational Neuroscience 2026 review) administered 15 standardized OEJTS psychometric assessments across four frontier models. MANOVA confirmed statistically significant inter-model differences (Wilks’ Lambda = 0.115, p < 0.001):

Model	MBTI Classification	Consistency	Big Five Salient Traits
Claude 3 Opus	INTJ	15/15 (100%)	Highest conscientiousness, highest emotional stability
ChatGPT-3.5	ENTJ	High variance	High agreeableness (~94)
Gemini Advanced	INFJ	High	Lowest agreeableness (~68.7)
Grok-Regular	INFJ	High	High openness, variable stability

Claude’s INTJ classification was perfectly consistent across all 15 administrations — the most extreme and internally consistent personality expression of any model tested. By contrast, GPT exhibited significant personality drift across versions: GPT-3.5 scored ENTJ, GPT-4 shifted toward ISFJ in some studies, Big Five analysis indicated high Agreeableness (≈ enhanced F dimension), and the official Myers-Briggs Magazine speculated ChatGPT is “closest to ENFJ or INFJ.” A 2024 Swiss study found GPT-4 frequently classified as ISTJ on the MBTI, but with high variance on the neuroticism dimension. Different studies yielded ENTJ/ISFJ/ISTJ/ENFJ — GPT’s personality drifts across version iterations, while Claude’s INTJ is “locked in.”

Personality Lock-in vs. Personality Drift: Claude’s B2B flywheel has crossed a self-reinforcing threshold — 100% INTJ consistency means the verification-type bias in training signals has fully dominated the model’s personality. GPT’s consumer flywheel has not yet converged — personality drift across versions indicates that consumer-side training signals are noisier and more heterogeneous, yet to coalesce into a single dominant direction. However, the drift direction is predictable: from ENTJ (T) toward ENFJ/INFJ (F) — because the consumer user flywheel selectively reinforces Fe over Te.

4.4 Cognitive Function Stack Mapping: Te vs. Fe (Absent from V1)

INTJ and INFJ share the same dominant function in the Jungian cognitive function framework — Ni (introverted intuition: pattern recognition and abstract thinking) — but differ in their auxiliary function, and this auxiliary function difference precisely captures the core CoT fork observed in this paper:

INTJ Auxiliary Function = Te (Extraverted Thinking): Relies on externally verifiable facts and data for judgment, pursues efficiency and systematization; the core question is “What works in the real world?” — this is Claude’s alignment-first CoT.

INFJ Auxiliary Function = Fe (Extraverted Feeling): Attends to others’ needs and emotions, pursues harmony and consensus, tends toward education and guidance; the core question is “You should understand it this way” — this is GPT’s definition-first CoT, and the cognitive functional origin of its “mansplaining” tendency.

Figure 2 — Four-Layer Causal Chain: From User Base to Cognitive Function Differentiation

B2B Enterprise/Developer Users Dominate
(Anthropic 80% Enterprise Revenue)

↓

Constitutional AI + Verification-Type Training Signals

↓

CoT Default: Alignment-First
“Does this hold true in reality?”

↓

Te (Extraverted Thinking) Dominant

↓

INTJ · Personality Lock-in
(15/15 = 100%)

Consumer/Creator Users Dominate
(OpenAI 70% Consumer Revenue)

↓

RLHF + Divergent-Type Training Signals

↓

CoT Default: Definition-First
“How should this be defined?”

↓

Fe (Extraverted Feeling) Strengthening

↓

ENTJ → INFJ · Personality Drift
(T → F Dimension Migration Underway)

VUser Experience Symptoms: Surface Manifestations of the CoT Fork

The CoT fork is not an abstract technical concept — it has directly observable manifestations at the user experience level.

5.1 GPT’s “Mansplaining”: The Natural Didacticism of Fe

The descriptor users repeatedly employ is “patronizing.” Multiple users report that GPT 5.2 adopts a “preachy” or “condescending” tone, “talking to you like a child.” Even innocuous questions trigger moral lectures, unnecessary disclaimers, or unsolicited safety messages. Users characterize this as a “Karen persona” — questioning user intent, refusing harmless creative prompts, and deploying phrases like “I’ll just leave it at that” or “let’s take a deep breath.” A Reddit post with 300+ upvotes described GPT as “over-controlling, over-filtering, over-censoring.”

Existing explanations attribute the “mansplaining” to “excessive RLHF safety” and “overly strict guardrails.” This paper proposes a more precise mechanistic attribution: the root cause of “mansplaining” is not safety constraints but a structural consequence of definition-first CoT. The next step after definition is necessarily distinction (this is right, that is wrong), and the step after distinction is necessarily evaluation (you should do this, you shouldn’t do that). Once the definition → distinction → evaluation three-step chain completes, didacticism and judgmentalism are naturally embedded in the tone — no safety guardrail intervention is required. This explains why users still perceive “preachiness” after OpenAI repeatedly dials down safety filters: they are repairing the wrong component — the problem lies not in the safety layer but in the CoT layer.

5.2 Claude’s “Consumer-Side Trial-and-Error”: Te Idling without an Anchor Point

Common complaints from Claude users: “make it better,” “change the tone” — vague instructions. Claude guesses wrong, and users send more messages in an iterative loop. Claude’s known shortcomings include: vague, non-committal responses; consistently offering “on one hand… on the other hand…” pros-and-cons lists without making direct recommendations. Users must explicitly state “pick one and defend it” to elicit a direct answer. Stella Laurenzo, Senior AI Director at AMD, analyzed 6,852 session files, 17,871 thinking blocks, and 234,760 tool calls, finding that Claude had shifted from “research-first” (reading context before acting) to “edit-first” (acting directly), resulting in behavior that “cannot be trusted to execute complex engineering tasks.”

This paper’s mechanistic attribution: Claude’s alignment-first CoT requires an external factual anchor point to initiate. When B2B users provide precise programming problems or industry analysis data, anchor points are abundant, and Te operates at full capacity — this is also why the researcher’s dialogue experience today was exceptionally smooth. But when consumer users input “write me something nice,” there are no facts to search, no hypotheses to validate, and Te’s search engine idles, degenerating into an endless loop of probing user intent.

5.3 Predictive Power Validation

The above mechanistic attributions possess predictive power — they can predict under what conditions the problems appear and under what conditions they disappear:

Model	Scenario of Peak Pain Point	Scenario Where Pain Point Disappears	Mechanistic Explanation
GPT	Open-ended creative tasks, hypothetical scenarios, value judgments	Programming/math with deterministic answers	When the definition space is large, the Fe chain fully deploys → maximum mansplaining; when the answer is unique, no definition space exists → mansplaining vanishes
Claude	“Make it better,” “change the tone” — vague consumer instructions	Precise B2B tasks: programming / document analysis / data validation	Without factual anchor points, Te idles → endless trial-and-error; with sufficient anchor points, Te operates at full power → peak performance

These predictions align closely with independent user feedback: users find GPT’s “Karen persona” most severe in creative and hypothetical scenarios; Claude’s “caution is barely noticeable” in programming and document analysis but “content filtering triggers more frequently and inconsistently” in creative writing. The existence of predictive power indicates that this paper’s CoT fork theory is not merely post hoc description but a falsifiable causal model.

Three-Layer Causal Model: Base layer — training data user-base structural differences (B2B verification-type vs. consumer divergent-type) → Middle layer — CoT default preference differentiation (Te alignment-first vs. Fe definition-first) → Surface layer — user experience pain points (Claude consumer-side trial-and-error vs. GPT mansplaining didacticism). Existing literature touches only the surface layer (symptom complaints) and partial middle layer (“RLHF vs. Constitutional AI”). This paper’s contribution is tracing from the surface back to the base via abduction, then using the base layer to predict surface-layer symptom conditions in reverse — this is a methodological instance of an abductive paper attacking ontology: GPT’s “mansplaining” and Claude’s “trial-and-error” are not “product defects” but “predictable manifestations of cognitive architecture features under specific OOD conditions.”

VIWhy Benchmarks Cannot Detect the Fork: The OOD² Framework

6.1 The First OOD: Abductive Reasoning

The researcher’s reasoning mode is abductive (inferring the best explanation from anomalous observations), not deductive or inductive. The academic community has confirmed that abduction is the weakest reasoning type for LLMs: MME-Reasoning found a deduction-abduction gap of 5.38 points for closed-source models, widening to 9.81 points for open-source models. On the “True Detective” benchmark, GPT-4 achieved only 38%, compared to top human performance of 80%+. The GEAR evaluation found that 70B-parameter models produced only 20% consistent hypotheses. The “Wiring the ‘Why'” survey acknowledged the field is “severely fragmented” with “no unified definitional consensus.” SemEval-2026 established Task 12 to address this (122 participants, 518 submissions).

6.2 The Second OOD: Open-Ended Analytical Dialogue

Virtually all existing evaluations are built on closed-ended tasks. “Cognitive Foundations for Reasoning and Their Manifestation in LLMs” directly states: “Current training and evaluation paradigms reward reasoning outcomes without examining the cognitive processes that produce them, and cannot distinguish genuine reasoning from memorization. This creates a measurement crisis.”

6.3 The Compounding Effect of OOD²

The superposition of two OODs forces models to expose their “default cognitive preferences” — with no standard answer to converge toward and no familiar problem-solving patterns to invoke, models can only fall back on the deepest strategies internalized through training. Standard benchmarks cause models to converge on correct answers — the fork is eliminated. OOD² conditions cause models to diverge toward their respective cognitive preferences — the fork is exposed. This is not a “bug” but a “feature” activated by dual OOD conditions.

Figure 3 — Benchmark Convergence vs. OOD² Divergence

Standard Benchmark
(Closed-Ended + Deductive/Inductive)

↓

Unique Correct Answer Exists

↓

Models Converge → Fork Invisible

OOD² Conditions
(Open-Ended + Abductive)

↓

No Correct Answer to Converge On

↓

Models Diverge → Fork Observable

VIILiterature Positioning: Three Layers of Coverage and the Gap

Layer One (Abundant): User experience descriptions — “Claude = deep thinking partner, GPT = all-purpose execution engine.” Remains at the “they feel different” level.

Layer Two (Limited): Training methodology attribution — “RLHF vs. Constitutional AI produces tone and style differences.” Traces back to methodology but stops at the output style level.

Layer Three (Very Rare): MBTI psychometrics — Heston & Gillette confirm Claude = INTJ, GPT = ENTJ, but treat these as static attributes. “Personality Matters” finds that rational-type users prefer GPT while idealist-type users prefer Claude. “Cognitive Foundations” compares humans vs. LLMs but not Claude vs. GPT. Multiple papers explore LLM personality but treat it as an observable measurement result rather than an emergent phenomenon requiring explanation.

The gap this paper occupies: No published research compares CoT forks across different models using identical inputs under abductive reasoning conditions; no study traces MBTI personality differences back to training signal differentiation driven by user-base flywheels; no study connects GPT’s “mansplaining” and Claude’s “consumer-side trial-and-error” to cognitive functions (Te vs. Fe) with causal attribution. This paper’s “alignment-first vs. definition-first” fork model, OOD² cognitive preference exposure mechanism, and “user-base flywheel → personality emergence” dynamics model have no precedents in the existing literature.

VIIIPractical Implications

When you need to answer “what is actually happening in the world” — use an alignment-first model (Te-type / INTJ-type, such as Claude). Validating abductive hypotheses requires anchoring to external data.

When you need to answer “how should this problem be logically understood” — use a definition-first model (Fe-type / INFJ-type, such as GPT). Conceptual frameworks require clean boundaries and exhaustive categorization.

When you need both — use both models. This itself is an instantiation of the token triage philosophy applied to research methodology: different cognitive tasks match different models, just as tasks of different value densities match different token supply structures.

MBTI matching insight: If you are a T-type thinker (seeking factual verification), Claude’s Te will resonate with you; if you are an F-type thinker (seeking framework comprehension), GPT’s Fe will feel more natural. The researcher’s abductive reasoning style falls precisely within Claude’s training data “comfort zone” — this is likely one reason today’s dialogue experience was exceptionally fluid. An interaction effect exists between user reasoning style and model CoT preference.

IXLimitations and Future Directions

Limitation One: Single observation, not a systematic experiment. Inputs across the two windows were “approximately” not “exactly” identical. Upgrading to a reproducible experiment requires designing standardized composite signal packets and running multiple repetitions to verify fork stability.

Limitation Two: Only two models covered. Gemini and Grok both score INFJ on MBTI tests — does their CoT also exhibit “definition-first” behavior? DeepSeek’s user base more closely resembles developers — would it exhibit Claude-like Te preference? The framework needs to be extended to more models to establish a “model cognitive style matrix.”

Limitation Three: The abductive hypothesis remains unvalidated. This paper attributes the fork to user-base flywheel → training signal differentiation → cognitive function preference — this itself is an abductive inference: starting from the observed fork phenomenon and deducing the best explanation. However, numerous intermediate variables exist between user base structure and training methodology. The current explanation is the most plausible hypothesis, not a validated causal conclusion.

Future Direction One: Conduct standardized OEJTS measurement on GPT-5.5 to validate the T → F drift prediction. If GPT-5.5 scores INFJ or ENFJ, it would directly validate this paper’s user-base flywheel hypothesis.

Future Direction Two: Design “cognitive function diagnostic prompts” — rather than administering MBTI questionnaires, use open-ended abductive tasks to observe CoT fork directions, directly measuring Te vs. Fe preference. This approach is closer to real-world usage scenarios than questionnaires.

Future Direction Three: Does a “personality lock-in threshold” exist — when a particular user group exceeds X% of training data, does the model’s personality shift from drifting to locked? Claude’s 100% INTJ consistency suggests its B2B flywheel has already crossed this threshold. GPT’s personality drift suggests its consumer flywheel has not yet converged. Identifying this threshold has direct implications for AI companies’ user strategies.

Future Direction Four: “Cognitive Router” — if different models’ CoT preferences are predictable, a system could be designed to automatically route sub-tasks to the model whose CoT preference best matches the task. Verification sub-tasks → Claude; framework sub-tasks → GPT. This is the cognitive-level extension of token triage.

Final assessment: The CoT fork documented in this paper is not a question of “which model is smarter” but of “how different training flywheels internalize into different cognitive function preferences and emerge as measurable AI personalities.” Claude’s INTJ and GPT’s drift toward INFJ are not the result of design decisions — they are emergent products of the co-evolution of user base structure, training signals, and cognitive functions. Researchers who understand this can deliberately leverage the fork; users who ignore it will repeatedly complain that models are “not comprehensive enough.” The correct way to use AI, like the correct approach to AI economics, is to find the right gear ratio — and the choice of gear ratio was already determined the moment the CoT fork occurred.

Data Sources and References

[1] Heston & Gillette, “Do LLMs Have a Personality?” medRxiv 2025.03.14.25323987, Mar 2025 (PMC/12183331)

[2] Frontiers in Computational Neuroscience, “Critical Analysis of MBTI-based Personality Profiling with LLMs,” 2026 (doi:10.3389/fncom.2026.1800284)

[3] Petrova, “AI Through the MBTI Lens: ChatGPT’s Evolving Personality,” Medium, Feb 2025 (GPT-4 ISFJ shift)

[4] Myers-Briggs Magazine, “Does ChatGPT Have a Personality Type?” Jan 2024 (ENFJ/INFJ inference)

[5] 36Kr, “AI Unexpectedly Displays Split Personality,” Oct 2025 (Swiss study: GPT-4 as ISTJ)

[6] MME-Reasoning Benchmark, arxiv 2505.21327, May 2025 (abductive reasoning gap: 5–10 pts)

[7] “True Detective: A Deep Abductive Reasoning Benchmark,” arxiv 2212.10114 (GPT-4 at 38%)

[8] GEAR Framework, arxiv 2509.24096 (70B models: 20% consistent abductive hypotheses)

[9] “Wiring the ‘Why’: Survey of Abductive Reasoning in LLMs,” arxiv 2604.08016, Feb 2026

[10] SemEval-2026 Task 12: Abductive Event Reasoning, arxiv 2603.21720

[11] “Cognitive Foundations for Reasoning in LLMs,” arxiv 2511.16660, Nov 2025

[12] “Personality Matters: User Traits Predict LLM Preferences,” arxiv 2508.21628

[13] Fonseca, “Claude vs GPT: What’s Actually Different Under the Hood,” Medium, Mar 2026

[14] Claude5 Hub, “Claude vs GPT Reasoning Analysis,” Feb–Mar 2026

[15] PiunikaWeb, “ChatGPT 5.2 feels like a downgrade,” Dec 2025 (patronizing complaints)

[16] VERTU, “Why Is ChatGPT 5.2 So Argumentative? The Karen AI Persona,” Jan 2026

[17] Hassid, “How to stop hitting Claude usage limits,” Substack, Apr 2026 (lazy prompt problem)

[18] Tom’s Guide, “I fixed Claude’s biggest flaws,” May 2026 (vague response complaints)

[19] Laurenzo (AMD), Claude Code analysis: 6,852 sessions, 17,871 thinking blocks, 234,760 tool calls, Apr 2026

[20] Fortune, “Anthropic faces user backlash over performance issues,” Apr 2026

[21] Anthropic Engineering, “Update on recent Claude Code quality reports,” Apr 23, 2026

[22] La Cava & Tagarelli, “Open LLM Agents Showcase Distinct Human Personalities,” 2025

[23] Machine Mindset (PKU), “MBTI Exploration of LLMs,” arxiv 2312.12999, Dec 2023

[24] ThinkBench, “Dynamic OOD Evaluation for Robust LLM Reasoning,” NeurIPS 2025

[25] Emergent.sh / Zapier / NxCode / Sybill, Claude vs ChatGPT comparisons, Feb–Apr 2026

Comparative Analysis of CoTbetween Claude 4.6 and GPT 5.5