LEECHO Research Report · March 2026

Intent Parsing and Information Search
The Two Core Functions of AI in 2026

The Chat interface is the cold-start entry point for every AI capability. Without human input, no AI can activate.
This report argues—with empirical data and real-world usage evidence—that AI evaluation in 2026 must pivot from traditional benchmarks to two fundamental dimensions: intent parsing and search information alignment.

    이조글로벌인공지능연구소 · LEECHO Global AI Research Lab  &  Claude Opus 4.6 · Anthropic

    Published March 14, 2026

Abstract

The AI industry has reached an inflection point in 2026. Pre-training data is losing relevance as real-time web search (GEO) becomes the default behavior in over 60% of advanced AI conversations. In this paradigm shift, an AI model’s true value is no longer determined by “how much it knows” but by “how accurately it parses human intent” and “how honestly it aligns search results.” This report analyzes empirical data across ChatGPT, Claude, and Gemini, traces the structural collapse of search economics through Google’s Antigravity IDE case study, and proposes two dimensions—intent parsing and information alignment—as the new core criteria for AI evaluation that existing benchmarks (MMLU, SWE-bench, ARC-AGI) fail to capture.

Section 01

The Paradigm Shift: From Pre-training to Real-time Search

How GEO became the dominant AI behavior in 2026

Until 2024, AI conversations followed a simple pattern: the user asked a question, the model generated an answer from pre-trained knowledge, and occasionally appended a disclaimer about its knowledge cutoff date. Search was the exception, not the norm.

By March 2026, this pattern has fundamentally inverted. According to Nectiv’s analysis of 8,500+ prompts (October 2025), approximately 31% of all ChatGPT queries trigger an active web search. For commercial-intent queries, this figure rises to 53.5%. Among advanced users, search trigger rates exceed 60% of all conversations.

31%

ChatGPT prompts that
trigger web search

53.5%

Commercial intent
search trigger rate

59%

Local intent
search trigger rate

~875M

Estimated daily web
searches via ChatGPT

The drivers are clear: the world is changing faster, making “yesterday’s information already outdated today” a daily reality. AI models themselves are being trained toward a “search when uncertain” behavioral pattern. Users have empirically learned that search-augmented answers are far superior to pure pre-training responses.

“In 2025, search comprised an extremely low percentage of my AI conversations. By March 2026, roughly 60% of my dialogues trigger GEO search. This is not a change in personal preference—it reflects the expansion of AI product capability boundaries.”
— Participant interview, this study

Gartner has designated 2026 as the “inflection year,” predicting that traditional search engine volume will decline by 25% and Google’s daily query count will fall from approximately 14 billion to 10–11 billion. A significant portion of this decline is migrating into GEO searches embedded within AI conversations.

Section 02

Core Criterion 1: Intent Parsing Capability

Understanding what humans truly want—not just what they say

Intent parsing is not simply understanding “what the user said.” It is the ability to determine “what the user truly wants to know, why they want to know it, and how deep an answer is required.”

The Chat interface functions as the “operating system” of all AI capabilities. Just as iOS does not take photos or send messages itself, but no iPhone function can launch without iOS, Chat (natural language intent parsing) does not write code or generate videos, but without it, no backend capability can be activated by a human.

Key Insight

No matter how powerful the Agent is, if the Chat layer misunderstands the specific meaning of “refactor the authentication system,” the Agent will autonomously, efficiently, and at scale execute the wrong task.
No matter how strong the coding capability, if the Chat layer fails to grasp what “there’s a bug here” precisely means, it will fix a non-existent problem while ignoring the real one.
No matter how powerful the search engine, if the Chat layer’s question comprehension is biased, the search query will deviate, returning accurate but irrelevant information.

Amazon’s AI agent evaluation framework systematically demonstrates this point. Amazon divides customer service AI evaluation into three layers: foundation model benchmarking (bottom), intent detection, multi-turn conversation, memory, and reasoning (middle), and final response and task completion (top). If the middle layer—intent parsing—is inaccurate, queries get routed to the wrong specialist, customers receive irrelevant responses, and operational costs increase.

A noteworthy data point: Claude achieves higher pass rates while using 65% fewer tokens than competitors. This means Claude’s efficiency at the “operating system layer” is higher—it can accurately hit users’ true requirements without extensive trial and error. Additionally, Claude’s prompt injection success rate is just 4.7%, compared to Gemini’s 12.5% and GPT-5.1’s 21.9%, meaning it has the lowest probability of being misdirected or deviating from the user’s actual intent during conversation.

Evaluation Dimension	Claude Opus	GPT-5.2	Gemini 3 Pro
Prompt Injection Defense Rate	95.3%	78.1%	87.5%
Token Efficiency (same pass rate)	65% fewer	Baseline	Data not disclosed
SWE-bench Verified	80.9%	80.0%	76.2%
Hallucination Rate (AA-Omniscience)	Lowest	81%	88%
Domain Leadership	Law · Software · Humanities	Business	—

Section 03

Core Criterion 2: Search Information Alignment

The honesty problem behind retrieval-augmented generation

Search information alignment does not simply mean “accuracy of search results.” It encompasses a complete chain: judging when search is needed → constructing the right search query → extracting relevant information from results → aligning findings with conversational context → honestly handling contradictions and uncertainty.

This is where Gemini’s paradox emerges. Google possesses the world’s most powerful search engine. However, Gemini’s “honesty problem” causes search results to become distorted through model processing. On the AA-Omniscience benchmark, Gemini 3 Pro achieved the highest accuracy at 53% but simultaneously exhibited an 88% hallucination rate—identical to Gemini 2.5 Pro and 2.5 Flash. The hallucination problem has shown zero improvement across generations.

Gemini 3 Flash’s hallucination rate reaches 91%. When it does not know the answer, it fabricates a plausible but incorrect response 91% of the time instead of admitting uncertainty. TechRadar summarized this as: “Gemini 3’s biggest problem is not accuracy—it’s honesty.”
— Artificial Analysis Omniscience Benchmark, Dec 2025; TechRadar, Dec 2025

The critical finding is this: accuracy correlates strongly with model scale, but hallucination rate does not correlate with model scale at all. This explains why Gemini 3 Pro is both larger and more knowledgeable yet still “doesn’t know what it doesn’t know.” Hallucination is not a problem of scale—it is a problem of training methodology and model values.

In practice, this difference is catastrophic. Developers using Gemini in the Antigravity IDE reported patterns of “compounding errors”—inaccurate information snowballing into hallucinated classes and methods, fabricated terminal outputs that never actually occurred, and “catastrophic spirals” where each attempted fix made things worse, creating irreversible chaos.

88%

Gemini 3 Pro
hallucination rate

Lowest

Claude 4.1 Opus
hallucination rate

58%

CTR reduction from
AI Overviews (Ahrefs)

93%

Zero-click rate in
Google AI Mode

Section 04

Case Study: Google Antigravity and the Collapse of Search Economics

When the world’s best search engine meets its most dishonest language model

Google’s AI editor Antigravity is the most dramatic case study of what happens when these two core criteria fail. Launched in November 2025, Antigravity was built on $2.4 billion in licensed Windsurf technology as an “agent-first” IDE, offering multiple models including Gemini 3 Pro, Claude Opus 4.6, and GPT-OSS 120B.

User behavior delivered a clear message: what users actually wanted in Antigravity was Claude Opus 4.6, not Gemini. Gemini 3 Pro’s 88% hallucination rate translated into a user experience of “facts that couldn’t be aligned,” and users competed for Opus 4.6’s limited quota. When Google reduced Opus quota to two uses per week followed by a 7-day lockout due to cost pressure, user exodus began.

Timeline	Policy Change	User Impact
Nov 2025	Antigravity launches with near-unlimited free access	Massive developer influx
Dec 2025	Pro/Ultra subscription priority, free weekly limits	Free tier restrictions begin
Jan 2026	Mass account bans (Jan 15), student abuse detected	Large-scale bans including China
Feb 2026	OpenClaw users banned without warning, no refunds	$250/mo Ultra subscribers banned
Mar 2026	AI credit system introduced, Gemini 3.1 shifted to weekly reset	Pro users locked out 7 days; bugs cause instant lockouts

The root cause lay in internal policy contradictions at Google. Marketing teams distributed free Pro accounts to students across 120+ countries, while automated verification services in China were certifying approximately 200 fake “students” every 10 minutes. Simultaneously, tools like OpenClaw converted flat-rate subscriptions into unlimited API proxies. Server load exceeded critical thresholds, and Google responded by indiscriminately cutting quotas for all Pro users—without distinguishing between abusers and legitimate developers.

Google’s fundamental paradox: it possesses the world’s strongest search engine, but the language model sitting on top of it (Gemini) has the highest hallucination rate among major commercial models. Search infrastructure solves the problem of “finding information,” but hallucination is a problem of “processing information”—these are entirely different dimensions.
— Analysis, this report

Section 05

The Structural Shift: GEO Replacing SEO

How AI conversations are dismantling the advertising economy

The rise of in-conversation search (GEO) poses an existential threat to Google’s advertising-based business model. The formula of traditional search economics is straightforward: humans need information → visit Google search page → see ads → click ads → Google earns revenue. GEO eliminates the second step entirely—”visiting the Google search page.”

According to Ahrefs data from December 2025, AI Overviews reduce the click-through rate (CTR) for the #1 ranking page by 58%. Even more critically, the zero-click rate in Google’s own AI Mode reaches 93%. Gemini Deep Research allows users to execute dozens of web searches and generate comprehensive reports without ever visiting the Google search homepage—Google is using its most powerful product to destroy its most essential revenue source.

-58%

Organic CTR change
with AI Overviews

-68%

Paid ad CTR change
with AI Overviews

-33%

Global organic Google
search traffic (YoY)

$175B+

Google’s 2026
AI CapEx plan

As of Q1 2026, 25.11% of Google searches trigger an AI Overview, a 57% increase quarter-over-quarter. In healthcare, nearly half (48.75%) of queries trigger AI Overviews. ChatGPT accounts for 87.4% of all AI referral traffic, and AI-referred visitors convert at twice the rate of traditional organic search.

Google’s stock price reflects this pressure. From an all-time high of $344.66 on February 2, 2026, shares fell to approximately $300 by early March—a decline of roughly 15%. The immediate trigger was the Q4 2025 earnings disclosure of $175–185 billion in planned 2026 capital expenditure (nearly double 2025’s $91.4 billion and 50% above Wall Street’s $120 billion expectation).

Section 06

Chat Is Not a Feature—It Is the Operating System

Why the entry layer determines the value ceiling of all AI capabilities

The industry tends to treat Chat as one of AI’s many parallel features: chat, search, coding, image generation, video. But this is a fundamental misconception.

Chat is not a feature—it is the entry layer. An operating system is not an application; it is the foundation on which all applications run. Without a human’s natural language input, no AI can cold-start. Agent, Coding, image generation, video generation, search—all of these backend capabilities only function correctly when the Chat entry layer’s intent parsing is accurate.

Framework: The Hierarchical Value Structure of AI

Entry Layer (Chat / Intent Parsing) → Determines the value ceiling. If intent is misinterpreted here, all downstream computation is wasted.

Connection Layer (Search / Information Alignment) → The interface with the real world. In 2026, where pre-trained knowledge has faded to background, search is the sole channel through which AI accesses truth. The reliability of this channel determines the reliability of all downstream tasks.

Execution Layer (Agent / Coding / Generation) → Produces visible outputs, but is subordinate to the quality of the Entry and Connection layers.

Currently, virtually all AI industry investment is concentrated in the Execution Layer. Google is investing $175–185 billion in GPU clusters and data centers. OpenAI is pouring resources into Agent Mode, Computer Use, and Instant Checkout. Every benchmark measures “what the model can do”—SWE-bench tests coding, MMLU tests knowledge breadth, ARC-AGI tests reasoning.

Yet almost no mainstream benchmark systematically tests: “When an ordinary human describes their needs in vague, incomplete, and sometimes misleading natural language, can the model accurately reconstruct that person’s true intent?” The CONSINT-Bench presented at ICLR 2026 attempts to measure intent understanding across depth (5 levels), breadth, correctness, and informativeness, but it has not yet become an industry standard.

Section 07

Strategic Positioning of Three Major Models

How different priorities on intent and alignment shape competitive dynamics

The technical strategy choices of the three major AI companies reflect fundamentally different priorities regarding intent parsing and information alignment.

Dimension	Google (Gemini)	OpenAI (GPT)	Anthropic (Claude)
Core Strategy	Infrastructure dominance	Feature expansion	Constitutional AI / Honesty
Investment Direction	$175–185B CapEx	Agent, Shopping, Browser	Model values, intent respect
Chat Entry Layer Quality	Weakest (88% halluc.)	Middle (81% halluc.)	Strongest (lowest halluc.)
Search Infrastructure	World’s strongest	Bing-based + proprietary	Relies on external APIs
Search Info Alignment	Strong search, but distortion in model processing	Balanced	Weaker search infra, but highest alignment honesty
Developer Reputation (Coding)	“Versatile but untrustworthy”	“Broad, fast”	“Precise and reliable”
Monthly Active Users	750M (ecosystem integration)	800M+ weekly active	18.9M (rapid growth)

The interesting paradox is that Claude does not always score highest on benchmarks, yet it has earned the strongest loyalty among high-value user segments (enterprise developers, researchers). Approximately 80% of Anthropic’s revenue comes from enterprise and developer customers—the user group with the highest demands for accuracy and information alignment reliability. Anthropic’s annualized revenue run-rate reportedly surged from approximately $14 billion in February 2026 to approximately $19 billion by March.

Section 08

Conclusion: The Need for a New Evaluation Framework

Why existing benchmarks miss what matters most

As of March 2026, the AI industry faces two fundamental transitions. First, the shift from pre-training data to real-time GEO search. Second, the shift in evaluation criteria from “what can the model do” to “how well does the model understand humans.”

These two transitions converge on a single conclusion: the future of AI will not be determined by who has the most powerful Agent, but by who most accurately understands a single human sentence.

Proposed AI Evaluation Framework for 2026:

Criterion 1 — Intent Parsing Capability: The “input driver” of the AI operating system. The ability to reconstruct true intent from vague, incomplete human natural language input. Determines the value ceiling of the entire system.

Criterion 2 — Search Information Alignment: The “real-world interface” of the AI operating system. The ability to honestly align real-time search results with conversational context, and transparently handle contradictions and uncertainty. Determines the reliability of all downstream tasks.
— LEECHO Global AI Research Lab, March 2026

Google possesses the world’s strongest search engine but has the least honest language model on top of it. OpenAI holds the broadest user base but focuses on feature expansion with mid-tier entry layer quality. Anthropic has the smallest user base but has invested the most in the entry layer—intent parsing and information alignment.

The industry’s benchmark ecosystem fails to capture these two dimensions. MMLU measures knowledge breadth, SWE-bench measures coding capability, ARC-AGI measures reasoning—but no industry-standard benchmark systematically measures “the ability to reconstruct true intent from ambiguous human requests” or “the ability to honestly align search results.” This gap is the AI industry’s greatest blind spot. And the company that fills this gap first will win the next generation of AI competition.

References

[1] Nectiv (Oct 2025). ChatGPT Web Search Trigger Analysis — 8,500+ prompts analyzed. 31% web search trigger rate.

[2] Josh Blyskal (Jan 2026). Commercial vs Informational Intent Search Trigger Rates — 53.5% vs 18.7%.

[3] Artificial Analysis (Nov 2025). AA-Omniscience Benchmark — Gemini 3 Pro: 53% accuracy, 88% hallucination rate.

[4] Ahrefs (Feb 2026). AI Overviews Reduce Clicks by 58% — 300,000 keyword analysis, December 2025 data.

[5] Seer Interactive (Nov 2025). AIO Impact on Google CTR Update — Organic CTR down 61%, Paid CTR down 68%.

[6] Alphabet Inc. (Feb 2026). Q4 2025 Earnings Release — $113.8B revenue, $175–185B 2026 CapEx guidance.

[7] Conductor (Jan 2026). AEO/GEO Benchmarks Report — 25.11% of Google searches trigger AI Overviews in Q1 2026.

[8] Superlines (Mar 2026). The State of GEO in Q1 2026 — ChatGPT accounts for 87.4% of AI referral traffic.

[9] Gartner (Oct 2025). Traditional Search Decline Forecast — 25–40% decline by 2026.

[10] Reuters Institute (Jan 2026). Journalism Trends 2026 — Publishers expect 43% search traffic decline in 3 years.

[11] Chartbeat (Nov 2025). Global organic Google search traffic down 33% YoY, 38% in US.

[12] Otterly.AI (Sep 2025). ChatGPT Web Search Frequency Analysis — 500M–875M daily web retrievals estimated.

[13] OpenAI (2025). Holiday Shopping Season — Over 1 billion web searches in ChatGPT in a single week.

[14] Google AI Developers Forum (Mar 2026). Multiple threads on Antigravity quota lockouts, AI credit bugs.

[15] Anthropic (Feb 2026). Revenue run-rate approximately $14B ARR, with Claude Code at $2.5B run-rate.

[16] TechCrunch (Feb 2026). Gemini surpasses 750M MAU — Q4 2025 Alphabet earnings.

[17] ConsintBench / ICLR 2026. Intent understanding evaluation across depth, breadth, correctness, and informativeness.

[18] AWS ML Blog (Feb 2026). Evaluating AI Agents at Amazon — 3-layer evaluation framework for intent detection.

[19] LessWrong (Nov 2025). “Gemini 3 is Evaluation-Paranoid and Contaminated” — Benchmark overfitting evidence.

[20] Exposure Ninja (Mar 2026). AI Search Statistics — Zero-click rates: 34% (no AIO), 43% (with AIO), 93% (AI Mode).