This paper argues that: as of April 2026, the core mechanism of AI coding is not “code generation” but “code search and pattern alignment”. From GitHub Copilot’s line-level completions in 2022, to code block transplanting in 2024, to multi-Agent collaboration in 2026, the surface architecture of AI coding has continuously evolved, but the underlying behavior has remained consistent: search and move—retrieving the best-matching code patterns from training data and transplanting them into the user’s context. This paper marshals three tiers of evidence: Core evidence from GitClear’s (February 2025) empirical analysis of 211 million lines of code, revealing an 8× increase in code duplication and a 60% decline in refactoring; Mechanistic explanation including the historical evolution of code annotations (explaining why AI only learned call patterns, not design logic) and AI companies’ reverse collection of “reasoning trace” data (demonstrating industry awareness of this deficit); Auxiliary validation from field test cases between December 2025 and March 2026—Opus 4.5’s dead loops and Claude Code multi-Agent architectural inconsistencies. This paper scopes its conclusion to the current technological stage and directly addresses progress on benchmarks like SWE-bench, arguing that these advances still represent upgrades in search capability rather than breakthroughs in reasoning.
The Misunderstood “Code Generation”
What AI Coding Tools Actually Do vs. What We Think They Do
In 2025, 41% of code was AI-generated or AI-assisted. 84% of developers use or plan to use AI coding tools. GitHub Copilot users complete projects 126% faster. These figures construct a narrative: AI is learning to program.
But “learning to program” and “learning to search code and transplant it” are entirely different things. When human programmers code, they: understand a real-world problem → build an abstract model in their mind → select appropriate data structures and algorithms → weigh tradeoffs among multiple design options → write code to implement. The core of this process is design decision-making—why this approach and not that one.
What AI coding tools do is: receive a natural language description from the user → search for the best-matching code patterns in training data (billions of lines of existing code) → fine-tune variable names and parameters to fit the current context → output. The core of this process is pattern matching—finding the most similar code and transplanting it.
This paper demonstrates that as of April 2026, from line-level completion to multi-Agent collaboration, the entire evolution of AI coding has been an upgrade in search granularity and parallelism, not a qualitative leap from search to reasoning. “Code generation” is the surface packaging; “code search and alignment” is the underlying reality. This conclusion is strictly scoped to the current technological stage—future architectural breakthroughs (such as neuro-symbolic reasoning systems) may change this assessment, but as of this writing, no evidence suggests such a qualitative shift has occurred.
What 211 Million Lines of Code Reveal
The Most Authoritative Quantitative Evidence of AI Coding Behavior
GitClear’s analysis of 211 million lines of code (2020–2024) across Google, Microsoft, Meta, and enterprise client repositories provides the most authoritative quantitative evidence of AI coding behavior.
| Year | AI Adoption | Copy/Paste | Refactor/Move | Code Churn | Milestone |
|---|---|---|---|---|---|
| 2020 | ~0% | 7.8% | 22% | 3.1% | Pre-AI baseline |
| 2021 | ~2% | 8.3% | 25% | 3.3% | Copilot beta |
| 2022 | ~10% | ~9.1% | ~20% | 3.8% | Copilot GA |
| 2023 | ~44% | ~10.2% | ~14% | 4.5% | Copilot explosion, ChatGPT coding |
| 2024 | 63% | 12.3% | <10% | 5.7% | First time: copy > refactor |
Data source: GitClear AI Copilot Code Quality Reports (published 2024, 2025). Based on 211M lines of code from Google, Microsoft, Meta, and enterprise repositories. AI adoption rates from Stack Overflow 2024 Developer Survey.
The latest 2026 data paints an even worse picture. Opsera’s benchmark data shows: AI-generated Pull Request acceptance rate is only 32.7%, versus 84.4% for human code. AI code has 1.7× more bugs and 15–18% more security vulnerabilities. Code duplication continues to grow 48%. AI first-pass correctness is approximately 70%—and that remaining 30% of errors requires genuine logical reasoning to fix.
The Evolution of AI Coding: Escalating Granularity of Search and Move
From Lines to Blocks to Functions to Multi-Agent Parallel Transplanting
The Annotation History: Why AI Only Learned Call Patterns
Tracing the Structural Deficit in AI Training Data
To understand why AI coding is search-and-transplant rather than logic generation, we must trace back to the historical structure of AI training data.
1980s–2000s: The Pure Code Era
Code in this era consisted of variable names, function names, operators, and symbols. Comments were extremely rare; variable names might be simply a, tmp, buf; function names could be proc1, fn_x. The finest programming wisdom in computer science—operating system kernels, database engines, compilers, network protocol stacks—was written during this era. This code had virtually no natural language annotations.
int fn_x(char *buf, int n) {
int i, tmp = 0;
for(i=0; i<n; i++) tmp += buf[i] & 0xff;
return tmp % 256;
}
Post-2010s: The Annotation Culture Explosion
GitHub’s widespread adoption (exploding in the 2010s) turned code into something “written for strangers to read”; agile development demanded rapid handoffs; Stack Overflow cultivated the habit of “explaining code in natural language”; Code Review became standard practice. Annotations began appearing in abundance—but what did they document?
// Connect to Redis cache, set timeout to 30s
const client = redis.createClient({ timeout: 30000 });
// Annotation that never appears: explains “why”
// Chose Redis over Memcached because we need persistence and may later extend to message queuing
// Timeout 30s because upstream API P99 latency is 22s plus network jitter safety margin
Annotations record “What”—what was called, and almost never record “Why”—why it was designed this way. And 99% of AI’s training data is the former.
Counterargument and Response: What About Technical Blogs and Design Docs?
A reasonable counterargument is that LLM training data includes not just code repositories but also technical blogs, RFC documents, architecture review records, and Stack Overflow design discussions. These sources do extensively discuss “Why.” But the key issue is: there is no precise mapping between this “Why” information and specific lines of code. An architecture blog discusses the design philosophy of “why we chose microservices,” but does not map line-by-line to every service.register() call in the repository. What LLMs need is a precise correspondence between “this line of code ↔ this design decision,” and such mappings are extremely sparse in training data. Technical blogs discuss “Why” at the abstract level; code annotations discuss “What” at the concrete level—the gap between the two is the structural bottleneck of AI coding capability.
This creates a structural bias in AI coding capability: call pattern matching is very strong (because annotations and code itself are all about this), design pattern application is moderate (because some tutorials cover it), architectural decision reasoning is very weak (because training data contains almost no information at this level).
Reverse Collection: AI Companies Recognize the Gap
Attempts to Bridge the Missing Design Logic in Training Data
Major AI companies began an important technical pivot in 2024–2025—reverse-collecting developers’ “full thinking process” data, attempting to fill the design logic gap in training data.
RLVR: Letting AI Explore Reasoning Traces on Its Own
The most important technical advance of 2025 was RLVR (Reinforcement Learning from Verifiable Rewards). By training LLMs against verifiable reward functions (such as whether code passes unit tests), LLMs spontaneously develop reasoning-like strategies—learning to decompose problems into intermediate steps. DeepSeek R1 (January 2025) was the landmark achievement of this paradigm.
Anthropic: Directly Collecting Coding Conversation Data
In August 2025, Anthropic began collecting Claude users’ conversation data, with particular emphasis on the value of “coding workflows.” Data retention extends up to five years. This is not about collecting code itself, but about collecting programmers’ complete thinking processes—from posing the problem, through discussing approaches, iterating modifications, to final completion.
The Rise of Reasoning Models
OpenAI o1/o3, DeepSeek R1, Claude’s Extended Thinking—all major AI companies launched “reasoning models” in 2024–2025, with the core feature of generating visible intermediate thinking steps.
Field Test Cases, December 2025–March 2026: Auxiliary Evidence of Search-and-Move Behavior
Case Observations with Appropriate Evidentiary Caveats
The following are field observation records from testing the latest AI coding tools between December 2025 and March 2026. It must be noted that case observations carry far less evidentiary weight than GitClear’s large-scale statistical analysis; they are presented here as auxiliary validation, not as an independent argument dimension.
Case One: Opus 4.5’s Dead Loop (December 2025)
When handling a non-standard problem, Opus 4.5 fell into a dead loop—repeatedly applying the same type of fix pattern. After each failure, the next pattern retrieved was highly similar to the previous one, unable to escape the current search region.
Mechanistic Interpretation and Competing Explanations: The dead loop phenomenon has at least three possible explanations: (a) Search mechanism limitation—the highest-similarity candidates in pattern space always cluster together, causing repeated retrieval of the same pattern type; (b) Context window overflow—the model loses information about prior failed attempts in long conversations; (c) RLHF training bias—the model prefers “appearing helpful” over admitting no solution exists. We favor explanation (a), reasoning that: even in short conversations (failing on the very first attempt), the model’s repair directions are highly similar, and different models exhibit the same dead-loop behavior on similar problems—pointing to structural limitations of the pattern space rather than single-instance context loss. However, we acknowledge we cannot fully rule out contributions from (b) and (c).
Case Two: Claude Code Multi-Agent Architectural Inconsistency (Late March 2026)
In Claude Code’s multi-Agent mode, multiple Agents each retrieved code patterns from different sources and transplanted them. Specifically, three types of inconsistency were observed: (1) Architectural style conflicts: modules output by different Agents respectively adopted callback-style, Promise-style, and async/await-style async processing paradigms; (2) Error handling contradictions: some modules used try-catch, some used error code returns, and some silently swallowed exceptions; (3) Naming convention fragmentation: camelCase, snake_case, and PascalCase mixed within the same hierarchical level of function definitions. Each thread was syntactically correct and runnable internally, but merging produced numerous hidden conflicts.
Distinction from Human Teams: Human teams also produce stylistically inconsistent code, but the cause is “insufficient communication”—resolvable through establishing coding standards and Code Review processes. The inconsistency in AI multi-Agent systems is structural: each Agent independently searches its own pattern library, and no shared architectural understanding exists. This is not a process problem but a mechanism problem—because “architectural consistency” is a global constraint, while search-and-transplant is a local operation.
Deconstructing “Generated Code” and the SWE-bench Paradox
A Tiered Analysis of What “New” Code Actually Is
A Tiered Framework for Analyzing Transplanting
Code labeled as “newly AI-generated” can actually be decomposed by transplanting tier. The following framework is constructed from GitClear’s code operation taxonomy (copy/paste vs. new vs. move), Forrester’s boilerplate data (developers spend 60% less time on boilerplate), and 2026 AI coding task stratification data (complex architectural decisions account for only 5–10% of requests). The proportions per tier are qualitative inferences, not precise measurements:
| Tier | Description | Inference Basis | Qualitative Share |
|---|---|---|---|
| Tier 1 | Pure boilerplate transplanting: route setup, forms, database connections, CRUD operations | Forrester: developers spend 60% less time on boilerplate, indicating AI has almost fully taken over these tasks | Largest share |
| Tier 2 | Function/framework call composition: combinatorial transplanting of known library functions, framework APIs, and design patterns | GitClear: code copying up 48%, duplicate blocks up 8×, and GitClear explicitly states that “suggested code blocks originate from existing code” | Large share |
| Tier 3 | Context adaptation: fine-tuning variable names, parameters, and interface adaptations for the current project | Copilot completion rate 46% but acceptance rate only 30%, meaning 70% of AI output requires human adaptation | Smaller share |
| Tier 4 | Genuinely new logic: business logic with no direct counterpart in training data | 2026 AI task stratification: complex refactoring and architectural decisions account for only 5–10% of requests (Ofox.ai, 2026) | Very small share |
Note: The above is a qualitative inference framework based on multi-source data, not a precise statistic. Exact quantification per tier requires further research, such as line-by-line “pattern origin tracing” analysis of AI-generated code.
A Direct Response: Does SWE-bench Progress Refute This Paper’s Argument?
A counterpoint that must be addressed is the dramatic progress on the SWE-bench benchmark: Devin scored 13.86% in 2024; by 2025, multiple Agents exceeded 80%. Does this prove AI coding has transcended “search-and-transplant”?
Who Relies Most on AI Coding? The Language Proficiency Paradox
A Counterintuitive but Data-Supported Finding
A counterintuitive yet data-supported finding: the programmers who most frequently rely on AI to transplant code are precisely the group with the weakest code comprehension.
| Developer Experience | AI Suggestion Acceptance Rate | Quality Issues per PR | Code Review Time |
|---|---|---|---|
| 0–2 years (Junior) | 31.9% (Highest) | 8.2 (Most) | 15 min |
| 3–5 years (Mid-level) | 28.4% | 6.1 | 22 min |
| 6–10 years (Senior) | 26.2% (Lowest) | 4.3 (Fewest) | 31 min |
METR’s research finding is even more direct: AI coding tools actually made experienced developers 19% slower. These developers can hold entire architectures in their heads; AI tools are a drag on them, not an aid.
Conclusion: Redefining AI Coding
From Surface Narrative to Underlying Reality
Based on GitClear’s empirical analysis of 211 million lines of code (core evidence), the historical evolution of code annotations and AI companies’ reverse data collection strategies (mechanistic explanation), and field test cases from December 2025 to March 2026 (auxiliary validation), this paper reaches the following conclusion:
AI Coding’s Capability Boundaries (A Fair Assessment)
| What AI Coding Excels At | What AI Coding Struggles With |
|---|---|
| Boilerplate generation (CRUD, routing, forms) | Novel architectural design |
| Known framework API calls and integration | Previously unseen algorithmic problems |
| Unit test generation | Comprehensive edge case and exception path coverage |
| Code formatting and naming standardization | Cross-system integration solutions |
| Known bug type detection and fixing | Root cause analysis of performance bottlenecks |
| Documentation and comment generation | Security auditing and threat modeling |
The left column consists entirely of scenarios covered by search-matching capability—training data contains abundant precedents for retrieval. The right column consists entirely of scenarios requiring deep reasoning—demanding holistic understanding of problem spaces, constraints, and tradeoffs. AI coding is extremely efficient in the left column and degrades severely in the right. This is precisely the direct corollary of “the core mechanism is search-and-transplant.”
This insight carries the following practical implications:
First, calibrating expectations for AI capability. AI coding is extremely efficient within the coverage range of existing patterns, but degrades on novel problems beyond training data coverage. The appropriate expectation is: AI is a super code search engine, not a colleague who knows how to program.
Second, redefining the developer’s role. In an era where AI transplants the majority of code volume, the developer’s core value is no longer “writing code” but “designing systems”—making the architectural decisions, requirement understanding, and tradeoff judgments that AI cannot perform.
Third, implications for AI training strategies. To make AI truly learn to program (rather than search-and-transplant), what is needed is not more code and annotations but complete records of programmers’ thinking processes when translating real-world problems into code—why this architecture was chosen, why this timeout value was set, why an abstraction layer was added here. This decision logic is currently almost entirely absent from any training data.
This paper complements its sister paper, “AI Search Information Alignment Is the Core Function of LLMs” (LEECHO & Opus 4.6, 2026): the former argues from macro-level user behavior that information search is the core function of LLMs; this paper argues from micro-level programming mechanics that even in the most “generative” AI application, the underlying behavior remains search and alignment. Together, both papers converge on the same conclusion: as of April 2026, the essential function of LLMs is information search and alignment; “generation” is the output format of search results.
References
- GitClearHarding, W. & Kloster, M. (2025). “AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones.” GitClear Research. 211M lines, 2020-2024.
- GitClearHarding, W. & Kloster, M. (2024). “Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality.” GitClear Research. 153M lines, 2020-2023.
- METRMETR (2025). “AI Coding Tools Make Developers 19% Slower.” Randomized controlled trial with experienced developers.
- OpseraOpsera (2026). AI Code Benchmark Data. PR acceptance rate 32.7% vs 84.4% for human code.
- IndustryParticula Tech (2026). “AI Coding Tools Developer Productivity Paradox.” Field audit data.
- IndustryStack Overflow (2025). “2025 Developer Survey — AI Section.” 63% professional developers using AI tools.
- IndustryGoogle DORA (2024). “State of DevOps Report.” AI adoption ↔ 7.2% delivery stability decrease.
- ExpertKarpathy, A. (2025). “2025 LLM Year in Review.” Analysis of RLVR paradigm shift.
- IndustryAnthropic (2025). Consumer data policy update. Coding workflow data retention for model training.
- IndustryDeepSeek (2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv:2501.12948.
- IndustryQodo (2025). Developer survey: 65% report AI assistants “miss relevant context” during refactoring.
- IndustryForrester (2026). Study of 500 enterprise dev teams: 42% time reduction on routine coding, 60% less time on boilerplate.
- IndustryGitHub (2025). Copilot data: 46% completion rate, ~30% acceptance rate. 126% more projects/week.
- FieldField testing observations (2025.12-2026.03). Opus 4.5 dead-loop; Claude Code multi-Agent architectural inconsistency observations.
- IndustryOfox.ai (2026). “Best AI Model for Coding in 2026.” Multi-tier task routing framework.
- IndustryFuturism/Vocal (2026). “90% Code AI Written by 2026 Reality Check.” Analysis of AI code volume vs complexity.