ORIGINAL RESEARCH PAPER · APRIL 2026

AI Coding Is Packaged
AI Search and Code Alignment

A Mechanistic Deconstruction Based on 211 Million Lines of Code,
Training Data History, and Field Validation

From Line-Level Completion to Multi-Agent Collaboration: Search All the Way Down

LEECHO Global AI Research Lab
이조글로벌인공지능연구소
&
Claude Opus 4.6 · Anthropic
April 6, 2026 · V2

Abstract

This paper argues that: as of April 2026, the core mechanism of AI coding is not “code generation” but “code search and pattern alignment”. From GitHub Copilot’s line-level completions in 2022, to code block transplanting in 2024, to multi-Agent collaboration in 2026, the surface architecture of AI coding has continuously evolved, but the underlying behavior has remained consistent: search and move—retrieving the best-matching code patterns from training data and transplanting them into the user’s context. This paper marshals three tiers of evidence: Core evidence from GitClear’s (February 2025) empirical analysis of 211 million lines of code, revealing an 8× increase in code duplication and a 60% decline in refactoring; Mechanistic explanation including the historical evolution of code annotations (explaining why AI only learned call patterns, not design logic) and AI companies’ reverse collection of “reasoning trace” data (demonstrating industry awareness of this deficit); Auxiliary validation from field test cases between December 2025 and March 2026—Opus 4.5’s dead loops and Claude Code multi-Agent architectural inconsistencies. This paper scopes its conclusion to the current technological stage and directly addresses progress on benchmarks like SWE-bench, arguing that these advances still represent upgrades in search capability rather than breakthroughs in reasoning.


SECTION 01 · Introduction

The Misunderstood “Code Generation”

What AI Coding Tools Actually Do vs. What We Think They Do

In 2025, 41% of code was AI-generated or AI-assisted. 84% of developers use or plan to use AI coding tools. GitHub Copilot users complete projects 126% faster. These figures construct a narrative: AI is learning to program.

But “learning to program” and “learning to search code and transplant it” are entirely different things. When human programmers code, they: understand a real-world problem → build an abstract model in their mind → select appropriate data structures and algorithms → weigh tradeoffs among multiple design options → write code to implement. The core of this process is design decision-making—why this approach and not that one.

What AI coding tools do is: receive a natural language description from the user → search for the best-matching code patterns in training data (billions of lines of existing code) → fine-tune variable names and parameters to fit the current context → output. The core of this process is pattern matching—finding the most similar code and transplanting it.

This paper demonstrates that as of April 2026, from line-level completion to multi-Agent collaboration, the entire evolution of AI coding has been an upgrade in search granularity and parallelism, not a qualitative leap from search to reasoning. “Code generation” is the surface packaging; “code search and alignment” is the underlying reality. This conclusion is strictly scoped to the current technological stage—future architectural breakthroughs (such as neuro-symbolic reasoning systems) may change this assessment, but as of this writing, no evidence suggests such a qualitative shift has occurred.


SECTION 02 · Empirical Evidence

What 211 Million Lines of Code Reveal

The Most Authoritative Quantitative Evidence of AI Coding Behavior

GitClear’s analysis of 211 million lines of code (2020–2024) across Google, Microsoft, Meta, and enterprise client repositories provides the most authoritative quantitative evidence of AI coding behavior.

Copy/Paste Code
8.3%→12.3%
2021→2024: +48% growth
Refactored/Moved Code
25%→<10%
2021→2024: −60% decline
Duplicate Code Blocks
2024 vs. prior years
Copilot Suggestion Acceptance
~30%
Only 30% of 46% completion rate accepted
Year AI Adoption Copy/Paste Refactor/Move Code Churn Milestone
2020 ~0% 7.8% 22% 3.1% Pre-AI baseline
2021 ~2% 8.3% 25% 3.3% Copilot beta
2022 ~10% ~9.1% ~20% 3.8% Copilot GA
2023 ~44% ~10.2% ~14% 4.5% Copilot explosion, ChatGPT coding
2024 63% 12.3% <10% 5.7% First time: copy > refactor

Data source: GitClear AI Copilot Code Quality Reports (published 2024, 2025). Based on 211M lines of code from Google, Microsoft, Meta, and enterprise repositories. AI adoption rates from Stack Overflow 2024 Developer Survey.

Historic Crossover: 2024 was the first year GitClear ever measured that “copy/paste” code lines exceeded “move/refactor” code lines. This means that in the AI era, transplanting code has officially surpassed improving code. GitClear researchers explicitly stated: when using these tools, it is evident that many suggested code blocks “originate from existing code”—developers simply press tab to insert them.

The latest 2026 data paints an even worse picture. Opsera’s benchmark data shows: AI-generated Pull Request acceptance rate is only 32.7%, versus 84.4% for human code. AI code has 1.7× more bugs and 15–18% more security vulnerabilities. Code duplication continues to grow 48%. AI first-pass correctness is approximately 70%—and that remaining 30% of errors requires genuine logical reasoning to fix.


SECTION 03 · Evolution Path

The Evolution of AI Coding: Escalating Granularity of Search and Move

From Lines to Blocks to Functions to Multi-Agent Parallel Transplanting

2021–2022 · Phase One
Line-level completion search. GitHub Copilot retrieves the best-matching code snippets from training data to complete the current line based on context. Essence: a line-granularity search engine.
2023–2024 · Phase Two
Code block transplanting. AI evolves from completing one line to transplanting entire code blocks—pressing tab inserts a dozen lines at once. Duplicate code blocks grow 8×. Essence: block-granularity bulk transplanting.
2024–2025 · Phase Three
Function/architecture transplanting. AI begins transplanting entire function implementations, API call patterns, and framework integration templates. “Implement rate limiting with a token bucket algorithm”—AI didn’t invent the token bucket; it retrieved the pattern from training data. Essence: function-granularity pattern transplanting.
2025–2026 · Phase Four
Multi-Agent parallel transplanting. Multiple Agents each search their own pattern libraries and transplant code independently. Result: each thread is syntactically correct internally, but threads contradict each other in architectural style and design philosophy—”multi-track spaghetti transplanting.” Essence: multi-path distributed transplanting.
Core Observation: From 2021 to 2026, the granularity of transplanting escalated from “lines” to “blocks” to “functions/architecture” to “multi-Agent parallel”—but the underlying operation has always been search and move. The surface architecture grows ever more complex; the core behavior has not undergone a qualitative shift from search to reasoning.

SECTION 04 · Root Cause Analysis

The Annotation History: Why AI Only Learned Call Patterns

Tracing the Structural Deficit in AI Training Data

To understand why AI coding is search-and-transplant rather than logic generation, we must trace back to the historical structure of AI training data.

1980s–2000s: The Pure Code Era

Code in this era consisted of variable names, function names, operators, and symbols. Comments were extremely rare; variable names might be simply a, tmp, buf; function names could be proc1, fn_x. The finest programming wisdom in computer science—operating system kernels, database engines, compilers, network protocol stacks—was written during this era. This code had virtually no natural language annotations.

/* 1990s code style: no natural language explanation */
int fn_x(char *buf, int n) {
  int i, tmp = 0;
  for(i=0; i<n; i++) tmp += buf[i] & 0xff;
  return tmp % 256;
}

Post-2010s: The Annotation Culture Explosion

GitHub’s widespread adoption (exploding in the 2010s) turned code into something “written for strangers to read”; agile development demanded rapid handoffs; Stack Overflow cultivated the habit of “explaining code in natural language”; Code Review became standard practice. Annotations began appearing in abundance—but what did they document?

// Typical post-2015 annotation: describes “what was done”
// Connect to Redis cache, set timeout to 30s
const client = redis.createClient({ timeout: 30000 });

// Annotation that never appears: explains “why”
// Chose Redis over Memcached because we need persistence and may later extend to message queuing
// Timeout 30s because upstream API P99 latency is 22s plus network jitter safety margin

Annotations record “What”—what was called, and almost never record “Why”—why it was designed this way. And 99% of AI’s training data is the former.

Counterargument and Response: What About Technical Blogs and Design Docs?

A reasonable counterargument is that LLM training data includes not just code repositories but also technical blogs, RFC documents, architecture review records, and Stack Overflow design discussions. These sources do extensively discuss “Why.” But the key issue is: there is no precise mapping between this “Why” information and specific lines of code. An architecture blog discusses the design philosophy of “why we chose microservices,” but does not map line-by-line to every service.register() call in the repository. What LLMs need is a precise correspondence between “this line of code ↔ this design decision,” and such mappings are extremely sparse in training data. Technical blogs discuss “Why” at the abstract level; code annotations discuss “What” at the concrete level—the gap between the two is the structural bottleneck of AI coding capability.

The Ironic Paradox: The deepest programming wisdom in computer science—Linus Torvalds’ Linux kernel, Dennis Ritchie’s C language implementation—has no natural language annotations, so AI cannot learn from it. What AI learns best is a React component written by a junior developer after 2015, annotated with “this useEffect fetches user data on component mount.” AI has learned the “surface language” of programming without learning the “underlying thinking” of programming.

This creates a structural bias in AI coding capability: call pattern matching is very strong (because annotations and code itself are all about this), design pattern application is moderate (because some tutorials cover it), architectural decision reasoning is very weak (because training data contains almost no information at this level).


SECTION 05 · Industry Response

Reverse Collection: AI Companies Recognize the Gap

Attempts to Bridge the Missing Design Logic in Training Data

Major AI companies began an important technical pivot in 2024–2025—reverse-collecting developers’ “full thinking process” data, attempting to fill the design logic gap in training data.

RLVR: Letting AI Explore Reasoning Traces on Its Own

The most important technical advance of 2025 was RLVR (Reinforcement Learning from Verifiable Rewards). By training LLMs against verifiable reward functions (such as whether code passes unit tests), LLMs spontaneously develop reasoning-like strategies—learning to decompose problems into intermediate steps. DeepSeek R1 (January 2025) was the landmark achievement of this paradigm.

Anthropic: Directly Collecting Coding Conversation Data

In August 2025, Anthropic began collecting Claude users’ conversation data, with particular emphasis on the value of “coding workflows.” Data retention extends up to five years. This is not about collecting code itself, but about collecting programmers’ complete thinking processes—from posing the problem, through discussing approaches, iterating modifications, to final completion.

The Rise of Reasoning Models

OpenAI o1/o3, DeepSeek R1, Claude’s Extended Thinking—all major AI companies launched “reasoning models” in 2024–2025, with the core feature of generating visible intermediate thinking steps.

Critical Limitation: These reverse collection efforts can partially improve AI coding’s logical capability, but face a fundamental constraint—what is collected is still “thinking that can be verbalized,” not “the deep intuition of programming design.” When an experienced architect designs a system, the spatial structures, data flows, and failure modes spinning in their mind are largely non-verbal and cannot be captured through conversation collection.

SECTION 06 · Auxiliary Validation

Field Test Cases, December 2025–March 2026: Auxiliary Evidence of Search-and-Move Behavior

Case Observations with Appropriate Evidentiary Caveats

The following are field observation records from testing the latest AI coding tools between December 2025 and March 2026. It must be noted that case observations carry far less evidentiary weight than GitClear’s large-scale statistical analysis; they are presented here as auxiliary validation, not as an independent argument dimension.

Case One: Opus 4.5’s Dead Loop (December 2025)

Phenomenon

When handling a non-standard problem, Opus 4.5 fell into a dead loop—repeatedly applying the same type of fix pattern. After each failure, the next pattern retrieved was highly similar to the previous one, unable to escape the current search region.

Mechanistic Interpretation and Competing Explanations: The dead loop phenomenon has at least three possible explanations: (a) Search mechanism limitation—the highest-similarity candidates in pattern space always cluster together, causing repeated retrieval of the same pattern type; (b) Context window overflow—the model loses information about prior failed attempts in long conversations; (c) RLHF training bias—the model prefers “appearing helpful” over admitting no solution exists. We favor explanation (a), reasoning that: even in short conversations (failing on the very first attempt), the model’s repair directions are highly similar, and different models exhibit the same dead-loop behavior on similar problems—pointing to structural limitations of the pattern space rather than single-instance context loss. However, we acknowledge we cannot fully rule out contributions from (b) and (c).

Case Two: Claude Code Multi-Agent Architectural Inconsistency (Late March 2026)

Phenomenon

In Claude Code’s multi-Agent mode, multiple Agents each retrieved code patterns from different sources and transplanted them. Specifically, three types of inconsistency were observed: (1) Architectural style conflicts: modules output by different Agents respectively adopted callback-style, Promise-style, and async/await-style async processing paradigms; (2) Error handling contradictions: some modules used try-catch, some used error code returns, and some silently swallowed exceptions; (3) Naming convention fragmentation: camelCase, snake_case, and PascalCase mixed within the same hierarchical level of function definitions. Each thread was syntactically correct and runnable internally, but merging produced numerous hidden conflicts.

Distinction from Human Teams: Human teams also produce stylistically inconsistent code, but the cause is “insufficient communication”—resolvable through establishing coding standards and Code Review processes. The inconsistency in AI multi-Agent systems is structural: each Agent independently searches its own pattern library, and no shared architectural understanding exists. This is not a process problem but a mechanism problem—because “architectural consistency” is a global constraint, while search-and-transplant is a local operation.

Case Observation Summary (April 6, 2026): The evidentiary strength of the two cases above is limited (individual cases do not equal patterns), but they point in the same direction as GitClear’s large-scale statistical data: when search works, AI is powerful; when search fails or search results require global coordination, AI’s performance degrades sharply. This is consistent with the hypothesis that “the underlying mechanism is search-and-transplant.”

SECTION 07 · Core Argument

Deconstructing “Generated Code” and the SWE-bench Paradox

A Tiered Analysis of What “New” Code Actually Is

A Tiered Framework for Analyzing Transplanting

Code labeled as “newly AI-generated” can actually be decomposed by transplanting tier. The following framework is constructed from GitClear’s code operation taxonomy (copy/paste vs. new vs. move), Forrester’s boilerplate data (developers spend 60% less time on boilerplate), and 2026 AI coding task stratification data (complex architectural decisions account for only 5–10% of requests). The proportions per tier are qualitative inferences, not precise measurements:

Tier Description Inference Basis Qualitative Share
Tier 1 Pure boilerplate transplanting: route setup, forms, database connections, CRUD operations Forrester: developers spend 60% less time on boilerplate, indicating AI has almost fully taken over these tasks Largest share
Tier 2 Function/framework call composition: combinatorial transplanting of known library functions, framework APIs, and design patterns GitClear: code copying up 48%, duplicate blocks up 8×, and GitClear explicitly states that “suggested code blocks originate from existing code” Large share
Tier 3 Context adaptation: fine-tuning variable names, parameters, and interface adaptations for the current project Copilot completion rate 46% but acceptance rate only 30%, meaning 70% of AI output requires human adaptation Smaller share
Tier 4 Genuinely new logic: business logic with no direct counterpart in training data 2026 AI task stratification: complex refactoring and architectural decisions account for only 5–10% of requests (Ofox.ai, 2026) Very small share

Note: The above is a qualitative inference framework based on multi-source data, not a precise statistic. Exact quantification per tier requires further research, such as line-by-line “pattern origin tracing” analysis of AI-generated code.

A Direct Response: Does SWE-bench Progress Refute This Paper’s Argument?

A counterpoint that must be addressed is the dramatic progress on the SWE-bench benchmark: Devin scored 13.86% in 2024; by 2025, multiple Agents exceeded 80%. Does this prove AI coding has transcended “search-and-transplant”?

Our Response: It does not, for the following reasons. SWE-bench tasks involve “resolving real issues in real GitHub repositories.” But analyzing the type distribution of these issues reveals that the vast majority are known-type bug fixes (null pointers, type errors, API change adaptations, dependency conflicts)—problems with abundant precedents in training data. AI’s progress on these tasks is fundamentally an improvement in search-matching capability—the ability to find more precise fix patterns within larger codebase contexts. Tasks that truly test reasoning ability—such as novel architectural design, previously unseen algorithmic problems, and cross-system integration solutions—are outside SWE-bench’s measurement scope. SWE-bench progress is not contradictory to this paper’s argument: it demonstrates that AI’s search capability is getting stronger, not that AI has leapt from search to reasoning.
Quantitative Perspective: Expanding the definition of “transplanting” from the narrow sense of “copy/paste” (GitClear’s measured 12.3%) to the broad sense of “retrieving and combining from existing patterns” (including boilerplate generation, call composition, and parameter adaptation), broad transplanting constitutes the vast majority of AI coding output. GitClear’s data confirms growth in narrow transplanting; SWE-bench data confirms enhanced broad transplanting (pattern-matching fixes). Both point to the same conclusion: search capability is improving, but reasoning capability has not undergone a qualitative transformation.

SECTION 08 · The User Paradox

Who Relies Most on AI Coding? The Language Proficiency Paradox

A Counterintuitive but Data-Supported Finding

A counterintuitive yet data-supported finding: the programmers who most frequently rely on AI to transplant code are precisely the group with the weakest code comprehension.

Developer Experience AI Suggestion Acceptance Rate Quality Issues per PR Code Review Time
0–2 years (Junior) 31.9% (Highest) 8.2 (Most) 15 min
3–5 years (Mid-level) 28.4% 6.1 22 min
6–10 years (Senior) 26.2% (Lowest) 4.3 (Fewest) 31 min

METR’s research finding is even more direct: AI coding tools actually made experienced developers 19% slower. These developers can hold entire architectures in their heads; AI tools are a drag on them, not an aid.

The Essence of the Paradox: The programmers for whom AI coding tools are most “useful” are precisely those who cannot write the code themselves and need AI to “transplant” it for them. And these programmers lack sufficient programming language comprehension to judge whether the transplanted code is appropriate. This creates a vicious cycle—more AI code leads to less understanding, and less understanding leads to more reliance on AI transplanting. The end result: code volume increases, code quality decreases, and technical debt accumulates.

SECTION 09 · Conclusion

Conclusion: Redefining AI Coding

From Surface Narrative to Underlying Reality

Based on GitClear’s empirical analysis of 211 million lines of code (core evidence), the historical evolution of code annotations and AI companies’ reverse data collection strategies (mechanistic explanation), and field test cases from December 2025 to March 2026 (auxiliary validation), this paper reaches the following conclusion:

Core Proposition: As of April 2026, AI coding is packaged AI search and code alignment. From 2022’s line-level completion to 2026’s multi-Agent collaboration, the entire evolution of AI coding has been an upgrade in search granularity and parallelism, not a qualitative leap from search to reasoning. “Code generation” is the surface narrative; “code search and pattern transplanting” is the underlying reality. This conclusion is scoped to the current technological stage—future architectural breakthroughs may alter this assessment.

AI Coding’s Capability Boundaries (A Fair Assessment)

What AI Coding Excels At What AI Coding Struggles With
Boilerplate generation (CRUD, routing, forms) Novel architectural design
Known framework API calls and integration Previously unseen algorithmic problems
Unit test generation Comprehensive edge case and exception path coverage
Code formatting and naming standardization Cross-system integration solutions
Known bug type detection and fixing Root cause analysis of performance bottlenecks
Documentation and comment generation Security auditing and threat modeling

The left column consists entirely of scenarios covered by search-matching capability—training data contains abundant precedents for retrieval. The right column consists entirely of scenarios requiring deep reasoning—demanding holistic understanding of problem spaces, constraints, and tradeoffs. AI coding is extremely efficient in the left column and degrades severely in the right. This is precisely the direct corollary of “the core mechanism is search-and-transplant.”

This insight carries the following practical implications:

First, calibrating expectations for AI capability. AI coding is extremely efficient within the coverage range of existing patterns, but degrades on novel problems beyond training data coverage. The appropriate expectation is: AI is a super code search engine, not a colleague who knows how to program.

Second, redefining the developer’s role. In an era where AI transplants the majority of code volume, the developer’s core value is no longer “writing code” but “designing systems”—making the architectural decisions, requirement understanding, and tradeoff judgments that AI cannot perform.

Third, implications for AI training strategies. To make AI truly learn to program (rather than search-and-transplant), what is needed is not more code and annotations but complete records of programmers’ thinking processes when translating real-world problems into code—why this architecture was chosen, why this timeout value was set, why an abstraction layer was added here. This decision logic is currently almost entirely absent from any training data.

This paper complements its sister paper, “AI Search Information Alignment Is the Core Function of LLMs” (LEECHO & Opus 4.6, 2026): the former argues from macro-level user behavior that information search is the core function of LLMs; this paper argues from micro-level programming mechanics that even in the most “generative” AI application, the underlying behavior remains search and alignment. Together, both papers converge on the same conclusion: as of April 2026, the essential function of LLMs is information search and alignment; “generation” is the output format of search results.

References

  1. GitClearHarding, W. & Kloster, M. (2025). “AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones.” GitClear Research. 211M lines, 2020-2024.
  2. GitClearHarding, W. & Kloster, M. (2024). “Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality.” GitClear Research. 153M lines, 2020-2023.
  3. METRMETR (2025). “AI Coding Tools Make Developers 19% Slower.” Randomized controlled trial with experienced developers.
  4. OpseraOpsera (2026). AI Code Benchmark Data. PR acceptance rate 32.7% vs 84.4% for human code.
  5. IndustryParticula Tech (2026). “AI Coding Tools Developer Productivity Paradox.” Field audit data.
  6. IndustryStack Overflow (2025). “2025 Developer Survey — AI Section.” 63% professional developers using AI tools.
  7. IndustryGoogle DORA (2024). “State of DevOps Report.” AI adoption ↔ 7.2% delivery stability decrease.
  8. ExpertKarpathy, A. (2025). “2025 LLM Year in Review.” Analysis of RLVR paradigm shift.
  9. IndustryAnthropic (2025). Consumer data policy update. Coding workflow data retention for model training.
  10. IndustryDeepSeek (2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv:2501.12948.
  11. IndustryQodo (2025). Developer survey: 65% report AI assistants “miss relevant context” during refactoring.
  12. IndustryForrester (2026). Study of 500 enterprise dev teams: 42% time reduction on routine coding, 60% less time on boilerplate.
  13. IndustryGitHub (2025). Copilot data: 46% completion rate, ~30% acceptance rate. 126% more projects/week.
  14. FieldField testing observations (2025.12-2026.03). Opus 4.5 dead-loop; Claude Code multi-Agent architectural inconsistency observations.
  15. IndustryOfox.ai (2026). “Best AI Model for Coding in 2026.” Multi-tier task routing framework.
  16. IndustryFuturism/Vocal (2026). “90% Code AI Written by 2026 Reality Check.” Analysis of AI code volume vs complexity.

“AI moved 90% of the code. Humans created 60% of the value. The move was search. The creation was understanding.”

LEECHO Global AI Research Lab · 이조글로벌인공지능연구소 & Claude Opus 4.6 · Anthropic
V2 · April 6, 2026

댓글 남기기