LLM INPUT FACTOR ATLAS · V4

The Ten Input Factors
That Determine LLM Output

A systematic synthesis of academic research and industry practice from 2023–2026. The model is the CPU; the input is the program — the CPU keeps upgrading (though more slowly), while our ability to write programs is evolving at breakneck speed.

12 Peer-Reviewed Papers [P]
8 arXiv Preprints [Pre]
7 Industry Reports [Ind]

⚠ READING GUIDE

The quantitative data collected in this document comes from different research teams, experimental conditions, models, and tasks. Numbers across different factors cannot be directly compared to judge “which factor has a larger effect.” For example, the 14.1% drop on GSM8K in Factor 1 (math reasoning + irrelevant context) and the 30% drop in Factor 3 (20-document retrieval + middle position) measure entirely different things. This document’s value lies in providing evidence that “each factor does have an effect,” not in supplying precise cross-factor weightings. Partial overlaps exist among the ten factors and are discussed at the end.

Core Thesis

Since GPT-3.5’s breakthrough in 2023, virtually all substantive user-facing progress in AI large models has occurred on the Input side — from prompting techniques to context engineering, to Agent architectures and Agent Teams. Pre-training has not “hit its ceiling,” but its marginal returns are clearly diminishing: the effective stock of high-quality public-domain text data is approximately 300 trillion tokens, projected to be exhausted between 2026–2032 (Epoch AI, 2024). Meanwhile, the return on investment from spending the same dollar on system prompt optimization, CoT, Tool Use, or Agent architecture already significantly exceeds that of further scaling pre-training.

However, it must be emphasized: pre-training still defines the model’s “capability ceiling,” while Input engineering determines what percentage of that ceiling is realized on any given task. The two are not substitutes but rather a “ceiling” and “actual height” relationship.

Part I

The Ten Input FactorsEach factor presented with balanced supporting and challenging evidence

FACTOR 01
Signal-to-Noise Ratio
The proportion of tokens in the input that are directly relevant to the current task relative to the total token count.

Supporting Evidence

Source Finding Effect Size Conditions
Shi et al., 2023 P Irrelevant information causes performance degradation GSM8K: 78.7%→64.6% (↓14.1%) Math reasoning; irrelevant paragraphs injected
Chroma Context Rot, 2025 Ind 18 models degrade as input grows 11/12 models drop below 50% baseline at 32K Non-lexical-match retrieval task
LLMLingua-2 P Input compressed 2–5× Limited quality loss; accuracy improved in some scenarios Multiple NLP benchmarks

Challenging Evidence

Source Finding Effect Size Conditions
Few-shot learning surveys Adding input (examples) substantially boosts performance zero→few-shot: 30%+ improvement Multiple NLP tasks
Context Discipline, 2025 Pre Large models “surprisingly resilient” to noisy context 70B accuracy drops only 98.5%→98% Simple factual QA; 15K-word noise
Core Insight

The optimal point is not “minimum input” but rather “just enough highly relevant input.” The optimization direction for signal-to-noise ratio is to increase “signal,” not merely compress “noise.”

FACTOR 02
Logical Structure
The hierarchical relationships, classification schemes, and organizational forms among different parts of the input.
Source Finding Conditions
Zeng et al., ACL 2025 P Ordering constraints hard-to-easy yields better performance Multi-constraint instruction following
Khan Academy / PyData 2025 Ind Dictionary key ordering affects output quality Education AI assistant; production environment
Input Matters Pre Input structure (JSON/table/natural language) affects summary accuracy NBA game summaries; p<0.05
Core Insight

The value of structure depends on task type. For instruction-following tasks, good structure helps enormously; for open-ended retrieval, excessive structure may actually interfere.

FACTOR 03
Positional Effects
How the same piece of information placed at different positions within the input produces different effects on model behavior.
Source Finding Effect Size
Liu et al., TACL 2024 P U-shaped attention curve Middle-position performance drops >30%
Snorkel AI SWiM Ind Worst performance at 25% document depth Validated across 8 long-context models
Fragile Preferences, 2025 Pre Positional bias persists in Claude 4 Sonnet Consistent across temperature settings
arXiv:2506.00069 Pre Repeating task instructions at the end recovers performance Recovery to near short-context baseline
Core Insight

The severity of positional effects is highly dependent on task complexity. In simple factual retrieval it may be negligible, but in complex reasoning and multi-hop QA the impact is enormous. One-size-fits-all conclusions are wrong.

FACTOR 04
Absolute Length Tax
Even when the signal-to-noise ratio is 100%, the absolute token length of the input itself degrades performance.
Source Finding Effect Size
EMNLP 2025 P Performance degrades even after perfect retrieval + masking irrelevant tokens 13.9%–85% (varies by task)
Chroma Context Rot Ind GPT-4o degrades 99.3%→69.7% (↓30% at 32K)
LoCoBench, Salesforce Pre All models degrade significantly at 1M tokens Cross-file reasoning in software engineering
Core Insight

“Long context = bad” is a false simplification. “Long context carries a hidden tax, and the tax rate varies by scenario” is more accurate. On simple QA, large models are almost unaffected (<2%), but on complex reasoning tasks all models degrade significantly.

FACTOR 05
Semantic Distractor Interference
Topically similar but factually irrelevant content that produces interference beyond simple noise.
Source Finding
Chroma Context Rot Ind Semantic distractors cause degradation exceeding that from length alone; highest frequency in hallucinated responses
LV-Eval P Substantial performance degradation under confounding facts

FACTOR 06
Instruction Hierarchy Conflict
Priority conflicts between instructions at different levels — system prompts, user messages, tool returns, etc.
Source Finding Effect Size
Wallace et al., NeurIPS 2024 P Training LLMs to selectively ignore lower-priority instructions Robustness improved substantially, including unseen attack types
VerIH, ICLR 2025 P Instruction hierarchy parsing treated as a reasoning task Instruction following +20%; attack success rate −20%
SecAlign, CCS 2025 P Preference optimization to defend against prompt injection Attack success rate reduced to <10%

FACTOR 07
Model-Specific Behavior
Different models exhibit fundamentally different failure modes under the same input.
Model Long-Context Failure Mode
Claude series Conservative abstention (refuses to answer)
GPT series Higher hallucination rate under distractors
Qwen2.5-14B 1K→32K: 43.87→20.53 (↓53%)
Llama-3.1-70B Extremely resilient on simple QA (98.5%→98%)

FACTOR 08
Output Self-Degradation
When generating long outputs, previously generated text becomes part of the input for subsequent generation, causing progressive quality decline in later segments.
Source Finding
LongGenBench P ICLR 2025 All models exhibit declining curves in long outputs
Ref-Long Pre Human citation attribution >90% ExAcc; best LLM <30%

FACTOR 09
Format Bias
Models’ ability to process different data formats (JSON, YAML, Markdown, etc.) varies due to training data distribution, independent of logical structure.
Source Finding
SoEval benchmark P Model format compliance significantly higher for JSON than YAML — due to JSON being more prevalent in training data
StructEval Pre Evaluation across 13 structured output types reveals significant differences in error patterns by format
Practical Implication

When designing system prompts and tool schemas, prefer JSON format (most abundant in training data). Avoid YAML for critical outputs. Note that serialization library field ordering may affect output quality.

FACTOR 10
Inference-Time Configuration
Temperature, top-p, reasoning effort, and other inference-time parameters are strictly part of the “input” but are rarely included in prompt engineering discussions.
Source Finding
Fragile Preferences Pre Positional bias remains robust across temperature variations
GPT-5.4 model docs Ind Supports low/medium/high/xhigh reasoning effort levels, directly affecting output quality
Practical Implication

Inference-time configuration is often the lowest-cost optimization lever — zero token cost, yet it can significantly affect output determinism and consistency.

Part II

Cross-Factor InteractionsEvidence-backed interactions and known research gaps

Interaction Source Explanation
SNR × Absolute Length EMNLP 2025 P After masking all irrelevant tokens, length still causes degradation — both stack independently
Position × Semantic Distractors Snorkel AI SWiM Ind Same-domain distractor documents cause maximum damage at 25% middle depth
Absolute Length × Reasoning arXiv:2512.13898 Pre CoT effectiveness diminishes in long contexts — length weakens self-correction ability
Logical Structure × Position arXiv:2506.00069 Pre Repeating task instructions at the end recovers performance — structure partially counteracts positional effects
Interactions Lacking Sufficient Experimental Evidence

Semantic distractors × instruction conflict / Model-specific behavior × output degradation / Precise mathematical relationship between SNR × logical structure / Format bias × positional effects / Inference-time config × any of the above — all remain current research gaps.

Part III

Optimization Priority MatrixQualitative ranking based on effect size and controllability

Priority
Factor
Control
Recommended Action
★★★★★Signal-to-Noise RatioHighOn-demand loading, dynamic trimming, semantic compression
★★★★★Absolute LengthMed-HighAgent-based step decomposition, compaction, token budget management
★★★★Positional ArrangementHighPlace critical instructions at start/end; repeat constraints at end
★★★★Semantic DistractorsMediumAgent Team context isolation; clean stale tool results
★★★Logical StructureHighOrder constraints hard-to-easy; use XML/tag boundary markers
★★★Instruction HierarchyMed-HighNEVER/ALWAYS hard constraints; explicit source trust tiers
★★★Format BiasHighPrefer JSON; mind serialization order
★★Inference-Time ConfigVery HighLow temperature for deterministic tasks; high effort for complex reasoning
★★Model-Specific BehaviorLowMatch model to task
★★Output DegradationLowMulti-step short outputs; post-hoc verification

Part IV

The Evolution of InputFrom handcrafted prompts to the unified Input operating system

2022–2023
Handcrafted Prompts
Humans write a paragraph → Input wording determines output quality. ChatGPT launches.
2023
CoT / Few-shot
Reasoning examples added to input → Input structure guides reasoning process. Google CoT paper.
2023–2024
RAG
Dynamic knowledge retrieval injected → Input timeliness and relevance determine accuracy.
2024
System Prompts
Persistent developer instructions → Input layering affects behavioral consistency.
2024–2025
Context Engineering
Systematic “working memory” management → Karpathy defines Context Engineering.
2024–2025
Agent + Tool Use
Programmatic multi-turn input generation → Input becomes self-producing and iterative.
2025
Skill / SOP / MCP
Structured knowledge packages + standardized external data access → Modular, on-demand Input loading.
2025–2026
Agent Teams
Multiple Agents generating input for each other → Division of labor in Input production.
2026
Unified Input OS
Complete Input production system → Consolidated into automated framework. GPT-5.4 tool search saves 47% tokens.

Part V

Agent Architecture: Solution or Tradeoff?Agents are not a free lunch — they mitigate some factors while exacerbating others

Problems Agents Mitigate

✓ Absolute length: splits long tasks into short steps
✓ Output degradation: only short output per step
✓ Positional effects: shorter context per step, smaller “middle blind spot”

New Problems Agents Introduce

✗ Total token consumption amplified (830K+ tokens/call documented)
✗ Intermediate reasoning steps become new semantic distractors
✗ Context fragmentation (compaction loses details)
✗ Multi-Agent synchronization issues
✗ Error propagation (hallucinations injected as “facts” into subsequent steps)

Decision Framework

Agent architecture yields net benefits when: (a) the task naturally decomposes into independent sub-steps, (b) each step has a clear success/failure signal (e.g., code compiles), (c) intermediate results can be verified. It yields net costs when: (a) the task requires global coherence (e.g., long-document writing), (b) intermediate steps lack verification mechanisms, (c) system prompt re-transmission overhead is unacceptable in cost-sensitive scenarios.

Part VI

Applicability Boundaries: When Do These Factors Not Matter?Four scenarios where over-optimization wastes engineering resources

Scenario 1: Model Capability Far Exceeds Task Difficulty

When using GPT-5.4 for simple translation or template filling, the model will almost always succeed regardless of noisy, poorly structured, poorly positioned input. The ten factors primarily matter for tasks “near the boundary of model capability.”

Scenario 2: Output Has External Verification

Code has compilers and unit tests; data extraction has schema validation. When incorrect output can be automatically detected and retried, single-pass accuracy requirements drop, and the marginal value of input optimization decreases accordingly.

Scenario 3: The Task Itself Is Open-Ended

Creative writing, brainstorming, and style exploration have no single “correct answer.” The metric of “accuracy” itself does not apply, weakening the influence of most factors. However, instruction hierarchy and logical structure remain important — they affect “whether user intent is followed” rather than “whether facts are correct.”

Scenario 4: Interactive Iterative Workflows

When user and model iterate rapidly (e.g., editing in Cursor), each interaction handles only a small incremental change. Input absolute length and positional effects are naturally compressed. Response latency may matter more than output quality.

Summary

The effect size of the ten factors is not a fixed constant but a function of task difficulty × verification mechanisms × iteration speed. In scenarios with simple tasks, ample verification, and rapid iteration, heavy input optimization is over-engineering. In scenarios with complex tasks, no verification, and one-shot output, input optimization is the deciding factor between success and failure.

Part VII

Key Conclusions

Conclusion 1

The model is the CPU; the input is the program. The CPU keeps upgrading (though more slowly), while our ability to write programs is evolving at breakneck speed.

Conclusion 2

Signal-to-noise ratio and absolute length are the two factors most worth prioritizing among the ten. Rationale: (a) both report relatively large effect sizes in their respective experimental conditions, (b) EMNLP 2025 demonstrates they stack independently, (c) both are the most prominent pain points in current Agent architectures. However, on simple QA tasks, large models display extreme resilience to both factors (↓<2%), at which point other factors may warrant more attention.

Conclusion 3

Different products represent different Input engineering philosophies. There is no “optimal,” only “best fit for the scenario.” Independent research comparing input efficiency of all products on the same standardized task does not yet exist.

Conclusion 4

“Less input” does not equal “better input.” The optimal point is “just enough highly relevant input.” Multiple counter-studies show that in information-scarce scenarios, adding input (few-shot examples, more retrieved documents) substantially boosts performance.

Conclusion 5

Context engineering is evolving from a subset of prompt engineering into an independent engineering discipline. Yet pre-training advances (the GPT-4→GPT-5 series leap) remain a critical source of capability improvement and should not be dismissed in the emphasis on input engineering.

Part VIII

Known Gaps: Factors Not Covered by This Atlas

Gap Area Why It Matters Current Research Status
Cross-modal input interference Agent tool returns may include screenshots, tables, PDFs, and other non-text content No controlled experiments on “how non-text portions in mixed input interfere with text reasoning”
Temporal decay of dialogue history Whether the influence of turn N−k on turn N follows a predictable decay curve in multi-turn dialogues Recency-effect research exists, but no precise “turn-number → influence” decay function established
Concurrency / batching effects In Agent Team scenarios, multiple agents’ requests may arrive at the same model in parallel Inference-engine engineering docs exist; systematic study from output-quality perspective is lacking
Cross-session context transfer loss Quantifying information loss during compaction or memory-based cross-session transfer Engineering-practice discussions exist; systematic information-theoretic analysis is lacking
Training data / input format alignment Models perform better on formats most seen during pre-training SoEval and similar benchmarks provide initial data, but coverage remains limited

Appendix

Key Literature Index

Peer-Reviewed Publications P

1. Lost in the Middle — Liu et al., TACL 2024
2. The Instruction Hierarchy — Wallace et al., NeurIPS 2024
3. Will We Run Out of Data? — Villalobos et al., Epoch AI 2024
4. Order Matters — Zeng et al., ACL Findings 2025
5. Context Length Alone Hurts — EMNLP Findings 2025
6. SecAlign — CCS 2025
7. VerIH — ICLR 2025
8. LV-Eval — ICLR 2025
9. LongGenBench — ICLR 2025
10. LoCoBench — KDD 2025
11. Shi et al. — ICML 2023
12. SoEval — Information Processing & Management, 2024

arXiv Preprints Pre

13. Serial Position Effects — arXiv:2406.15981, 2024
14. Long Context Less Focus — arXiv:2602.15028, 2026
15. Context Discipline — arXiv:2601.11564, 2025
16. Fragile Preferences — arXiv:2506.14092, 2025
17. Position is Power — arXiv:2505.21091, 2025
18. Test-Time Training for Long-Context LLMs — arXiv:2512.13898, 2025
19. Input Matters — arXiv:2510.21034, 2025
20. StructEval — arXiv:2505.20139, 2025

Industry Reports Ind

21. Context Rot — Chroma Research, 2025.7
22. SWiM — Snorkel AI, 2024
23. OpenClaw System Prompt Investigation — GitHub Issue #21999, 2026.2
24. State of LLMs 2025 — Sebastian Raschka, 2025.12
25. GPT-5.4 Launch — OpenAI Blog, 2026.3.5
26. Cursor vs Claude Code — Builder.io, 2026.2
27. Khan Academy PyData — Boris Lau, PyData Global 2025

The Ten Input Factors That Determine LLM Output — A Deep Analysis

LLM Input Factor Atlas V4 · March 2026

“The model is the CPU; the input is the program. What truly changes the world is not a bigger CPU, but a better program.”

댓글 남기기