⚠ READING GUIDE
The quantitative data collected in this document comes from different research teams, experimental conditions, models, and tasks. Numbers across different factors cannot be directly compared to judge “which factor has a larger effect.” For example, the 14.1% drop on GSM8K in Factor 1 (math reasoning + irrelevant context) and the 30% drop in Factor 3 (20-document retrieval + middle position) measure entirely different things. This document’s value lies in providing evidence that “each factor does have an effect,” not in supplying precise cross-factor weightings. Partial overlaps exist among the ten factors and are discussed at the end.
Core Thesis
Since GPT-3.5’s breakthrough in 2023, virtually all substantive user-facing progress in AI large models has occurred on the Input side — from prompting techniques to context engineering, to Agent architectures and Agent Teams. Pre-training has not “hit its ceiling,” but its marginal returns are clearly diminishing: the effective stock of high-quality public-domain text data is approximately 300 trillion tokens, projected to be exhausted between 2026–2032 (Epoch AI, 2024). Meanwhile, the return on investment from spending the same dollar on system prompt optimization, CoT, Tool Use, or Agent architecture already significantly exceeds that of further scaling pre-training.
However, it must be emphasized: pre-training still defines the model’s “capability ceiling,” while Input engineering determines what percentage of that ceiling is realized on any given task. The two are not substitutes but rather a “ceiling” and “actual height” relationship.
The Ten Input FactorsEach factor presented with balanced supporting and challenging evidence
Cross-Factor InteractionsEvidence-backed interactions and known research gaps
| Interaction | Source | Explanation |
|---|---|---|
| SNR × Absolute Length | EMNLP 2025 P | After masking all irrelevant tokens, length still causes degradation — both stack independently |
| Position × Semantic Distractors | Snorkel AI SWiM Ind | Same-domain distractor documents cause maximum damage at 25% middle depth |
| Absolute Length × Reasoning | arXiv:2512.13898 Pre | CoT effectiveness diminishes in long contexts — length weakens self-correction ability |
| Logical Structure × Position | arXiv:2506.00069 Pre | Repeating task instructions at the end recovers performance — structure partially counteracts positional effects |
Semantic distractors × instruction conflict / Model-specific behavior × output degradation / Precise mathematical relationship between SNR × logical structure / Format bias × positional effects / Inference-time config × any of the above — all remain current research gaps.
Optimization Priority MatrixQualitative ranking based on effect size and controllability
The Evolution of InputFrom handcrafted prompts to the unified Input operating system
Agent Architecture: Solution or Tradeoff?Agents are not a free lunch — they mitigate some factors while exacerbating others
✓ Absolute length: splits long tasks into short steps
✓ Output degradation: only short output per step
✓ Positional effects: shorter context per step, smaller “middle blind spot”
✗ Total token consumption amplified (830K+ tokens/call documented)
✗ Intermediate reasoning steps become new semantic distractors
✗ Context fragmentation (compaction loses details)
✗ Multi-Agent synchronization issues
✗ Error propagation (hallucinations injected as “facts” into subsequent steps)
Agent architecture yields net benefits when: (a) the task naturally decomposes into independent sub-steps, (b) each step has a clear success/failure signal (e.g., code compiles), (c) intermediate results can be verified. It yields net costs when: (a) the task requires global coherence (e.g., long-document writing), (b) intermediate steps lack verification mechanisms, (c) system prompt re-transmission overhead is unacceptable in cost-sensitive scenarios.
Applicability Boundaries: When Do These Factors Not Matter?Four scenarios where over-optimization wastes engineering resources
When using GPT-5.4 for simple translation or template filling, the model will almost always succeed regardless of noisy, poorly structured, poorly positioned input. The ten factors primarily matter for tasks “near the boundary of model capability.”
Code has compilers and unit tests; data extraction has schema validation. When incorrect output can be automatically detected and retried, single-pass accuracy requirements drop, and the marginal value of input optimization decreases accordingly.
Creative writing, brainstorming, and style exploration have no single “correct answer.” The metric of “accuracy” itself does not apply, weakening the influence of most factors. However, instruction hierarchy and logical structure remain important — they affect “whether user intent is followed” rather than “whether facts are correct.”
When user and model iterate rapidly (e.g., editing in Cursor), each interaction handles only a small incremental change. Input absolute length and positional effects are naturally compressed. Response latency may matter more than output quality.
The effect size of the ten factors is not a fixed constant but a function of task difficulty × verification mechanisms × iteration speed. In scenarios with simple tasks, ample verification, and rapid iteration, heavy input optimization is over-engineering. In scenarios with complex tasks, no verification, and one-shot output, input optimization is the deciding factor between success and failure.
Key Conclusions
The model is the CPU; the input is the program. The CPU keeps upgrading (though more slowly), while our ability to write programs is evolving at breakneck speed.
Signal-to-noise ratio and absolute length are the two factors most worth prioritizing among the ten. Rationale: (a) both report relatively large effect sizes in their respective experimental conditions, (b) EMNLP 2025 demonstrates they stack independently, (c) both are the most prominent pain points in current Agent architectures. However, on simple QA tasks, large models display extreme resilience to both factors (↓<2%), at which point other factors may warrant more attention.
Different products represent different Input engineering philosophies. There is no “optimal,” only “best fit for the scenario.” Independent research comparing input efficiency of all products on the same standardized task does not yet exist.
“Less input” does not equal “better input.” The optimal point is “just enough highly relevant input.” Multiple counter-studies show that in information-scarce scenarios, adding input (few-shot examples, more retrieved documents) substantially boosts performance.
Context engineering is evolving from a subset of prompt engineering into an independent engineering discipline. Yet pre-training advances (the GPT-4→GPT-5 series leap) remain a critical source of capability improvement and should not be dismissed in the emphasis on input engineering.
Known Gaps: Factors Not Covered by This Atlas
| Gap Area | Why It Matters | Current Research Status |
|---|---|---|
| Cross-modal input interference | Agent tool returns may include screenshots, tables, PDFs, and other non-text content | No controlled experiments on “how non-text portions in mixed input interfere with text reasoning” |
| Temporal decay of dialogue history | Whether the influence of turn N−k on turn N follows a predictable decay curve in multi-turn dialogues | Recency-effect research exists, but no precise “turn-number → influence” decay function established |
| Concurrency / batching effects | In Agent Team scenarios, multiple agents’ requests may arrive at the same model in parallel | Inference-engine engineering docs exist; systematic study from output-quality perspective is lacking |
| Cross-session context transfer loss | Quantifying information loss during compaction or memory-based cross-session transfer | Engineering-practice discussions exist; systematic information-theoretic analysis is lacking |
| Training data / input format alignment | Models perform better on formats most seen during pre-training | SoEval and similar benchmarks provide initial data, but coverage remains limited |
Key Literature Index
Peer-Reviewed Publications P
1. Lost in the Middle — Liu et al., TACL 2024
2. The Instruction Hierarchy — Wallace et al., NeurIPS 2024
3. Will We Run Out of Data? — Villalobos et al., Epoch AI 2024
4. Order Matters — Zeng et al., ACL Findings 2025
5. Context Length Alone Hurts — EMNLP Findings 2025
6. SecAlign — CCS 2025
7. VerIH — ICLR 2025
8. LV-Eval — ICLR 2025
9. LongGenBench — ICLR 2025
10. LoCoBench — KDD 2025
11. Shi et al. — ICML 2023
12. SoEval — Information Processing & Management, 2024
arXiv Preprints Pre
13. Serial Position Effects — arXiv:2406.15981, 2024
14. Long Context Less Focus — arXiv:2602.15028, 2026
15. Context Discipline — arXiv:2601.11564, 2025
16. Fragile Preferences — arXiv:2506.14092, 2025
17. Position is Power — arXiv:2505.21091, 2025
18. Test-Time Training for Long-Context LLMs — arXiv:2512.13898, 2025
19. Input Matters — arXiv:2510.21034, 2025
20. StructEval — arXiv:2505.20139, 2025
Industry Reports Ind
21. Context Rot — Chroma Research, 2025.7
22. SWiM — Snorkel AI, 2024
23. OpenClaw System Prompt Investigation — GitHub Issue #21999, 2026.2
24. State of LLMs 2025 — Sebastian Raschka, 2025.12
25. GPT-5.4 Launch — OpenAI Blog, 2026.3.5
26. Cursor vs Claude Code — Builder.io, 2026.2
27. Khan Academy PyData — Boris Lau, PyData Global 2025