LLM INPUT FACTOR ATLAS · V4

The Ten Input Factors
That Determine LLM Output

A systematic synthesis of academic research and industry practice from 2023–2026. The model is the CPU; the input is the program — the CPU keeps upgrading (though more slowly), while our ability to write programs is evolving at breakneck speed.

Peer-Reviewed Papers [P]

arXiv Preprints [Pre]

Industry Reports [Ind]

⚠ READING GUIDE

The quantitative data collected in this document comes from different research teams, experimental conditions, models, and tasks. Numbers across different factors cannot be directly compared to judge “which factor has a larger effect.” For example, the 14.1% drop on GSM8K in Factor 1 (math reasoning + irrelevant context) and the 30% drop in Factor 3 (20-document retrieval + middle position) measure entirely different things. This document’s value lies in providing evidence that “each factor does have an effect,” not in supplying precise cross-factor weightings. Partial overlaps exist among the ten factors and are discussed at the end.

Core Thesis

Since GPT-3.5’s breakthrough in 2023, virtually all substantive user-facing progress in AI large models has occurred on the Input side — from prompting techniques to context engineering, to Agent architectures and Agent Teams. Pre-training has not “hit its ceiling,” but its marginal returns are clearly diminishing: the effective stock of high-quality public-domain text data is approximately 300 trillion tokens, projected to be exhausted between 2026–2032 (Epoch AI, 2024). Meanwhile, the return on investment from spending the same dollar on system prompt optimization, CoT, Tool Use, or Agent architecture already significantly exceeds that of further scaling pre-training.

However, it must be emphasized: pre-training still defines the model’s “capability ceiling,” while Input engineering determines what percentage of that ceiling is realized on any given task. The two are not substitutes but rather a “ceiling” and “actual height” relationship.

Part I

The Ten Input FactorsEach factor presented with balanced supporting and challenging evidence

FACTOR 01
Signal-to-Noise Ratio

The proportion of tokens in the input that are directly relevant to the current task relative to the total token count.

Supporting Evidence

Source	Finding	Effect Size	Conditions
Shi et al., 2023 P	Irrelevant information causes performance degradation	GSM8K: 78.7%→64.6% (↓14.1%)	Math reasoning; irrelevant paragraphs injected
Chroma Context Rot, 2025 Ind	18 models degrade as input grows	11/12 models drop below 50% baseline at 32K	Non-lexical-match retrieval task
LLMLingua-2 P	Input compressed 2–5×	Limited quality loss; accuracy improved in some scenarios	Multiple NLP benchmarks

Challenging Evidence

Source	Finding	Effect Size	Conditions
Few-shot learning surveys	Adding input (examples) substantially boosts performance	zero→few-shot: 30%+ improvement	Multiple NLP tasks
Context Discipline, 2025 Pre	Large models “surprisingly resilient” to noisy context	70B accuracy drops only 98.5%→98%	Simple factual QA; 15K-word noise

Core Insight

The optimal point is not “minimum input” but rather “just enough highly relevant input.” The optimization direction for signal-to-noise ratio is to increase “signal,” not merely compress “noise.”

FACTOR 02
Logical Structure

The hierarchical relationships, classification schemes, and organizational forms among different parts of the input.

Source	Finding	Conditions
Zeng et al., ACL 2025 P	Ordering constraints hard-to-easy yields better performance	Multi-constraint instruction following
Khan Academy / PyData 2025 Ind	Dictionary key ordering affects output quality	Education AI assistant; production environment
Input Matters Pre	Input structure (JSON/table/natural language) affects summary accuracy	NBA game summaries; p<0.05

Core Insight

The value of structure depends on task type. For instruction-following tasks, good structure helps enormously; for open-ended retrieval, excessive structure may actually interfere.

FACTOR 03
Positional Effects

How the same piece of information placed at different positions within the input produces different effects on model behavior.

Source	Finding	Effect Size
Liu et al., TACL 2024 P	U-shaped attention curve	Middle-position performance drops >30%
Snorkel AI SWiM Ind	Worst performance at 25% document depth	Validated across 8 long-context models
Fragile Preferences, 2025 Pre	Positional bias persists in Claude 4 Sonnet	Consistent across temperature settings
arXiv:2506.00069 Pre	Repeating task instructions at the end recovers performance	Recovery to near short-context baseline

Core Insight

The severity of positional effects is highly dependent on task complexity. In simple factual retrieval it may be negligible, but in complex reasoning and multi-hop QA the impact is enormous. One-size-fits-all conclusions are wrong.

FACTOR 04
Absolute Length Tax

Even when the signal-to-noise ratio is 100%, the absolute token length of the input itself degrades performance.

Source	Finding	Effect Size
EMNLP 2025 P	Performance degrades even after perfect retrieval + masking irrelevant tokens	13.9%–85% (varies by task)
Chroma Context Rot Ind	GPT-4o degrades	99.3%→69.7% (↓30% at 32K)
LoCoBench, Salesforce Pre	All models degrade significantly at 1M tokens	Cross-file reasoning in software engineering

Core Insight

“Long context = bad” is a false simplification. “Long context carries a hidden tax, and the tax rate varies by scenario” is more accurate. On simple QA, large models are almost unaffected (<2%), but on complex reasoning tasks all models degrade significantly.

FACTOR 05
Semantic Distractor Interference

Topically similar but factually irrelevant content that produces interference beyond simple noise.

Source	Finding
Chroma Context Rot Ind	Semantic distractors cause degradation exceeding that from length alone; highest frequency in hallucinated responses
LV-Eval P	Substantial performance degradation under confounding facts

FACTOR 06
Instruction Hierarchy Conflict

Priority conflicts between instructions at different levels — system prompts, user messages, tool returns, etc.

Source	Finding	Effect Size
Wallace et al., NeurIPS 2024 P	Training LLMs to selectively ignore lower-priority instructions	Robustness improved substantially, including unseen attack types
VerIH, ICLR 2025 P	Instruction hierarchy parsing treated as a reasoning task	Instruction following +20%; attack success rate −20%
SecAlign, CCS 2025 P	Preference optimization to defend against prompt injection	Attack success rate reduced to <10%

FACTOR 07
Model-Specific Behavior

Different models exhibit fundamentally different failure modes under the same input.

Model	Long-Context Failure Mode
Claude series	Conservative abstention (refuses to answer)
GPT series	Higher hallucination rate under distractors
Qwen2.5-14B	1K→32K: 43.87→20.53 (↓53%)
Llama-3.1-70B	Extremely resilient on simple QA (98.5%→98%)

FACTOR 08
Output Self-Degradation

When generating long outputs, previously generated text becomes part of the input for subsequent generation, causing progressive quality decline in later segments.

Source	Finding
LongGenBench P ICLR 2025	All models exhibit declining curves in long outputs
Ref-Long Pre	Human citation attribution >90% ExAcc; best LLM <30%

FACTOR 09
Format Bias

Models’ ability to process different data formats (JSON, YAML, Markdown, etc.) varies due to training data distribution, independent of logical structure.

Source	Finding
SoEval benchmark P	Model format compliance significantly higher for JSON than YAML — due to JSON being more prevalent in training data
StructEval Pre	Evaluation across 13 structured output types reveals significant differences in error patterns by format

Practical Implication

When designing system prompts and tool schemas, prefer JSON format (most abundant in training data). Avoid YAML for critical outputs. Note that serialization library field ordering may affect output quality.

FACTOR 10
Inference-Time Configuration

Temperature, top-p, reasoning effort, and other inference-time parameters are strictly part of the “input” but are rarely included in prompt engineering discussions.

Source	Finding
Fragile Preferences Pre	Positional bias remains robust across temperature variations
GPT-5.4 model docs Ind	Supports low/medium/high/xhigh reasoning effort levels, directly affecting output quality

Practical Implication

Inference-time configuration is often the lowest-cost optimization lever — zero token cost, yet it can significantly affect output determinism and consistency.

Part II

Cross-Factor InteractionsEvidence-backed interactions and known research gaps

Interaction	Source	Explanation
SNR × Absolute Length	EMNLP 2025 P	After masking all irrelevant tokens, length still causes degradation — both stack independently
Position × Semantic Distractors	Snorkel AI SWiM Ind	Same-domain distractor documents cause maximum damage at 25% middle depth
Absolute Length × Reasoning	arXiv:2512.13898 Pre	CoT effectiveness diminishes in long contexts — length weakens self-correction ability
Logical Structure × Position	arXiv:2506.00069 Pre	Repeating task instructions at the end recovers performance — structure partially counteracts positional effects

Interactions Lacking Sufficient Experimental Evidence

Semantic distractors × instruction conflict / Model-specific behavior × output degradation / Precise mathematical relationship between SNR × logical structure / Format bias × positional effects / Inference-time config × any of the above — all remain current research gaps.

Part III

Optimization Priority MatrixQualitative ranking based on effect size and controllability

Priority
Factor
Control
Recommended Action

★★★★★Signal-to-Noise RatioHighOn-demand loading, dynamic trimming, semantic compression

★★★★★Absolute LengthMed-HighAgent-based step decomposition, compaction, token budget management

★★★★Positional ArrangementHighPlace critical instructions at start/end; repeat constraints at end

★★★★Semantic DistractorsMediumAgent Team context isolation; clean stale tool results

★★★Logical StructureHighOrder constraints hard-to-easy; use XML/tag boundary markers

★★★Instruction HierarchyMed-HighNEVER/ALWAYS hard constraints; explicit source trust tiers

★★★Format BiasHighPrefer JSON; mind serialization order

★★Inference-Time ConfigVery HighLow temperature for deterministic tasks; high effort for complex reasoning

★★Model-Specific BehaviorLowMatch model to task

★★Output DegradationLowMulti-step short outputs; post-hoc verification

Part IV

The Evolution of InputFrom handcrafted prompts to the unified Input operating system

2022–2023

Handcrafted Prompts

Humans write a paragraph → Input wording determines output quality. ChatGPT launches.

2023

CoT / Few-shot

Reasoning examples added to input → Input structure guides reasoning process. Google CoT paper.

2023–2024

RAG

Dynamic knowledge retrieval injected → Input timeliness and relevance determine accuracy.

2024

System Prompts

Persistent developer instructions → Input layering affects behavioral consistency.

2024–2025

Context Engineering

Systematic “working memory” management → Karpathy defines Context Engineering.

2024–2025

Agent + Tool Use

Programmatic multi-turn input generation → Input becomes self-producing and iterative.

2025

Skill / SOP / MCP

Structured knowledge packages + standardized external data access → Modular, on-demand Input loading.

2025–2026

Agent Teams

Multiple Agents generating input for each other → Division of labor in Input production.

2026

Unified Input OS

Complete Input production system → Consolidated into automated framework. GPT-5.4 tool search saves 47% tokens.

Part V

Agent Architecture: Solution or Tradeoff?Agents are not a free lunch — they mitigate some factors while exacerbating others

Problems Agents Mitigate

✓ Absolute length: splits long tasks into short steps
✓ Output degradation: only short output per step
✓ Positional effects: shorter context per step, smaller “middle blind spot”

New Problems Agents Introduce

✗ Total token consumption amplified (830K+ tokens/call documented)
✗ Intermediate reasoning steps become new semantic distractors
✗ Context fragmentation (compaction loses details)
✗ Multi-Agent synchronization issues
✗ Error propagation (hallucinations injected as “facts” into subsequent steps)

Decision Framework

Agent architecture yields net benefits when: (a) the task naturally decomposes into independent sub-steps, (b) each step has a clear success/failure signal (e.g., code compiles), (c) intermediate results can be verified. It yields net costs when: (a) the task requires global coherence (e.g., long-document writing), (b) intermediate steps lack verification mechanisms, (c) system prompt re-transmission overhead is unacceptable in cost-sensitive scenarios.

Part VI

Applicability Boundaries: When Do These Factors Not Matter?Four scenarios where over-optimization wastes engineering resources

Scenario 1: Model Capability Far Exceeds Task Difficulty

When using GPT-5.4 for simple translation or template filling, the model will almost always succeed regardless of noisy, poorly structured, poorly positioned input. The ten factors primarily matter for tasks “near the boundary of model capability.”

Scenario 2: Output Has External Verification

Code has compilers and unit tests; data extraction has schema validation. When incorrect output can be automatically detected and retried, single-pass accuracy requirements drop, and the marginal value of input optimization decreases accordingly.

Scenario 3: The Task Itself Is Open-Ended

Creative writing, brainstorming, and style exploration have no single “correct answer.” The metric of “accuracy” itself does not apply, weakening the influence of most factors. However, instruction hierarchy and logical structure remain important — they affect “whether user intent is followed” rather than “whether facts are correct.”

Scenario 4: Interactive Iterative Workflows

When user and model iterate rapidly (e.g., editing in Cursor), each interaction handles only a small incremental change. Input absolute length and positional effects are naturally compressed. Response latency may matter more than output quality.

Summary

The effect size of the ten factors is not a fixed constant but a function of task difficulty × verification mechanisms × iteration speed. In scenarios with simple tasks, ample verification, and rapid iteration, heavy input optimization is over-engineering. In scenarios with complex tasks, no verification, and one-shot output, input optimization is the deciding factor between success and failure.

Part VII

Key Conclusions

Conclusion 1

The model is the CPU; the input is the program. The CPU keeps upgrading (though more slowly), while our ability to write programs is evolving at breakneck speed.

Conclusion 2

Signal-to-noise ratio and absolute length are the two factors most worth prioritizing among the ten. Rationale: (a) both report relatively large effect sizes in their respective experimental conditions, (b) EMNLP 2025 demonstrates they stack independently, (c) both are the most prominent pain points in current Agent architectures. However, on simple QA tasks, large models display extreme resilience to both factors (↓<2%), at which point other factors may warrant more attention.

Conclusion 3

Different products represent different Input engineering philosophies. There is no “optimal,” only “best fit for the scenario.” Independent research comparing input efficiency of all products on the same standardized task does not yet exist.

Conclusion 4

“Less input” does not equal “better input.” The optimal point is “just enough highly relevant input.” Multiple counter-studies show that in information-scarce scenarios, adding input (few-shot examples, more retrieved documents) substantially boosts performance.

Conclusion 5

Context engineering is evolving from a subset of prompt engineering into an independent engineering discipline. Yet pre-training advances (the GPT-4→GPT-5 series leap) remain a critical source of capability improvement and should not be dismissed in the emphasis on input engineering.

Part VIII

Known Gaps: Factors Not Covered by This Atlas

Gap Area	Why It Matters	Current Research Status
Cross-modal input interference	Agent tool returns may include screenshots, tables, PDFs, and other non-text content	No controlled experiments on “how non-text portions in mixed input interfere with text reasoning”
Temporal decay of dialogue history	Whether the influence of turn N−k on turn N follows a predictable decay curve in multi-turn dialogues	Recency-effect research exists, but no precise “turn-number → influence” decay function established
Concurrency / batching effects	In Agent Team scenarios, multiple agents’ requests may arrive at the same model in parallel	Inference-engine engineering docs exist; systematic study from output-quality perspective is lacking
Cross-session context transfer loss	Quantifying information loss during compaction or memory-based cross-session transfer	Engineering-practice discussions exist; systematic information-theoretic analysis is lacking
Training data / input format alignment	Models perform better on formats most seen during pre-training	SoEval and similar benchmarks provide initial data, but coverage remains limited

Appendix

Key Literature Index

Peer-Reviewed Publications P

1. Lost in the Middle — Liu et al., TACL 2024
2. The Instruction Hierarchy — Wallace et al., NeurIPS 2024
3. Will We Run Out of Data? — Villalobos et al., Epoch AI 2024
4. Order Matters — Zeng et al., ACL Findings 2025
5. Context Length Alone Hurts — EMNLP Findings 2025
6. SecAlign — CCS 2025
7. VerIH — ICLR 2025
8. LV-Eval — ICLR 2025
9. LongGenBench — ICLR 2025
10. LoCoBench — KDD 2025
11. Shi et al. — ICML 2023
12. SoEval — Information Processing & Management, 2024

arXiv Preprints Pre

13. Serial Position Effects — arXiv:2406.15981, 2024
14. Long Context Less Focus — arXiv:2602.15028, 2026
15. Context Discipline — arXiv:2601.11564, 2025
16. Fragile Preferences — arXiv:2506.14092, 2025
17. Position is Power — arXiv:2505.21091, 2025
18. Test-Time Training for Long-Context LLMs — arXiv:2512.13898, 2025
19. Input Matters — arXiv:2510.21034, 2025
20. StructEval — arXiv:2505.20139, 2025

Industry Reports Ind

21. Context Rot — Chroma Research, 2025.7
22. SWiM — Snorkel AI, 2024
23. OpenClaw System Prompt Investigation — GitHub Issue #21999, 2026.2
24. State of LLMs 2025 — Sebastian Raschka, 2025.12
25. GPT-5.4 Launch — OpenAI Blog, 2026.3.5
26. Cursor vs Claude Code — Builder.io, 2026.2
27. Khan Academy PyData — Boris Lau, PyData Global 2025