AI Agent Success Probability Spectrum
A research-based reference for AI Agent task success rates, graded by environment certainty, based on publicly available research data from 2024–March 2026. Covers 16 primary industries and 29 scenarios, spanning the full spectrum from board games (undefeated ~100%) to black swan prediction (≈random). Core principle: The more complete the rules and bounded the state space, the stronger AI performs; the more exogenous variables and incomplete the rules, the harder it is for reinforcement learning and reasoning to deliver.
AI (especially reinforcement learning) requires that the environment can be modeled as a Markov Decision Process (MDP) — with well-defined states, actions, and transition probabilities. Chess satisfies this perfectly; financial markets do not at all. This is the same principle behind autonomous driving performing differently in clear weather vs. blizzards. Environment certainty determines the theoretical upper bound of AI success rates.
2025 mathematical proof: under current LLM architectures, hallucinations cannot be fully eliminated[37]. Summarization tasks as low as 0.7%[38], open-ended factual questions averaging 9.2%[39], reasoning model o3 reaching 33% on specific questions[40]. Global financial losses from AI hallucinations reached $67.4B in 2024[41].
🔬 Peer-Reviewed Nature/Science/npj journals · 📊 Independent Benchmark Epoch AI/SEAL/Vals.ai · 🏢 Self-Reported Use caution (possible cherry-picking) · 📰 Industry Report McKinsey/Gartner/NBER etc.
AIME perfect scores marked as self-reported[5], independent verification at 90.6%[6]. SWE-bench Verified flagged for data contamination[10]. Code split into SEAL-standardized vs custom scaffold. Added 4 supply chain/accounting sectors. All data now with numbered references.
→ Overall:
77.4%
| Scenario | Certainty | Success Rate | Risk | RL | Halluc. Rate | Data Sources & References | User Action · AI:Human Ratio |
|---|---|---|---|---|---|---|---|
|
1. Board Games / Perfect Information Games
Chess, Go, Shogi
|
★★★★★
|
Undefeated ~100%
Win 28% Draw 72%
|
Very Low | Very Effective | N/A | 🔬AlphaZero 28W-0L-72D vs Stockfish[1]; ELF OpenGo 20:0 vs top professionals[2]; non-standard puzzles 93% failure[3] | AI:human = 100:0. canallassign. ⚠️not astandard variantsAZcancan fail. |
|
2. Math Computation / Competition Reasoning
AIME, IMO, equation solving
|
★★★★★
|
78–97%
Independent verified; ~100% with tools
|
Very Low | Effective | ~1% | 🏢AIME leaderboard perfect scores are self-reported, 0 independently verified[5]; 📊Vals.ai independent: Grok-4 90.6%[6]; 105modelsavg 78.3%[5]; with code tools can reach ~100%[7] | AI:human = 95:5. Use code for complex problems. humantopAIME27–40%[7]. |
| Scenario | Certainty | Success Rate | Risk | RL | Halluc. Rate | Data Sources & References | User Action · AI:Human Ratio |
|---|---|---|---|---|---|---|---|
|
3. Code Generation — Single File / Function
Function writing, bug fixing, unit tests
|
★★★★☆
|
62–81%
⚠️Verified is contaminated
|
Low | Effective | ~3% | 📊SWE-bench Verified avg 62.2%,top80.9%[8]; ⚠️OpenAI confirmed Verified data contamination, stopped reporting[10]; scaffoldimprovement20%[11] | AI:human = 80:20. Must pass tests.realreferenceshould look atPro(42-57%)[9]. |
|
4. Cybersecurity / Known Threat Detection
Phishing, malware, anomaly detection
|
★★★★☆
|
92–99%
Known attack patterns
|
Low | Effective | N/A | 🔬RL phishing detection 95%/2% false positive[12]; Threat detection >95%, false positives down 60%[13]; 57% SOC analysts say traditional insufficient[14] | AI:human = 85:15. alreadyknown threatscanassign. zero/forhuman. |
|
5. Accounting — Data Entry / Reconciliation / Compliance
Invoice scanning, bank reconciliation, expense classification
|
★★★★☆
|
95–99.5%
Highly structured rules
|
Low | Partial | ~1% | 🏢AI accuracy reaches 99.5% within 6 months, errors down 95%[42]; Manual intervention reduced 70%[42]; 100% transaction analysis vs traditional sampling[43] | AI:Human = 90:10. Highly reliable. Manual review at month-end/year-end critical nodes. |
|
6. Supply Chain — Route / Inventory Optimization
Delivery routing, warehouse layout, inventory management
|
★★★★☆
|
85–99.7%
Structured optimization
|
Low | Effective | N/A | 🏢XPO 99.7% auto load matching[44]; Walmart 99.2% in-stock rate, saving $1.5B[44]; Logistics costs down 15%, inventory improved 35%[45] | AI:Human = 85:15. Optimization highly reliable. Anomalies require human intervention. |
|
7. Data Retrieval / Structured Queries
SQL queries, document retrieval, database operations
|
★★★★☆
|
85–91%
Degrades under load
|
Low | Partial | ~1% | 🔬Mount Sinai npj 2026: Multi-Agent retrieval 90.6%,80tasksdrops to65.3%[15] | AI:human = 85:15. resultscanverification. high concurrency. |
|
8. Customer Service — Banking / Finance Structured Queries
Balance inquiries, transfers, account operations
|
★★★★☆
|
95–98%
Highly structured
|
Low | Partial | ~2% | 🏢BofA Erica: 98%/44 sec, 2B interactions[16]; 56M monthly interactions[16] | AI:Human = 90:10. Banking FAQ: very high reliability. |
|
9. Autonomous Driving (Good Weather)
Clear skies, mapped cities, normal traffic
|
★★★★☆
|
Crash rate -85%
85% lower than humans
|
Low | Effective | N/A | 🔬Waymo 170.7M rider-only miles[17]; Injury rate 0.41 vs human 2.78/million miles[18] | AI:human = 90:10. ODDwithincanassign. Mind geofences. |
| Scenario | Certainty | Success Rate | Risk | RL | Halluc. Rate | Data Sources & References | User Action · AI:Human Ratio |
|---|---|---|---|---|---|---|---|
|
10. Customer Service — General E-commerce / Product Support
Returns, product inquiries, complaint handling
|
★★★☆☆
|
80–85%
Complex issues need handoff
|
Low-Med | Partial | ~3% | 🏢OPPO: 83% resolution/94% positive feedback[19]; Industry target ≥85%[19]; 90% of businesses struggle with handoffs[20] | AI:Human = 70:30. Routine queries reliable. Emotional/complex issues require humans. |
|
11. Translation (High-Resource Languages + General Text)
EN⇄ZH/ES/FR/DE news/business
|
★★★★☆
|
90–95%
≈ Junior/mid-level translators
|
Low-Med | Partial | ~1.2% | 📰EN-ES BLEU 94.2%[21]; News 10 lang pairs 92.7% human parity[21]; GPT-4 ≈ junior/mid-level, behind seniors[22]; “Human parity” limited to specific domains[23] | AI:Human = 75:25. General text usable. Marketing/legal requires senior review. |
|
12. Weather Forecasting (1–5 Days)
Temperature, wind speed, pressure, precipitation probability
|
★★★☆☆
|
97.2%Outperforms traditional
Outperforms ENS on 97.2% of 1320 targets
|
Low-Med | Partial | N/A | 🔬GenCast 97.2% outperforms ENS, >36h 99.8%[26]; WeatherNext 2 further +6.5%[27]; Nature 2024[26] | AI:Human = 80:20. Short-term highly reliable. Physics known but chaotic system. |
|
13. Education / AI Tutoring (Structured Subjects)
Math, physics, programming instruction
|
★★★☆☆
|
Effect +54%
Test score improvement
|
Low-Med | Partial | ~3–6% | 🔬Harvard RCT: effect size 0.73–1.3σ[28]; Completion +70%, dropout -15%[29]; Stanford math +4–9pp[30] | AI:Human = 65:35. Significant for structured subjects. Beware over-reliance (95% of faculty concerned)[29]. |
|
14. Content Summarization / Rewriting (Source Documents)
Document summaries, meeting notes, report rewriting
|
★★★☆☆
|
Faithfulness 99.3%
Source-grounded; very low hallucination
|
Low-Med | Partial | 0.7%[38] | 📊Gemini-2.0-Flash summarization 0.7%[38]; 4 models <1%[39]; 96% reduction over 4 years[39] | AI:Human = 80:20. Source-grounded: high reliability. Open-ended writing: risk rises sharply. |
| Scenario | Certainty | Success Rate | Risk | RL | Halluc. Rate | Data Sources & References | User Action · AI:Human Ratio |
|---|---|---|---|---|---|---|---|
|
15. Code Engineering — Real Multi-File Projects
Cross-file modifications, large codebase maintenance
|
★★★☆☆
|
SEAL:42–46%
Custom scaffold: 50–57%
|
Medium | Partial | ~6% | 📊SWE-bench Pro SEAL: Opus 4.5 45.9%[9]; GPT-5.3-Codex 56.8% (custom scaffold)[9]; scaffold gap 22pp[11]; 35.9% semantic failures[10] | AI:Human = 50:50. Every commit needs Code Review. Scaffold > model differences. |
|
16. Supply Chain — Demand Forecasting
Sales forecasting, seasonal demand, market signals
|
★★★☆☆
|
~95%
Stable markets; declines in volatility
|
Medium | Partial | N/A | 🏢Demand forecast 95% accuracy[46]; Amazon stockouts down 32%[47]; but 2.5–7.5/10 wide disagreement[48]; Data readiness is the real bottleneck[48] | AI:Human = 60:40. Effective in stable markets. External shocks/new categories need human judgment. |
|
17. Medical AI (Rule-Based Subtasks)
Drug dosing, image labeling, literature retrieval
|
★★★☆☆
|
65–91%
Degrades under load
|
Medium | Partial | ~6% | 🔬Mount Sinai npj 2026: 90.6%→65.3% (5→80 tasks)[15]; Single agent collapsed to 16.6%[15] | AI:Human = 40:60. ⚠️ Must be reviewed by professionals. Severe degradation under high load. |
|
18. Translation (Low-Resource Languages + Specialized Domains)
Minor languages, medical/legal professional translation
|
★★☆☆☆
|
72–89%
Behind senior translators
|
Medium | Limited | ~4% | 📰Low-resource 72% (transfer learning)[21]; DeepL medical 89.5%[21]; Cultural nuance 85%[21] | AI:Human = 40:60. ⚠️ Must be reviewed by senior translators. Medical/legal errors can be fatal. |
|
19. Drug Discovery / Molecular Screening
Target discovery, virtual screening, lead compounds
|
★★☆☆☆
|
Iphase80–90%
But zero FDA approvals
|
Medium | Partial | N/A | 📰AI drugs Phase I 80–90% vs traditional 40–65%[31]; 24 molecules, 21 success (87.5%)[31]; Zero FDA approvals as of Dec 2025[32] | AI:human = 40:60. ⚠️accelerate screeninghas. clinicalsuccessratenotproofOutperforms traditional. |
|
20. Accounting — Complex Tax Judgment
Impairment testing, fair value estimation, cross-border tax
|
★★☆☆☆
|
50–70%
Up to 50% inaccurate on complex questions
|
Medium | Limited | highrisk | 📰GenAI complex tax questions 50% inaccurate[34]; AI audit selection has racial bias (3–5x)[35]; Audit triggers down 40%[36] | AI:Human = 30:70. ⚠️ CPA review mandatory. Serious bias risk. Complex judgment cannot replace humans. |
|
21. Weather Forecasting (7–15 Days + Extreme Events)
Hurricane intensity, extreme precipitation, heatwaves
|
★★☆☆☆
|
Outperforms traditional
Extreme intensity underestimated 20–35%
|
Medium | Limited | N/A | 🔬Extreme precipitation underestimated 20–35%[33]; Once-in-a-century events: traditional models superior[33] | AI:human = 50:50. ⚠️trendscanreference. verydegreenotcanall. |
| Scenario | Certainty | Success Rate | Risk | RL | Halluc. Rate | Data Sources & References | User Action · AI:Human Ratio |
|---|---|---|---|---|---|---|---|
|
22. Autonomous Driving (Adverse Weather)
Heavy rain, snow, dense fog, hail
|
★★☆☆☆
|
Significant decline
Sensors may shut down
|
High | Limited | N/A | 🔬Rainfall >20mm ADAS stops[4]; Tesla FSD cannot operate in blizzards[4] | AI:Human = 10:90. 🚨 Humans must be ready to take over at any time. Never rely on it. |
|
23. Complex Office Automation (10+ Steps)
Cross-system operations, multi-step approvals
|
★★☆☆☆
|
~20–24%
10-step workflow
|
High | Limited | ~9% | 📰CMU 2026: Complex office tasks 24%[24]; 85%/step×10 steps=19.7%[24] | AI:Human = 20:80. 🚨 Every critical step must be manually confirmed. |
|
24. Content Creation (Open-Ended Factual Writing)
Article writing, research reports, factual claims
|
★★☆☆☆
|
67–97%
Varies drastically by task
|
High | Near Null | 3–33% | 📊Claude ~3%, GPT-5.2/Gemini ~6%[40]; o3 reaches 33%[40]; avg9.2%[39] | AI:Human = 30:70. 🚨 All facts must be independently verified. Reasoning models hallucinate even more. |
|
25. Medical Diagnosis / Treatment Decisions
Complex conditions, multi-drug regimens, rare diseases
|
★★☆☆☆
|
Uncertain
Lacks large-scale validation
|
High | Limited | highrisk | Clinical decision Agent: no public large-scale data; Diagnosis 79.6% (multimodal)[15] | AI:Human = 15:85. 🚨 Auxiliary reference only. Patient safety comes first. |
|
26. Legal / Compliance Analysis
Contract review, case law prediction, regulatory judgment
|
★★☆☆☆
|
Highly uncertain
Frequent hallucinated citations
|
High | Near Null | highrisk | 2025: hundreds of global judicial rulings on AI-fabricated case law (~90%)[41]; Grok-3 source attribution errors 94%[40] | AI:Human = 15:85. 🚨 Every legal citation must be manually verified. |
| Scenario | Certainty | Success Rate | Risk | RL | Halluc. Rate | Data Sources & References | User Action · AI:Human Ratio |
|---|---|---|---|---|---|---|---|
|
27. Market Research / Consumer Behavior Prediction
Demand forecasting, user preferences, competitor trends
|
★☆☆☆☆
|
Cannot be guaranteed
|
Very High | Near Null | High | 📰NBER Feb 2026: 89% of firms report zero AI productivity change[25] | AI:Human = 10:90. 🚨 For data organization only. Predictions must not be directly adopted. |
|
28. Financial Trade Execution / Market Making
Bid-ask spread, inventory mgmt, order execution
|
★☆☆☆☆
|
Limited improvement
|
Very High | Partial | N/A | 🔬RL review: market-making is the most improved RL finance sub-domain[35a]; Overfitting remains a fundamental challenge | AI:Human = 15:85. 🚨 Market-making RL partially effective. Risk management must be independent. |
|
29a. Financial Prediction / Macroeconomics
Stock prices, exchange rates, economic trends
|
★☆☆☆☆
|
Cannot be guaranteed
MDP assumption does not hold
|
Critical | Fundamental Failure | High | 🔬RLnoUncertain[35a]; Bank of England warns of systemic risk[35b]; LLM homogenization amplifies crashes[35c] | AI:Human = 0:100. 🚨🚨 Strictly prohibited as trading basis. May cause massive losses. |
|
29b. Geopolitics / Black Swan Events
War trajectories, policy shifts, pandemics, extreme events
|
☆☆☆☆☆
|
≈ Random
|
Critical | Fully Ineffective | Very High | Training data cannot cover “unknown unknowns”; Taleb’s “The Black Swan” theoretical framework | AI:Human = 0:100. 🚨🚨 Output must absolutely never be used as basis for decisions. |
Find your scenario → Check certainty + risk → The “AI:Human” ratio determines collaboration level. Green (90:10) can be automated, Yellow (50:50) step-by-step confirmation, Orange (20:80) reference only, Red (0:100) never replace humans.
AI requires environments modelable as MDP. Chess satisfies perfectly; financial markets do not at all. More complete rules → stronger training signal → higher success rate.
Serial compound accumulation: 95%/step × 5 steps = 77%, × 10 steps = 60%[24]. Google: Sequential tasks multi-agent 70% worse than single-agent[25].
🔬= Nature/Science peer-reviewed (highest)· 📊= Independent benchmarks (high)· 🏢= Company self-reported (medium, use caution)· 📰= Industry reports (reference)
Reference Index
v3.0 · 2026-03-21 · 이조 세계인공지능연구소 (human) × Claude Opus 4.6
This table is a risk-awareness reference tool, not a precise prediction. Actual success rates depend on model version, prompt quality, task complexity, scaffold quality, data quality, and other factors. All entries marked 🏢 are company self-reported and may contain selection bias. AIME leaderboard perfect scores are all self-reported (0 independently verified). SWE-bench Verified has confirmed data contamination. Users should prioritize 🔬 and 📊 marked data sources. Continuous revision based on new data is welcome. Please cite when republishing.