v3.0 · 29 Scenarios · 16 Industries · 48 References

AI Agent Success Probability Spectrum

A research-based reference for AI Agent task success rates, graded by environment certainty, based on publicly available research data from 2024–March 2026. Covers 16 primary industries and 29 scenarios, spanning the full spectrum from board games (undefeated ~100%) to black swan prediction (≈random). Core principle: The more complete the rules and bounded the state space, the stronger AI performs; the more exogenous variables and incomplete the rules, the harder it is for reinforcement learning and reasoning to deliver.

Author
LEECHO (이조)
LEECHO Global AI Research Lab
×
Co-Author
Claude Opus 4.6
Anthropic · Research & Data Compilation
2026-03-21 · Incheon, Korea / San Francisco

⚡ Core Principle · Environment Certainty

AI (especially reinforcement learning) requires that the environment can be modeled as a Markov Decision Process (MDP) — with well-defined states, actions, and transition probabilities. Chess satisfies this perfectly; financial markets do not at all. This is the same principle behind autonomous driving performing differently in clear weather vs. blizzards. Environment certainty determines the theoretical upper bound of AI success rates.

⚠️ Hallucination Alert

2025 mathematical proof: under current LLM architectures, hallucinations cannot be fully eliminated[37]. Summarization tasks as low as 0.7%[38], open-ended factual questions averaging 9.2%[39], reasoning model o3 reaching 33% on specific questions[40]. Global financial losses from AI hallucinations reached $67.4B in 2024[41].

📊 Source Credibility Markers

🔬 Peer-Reviewed Nature/Science/npj journals · 📊 Independent Benchmark Epoch AI/SEAL/Vals.ai · 🏢 Self-Reported Use caution (possible cherry-picking) · 📰 Industry Report McKinsey/Gartner/NBER etc.

🔄 v2→v3 Key Corrections

AIME perfect scores marked as self-reported[5], independent verification at 90.6%[6]. SWE-bench Verified flagged for data contamination[10]. Code split into SEAL-standardized vs custom scaffold. Added 4 supply chain/accounting sectors. All data now with numbered references.

🧮 Multi-Step Agent Compound Decay Calculator


→ Overall:
77.4%
Formula: (per-step rate)^steps. 95%×10 steps = 59.9%, 85%×10 steps = 19.7%[24]. Google Research: Independent parallel agents amplify errors 17.2×[25].



ZONE 1 · Very High Certainty · Complete Rules · Fully Trustworthy
Scenario Certainty Success Rate Risk RL Halluc. Rate Data Sources & References User Action · AI:Human Ratio
1. Board Games / Perfect Information Games
Chess, Go, Shogi
★★★★★
Undefeated ~100%
Win 28% Draw 72%
Very Low Very Effective N/A 🔬AlphaZero 28W-0L-72D vs Stockfish[1]; ELF OpenGo 20:0 vs top professionals[2]; non-standard puzzles 93% failure[3] AI:human = 100:0. canallassign. ⚠️not astandard variantsAZcancan fail.
2. Math Computation / Competition Reasoning
AIME, IMO, equation solving
★★★★★
78–97%
Independent verified; ~100% with tools
Very Low Effective ~1% 🏢AIME leaderboard perfect scores are self-reported, 0 independently verified[5]; 📊Vals.ai independent: Grok-4 90.6%[6]; 105modelsavg 78.3%[5]; with code tools can reach ~100%[7] AI:human = 95:5. Use code for complex problems. humantopAIME27–40%[7].



ZONE 2 · High Certainty · Clear Rules · Verify but Highly Reliable
Scenario Certainty Success Rate Risk RL Halluc. Rate Data Sources & References User Action · AI:Human Ratio
3. Code Generation — Single File / Function
Function writing, bug fixing, unit tests
★★★★☆
62–81%
⚠️Verified is contaminated
Low Effective ~3% 📊SWE-bench Verified avg 62.2%,top80.9%[8]; ⚠️OpenAI confirmed Verified data contamination, stopped reporting[10]; scaffoldimprovement20%[11] AI:human = 80:20. Must pass tests.realreferenceshould look atPro(42-57%)[9].
4. Cybersecurity / Known Threat Detection
Phishing, malware, anomaly detection
★★★★☆
92–99%
Known attack patterns
Low Effective N/A 🔬RL phishing detection 95%/2% false positive[12]; Threat detection >95%, false positives down 60%[13]; 57% SOC analysts say traditional insufficient[14] AI:human = 85:15. alreadyknown threatscanassign. zero/forhuman.
5. Accounting — Data Entry / Reconciliation / Compliance
Invoice scanning, bank reconciliation, expense classification
★★★★☆
95–99.5%
Highly structured rules
Low Partial ~1% 🏢AI accuracy reaches 99.5% within 6 months, errors down 95%[42]; Manual intervention reduced 70%[42]; 100% transaction analysis vs traditional sampling[43] AI:Human = 90:10. Highly reliable. Manual review at month-end/year-end critical nodes.
6. Supply Chain — Route / Inventory Optimization
Delivery routing, warehouse layout, inventory management
★★★★☆
85–99.7%
Structured optimization
Low Effective N/A 🏢XPO 99.7% auto load matching[44]; Walmart 99.2% in-stock rate, saving $1.5B[44]; Logistics costs down 15%, inventory improved 35%[45] AI:Human = 85:15. Optimization highly reliable. Anomalies require human intervention.
7. Data Retrieval / Structured Queries
SQL queries, document retrieval, database operations
★★★★☆
85–91%
Degrades under load
Low Partial ~1% 🔬Mount Sinai npj 2026: Multi-Agent retrieval 90.6%,80tasksdrops to65.3%[15] AI:human = 85:15. resultscanverification. high concurrency.
8. Customer Service — Banking / Finance Structured Queries
Balance inquiries, transfers, account operations
★★★★☆
95–98%
Highly structured
Low Partial ~2% 🏢BofA Erica: 98%/44 sec, 2B interactions[16]; 56M monthly interactions[16] AI:Human = 90:10. Banking FAQ: very high reliability.
9. Autonomous Driving (Good Weather)
Clear skies, mapped cities, normal traffic
★★★★☆
Crash rate -85%
85% lower than humans
Low Effective N/A 🔬Waymo 170.7M rider-only miles[17]; Injury rate 0.41 vs human 2.78/million miles[18] AI:human = 90:10. ODDwithincanassign. Mind geofences.



ZONE 3 · Medium-High Certainty · Partially Known Rules · Human Review Required
Scenario Certainty Success Rate Risk RL Halluc. Rate Data Sources & References User Action · AI:Human Ratio
10. Customer Service — General E-commerce / Product Support
Returns, product inquiries, complaint handling
★★★☆☆
80–85%
Complex issues need handoff
Low-Med Partial ~3% 🏢OPPO: 83% resolution/94% positive feedback[19]; Industry target ≥85%[19]; 90% of businesses struggle with handoffs[20] AI:Human = 70:30. Routine queries reliable. Emotional/complex issues require humans.
11. Translation (High-Resource Languages + General Text)
EN⇄ZH/ES/FR/DE news/business
★★★★☆
90–95%
≈ Junior/mid-level translators
Low-Med Partial ~1.2% 📰EN-ES BLEU 94.2%[21]; News 10 lang pairs 92.7% human parity[21]; GPT-4 ≈ junior/mid-level, behind seniors[22]; “Human parity” limited to specific domains[23] AI:Human = 75:25. General text usable. Marketing/legal requires senior review.
12. Weather Forecasting (1–5 Days)
Temperature, wind speed, pressure, precipitation probability
★★★☆☆
97.2%Outperforms traditional
Outperforms ENS on 97.2% of 1320 targets
Low-Med Partial N/A 🔬GenCast 97.2% outperforms ENS, >36h 99.8%[26]; WeatherNext 2 further +6.5%[27]; Nature 2024[26] AI:Human = 80:20. Short-term highly reliable. Physics known but chaotic system.
13. Education / AI Tutoring (Structured Subjects)
Math, physics, programming instruction
★★★☆☆
Effect +54%
Test score improvement
Low-Med Partial ~3–6% 🔬Harvard RCT: effect size 0.73–1.3σ[28]; Completion +70%, dropout -15%[29]; Stanford math +4–9pp[30] AI:Human = 65:35. Significant for structured subjects. Beware over-reliance (95% of faculty concerned)[29].
14. Content Summarization / Rewriting (Source Documents)
Document summaries, meeting notes, report rewriting
★★★☆☆
Faithfulness 99.3%
Source-grounded; very low hallucination
Low-Med Partial 0.7%[38] 📊Gemini-2.0-Flash summarization 0.7%[38]; 4 models <1%[39]; 96% reduction over 4 years[39] AI:Human = 80:20. Source-grounded: high reliability. Open-ended writing: risk rises sharply.



ZONE 4 · Medium Certainty · Incomplete Rules · Confirm Each Step
Scenario Certainty Success Rate Risk RL Halluc. Rate Data Sources & References User Action · AI:Human Ratio
15. Code Engineering — Real Multi-File Projects
Cross-file modifications, large codebase maintenance
★★★☆☆
SEAL:42–46%
Custom scaffold: 50–57%
Medium Partial ~6% 📊SWE-bench Pro SEAL: Opus 4.5 45.9%[9]; GPT-5.3-Codex 56.8% (custom scaffold)[9]; scaffold gap 22pp[11]; 35.9% semantic failures[10] AI:Human = 50:50. Every commit needs Code Review. Scaffold > model differences.
16. Supply Chain — Demand Forecasting
Sales forecasting, seasonal demand, market signals
★★★☆☆
~95%
Stable markets; declines in volatility
Medium Partial N/A 🏢Demand forecast 95% accuracy[46]; Amazon stockouts down 32%[47]; but 2.5–7.5/10 wide disagreement[48]; Data readiness is the real bottleneck[48] AI:Human = 60:40. Effective in stable markets. External shocks/new categories need human judgment.
17. Medical AI (Rule-Based Subtasks)
Drug dosing, image labeling, literature retrieval
★★★☆☆
65–91%
Degrades under load
Medium Partial ~6% 🔬Mount Sinai npj 2026: 90.6%→65.3% (5→80 tasks)[15]; Single agent collapsed to 16.6%[15] AI:Human = 40:60. ⚠️ Must be reviewed by professionals. Severe degradation under high load.
18. Translation (Low-Resource Languages + Specialized Domains)
Minor languages, medical/legal professional translation
★★☆☆☆
72–89%
Behind senior translators
Medium Limited ~4% 📰Low-resource 72% (transfer learning)[21]; DeepL medical 89.5%[21]; Cultural nuance 85%[21] AI:Human = 40:60. ⚠️ Must be reviewed by senior translators. Medical/legal errors can be fatal.
19. Drug Discovery / Molecular Screening
Target discovery, virtual screening, lead compounds
★★☆☆☆
Iphase80–90%
But zero FDA approvals
Medium Partial N/A 📰AI drugs Phase I 80–90% vs traditional 40–65%[31]; 24 molecules, 21 success (87.5%)[31]; Zero FDA approvals as of Dec 2025[32] AI:human = 40:60. ⚠️accelerate screeninghas. clinicalsuccessratenotproofOutperforms traditional.
20. Accounting — Complex Tax Judgment
Impairment testing, fair value estimation, cross-border tax
★★☆☆☆
50–70%
Up to 50% inaccurate on complex questions
Medium Limited highrisk 📰GenAI complex tax questions 50% inaccurate[34]; AI audit selection has racial bias (3–5x)[35]; Audit triggers down 40%[36] AI:Human = 30:70. ⚠️ CPA review mandatory. Serious bias risk. Complex judgment cannot replace humans.
21. Weather Forecasting (7–15 Days + Extreme Events)
Hurricane intensity, extreme precipitation, heatwaves
★★☆☆☆
Outperforms traditional
Extreme intensity underestimated 20–35%
Medium Limited N/A 🔬Extreme precipitation underestimated 20–35%[33]; Once-in-a-century events: traditional models superior[33] AI:human = 50:50. ⚠️trendscanreference. verydegreenotcanall.



ZONE 5 · Low Certainty · Many Exogenous Variables · Reference Only
Scenario Certainty Success Rate Risk RL Halluc. Rate Data Sources & References User Action · AI:Human Ratio
22. Autonomous Driving (Adverse Weather)
Heavy rain, snow, dense fog, hail
★★☆☆☆
Significant decline
Sensors may shut down
High Limited N/A 🔬Rainfall >20mm ADAS stops[4]; Tesla FSD cannot operate in blizzards[4] AI:Human = 10:90. 🚨 Humans must be ready to take over at any time. Never rely on it.
23. Complex Office Automation (10+ Steps)
Cross-system operations, multi-step approvals
★★☆☆☆
~20–24%
10-step workflow
High Limited ~9% 📰CMU 2026: Complex office tasks 24%[24]; 85%/step×10 steps=19.7%[24] AI:Human = 20:80. 🚨 Every critical step must be manually confirmed.
24. Content Creation (Open-Ended Factual Writing)
Article writing, research reports, factual claims
★★☆☆☆
67–97%
Varies drastically by task
High Near Null 3–33% 📊Claude ~3%, GPT-5.2/Gemini ~6%[40]; o3 reaches 33%[40]; avg9.2%[39] AI:Human = 30:70. 🚨 All facts must be independently verified. Reasoning models hallucinate even more.
25. Medical Diagnosis / Treatment Decisions
Complex conditions, multi-drug regimens, rare diseases
★★☆☆☆
Uncertain
Lacks large-scale validation
High Limited highrisk Clinical decision Agent: no public large-scale data; Diagnosis 79.6% (multimodal)[15] AI:Human = 15:85. 🚨 Auxiliary reference only. Patient safety comes first.
26. Legal / Compliance Analysis
Contract review, case law prediction, regulatory judgment
★★☆☆☆
Highly uncertain
Frequent hallucinated citations
High Near Null highrisk 2025: hundreds of global judicial rulings on AI-fabricated case law (~90%)[41]; Grok-3 source attribution errors 94%[40] AI:Human = 15:85. 🚨 Every legal citation must be manually verified.



ZONE 6 · Very Low Certainty · Non-Stationary / Exogenous Shocks · Not for Decision-Making
Scenario Certainty Success Rate Risk RL Halluc. Rate Data Sources & References User Action · AI:Human Ratio
27. Market Research / Consumer Behavior Prediction
Demand forecasting, user preferences, competitor trends
★☆☆☆☆
Cannot be guaranteed
Very High Near Null High 📰NBER Feb 2026: 89% of firms report zero AI productivity change[25] AI:Human = 10:90. 🚨 For data organization only. Predictions must not be directly adopted.
28. Financial Trade Execution / Market Making
Bid-ask spread, inventory mgmt, order execution
★☆☆☆☆
Limited improvement
Very High Partial N/A 🔬RL review: market-making is the most improved RL finance sub-domain[35a]; Overfitting remains a fundamental challenge AI:Human = 15:85. 🚨 Market-making RL partially effective. Risk management must be independent.
29a. Financial Prediction / Macroeconomics
Stock prices, exchange rates, economic trends
★☆☆☆☆
Cannot be guaranteed
MDP assumption does not hold
Critical Fundamental Failure High 🔬RLnoUncertain[35a]; Bank of England warns of systemic risk[35b]; LLM homogenization amplifies crashes[35c] AI:Human = 0:100. 🚨🚨 Strictly prohibited as trading basis. May cause massive losses.
29b. Geopolitics / Black Swan Events
War trajectories, policy shifts, pandemics, extreme events
☆☆☆☆☆
≈ Random
Critical Fully Ineffective Very High Training data cannot cover “unknown unknowns”; Taleb’s “The Black Swan” theoretical framework AI:Human = 0:100. 🚨🚨 Output must absolutely never be used as basis for decisions.

How to Use This Table

Find your scenario → Check certainty + risk → The “AI:Human” ratio determines collaboration level. Green (90:10) can be automated, Yellow (50:50) step-by-step confirmation, Orange (20:80) reference only, Red (0:100) never replace humans.

Why Such Wide Success Rate Differences?

AI requires environments modelable as MDP. Chess satisfies perfectly; financial markets do not at all. More complete rules → stronger training signal → higher success rate.

Multi-Agent ≠ More Accurate

Serial compound accumulation: 95%/step × 5 steps = 77%, × 10 steps = 60%[24]. Google: Sequential tasks multi-agent 70% worse than single-agent[25].

Credibility Marker Guide

🔬= Nature/Science peer-reviewed (highest)· 📊= Independent benchmarks (high)· 🏢= Company self-reported (medium, use caution)· 📰= Industry reports (reference)

Reference Index

[1] Silver, D. et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.” Science, 362(6419), 1140–1144, 2018. AlphaZero vs Stockfish: 28W-0L-72D. 🔬
[2] Tian, Y. et al. “ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero.” arXiv:1902.04522, 2019. 20:0 vs alltop professionalsplayers. 🔬
[3] “Limitations in Planning Ability in AlphaZero.” NeurIPS 2024 Workshop on Behavioral ML. non-standard puzzles 93% failurerate. 🔬
[4] PMC/Sensors. “Analysis of Impact of Rain Conditions on ADAS.” Rainfall>20mmsensors stop working. Tesla FSDBlizzard limitation fromThinkAutonomous 2025analysis. 🔬
[5] llm-stats.com. “AIME 2025 Benchmark Leaderboard.” 105models, avg0.783, allself-reported, 0independently verified. as of2026year3month. 🏢
[6] Vals.ai. “AIME Benchmark.” independent8timesrunspass@1Davg: Grok-4 90.6%, o3-Mini 86.5%. states”nomodelsreachestoperfectaccuracy”. 📊
[7] IntuitionLabs. “AIME 2025 Benchmark Analysis.” GPT-4 o4-miniwithPythonsandbox~99.5%; top human students4–6/15(27–40%). 📰
[8] llm-stats.com. “SWE-Bench Verified Leaderboard.” 77models, avg0.622, topClaude Opus 4.5 80.9%. as of2026year3month. 🏢
[9] morphllm.com. “SWE-Bench Pro Leaderboard (2026).” SEAL: Opus 4.5 45.9%; GPT-5.3-Codexself-reported56.8%. 📊
[10] OpenAIaudit. SWE-Bench Verifieddata contaminationconfirmed——beforemodelscanverbatim reproduce gold patches. OpenAIstopped reportingVerifiednumber. via morphllm.com/swe-bench-pro. 📊
[11] Epoch AI. “What skills does SWE-bench Verified evaluate?” 2025. scaffoldgap up to20%+; modelsnotscaffold22ppgap. 📊
[12] MDPI Information. “AI-Driven Phishing Detection: Enhancing Cybersecurity with RL.” 2025. DQN: 95%accuracy, 2%false positiverate. 🔬
[13] Durotolu, G.A. “Leveraging AI and ML for threat detection in U.S. cybersecurity.” WJARR, 2025. detection>95%, false positivereduction60%. 🔬
[14] Proofpoint. “2025 Voice of the CISO Report.” 57% SOCanalysis analysts saytraditionalintelligencenot. 📰
[15] Klang, E. et al. “Orchestrated multi agents sustain accuracy under clinical-scale workloads.” npj Health Systems, 3, 23, 2026. doi:10.1038/s44401-026-00077-0. multiAgent 90.6%→65.3%; SingleAgent 73.1%→16.6%. 🔬
[16] Desk365/Bank of America. Erica: 2000Mtimesinteractions, 98%in44within, 56M monthly interactions. 2025. 🏢
[17] Waymo Safety Impact Dashboard. as of2025year12month1.70700Mnohuman. waymo.com/safety/impact. 🏢
[18] Waymo Blog, Dec 2023 + Traffic Injury Prevention 2025. affected byrate: Waymo 0.41 vs human 2.78/million miles(reduction85%); Alert rate 57% lower. 🔬
[19] Sobot.io. “AI Chatbot Accuracy 2026.” OPPO: 83% resolution/94% positive feedback; Industry target ≥85%. 🏢
[20] Ringly.io. “45+ AI Customer Service Statistics 2026.” 98%leaders believeasAItohuman, 90%admit difficulty. 📰
[21] Gitnux. “AI in Translation Industry Statistics 2026.” EN-ES BLEU 94.2%; DeepL medical 89.5%; Low-resource72%; hallucinationrate1.2%. 📰
[22] “Benchmarking GPT-4 against Human Translators.” arXiv:2411.13775, 2024. GPT-4≈ Junior/mid-level translators, Behind senior translators. 🔬
[23] TRANSLIFE. “AI vs Human Translation Accuracy Research Analysis.” 2025. “humanDetc”newstranslation、Singlelang pairs, highly contested. 📰
[24] Towards Data Science. “The Multi-Agent Trap.” 2026.3. CMU: mostAgentcomplex office24%; 99%×10steps=90.4%; 85%×10steps=19.7%. 📰
[25] Google Research. “Towards a science of scaling agent systems.” 2025. canrowtasks+81%, sequentialtasks-70%; independentAgenterror17.2x; value~45%. NBER 2026.2: 89%enterpriseszerochange. 🔬
[26] Price, I. et al. “Probabilistic weather forecasting with machine learning.” Nature, 2024. GenCast: 97.2%outperformsENS, >36h 99.8%. 🔬
[27] Google DeepMind. “WeatherNext 2.” 2025. GenCastDavgimprovement6.5%. 🏢
[28] Kestin, G. et al. “AI tutoring outperforms in-class active learning.” Scientific Reports, 15, 17458, 2025. Harvard RCT, effect size0.73–1.3σ. 🔬
[29] Engageli. “25 AI in Education Statistics 2026.” number+54%, completion rate+70%, dropout-15%; Courserasurvey: 95%degree. 📰
[30] Stanford SCALE Initiative. “How AI can improve tutor effectiveness.” math+4pp(overall), lowerDtutor students+9pp. 🔬
[31] AllAboutAI. “AI in Drug Development Statistics 2026.” AI Iphase80–90% vs traditional40–65%; 24 molecules, 21 success (87.5%). 📰
[32] Drug Target Review. “AI in drug discovery: 2025 in review.” 2026.2. as of2025.12zeroFDAapprovals; CEOassessment”let us all down”. 📰
[33] ArticleSledge. “AI Weather Forecasting 2026.” GraphCast/Pangulower99thpercentilereduction20–35%; 100yearoccurrencetraditionalmore. 📰
[34] CPA Practice Advisor. Pearl.com 2025survey: GenAItax adviceUp to 50% inaccurate on complex questions(Taxpayer Advocate). 📰
[35] Capitol Tech/GAO. IRS AIaudit: Black taxpayerswasaudit probability3–5x; GAOidentified algorithmicbias; IRS 129AIuse cases(2024:54). 📰
[35a] Bai, Y. et al. “A Review of RL in Financial Applications.” Annual Review of Statistics, 12:209–232, 2025. market-making isRLmost improved sub-domain. 🔬
[35b] Sidley Austin. “AI in Financial Markets: Systemic Risk.” 2024.12. Bank of England: AIcanmachine. 📰
[35c] MDPI JRFM. “AI and Financial Fragility.” 2025. LLMhomogenization→→risk. 🔬
[36] OneUp Networks. “70% of Accountants Trust AI Tax Tools.” 2025. Audit triggers down 40%; preparation errorsreduction58%. 📰
[37] OpenAI, Sep 2025. Hallucinations persist due to training incentivescontinuous——rewards guessing overnot aUncertain. 2025yearmathproof: beforearchitecturebelowhallucinationnotcaneliminate. 🏢
[38] Vectara HHEM Leaderboard. Gemini-2.0-Flash-001: summarizationhallucinationrate0.7%; 4 models <1%. as of2025.4. 📊
[39] aboutchromebooks.com. “AI Hallucination Rates 2026.” open-ended factualquestionsavg9.2%; 2021→2025: 21.8%→0.7%(reduction96%); yearreduction3pp. 📰
[40] FreeAcademy.ai. “ChatGPT vs Claude vs Gemini 2026.” Claude~3%, GPT-5.2~6%, Gemini 3~6%; o3 PersonQA 33%; Grok-3newsattribution94%error. 📰
[41] renovateqr.com/Lakera. allAIhallucinationtask2024: $67400M; Enterprise per-employee annual cost $14,200. 2025judgesnumber100rulings involvingAIfabricated case law(~90%). 📰
[42] Phacet Labs. “AI agents accounting automation 2026.” accuracy6monthwithin99.5%; Manual intervention reduced 70%; errorreduction95%. 🏢
[43] WifiTalents. “AI in Accounting Industry 2026.” 100% transaction analysis vs traditional sampling; fraud detection+50%; taskerrorreduction60%. 📰
[44] DocShipper. “How AI is Changing Logistics 2025.” XPO 99.7%auto matching; Walmart 4700stores$1500Msaving/99.2%in-stock rate; UPS ORION 30kroute/min. 🏢
[45] OneReach AI. “How AI Agents Transform Supply Chain.” logistics costreduction15%, inventory improvement35%; Process efficiency +25–30%. 📰
[46] Kodexo Labs. “Top AI Agents Supply Chain Logistics 2025.” Demand forecast 95% accuracy; response time+40%. 🏢
[47] SupplyChains Magazine. “Impact of Agentic AI on Supply Chain.” Amazon stockouts down 32%. 📰
[48] Inbound Logistics. “AI in Supply Chain 2026 Outlook.” rowscore2.5–7.5/10wide disagreement; Data readiness is the real bottleneck; LLMlacks ability to consistently generate newinsightscapabilities. 📰

v3.0 · 2026-03-21 · 이조 세계인공지능연구소 (human) × Claude Opus 4.6
This table is a risk-awareness reference tool, not a precise prediction. Actual success rates depend on model version, prompt quality, task complexity, scaffold quality, data quality, and other factors. All entries marked 🏢 are company self-reported and may contain selection bias. AIME leaderboard perfect scores are all self-reported (0 independently verified). SWE-bench Verified has confirmed data contamination. Users should prioritize 🔬 and 📊 marked data sources. Continuous revision based on new data is welcome. Please cite when republishing.

댓글 남기기