amazon-science/SOP-Bench
Python
Captured source
source ↗amazon-science/SOP-Bench
Language: Python
License: NOASSERTION
Stars: 21
Forks: 2
Open issues: 6
Created: 2026-02-01T21:42:38Z
Pushed: 2026-06-06T00:04:59Z
Default branch: main
Fork: no
Archived: no
README:
SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents
 
Overview
SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.
🏭 Human Expert-Authored SOPs · 🤖 Human-AI Collaborative Framework · 📊 Executable Interfaces · 🔧 Two Agent Architectures · 📈 11 Frontier Models Evaluated
News
- [2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.
---
The Problem
Standard Operating Procedures (SOPs) are the backbone of industrial operations—from content moderation to healthcare intake to supply chain logistics. These multi-step procedures require:
- Sequential reasoning across 10-50+ decision points
- Tool orchestration to gather information from multiple systems
- Implicit knowledge that humans learn but rarely document
- Ambiguity handling when procedures don't cover edge cases
Can LLM agents reliably execute these procedures?
Our research shows they struggle significantly—with performance varying dramatically across domains (26.7%-94.3% success rates).
---
Key Findings
📊 Detailed leaderboard and benchmark results coming soon.
Early findings from our evaluation of 11 frontier models:
- Function-Calling agents: ~64% average task success rate
- ReAct agents: ~55% average task success rate
- High execution rates (95%+) indicate failures are reasoning-based, not technical
- Open-source models (DeepSeek-R1, Llama 3.3) approach proprietary performance
- Architecture-model co-design matters: Newer reasoning models can degrade ReAct performance without targeted prompt engineering
---
What's in SOP-Bench?
2,000+ tasks across 12 industrial domains, created through human-AI collaboration:
| Domain | Description | |--------|-------------| | Content Moderation | Bot detection, trust scoring, violation assessment | | Customer Service | Issue diagnosis, system checks, resolution routing | | Supply Chain | Safety data sheet analysis, hazard classification | | Aviation | Pre-flight safety checks, compliance verification | | Retail | Seller email categorization, routing decisions | | Finance | Business entity verification, risk assessment | | Healthcare | Insurance validation, medical history processing | | Autonomous Driving | Object detection in driving scenarios | | Media | Content moderation, category assignment | | Logistics | Package damage assessment, compliance checks | | *...and more* | |
Human expert-authored SOPs • Mock tools for reproducibility • Ground-truth outputs • Multiple agent architectures
---
Example: What Does an SOP Task Look Like?
Here's a simplified example from the Dangerous Goods benchmark:
SOP Instruction (excerpt): > *"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point ⚠️ Security Notice: This package is only available via source installation from this GitHub repository. It is not published on PyPI. Do not install any package named amazon-sop-bench, amazonsopbench, or sop-bench from PyPI — those are not official and may contain malicious code.
Configure AWS (for Bedrock models)
cp .env.example .env # Edit .env with your AWS credentials and model ID
Run Your First Evaluation
# List available benchmarks ./sop-bench list # Evaluate on a single task ./sop-bench evaluate content_flagging --agent function_calling --max-tasks 1 # Expected output: # Evaluating content_flagging with function_calling agent... # Limited to 1 tasks # # Starting evaluation... # Evaluating content_flagging ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:16 # # ✓ Evaluation Complete! # Task Success Rate: 100.0% # Execution Completion Rate: 100.0% # Tool Accuracy: 100.0%
Understanding the Metrics:
- Task Success Rate (TSR): Percentage of tasks where the agent made the correct decision
- Execution Completion Rate (ECR): Percentage of tasks that completed without errors
- Conditional Task Success Rate (C-TSR): Of the tasks that completed execution, how many were accurate? This measures decision accuracy for successfully executed tasks only.
- Tool Accuracy: Percentage of tool calls that were correct
Note: TSR = ECR × C-TSR (Task Success Rate equals Execution Completion Rate times Conditional Task Success Rate)
5. Run Full Evaluation
# Run on all tasks and save results ./sop-bench evaluate content_flagging --agent function_calling --output results.json # Save execution traces for debugging ./sop-bench evaluate content_flagging --agent function_calling --save-traces # View results ./sop-bench results results.json
Parallel Execution
By default, AmazonSOPBench runs evaluations sequentially (one task at a time). You can enable parallel execution with the --max-workers option to speed up your evaluations. It is recommended to start with 3 workers and can increase up to 10 if you are not getting throttled. If you get throttling exception, just decrease the number of workers in next run.
# Default: Sequential execution (1 worker) ./sop-bench evaluate content_flagging --agent function_calling # Run with 10 parallel workers ./sop-bench evaluate content_flagging --agent function_calling --max-workers 10 # Run with 5 parallel workers ./sop-bench evaluate content_flagging --agent function_calling --max-workers 5
The --max-workers option is also available in batch_evaluate.py for batch runs across multiple models:
# Batch evaluate with 4 parallel workers per model run python…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10New benchmark from Amazon, low stars