RepoAmazon (Nova)Amazon (Nova)published Feb 1, 2026seen 5d

amazon-science/SOP-Bench

Python

Open original ↗

Captured source

source ↗
published Feb 1, 2026seen 5dcaptured 15hhttp 200method plain

amazon-science/SOP-Bench

Language: Python

License: NOASSERTION

Stars: 21

Forks: 2

Open issues: 6

Created: 2026-02-01T21:42:38Z

Pushed: 2026-06-06T00:04:59Z

Default branch: main

Fork: no

Archived: no

README:

SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

![Lint](https://github.com/amazon-science/SOP-Bench/actions/workflows/lint.yml) ![Tests](https://github.com/amazon-science/SOP-Bench/actions/workflows/test.yml)

Overview

SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.

🏭 Human Expert-Authored SOPs · 🤖 Human-AI Collaborative Framework · 📊 Executable Interfaces · 🔧 Two Agent Architectures · 📈 11 Frontier Models Evaluated

News

  • [2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.

---

The Problem

Standard Operating Procedures (SOPs) are the backbone of industrial operations—from content moderation to healthcare intake to supply chain logistics. These multi-step procedures require:

  • Sequential reasoning across 10-50+ decision points
  • Tool orchestration to gather information from multiple systems
  • Implicit knowledge that humans learn but rarely document
  • Ambiguity handling when procedures don't cover edge cases

Can LLM agents reliably execute these procedures?

Our research shows they struggle significantly—with performance varying dramatically across domains (26.7%-94.3% success rates).

---

Key Findings

📊 Detailed leaderboard and benchmark results coming soon.

Early findings from our evaluation of 11 frontier models:

  • Function-Calling agents: ~64% average task success rate
  • ReAct agents: ~55% average task success rate
  • High execution rates (95%+) indicate failures are reasoning-based, not technical
  • Open-source models (DeepSeek-R1, Llama 3.3) approach proprietary performance
  • Architecture-model co-design matters: Newer reasoning models can degrade ReAct performance without targeted prompt engineering

---

What's in SOP-Bench?

2,000+ tasks across 12 industrial domains, created through human-AI collaboration:

| Domain | Description | |--------|-------------| | Content Moderation | Bot detection, trust scoring, violation assessment | | Customer Service | Issue diagnosis, system checks, resolution routing | | Supply Chain | Safety data sheet analysis, hazard classification | | Aviation | Pre-flight safety checks, compliance verification | | Retail | Seller email categorization, routing decisions | | Finance | Business entity verification, risk assessment | | Healthcare | Insurance validation, medical history processing | | Autonomous Driving | Object detection in driving scenarios | | Media | Content moderation, category assignment | | Logistics | Package damage assessment, compliance checks | | *...and more* | |

Human expert-authored SOPsMock tools for reproducibilityGround-truth outputsMultiple agent architectures

---

Example: What Does an SOP Task Look Like?

Here's a simplified example from the Dangerous Goods benchmark:

SOP Instruction (excerpt): > *"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point ⚠️ Security Notice: This package is only available via source installation from this GitHub repository. It is not published on PyPI. Do not install any package named amazon-sop-bench, amazonsopbench, or sop-bench from PyPI — those are not official and may contain malicious code.

Configure AWS (for Bedrock models)

cp .env.example .env
# Edit .env with your AWS credentials and model ID

Run Your First Evaluation

# List available benchmarks
./sop-bench list

# Evaluate on a single task
./sop-bench evaluate content_flagging --agent function_calling --max-tasks 1

# Expected output:
# Evaluating content_flagging with function_calling agent...
# Limited to 1 tasks
#
# Starting evaluation...
# Evaluating content_flagging ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:16
#
# ✓ Evaluation Complete!
# Task Success Rate: 100.0%
# Execution Completion Rate: 100.0%
# Tool Accuracy: 100.0%

Understanding the Metrics:

  • Task Success Rate (TSR): Percentage of tasks where the agent made the correct decision
  • Execution Completion Rate (ECR): Percentage of tasks that completed without errors
  • Conditional Task Success Rate (C-TSR): Of the tasks that completed execution, how many were accurate? This measures decision accuracy for successfully executed tasks only.
  • Tool Accuracy: Percentage of tool calls that were correct

Note: TSR = ECR × C-TSR (Task Success Rate equals Execution Completion Rate times Conditional Task Success Rate)

5. Run Full Evaluation

# Run on all tasks and save results
./sop-bench evaluate content_flagging --agent function_calling --output results.json

# Save execution traces for debugging
./sop-bench evaluate content_flagging --agent function_calling --save-traces

# View results
./sop-bench results results.json

Parallel Execution

By default, AmazonSOPBench runs evaluations sequentially (one task at a time). You can enable parallel execution with the --max-workers option to speed up your evaluations. It is recommended to start with 3 workers and can increase up to 10 if you are not getting throttled. If you get throttling exception, just decrease the number of workers in next run.

# Default: Sequential execution (1 worker)
./sop-bench evaluate content_flagging --agent function_calling

# Run with 10 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 10

# Run with 5 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 5

The --max-workers option is also available in batch_evaluate.py for batch runs across multiple models:

# Batch evaluate with 4 parallel workers per model run
python…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New benchmark from Amazon, low stars