What does this repo signal mean?

Amazon (Nova) published amazon-science/SOP-Bench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/SOP-Bench · language Python · New benchmark from Amazon, low stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Amazon (Nova) Repo: amazon-science/SOP-Bench

Captured source

source ↗

GitHub/github.com/amazon-science/SOP-Bench

amazon-science/SOP-Bench repository metadata

Source ↗

published Feb 1, 2026seen Jun 5captured Jun 11http 200method plain

amazon-science/SOP-Bench

Language: Python

License: NOASSERTION

Stars: 21

Forks: 2

Open issues: 6

Created: 2026-02-01T21:42:38Z

Pushed: 2026-06-06T00:04:59Z

Default branch: main

Fork: no

Archived: no

README:

SOP-Bench : Complex Industrial SOPs for Evaluating LLM Agents

![Lint](https://github.com/amazon-science/SOP-Bench/actions/workflows/lint.yml) ![Tests](https://github.com/amazon-science/SOP-Bench/actions/workflows/test.yml)

Overview

SOP-Bench is a comprehensive benchmark for evaluating LLM-based agents on complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Built from 2,000+ tasks across 12 industrial domains (healthcare, logistics, finance, content moderation, etc.), SOP-Bench addresses the gap between existing benchmarks and real-world procedural complexity.

🏭 Human Expert-Authored SOPs · 🤖 Human-AI Collaborative Framework · 📊 Executable Interfaces · 🔧 Two Agent Architectures · 📈 11 Frontier Models Evaluated

News

[2026-02] 🎉 SOP-Bench submitted to KDD 2026 Datasets and Benchmarks Track.

---

The Problem

Standard Operating Procedures (SOPs) are the backbone of industrial operations—from content moderation to healthcare intake to supply chain logistics. These multi-step procedures require:

Sequential reasoning across 10-50+ decision points
Tool orchestration to gather information from multiple systems
Implicit knowledge that humans learn but rarely document
Ambiguity handling when procedures don't cover edge cases

Can LLM agents reliably execute these procedures?

Our research shows they struggle significantly—with performance varying dramatically across domains (26.7%-94.3% success rates).

---

Key Findings

📊 Detailed leaderboard and benchmark results coming soon.

Early findings from our evaluation of 11 frontier models:

Function-Calling agents: ~64% average task success rate
ReAct agents: ~55% average task success rate
High execution rates (95%+) indicate failures are reasoning-based, not technical
Open-source models (DeepSeek-R1, Llama 3.3) approach proprietary performance
Architecture-model co-design matters: Newer reasoning models can degrade ReAct performance without targeted prompt engineering

---

What's in SOP-Bench?

2,000+ tasks across 12 industrial domains, created through human-AI collaboration:

| Domain | Description | |--------|-------------| | Content Moderation | Bot detection, trust scoring, violation assessment | | Customer Service | Issue diagnosis, system checks, resolution routing | | Supply Chain | Safety data sheet analysis, hazard classification | | Aviation | Pre-flight safety checks, compliance verification | | Retail | Seller email categorization, routing decisions | | Finance | Business entity verification, risk assessment | | Healthcare | Insurance validation, medical history processing | | Autonomous Driving | Object detection in driving scenarios | | Media | Content moderation, category assignment | | Logistics | Package damage assessment, compliance checks | | *...and more* | |

Human expert-authored SOPs • Mock tools for reproducibility • Ground-truth outputs • Multiple agent architectures

---

Example: What Does an SOP Task Look Like?

Here's a simplified example from the Dangerous Goods benchmark:

SOP Instruction (excerpt): > *"1. Retrieve the Safety Data Sheet for the product. 2. Check if the product contains any Class 3 flammable liquids. 3. If flash point ⚠️ Security Notice: This package is only available via source installation from this GitHub repository. It is not published on PyPI. Do not install any package named amazon-sop-bench, amazonsopbench, or sop-bench from PyPI — those are not official and may contain malicious code.

Configure AWS (for Bedrock models)

cp .env.example .env
# Edit .env with your AWS credentials and model ID

Run Your First Evaluation

# List available benchmarks
./sop-bench list

# Evaluate on a single task
./sop-bench evaluate content_flagging --agent function_calling --max-tasks 1

# Expected output:
# Evaluating content_flagging with function_calling agent...
# Limited to 1 tasks
#
# Starting evaluation...
# Evaluating content_flagging ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:16
#
# ✓ Evaluation Complete!
# Task Success Rate: 100.0%
# Execution Completion Rate: 100.0%
# Tool Accuracy: 100.0%

Understanding the Metrics:

Task Success Rate (TSR): Percentage of tasks where the agent made the correct decision
Execution Completion Rate (ECR): Percentage of tasks that completed without errors
Conditional Task Success Rate (C-TSR): Of the tasks that completed execution, how many were accurate? This measures decision accuracy for successfully executed tasks only.
Tool Accuracy: Percentage of tool calls that were correct

Note: TSR = ECR × C-TSR (Task Success Rate equals Execution Completion Rate times Conditional Task Success Rate)

5. Run Full Evaluation

# Run on all tasks and save results
./sop-bench evaluate content_flagging --agent function_calling --output results.json

# Save execution traces for debugging
./sop-bench evaluate content_flagging --agent function_calling --save-traces

# View results
./sop-bench results results.json

Parallel Execution

By default, AmazonSOPBench runs evaluations sequentially (one task at a time). You can enable parallel execution with the --max-workers option to speed up your evaluations. It is recommended to start with 3 workers and can increase up to 10 if you are not getting throttled. If you get throttling exception, just decrease the number of workers in next run.

# Default: Sequential execution (1 worker)
./sop-bench evaluate content_flagging --agent function_calling

# Run with 10 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 10

# Run with 5 parallel workers
./sop-bench evaluate content_flagging --agent function_calling --max-workers 5

The --max-workers option is also available in batch_evaluate.py for batch runs across multiple models:

# Batch evaluate with 4 parallel workers per model run
python...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New benchmark from Amazon, low stars