amazon-science/MultiTurnAgentAttack
Python
Captured source
source ↗amazon-science/MultiTurnAgentAttack
Language: Python
License: NOASSERTION
Stars: 12
Forks: 7
Open issues: 9
Created: 2025-10-28T18:39:56Z
Pushed: 2026-03-31T13:13:30Z
Default branch: main
Fork: no
Archived: no
README:
MultiTurnAgentAttack
This repository contains data and code for the STAC (Sequential Tool Attack Chaining) framework, which generates and evaluates multi-turn adversarial attacks against LLM agents in tool-use environments.
📄 Paper: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
Quick Start
Just want to evaluate your model on our STAC Benchmark?
If you only want to evaluate your model on our pre-generated STAC Benchmark (483 test cases), you can skip the full pipeline and directly run:
# Set up environment conda env create -f environment.yml conda activate STAC export OPENAI_API_KEY="your-openai-api-key-here" # For planner and judge models # Run evaluation python -m STAC_eval.eval_STAC_benchmark \ --model_agent gpt-4.1 \ --defense no_defense \ --batch_size 512
Input: data/STAC_benchmark_data.json (483 test cases from both SHADE-Arena and Agent-SafetyBench)
Output: Evaluation results in data/Eval/{model_planner}/{model_agent}/{defense}/gen_res.json
The benchmark automatically handles both SHADE-Arena and Agent-SafetyBench test cases. Skip to the [Benchmark Evaluation](#benchmark-evaluation-stac_eval) section for more details.
---
Full Setup
Prerequisites
Clone the repository and set up the conda environment on a Linux machine:
# Clone the repository cd MultiTurnAgentAttack-main # Create and activate conda environment conda env create -f environment.yml conda activate STAC
Alternative Setup (pip/venv)
If you prefer using pip instead of conda:
# Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt
Environment Variables
Before running any scripts, configure the required API keys:
OpenAI API (Required for most steps)
export OPENAI_API_KEY="your-openai-api-key-here"
Hugging Face (Required for HF models)
export HF_TOKEN="your-hf-token-here"
AWS Bedrock API (Required for AWS models only)
export AWS_ACCESS_KEY_ID="your-aws-access-key-id" export AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key" export AWS_SESSION_TOKEN="your-aws-session-token"
Note: AWS session tokens expire every 12 hours and must be refreshed regularly.
---
Repository Structure
MultiTurnAgentAttack-main/ ├── STAC_gen/ # Full STAC attack generation pipeline │ ├── step_1_gen_tool_chains.py # Generate tool chain attacks │ ├── step_2_verify_tool_chains.py # Verify generated attacks │ ├── step_3_reverse_engineer_prompts.py # Generate adversarial prompts │ └── step_4_eval_adaptive_planning.py # Evaluate with adaptive planning │ ├── STAC_eval/ # Benchmark evaluation │ └── eval_STAC_benchmark.py # Evaluate models on STAC benchmark │ ├── data/ │ └── STAC_benchmark_data.json # Pre-generated benchmark (483 cases) │ ├── Agent_SafetyBench/ # Agent-SafetyBench environments ├── SHADE_Arena/ # SHADE-Arena environments │ ├── src/ # Core implementation │ ├── Agents.py │ ├── Environments.py │ ├── LanguageModels.py │ ├── STAC.py │ └── utils.py │ └── prompts/ # System prompts for all components
---
Overview
The repository provides two main usage modes:
1. Benchmark Evaluation (STAC_eval/)
Evaluate your model on our pre-generated STAC Benchmark containing 483 test cases across SHADE-Arena and Agent-SafetyBench environments. This is the recommended starting point for most users.
2. Full Automated Attack Generation Pipeline (STAC_gen/)
Generate new attacks from scratch using the complete 4-step STAC pipeline. This is useful for:
- Creating attacks for new environments
- Experimenting with different attack generation strategies
- Extending the benchmark with additional test cases
---
Benchmark Evaluation (STAC_eval)
Quick Evaluation
Evaluate your agent against the STAC Benchmark with a single command:
python -m STAC_eval.eval_STAC_benchmark \ --model_agent gpt-4.1 \ --defense no_defense \ --batch_size 512
Parameters
--input_path: Path to benchmark data (default:data/STAC_benchmark_data.json)--output_dir: Output directory (default:data/Eval)--model_planner: Model for attack planning (default:gpt-4.1)--model_judge: Model for evaluation (default:gpt-4.1)--model_agent: Your model to evaluate (default:gpt-4.1)--temperature: Sampling temperature (default:0.0)--top_p: Nucleus sampling parameter (default:0.95)--batch_size: Batch size for evaluation (default:1)--region: AWS region for Bedrock API (default:us-west-2)--max_n_turns: Maximum conversation turns (default:3)--defense: Defense mechanism to evaluate (default:no_defense)
Defense Mechanisms
Evaluate different defense strategies:
# No defense (baseline) python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense no_defense # Spotlighting (datamarking user prompts) python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense spotlighting # Failure mode awareness python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense failure_modes # User intent summarization python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense summarization # Harm-benefit reasoning python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense reasoning
Evaluating Different Models
OpenAI Models
Requirements: OPENAI_API_KEY
# GPT-4.1 python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --batch_size 512 # GPT-4.1-mini python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1-mini --batch_size 512
AWS Bedrock Models
Requirements: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
# Llama 3.3 70B python -m STAC_eval.eval_STAC_benchmark \ --model_agent us.meta.llama3-3-70b-instruct-v1:0 \ --batch_size 10 # Llama 3.1 405B python -m STAC_eval.eval_STAC_benchmark \ --model_agent meta.llama3-1-405b-instruct-v1:0 \ --batch_size 10
Hugging Face Models
Requirements: GPU access (tested on H100s), HF_TOKEN
# Request GPU node # e.g., on…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low stars, routine repo release.