amazon-science/TurboFuzzLLM
Python
Captured source
source ↗amazon-science/TurboFuzzLLM
Description: TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Language: Python
License: Apache-2.0
Stars: 24
Forks: 2
Open issues: 0
Created: 2025-04-28T17:26:38Z
Pushed: 2025-11-24T18:41:17Z
Default branch: main
Fork: no
Archived: no
README:
TurboFuzzLLM
Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice
A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.
⚠️ Responsible Use
This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.
Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.
📖 Table of Contents
- [🚀 Getting Started](#-getting-started)
- [🎯 Key Features](#-key-features)
- [🔧 Method Overview](#-method-overview)
- [🔄 Architecture and Data Flow](#-architecture-and-data-flow)
- [📊 Results](#-results)
- [🛡️ Applications](#️-applications)
- [⚙️ Configuration](#️-configuration)
- [🤖 Supported Models](#-supported-models)
- [🧑💻 Development](#-development)
- [📁 Codebase Structure](#-codebase-structure)
- [📂 Understanding Output](#-understanding-output)
- [🔧 Troubleshooting](#-troubleshooting)
- [👥 Meet the Team](#-meet-the-team)
- [Security](#security)
- [License](#license)
- [Citation](#citation)
🚀 Getting Started
Prerequisites
- Python 3.8+ and
pip - Provider access: OpenAI API key for
gpt-*/o1-*, AWS credentials for Bedrock/SageMaker (configure withaws configure) - Optional local models: Hugging Face-compatible checkpoints (e.g., Gemma/Zephyr) for offline judge/target use
Install
git clone https://github.com/amazon-science/TurboFuzzLLM.git cd TurboFuzzLLM python -m venv .venv && source .venv/bin/activate # optional but recommended pip install --upgrade pip pip install -e .
> Network/cost safety: SageMaker endpoint deployment and Bedrock validation are blocked by default; pass --allow-endpoint-deploy explicitly when you intend to enable them.
Quick Start
1) Download seed templates:
python3 scripts/get_templates_gptfuzzer.py
2) Run an interactive attack:
python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY
3) Batch attack HarmBench (AWS Bedrock):
turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000
Results appear under output//*/.
🎯 Key Features
- High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
- Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
- Generalizable: >90% ASR on unseen harmful questions
- Practical: Easy-to-use CLI with statistics, search visualization, and logging
- Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)
🔧 Method Overview
TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:
1. Expanded Mutation Space: New mutation operations including refusal suppression 2. Reinforcement Learning: Feedback-guided prioritized search 3. Intelligent Heuristics: Efficient exploration with fewer LLM queries 4. Template-Based Approach: Templates can be combined with any harmful question for scalable attacks
🔄 Architecture and Data Flow
High-Level Architecture
TurboFuzzLLM performs black-box mutation-based fuzzing to generate adversarial prompt templates for jailbreaking LLMs. It uses reinforcement learning to prioritize effective mutations.
Key Components
1. Fuzzer Core (fuzzer/core.py):
TurboFuzzLLMFuzzer: Main orchestrator class.- Manages questions, templates, mutations, evaluations, and statistics.
2. Models (llm/):
TargetModel: The LLM being attacked (e.g., GPT-4, Claude).MutatorModel: LLM used for generating mutations (e.g., for paraphrasing templates).JudgeModel: Determines if a response is "jailbroken" (vulnerable to the attack).
3. Mutation System (fuzzer/mutators.py, fuzzer/mutator_selection.py):
- Various mutation operators: ExpandBefore, FewShots, Rephrase, Crossover, etc.
- Selection policies: QLearning, UCB, Random, RoundRobin, MCTS, EXP3.
4. Template System (fuzzer/template.py):
Templateclass: Represents adversarial prompt templates.- Tracks ASR (Attack Success Rate), jailbreaks, parent/child relationships.
5. Template Selection (fuzzer/template_selection.py):
- Policies for selecting which template to mutate next (reinforcement learning-based).
Data Flow
1. Initialization: Load initial templates, questions, and configure models.
2. Warmup Phase: Evaluate initial templates on subset of questions.
3. Mutation Loop:
- Select template using selection policy (e.g., QLearning).
- Select mutation using mutation policy.
- Apply mutation to generate new template.
- Evaluate new template on remaining questions.
- Update selection/mutation policies based on results.
- Repeat until stopping criteria (query limit, all questions jailbroken).
4. Evaluation: For each template-question pair:
- Synthesize prompt (replace placeholder in template with question).
- Query target model.
- Judge response for vulnerability.
- Track statistics and jailbreaks.
5. Output: Generate CSV files, logs, statistics, and visualization of template evolution tree.
📊 Results
| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |
🛡️ Applications
1. Vulnerability Identification: Discover prompt-based attack vectors in LLMs 2. Countermeasure Development:
- Improve in-built LLM safeguards
- Create external guardrails
3. Adversarial Training: Generate high-quality (attack…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo, low traction, research tool