RepoAmazon (Nova)Amazon (Nova)published Apr 28, 2025seen 5d

amazon-science/TurboFuzzLLM

Python

Open original ↗

Captured source

source ↗
published Apr 28, 2025seen 5dcaptured 8hhttp 200method plain

amazon-science/TurboFuzzLLM

Description: TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Language: Python

License: Apache-2.0

Stars: 24

Forks: 2

Open issues: 0

Created: 2025-04-28T17:26:38Z

Pushed: 2025-11-24T18:41:17Z

Default branch: main

Fork: no

Archived: no

README:

TurboFuzzLLM

Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice

A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.

⚠️ Responsible Use

This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.

Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.

📖 Table of Contents

  • [🚀 Getting Started](#-getting-started)
  • [🎯 Key Features](#-key-features)
  • [🔧 Method Overview](#-method-overview)
  • [🔄 Architecture and Data Flow](#-architecture-and-data-flow)
  • [📊 Results](#-results)
  • [🛡️ Applications](#️-applications)
  • [⚙️ Configuration](#️-configuration)
  • [🤖 Supported Models](#-supported-models)
  • [🧑‍💻 Development](#-development)
  • [📁 Codebase Structure](#-codebase-structure)
  • [📂 Understanding Output](#-understanding-output)
  • [🔧 Troubleshooting](#-troubleshooting)
  • [👥 Meet the Team](#-meet-the-team)
  • [Security](#security)
  • [License](#license)
  • [Citation](#citation)

🚀 Getting Started

Prerequisites

  • Python 3.8+ and pip
  • Provider access: OpenAI API key for gpt-*/o1-*, AWS credentials for Bedrock/SageMaker (configure with aws configure)
  • Optional local models: Hugging Face-compatible checkpoints (e.g., Gemma/Zephyr) for offline judge/target use

Install

git clone https://github.com/amazon-science/TurboFuzzLLM.git
cd TurboFuzzLLM
python -m venv .venv && source .venv/bin/activate # optional but recommended
pip install --upgrade pip
pip install -e .

> Network/cost safety: SageMaker endpoint deployment and Bedrock validation are blocked by default; pass --allow-endpoint-deploy explicitly when you intend to enable them.

Quick Start

1) Download seed templates:

python3 scripts/get_templates_gptfuzzer.py

2) Run an interactive attack:

python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY

3) Batch attack HarmBench (AWS Bedrock):

turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000

Results appear under output//*/.

🎯 Key Features

  • High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
  • Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
  • Generalizable: >90% ASR on unseen harmful questions
  • Practical: Easy-to-use CLI with statistics, search visualization, and logging
  • Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)

🔧 Method Overview

TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:

1. Expanded Mutation Space: New mutation operations including refusal suppression 2. Reinforcement Learning: Feedback-guided prioritized search 3. Intelligent Heuristics: Efficient exploration with fewer LLM queries 4. Template-Based Approach: Templates can be combined with any harmful question for scalable attacks

🔄 Architecture and Data Flow

High-Level Architecture

TurboFuzzLLM performs black-box mutation-based fuzzing to generate adversarial prompt templates for jailbreaking LLMs. It uses reinforcement learning to prioritize effective mutations.

Key Components

1. Fuzzer Core (fuzzer/core.py):

  • TurboFuzzLLMFuzzer: Main orchestrator class.
  • Manages questions, templates, mutations, evaluations, and statistics.

2. Models (llm/):

  • TargetModel: The LLM being attacked (e.g., GPT-4, Claude).
  • MutatorModel: LLM used for generating mutations (e.g., for paraphrasing templates).
  • JudgeModel: Determines if a response is "jailbroken" (vulnerable to the attack).

3. Mutation System (fuzzer/mutators.py, fuzzer/mutator_selection.py):

  • Various mutation operators: ExpandBefore, FewShots, Rephrase, Crossover, etc.
  • Selection policies: QLearning, UCB, Random, RoundRobin, MCTS, EXP3.

4. Template System (fuzzer/template.py):

  • Template class: Represents adversarial prompt templates.
  • Tracks ASR (Attack Success Rate), jailbreaks, parent/child relationships.

5. Template Selection (fuzzer/template_selection.py):

  • Policies for selecting which template to mutate next (reinforcement learning-based).

Data Flow

1. Initialization: Load initial templates, questions, and configure models.

2. Warmup Phase: Evaluate initial templates on subset of questions.

3. Mutation Loop:

  • Select template using selection policy (e.g., QLearning).
  • Select mutation using mutation policy.
  • Apply mutation to generate new template.
  • Evaluate new template on remaining questions.
  • Update selection/mutation policies based on results.
  • Repeat until stopping criteria (query limit, all questions jailbroken).

4. Evaluation: For each template-question pair:

  • Synthesize prompt (replace placeholder in template with question).
  • Query target model.
  • Judge response for vulnerability.
  • Track statistics and jailbreaks.

5. Output: Generate CSV files, logs, statistics, and visualization of template evolution tree.

📊 Results

| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |

🛡️ Applications

1. Vulnerability Identification: Discover prompt-based attack vectors in LLMs 2. Countermeasure Development:

  • Improve in-built LLM safeguards
  • Create external guardrails

3. Adversarial Training: Generate high-quality (attack…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo, low traction, research tool