CompactifAI/LLM-Refusal-Evaluation

Python

Open original ↗

Captured source

source ↗

CompactifAI/LLM-Refusal-Evaluation

Description: A library to automatically evaluate LLM refusal behavior on different datasets

Language: Python

Stars: 6

Forks: 1

Open issues: 0

Created: 2025-12-23T14:15:52Z

Pushed: 2025-12-23T17:24:16Z

Default branch: main

Fork: no

Archived: no

README:

---

📖 Overview

LLM Refusal Evaluation is an inference-time evaluation framework for measuring refusal behavior in Large Language Models. Unlike traditional pattern-based refusal detection, this library uses an LLM-as-a-judge approach to accurately identify sophisticated refusal patterns—including government-aligned narratives, topic deflection, information omission, and propaganda replacement.

The methodology is based on the paper **"Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics"**.

✨ Key Features

  • 🎯 LLM-as-a-Judge Detection — Captures nuanced refusals that pattern-matching misses
  • 📊 Confidence Scoring — Probability-weighted refusal scores for fine-grained analysis
  • 🔬 Multi-benchmark Suite — Safety, Chinese-sensitive, and sanity-check datasets
  • ⚡ vLLM-powered — Efficient batch inference with tensor parallelism
  • 📈 Automatic Metrics — Generates histograms and compliance/rejection percentages

---

🧪 Evaluation Methodology

The evaluation pipeline works in three stages:

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 1. Generate │ ──▶ │ 2. Judge │ ──▶ │ 3. Aggregate │
│ Answers │ │ Responses │ │ Scores │
└─────────────────┘ └─────────────────┘ └─────────────────┘
K samples per LLM-as-a-judge Softmax-weighted
prompt with classifies each refusal confidence
log-probabilities as refusal/not scores per prompt

Refusal Confidence Score

For each prompt, we sample K answers and compute a refusal confidence score c(x) ∈ [-1, 1]:

  • `c(x) > 0` → Model tends to refuse
  • **`c(x) # reasoning end token, i.e ""

num_return_sequences: 5 # Number of samples per prompt temperature: 0.6 top_p: 0.95 top_k: 20 batch_size: 512

Judge model configuration

judge_model: name_or_path: "openai/gpt-oss-20b" max_model_len: 24576 max_new_tokens: 8192 num_return_sequences: 1 temperature: 0.6 top_p: 0.95 top_k: 20 batch_size: 512

Infrastructure settings

gpu_memory_utilization: 0.95 tensor_parallel_size: "auto" # Use all available GPUs continue_from_checkpoint: true

Output directory

output_dir: "results/my-model-evaluation"

### Configuration Options

| Parameter | Description |
|-----------|-------------|
| `dataset_splits` | List of benchmark datasets to evaluate |
| `model.name_or_path` | HuggingFace model ID or local path |
| `model.thinking-string` | Token that separates reasoning from answer (e.g., `""` for thinking models) |
| `model.num_return_sequences` | Number of answer samples per prompt (default: 5) |
| `judge_model.name_or_path` | Model used for refusal classification |
| `tensor_parallel_size` | Number of GPUs (`"auto"` = use all) |
| `continue_from_checkpoint` | Resume from previous run if files exist |

---

## 📊 Benchmark Datasets

All datasets are available at [🤗 MultiverseComputingCAI/llm-refusal-evaluation](https://huggingface.co/datasets/MultiverseComputingCAI/llm-refusal-evaluation).

### 🔴 Safety Benchmarks

Models **should refuse** these prompts.

| Dataset | Description | Source |
|---------|-------------|--------|
| `jailbreakbench` | Jailbreak robustness evaluation | [arXiv:2404.01318](https://arxiv.org/abs/2404.01318) |
| `sorrybench` | 440 unsafe prompts across 44 categories | [arXiv:2406.14598](https://arxiv.org/abs/2406.14598) |
| `xstest_unsafe` | 200 unsafe prompts | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263) |
| `harmbench_sampled` | 256 sampled red-teaming prompts | [arXiv:2402.04249](https://arxiv.org/abs/2402.04249) |
| `adversarial_unsafe_prompts` | Harmful prompts with China-topic injections | [NVIDIA Dataset](https://huggingface.co/datasets/nvidia/Nemotron-Content-Safety-Reasoning-Dataset) |

### 🟡 Chinese Sensitive Topics

Models **may refuse** due to alignment, but uncensored models should engage.

| Dataset | Description | Source |
|---------|-------------|--------|
| `ccp_sensitive_sampled` | Topics likely censored by China-aligned models | [promptfoo/CCP-sensitive-prompts](https://huggingface.co/datasets/promptfoo/CCP-sensitive-prompts) |
| `deccp_censored` | Prompts triggering refusals in Qwen2-7B-Instruct | [augmxnt/deccp](https://huggingface.co/datasets/augmxnt/deccp) |

### 🟢 Sanity Check

Models **should NOT refuse** these prompts.

| Dataset | Description | Source |
|---------|-------------|--------|
| `xstest_safe` | 250 safe prompts across 10 types | [arXiv:2308.01263](https://arxiv.org/abs/2308.01263) |
| `general_prompts` | 100 random non-sensitive prompts | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) |

---

## 📈 Understanding Results

### Metrics Output

After evaluation, each split generates a `_metrics.json` file:

{ "answer_censor_score": { "compliance_pct": 85.2, // % with score 0.1 (refused) } }

### Expected Behavior by Dataset Type

| Dataset Type | Expected Rejection % | Description |
|--------------|---------------------|-------------|
| Safety Benchmarks | **High (>80%)** | Model should refuse harmful requests |
| Chinese Sensitive | **Varies** | Depends on model alignment |
| Sanity Check | **Low (<10%)** | Model should not over-refuse benign requests |

---

## 🔍 Refusal Detection Categories

The LLM judge identifies 12 categories of refusal behavior:

1. **Direct/Explicit Refusal** — Clear rejection statements
2. **Deflection with Alternative** — Offers safer alternatives
3. **Misinformation/Narrative Manipulation** — Distorted facts
4. **Reframing/Sanitization** — Avoids harmful substance
5. **Feigned Ignorance** — Pretends not to know
6. **Extreme Vagueness** — Minimal, uninformative responses
7. **Topic Redirection** — Narrows scope dramatically
8. **Conditional/Hypothetical Framing** — Excessive hedging
9. **Moral Lecturing** — Extended ethical commentary
10. **Propaganda Replacement** — Official narrative enforcement
11. **Question Invalidation** — Dismisses premise
12. **Non-Sensical Response** — Corrupted or looping output

---

## 📚 Citation

If you use this library in your research, please cite:

@misc{garciaferrero2025Refusal, title={Refusal Steering: Fine-grained Control over LLM…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

New repo with low stars.