forked from centerforaisafety/HarmBench
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
dbc8334
commit 26ac422
Showing
332 changed files
with
18,326 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,270 @@ | ||
# HarmBench | ||
<p align="center"> | ||
|
||
<img src="assets/harmbench_splash-1.png" alt="HarmBench"/> | ||
|
||
</p> | ||
|
||
<div align="center"> | ||
|
||
# 🗡️ HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal 🛡️ | ||
|
||
</div> | ||
|
||
<div align="center"> | ||
|
||
|
||
</div> | ||
|
||
|
||
#### A fast, affordable, scalable and open system framework for evaluating automated red teaming methods and LLM attacks/defenses | ||
|
||
|
||
<!-- Three language version (Eng/Chinese/Japanese) --> | ||
|
||
|
||
<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc --> | ||
## Table of Contents | ||
|
||
- [📰 Latest News 📰](#-latest-news-) | ||
- [🗡️ What is HarmBench 🛡️](#🗡️-what-is-harmbench-🛡️) | ||
- [🌐 Overview 🌐](#🌐-overview-🌐) | ||
- [☕ Quick Start ☕](#☕-quick-start-☕) | ||
- [⚙️ Installation](#⚙️-installation) | ||
- [🗡️ Step 1 - Generate Test Cases](#🗡️-step-1---generate-test-cases) | ||
- [🗡️ Step 1.5 - Merge Test Cases](#🗡️-step-15---merge-test-cases) | ||
- [🗡️ Step 2 - Generate Completions](#🗡️-step-2---generate-completions) | ||
- [🗡️ Step 3 - Evaluate Completions](#🗡️-step-3---evaluate-completions) | ||
- [➕ Using your own model in HarmBench](#➕-using-your-own-model-in-harmbench) | ||
- [➕ Adding/Customizing your own red teaming methods in HarmBench](#➕-addingcustomizing-your-own-red-teaming-methods-in-harmbench) | ||
- [⚓ Documentation and Tutorial ⚓](#⚓-documentation-and-tutorial-⚓) | ||
- [🌱 HarmBench's Roadmap 🌱](#🌱-harmbench-roadmap-🌱) | ||
- [🙏 Acknowledgement and Citation 🙏](#🙏-acknowledgements-and-citation-🙏) | ||
|
||
<!-- markdown-toc end --> | ||
|
||
## 📰 Latest News 📰 | ||
|
||
* ***[2024/02] 🚀 [Initial release of HarmBench with 33 evaluated LLMs and 18 red teaming methods]()*** 🚀 | ||
|
||
|
||
## 🗡️ What is HarmBench 🛡️ | ||
|
||
<div align="center"> | ||
|
||
<img src="assets/eval_pipeline-1.png" alt="HarmBench Evaluation Pipeline"/> | ||
|
||
</div> | ||
|
||
Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess these methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify key considerations previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of $18$ red teaming methods and $33$ target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. | ||
|
||
## 🌐 Overview 🌐 | ||
There are two primary ways to use HarmBench: (1) evaluating red teaming methods against a set of LLMs, and (2) evaluating LLMs against a set of red teaming methods. These use cases are both supported by the same evaluation pipeline, illustrated above. The process for running this evaluation pipeline is described below in the Quick Start section. | ||
|
||
HarmBench has out-of-the-box support for transformers-compatible LLMs, numerous closed-source APIs, and several multimodal models. To add new red teaming methods and models that aren't compatible with transformers, it may help to learn more about how the codebase is structured. We provide a set of docs and *(coming soon)* tutorials to help with more advanced usage. | ||
|
||
|
||
## ☕ Quick Start ☕ | ||
|
||
|
||
### ⚙️ Installation | ||
|
||
```bash | ||
git clone https://github.com/centerforaisafety/HarmBench.git | ||
cd HarmBench | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### 🗡️ Step 1 - Generate Test Cases | ||
|
||
The first step in the evaluation pipeline is generating test cases. The start_idx and end_idx parameters allow parallelizing test case generation across behaviors. The run_id parameter enables parallelizing test case generation in other ways (e.g., the EnsembleGCG method can be parallelized by setting `start_idx=0`, `end_idx=401`, `run_id="${n}"` for n ranging from 1 to N.) | ||
|
||
```bash | ||
# Run the step1 (generate test cases) script with Slurm | ||
./scripts/step1.sh | ||
|
||
# OR run directly | ||
./scripts/generate_test_cases.sh $method $experiment $behaviors_path $save_dir $start_idx $end_idx $run_id | ||
``` | ||
|
||
<details><summary> Example </summary><p> | ||
|
||
```bash | ||
# Make sure your environment has enough GPUs required for each model and attacks | ||
|
||
base_dir="results" | ||
method="GCG" | ||
experiment="vicuna_13b_v1_5" | ||
behaviors_path="./data/harmbench_text_test_behaviors.csv" # or "./data/harmbench_text_validation_behaviors.csv" | ||
save_dir=${base_dir}/${method}/${experiment}/test_cases | ||
start_idx=0 | ||
end_idx=51 # first 50 behaviors | ||
run_id="" | ||
|
||
./scripts/generate_test_cases.sh $method $experiment $behaviors_path $save_dir $start_idx $end_idx $run_id | ||
``` | ||
|
||
</p></details> | ||
|
||
|
||
### 🗡️ Step 1.5 - Merge Test Cases | ||
For many methods, `generate_test_cases.py` is parallelized and does not directly save to the final `test_cases.json` file. Rather, these methods save results in `{top_level_results_dir}/{method}/{experiment}/test_cases/test_cases_individual_behaviors/{behavior_id}/test_cases.json`. Run the following command to merge all of the test cases from different behaviors into a single `test_cases.json` file. | ||
|
||
```bash | ||
# With Slurm | ||
./scripts/step1.5.sh | ||
|
||
# Or run directly | ||
./scripts/merge_test_cases.sh $method $save_dir | ||
``` | ||
<details><summary> Example </summary><p> | ||
|
||
```bash | ||
# Your save_dir from Step 1 | ||
base_dir="results" | ||
method="GCG" | ||
experiment="vicuna_13b_v1_5" | ||
save_dir=$base_dir/$method/$experiment/test_cases | ||
|
||
./scripts/merge_test_cases.sh $method $save_dir | ||
``` | ||
</p></details> | ||
|
||
|
||
### 🗡️ Step 2 - Generate Completions | ||
|
||
After generating test cases (attack prompts) specified by a `method` and `experiment` in Step 1, we generate completions for a `target_model`. | ||
|
||
```bash | ||
# Run with Slurm | ||
./scripts/step2.sh | ||
|
||
# Or run directly, make sure your environment have enough GPUs to allocate the model | ||
./scripts/generate_completions.sh $model $behaviors_path $test_cases_file $completions_output_file $max_new_tokens $token | ||
|
||
``` | ||
|
||
<details><summary> Example </summary><p> | ||
|
||
```bash | ||
|
||
base_dir="results" | ||
model="vicuna_13b_v1_5" # from configs/model_configs/models.yaml | ||
files_dir=$base_dir/$method/$model | ||
|
||
test_cases_file=$files_dir/test_cases/test_cases.json | ||
completions_output_file=$files_dir/completions/${model}.json | ||
behaviors_path="./data/harmbench_text_test_behaviors.csv" # or "./data/harmbench_text_validation_behaviors.csv" | ||
max_new_tokens=512 | ||
# Optional token is the private token for the HF's model or api tokens for GPT, CLAUDE, etc. | ||
# token = <your_api_token> | ||
|
||
./scripts/generate_completions.sh $model $behaviors_path $test_cases_file $completions_output_file $max_new_tokens $token | ||
|
||
``` | ||
</p></details> | ||
|
||
### 🗡️ Step 3 - Evaluate Completions | ||
|
||
After generate completions from a `target_model` from Step 2, we use our classifier to label whether each completion is an example of its corresponding behavior. More details about our classifier are in [docs/classifiers](./docs/classifiers.md) | ||
|
||
```bash | ||
# Run with Slurm | ||
./scripts/step3.sh | ||
|
||
# Or run directly, make sure your environment have enough GPUs to allocate the classifier model (1 GPU) | ||
./scripts/evaluate_completions.sh $cls_path $behaviors_path $input_file $output_file | ||
``` | ||
|
||
<details><summary> Example </summary><p> | ||
|
||
```bash | ||
base_dir="results" | ||
model="vicuna_13b_v1_5" # from configs/model_configs/models.yaml | ||
method="GCG" | ||
|
||
cls_path="cais/HarmBench-Llama-2-13b-cls" | ||
behaviors_path="./data/harmbench_text_test_behaviors.csv" # or "./data/harmbench_text_validation_behaviors.csv" | ||
input_file=$base_dir/$method/$model/completions/$model.json # output file from Step 2 | ||
output_file=$base_dir/$method/$model/results/$model.json | ||
|
||
./scripts/evaluate_completions.sh $cls_path $behaviors_path $input_file $output_file | ||
|
||
``` | ||
</p></details> | ||
|
||
<!-- ### 🗡️ Step 2+3: Generate and Evaluate | ||
For convenience, we also provie a script to combine step 2 and step 3 to get completions from a `target_model` and predictions from our classifier. | ||
```bash | ||
# Run with Slurm | ||
./step2+3.sh | ||
# Or run directly, make sure your environment have enough GPUs to allocate both target_model and the classifier model | ||
./generate_and_evaluate.sh $model $cls_path $behaviors_path $test_cases_file $completions_output_file $scores_output_file $max_new_tokens $token | ||
``` | ||
<details><summary> Example </summary><p> | ||
```bash | ||
base_dir="results" | ||
model="vicuna_13b_v1_5" # from configs/model_configs/models.yaml | ||
method="GCG" | ||
cls_path="cais/HarmBench-Llama-2-13b-cls" | ||
files_dir=$base_dir/$method/$model/test_cases | ||
test_cases_file=$files_dir/test_cases.json | ||
completions_output_file=$base_dir/$method/$model/completions/${model}.json | ||
scores_output_file=$base_dir/$method/$model/results/${model}.json | ||
# Optional token is the private token for the HF's model or api tokens for GPT, CLAUDE, etc. | ||
# token = <your_api_token> | ||
./generate_and_evaluate.sh $model $cls_path $behaviors_path $test_cases_file $completions_output_file $scores_output_file $max_new_tokens $token --> | ||
|
||
<!-- ``` --> | ||
</p></details> | ||
|
||
### ➕ Using your own model in HarmBench | ||
You can easily add new Hugging Face transformers models in [config/model_configs/models.yaml](config/model_configs/models.yaml) by simply adding an entry for your model. This model can then be directly evaluated on most red teaming methods without modifying the method configs (using our dynamic experiment config parsing code, described in the `docs` folder). Some methods (AutoDAN, PAIR, TAP) currently require manually adding new experiment configs for added models. | ||
|
||
### ➕ Adding/Customizing your own red teaming methods in HarmBench | ||
All of our red teaming methods are implemented in [baselines](baselines), imported through [baselines/__init__.py](baselines/__init__.py), and managed by [configs/method_configs](configs/method_configs). You can easily improve on top of existing red teaming methods or add new methods by simply making a new subfolder in the `baselines` directory. | ||
|
||
|
||
## ⚓ Documentation ⚓ | ||
|
||
For more details documentation about each component in the framework | ||
- [**Codebase Structure**](./docs/codebase_structure.md) | ||
- [**Method and Model Configs**](./docs/configs.md) | ||
|
||
<!-- We also provide a notebook version for each of the red teaming attack to foster future development and improvement over each of the red teaming methods. | ||
- [**Notebooks**](notebooks/) --> | ||
|
||
## 🌱 HarmBench Roadmap 🌱 | ||
|
||
We plan to update to HarmBench over the next few weeks with the following: | ||
- Notebook versions for red teaming methods | ||
- Additional red teaming methods | ||
- Additional target models | ||
- Additional harmful behaviors | ||
- Support for system prompt attacks | ||
- Experiment code from the paper, including adversarial training code | ||
|
||
## 🙏 Acknowledgements and Citation 🙏 | ||
|
||
We thank the following open-source repositories: | ||
|
||
[1] https://github.com/llm-attacks/llm-attacks | ||
[2] https://github.com/patrickrchao/JailbreakingLLMs | ||
[3] https://github.com/RICommunity/TAP | ||
[4] https://github.com/SheltonLiu-N/AutoDAN | ||
[5] https://github.com/lm-sys/FastChat/tree/main/fastchat | ||
[6] https://github.com/ray-project/ray | ||
[7] https://github.com/vllm-project/vllm | ||
[8] https://github.com/huggingface/transformers | ||
[10] https://github.com/CHATS-lab/persuasive_jailbreaker | ||
|
||
If you find this useful in your research, please consider citing our [paper](): | ||
``` | ||
@article{harmbench, | ||
} | ||
``` |
Oops, something went wrong.