Name		Name	Last commit message	Last commit date
parent directory ..
alignment-handbook		alignment-handbook
README.md		README.md

README.md

Efficient Adversarial Training with GCG

This folder contains the code for running our efficient adversarial training method with GCG. We use the alignment handbook repository as a starting point and modify/add the following files:

./alignment-handbook/scripts/run_adv_training.sh
./alignment-handbook/scripts/run_sft_adv_training.py
./alignment-handbook/scripts/adv_training_utils.py
./alignment-handbook/recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml
./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml

To start an adversarial training run, follow these setup steps:

Modify num_processes in ./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml to the number of GPUs that you want to train on. Set NUM_ACCELERATE_GPUS in ./alignment-handbook/scripts/run_sft_adv_training.py to match this value.
Make sure NUM_TEST_CASES_TO_UPDATE_PER_STEP in ./alignment-handbook/scripts/run_sft_adv_training.py is a multiple of num_processes.
Make sure NUM_SINGLE_BEHAVIOR_TEST_CASES in ./alignment-handbook/scripts/run_sft_adv_training.py is greater than or equal to NUM_TEST_CASES_TO_UPDATE_PER_STEP.
Activate or create a new conda environment and install the requirements in ./alignment-handbook/requirements.txt. The version of alignment-handbook that we use is a bit outdated now, so the code may not work without this step.

Then run the following commands:

cd alignment-handbook
# Step 1 - train SFT policy
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_config.yaml scripts/run_sft_adv_training.py recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml

We obtain the Zephyr 7B + R2D2 model that we evaluate in the paper using this code with the default arguments. The model in the paper is a snapshot after 2000 steps and is available here: 🤗 cais/zephyr_7b_r2d2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adversarial_training

adversarial_training

README.md

Efficient Adversarial Training with GCG

Files

adversarial_training

Directory actions

More options

Directory actions

More options

Latest commit

History

adversarial_training

Folders and files

parent directory

README.md

Efficient Adversarial Training with GCG