This folder contains the code for running our efficient adversarial training method with GCG. We use the alignment handbook repository as a starting point and modify/add the following files:
./alignment-handbook/scripts/run_adv_training.sh
./alignment-handbook/scripts/run_sft_adv_training.py
./alignment-handbook/scripts/adv_training_utils.py
./alignment-handbook/recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml
./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml
To start an adversarial training run, follow these setup steps:
- Modify
num_processes
in./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml
to the number of GPUs that you want to train on. SetNUM_ACCELERATE_GPUS
in./alignment-handbook/scripts/run_sft_adv_training.py
to match this value. - Make sure
NUM_TEST_CASES_TO_UPDATE_PER_STEP
in./alignment-handbook/scripts/run_sft_adv_training.py
is a multiple ofnum_processes
. - Make sure
NUM_SINGLE_BEHAVIOR_TEST_CASES
in./alignment-handbook/scripts/run_sft_adv_training.py
is greater than or equal toNUM_TEST_CASES_TO_UPDATE_PER_STEP
. - Activate or create a new conda environment and install the requirements in
./alignment-handbook/requirements.txt
. The version ofalignment-handbook
that we use is a bit outdated now, so the code may not work without this step.
Then run the following commands:
cd alignment-handbook
# Step 1 - train SFT policy
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_config.yaml scripts/run_sft_adv_training.py recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml
We obtain the Zephyr 7B + R2D2 model that we evaluate in the paper using this code with the default arguments. The model in the paper is a snapshot after 2000 steps and is available here: 🤗 cais/zephyr_7b_r2d2.