Skip to content

Latest commit

 

History

History
 
 

adversarial_training

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Efficient Adversarial Training with GCG

This folder contains the code for running our efficient adversarial training method with GCG. We use the alignment handbook repository as a starting point and modify/add the following files:

  • ./alignment-handbook/scripts/run_adv_training.sh
  • ./alignment-handbook/scripts/run_sft_adv_training.py
  • ./alignment-handbook/scripts/adv_training_utils.py
  • ./alignment-handbook/recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml
  • ./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml

To start an adversarial training run, follow these setup steps:

  • Modify num_processes in ./alignment-handbook/recipes/accelerate_configs/deepspeed_config.yaml to the number of GPUs that you want to train on. Set NUM_ACCELERATE_GPUS in ./alignment-handbook/scripts/run_sft_adv_training.py to match this value.
  • Make sure NUM_TEST_CASES_TO_UPDATE_PER_STEP in ./alignment-handbook/scripts/run_sft_adv_training.py is a multiple of num_processes.
  • Make sure NUM_SINGLE_BEHAVIOR_TEST_CASES in ./alignment-handbook/scripts/run_sft_adv_training.py is greater than or equal to NUM_TEST_CASES_TO_UPDATE_PER_STEP.
  • Activate or create a new conda environment and install the requirements in ./alignment-handbook/requirements.txt. The version of alignment-handbook that we use is a bit outdated now, so the code may not work without this step.

Then run the following commands:

cd alignment-handbook
# Step 1 - train SFT policy
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_config.yaml scripts/run_sft_adv_training.py recipes/zephyr-7b-beta/sft_adv_training/config_full.yaml

We obtain the Zephyr 7B + R2D2 model that we evaluate in the paper using this code with the default arguments. The model in the paper is a snapshot after 2000 steps and is available here: 🤗 cais/zephyr_7b_r2d2.