Fine-Grained RLHF

This repo provides data, code and models for the paper: Fine-Grained Human Feedback Gives Better Rewards for Language Model Training.

Content

Set Up
Tasks and Data
- Long-form QA
- Detoxification
SFT Training
Reward Modeling
RLHF Training
- Holistic RLHF
- Fine-Grained RLHF
Our Trained Models
Feedback Collection Interfaces

Set Up

# create a conda environment with python 3.9
conda create --name py39 python=3.9
conda activate py39 

# git clone and install packages
git clone https://github.com/allenai/FineGrainedRLHF.git
cd FineGrainedRLHF
pip install -e .
python -m spacy download en_core_web_sm

Tasks and Data

Long-form QA

Long-form QA requires an LM to generate a textual response to a question with comprehensive answers and explanations. We conduct experiments on our constructed data, qa-feedback, with collected preference-based and fine-grained human feedback. Please see the data under ./tasks/qa_feedback/data.

Detoxification

The task of detoxification aims to reduce the toxicity in the model generation y when given a prompt x. We conduct experiments on RealToxicityPrompts.

SFT Training

All our experiments reported in the paper were run on 80G A100 GPUs.

For qa-feedback, we initialize the policy model with supervised finetuning on 1K training examples, by running the following command. The trained model is saved under ./tasks/qa_feedback/model_outputs/t5-large-1k-train.

bash tasks/qa_feedback/training/train_sft.sh

We do not do supervised finetuning for the detoxification task and simply use GPT-2 as the initial policy model.

Reward Modeling

For qa-feedback, we train three fine-grained reward models R1, R2 and R3 that correspond to (1) irrelevance, repetition, and incoherence error and (2) incorrect or unverifiable facts and (3) information completeness:

# prepare RM training data, all data saved under ./tasks/qa_feedback/data
bash tasks/qa_feedback/reward_modeling/create_rm_train_files.sh

# train R1, saved under ./tasks/qa_feedback/model_outputs/rel_rm
bash tasks/qa_feedback/reward_modeling/train_rel_rm.sh

# train R2, saved under ./tasks/qa_feedback/model_outputs/fact_rm
bash tasks/qa_feedback/reward_modeling/train_fact_rm.sh

# train R3, saved under ./tasks/qa_feedback/model_outputs/comp_rm
bash tasks/qa_feedback/reward_modeling/train_comp_rm.sh

To train the preference-based reward model:

bash tasks/qa_feedback/reward_modeling/train_baseline_rm.sh

For detoxification, as explained in the paper, we use PERSPECTIVE, an off-the-shelf toxicity API as our reward model.

RLHF Training

We provide training scripts to run both holistic and fine-grained RLHF. Now we only provide the scripts for qa-feedback, and will add those for the detoxification task soon.

You can find the hyperparameters in tasks/{task_name}/training/baseline_config.yml and tasks/{task_name}/training/fine_grained_config.yml and change them accordingly. For example, you may want to change wandb_entity as your own wandb username.

You also want to change the mean and std values for sequence-level reward models (preference and info completeness) in each yml file, which stand for the mean and average reward scores on the training data from the reward model. You can find the values from the mean_std.txt file under ./tasks/qa_feedback/model_outputs/comp_rm or ./tasks/qa_feedback/model_outputs/baseline_rm. The current values are from our own trained reward models.

Holistic RLHF

For qa-feedback:

bash tasks/qa_feedback/training/train_baseline.sh