RWKV based reward model

Work in progress, not ready for use.

Introduction

This git implements a basic reward model based on RWKV models. The purposes of training such a reward model are:

Show that RWKV models can be used to learn reward functions
With the reward model trained, we can use it to train a policy using RL algorithms that encourages the base RWKV model to generate more diverse trajectories and high quality answers.

To run this experiment

0. Environments

You will need to install RWKV official pypi package using pip install rwkv. You will also need datasets and pytorch packages.

pip install rwkv
pip install datasets
pip install torch

2. Download the weights

In this experiment I chose a smaller RWKV model with 430M parameters for enabling a quick training and testing on a single GPU. You can download the weight from https://huggingface.co/BlinkDL/rwkv-4-pile-430m

I used this weight: https://huggingface.co/BlinkDL/rwkv-4-pile-430m/resolve/main/RWKV-4-Pile-430M-20220808-8066.pth

3. Run the experiment

python train.py

In this experiment, we will use the reward dataset from https://huggingface.co/datasets/yitingxie/rlhf-reward-datasets. We sampled 100 data points from the train and 20 from the validation to show that the reward model can be trained to predict the reward values.

The detail of the reward model

In short, the reward model in this setting will 1) read the prompt and output from another LLM model; 2) Rate the output from 1 to -1 ( in logits) for "accept" and "reject" respectively. In the dataset yitingxie/rlhf-reward-datasets, for each prompt, two answers were generated by an anonymous LLM, then human raters rated one of the answers as "accept" and the other as "reject". The reward model is trained to predict the human ratings.

Once such a reward model is trained, we can use it to train a policy using RL algorithms that encourages the base RWKV model to get higher scores (i.e. more "accept" ratings).

Work in progress

The reward model is not trained well yet. Only the last few layers are trained.
I will implement qLora on top of this reward model to make it fully trainable.
The dataset we used for illustration here is not big enough and diverse enough. We will need to collect more data to train a better reward model.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
README.md		README.md
config.py		config.py
modeling_rwkv_rm.py		modeling_rwkv_rm.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RWKV based reward model

Introduction

To run this experiment

0. Environments

2. Download the weights

3. Run the experiment

The detail of the reward model

Work in progress

About

Releases

Packages

Languages

jiamingkong/rwkv_reward

Folders and files

Latest commit

History

Repository files navigation

RWKV based reward model

Introduction

To run this experiment

0. Environments

2. Download the weights

3. Run the experiment

The detail of the reward model

Work in progress

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages