Skip to content

Asap7772/understanding-rlhf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data

License

This is the official codebase of our paper "Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data" by Fahim Tajwar*, Anikait Singh*, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn and Aviral Kumar. For any questions/concerns related to this codebase, please reach out to Anikait Singh.

Running experiments

For bandit experiments, make sure you are in the bandit_experiment directory. The bandit_experiment/scripts directory provides example commands to run our experiments.

For UltraFeedback DPO/Pref-FT experiments, HALOs/project_scripts/run_halos_multi.sh has the example commands to reproduce the experiments in our paper.

Additional Datasets

We note the following additional datasets used in our LLM experiments:

  1. Relabelled AlpacaFarm
  2. Min Length
  3. Mode Length
  4. Skew Length

Acknowledgements

We acknowledge the following codebases for our paper:

  1. TRL - Adapted for our synthetic LLM experiments.
  2. HALOs - Used for our DPO/Pref-FT experiments on UltraFeedback.
  3. DrQ-v2 - Used for our bandit experiments.
  4. minGPT - Used for our bandit experiments.

We thank the authors for providing us with easy-to-work-with codebases.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages