Dataset Reset Policy Optimization for RLHF

Chang, Jonathan D.; Zhan, Wenhao; Oertell, Owen; Brantley, Kianté; Misra, Dipendra; Lee, Jason D.; Sun, Wen

Computer Science > Machine Learning

arXiv:2404.08495 (cs)

[Submitted on 12 Apr 2024 (v1), last revised 16 Apr 2024 (this version, v3)]

Title:Dataset Reset Policy Optimization for RLHF

Authors:Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun

View PDF HTML (experimental)

Abstract:Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at this https URL.

Comments:	28 pages, 6 tables, 3 Figures, 3 Algorithms
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2404.08495 [cs.LG]
	(or arXiv:2404.08495v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2404.08495

Submission history

From: Jonathan Chang [view email]
[v1] Fri, 12 Apr 2024 14:25:49 UTC (625 KB)
[v2] Mon, 15 Apr 2024 01:56:27 UTC (625 KB)
[v3] Tue, 16 Apr 2024 17:36:39 UTC (625 KB)

Computer Science > Machine Learning

Title:Dataset Reset Policy Optimization for RLHF

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Dataset Reset Policy Optimization for RLHF

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators