[Example] RLHF end to end example #1324

apbard · 2023-06-27T14:42:45Z

merge after #1309, #1319, #1316, #1315 + rebase

Adds a complete end 2 end RLHF pipeline

examples/rlhf/train_rlhf.py

vmoens · 2023-06-28T11:27:04Z

examples/rlhf/train_rlhf.py

+ """Returns adaptively updated KL coefficient, βₜ₊₁.
+ Arguments:
+ current: The current KL value between the newest policy and the initial policy.
+ """


wrong formatting

examples/rlhf/train_rlhf.py

vmoens · 2023-06-28T11:27:41Z

examples/rlhf/train_rlhf.py

+
+ For debugging purposes, we also generate responses to a fixed prompt so that the
+ quality of the model can be visually assessed during training.
+ """


Missing args and example

vmoens · 2023-06-28T11:30:04Z

examples/rlhf/train_rlhf.py

+ batch = next(dataloader)
+ # NOTE: disable kl for evaluation
+ td = rollout_from_model.rollout_from_data(batch, kl_coef=0.0)
+ rewards[k] = td.get(("next", "reward")).sum(dim=1).mean().item()


to get a scalar instead of a scalar-tensor

examples/rlhf/train_rlhf.py

Co-authored-by: Vincent Moens <[email protected]>

Co-authored-by: Alessandro Pietro Bardelli <[email protected]>

Co-authored-by: Vincent Moens <[email protected]>

# Conflicts: # test/test_rlhf.py # torchrl/data/rlhf/utils.py # torchrl/modules/tensordict_module/actors.py # torchrl/modules/tensordict_module/common.py

vmoens · 2023-07-04T08:35:17Z

torchrl/data/rlhf/utils.py

+ model,
+ ref_model,
+ reward_model,
+ kl_controller,


this is missing from the doc.

I'm not so sure about this kl_controller that is passed to the module. I feel it should be handled separately. It's like passing the lr_scheduler to the optimizer, the reason we don't do that is that it mixes responsibilities between modules. It gives the impression that one module has multiple responsibilities but it is less clear than doing things explicitly in the main code.

are you suggesting we go back to passing just the kl coefficient?

I think the KL controller should be another class, but then we have 2 options:
the KL controller changes the KL coefficienbt of the other class (like the LR scheduler changes the LR of the optimizer or the target param updaters in torchrl update the target params of the loss) or we explicitely pass the kl coef.
I think the first option is more "pytorch"-style

I think the KL controller should be another class
actually is another class

the KL controller changes the KL coefficienbt of the other class
isn't this what we are currently doing?

Should we not remove it then?

vmoens · 2023-07-05T16:41:56Z

examples/rlhf/train_rlhf.py

+ )
+
+ rollout_from_model = RolloutFromModel(model, ref_model, reward_model)
+ kl_controller = AdaptiveKLController(rollout_from_model, 0.1, 6, 10000)


I don't think AdaptiveKLController takes the rollout_from_model as input does it?

vmoens · 2023-07-05T16:43:27Z

torchrl/data/rlhf/utils.py

+class KLControllerBase(abc.ABC):
+ """Base class for KL controllers.
+
+ Each controller must implement an update method that takes the current KL value and
+ the number of steps and updates the self.coef attribute, which will multiply
+ the KL during calculation of the reward.
+ """
+
+ @abc.abstractmethod
+ def update(self, kl_value: float, n_steps: int):
+ pass
+
+
+class ConstantKLController(KLControllerBase):
+ """Constant KL Controller.
+
+ This controller maintains a fixed coefficient no matter what values it is updated
+ with.
+
+ Arguments:
+ coefficient (float): The coefficient to multiply KL with when calculating the
+ reward.
+ """
+
+ def __init__(self, coefficient):
+ self.coef = coefficient
+
+ def update(self, kl_value: float, n_steps: int):
+ pass
+
+
+class AdaptiveKLController(KLControllerBase):
+ """Adaptive KL Controller as described in Ziegler et al. "Fine-Tuning Language Models from Human Preferences".
+
+ Arguments:
+ init_kl_coef (float): The starting value of the coefficient.
+ target (float): The target KL value. When the observed KL is smaller, the
+ coefficient is decreased, thereby relaxing the KL penalty in the training
+ objective and allowing the model to stray further from the reference model.
+ When the observed KL is greater than the target, the KL coefficient is
+ increased, thereby pulling the model back towards the reference model.
+ horizon (int): Scaling factor to control how aggressively we update the
+ coefficient.
+
+ Reference: Section 2.2 https://arxiv.org/pdf/1909.08593.pdf#page=2
+ Source: https://github.com/openai/lm-human-preferences/blob/master/lm_human_preferences/train_policy.py
+ """
+
+ def __init__(self, init_kl_coef: float, target: float, horizon: int):
+ self.coef = init_kl_coef
+ self.target = target
+ self.horizon = horizon
+
+ def update(self, kl_value: float, n_steps: int):
+ """Update ``self.coef`` adaptively.
+
+ Arguments:
+ kl_value: The current KL value between the newest policy and the initial
+ policy.
+ n_steps: The number of training steps taken since last update.
+ """
+ proportional_error = np.clip(kl_value / self.target - 1, -0.2, 0.2) # ϵₜ
+ mult = 1 + proportional_error * n_steps / self.horizon
+ self.coef *= mult # βₜ₊₁


Are these guys really part of data?
They seem more related to the model to me.
They act on a class that belongs to data (maybe should be moved to collector tbh) but the KL coef is something that has to do with the stochastic policy (the language model, in our case), not the data.

New classes should be added to the doc (provided we're sure of where they belong)

vmoens · 2023-07-05T16:44:05Z

torchrl/data/rlhf/utils.py

+ model,
+ ref_model,
+ reward_model,
+ kl_controller,


Should we not remove it then?

vmoens · 2023-07-05T16:44:19Z

torchrl/data/rlhf/utils.py

 """Makes a step in the KL coefficient schedule."""
- raise NotImplementedError
+ self.kl_controller.update(kl_value, n_steps)


ditto, maybe this function should go away?

vmoens · 2023-07-05T16:45:03Z

torchrl/data/rlhf/utils.py

@@ -167,7 +242,7 @@ def create_rollout_td(self, batch, generated, log_probs, log_ratio, kl_coef=0.1)
 )
 reward_raw = clipped_scores.unsqueeze(-1).unsqueeze(-1)
 reward_raw = reward_raw * done
- reward_kl = -kl_coef * log_ratio.unsqueeze(-1)
+ reward_kl = -self.kl_controller.coef * log_ratio.unsqueeze(-1)


facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2023

RLHF end2end example

f24281c

apbard force-pushed the rlhf-example branch from 01a6491 to f24281c Compare June 27, 2023 14:43

vmoens reviewed Jun 28, 2023

View reviewed changes

apbard and others added 6 commits June 28, 2023 14:50

add VmapModule and from_lmhead_model method

ef3f76f

Update examples/rlhf/train_rlhf.py

02a909b

Co-authored-by: Vincent Moens <[email protected]>

addressing comments

953e4af

Merge remote-tracking branch 'origin/main' into rlhf-networks

ffb8661

Update torchrl/modules/tensordict_module/common.py

f43faea

Update torchrl/modules/tensordict_module/actors.py

69b0588

vmoens added the enhancement New feature or request label Jun 28, 2023

tcbegley and others added 18 commits June 29, 2023 10:47

Add RolloutFromModel class

b6fecbb

Add rollout tests

bd8fbb6

Apply suggestions from code review

6fbb603

Co-authored-by: Alessandro Pietro Bardelli <[email protected]>

Address comments

3e80a55

Docstring lint

385ac90

Apply suggestions from code review

8d0a152

Co-authored-by: Vincent Moens <[email protected]>

Address comments

fcddc97

Fix tests

5c7c72e

Handle missing transformers import

92d5757

Import transformers locally

eec0eaf

lint

87501ea

Merge branch 'rlhf-rollout' into rlhf-example

043fcf6

Merge branch 'rlhf-networks' into rlhf-example

3f53046

lint

8b69e41

Example bugfixes

24eaa3a

Move KL controller logic

fba43a1

Merge branch 'main' into rlhf-example

20fa920

# Conflicts: # test/test_rlhf.py # torchrl/data/rlhf/utils.py # torchrl/modules/tensordict_module/actors.py # torchrl/modules/tensordict_module/common.py

amend

c07ac93

vmoens reviewed Jul 4, 2023

View reviewed changes

vmoens reviewed Jul 5, 2023

View reviewed changes

addressing comments about klcontroller

f463e0e

apbard force-pushed the rlhf-example branch from e9562fb to f463e0e Compare July 5, 2023 17:30

apbard changed the title ~~[Example, NOMERGE] RLHF end to end example~~ [Example] RLHF end to end example Jul 7, 2023

vmoens added 4 commits September 5, 2023 11:24

Merge remote-tracking branch 'origin/main' into rlhf-example

eac5374

Merge remote-tracking branch 'origin/main' into rlhf-example

8d2dde7

Merge branch 'main' into rlhf-example

a2ba045

amend

a9b94f0

vmoens closed this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] RLHF end to end example #1324

[Example] RLHF end to end example #1324

apbard commented Jun 27, 2023

vmoens Jun 28, 2023

vmoens Jun 28, 2023

vmoens Jun 28, 2023

apbard Jun 28, 2023

vmoens Jul 4, 2023

apbard Jul 4, 2023

vmoens Jul 4, 2023

apbard Jul 4, 2023

apbard Jul 4, 2023

vmoens Jul 5, 2023

vmoens Jul 5, 2023

vmoens Jul 5, 2023

vmoens Jul 5, 2023

vmoens Jul 5, 2023

vmoens Jul 5, 2023

vmoens Jul 5, 2023

[Example] RLHF end to end example #1324

[Example] RLHF end to end example #1324

Conversation

apbard commented Jun 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment