[chat] fix RM & MDP #4125

CWHer · 2023-06-30T07:40:00Z

Try to merge #3645 and evaluate its performance.

CWHer · 2023-08-04T04:18:44Z

The PR mainly contains two major changes.

Change of Reward Model and Critic

TLDR: only last_hidden_states[:, -1] is used by Critic / Reward Model.

Input of Critic Model should end with <eos> token

forward fn of Critic is changed to the following form:

def forward(...):
      outputs = self.model(sequences, attention_mask=attention_mask)
      last_hidden_states = outputs['last_hidden_state']
      sentence_hidden_states = last_hidden_states[:, -1]
      values = self.value_head(sentence_hidden_states).squeeze(1)

Change of MDP definition and reward

TLDR: In previous version, coati treats the whole response as a step (or transition) of MDP. In this version, user can munually determine to treat every chunk_size tokens as a step of MDP.

In sum, if chunk_size=1, the behavior of coati is close to trlx; and if chunk_size=response_size, the behavior of coati is close to previous version.

NOTE: trlx DOES NOT add <eos> token to the input of Critic; and previous coati version may contain some bug in the RL training process.
1. New MDP definition
  
  chunk_size parameter is added to group every chunk_size tokens as a step of MDP.
  
  e.g., if chunk_size = 2,
```
s0 --> a0 ---> s1 ---> a1 ---> ...
where s0 = |prompt|, a0 = |t0t1|, s1 = |prompt|t0t1|, a1 = |t2t3|, ...
```
  $$ \pi (a_0\mid s_0)=\text{Pr}(t_0 \mid s_0) \times \text{Pr}(t_1 \mid s_0, t_0) $$
2. Modified reward
$$ \begin{align} \text{reward}[i] &= -\beta \times \log \left(\pi^{RL}(a\mid s_0) / \pi^{SFT}(a\mid s_0)\right)\\ r &= \text{Reward Model}(\mid \text{prompt} \mid \text{response} \mid \text{}) \\ \text{reward}[terminal] &= \text{reward}[t] + r \end{align} $$
1. Then, advantages and returns are calculated using GAE.

CWHer self-assigned this Jun 30, 2023

CWHer mentioned this issue Jun 30, 2023

[coati] fix RM & MDP #3645

Closed

10 tasks

CWHer linked a pull request Jul 24, 2023 that will close this issue

[chat] use chunked MDP #4309

Open

10 tasks

This was referenced Aug 18, 2023

[chat]: update rm, add wandb and fix bugs #4471

Merged

Question about the output of reward model in RLHF？ #4475

Open

CWHer closed this as completed in #4471 Sep 20, 2023

CWHer reopened this Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chat] fix RM & MDP #4125

[chat] fix RM & MDP #4125

CWHer commented Jun 30, 2023

CWHer commented Aug 4, 2023 •

edited

Loading

[chat] fix RM & MDP #4125

[chat] fix RM & MDP #4125

Comments

CWHer commented Jun 30, 2023

CWHer commented Aug 4, 2023 • edited Loading

CWHer commented Aug 4, 2023 •

edited

Loading