Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chat] fix RM & MDP #4125

Open
CWHer opened this issue Jun 30, 2023 · 1 comment · Fixed by #4471 · May be fixed by #4309
Open

[chat] fix RM & MDP #4125

CWHer opened this issue Jun 30, 2023 · 1 comment · Fixed by #4471 · May be fixed by #4309
Assignees

Comments

@CWHer
Copy link
Contributor

CWHer commented Jun 30, 2023

Try to merge #3645 and evaluate its performance.

@CWHer CWHer self-assigned this Jun 30, 2023
@CWHer CWHer mentioned this issue Jun 30, 2023
10 tasks
@CWHer CWHer linked a pull request Jul 24, 2023 that will close this issue
10 tasks
@CWHer
Copy link
Contributor Author

CWHer commented Aug 4, 2023

The PR mainly contains two major changes.

  • Change of Reward Model and Critic

    TLDR: only last_hidden_states[:, -1] is used by Critic / Reward Model.

    1. Input of Critic Model should end with <eos> token

    2. forward fn of Critic is changed to the following form:

      def forward(...):
            outputs = self.model(sequences, attention_mask=attention_mask)
            last_hidden_states = outputs['last_hidden_state']
            sentence_hidden_states = last_hidden_states[:, -1]
            values = self.value_head(sentence_hidden_states).squeeze(1)
  • Change of MDP definition and reward

    TLDR: In previous version, coati treats the whole response as a step (or transition) of MDP. In this version, user can munually determine to treat every chunk_size tokens as a step of MDP.

    In sum, if chunk_size=1, the behavior of coati is close to trlx; and if chunk_size=response_size, the behavior of coati is close to previous version.

    NOTE: trlx DOES NOT add <eos> token to the input of Critic; and previous coati version may contain some bug in the RL training process.

    1. New MDP definition

      chunk_size parameter is added to group every chunk_size tokens as a step of MDP.

      e.g., if chunk_size = 2,

      s0 --> a0 ---> s1 ---> a1 ---> ...
      where s0 = |prompt|, a0 = |t0t1|, s1 = |prompt|t0t1|, a1 = |t2t3|, ...

      $$ \pi (a_0\mid s_0)=\text{Pr}(t_0 \mid s_0) \times \text{Pr}(t_1 \mid s_0, t_0) $$

    2. Modified reward

    $$ \begin{align} \text{reward}[i] &= -\beta \times \log \left(\pi^{RL}(a\mid s_0) / \pi^{SFT}(a\mid s_0)\right)\\ r &= \text{Reward Model}(\mid \text{prompt} \mid \text{response} \mid \text{}) \\ \text{reward}[terminal] &= \text{reward}[t] + r \end{align} $$

    1. Then, advantages and returns are calculated using GAE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
1 participant