-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to understand how the LSTM policy works #278
Comments
Hello,
I think this is a good question and some documentation is needed on that. To be honest, I did not have the time to dive into the obscure mechanics of LSTM in the codebase, but I would recommend to rather look at PPO2 or A2C, because the code of ACER is very hard to read. And please tell us your finding, that would be valuable for the community ;) Related: #158 |
I only ever looked ad PPO2 too. I will try to get back to you when I have some more time in a few days! |
Also related: openai#859 |
Hello, Is there any update on this? I have the same questions as @Caisho. The way that LSTM policy is used doesn't make sense for me. |
Admittedly that part of the code could be clearer, but this is how I have understood it:
Late edit: Disregard above. The code seems to run backprop through time over the gathered rollout, i.e. |
Thank you @Miffyli |
@Miffyli sorry i didn't get what your response means for questions 1 and 2 |
|
@Miffyli do we skip all rewards (do not train on them) except the last one as we collect these steps? |
If |
as far as I know the LSTM model takes if you say we get |
Non-RNN models take all samples from all environments, bundle them together and trains a batch of |
Dear @erniejunior,
I been trying to trace how the LSTM policy works (with ACER) and its rather confusing. My understanding that the n_steps = lstm sequence length, and so each batch (n_env * n_steps) is fed into the LSTM policy for train_step. However in _Runner.run the self.model.step only takes in 1 obs (1, obs_dim) step instead of (n_steps, obs_dim) when generating the predicted action.
So my 2 questions are:
The text was updated successfully, but these errors were encountered: