Trying to understand how the LSTM policy works #278

Caisho · 2019-04-17T05:19:25Z

Dear @erniejunior,

I been trying to trace how the LSTM policy works (with ACER) and its rather confusing. My understanding that the n_steps = lstm sequence length, and so each batch (n_env * n_steps) is fed into the LSTM policy for train_step. However in _Runner.run the self.model.step only takes in 1 obs (1, obs_dim) step instead of (n_steps, obs_dim) when generating the predicted action.

So my 2 questions are:

Can you explain a little how the LSTM policy works when it is trained with a sequence of obs but it predicts with only 1 obs
It seems that the batch training step is not slid across the sequence? e.g. with data {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and timesteps of 5, it is trained as batches of {0, 1, 2, 3, 4} and {5, 6, 7, 8, 9} rather than {0, 1, 2, 3, 4} followed by {1, 2, 3, 4, 5}

araffin · 2019-04-28T09:27:53Z

Hello,

I been trying to trace how the LSTM policy works (with ACER) and its rather confusing.

I think this is a good question and some documentation is needed on that. To be honest, I did not have the time to dive into the obscure mechanics of LSTM in the codebase, but I would recommend to rather look at PPO2 or A2C, because the code of ACER is very hard to read.

And please tell us your finding, that would be valuable for the community ;)

Related: #158

ernestum · 2019-04-29T08:54:21Z

I only ever looked ad PPO2 too. I will try to get back to you when I have some more time in a few days!

araffin · 2019-04-30T11:05:37Z

Also related: openai#859

andris955 · 2019-11-20T08:02:51Z

Hello,

Is there any update on this? I have the same questions as @Caisho. The way that LSTM policy is used doesn't make sense for me.

Miffyli · 2019-11-20T14:41:55Z

Admittedly that part of the code could be clearer, but this is how I have understood it:

No unrolling/backprop-through-time is used here. Each step is handled separately, where the hidden state is just one of the inputs. This will make learning harder but also makes the implementation easier, as we can treat hidden states just as one of the inputs. The "right way" of doing recurrent policies with RL agents is still an on-going research (see e.g. R2D2). For prediction we just feed in observations and hidden states from previous calls.

~~Note that this is based on the observation that states are stored as numpy arrays during training, they are fed alongside observations and are not updated during training steps.~~

Late edit: Disregard above. The code seems to run backprop through time over the gathered rollout, i.e. n_steps. The previous known hidden state is used as the initial point. Only these initial states are stored in numpy arrays.

andris955 · 2019-11-22T22:30:28Z

Thank you @Miffyli

iza88 · 2020-12-10T13:09:52Z

@Miffyli sorry i didn't get what your response means for questions 1 and 2

Miffyli · 2020-12-10T13:20:05Z

@iza88

Hidden state is stored in a numpy array when predicting for one-step observations (same is done inside network during training, except it all is in the TF graph)
Training is done in batches of (num_envs, n_steps), parallelizing over the number of environments ("batch size") and backpropagating through time on the second axis.

iza88 · 2020-12-10T14:18:33Z

@Miffyli
suppose num_envs=4
does it mean on rollout we get 4 points to train on regardless n_steps?
e.g. reward function: (n_steps, features_count) -> reward

do we skip all rewards (do not train on them) except the last one as we collect these steps?

Miffyli · 2020-12-10T14:20:55Z

If num_envs=4, then the batch-size will be 4. Then in total there will be num_envs * n_steps points for training. I do not believe I understand the second question about rewards. No rewards are skipped.

iza88 · 2020-12-10T14:25:58Z

as far as I know the LSTM model takes (m, n_features) as an input whereas a non-rnn model is ok with shape (n_features,)

if you say we get num_envs * n_steps points for training that means LSTM if fed with (n_features,) which confuses me

Miffyli · 2020-12-10T14:28:17Z

Non-RNN models take all samples from all environments, bundle them together and trains a batch of (num_envs * n_steps, n_features). RNN models keeps the data in (num_envs, n_steps, n_features) format so that the RNN layer can process data over time (the second dimension).

araffin added question Further information is requested documentation Documentation should be updated labels Apr 28, 2019

araffin mentioned this issue Feb 19, 2020

[question] Questions about MlpLstmPolicy #646

Open

Miffyli mentioned this issue Mar 22, 2020

How does PPO with LSTM handle the case where episodes may have different lengths? #759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to understand how the LSTM policy works #278

Trying to understand how the LSTM policy works #278

Caisho commented Apr 17, 2019

araffin commented Apr 28, 2019

ernestum commented Apr 29, 2019

araffin commented Apr 30, 2019

andris955 commented Nov 20, 2019 •

edited

Loading

Miffyli commented Nov 20, 2019 •

edited

Loading

andris955 commented Nov 22, 2019

iza88 commented Dec 10, 2020

Miffyli commented Dec 10, 2020

iza88 commented Dec 10, 2020

Miffyli commented Dec 10, 2020

iza88 commented Dec 10, 2020

Miffyli commented Dec 10, 2020

Trying to understand how the LSTM policy works #278

Trying to understand how the LSTM policy works #278

Comments

Caisho commented Apr 17, 2019

araffin commented Apr 28, 2019

ernestum commented Apr 29, 2019

araffin commented Apr 30, 2019

andris955 commented Nov 20, 2019 • edited Loading

Miffyli commented Nov 20, 2019 • edited Loading

andris955 commented Nov 22, 2019

iza88 commented Dec 10, 2020

Miffyli commented Dec 10, 2020

iza88 commented Dec 10, 2020

Miffyli commented Dec 10, 2020

iza88 commented Dec 10, 2020

Miffyli commented Dec 10, 2020

andris955 commented Nov 20, 2019 •

edited

Loading

Miffyli commented Nov 20, 2019 •

edited

Loading