Training LSTMs involves lots of data transformation #158

ernestum · 2019-01-11T13:18:57Z

I looked at how exactly LSTMs are trained with PPO2 and found that a lot of unnecessary data transformations happen:

Trajectories are sampled by the Runner. At the end of its run method data is flattened from the shape [num_steps, num_envs, x] to [num_steps * num_envs, x] after switching the first two dimensions.
```
arr.swapaxes(0, 1).reshape(shape[0] * shape[1], *shape[2:])
```
In the learn method of PPO2 a very hard to understand mechanism is used to shuffle the sampled trajectory data without mixing up states adjacent in time. It uses a very large flat_indices array.
```
flat_indices = np.arange(self.n_envs * self.n_steps).reshape(self.n_envs, self.n_steps)
```
In the optimization step, the data needs to be disentangled again using the batch_to_seq function, which reconstructs the data in the shape [n_steps, num_envs, x] again so we can build an LSTM graph (by the way this is the format in which the trajectories were sampled to begin with in step 1).
```
input_sequence = batch_to_seq(extracted_features, self.n_env, n_steps)
```
For further processing, the data is converted back to the flat version using seq_to_batch
```
rnn_output = seq_to_batch(rnn_output)
```

All this seems to be overly complex and potentially slow to me. This is why I would like to open the discussion here on how matters could be improved. Please set your ideas free :-)

The text was updated successfully, but these errors were encountered:

ernestum · 2019-01-11T14:41:39Z

My first thought was that the runner should keep the data untouched and we should feed it to the policy in the format [num_steps, num_envs, x]:

This saves a lot of preprocessing for the LSTM case
In PPO2 (and maybe other algos) the nasty distinction between recurrent and non-recurrent policies can go away.
The burden of flattening is placed on the Feedforward policies (we break compatibility with any custom ActorCriticPolicys)
Is this easy to implement for other algos than PPO2?
Since we are not allowed to reorder the samples within one trajectory in case we train a recurrent policy, we can not mix the training data as throughly as before. For me it is hard to guess how much this would decrease performance.

What do you think?

araffin · 2019-01-13T11:24:32Z

Yes, I completely agree that LSTM code is overcomplicated (and that is also the reason I avoid using recurrent policies for now ^^"...).
However, I need a bit more time to give you insightful feedback.
Ping me again in two weeks if I didn't answer you ;)

araffin · 2019-04-08T11:44:16Z

Referencing that PR here: openai#859

ernestum added the help wanted Help from contributors is needed label Jan 11, 2019

araffin added the enhancement New feature or request label Jan 11, 2019

ernestum changed the title ~~Training LSTMs involces lots of data transformation~~ Training LSTMs involves lots of data transformation Jan 12, 2019

araffin mentioned this issue Mar 23, 2019

[feature request] RL algorithms supporting custom recurrent policies #241

Closed

araffin mentioned this issue Mar 31, 2019

pre train LSTM policy [question] #253

Closed

araffin mentioned this issue Apr 28, 2019

Trying to understand how the LSTM policy works #278

Open

araffin mentioned this issue Feb 19, 2020

[question] Questions about MlpLstmPolicy #646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training LSTMs involves lots of data transformation #158

Training LSTMs involves lots of data transformation #158

ernestum commented Jan 11, 2019

ernestum commented Jan 11, 2019

araffin commented Jan 13, 2019

araffin commented Apr 8, 2019

Training LSTMs involves lots of data transformation #158

Training LSTMs involves lots of data transformation #158

Comments

ernestum commented Jan 11, 2019

ernestum commented Jan 11, 2019

araffin commented Jan 13, 2019

araffin commented Apr 8, 2019