Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training LSTMs involves lots of data transformation #158

Open
ernestum opened this issue Jan 11, 2019 · 3 comments
Open

Training LSTMs involves lots of data transformation #158

ernestum opened this issue Jan 11, 2019 · 3 comments
Labels
enhancement New feature or request help wanted Help from contributors is needed

Comments

@ernestum
Copy link
Collaborator

I looked at how exactly LSTMs are trained with PPO2 and found that a lot of unnecessary data transformations happen:

  1. Trajectories are sampled by the Runner. At the end of its run method data is flattened from the shape [num_steps, num_envs, x] to [num_steps * num_envs, x] after switching the first two dimensions.
    arr.swapaxes(0, 1).reshape(shape[0] * shape[1], *shape[2:])
  2. In the learn method of PPO2 a very hard to understand mechanism is used to shuffle the sampled trajectory data without mixing up states adjacent in time. It uses a very large flat_indices array.
    flat_indices = np.arange(self.n_envs * self.n_steps).reshape(self.n_envs, self.n_steps)
  3. In the optimization step, the data needs to be disentangled again using the batch_to_seq function, which reconstructs the data in the shape [n_steps, num_envs, x] again so we can build an LSTM graph (by the way this is the format in which the trajectories were sampled to begin with in step 1).
    input_sequence = batch_to_seq(extracted_features, self.n_env, n_steps)
  4. For further processing, the data is converted back to the flat version using seq_to_batch
    rnn_output = seq_to_batch(rnn_output)

All this seems to be overly complex and potentially slow to me. This is why I would like to open the discussion here on how matters could be improved. Please set your ideas free :-)

@ernestum ernestum added the help wanted Help from contributors is needed label Jan 11, 2019
@araffin araffin added the enhancement New feature or request label Jan 11, 2019
@ernestum
Copy link
Collaborator Author

My first thought was that the runner should keep the data untouched and we should feed it to the policy in the format [num_steps, num_envs, x]:

  • This saves a lot of preprocessing for the LSTM case
  • In PPO2 (and maybe other algos) the nasty distinction between recurrent and non-recurrent policies can go away.
  • The burden of flattening is placed on the Feedforward policies (we break compatibility with any custom ActorCriticPolicys)
  • Is this easy to implement for other algos than PPO2?
  • Since we are not allowed to reorder the samples within one trajectory in case we train a recurrent policy, we can not mix the training data as throughly as before. For me it is hard to guess how much this would decrease performance.

What do you think?

@ernestum ernestum changed the title Training LSTMs involces lots of data transformation Training LSTMs involves lots of data transformation Jan 12, 2019
@araffin
Copy link
Collaborator

araffin commented Jan 13, 2019

Yes, I completely agree that LSTM code is overcomplicated (and that is also the reason I avoid using recurrent policies for now ^^"...).
However, I need a bit more time to give you insightful feedback.
Ping me again in two weeks if I didn't answer you ;)

@araffin
Copy link
Collaborator

araffin commented Apr 8, 2019

Referencing that PR here: openai#859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Help from contributors is needed
Projects
None yet
Development

No branches or pull requests

2 participants