Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

Open
araffin opened this issue May 6, 2019 · 17 comments
Open

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

araffin opened this issue May 6, 2019 · 17 comments
Labels
enhancement New feature or request experimental Experimental Feature v3 Discussion about V3
Projects

Comments

@araffin
Copy link
Collaborator

araffin commented May 6, 2019

Recent approaches have proposed to enhance exploration using an intrinsic reward.
Among the techniques:

The way I would do that:

  1. Using a VecEnvWrapper so it is compatible with all the algorithms without any modifications
  2. I would use a replay buffer inside the wrapper (this requires more memory but is quite general)
  3. the different parameters of the wrapper:
  • network: the network to use (for forward/ RND / ... models), could a cnn or a mlp
  • weight_intrinsic_reward: scale of the intrinsic reward compared to the extrinsic one
  • buffer_size: how many transitions to store
  • train_freq: train the network every n steps
  • gradient_steps: how many gradient steps
  • batch_size: minibatch size
  • learning_starts: start computing the intrinsic reward only after n steps
  • save/load: save and load weights of the network used for computing intrinsic reward

Drawbacks:

  • this would slow down the speed (because of extra learning involved)
  • uses more memory (because of the replay buffer and the network)

Related issue: #299

@araffin araffin added the enhancement New feature or request label May 6, 2019
@hill-a
Copy link
Owner

hill-a commented May 6, 2019

  1. Using a VecEnvWrapper so it is compatible with all the algorithms without any modifications

Yes, good idea. definitly agree there.

  1. I would use a replay buffer inside the wrapper (this requires more memory but is quite general)

I'm not sure about wrappers and replay buffers at the moment, due to the inherite issues of modifing the states over time (by VecNorm for example). A rework of this needs to be done in general I think (like placing the replay buffer in a specific wrapper by default)

  1. the different parameters of the wrapper:

All agreed

Drawbacks:

  • this would slow down the speed (because of extra learning involved)
  • uses more memory (because of the replay buffer and the network)

It would slow down and use memory, but this is not by an order of magnitude, this is by a small factor increase at wost, imo this is not that bad of a problem.

@huvar
Copy link

huvar commented May 9, 2019

As far as I understand from the formal definitions, the state of recurrent units such as LSTM are part of the environment (aka universe) state. So, should not they be included in the curiosity calculations? This would be a drawback from calculating it in the Env where is cannot be accounted for.

@araffin
Copy link
Collaborator Author

araffin commented May 10, 2019

the state of recurrent units such as LSTM are part of the environment (aka universe) state.

I would rather say that the state of the LSTM, which is in fact the memory cell and the hidden state, is part of the agent policy, not the environment.

So, should not they be included in the curiosity calculations?

I'm not aware of works that uses LSTM's state of the agent policy for creating an intrinsic reward... are you referring to a particular paper?

@araffin araffin added this to To do in Roadmap Jun 11, 2019
@NeoExtended
Copy link

Hey *,
I am currently working on my thesis and am struggeling a little bit with an environment which is hard to explore. Therefore I thought it would be great to try to implement curiosity and see how it works. I know the project is currently heading towards 3.0 with the switch to the PyTorch backend, but would you still be interested in a PR once im done?

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 17, 2020

@NeoExtended

Sadly we will not be taking any new features/enhancements to v2 right now, like you mentioned. This could be added in the later versions after v3.

But, if you wish to try out exploration techniques in your environment, take a look at Unity's ML-agents and their PPO. They support exploration bonuses.

@araffin
Copy link
Collaborator Author

araffin commented Mar 17, 2020

but would you still be interested in a PR once im done?

This feature should be a gym.Wrapper so independent of the backend.

This could be added in the later versions after v3.

I agree, or at least reference the implementation in the doc.

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 17, 2020

This feature should be a gym.Wrapper so independent of the backend.

Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this.

@NeoExtended
Copy link

This feature should be a gym.Wrapper so independent of the backend.

Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this.

Exactly. I think we can abstract all the network code by reusing functions from the policies (for example the mlp_extractor method). Just the training process of the networks would be included in the wrapper and dependent on the backend, but that shouldn't be too hard to change afterwards.

@NeoExtended
Copy link

I finally experimented with the wrapper today and noticed, that we somehow need to train the RND networks inside of the wrapper. Currently i don't see a way of doing this independent of the backend, since i need a new tf session for this. Or is there a different option to train the networks? (I am currently creating the target and predictor networks via nature_cnn/mlp_extractor methods).

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 2, 2020

@NeoExtended

You should be able to create different sessions, although I am not familiar with this (google might help you here). You could also try using PyTorch to implement the RND.

However, this is not a stable-baselines related issue per-se, so you may close this issue if you have no further questions related to stable-baselines.

@NeoExtended
Copy link

NeoExtended commented Apr 15, 2020

Its been a bit longer than just a few days, but i finally implemented the RND curiosity wrapper. I know you will not include it until after v3, but just for those interested i already uploaded the code here.

The class it is derived from the new BaseTFWrapper class which just uses copied code from BaseRLModel and ActorCriticRLModel to realize saving and loading. This is quite ugly and would require some refactoring, but since the code needs to be rewritten anyway for v3 i did not invest the time.

I did some testing to verify the implementation and was able to train a PPO agent on Pong using intrinsic reward only. As expected the agent optimizes for episode length instead of trying to win (and maximizing extrinsic reward).
rnd_pong_episode_length
rnd_pong_episode_reward

If you are interested i would update the wrapper for v3 as soon as it is released.

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 15, 2020

Looks very promising, and quite compact by reusing stable-baselines functions. This would be a good addition to v3, but it would be a nice tool outside stable-baselines, too, as it is independent of the RL algorithms :).

Things should be cleaner by v3, being PyTorch env, so there is no need to delve into cleaning up the BaseTFWrapper for now.

@araffin araffin added the v3 Discussion about V3 label Apr 15, 2020
@m-rph
Copy link

m-rph commented Apr 22, 2020

I am not sure if this is the correct approach. In RND the critic network uses two Value heads to estimate the two reward streams so implementing it as a wrapper will block this approach unless the wrapper returns both rewards in the info but that is kind of messy.

@Miffyli
Copy link
Collaborator

Miffyli commented Apr 22, 2020

Most of the mentioned curiosity methods just add the reward to the extrinsic reward of the environment, and still show improvement over previous results. I agree the dual architecture as presented in RND paper could be better (as results indicate, especially with RNN policies), but I do not think it is worth the hassle to implement before we implement these curiosity wrappers with single reward stream.

@m-rph
Copy link

m-rph commented Apr 22, 2020

I agree with you, perhaps both streams could be available in the info dict? This will be quite useful wrt evaluating the performance and debugging the algorithm.

@FabioPINO
Copy link

FabioPINO commented Sep 18, 2020

I am sorry for getting in the middle of your conversation. I am really interested in the application of Intrinsic Reward in the learning pipeline. I noticed that @NeoExtended created an RND curiosity wrapper but I am not very acquainted on how to deal with it. Is it possible to have a representative example of how to use this instrument?

@NeoExtended
Copy link

Hey @FabioPINO,
sorry for the late answer, i am quite busy at the moment. I will try to get you a minimal example which I used to generate the plots above by the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experimental Experimental Feature v3 Discussion about V3
Projects
Roadmap
  
To do
Development

No branches or pull requests

7 participants