-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Proposal] Intrinsic Reward VecEnvWrapper #309
Comments
Yes, good idea. definitly agree there.
I'm not sure about wrappers and replay buffers at the moment, due to the inherite issues of modifing the states over time (by VecNorm for example). A rework of this needs to be done in general I think (like placing the replay buffer in a specific wrapper by default)
All agreed
It would slow down and use memory, but this is not by an order of magnitude, this is by a small factor increase at wost, imo this is not that bad of a problem. |
As far as I understand from the formal definitions, the state of recurrent units such as LSTM are part of the environment (aka universe) state. So, should not they be included in the curiosity calculations? This would be a drawback from calculating it in the Env where is cannot be accounted for. |
I would rather say that the state of the LSTM, which is in fact the memory cell and the hidden state, is part of the agent policy, not the environment.
I'm not aware of works that uses LSTM's state of the agent policy for creating an intrinsic reward... are you referring to a particular paper? |
Hey *, |
Sadly we will not be taking any new features/enhancements to v2 right now, like you mentioned. This could be added in the later versions after v3. But, if you wish to try out exploration techniques in your environment, take a look at Unity's ML-agents and their PPO. They support exploration bonuses. |
This feature should be a
I agree, or at least reference the implementation in the doc. |
Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this. |
Exactly. I think we can abstract all the network code by reusing functions from the policies (for example the mlp_extractor method). Just the training process of the networks would be included in the wrapper and dependent on the backend, but that shouldn't be too hard to change afterwards. |
I finally experimented with the wrapper today and noticed, that we somehow need to train the RND networks inside of the wrapper. Currently i don't see a way of doing this independent of the backend, since i need a new tf session for this. Or is there a different option to train the networks? (I am currently creating the target and predictor networks via nature_cnn/mlp_extractor methods). |
You should be able to create different sessions, although I am not familiar with this (google might help you here). You could also try using PyTorch to implement the RND. However, this is not a stable-baselines related issue per-se, so you may close this issue if you have no further questions related to stable-baselines. |
Its been a bit longer than just a few days, but i finally implemented the RND curiosity wrapper. I know you will not include it until after v3, but just for those interested i already uploaded the code here. The class it is derived from the new BaseTFWrapper class which just uses copied code from BaseRLModel and ActorCriticRLModel to realize saving and loading. This is quite ugly and would require some refactoring, but since the code needs to be rewritten anyway for v3 i did not invest the time. I did some testing to verify the implementation and was able to train a PPO agent on Pong using intrinsic reward only. As expected the agent optimizes for episode length instead of trying to win (and maximizing extrinsic reward). If you are interested i would update the wrapper for v3 as soon as it is released. |
Looks very promising, and quite compact by reusing stable-baselines functions. This would be a good addition to v3, but it would be a nice tool outside stable-baselines, too, as it is independent of the RL algorithms :). Things should be cleaner by v3, being PyTorch env, so there is no need to delve into cleaning up the BaseTFWrapper for now. |
I am not sure if this is the correct approach. In RND the critic network uses two Value heads to estimate the two reward streams so implementing it as a wrapper will block this approach unless the wrapper returns both rewards in the info but that is kind of messy. |
Most of the mentioned curiosity methods just add the reward to the extrinsic reward of the environment, and still show improvement over previous results. I agree the dual architecture as presented in RND paper could be better (as results indicate, especially with RNN policies), but I do not think it is worth the hassle to implement before we implement these curiosity wrappers with single reward stream. |
I agree with you, perhaps both streams could be available in the |
I am sorry for getting in the middle of your conversation. I am really interested in the application of Intrinsic Reward in the learning pipeline. I noticed that @NeoExtended created an RND curiosity wrapper but I am not very acquainted on how to deal with it. Is it possible to have a representative example of how to use this instrument? |
Hey @FabioPINO, |
Recent approaches have proposed to enhance exploration using an intrinsic reward.
Among the techniques:
The way I would do that:
Drawbacks:
Related issue: #299
The text was updated successfully, but these errors were encountered: