[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

araffin · 2019-05-06T08:49:56Z

Recent approaches have proposed to enhance exploration using an intrinsic reward.
Among the techniques:

Intrinsic Curiosity Module: uses the loss of a forward model on features (trained with an inverse model) as intrinsic reward
Large-Scale Study of Curiosity-Driven Learning: uses the loss of a forward model trained on different type of features (VAE, random, inverse)
Exploration by Random Network Distillation: uses the losses of a network that predicts the features of a randomly initialized one as intrinsic reward

The way I would do that:

Using a VecEnvWrapper so it is compatible with all the algorithms without any modifications
I would use a replay buffer inside the wrapper (this requires more memory but is quite general)
the different parameters of the wrapper:

network: the network to use (for forward/ RND / ... models), could a cnn or a mlp
weight_intrinsic_reward: scale of the intrinsic reward compared to the extrinsic one
buffer_size: how many transitions to store
train_freq: train the network every n steps
gradient_steps: how many gradient steps
batch_size: minibatch size
learning_starts: start computing the intrinsic reward only after n steps
save/load: save and load weights of the network used for computing intrinsic reward

Drawbacks:

this would slow down the speed (because of extra learning involved)
uses more memory (because of the replay buffer and the network)

Related issue: #299

hill-a · 2019-05-06T11:19:57Z

Using a VecEnvWrapper so it is compatible with all the algorithms without any modifications

Yes, good idea. definitly agree there.

I would use a replay buffer inside the wrapper (this requires more memory but is quite general)

I'm not sure about wrappers and replay buffers at the moment, due to the inherite issues of modifing the states over time (by VecNorm for example). A rework of this needs to be done in general I think (like placing the replay buffer in a specific wrapper by default)

the different parameters of the wrapper:

All agreed

Drawbacks:

this would slow down the speed (because of extra learning involved)

uses more memory (because of the replay buffer and the network)

It would slow down and use memory, but this is not by an order of magnitude, this is by a small factor increase at wost, imo this is not that bad of a problem.

huvar · 2019-05-09T18:38:04Z

As far as I understand from the formal definitions, the state of recurrent units such as LSTM are part of the environment (aka universe) state. So, should not they be included in the curiosity calculations? This would be a drawback from calculating it in the Env where is cannot be accounted for.

araffin · 2019-05-10T08:40:00Z

the state of recurrent units such as LSTM are part of the environment (aka universe) state.

I would rather say that the state of the LSTM, which is in fact the memory cell and the hidden state, is part of the agent policy, not the environment.

So, should not they be included in the curiosity calculations?

I'm not aware of works that uses LSTM's state of the agent policy for creating an intrinsic reward... are you referring to a particular paper?

NeoExtended · 2020-03-17T09:42:20Z

Hey *,
I am currently working on my thesis and am struggeling a little bit with an environment which is hard to explore. Therefore I thought it would be great to try to implement curiosity and see how it works. I know the project is currently heading towards 3.0 with the switch to the PyTorch backend, but would you still be interested in a PR once im done?

Miffyli · 2020-03-17T10:05:56Z

@NeoExtended

Sadly we will not be taking any new features/enhancements to v2 right now, like you mentioned. This could be added in the later versions after v3.

But, if you wish to try out exploration techniques in your environment, take a look at Unity's ML-agents and their PPO. They support exploration bonuses.

araffin · 2020-03-17T10:21:19Z

but would you still be interested in a PR once im done?

This feature should be a gym.Wrapper so independent of the backend.

This could be added in the later versions after v3.

I agree, or at least reference the implementation in the doc.

Miffyli · 2020-03-17T10:30:13Z

This feature should be a gym.Wrapper so independent of the backend.

Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this.

NeoExtended · 2020-03-17T10:39:40Z

This feature should be a gym.Wrapper so independent of the backend.

Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this.

Exactly. I think we can abstract all the network code by reusing functions from the policies (for example the mlp_extractor method). Just the training process of the networks would be included in the wrapper and dependent on the backend, but that shouldn't be too hard to change afterwards.

NeoExtended · 2020-04-02T16:15:56Z

I finally experimented with the wrapper today and noticed, that we somehow need to train the RND networks inside of the wrapper. Currently i don't see a way of doing this independent of the backend, since i need a new tf session for this. Or is there a different option to train the networks? (I am currently creating the target and predictor networks via nature_cnn/mlp_extractor methods).

Miffyli · 2020-04-02T16:21:09Z

@NeoExtended

You should be able to create different sessions, although I am not familiar with this (google might help you here). You could also try using PyTorch to implement the RND.

However, this is not a stable-baselines related issue per-se, so you may close this issue if you have no further questions related to stable-baselines.

NeoExtended · 2020-04-15T11:10:16Z

Its been a bit longer than just a few days, but i finally implemented the RND curiosity wrapper. I know you will not include it until after v3, but just for those interested i already uploaded the code here.

The class it is derived from the new BaseTFWrapper class which just uses copied code from BaseRLModel and ActorCriticRLModel to realize saving and loading. This is quite ugly and would require some refactoring, but since the code needs to be rewritten anyway for v3 i did not invest the time.

I did some testing to verify the implementation and was able to train a PPO agent on Pong using intrinsic reward only. As expected the agent optimizes for episode length instead of trying to win (and maximizing extrinsic reward).

If you are interested i would update the wrapper for v3 as soon as it is released.

Miffyli · 2020-04-15T11:57:47Z

Looks very promising, and quite compact by reusing stable-baselines functions. This would be a good addition to v3, but it would be a nice tool outside stable-baselines, too, as it is independent of the RL algorithms :).

Things should be cleaner by v3, being PyTorch env, so there is no need to delve into cleaning up the BaseTFWrapper for now.

m-rph · 2020-04-22T11:45:11Z

I am not sure if this is the correct approach. In RND the critic network uses two Value heads to estimate the two reward streams so implementing it as a wrapper will block this approach unless the wrapper returns both rewards in the info but that is kind of messy.

Miffyli · 2020-04-22T11:53:03Z

Most of the mentioned curiosity methods just add the reward to the extrinsic reward of the environment, and still show improvement over previous results. I agree the dual architecture as presented in RND paper could be better (as results indicate, especially with RNN policies), but I do not think it is worth the hassle to implement before we implement these curiosity wrappers with single reward stream.

m-rph · 2020-04-22T13:20:55Z

I agree with you, perhaps both streams could be available in the info dict? This will be quite useful wrt evaluating the performance and debugging the algorithm.

FabioPINO · 2020-09-18T09:09:08Z

I am sorry for getting in the middle of your conversation. I am really interested in the application of Intrinsic Reward in the learning pipeline. I noticed that @NeoExtended created an RND curiosity wrapper but I am not very acquainted on how to deal with it. Is it possible to have a representative example of how to use this instrument?

NeoExtended · 2020-09-28T10:34:27Z

Hey @FabioPINO,
sorry for the late answer, i am quite busy at the moment. I will try to get you a minimal example which I used to generate the plots above by the end of the week.

araffin added the enhancement New feature or request label May 6, 2019

araffin added this to To do in Roadmap Jun 11, 2019

araffin added the v3 Discussion about V3 label Apr 15, 2020

MijnheerD mentioned this issue Sep 29, 2020

[question] Using wrappers in EvalCallback #1014

Closed

araffin added the experimental Experimental Feature label Oct 12, 2020

araffin mentioned this issue Aug 13, 2021

[Feature Request] add Curiosity-driven RL agent DLR-RM/stable-baselines3#538

Closed

araffin mentioned this issue Nov 16, 2021

[Feature Request] [V1.1] Standardization of intrinsic rewards DLR-RM/stable-baselines3#664

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

araffin commented May 6, 2019

hill-a commented May 6, 2019

huvar commented May 9, 2019

araffin commented May 10, 2019

NeoExtended commented Mar 17, 2020

Miffyli commented Mar 17, 2020

araffin commented Mar 17, 2020

Miffyli commented Mar 17, 2020

NeoExtended commented Mar 17, 2020

NeoExtended commented Apr 2, 2020

Miffyli commented Apr 2, 2020

NeoExtended commented Apr 15, 2020 •

edited

Loading

Miffyli commented Apr 15, 2020

m-rph commented Apr 22, 2020

Miffyli commented Apr 22, 2020 •

edited

Loading

m-rph commented Apr 22, 2020 •

edited

Loading

FabioPINO commented Sep 18, 2020 •

edited

Loading

NeoExtended commented Sep 28, 2020

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

Comments

araffin commented May 6, 2019

hill-a commented May 6, 2019

huvar commented May 9, 2019

araffin commented May 10, 2019

NeoExtended commented Mar 17, 2020

Miffyli commented Mar 17, 2020

araffin commented Mar 17, 2020

Miffyli commented Mar 17, 2020

NeoExtended commented Mar 17, 2020

NeoExtended commented Apr 2, 2020

Miffyli commented Apr 2, 2020

NeoExtended commented Apr 15, 2020 • edited Loading

Miffyli commented Apr 15, 2020

m-rph commented Apr 22, 2020

Miffyli commented Apr 22, 2020 • edited Loading

m-rph commented Apr 22, 2020 • edited Loading

FabioPINO commented Sep 18, 2020 • edited Loading

NeoExtended commented Sep 28, 2020

NeoExtended commented Apr 15, 2020 •

edited

Loading

Miffyli commented Apr 22, 2020 •

edited

Loading

m-rph commented Apr 22, 2020 •

edited

Loading

FabioPINO commented Sep 18, 2020 •

edited

Loading