-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Interaction Drivers/Runners #381
Comments
Hello, If I understand well, the main advantage of Driver is that it is easy to have the same log/metrics for each algo? or do you see other advantages? I feel this would require too much changes in the codebase, because some algos uses n-steps of interaction with the environment, others 1, and it gets even more complicated when you have environments running in parallel. I would rather go for callbacks if it is only about metrics (see #348 ). |
Ok.
Now I want to prove that my agent outperforms the DQN in MountainCar, without any more training on both of them. Is there a way to now do something like the following with callbacks?
|
So, you want to compare them after training only? (i.e. on the final performance) Callbacks are mainly made for doing things during training. |
Without drivers, a really simple solution would be to do the same you guys are doing in Zoo's Enjoy Script. Something like adding a enjoy or evaluate method to the BaseRLModel which does the same as learn however without updating the network's parameters and has the option of rendering while acting. How does it sound? |
sounds like something more plausible. I'm just afraid of all the possible problems that may come from the environment (and that is also the reason of the enjoy script of the zoo), especially the wrappers. To give two examples, keeping it simple, with one env (with multiple it gets even worse):
but, I think it could be a good idea anyway. For instance, we have a lot of redundant code in the tests, I made a function that may be used a base: stable-baselines/tests/test_her.py Lines 14 to 32 in 65ed396
I need the point of view of the others maintainers: @hill-a , @erniejunior , @AdamGleave ? |
Awesome, lets see wait for their response. |
I'd be reluctant to change e.g. |
I agree with @AdamGleave , in fact the I think adding a file |
After training finished, I roll out a large number of trajectories/episodes to estimate the policy performance. For this I keep writing custom code, which is not desirable since this means extra work and extra risk of introducing bugs which are very sad especially in evaluation code (maybe your policy is very good but you miss that because there is a bug in your evaluation code). So an interaction runner as you describe it is an awesome idea! However, to get a meaningful estimate of your policy performance you often need a very large number of trajectories. Often larger than you might think. E.g. if you want to estimate the failure rate of a policy which is about 0.1% and you only use 1000 episodes to compute your failure rate, with a probability of about 37% you will not observe a single failure in your 1000 test episodes, thereby greatly overestimating the reliability of a policy. In this particular case you need at least about 5000 test episodes. So it is clear that you might need a large number of test episodes depending on your problem domain. It comes with three problems:
See here for an example of how I implemented most of the above (except for parallel environment execution) for a specific environment and with PPO2 in mind. Maybe you can generalize from it. Sorry for mixing up the terms episode and trajectory here. |
Agree with @erniejunior's comments. Just to chime in with another use case for this code: for imitation learning algorithms, it is common to need to perform rollouts of a policy while training the reward model. It'd be nice if the interaction driver/runner also easily supported this use case. I'm sure we do this in the GAIL algorithm inside Stable Baselines. Here is another example of rollout for AIRL & GAIL. |
Currently the way to train an agent is to 1) Instantiate the environment, 2) Instantiate the agent, passing the environment in constructor and 3) calling the learn method.
Some agent frameworks have started implementing execution drivers, i.e., objects responsible for interacting the agent with the environment.
Would love to see such a feature in stable-baselines, given that it would facilitate a lot the pipeline for testing a new agent and comparing it with existing ones.
Current code:
What if there were driver objects such that the execution would go something like this:
Example
And a base class:
Would love to contribute with such features.
Let me know what you think.
The text was updated successfully, but these errors were encountered: