Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Reproduce the result of PPO on RoboschoolHumanoidFlagrunHarder #179

Open
doviettung96 opened this issue Jan 30, 2019 · 17 comments
Labels
question Further information is requested

Comments

@doviettung96
Copy link

doviettung96 commented Jan 30, 2019

Hi @araffin ,
Current I am trying to reproduce the result of PPO paper with the environment RoboschoolHumanoidFlagrunHarder.
As I have tried almost every settings, there is still a big gap between mine and their.
I have just modified the code to make logstd=LinearAnneal(-0.7, -1.6) as in the paper.
As I printed the logstd in the distribution.py file, I got:

<tf.Variable 'model/pi/logstd:0' shape=(1, 17) dtype=float32_ref>
However, as I tried to add the following code to the end of PPO2 constructor:
with tf.variable_scope("model", reuse=True):
    self.logstd = tf.get_variable(name='pi/logstd:0')

I got this:

ValueError: Variable model/pi/logstd:0 does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?

I had also just use the variable name "pi/logstd" but it was still failed.
How could I change the value of logstd during training?
Thanks.

@araffin araffin added the question Further information is requested label Jan 30, 2019
@araffin
Copy link
Collaborator

araffin commented Feb 2, 2019

Hello,

I had also just use the variable name "pi/logstd" but it was still failed.

I think the variable you are looking for is created here when it is called from the policy
I would check what is the scope of that variable (probably model/... and not directly pi/)

@doviettung96
Copy link
Author

Yeah.. That variable is created when the policy is created. As in PPO2, the first call is in the construction of step_model. Thus, the variable scope is "model". Please let me know when you could get that variable.
Thanks.

@araffin
Copy link
Collaborator

araffin commented Feb 3, 2019

This works for me (without the :0):

with tf.variable_scope('model', reuse=True):
    print(tf.get_variable(name='pi/logstd'))

@doviettung96
Copy link
Author

I will try that. Thank you.

@BruceK4t1qbit
Copy link

@doviettung96 Let me know if you're able to train RoboschoolHumanoidFlagrunHarder successfully. I was not able to, even with annealing the logstd.

@doviettung96
Copy link
Author

@BruceK4t1qbit ,
How good is your trained agent? Could you provide some statistics like mean reward or the tensorboard graph? I am trying to use the logstd annealing but for now the Roboschool library runs into a problem of building from source.
Anyway, did you try all the settings from the PPO paper?
Thanks.

@BruceK4t1qbit
Copy link

@doviettung96
I tried to use all the settings from the PPO paper (It was a while ago, I forget the details). Modified the original baselines' code to do this.

I didn't use tensorboard - just looked at the rendering.

I've found the pybullet_env is much easier to install than roboschool...

@doviettung96
Copy link
Author

@BruceK4t1qbit ,
I think you might need the mean of episode rewards to have something to compare. For now, I have also changed the code to use all the settings. As suggested, openai baselines or stable baselines are not the original version of the code using in the PPO paper. Therefore, I am not sure if we could reproduce the result. If you could find any improvement, please let me know.
Thanks.

@doviettung96
Copy link
Author

@BruceK4t1qbit ,
I just have a test with it and found that logstd annealing is not important. The result is quite far from the paper.

@BruceK4t1qbit
Copy link

@doviettung96
I recently also tried SAC on it, which seemed to get stuck in the same local optimum...

@doviettung96
Copy link
Author

@BruceK4t1qbit ,
Really? My next step is also DDPG, TD3 and SAC. With this news, I don't know if we could train it successfully. Thanks.

@doviettung96
Copy link
Author

@BruceK4t1qbit ,
I have not tested the result carefully, but just by now, my result is quite close to the trained agent in the roboschool.
Everything is set as in the paper. The agent is trained with 400M timesteps (as in the roboschool trained agent).
Just try it again.

@ernestum
Copy link
Collaborator

So it is just a matter of good luck?

@doviettung96
Copy link
Author

I don't think so. Changes are necessary to improve the performance for those tasks. Just it is quite difficult to know what should we add up to improve.

@ernestum
Copy link
Collaborator

So what did you change?

@doviettung96
Copy link
Author

@erniejunior ,
It depends on your starting point. If your starting point is the PPO paper, just increase the total timesteps to 400M and use deeper network (hidden layer: 512-256-128 with relu activation for example).

@BruceK4t1qbit
Copy link

Thanks! I didn't try such a big network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants