Solving the environment require an average total reward of over 2500 on 100 consecutive episodes.
Training of HopperBulletEnv is performed using the Twin Delayed DDPG (TD3) algorithm, see
the basic paper Addressing Function Approximation Error in Actor-Critic Methods.
In this directory we solve the HopperBulletEnv environment in 3240 episodes with the parameter noise std = 0.03,
and in 5438 episodes with noise std = 0.02.
The score 2500 was achieved in the episode 3240 after training 25 hours 28 minutes.
The score 2500 was achieved in the episode 5438 after training 36 hours 59 minutes.
Three aspects of Deep RL: noise, overestimation and exploration
See video Lucky Hopper on youtube.
The source paper is Addressing Function Approximation Error in Actor-Critic Methods
by Scott Fujimoto , Herke van Hoof, David Meger.