Solving the environment require an average total reward of over 15.0 over 100 consecutive episodes.
We solve the MinitaurBulletEnv environment in 1745 episodes, in 20 hours, by usage of the SAC algorithm,
see the basic paper SAC: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.
Here is a graph of the average number of steps for 100 series.
...
Ep.: 1670, Tot.St.: 489093, Avg.Num.St.: 753.8, Min-Max.Sc.: (0.04, 20.80), Avg.Score: 13.621, Time: 17:11:46
Ep.: 1680, Tot.St.: 497313, Avg.Num.St.: 776.1, Min-Max.Sc.: (0.04, 21.44), Avg.Score: 14.260, Time: 17:35:09
Ep.: 1690, Tot.St.: 505945, Avg.Num.St.: 791.1, Min-Max.Sc.: (0.04, 21.44), Avg.Score: 14.479, Time: 17:59:44
Ep.: 1700, Tot.St.: 514981, Avg.Num.St.: 789.7, Min-Max.Sc.: (0.04, 21.44), Avg.Score: 14.473, Time: 18:25:31
Ep.: 1710, Tot.St.: 522821, Avg.Num.St.: 774.7, Min-Max.Sc.: (0.04, 22.46), Avg.Score: 14.205, Time: 18:47:56
Ep.: 1720, Tot.St.: 530160, Avg.Num.St.: 760.6, Min-Max.Sc.: (0.04, 22.51), Avg.Score: 14.020, Time: 19:09:32
Ep.: 1730, Tot.St.: 538166, Avg.Num.St.: 778.1, Min-Max.Sc.: (0.04, 22.51), Avg.Score: 14.498, Time: 19:33:12
Ep.: 1740, Tot.St.: 545961, Avg.Num.St.: 800.1, Min-Max.Sc.: (0.04, 22.51), Avg.Score: 14.872, Time: 19:56:14
Solved environment with Avg Score: 15.097705826385656
Full log is available in the jupyter notebook file.
lr = 0.0001,
batch size = 512,
10000 episodes,
maximal vaue for average score = 13.85
lr = 0.00001,
batch size = 128,
40000 episodes,
maximal vaue for average score = 13.09
lr = 0.0001
batch size = 1024
10000 episodes
maximal vaue for average score = 12.41
See video Four stages of Minitaur training on youtube.
Learning to Walk via Deep Reinforcement Learning, Minitaur-Locomotion.
The implementation of the SAC algorithm is based on Pranjal Tandon's code (https://github.com/pranz24).