Skip to content

Latest commit

 

History

History

Minitaur-Soft-Actor-Critic

Project - MinitaurBulletEnv with Soft Actor Critic (SAC)

Introduction

Solving the environment require an average total reward of over 15.0 over 100 consecutive episodes.
We solve the MinitaurBulletEnv environment in 1745 episodes, in 20 hours, by usage of the SAC algorithm,
see the basic paper SAC: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.

Training Score

Steps of episodes

Here is a graph of the average number of steps for 100 series.

Other SAC projects

The last few lines from the log

...
Ep.: 1670, Tot.St.: 489093, Avg.Num.St.: 753.8, Min-Max.Sc.: (0.04, 20.80), Avg.Score: 13.621, Time: 17:11:46
Ep.: 1680, Tot.St.: 497313, Avg.Num.St.: 776.1, Min-Max.Sc.: (0.04, 21.44), Avg.Score: 14.260, Time: 17:35:09
Ep.: 1690, Tot.St.: 505945, Avg.Num.St.: 791.1, Min-Max.Sc.: (0.04, 21.44), Avg.Score: 14.479, Time: 17:59:44
Ep.: 1700, Tot.St.: 514981, Avg.Num.St.: 789.7, Min-Max.Sc.: (0.04, 21.44), Avg.Score: 14.473, Time: 18:25:31
Ep.: 1710, Tot.St.: 522821, Avg.Num.St.: 774.7, Min-Max.Sc.: (0.04, 22.46), Avg.Score: 14.205, Time: 18:47:56
Ep.: 1720, Tot.St.: 530160, Avg.Num.St.: 760.6, Min-Max.Sc.: (0.04, 22.51), Avg.Score: 14.020, Time: 19:09:32
Ep.: 1730, Tot.St.: 538166, Avg.Num.St.: 778.1, Min-Max.Sc.: (0.04, 22.51), Avg.Score: 14.498, Time: 19:33:12
Ep.: 1740, Tot.St.: 545961, Avg.Num.St.: 800.1, Min-Max.Sc.: (0.04, 22.51), Avg.Score: 14.872, Time: 19:56:14
Solved environment with Avg Score: 15.097705826385656

Full log is available in the jupyter notebook file.

Trials not reaching the threshold

lr = 0.0001,
batch size = 512,
10000 episodes,
maximal vaue for average score = 13.85

lr = 0.00001,
batch size = 128,
40000 episodes,
maximal vaue for average score = 13.09

lr = 0.0001
batch size = 1024
10000 episodes
maximal vaue for average score = 12.41

Video

See video Four stages of Minitaur training on youtube.

Real Minitaur

Learning to Walk via Deep Reinforcement Learning, Minitaur-Locomotion.

Credit

The implementation of the SAC algorithm is based on Pranjal Tandon's code (https://github.com/pranz24).