Skip to content

Latest commit

 

History

History

HopperBulletEnv-v0-SAC

Project - HopperBulletEnv with Soft Actor-Critic (SAC)

Environment

Solving the environment require an average total reward of over 2500 on 100 consecutive episodes.
Training of HopperBulletEnv is performed using the Soft Actor-Critic (SAC) algorithm, see
two basic papers SAC: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor
and SAC Algorithms and Applications. The HopperBulletEnv environment was solved
in 2 experiments: (I) in 7662 episodes, (II) in 3814 episodes.

Tips to class GaussianPolicy

Scale and Bias

The varaible scale is the length of the interval [low, high]:
scale = (action_space.high - action_space.low)/2

The varaible bias is the center of the interval [low, high]:
bias = (action_space.high + action_space.low)/2

These values give the map [low, high] --> [(low - bias)/scale, (high - bias)/scale] = [-1,1].

Activation Function

The hyperbolic tangent function torch.tanh is very similar to
the logistic sigmoid function g(x) = 1/(1 + exp(-x)).
However, the range of logistic sigmoid function is [0,1] and the range of tanh is [-1,1].
Then tanh is should be more efficient because it has a wider range, and derivative is more steep,

see Comparison of Activation Functions for Deep Neural Networks.

Actually, tanh is the rescaled logistic sigmoid function, namely, tanh(x) = 2g(2x) - 1.
We also note that (tanh(x))' = 1 - (tanh(x))^2.

Reparameterization

see Auto-Encoding Variational Bayes by D.Kingma, M.Welling

Training Score

i. The score 2500 was achieved in the episode 7662 after training 89 hours 23 minutes.
Learning rate = 0.0001.

ii. The score 2500 was achieved in the episode 3814 after training 37 hours 58 minutes.
Here, learning rate = 0.0003.

Other Soft Actor-Critic projects

The last few lines from the log

...
Ep.: 3803, Total Steps: 2498568, Ep.Steps: 1000, Score: 2518.460, Avg.Score: 2494.938, Max.Score: 2557.114, Time: 37:48:57
Ep.: 3804, Total Steps: 2499568, Ep.Steps: 1000, Score: 2522.515, Avg.Score: 2495.311, Max.Score: 2557.114, Time: 37:49:54
Ep.: 3805, Total Steps: 2500568, Ep.Steps: 1000, Score: 2517.100, Avg.Score: 2495.309, Max.Score: 2557.114, Time: 37:50:52
Ep.: 3806, Total Steps: 2501568, Ep.Steps: 1000, Score: 2529.950, Avg.Score: 2495.785, Max.Score: 2557.114, Time: 37:51:50
Ep.: 3807, Total Steps: 2502568, Ep.Steps: 1000, Score: 2544.071, Avg.Score: 2496.402, Max.Score: 2557.114, Time: 37:52:48
Ep.: 3808, Total Steps: 2503568, Ep.Steps: 1000, Score: 2545.314, Avg.Score: 2496.918, Max.Score: 2557.114, Time: 37:53:45
Ep.: 3809, Total Steps: 2504568, Ep.Steps: 1000, Score: 2548.591, Avg.Score: 2497.434, Max.Score: 2557.114, Time: 37:54:43
Ep.: 3810, Total Steps: 2505568, Ep.Steps: 1000, Score: 2543.297, Avg.Score: 2498.089, Max.Score: 2557.114, Time: 37:55:41
Ep.: 3811, Total Steps: 2506568, Ep.Steps: 1000, Score: 2548.280, Avg.Score: 2498.782, Max.Score: 2557.114, Time: 37:56:39
Ep.: 3812, Total Steps: 2507568, Ep.Steps: 1000, Score: 2555.617, Avg.Score: 2499.684, Max.Score: 2557.114, Time: 37:57:37
Ep.: 3813, Total Steps: 2508568, Ep.Steps: 1000, Score: 2562.541, Avg.Score: 2500.254, Max.Score: 2562.541, Time: 37:58:35
Solved environment with Avg Score: 2500.253525302146

Videos

See videos
Lucky Hopper and
Chessboard chase with four Pybullet actors on youtube.

Articles on Soft Actor-Critic

Credit

Based on Pranjal Tandon's code (https://github.com/pranz24).