Skip to content

Latest commit

 

History

History

MountainCarContinuous_PPO

Project - MountainCarContinuous with PPO (vectorized environments)

Environment

Usually, solving the environment require an average total reward of over the threshold over 100 consecutive episodes.
However, in this case the solution is achieved very fastly: in 21 episodes in 1 minute ! This is due to the fact
that there are 16 processes in use in this PPO implementation notebook. We can think that the real number of episodes
is 21x16 = 336.

We use PPO with vectorized environments, the basic paper: Proximal Policy Optimization Algorithms.
Vectorized Environments (in our case there are 16 environments) is a method that means that the agent is trained in
16 environments simultaneously.

Training score

Log

Ep. 1, Timesteps 999, Score.Agents: -100.28, Avg.Score: -104.26, Time: 00:00:07, Interval: 00:07
Ep. 2, Timesteps 999, Score.Agents: -89.06, Avg.Score: -99.19, Time: 00:00:11, Interval: 00:04
Ep. 3, Timesteps 999, Score.Agents: -81.92, Avg.Score: -94.87, Time: 00:00:15, Interval: 00:04
Ep. 4, Timesteps 999, Score.Agents: -72.26, Avg.Score: -90.35, Time: 00:00:19, Interval: 00:04
Ep. 5, Timesteps 999, Score.Agents: -64.13, Avg.Score: -85.98, Time: 00:00:23, Interval: 00:04
Ep. 6, Timesteps 999, Score.Agents: -56.60, Avg.Score: -81.78, Time: 00:00:27, Interval: 00:04
Ep. 7, Timesteps 999, Score.Agents: -45.15, Avg.Score: -77.21, Time: 00:00:31, Interval: 00:04
Ep. 8, Timesteps 999, Score.Agents: -15.89, Avg.Score: -70.39, Time: 00:00:35, Interval: 00:04
Ep. 9, Timesteps 999, Score.Agents: 35.67, Avg.Score: -59.79, Time: 00:00:39, Interval: 00:04
Ep. 10, Timesteps 999, Score.Agents: 78.21, Avg.Score: -47.24, Time: 00:00:42, Interval: 00:03
Ep. 11, Timesteps 999, Score.Agents: 130.52, Avg.Score: -32.43, Time: 00:00:46, Interval: 00:04
Ep. 12, Timesteps 999, Score.Agents: 191.98, Avg.Score: -15.17, Time: 00:00:50, Interval: 00:04
Ep. 13, Timesteps 999, Score.Agents: 223.33, Avg.Score: 1.87, Time: 00:00:54, Interval: 00:04
Ep. 14, Timesteps 999, Score.Agents: 284.03, Avg.Score: 20.68, Time: 00:00:58, Interval: 00:04
Ep. 15, Timesteps 999, Score.Agents: 331.33, Avg.Score: 40.10, Time: 00:01:02, Interval: 00:04
Ep. 16, Timesteps 999, Score.Agents: 394.99, Avg.Score: 60.97, Time: 00:01:06, Interval: 00:04
Ep. 17, Timesteps 999, Score.Agents: 479.05, Avg.Score: 84.20, Time: 00:01:10, Interval: 00:04
Ep. 18, Timesteps 999, Score.Agents: 518.25, Avg.Score: 107.04, Time: 00:01:14, Interval: 00:04
Ep. 19, Timesteps 999, Score.Agents: 553.75, Avg.Score: 129.38, Time: 00:01:18, Interval: 00:04
Ep. 20, Timesteps 999, Score.Agents: 619.99, Avg.Score: 152.74, Time: 00:01:22, Interval: 00:04
Environment solved with Average Score: 152.7411231610043

MountainCar, different models

Other PPO projects

References

  • Proximal Policy Optimization.
    "PPO has become the default reinforcement learning algorithm at OpenAI."

  • Vectorized Environments
    "Vectorized Environments are a method for stacking multiple independent environments into a single environment. Instead of training an RL agent on 1 environment per step, it allows us to train it on n environments per step."