In this notebook, we implement an agent learning to play Pong with
algorithm PPO (Proximal Policy Optimization).
As with the REINFORCE version,
the model learns from pixels.
I. Collect trajectories based on the policy \Pi(\theta'),
initialize \theta' = \theta.
II. Compute the gradient for the clipped surrogate function
III. Gradient ascent, update \theta':
IV. The internal loop of the PPO training: Steps II and III are repeated k times,
i.e., every trajectory is used k times before it is trown away. In our case, k = 4.
For the case REINFORCE, k = 1. In the code, k = SGD_epoch, see file pong_utils.py,
function clipped_surrogate.
V. External loop: back to step 1. Set \theta=\theta', go to new epsodes with new trajectories.
RL uses the idea of rewards in order to determine which actions to perform.
The reward is simply a +1 for every round the Agent wins, and a -1 for every round the opponent CPU wins.
For more complex games, rewards can be tied to score increments. In real-life applications computing rewards
can be trickier, especially when there is no obvious single score or objective to optimize.
The environment was solved:
The training is performed by 8 parallel agents. The agents run in
8 independent environments and learn the same Neural Network.
envs = parallelEnv('PongDeterministic-v4', n=8)
- CarRacing, Single agent, Learning from pixels
- C r a w l e r , 12 parallel agents
- BipedalWalker, 16 parallel agents
The implementation of the PPO algorithm code is based on the Udacity code.