Skip to content

Latest commit

 

History

History

Walker2DBulletEnv-v0_TD3

Project - Walker2DBulletEnv with Twin Delayed DDPG (TD3)

Environment

Solving the environment require an average total reward of over 2500 over 100 consecutive episodes.
The environment is solved in 9361 episodes by usage of the Twin Delayed DDPG (TD3) algorithm, see the basic paper Addressing Function Approximation Error in Actor-Critic Methods.

Three TD3 tricks

A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values,
which then leads to the policy breaking, because it exploits the errors in the Q-function.
Twin Delayed DDPG (TD3) is an algorithm which addresses this issue by introducing three critical tricks:

  • Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence the name “twin”),
    and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions. TD3 maintains a pair of critics Q1 amd Q2 along with a single actor.

  • Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently
    than the Q-function. The paper recommends one policy update (actor) for every two Q-function (critic) updates.
    See parameter policy_freq in the function train(), class TD3.

  • Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder
    for the policy to exploit Q-function errors by smoothing out Q along changes in action.
    See parameter policy_noise in the function train(), class TD3. TD3 uses Gaussian noise, not Ornstein-Uhlenbeck noise as in DDPG.

Exploration noise

Exploration noise is the crucial parameterin in TD3. For this project, the parameter std_noise is choosed 0.02.
For details, see Three aspects of Deep RL: noise, overestimation and exploration.

Off-policy

TD3 is an off-policy algorithm. In other words, the TD3 algorithm allows reusing the already collected data. In the agent.train we get the batch of (state, action, next_state, done, reward) of the length = batch_size:

        # Sample replay buffer 
        x, y, u, r, d = replay_buffer.sample(batch_size)
        state = torch.FloatTensor(x).to(device)
        action = torch.FloatTensor(u).to(device)
        next_state = torch.FloatTensor(y).to(device)
        done = torch.FloatTensor(1 - d).to(device)
        reward = torch.FloatTensor(r).to(device)

Training Score

Other TD3 projects

Credit

The source paper is Addressing Function Approximation Error in Actor-Critic Methods
by Scott Fujimoto , Herke van Hoof, David Meger.