Name		Name	Last commit message	Last commit date
parent directory ..
dir_Walker2D_002		dir_Walker2D_002
images		images
README.md		README.md
TwinDelayed.py		TwinDelayed.py
Walker2DBulletEnv_std0.02.ipynb		Walker2DBulletEnv_std0.02.ipynb
WatchAgent_Walker2D.ipynb		WatchAgent_Walker2D.ipynb

README.md

Project - Walker2DBulletEnv with Twin Delayed DDPG (TD3)

Environment

Solving the environment require an average total reward of over 2500 over 100 consecutive episodes.
The environment is solved in 9361 episodes by usage of the Twin Delayed DDPG (TD3) algorithm, see the basic paper Addressing Function Approximation Error in Actor-Critic Methods.

Three TD3 tricks

A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values,
which then leads to the policy breaking, because it exploits the errors in the Q-function.
Twin Delayed DDPG (TD3) is an algorithm which addresses this issue by introducing three critical tricks:

Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence the name “twin”),
and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions. TD3 maintains a pair of critics Q1 amd Q2 along with a single actor.
Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently
than the Q-function. The paper recommends one policy update (actor) for every two Q-function (critic) updates.
See parameter policy_freq in the function train(), class TD3.
Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder
for the policy to exploit Q-function errors by smoothing out Q along changes in action.
See parameter policy_noise in the function train(), class TD3. TD3 uses Gaussian noise, not Ornstein-Uhlenbeck noise as in DDPG.

Exploration noise

Exploration noise is the crucial parameterin in TD3. For this project, the parameter std_noise is choosed 0.02.
For details, see Three aspects of Deep RL: noise, overestimation and exploration.

Off-policy

TD3 is an off-policy algorithm. In other words, the TD3 algorithm allows reusing the already collected data. In the agent.train we get the batch of (state, action, next_state, done, reward) of the length = batch_size:

        # Sample replay buffer 
        x, y, u, r, d = replay_buffer.sample(batch_size)
        state = torch.FloatTensor(x).to(device)
        action = torch.FloatTensor(u).to(device)
        next_state = torch.FloatTensor(y).to(device)
        done = torch.FloatTensor(1 - d).to(device)
        reward = torch.FloatTensor(r).to(device)

Training Score

Other TD3 projects

Credit

The source paper is Addressing Function Approximation Error in Actor-Critic Methods
by Scott Fujimoto , Herke van Hoof, David Meger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Walker2DBulletEnv-v0_TD3

Walker2DBulletEnv-v0_TD3

README.md

Project - Walker2DBulletEnv with Twin Delayed DDPG (TD3)

Environment

Three TD3 tricks

Exploration noise

Off-policy

Training Score

Other TD3 projects

Credit

Files

Walker2DBulletEnv-v0_TD3

Directory actions

More options

Directory actions

More options

Latest commit

History

Walker2DBulletEnv-v0_TD3

Folders and files

parent directory

README.md

Project - Walker2DBulletEnv with Twin Delayed DDPG (TD3)

Environment

Three TD3 tricks

Exploration noise

Off-policy

Training Score

Other TD3 projects

Credit