Deep Reinforcement Learning with PyTorch

In this repository we will implement various deep reinforcement learning algorithms. The main objective is to provide a clean implementation of the deep learning algorithms for deep learning enthusiast who want to have a deeper understanding of RL algorithms

Implemented Algorithms

Cross entropy
Deep Q-Learning (DQN)
- Basic DQN
- Double DQN
- Dueling DQN
- N-step DQN
- Priority Buffer DQN
- DQN with prioritized replay buffer
Policy Gradient

Double DQN

The motivation for double DQN comes from DeepMind.They demonstrated that the basic DQN tend to overestimate the predicted Q values. This seems to stem from the Max operation in Bellman update equation. They proposed to use the network to predict the best action for the next_state and then use the q value associated with this action in bellman equation. You can read the full details in the reference paper[2]

In this approach, instead of simply selecting the action associated with the maxQ value. We use the main network to predict the actions for the next state and then we choose the Q value associated with this action for our Ballman update approximation.

In my experiments, it almost took the same amount of time to train both basic and double DQN networks to achieve a certain reward level .However, Idid this experiment on a simple environment CartPole, it might make a huge difference in more complicated environments.

Rewards

Next figure compares the rewards trend in double DQN (blue) versus to DQN (red)

Dueling DQN

The idea of Dueling DQN is to split the Q into two components, Advantage and Value, to improve the training stability and faster convergence of the network. The implementation is rather minimal and straightforward. We just need to slightly modify our DQN network to return two values V and A and use these values in our loss calculation.

Architecture

Next figure compares the architecture of the base DQN(top) with that of Duel DQN(bottom) Architecture of the Dueling DQN (source[3])

The main idea here is that the average of the advantages that the network predicts, for all the actions, should be zero. Essentialy at any given state, we can not expect all the actions to be good, some of them should be good some of them should be bad and overall, average advantage of all actions should be zero. By imposing this constraint on the network , we force it ot have a better estimation for the V values and in which in return results in better estimations for Q values. Next figure shows that the Dueling DQN required less training and converges faster to a desired reward value.

Rewards

Dueling DQN (orange) vs Base DQN (blue)

We can see that the effect of Dueling DQN is somehow similar to the effect of the Double DQN.

N-Step DQN

This improvement comes from a relatively old paper published way back in 1988. The main idea here is to speed up the convergence by providing more terms(n-step) for Ballman approximation. This generally results in speeding up the convergence by selecting a relativly modest step size n typically 2 or 3. Howver, selecing a very large number, say 100, will totally degrade the performance and the model might not converge at all.

Rewards N-step DQN (gray) vs Base DQN (blue)

As you can see the n-step DQN with n=3 is almost two times faster than the base DQN.

DQN with Prioritized Replay Buffer

This concept was introduced by DeepMind and the idea is that instead of just randomly selecting samples from the replay buffer, we should select samples that are more beneficial for the model training. So, how do we know that a sample is beneficial(or has higher priority as they call it)? Well, they suggested different approaches in their paper[5] One of them is to use the training error as an indicator for selecting a sample. In other words, if a sample has a high error, we want to sample it more frequently to give the network a chance to train on it more often and ultimately reduce the error associated with it and improve the training efficiency.

Noisy Networks

This approach tries to solve the exploration vs exploitation dilemma in the DQN. Typically DQN allows the agent to do the exploration by selecting random actions during the training.The amount of random action is controlled by a hyperparameter epsilon and it is generally linearly decreased over time as the training progress.(epsilon-greedy) The challenge is that we often need to fine tune the epsilon to make the training fast and efficient. The main benefit of noisy networks is that we do not have to worry about fine tuning another hyperparameter as the network adjusts the required amount of exploration by agent automatically by itself. Whenever more exploration is required more noise is automatically added to the weights of the fully connected layers of the network and its amount is adjusted by back propagation.This method generally results in faster convergence. The initialization scheme and initial value for sigma were adopted from the reference paper[7]

Rewards dynamics

In my experiments with the CartPole environment. Runs with prioritized replay buffer tend to take less episode to reach a specific target level compared to simple replay buffer. However, the runs take a bit longer which is expected.

Rewards comparision for prioritized replay buffer( blue) vs base (orange)

Loss dynamics

Rewards comparision for prioritized replay buffer( blue) vs base (orange)

Rainbow: Combining the above improvements

All the previous DQN extensions that were mentioned so far, were the result of many researchers who has made significant improvements to the DQN algorithm over the years. The so called Rainbow paper[5] combined all of these enhancements in a single model and evaluated the performance of the combined model.

In the current implementation I have combined the following methods:

Dueling DQN
N-step DQN
Prioritized Replay Buffer
Noisy Networks

The performance gain is quite considerable. Next figure compares the reward dynamics for this model with the base DQN for the CartPole env.

Rainbow(blue) vs baseline vanilla DQN (orange)

The code can still be optimized further by , for instance, improving the sampling in O(log N) time in the priority replay buffer , by using a more efficient data structure like the segment tree. Even without that the final performance is still quite good and positive. The time complexity of our buffer is O(N) and this should generally be fine for the learning purposes or if our buffer is relatively small, say 200k.However, you might need to consider a more optimized replay buffer for large projects which typically require larger buffers to hold millions of frames.

Reference Papers

Playing Atari with Deep Reinforcement Learning ,Volodymyr Mnih et. al, 2015
Deep Reinforcement Learning with Double Q-Learning, van Hasselt, Guez, and Silver, 2015
Dueling Network Architectures for Deep Reinforcement Learning Wang et al., 2015
Learning to Predict by the Methods of Temporal Differences Sutton, 1988]
Prioritized Experience Replay Schaul and others, 2015
Rainbow:Combining Improvements in Deep Reinforcement Learning Matteo Hessel and others, 2017
Noisy Networks for Exploration Fortunato and others, 2017

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
01-Cross-Entropy		01-Cross-Entropy
03-DQN		03-DQN
04-Policy-Gradient		04-Policy-Gradient
examples		examples
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Reinforcement Learning with PyTorch

Implemented Algorithms

Double DQN

Dueling DQN

N-Step DQN

DQN with Prioritized Replay Buffer

Noisy Networks

Rainbow: Combining the above improvements

Reference Papers

About

Releases

Packages

Languages

khordoo/Deep-Reinforcement-Learning-with-PyTorch

Folders and files

Latest commit

History

Repository files navigation

Deep Reinforcement Learning with PyTorch

Implemented Algorithms

Double DQN

Dueling DQN

N-Step DQN

DQN with Prioritized Replay Buffer

Noisy Networks

Rainbow: Combining the above improvements

Reference Papers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages