This repository contains an implementation of Ephemeral Value Adjustment (EVA) from "Fast deep reinforcement learning using online adjustments from the past" by S.Hansen et al. https://arxiv.org/abs/1810.08163
Project is done as part of the course DeepPavlov course: Advanced Topics in Deep Reinforcement learning .
We used PyTorch library version 1.6.
For performing approximate nearest neighbours search we used Fast Library for Approximate Nearest Neighbors (FLANN).
Original codes and install instructions could be found https://github.com/mariusmuja/flann.
In order to run the code with the default parameters use the snippet below. The default parameters can be found in the module config.py in our repository.
python experiment.py
In order to run baseline DQN model one can set up weighting parameter λ to 1
python experiment.py --lambd=1
We ran experiments primarly on atari environments, such as "BreakoutNoFrameskip-v4" and "AtlantisNoFrameskip-v4". We used EpisodicLifeEnv, FireReset or NoOpReset and MaxAndSkipEnv wrappers from OpenAI.baselines. We did not use four parallel agents like in original work. We also limit the size of experience replay to 400k. Other than that, we tweaked a little bit some parameters: greedy exploration rate (ϵ) decay rate, Adam learning rate. Most of the parameters were kept the same as in the original work (Section 9 Atari Experiment Details).
Trajectory Central Planning (TCP) algorithm is a core of Ephemeral Value Adjustment. The idea behind it is to find some states in past experience that are in some sence similar to the given state. Similarity here is defined by measuring Euclidean distance in embedding space. Embeddings here are the outputs of fully-connected layer of the convolutional neural network used by agent. Ideally, once the network is good enough in predicting Q-values, the embeddings for the states similar in terms of further actions and rewards become close in space.
TCP unrolls several paths from the experience, starting from these similar states, and calculates the action values, based on actual rewards on-path, and on Q-values estimates by agent's neural network off-path. These non-parametric Q-values, as they called in the original work, are stored in the value buffer along with the embedding of the state they were calculated for.
We implemented two different version of TCP – as it was not explicitly stated in the original work what embeddings should be inserted as the keys to the value buffer. As the neural network changing over time, the embeddings for the states that were found as nearest to the given one would not be quite the same. There are two possibilities: use the old embeddings which we used to find the similar states as the keys to insert in the value buffer, or obtain new embeddings by passing these states into neural network and use them. We did not notice any significant difference between these two variants in our experiments. Currently, the latest version of the code uses the first variant.
After some time during the learning when the agent could play reasonably well we paused learning process to check how well searching neighbours in the embedding space works.
We took a frame in the Breakout game and made a request to replay buffer to yield its neighbors in the embedding space using approximate nearest neighbours search we mentioned above.
Query state
Neighboring states
You can see that the position and direction of the ball and the position of the bar are similar in most cases, but the remaining blocks are different. It would be hard to achieve the same in the original frames space.
As a benchmark we have used DQN of exactly the same atchitecture and hyperparameters. The only difference is that in DQN, weighting parameter for non-parametric Q-value, λ, equals 1.
Below is a video of how EVA algorithm plays on Breakout environment after 10000 episodes
Movie of Breakout play game after 10000 episodes
Below is a moving average (window of 1000 episodes) of training rewards of two implementations of TCP (described in TCP section above) and DQN with the same hyperparameters on Atlantis environment.
Moving average of training rewards for Atlantis