Ephemeral Value Adjustment

This repository contains an implementation of Ephemeral Value Adjustment (EVA) from "Fast deep reinforcement learning using online adjustments from the past" by S.Hansen et al. https://arxiv.org/abs/1810.08163
Project is done as part of the course DeepPavlov course: Advanced Topics in Deep Reinforcement learning .

Anna Mazur, Nikita Trukhanov

Prerequisites

We used PyTorch library version 1.6.
For performing approximate nearest neighbours search we used Fast Library for Approximate Nearest Neighbors (FLANN). Original codes and install instructions could be found https://github.com/mariusmuja/flann.

Running the code

In order to run the code with the default parameters use the snippet below. The default parameters can be found in the module config.py in our repository.

python experiment.py

In order to run baseline DQN model one can set up weighting parameter λ to 1

python experiment.py --lambd=1

Parameters and some differences from the original article

We ran experiments primarly on atari environments, such as "BreakoutNoFrameskip-v4" and "AtlantisNoFrameskip-v4". We used EpisodicLifeEnv, FireReset or NoOpReset and MaxAndSkipEnv wrappers from OpenAI.baselines. We did not use four parallel agents like in original work. We also limit the size of experience replay to 400k. Other than that, we tweaked a little bit some parameters: greedy exploration rate (ϵ) decay rate, Adam learning rate. Most of the parameters were kept the same as in the original work (Section 9 Atari Experiment Details).

Trajectory Central Planning details

Trajectory Central Planning (TCP) algorithm is a core of Ephemeral Value Adjustment. The idea behind it is to find some states in past experience that are in some sence similar to the given state. Similarity here is defined by measuring Euclidean distance in embedding space. Embeddings here are the outputs of fully-connected layer of the convolutional neural network used by agent. Ideally, once the network is good enough in predicting Q-values, the embeddings for the states similar in terms of further actions and rewards become close in space.
TCP unrolls several paths from the experience, starting from these similar states, and calculates the action values, based on actual rewards on-path, and on Q-values estimates by agent's neural network off-path. These non-parametric Q-values, as they called in the original work, are stored in the value buffer along with the embedding of the state they were calculated for.
We implemented two different version of TCP – as it was not explicitly stated in the original work what embeddings should be inserted as the keys to the value buffer. As the neural network changing over time, the embeddings for the states that were found as nearest to the given one would not be quite the same. There are two possibilities: use the old embeddings which we used to find the similar states as the keys to insert in the value buffer, or obtain new embeddings by passing these states into neural network and use them. We did not notice any significant difference between these two variants in our experiments. Currently, the latest version of the code uses the first variant.

Embeddings

After some time during the learning when the agent could play reasonably well we paused learning process to check how well searching neighbours in the embedding space works. We took a frame in the Breakout game and made a request to replay buffer to yield its neighbors in the embedding space using approximate nearest neighbours search we mentioned above.

Query state

Neighboring states

You can see that the position and direction of the ball and the position of the bar are similar in most cases, but the remaining blocks are different. It would be hard to achieve the same in the original frames space.

Results

As a benchmark we have used DQN of exactly the same atchitecture and hyperparameters. The only difference is that in DQN, weighting parameter for non-parametric Q-value, λ, equals 1.

Below is a video of how EVA algorithm plays on Breakout environment after 10000 episodes

Movie of Breakout play game after 10000 episodes

Below is a moving average (window of 1000 episodes) of training rewards of two implementations of TCP (described in TCP section above) and DQN with the same hyperparameters on Atlantis environment.

Moving average of training rewards for Atlantis

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
pictures		pictures
LICENSE		LICENSE
README.md		README.md
atari_wrappers.py		atari_wrappers.py
config.py		config.py
experiment.py		experiment.py
qnetwork.py		qnetwork.py
replay_buffer.py		replay_buffer.py
tcp.py		tcp.py
train_test.py		train_test.py
utils.py		utils.py
value_buffer.py		value_buffer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ephemeral Value Adjustment

Anna Mazur, Nikita Trukhanov

Prerequisites

Running the code

Parameters and some differences from the original article

Trajectory Central Planning details

Embeddings

Results

About

Releases

Packages

Contributors 2

Languages

License

AnnaNikitaRL/EVA

Folders and files

Latest commit

History

Repository files navigation

Ephemeral Value Adjustment

Anna Mazur, Nikita Trukhanov

Prerequisites

Running the code

Parameters and some differences from the original article

Trajectory Central Planning details

Embeddings

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages