This package contains implementations of the most relevant TD methods for policy evaluation (i.e. estimating the value function) and a benchmark framework to systematically assess their quality in a variety of scenarios. Only methods for linear function approximation are considered.
The following algorithms as implemented in the td
module:
- TD Learning with e-traces
- TDC with e-traces
- GTD
- GTD2
- LSTD with e-traces
- Bellman Residual Minimization with or without double sampling (+ e-traces)
- Residual Gradient with or without double sampling
- GPTD with e-traces
- Kalman TD
- LSPE with e-traces
- FPKF with e-traces
In addition, the package contains rudimentary implementations (in the regtd
module) of different regularization
schemes for LSTD such as
- LSTD with l2 regularization
- Dantzig-LSTD
- LarsTD
- LSTD-l1
- LSTD with l2l1 regularization
The package contains the implementations of several MDPs suitable for
benchmarking the algorithms (see examples.py
). While the implementations are of more general
nature, there are ready-to-run scripts for the following benchmark scenarios:
- 14-State Boyan Chain [
boyan
] - Baird Star Example [
baird
] - 400-State Random MDP On-policy [
disc_random_on
] - 400-State Random MDP Off-policy [
disc_random_off
] - Lin. Cart-Pole Balancing On-pol. Imp. Feat. [
lqr_imp_onpolicy
] - Lin. Cart-Pole Balancing Off-pol. Imp. Feat. [
lqr_imp_offpolicy
] - 4-dim. State Pole Balancing Onpolicy Perfect Features [
lqr_full_onpolicy
] - Lin. Cart-Pole Balancing Off-pol. Perf. Features [
lqr_full_offpolicy
] - Cart-Pole Swinup On-policy [
swingup_gauss_onpolicy
] - Cart-Pole Swinup Off-policy [
swingup_gauss_offpolicy
] - 20-link Lin. Pole Balancing On-policy [
link20_imp_onpolicy
] - 20-link Lin. Pole Balancing Off-policy [
link20_imp_offpolicy
]
The scripts are located in the experiments
folder and should be executed from the base directory.
The results of the experiments is stored in the data
folder. The plots
directory contains scripts which automtically
create the figures of the paper Dann, Neumann, Peters -- Policy Evaluation with Temporal Differences: A Survey and Comparison from
the stored results.
Alternatively, the data can be viewed interactively by executing
from experiments import *
name = "lqr_full_offpolicy" # the name of the experiment (in brackets above)
measure = "RMSPBE" # Root Mean squared projected bellman error, alternatives: RMSE, RMSBE
plot_experiment(name, measure)
Be aware that the scripts make heavy use of harddisk caching to avoid re-computation of runtime intensive results. This will significantly
speed-up re-executions of experiments. The cache is located
in the cache
folder any may grow up to several GB.
Exhaustive grid-search is implemented for tuning hyper-parameters of the algorithms. To find optimal parameters for a given benchmark
use the script experiments/gridsearch.py
.
The script takes the following parameters:
- --experiment: name of the benchmark. It must be a module in the
experiments
folder. The grid-search script automatically imports the module to use the settings defined there - --njobs: number of cores to use in parallel
- --batchsize: number of parameter settings to evaluate per job. Increasing the value may speed-up the search for small benchmarks due to the additional overhead per job.
For example, finding parameters for the discrete random MDP on-policy benchmark (3) can be started with
python experiments/gridsearch.py --experiment disc_random_on
The results of the grid-search are stored in a directory with the name of the benchmark located in the data
folder.
You may want to have a look at 2d-slices of the hyper-parameter space. The plot_2d_error_grid_experiment
in the experiments package
will help you. For example, the performance depencency of the FPKF on its alpha and mins parameter for fixed lambda=0 and beta=100 on the discrete random MDP on-policy benchmark can
be illustrated by
from experiments import *
plot_2d_error_grid_experiment("disc_random_on", "FPKF", criterion="RMSE", pn1="alpha", pn2="mins", settings={"beta": 100, "lam": 0.})
For further information how to display data have a look at the scripts in the plots
directory.
This code is known to run well with
- Python 2.7
- Numpy 1.6.1
- matplotlib 1.2.0 (up-to date version required for error bars and smooth curves in plots)
- Cython 0.17
- mlabwrap 1.1 (https://mlabwrap.sourceforge.net/ , for executing the PILCO policy for the cart-pole swing-up task)
- custom joblib version available from https://github.com/chrodan/joblib (to have custom hashing functions for more complex objects)
We provide short installation instruction for Unix systems in the following.
The dynamics of the cart-pole swing-up benchmark are implemented in Python to make it really fast.
Therefore the swingup_ode
module needs to be compiled.
cython swingup_ode.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o swingup_ode.so swingup_ode.c
You maybe need to adapt the Python include path to your settings. Alternatively, the module can be compiled with distutils by executing in the base directory:
python setup.py build_ext --inplace
mv tdlearn/swingup_ode.so .
The custom version of joblib can be installed locally in the directory so that it is used automatically by this framework but does not interfere with code outside. This can be done by executing:
git pull https://github.com/chrodan/joblib joblib_repo
ln -s joblib_repo/joblib joblib