Skip to content

Latest commit

 

History

History

T-LBO

High-Dimensional Bayesian Optimisation withVariational Autoencoders and Deep Metric Learning

This repository is the official implementation of High-Dimensional Bayesian Optimisation withVariational Autoencoders and Deep Metric Learning.

The provided code allows to re-run the experiments presented in the paper, including:

  • an implementation of soft contrastive and soft triplet losses in metrics.py.
  • scripts to train VAE with metric learning based on black-box function values for the three tasks considered in our paper:
    • Expression Reconstruction
    • Topology Shaping Fitting
    • Chemical Design
  • scripts to run LSBO under fully supervised and semi-supervised settings.

Contributors

Antoine Grosnit and Rasul Tutunov and Alexandre Max Maraval and Ryan-Rhys Griffiths and Alexander I. Cowen-Rivers and Lin Yang and Lin Zhu and Wenlong Lyu and Zhitang Chen and Jun Wang and Jan Peters and Haitham Bou-Ammar

Installation

First, install dependencies. We recommend using miniconda to install dependencies, as rdkit (a dependency used for the chemical design experiment) is partucularly difficult to install otherwise.

Setup was run on Python 3.7 with CUDA 10.1

⚠️ Make sure that cuda device is compatible with installed version of torch

conda env create -f lsbo_metric_env.yml
conda activate lsbo_metric_env

To run experiments you need to define a folder where results will be stored:

  • create file utils/data_storage_root_path.txt and specify a one-line absolute path to the folder where results and data should be stored (e.g. ~/LSO-storage/)
echo '~/LSO-storage/' > ./utils/storage_root_path.txt

Usage guide

In each script invoked below, default parameters are the same as the ones we used to run the experiments. Nonetheless user can still modify the scripts to run model training / BO under a specific setting, e.g. with a specific metric loss or acquisition function.

We give instructions for each task on how to download the datasets we used in our experiments, how to train VAE model (or which pretrained model can be used) and how to run Bayesian Optimisation.

Set cuda parameter in the scripts or use CUDA_VISIBLE_DEVICES=0 to run just on a specific GPU.

To train a VAE or run LSBO with a specific metric learning loss or with target-prediction, set the metric_loss_ind or predict_target parameters in the training / optimisation scripts.

Expression task:

Set up the data
# download necessary files and unzip data
url=https://github.com/cambridge-mlg/weighted-retraining/raw/master/assets/data/expr/
expr_dir="./weighted_retraining/assets/data/expr"
mkdir $expr_dir
for file in eq2_grammar_dataset.zip equation2_15_dataset.txt scores_all.npz;
do
 cmd="wget -P ${expr_dir} ${url}${file}"
 echo $cmd 
 $cmd
done
unzip "$expr_dir/eq2_grammar_dataset.zip" -d "$expr_dir"

# split data and generate datasets used in our BO experiments
python ./weighted_retraining/weighted_retraining/expr/expr_dataset.py \
           --ignore_percentile 65 --good_percentile 5 \
           --seed 0 --save_dir weighted_retraining/data/expr
Train a grammar VAE
  • Train a grammar VAE with train-expr-pt.sh (modify the script to select GPU id and other parameters if needs be):
chmod u+x ./weighted_retraining/scripts/models/supervised/train-expr-pt.sh
./weighted_retraining/scripts/models/supervised/train-expr-pt.sh
Run LSBO
chmod u+x ./weighted_retraining/scripts/robust_opt/robust_opt_expr.sh
./weighted_retraining/scripts/robust_opt/robust_opt_expr.sh

Topology task:

Set up the data
Train a VAE
chmod u+x ./weighted_retraining/scripts/models/supervised/train-topology.sh
./weighted_retraining/scripts/models/supervised/train-topology.sh
  • To run LSBO, execute run robust_opt_topology.sh (and selecting the desired parameters within the shell such as metric learning, acquisition, number of acquisition steps, ...)
chmod u+x ./weighted_retraining/scripts/robust_opt/robust_opt_topology.sh
./weighted_retraining/scripts/robust_opt/robust_opt_topology.sh

Molecule task:

Set up data

Note: preprocessing the chemical dataset will take several hours ☕.

  • To download and build the Zinc250k dataset with black-box functions labels, execute setup-chem.sh
# download necessary files
url=https://github.com/cambridge-mlg/weighted-retraining/raw/master/assets/data/chem_orig_model
molecule_dir="./weighted_retraining/assets/data/chem_orig_model/"
mkdir $molecule_dir
for file in train.txt val.txt vocab.txt README.md;
do
 wget -P $molecule_dir "$url/$file"
done

# preprocess molecule data for BO experiments
chmod u+x ./weighted_retraining/scripts/data/setup-chem.sh
./weighted_retraining/scripts/data/setup-chem.sh
Get JTVAE
url=https://github.com/cambridge-mlg/weighted-retraining/raw/master/assets/pretrained_models/chem.ckpt
wget -P ./weighted_retraining/assets/pretrained_models/chem_vanilla/ $url
  • To train the JTVAE with metric loss or target prediction - to be used for the BO - run train_chem.sh with desired options:
chmod u+x ./weighted_retraining/scripts/models/supervised/train-chem.sh
./weighted_retraining/scripts/models/supervised/train-chem.sh
Run LSBO
  • Run optimization experiments with robust_opt_chem.sh (and selecting the desired parameters within the shell such as metric learning, acquisition, number of acquisition steps, ...)
chmod u+x ./weighted_retraining/scripts/robust_opt/robust_opt_chem.sh
./weighted_retraining/scripts/robust_opt/robust_opt_chem.sh

Results

Regret Visualisation

The function plot_results() can be used to show the evolution of the regret across acquisition steps displaying the mean and standard deviation when the experiment has been carried out on several seeds.

Our results on chemical design

Nb. Function evaluations Penalized logP (top-3)
T-LBO 7750 38.57 / 34.83 / 34.63
T-LBO 3450 34.83 / 31.1 / 29.21
T-LBO 2300 24.06 / 22.84 / 21.26

Optimisation of penalised logP

molecule-opt We optimised the penalised water-octanol partition coefficient (logP) objective of molecules from the ZINC250K dataset and compared seval optimisation algorithms, showing that the highest logP scores are obtained with our methods T-LBO.

Molecule visualisation

molecule_semi_supervised T-LBO – Starting with observation of only 3% of labelled data points in Zinc250K the best molecule initially available (displayed on the top-left corner) has a penalised logP score of 4.09. Under semi-supervised setup, our method manages to find a molecule with score 29.14 after only six retrainings of the JTVAE with triplet loss (bottom right molecule).

Cite Us

Grosnit, Antoine, et al. "High-Dimensional Bayesian Optimisation with Variational Autoencoders and Deep Metric Learning." arXiv preprint arXiv:2106.03609 (2021).

BibTex

# BitbTex
@misc{grosnit2021highdimensional,
      title={High-Dimensional Bayesian Optimisation with Variational Autoencoders and Deep Metric Learning}, 
      author={Antoine Grosnit and Rasul Tutunov and Alexandre Max Maraval and Ryan-Rhys Griffiths and Alexander I. Cowen-Rivers and Lin Yang and Lin Zhu and Wenlong Lyu and Zhitang Chen and Jun Wang and Jan Peters and Haitham Bou-Ammar},
      year={2021},
      eprint={2106.03609},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgements