This project trains a Transformer model to tackle the task of de novo GC-MS spectra analysis.
The conda environment files are in the env_specification
folder. BARTtrainH100
is the main environment used for data preprocessing, training and evaluation. The NEIMSpy3_environment
is specifically used only for NEIMS spectra generation. This was necessary because of the package incompatibility.
Because of the size constraints and licensing we cannot provide the datasets we used for training. However, we provide the scripts used to obtain filter and preprocess the ZINC smiles dataset and all the preprocessing scripts for the NIST GC-MS dataset.
For every dataset in the data/datasets
folder, there is a README file that provides closer information about the particular dataset and explains how it was obtained.
Pretraining and finetuning can be conducted using the train_bart.py
script. The script needs a couple of arguments to run, most importantly the config_file
, which is a YAML file that contains all the necessary hyperparameters for the training.
All the run scripts we used for our experiments are in the run_scripts folder and don't need any additional parameters. The scripts are named run_pretrain*
and run_finetune*
. Their corresponding config files are in the configs
folder, again named train_config_pretrain*
and train_config_finetune*
.
Prediciton and evaluation are two separate steps. The prediction process on NIST valid/test splits takes depending on the used hardware from 4 hours to infinity. Once you have the predictions, you can run multiple evaluation runs each taking around a minute.
The prediction script, predict.py
has its runner in the run_scripts
folder (run_predict.sh
) and corresponding config files in the configs
folder (predict_config*
). The evaluation script, evaluate_predicitons.py
has also its runner in the run_scripts
folder (run_eval.sh
) and corresponding config files in the configs
folder (eval_config*
).
------------------------- Other folders ------------------------
The predictions computed by our models are in the predictions
folder. Along with the predictions each folder contains a log_file.yaml
with all the evaluation results (sometimes from multiple evaluation runs with different setting) and figures generated by the latest evaluaiton.
The tokenizer
folder contains all the different tokenizers used during the experiments and the final training. It also contains the traininig data for the BBPE tokenizers.
This folder contains the custom implementation of the BART model used for the experiments. The implementation is based on the transformers
library and is a modification of the BartForConditionalGeneration
class.
This folder contains a lot of things. Some of them are useful and nice, some of them you better not look at. I leave it in the repository as a memento of the hard work and the struggle we went through.
That's it.:)