SpecTRUM: Spectral Translator for the Reconstruction of Unknown Molecules

This project trains a Transformer model to tackle the task of de novo GC-MS spectra analysis.

Environment setting

The conda environment files are in the env_specification folder. BARTtrainH100 is the main environment used for data preprocessing, training and evaluation. The NEIMSpy3_environment is specifically used only for NEIMS spectra generation. This was necessary because of the package incompatibility.

Data preprocessing

Because of the size constraints and licensing we cannot provide the datasets we used for training. However, we provide the scripts used to obtain filter and preprocess the ZINC smiles dataset and all the preprocessing scripts for the NIST GC-MS dataset.

For every dataset in the data/datasets folder, there is a README file that provides closer information about the particular dataset and explains how it was obtained.

Pretraining & Finetuning

Pretraining and finetuning can be conducted using the train_bart.py script. The script needs a couple of arguments to run, most importantly the config_file, which is a YAML file that contains all the necessary hyperparameters for the training.

All the run scripts we used for our experiments are in the run_scripts folder and don't need any additional parameters. The scripts are named run_pretrain* and run_finetune*. Their corresponding config files are in the configs folder, again named train_config_pretrain* and train_config_finetune*.

Prediction & Evaluation

Prediciton and evaluation are two separate steps. The prediction process on NIST valid/test splits takes depending on the used hardware from 4 hours to infinity. Once you have the predictions, you can run multiple evaluation runs each taking around a minute.

The prediction script, predict.py has its runner in the run_scripts folder (run_predict.sh) and corresponding config files in the configs folder (predict_config*). The evaluation script, evaluate_predicitons.py has also its runner in the run_scripts folder (run_eval.sh) and corresponding config files in the configs folder (eval_config*).

------------------------- Other folders ------------------------

predicitons

The predictions computed by our models are in the predictions folder. Along with the predictions each folder contains a log_file.yaml with all the evaluation results (sometimes from multiple evaluation runs with different setting) and figures generated by the latest evaluaiton.

tokenizer

The tokenizer folder contains all the different tokenizers used during the experiments and the final training. It also contains the traininig data for the BBPE tokenizers.

bart_spektro

This folder contains the custom implementation of the BART model used for the experiments. The implementation is based on the transformers library and is a modification of the BartForConditionalGeneration class.

notebooks

This folder contains a lot of things. Some of them are useful and nice, some of them you better not look at. I leave it in the repository as a memento of the hard work and the struggle we went through.

That's it.:)

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.vscode		.vscode
RLHF		RLHF
bart_spektro		bart_spektro
clean_paper		clean_paper
config_runners		config_runners
configs		configs
configs_gpu_test		configs_gpu_test
data		data
deprecated		deprecated
env_specification		env_specification
figures		figures
neims_gen_analysis_ntbs		neims_gen_analysis_ntbs
notebooks		notebooks
predictions		predictions
tmp_trainer/runs/Oct30_18-06-13_alfa.fi.muni.cz		tmp_trainer/runs/Oct30_18-06-13_alfa.fi.muni.cz
tokenizer		tokenizer
zia_training_stuff_30M		zia_training_stuff_30M
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
callbacks.py		callbacks.py
core		core
data_utils.py		data_utils.py
eval_utils.py		eval_utils.py
evaluate_predictions.py		evaluate_predictions.py
general_utils.py		general_utils.py
metrics.py		metrics.py
precompute_denovo_index.py		precompute_denovo_index.py
predict.py		predict.py
spectra_process_utils.py		spectra_process_utils.py
spektro_evaluace_jak_to_ma_vypadat.pdf		spektro_evaluace_jak_to_ma_vypadat.pdf
train_bart.py		train_bart.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpecTRUM: Spectral Translator for the Reconstruction of Unknown Molecules

Environment setting

Data preprocessing

Pretraining & Finetuning

Prediction & Evaluation

predicitons

tokenizer

bart_spektro

notebooks

About

Releases

Packages

Languages

hejjack/SpecTRUM

Folders and files

Latest commit

History

Repository files navigation

SpecTRUM: Spectral Translator for the Reconstruction of Unknown Molecules

Environment setting

Data preprocessing

Pretraining & Finetuning

Prediction & Evaluation

predicitons

tokenizer

bart_spektro

notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages