This repository provieds data and methods in the paper:
Pseudodata-based molecular structure generator to reveal unknown chemicals (under review)
Authors: Nanyang Yu, Zheng Ma, Qi Shao, Laihui Li, Xuebing Wang, and Si Wei*
Python: 3.7
Torch: 1.7.1
We provied For Training, we use 30k+ pseudo smiles-specturm pairs generated by cfmid (you can download the raw smiles lists file here). For evaluation, we use 300+ real specturm to verify our method (download here). For evaluation in real samples,we use one LC–QTOF dataset for wastewater samples to verify our model (download here, code: gmas).
We provide the MSGO model (pfas, code: 0bfg; lipid, code: 37it) trained use pseudo smiles-specturm pairs with whole methods mentioned in paper. you also can train you own model with other methods.
You can replicate our experiment, including all the techniques:
python tools/train.py --id all_trick --user_precurso 1 -- use_mask 1 --use_formual 1
More options can be viewed in opt.py
Download the model weights in ckpts/pfas or ckpts/lipid, run
python tools/eval.py --log_path [ckpt/pfas or ckpts/lipid]
We provide example data in data/example.
For pfas, run :
python tools/eval_standard.py --log_path ckpts/pfas --real_csv ./data/example/pfas.csv --out_csv ./pfas_results.csv --beam_size 500 --polar neg
For lipid, run:
python tools/eval_standard.py --log_path ckpts/lipid --real_csv ./data/example/lipid.csv --out_csv ./lipid_results.csv --beam_size 300 --polar pos
Then you can obatin a results csv file inluding top 10 predicts.
- Release model weights
- Release pseudo and real data
- Release training process