To train the model from scratch, you need to download the preprocessed file from this link and save it into data/single
or data/multi
.
We use the CrossDocked dataset and reaction-based slicing method in LibINVENT to construct single and multi R-groups datasets. If you want to process the dataset from scratch, you can follow the steps:
- Download the dataset archive
crossdocked_pocket10.tar.gz
and the split filesplit_by_name.pt
from this link. - Extract the TAR archive using the command:
tar -xzvf crossdocked_pocket10.tar.gz
- Split raw PL-complexes and convert sdf files into SMILES format:
python split_and_convert.py
- Use the reaction-based slicing method in LibINVENT to slice the molecules into scaffolds and R-groups in Lib-INVENT-dataset and replace
example_configurations/supporting_files/filter_conditions.json
in Lib-INVENT-dataset withfilter_conditions.json
in this directory. For single R-group dataset, set the value of parametermax_cuts
inexample_configurations/reaction_based_slicing.json
to1
while for multi R-groups dataset to4
. - Process and prepare datasets:
cd single
python -W ignore process_and_prepare.py
cd multi
python -W ignore process_and_prepare.py