We have provided two primary datasets (ZINC and CASF), for which we have pre-computed fragmentations and structural information. These datasets match the dataset utilised in our paper, Deep Generative Models for 3D Compound Design.
We have also provided several scripts to allow you to use your own dataset.
If you have prepared your own fragmentations (two unlinked substructures), run calculate_distance_angle.py
.
You will need to supply a data file containing a list of fragments and molecules, and an SD file containing a conformation of each molecule.
python calculate_distance_angle.py --data_path PATH_TO_FILE --sdf_path PATH_TO_FILE --output_path PATH_TO_FILE --verbose
The format of the data file is:
Fragments (SMILES) Full molecule (SMILES)
For example:
COc1ccccc1[*:2].Fc1cccc([*:1])c1 COc1ccccc1CCC(=O)c1cccc(F)c1
Now run prepare_data.py
with the output file as the first argument (see below for details).
If you want to simply provide an SD file containing a set of molecules, run prepare_data_from_sdf.py
.
python prepare_data_from_sdf --sdf_path PATH_TO_DATA --output_path PATH_TO_FILE --verbose
This will compute fragmentations and structural information, as per the criteria described in our paper, Deep Generative Models for 3D Compound Design. If you do not want to filter the fragmentations using the 2D chemical property filters described in our paper, add the flag --no_filters
to the above command.
For example, the following command reproduces the CASF dataset entries in data_casf_final.txt
.
python calculate_distance_angle.py --sdf_path ../analysis/casf_structures.sdf --output_path data_casf_duplicate.txt --verbose
Now run prepare_data.py
with the output file as the first argument (see below for details).
To process the provided datasets (ZINC and CASF), run prepare_data.py
. This allows you to train, validate and generate molecules using DeLinker.py
. Generated molecules will have a linker with at most the same number of atoms as the reference molecule provided.
python prepare_data.py
If you want to process your own dataset (having followed the above preprocessing steps), run prepare_data.py
with the following arguments:
python prepare_data.py --data_path PATH_TO_DATA --dataset_name NAME_OF_DATASET
The format taken by prepare_data.py
is:
Full molecule (SMILES) Linker (SMILES) Fragments (SMILES) Distance (Angstrom) Angle (Radians)
For example:
COc1ccccc1CCC(=O)c1cccc(F)c1 O=C(CC[*:2])[*:1] COc1ccccc1[*:2].Fc1cccc([*:1])c1 4.69 2.00
If you want to use DeLinker_test.py
(which generates linkers with a specified number of atoms), run prepare_data
with the following arguments:
python prepare_data.py --data_path PATH_TO_DATA --dataset_name NAME_OF_DATASET --test_mode
prepare_data.py
takes two possible input formats, listed below.
Fragments (SMILES) Distance (Angstrom) Angle (Radians)
Full molecule (SMILES) Linker (SMILES) Fragments (SMILES) Distance (Angstrom) Angle (Radians)
Please submit a Github issue or contact Fergus Imrie [email protected].