Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
calculate_distance_angle.py		calculate_distance_angle.py
data_casf_final.txt		data_casf_final.txt
data_zinc_final_test.txt		data_zinc_final_test.txt
data_zinc_final_train.txt		data_zinc_final_train.txt
data_zinc_final_valid.txt		data_zinc_final_valid.txt
fpscores.pkl.gz		fpscores.pkl.gz
frag_utils.py		frag_utils.py
prepare_data.py		prepare_data.py
prepare_data_from_sdf.py		prepare_data_from_sdf.py
sascorer.py		sascorer.py
wehi_pains.csv		wehi_pains.csv

README.md

Data

We have provided two primary datasets (ZINC and CASF), for which we have pre-computed fragmentations and structural information. These datasets match the dataset utilised in our paper, Deep Generative Models for 3D Compound Design.

We have also provided several scripts to allow you to use your own dataset.

To preprocess your own dataset

Option 1: Pre-computed fragmentations

If you have prepared your own fragmentations (two unlinked substructures), run calculate_distance_angle.py.

You will need to supply a data file containing a list of fragments and molecules, and an SD file containing a conformation of each molecule.

python calculate_distance_angle.py --data_path PATH_TO_FILE --sdf_path PATH_TO_FILE --output_path PATH_TO_FILE --verbose

The format of the data file is:

Fragments (SMILES) Full molecule (SMILES)

For example:

COc1ccccc1[*:2].Fc1cccc([*:1])c1 COc1ccccc1CCC(=O)c1cccc(F)c1

Now run prepare_data.py with the output file as the first argument (see below for details).

Option 2: SD file of molecules

If you want to simply provide an SD file containing a set of molecules, run prepare_data_from_sdf.py.

python prepare_data_from_sdf --sdf_path PATH_TO_DATA --output_path PATH_TO_FILE --verbose

This will compute fragmentations and structural information, as per the criteria described in our paper, Deep Generative Models for 3D Compound Design. If you do not want to filter the fragmentations using the 2D chemical property filters described in our paper, add the flag --no_filters to the above command.

For example, the following command reproduces the CASF dataset entries in data_casf_final.txt.

python calculate_distance_angle.py --sdf_path ../analysis/casf_structures.sdf --output_path data_casf_duplicate.txt --verbose

Now run prepare_data.py with the output file as the first argument (see below for details).

To use a provided dataset

To process the provided datasets (ZINC and CASF), run prepare_data.py. This allows you to train, validate and generate molecules using DeLinker.py. Generated molecules will have a linker with at most the same number of atoms as the reference molecule provided.

python prepare_data.py

If you want to process your own dataset (having followed the above preprocessing steps), run prepare_data.py with the following arguments:

python prepare_data.py --data_path PATH_TO_DATA --dataset_name NAME_OF_DATASET

The format taken by prepare_data.py is:

Full molecule (SMILES) Linker (SMILES) Fragments (SMILES) Distance (Angstrom) Angle (Radians)

For example:

COc1ccccc1CCC(=O)c1cccc(F)c1 O=C(CC[*:2])[*:1] COc1ccccc1[*:2].Fc1cccc([*:1])c1 4.69 2.00

If you want to use DeLinker_test.py (which generates linkers with a specified number of atoms), run prepare_data with the following arguments:

python prepare_data.py --data_path PATH_TO_DATA --dataset_name NAME_OF_DATASET --test_mode

prepare_data.py takes two possible input formats, listed below.

Fragments (SMILES) Distance (Angstrom) Angle (Radians)
Full molecule (SMILES) Linker (SMILES) Fragments (SMILES) Distance (Angstrom) Angle (Radians)

Contact (Questions/Bugs/Requests)

Please submit a Github issue or contact Fergus Imrie [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data

To preprocess your own dataset

Option 1: Pre-computed fragmentations

Option 2: SD file of molecules

To use a provided dataset

Contact (Questions/Bugs/Requests)

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data

To preprocess your own dataset

Option 1: Pre-computed fragmentations

Option 2: SD file of molecules

To use a provided dataset

Contact (Questions/Bugs/Requests)