Deep-Drug-Coder/datasets at master · MolecularAI/Deep-Drug-Coder

History

Name		Name	Last commit message	Last commit date
parent directory ..
CHEMBL25_TEST_MOLS.h5		CHEMBL25_TEST_MOLS.h5
CHEMBL25_TRAIN_MOLS.h5		CHEMBL25_TRAIN_MOLS.h5
DRD2_TEST_MOLS.h5		DRD2_TEST_MOLS.h5
DRD2_TRAIN_MOLS.h5		DRD2_TRAIN_MOLS.h5
DRD2_VALID_MOLS.h5		DRD2_VALID_MOLS.h5
README.md		README.md

README.md

These datasets are filtered extracts from ChEMBL25 and ExCAPE-DB.

The neural network was trained with a subset of the ChEMBL dataset, version 25. Initially, the complete dataset has been standardized using the MolVS Python module [29] using the super parent setting, which standardizes fragment, charge, isotope, stereochemistry and tautomeric states. Molecules were filtered to only contain the atoms [H, C, N, O, F, S, Cl, Br] with total heavy atoms less than 50. Next, the known active molecules found in the DRD2 dataset were removed from the dataset. The dataset was split into training and test subsets with a 9:1 ratio. During training, 10% of the training subset was used as a fixed validation set.

All data regarding the DRD2 entry in ExCAPE-DB were downloaded [30] and preprocessed as follows: first, duplicate compounds as well as SMILES strings [6] that were not sanitizable by RDKit v2018.09.1 [31] were removed from the DRD2 dataset. All compounds with a pXC50 value greater than fivewere selected as known actives along with 100,000 random DRD2 measured inactive compounds from ExCAPEDB. Stereochemical information was removedby converting all molecules to non-isomeric SMILES strings. The dataset was further reduced to exclude SMILES strings that were longer thanthe ones in ChEMBL or contained characters not found in ChEMBL.This led to removing strings with iodine and phosphorus. All active molecules were clustered based on the pairwise Tanimoto distance of their Morgan fingerprints with a radius of twousing the implementation of the Butina algorithm [32] found in RDKit. The maximum distance threshold for the algorithm to associate neighbours was fixed to 0.4 with a value above it dictating different clusters. All clusters were sorted based on their size and were assigned to the train, validation and test subsets iteratively using a “4-1-1” scheme, i.e. for every four clusters assigned to the train set, one cluster was assigned to the validation set and one cluster to the test set in order of decreasing cluster size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

README.md

Files

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md