The generative model was pretrained on combined dataset formed from a combination of ChemBL 33, GuacaMol v1, MOSES, and BindingDB (08-2023). The dataset was processed to exclude all SMILES strings containing more than 133 tokens and which contain tokens that occurr less than 1000 times in the dataset. The combined dataset contains 5 539 765 unique and valid SMILES strings, which are split into:
If you wish to use our pretrained model, you can download the model weights and dataset descriptors (these are internal parameters required for generation). Note that if you use our Jupyter Notebook, it has a special cell which will download these files and put them into appropriate folders for you!
You can also download the PCA weights fitted on the combined dataset: