This repository contains the supporting information for Rowan's recent preprint on pKa prediction. We hope that this collection of test datasets can be useful for future work in pKa prediction.
The fit dataset was adapted from Thapa and Raghavachari, filtering out any SMILES strings that could not be parsed by RDKit. This resulted in 215 molecules and associated pKa values, which can be found in TR215.csv
.
Eight different datasets used to benchmark Rowan pKa are included in assays/
, and the results can be visualized by running plot_assay.ipynb
. Here's where the data comes from:
Data for SAMPL6 was obtained from the SAMPL6 Github repository. We compared the experimentally measured macroscopic pKa values to the microscopic pKa values computed by Rowan, considering only the most acidic and basic microscopic sites on each molecule: we compared each one to the closest macroscopic value, consistent with the matching procedures detailed here. This had the effect of excluding doubly ionized microstates.
Data for SAMPL7 was obtained from the SAMPL7 Github repository. Since only one pKa value was obtained for each molecule, assignment was straightforward.
Data were obtained from this paper.
Data were obtained from this paper.
Data were obtained from this paper.
Data were obtained from Drug-Like Properties: Concepts, Structure Design and Methods from ADME to Toxicity Optimization, by Li Di and Edward Kerns.
Data were obtained from this paper.
Data were obtained from this paper.
Corin Wagen, 2024