Skip to content

Releases: CDDLeiden/QSPRpred

Version 3.1.1

02 Jul 13:56
Compare
Choose a tag to compare

Change Log

From v3.0.2 to v3.1.1

Fixes

  • Fixed a bug in QSPRDataset where property transformations were not applied.
  • Fixed a bug where an attached standardizer would be refit when calling
    QSPRModel.predictMols with use_applicability_domain=True.
  • Fixed random seed not set in FoldsFromDataSplit.iterFolds for ClusterSplit.

Changes

  • renamed PandasDataTable.transform to PandasDataTable.transformProperties
  • moved imputeProperties, dropEmptyProperties and hasProperty from MoleculeTable
    to PandasDataTable.
  • removed getProperties, addProperty, removeProperty, now use PandasDataTable
    methods directly.
  • Since the way descriptors are saved has changed, this release is incompatible with
    previous data sets and models. However, these can be easily converted to the new
    format by adding
    a prefix with descriptor set name to the old descriptor tables. Feel free to contact
    us if you require assistance with this.
  • Due to some changes in rdkit-2023.9.6, the add_rdkit
    option for molecule tables temporarily might not work.
    This also affects the current ChemProp integration, which was not adapted to 2.0.0 yet.
    In order to prevent these issues, QSPRpred now forces rdkit version rdkit-2023.9.5,
    but we will be working on resolving these.

New Features

  • Descriptors are now saved with prefixes to indicate the descriptor sets. This reduces
    the chance of name collisions when using multiple descriptor sets.
  • Added new methods to MoleculeTable and QSARDataset for more fine-grained control
    of clearing, dropping and restoring of descriptor sets calculated for the dataset.
    • dropDescriptorSets will drop descriptors associated with the given descriptor
      sets.
    • dropDescriptors will drop individual descriptors associated with the given
      descriptor sets and properties.
    • All drop actions are restorable with restoreDescriptorSets unless explicitly
      cleared from the data set with the clear parameter of dropDescriptorSets.
  • Added a proper API for parallelization backend selection and configuration (see
    documentation of ParallelGenerator and JITParallelGenerator for more information).
  • Clusters can now be added to a MoleculeTable with addClusters and retrieved with
    getClusters, similar to scaffolds.

Removed Features

  • removed support for PyBoost since the project was abandoned by the original developers and is no longer maintained

Version 3.0.2

28 Mar 14:13
Compare
Choose a tag to compare

Change Log

From v3.0.1 to v3.0.2

Fixes

  • Fixed a bug where an attached standardizer would be refit when calling
    QSPRModel.predictMols with use_applicability_domain=True.
  • Fixed a bug with use_applicability_domain=True in QSPRModel.predictMols
    where an error would be raised if there were invalid molecules in the input.
  • Fixed a bug where dataset type was not properly set to numeric
    in MlChemADWrapper.contains
  • Fixed a bug in QSPRDataset where property transformations were not applied.
  • Fixed a bug where an attached standardizer would be refit when calling
    QSPRModel.predictMols with use_applicability_domain=True.
  • Fixed random seed not set in FoldsFromDataSplit.iterFolds for ClusterSplit.
  • Fixed a bug where class ratios were shuffled in the RatioDistributionAlgorithm.

Changes

  • The module containing the sole model base class (QSPRModel) was renamed
    from models to model.
  • Restrictions on numpy versions were removed to allow for more flexibility in
    package installations. However, the BorutaFilter feature selection method does not
    function with numpy versions 1.24.0 and above. Therefore, this functionality now
    requires a downgrade to numpy version 1.23.0 or lower. This was reflected in the
    documentation and numpy itself outputs a reasonable error message if the version is
    incompatible.
  • Data type in MlChemADWrapper is now set to float64 by default, instead
    of float32.
  • Saving of models after hyperparameter optimization was improved to ensure parameters
    are always propagated to the underlying estimator as well.

New Features

  • The DataFrameDescriptorSet class was extended to allow more flexibility when joining
    custom descriptor sets.
  • Added the prepMols method to DescriptorSet to allow separated customization of
    molecule preparation before descriptor calculation.
  • The package can now be installed from the PyPI repository 🐍📦.
  • New argument (refit_optimal) was added to HyperparameterOptimization.optimize()
    method to make refitting of the model with optimal parameters easier.

Removed Features

None.

v3.0.1

28 Feb 07:42
468ee11
Compare
Choose a tag to compare

Change Log

From v3.0.0 to v3.0.1

Fixes

  • Fixed a bug in QSPRDataset where property transformations were not applied.

Changes

  • renamed PandasDataTable.transform to PandasDataTable.transformProperties
  • moved imputeProperties, dropEmptyProperties and hasProperty from MoleculeTable
    to PandasDataTable.
  • removed getProperties, addProperty, removeProperty, now use PandasDataTable
    methods directly.

New Features

Removed Features

v3.0.0

11 Feb 21:07
b28f843
Compare
Choose a tag to compare

Change Log

From v2.1.1 to v3.0.0

Fixes

  • Fixed random seeds to give reproducible results. Each dataset is initialized with a
    single random state (either from the constructor or a random number generator) which
    is used in all subsequent random operations. Each model is initialized with a single
    random state as well: it uses the random state from the dataset, unless it's overriden
    in the constructor. When a dataset is saved to a file so is its random state, which is
    used by the dataset when the dataset is reloaded.
  • fixed error with serialization of the DNNModel.params attribute, when no parameters
    are set.
  • Fix bug with saving predictions from classification model
    when ModelAssessor.useProba set to False.
  • Add missing implementation of QSPRDataset.removeProperty
  • Improved behavior of the Papyrus data source (does not attempt to connect to the
    internet if the data set already exists).
  • It is now possible to define new descriptor sets outside the package without errors.
  • Basic consistency of models is also checked in the unit test suite, including in
    the qsprpred.extra package.
  • Fixed a problem with feature standardizer being retrained on prediction data when a
    prediction from SMILES was invoked. This affected all versions of the package higher
    or equal to v2.1.0.
  • Fixes to the fromMolTable method in various data set implementations, in particular
    in copying of the feature standardizer and other settings.
  • Fixed not working cluster split and --imputation from data_CLI.py.
  • Fixed a problem with ProteinDescriptorSet.getDescriptors returning descriptors in
    wrong order with Pandas <v2.2.0.

Changes

  • The model is now independent of data sets. This means that the model no longer
    contains a reference to the data set it was trained on.
    • The fitAttached method was replaced with fitDataset, which takes the data set
      as
      an argument.
    • Assessors now also accept a data set as a second argument. Therefore, the same
      assessor
      can be used to assess different data sets with the same model settings.
    • The monitoring API was also slightly modified to reflect this change.
    • If a model requires initialization of some settings from data, this can be done in
      its initFromDataset method, which takes the data set as an argument. This method
      is called automatically before fitting, model assessment, and hyperparameter
      optimization.
  • The whole package was refactored to simplify certain commonly used imports. The
    tutorial code was adjusted to reflect that.
  • The jupyter notebooks in the tutorial now pass a random state to ensure consistent
    results.
  • The default parameter values for STFullyConnected have changed from n_epochs =
    1000 to n_epochs = 100, from neurons_h1 = 4000 to neurons_h1 = 256
    and neurons_hx = 1000 to neurons_hx = 128.
  • Rename HyperParameterOptimization to HyperparameterOptimization.
  • TargetProperty.fromList and TargetProperty.fromDict now accept a both a string and
    a TargetTask as the task argument,
    without having to set the task_from_str argument, which is now deprecated.
  • Make EarlyStopping.mode flexible for QSPRModel.fitDataset.
  • save_params argument added to OptunaOptimization to save the best hyperparameters
    to the model (default: True).
  • We now use jsonpickle for object serialization, which is more flexible than the
    non-standard approach before, but it also means previous models will not be compatible
    with this version.
  • SklearnMetric was renamed to SklearnMetrics, it now also accepts an scikit-learn
    scorer name as input.
  • QSPRModel.fitDataset now accepts a save_model (default: True)
    and save_dataset (default: False) argument to save the model and dataset to a file
    after fitting.
  • Tutorials were completely rewritten and expanded. They can now be found in
    the tutorials folder instead of the tutorial folder.
  • MetricsPlot now supports multi-class and multi-task classification models.
  • CorrelationPlot now supports multi-task regression models.
  • The behaviour of QSPRDataset was changed with regards to target properties. It now
    remembers the original state of any target property and all changes are performed in
    place on the original property column (i.e. conversion to multi-class classification).
    This is to always maintain the same property name and always have the option to reset
    it to the raw original state (i.e. if we switch to regression or want to repeat a
    transformation).
  • The default log level for the package was changed from INFO to WARNING. A new
    tutorial
    was added to explain how to change the log level.
  • RepeatsFilter argument year_name renamed to time_col and
    arugment additional_cols added.
  • The perc argument of BorutaPy can now be set from the CLI.
  • Descriptor calculators (previously used to aggregate and manage descriptor sets) were
    completely removed from the API and descriptor sets can now be added directly to the
    molecule tables.
  • The rdkit-like descriptor and fingerprint retrieval functions were removed from the
    API because they complicated implementation of customized descriptors.
  • The apply method was simplified and a new API was clearly defined for parallel
    processing of properties over data sets. To improve molecule processing,
    a processMols method was added to MoleculeTable.

New Features

  • The qsprpred.benchmarks module was added, which contains functions to easily
    benchmark
    models on datasets.
  • Most unit tests now have a variant that checks whether using a fixed random seed gives
    reproducible results.
  • The build pipeline now contains a check that the jupyter notebooks give the same
    results as ones that were observed before.
  • Added FitMonitor, AssessorMonitor, and HyperparameterOptimizationMonitor base
    classes to monitor the progress of fitting, assessing, and hyperparameter
    optimization, respectively.
  • Added BaseMonitor class to internally keep track of the progress of a fitting,
    assessing, or hyperparameter optimization process.
  • Added FileMonitor class to save the progress of a fitting, assessing, or
    hyperparameter optimization process to files.
  • Added WandBMonitor class to save the progress of a fitting, assessing, or
    hyperparameter optimization process to Weights & Biases.
  • Added NullMonitor class to ignore the progress of a fitting, assessing, or
    hyperparameter optimization process.
  • Added ListMonitor class to combine multiple monitors.
  • Cross-validation, testing, hyperparameter optimization and early-stopping were made
    more flexible by allowing custom splitting and fold generation strategies. A tutorial
    showcasing these features was created.
  • Added a reset method to QSPRDataset, which resets splits and loads all descriptors
    into the training set matrix again.
  • Added ConfusionMatrixPlot to plot confusion matrices.
  • Added the searchWithIndex, searchOnProperty, searchWithSMARTS and sample
    to MoleculeTable to facilitate more advanced sampling from data.
  • Assessors now have the split_multitask_scores flag that can be used to evaluate each
    task seperately with single-task metrics.
  • MoleculeDataSets now has the smiles property to easily get smiles.
  • A Docker-based runner in testing/runner can now be used to test GPU-enabled features
    and run the full CI pipeline.
  • It is now possible to save PandasDataTables to a CSV file instead of the default
    pickle format (slower, but more human-readable).
  • New RegressionPlot class WilliamsPlot added to plot Williams plots.
  • Data sets can now be optionally stored in the csv format and not just as a pickle
    file. This makes it easier to debug and share data sets, but it is slower to load and
    save.
  • Added ApplicabilityDomain class to calculate applicability domain and filter
    outliers from test sets.

Removed Features

  • The Metric interface has been simplified in order to make it easier to implement
    custom metrics. The Metric interface now only requires the implementation of
    the __call__ method, which takes predictions and returns a float. The Metric
    interface no longer requires the implementation
    of needsDiscreteToScore, needsProbaToScore and supportsTask. However, this means
    the base functionality of checkMetricCompatibility, isClassificationMetric
    and isRegressionMetric are no longer available.
  • Default hyperparameter search space file, no longer available.

v2.1.1

18 Jan 15:25
Compare
Choose a tag to compare

Change Log

From v2.1.0 to v2.1.1

Fixes

  • ⚠️ Important! ⚠️ Fixed bug in predictMols where the feature_standardizer was
    not being applied to the calculated features. This bug was introduced in v2.1.0.
    Models trained with v2.1.0 are compatible with v2.1.1, make sure to update
    QSPRpred to v2.1.1 to ensure that the feature_standardizer is applied when
    predicting on new molecules.

Changes

New Features

Removed Features

v2.1.0

21 Sep 07:23
Compare
Choose a tag to compare

Change Log

From v2.0.1 to v2.1.0.a2

Fixes

  • fixed error with serialization of the DataFrameDescriptorSet (#63)
  • Papyrus descriptors are not fetched by default anymore from the Papyrus adapter, which caused fetching of unnecessary data.
  • A potential bug in new version of pandas broke scaffold generation so a workaround was implemented.

Changes

  • QSPRModel.evaluate moved to a separate class EvaluationMethod in qsprpred.models.interfaces, with subclasses for cross-validation and making predictions on a test set in qsprpred.models.evaluation_methods (CrossValidation and EvaluateTestSetPerformance respectively).
  • QSPRModel attribute scoreFunc is removed.
  • 'qspr/models' is no longer added to the output path of QSPRModel.save, allowing for complete control over the output path.
  • SKlearnMetrics.supportsTask now uses a dictionary like dict[ModelTasks, list[str]] to map tasks to supported metric names. (#53)
  • GBMTRandomSplit and ScaffoldSplit now use the GBMTDataSplit to create balanced splits. RandomSplit still functions the same way as a completely random test split.
  • PCMSplit replaces StratifiedPerTarget and is compatible with RandomSplit, ScaffoldSplit and ClusterSplit.
  • DuplicatesFilter refactored toRepeatsFilter, as it also captures scenarios where triplicates/quadruplicates are found in the dataset. These scenarios are now also covered by the respective UnitTest.
  • The versioning scheme of development snapshots has changed from devX to alphaX/betaX, where X is an integer that increments with each release.
  • The following model class have been renamed and moved:
    • models.models.QSPRsklearn > models.sklearn.SklearnModel
    • deep.models.QSPRDNN > extra.gpu.models.dnn.DNNModel
    • extra.models.pcm.ModelPCM > extra.models.pcm.PCMModel
    • extra.models.pcm.QSPRsklearnPCM > extra.models.pcm.SklearnPCMModel
  • The command line interface modules now use input and output file paths instead
    of automatically placing all files in a subfolder qspr, allowing for more
    control over the output and input paths.

New Features

  • GBMTDataSplit - parent class to create globally balanced splits with the gbmt-split package.
  • ClusterSplit - splits data based clustering of molecular fingerprints (uses GBMTDataSplit).
  • Raise error if search space for optuna optimization is missing search space type annotation or if type not in list.
  • When installing package with pip, the commit hash and date of the installation is saved into qsprpred._version
  • HyperParameterOptimization classes now accept a evaluation_method argument, which is an instance of EvaluationMethod (see above). This allows for hyperparameter optimization to be performed on a test set, or on a cross-validation set. (#11)
  • HyperParameterOptimization now accepts score_aggregation argument, which is a function that takes a list of scores and returns a single score. This allows for the use of different aggregation functions, such as np.mean or np.median to combine scores from different folds. (#45)
  • A new tutorial adding_new_components.ipynb has been added to the tutorials folder, which demonstrates how to add new model to QSPRpred.
  • A new function Metrics.checkMetricCompatibility has been added, which checks if a metric is compatible with a given task and a given prediction methods (i.e. predict or predictProba)
  • In EvaluationMethod (see above), an attribute use_proba has been added, which determines whether the predict or predictProba method is used to make predictions (#56).
  • Add new descriptorset SmilesDesc to use the smiles strings as a descriptor.
  • New module early_stopping with classes EarlyStopping and EarlyStoppingMode has been added. This module allows for more control over early stopping in models that support it.
  • Add new descriptorset SmilesDesc to use the smiles strings as a descriptor.
  • Refactoring of the test suite under qsprpred.data and improvement of temporary file handling (!114).
  • PyBoostModel - QSPRpred wrapper for py-boost models. Requires optional pyboost dependencies.
  • ChempropModel - QSPRpred wrapper for Chemprop models. Requires optional deep dependencies.
  • The data_CLI argument --log_transform (-lt) has been changed to --transform_data (-t), which now accepts a number of transformations to apply to the target data. Available transformations are log, log10, log2, sqrt, cbrt, exp, exp2, exp10, square, cube, reciprocal.
  • New data_CLI, model_CLI and predict_CLI argument --skip_backup (-sb) to skip the backup of the output files. WARNING: This will overwrite existing files.

Removed Features

  • StratifiedPerTarget is replaced by PCMSplit.

v2.0.1

06 Jul 16:06
Compare
Choose a tag to compare

Change Log

From v2.0.0 to v2.0.1

Fixes

  • Requirement python version in pyproject.toml updated to 3.10, as older version of python don't support the type hinting used in the code.
  • Corrected type hinting for QSPRModel.handleInvalidsInPredictions, which resulted in an error when importing the package in google colab.
  • The predictMols method returned random predictions in v2.0.0 due to unpatched shuffling code. This has now been fixed.

Changes

New Features

  • raise error if search space for optuna optimization is missing search space type annotation or if type not in list

v2.0.0

19 Jun 07:43
Compare
Choose a tag to compare

Change Log

From v1.3.1 to v2.0.0

Fixes

  • more robust error handling of invalid molecules in MoleculeTable
  • Not all scorers in supported_scoring were actually working in the multi-class case, the scorer support is now
    divided by single and multiclass support (moved to metrics.py, see also New Features).
  • Instead of all smiles, only invalid smiles are now printed to the log when they are removed.
  • problems with PaDEL descriptors and fingerprints on Linux were fixed
  • fixed serialization issues with DataFrameDescriptorSet and saving and loading of MSA for PCM descriptor calculations
  • the Papyrus adapter was fixed so that the quality and data set filtering options work properly (before only high quality Papyrus++ data was fetched no matter the options)
  • previously, in some cases cross-validation splits might not have been shuffled during hyperparameter optimization and evaluation on cross-validation folds (this might have resulted in suboptimal cross-validation performance and bad choices of hyperparameters), a fix was made in b029e78
  • score_func can now be set in QSPRModel.

Changes

  • Hyperparameter optimization moved to a separate class from QSPRModel.bayesOptimization and QSPRModel.gridSearch to OptunaOptimization and GridSearchOptimization in the new module qsprpred.models.param_optimzation with a base clase HyperParameterOptimization in qsprpred.models.interfaces.
  • ⚠️ Important! ⚠️ QSPRModel attribute model now called estimator, which is always an instance of alg, while alg may no longer be an instance but only a Type.
  • Converting input data for qsprpred.models.neural_network.Base to dataloaders now executed in the fit and predict functions instead of in the qspred.deep.models.QSPRDNN class.
  • MoleculeTable now uses a custom index. When a MoleculeTable is created a new column (QSPRID) is added (overwritten if already present), which is then used as the index of the underlying data frame.
    • It is possible to override this with a custom index by passing index_cols to the MoleculeTable constructor. These columns will be then used as index or a multi-index if more than one column is passed.
    • Due to this change, scaffoldsplit now uses these IDs instead of unreliable SMILES strings (see documentation for the new API).
  • If there are invalid molecules in MoleculeTable, addDescriptors now fails by default. You can disable this by passing fail_on_invalid=False to the method.
  • To support multitask modelling, the representation of the target in the QSPRdataset has changed to a list of
    TargetPropertys (see New Features). These can be automatically initizalid from dictionaries in the QSPRdataset
    init.
  • A fill_value argument was also added to the predict_CLI script to allow for filling missing values in the
    prediction data set as well.
  • ⚠️ Important! ⚠️ setup.py and setup.cfg were substituted with pyproject.toml and MANIFEST.in. A lighter version of the package is now the default installation option!!!
    • Installation options for the optional dependencies are described in README.md
    • CI scripts were modified to test the package on the full version. See changes in .gitlab-ci.yml.
    • Features using the extra dependencies were moved to qsprpred.extra and qsprpred.deep subpackages. The structure of the subpackages is the same as of the main package, so you just need to remember to use qsprpred.extra or qsprpred.deep instead of just qsprpred in your imports if you were using these features from the main package before.
  • The way descriptors are stored in MoleculeTable was changed. They now reside in their own DescriptorTable instances that are linked to the orginal MoleculeTable
    • This change was made to allow several types of descriptors to be calculated and used efficiently (facilitated by a the DescriptorsCalculators interface)
    • Unfortunately, this change is not backwards compatible, so previously pickled MoleculeTable instances will not work with this version. There were also changes to how models handle multiple descriptor types, which also makes them incompatible with previous versions. However, this can be fixed by modifying the old JSON files as illustrated in commits 7d3f863 and 6564f02.
  • 'LowVarianceFilter` now includes boundary in the filtered features, e.g. if threshold is 0.1, also features that
    have a variance of 0.1 will be removed.
  • Added the ExtendedValenceSignature molecular descriptor based on Jean-Loup Faulon's work.
  • removed default parameter setting scikit-learn SVC and SVR max_iter 10000.
  • added matthews_corrcoef to the supported metrics for binary classification.

New Features

  • New feature split ManualSplit for splitting data by a user-defined column
  • The index of the MoleculeTable can now be used to relate cross-validation and test outputs to the original molecules. Therefore, the index is now also saved in the model training outputs.
  • the Papyrus.getData() method now accepts activity_types parameter to select a list of activity types to get.
  • Added the checkMols method to MoleculeTable to use for indication of invalid molecules in the data.
  • Support for Sklearn Multitask modelling
  • New class abstract class Metric, which is an abstract base class that allows for creating custom scorers.
  • Subclass SklearnMetric of the Metric class, this class wraps the sklearn metrics, to allow for checking
    the compatibility of each Sklearn scoring function with the QSPRSklearn model type.
  • New class TargetProperty, to allow for multitask modelling, a QSPRdataset has to have the option of multiple
    targetproperties. To support this a targer property is now defined seperatly from the dataset as a TargetProperty
    instance, which holds the information on name, TargetTask (see also Changes) and threshold of the property.
  • Support for protein descriptors and PCM modeling was added.
    • The PCMDataSet class was introduced that extends QSPRDataset and adds the addProteinDescriptors method, which can be used to calculate protein descriptors by linking information from the table with sequencing data.
  • Support for precalculated descriptors was added with addCustomDescriptors method of MoleculeTable.
    • It allows for adding precalculated descriptors to the MoleculeTable by linking the information from the table with external precalculated descriptors.
  • The tutorial was improved with more detailed sections on data preparation and PCM modelling added.
  • We agreed on and adopted a style guide for contributions to the package. This is described and exemplified in docs/style_guide.py. This is also supported by several development tools that were configured to check and automatically format the code. Instructions are included in the example file as well.
  • Style guide implemented. As a consequence, some classes, methods, and attributes were renamed to comply with the style guide. Some changes were:
    • Fingerprint functions from RDKit are now also implemented. For checking the available fingerprints in qsprpred, the user can now access AVAIL_FPS through from qsprpred.data.utils.descriptor_utils.fingerprints import AVAIL_FPS.
    • Fingerprint abstract base class now moved to qsprpred.data.utils.descriptor_utils.interfaces.
    • Instance attributes are now written in camelCase, and method arguments are snake_case. As an example of this, the old parameter descsets from MoleculeDescriptorsCalculator is now renamed as desc_sets, stored as the attribute self.descSets. Functions are still written in snake_case.

v1.3.1

20 Mar 16:34
Compare
Choose a tag to compare

Change Log

From v1.3.0 to v1.3.1

Fixes

  • Fix not re-initiating model weights during DNN training
  • Feature values converted to np.float32 and then np.inf are converted to nan on DescriptorsCalculator.__call__.

Changes

  • QSPRDataset.prepareDataset changed attributes from standardize and sanitize to only standardizer.
    • Accepted parameters are either chembl, old, or a function that reads and standardizes smiles.
    • None is now also supported to allow skipping smiles standardization.
    • SMILES standardization now runs in parallel, but if the input function is not pickable, will just run on a single core.
  • QSPRModel.predictMols now accepts parameters smiles_standardizer, n_jobs and fill_value.

v1.3.0

02 Mar 14:53
Compare
Choose a tag to compare

Change Log

From v1.2.0 to v1.3.0

Fixes

  • problems with PaDEL descriptors and fingerprints on Linux were fixed

Changes

  • QSPRModel metadata now contains two extra entries:
    1. model_class - the fully qualified class name of the model
    2. version - the version of QSPRPred used to save the model
    • this change is not compatible with older files, but you can manually add these two entries and it should work fine in the newer version

New Features

  • The QSPRModel.fromFile() method can now instantiate a model from a file directly without knowing the underlying model type. It simply uses the class path stored in the model metadata file now.