FEAT SLEP006: metadata routing infrastructure (scikit-learn#24027)

Co-authored-by: Christian Lorentzen <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Benjamin Bossan <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Omar Salman <[email protected]>
neurodata · Jun 2, 2023 · 62671a7 · 62671a7
1 parent 849c2f1
commit 62671a7
Show file tree

Hide file tree

Showing 28 changed files with 4,396 additions and 155 deletions.
diff --git a/doc/conftest.py b/doc/conftest.py
@@ -144,6 +144,13 @@ def pytest_runtest_setup(item):
  setup_preprocessing()
  elif fname.endswith("statistical_inference/unsupervised_learning.rst"):
  setup_unsupervised_learning()
+ elif fname.endswith("metadata_routing.rst"):
+ # TODO: remove this once implemented
+ # Skip metarouting because is it is not fully implemented yet
+ raise SkipTest(
+ "Skipping doctest for metadata_routing.rst because it "
+ "is not fully implemented yet"
+ )
 
  rst_files_requiring_matplotlib = [
  "modules/partial_dependence.rst",

diff --git a/doc/metadata_routing.rst b/doc/metadata_routing.rst
@@ -0,0 +1,231 @@
+
+.. _metadata_routing:
+
+.. currentmodule:: sklearn
+
+.. TODO: update doc/conftest.py once document is updated and examples run.
+
+Metadata Routing
+================
+
+.. note::
+ The Metadata Routing API is experimental, and is not implemented yet for many
+ estimators. It may change without the usual deprecation cycle. By default
+ this feature is not enabled. You can enable this feature by setting the
+ ``enable_metadata_routing`` flag to ``True``:
+
+ >>> import sklearn
+ >>> sklearn.set_config(enable_metadata_routing=True)
+
+This guide demonstrates how metadata such as ``sample_weight`` can be routed
+and passed along to estimators, scorers, and CV splitters through
+meta-estimators such as :class:`~pipeline.Pipeline` and
+:class:`~model_selection.GridSearchCV`. In order to pass metadata to a method
+such as ``fit`` or ``score``, the object consuming the metadata, must *request*
+it. For estimators and splitters, this is done via ``set_*_request`` methods,
+e.g. ``set_fit_request(...)``, and for scorers this is done via the
+``set_score_request`` method. For grouped splitters such as
+:class:`~model_selection.GroupKFold`, a ``groups`` parameter is requested by
+default. This is best demonstrated by the following examples.
+
+If you are developing a scikit-learn compatible estimator or meta-estimator,
+you can check our related developer guide:
+:ref:`sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py`.
+
+.. note::
+ Note that the methods and requirements introduced in this document are only
+ relevant if you want to pass metadata (e.g. ``sample_weight``) to a method.
+ If you're only passing ``X`` and ``y`` and no other parameter / metadata to
+ methods such as ``fit``, ``transform``, etc, then you don't need to set
+ anything.
+
+Usage Examples
+**************
+Here we present a few examples to show different common use-cases. The examples
+in this section require the following imports and data::
+
+ >>> import numpy as np
+ >>> from sklearn.metrics import make_scorer, accuracy_score
+ >>> from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
+ >>> from sklearn.model_selection import cross_validate, GridSearchCV, GroupKFold
+ >>> from sklearn.feature_selection import SelectKBest
+ >>> from sklearn.pipeline import make_pipeline
+ >>> n_samples, n_features = 100, 4
+ >>> rng = np.random.RandomState(42)
+ >>> X = rng.rand(n_samples, n_features)
+ >>> y = rng.randint(0, 2, size=n_samples)
+ >>> my_groups = rng.randint(0, 10, size=n_samples)
+ >>> my_weights = rng.rand(n_samples)
+ >>> my_other_weights = rng.rand(n_samples)
+
+Weighted scoring and fitting
+----------------------------
+
+Here :class:`~model_selection.GroupKFold` requests ``groups`` by default. However, we
+need to explicitly request weights for our scorer and the internal cross validation of
+:class:`~linear_model.LogisticRegressionCV`. Both of these *consumers* know how to use
+metadata called ``sample_weight``::
+
+ >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
+ ... sample_weight=True
+ ... )
+ >>> lr = LogisticRegressionCV(
+ ... cv=GroupKFold(), scoring=weighted_acc,
+ ... ).set_fit_request(sample_weight=True)
+ >>> cv_results = cross_validate(
+ ... lr,
+ ... X,
+ ... y,
+ ... props={"sample_weight": my_weights, "groups": my_groups},
+ ... cv=GroupKFold(),
+ ... scoring=weighted_acc,
+ ... )
+
+Note that in this example, ``my_weights`` is passed to both the scorer and
+:class:`~linear_model.LogisticRegressionCV`.
+
+Error handling: if ``props={"sample_weigh": my_weights, ...}`` were passed
+(note the typo), :func:`~model_selection.cross_validate` would raise an error,
+since ``sample_weigh`` was not requested by any of its underlying objects.
+
+Weighted scoring and unweighted fitting
+---------------------------------------
+
+When passing metadata such as ``sample_weight`` around, all scikit-learn
+estimators require weights to be either explicitly requested or not requested
+(i.e. ``True`` or ``False``) when used in another router such as a
+:class:`~pipeline.Pipeline` or a ``*GridSearchCV``. To perform an unweighted
+fit, we need to configure :class:`~linear_model.LogisticRegressionCV` to not
+request sample weights, so that :func:`~model_selection.cross_validate` does
+not pass the weights along::
+
+ >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
+ ... sample_weight=True
+ ... )
+ >>> lr = LogisticRegressionCV(
+ ... cv=GroupKFold(), scoring=weighted_acc,
+ ... ).set_fit_request(sample_weight=False)
+ >>> cv_results = cross_validate(
+ ... lr,
+ ... X,
+ ... y,
+ ... cv=GroupKFold(),
+ ... props={"sample_weight": my_weights, "groups": my_groups},
+ ... scoring=weighted_acc,
+ ... )
+
+If :meth:`linear_model.LogisticRegressionCV.set_fit_request` has not
+been called, :func:`~model_selection.cross_validate` will raise an
+error because ``sample_weight`` is passed in but
+:class:`~linear_model.LogisticRegressionCV` would not be explicitly configured
+to recognize the weights.
+
+Unweighted feature selection
+----------------------------
+
+Setting request values for metadata are only required if the object, e.g. estimator,
+scorer, etc., is a consumer of that metadata Unlike
+:class:`~linear_model.LogisticRegressionCV`, :class:`~feature_selection.SelectKBest`
+doesn't consume weights and therefore no request value for ``sample_weight`` on its
+instance is set and ``sample_weight`` is not routed to it::
+
+ >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
+ ... sample_weight=True
+ ... )
+ >>> lr = LogisticRegressionCV(
+ ... cv=GroupKFold(), scoring=weighted_acc,
+ ... ).set_fit_request(sample_weight=True)
+ >>> sel = SelectKBest(k=2)
+ >>> pipe = make_pipeline(sel, lr)
+ >>> cv_results = cross_validate(
+ ... pipe,
+ ... X,
+ ... y,
+ ... cv=GroupKFold(),
+ ... props={"sample_weight": my_weights, "groups": my_groups},
+ ... scoring=weighted_acc,
+ ... )
+
+Advanced: Different scoring and fitting weights
+-----------------------------------------------
+
+Despite :func:`~metrics.make_scorer` and
+:class:`~linear_model.LogisticRegressionCV` both expecting the key
+``sample_weight``, we can use aliases to pass different weights to different
+consumers. In this example, we pass ``scoring_weight`` to the scorer, and
+``fitting_weight`` to :class:`~linear_model.LogisticRegressionCV`::
+
+ >>> weighted_acc = make_scorer(accuracy_score).set_score_request(
+ ... sample_weight="scoring_weight"
+ ... )
+ >>> lr = LogisticRegressionCV(
+ ... cv=GroupKFold(), scoring=weighted_acc,
+ ... ).set_fit_request(sample_weight="fitting_weight")
+ >>> cv_results = cross_validate(
+ ... lr,
+ ... X,
+ ... y,
+ ... cv=GroupKFold(),
+ ... props={
+ ... "scoring_weight": my_weights,
+ ... "fitting_weight": my_other_weights,
+ ... "groups": my_groups,
+ ... },
+ ... scoring=weighted_acc,
+ ... )
+
+API Interface
+*************
+
+A *consumer* is an object (estimator, meta-estimator, scorer, splitter) which
+accepts and uses some metadata in at least one of its methods (``fit``,
+``predict``, ``inverse_transform``, ``transform``, ``score``, ``split``).
+Meta-estimators which only forward the metadata to other objects (the child
+estimator, scorers, or splitters) and don't use the metadata themselves are not
+consumers. (Meta-)Estimators which route metadata to other objects are
+*routers*. A(n) (meta-)estimator can be a consumer and a router at the same time.
+(Meta-)Estimators and splitters expose a ``set_*_request`` method for each
+method which accepts at least one metadata. For instance, if an estimator
+supports ``sample_weight`` in ``fit`` and ``score``, it exposes
+``estimator.set_fit_request(sample_weight=value)`` and
+``estimator.set_score_request(sample_weight=value)``. Here ``value`` can be:
+
+- ``True``: method requests a ``sample_weight``. This means if the metadata is
+ provided, it will be used, otherwise no error is raised.
+- ``False``: method does not request a ``sample_weight``.
+- ``None``: router will raise an error if ``sample_weight`` is passed. This is
+ in almost all cases the default value when an object is instantiated and
+ ensures the user sets the metadata requests explicitly when a metadata is
+ passed. The only exception are ``Group*Fold`` splitters.
+- ``"param_name"``: if this estimator is used in a meta-estimator, the
+ meta-estimator should forward ``"param_name"`` as ``sample_weight`` to this
+ estimator. This means the mapping between the metadata required by the
+ object, e.g. ``sample_weight`` and what is provided by the user, e.g.
+ ``my_weights`` is done at the router level, and not by the object, e.g.
+ estimator, itself.
+
+Metadata are requested in the same way for scorers using ``set_score_request``.
+
+If a metadata, e.g. ``sample_weight``, is passed by the user, the metadata
+request for all objects which potentially can consume ``sample_weight`` should
+be set by the user, otherwise an error is raised by the router object. For
+example, the following code raises an error, since it hasn't been explicitly
+specified whether ``sample_weight`` should be passed to the estimator's scorer
+or not::
+
+ >>> param_grid = {"C": [0.1, 1]}
+ >>> lr = LogisticRegression().set_fit_request(sample_weight=True)
+ >>> try:
+ ... GridSearchCV(
+ ... estimator=lr, param_grid=param_grid
+ ... ).fit(X, y, sample_weight=my_weights)
+ ... except ValueError as e:
+ ... print(e)
+ [sample_weight] are passed but are not explicitly set as requested or not for
+ LogisticRegression.score
+
+The issue can be fixed by explicitly setting the request value::
+
+ >>> lr = LogisticRegression().set_fit_request(
+ ... sample_weight=True
+ ... ).set_score_request(sample_weight=False)
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -34,6 +34,7 @@ Base classes
  base.DensityMixin
  base.RegressorMixin
  base.TransformerMixin
+ base.MetaEstimatorMixin
  base.OneToOneFeatureMixin
  base.ClassNamePrefixFeaturesOutMixin
  feature_selection.SelectorMixin
@@ -1652,6 +1653,11 @@ Plotting
  utils.validation.check_symmetric
  utils.validation.column_or_1d
  utils.validation.has_fit_parameter
+ utils.metadata_routing.get_routing_for_object
+ utils.metadata_routing.MetadataRouter
+ utils.metadata_routing.MetadataRequest
+ utils.metadata_routing.MethodMapping
+ utils.metadata_routing.process_routing
 
 Specific utilities to list scikit-learn components:
 

diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
@@ -222,6 +222,14 @@ the following two rules:
  Again, by convention higher numbers are better, so if your scorer
  returns loss, that value should be negated.
 
+- Advanced: If it requires extra metadata to be passed to it, it should expose
+ a ``get_metadata_routing`` method returning the requested metadata. The user
+ should be able to set the requested metadata via a ``set_score_request``
+ method. Please see :ref:`User Guide <metadata_routing>` and :ref:`Developer
+ Guide <sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py>` for
+ more details.
+
+
 .. note:: **Using custom scorers in functions where n_jobs > 1**
 
  While defining the custom scoring function alongside the calling function

diff --git a/doc/user_guide.rst b/doc/user_guide.rst
@@ -31,3 +31,12 @@ User Guide
  model_persistence.rst
  common_pitfalls.rst
  dispatching.rst
+
+Under Development
+-----------------
+
+.. toctree::
+ :numbered:
+ :maxdepth: 1
+
+ metadata_routing.rst
diff --git a/doc/whats_new/v1.3.rst b/doc/whats_new/v1.3.rst
@@ -145,6 +145,19 @@ Changes impacting all modules
  :pr:`26082` by :user:`Jérémie du Boisberranger <jeremiedbb>` and
  :user:`Olivier Grisel <ogrisel>`.
 
+Experimental / Under Development
+--------------------------------
+
+- |MajorFeature| :ref:`Metadata routing <metadata_routing>`'s related base
+ methods are included in this release. This feature is only available via the
+ `enable_metadata_routing` feature flag which can be enabled using
+ :func:`sklearn.set_config` and :func:`sklearn.config_context`. For now this
+ feature is mostly useful for third party developers to prepare their code
+ base for metadata routing, and we strongly recommend that they also hide it
+ behind the same feature flag, rather than having it enabled by default.
+ :pr:`24027` by `Adrin Jalali`_, :user:`Benjamin Bossan <BenjaminBossan>`, and
+ :user:`Omar Salman <OmarManzoor>`.
+
 Changelog
 ---------