MAINT: Revert ChainedImputer (scikit-learn#11600)

neurodata · Jul 17, 2018 · f819704 · f819704
1 parent 2242c59
commit f819704
Show file tree

Hide file tree

Showing 7 changed files with 16 additions and 883 deletions.
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -655,9 +655,8 @@ Kernels:
  :template: class.rst
 
  impute.SimpleImputer
- impute.ChainedImputer
  impute.MissingIndicator
- 
+
 .. _kernel_approximation_ref:
 
 :mod:`sklearn.kernel_approximation` Kernel Approximation

diff --git a/doc/modules/impute.rst b/doc/modules/impute.rst
@@ -16,22 +16,6 @@ values. However, this comes at the price of losing data which may be valuable
 i.e., to infer them from the known part of the data. See the :ref:`glossary`
 entry on imputation.
 
-
-Univariate vs. Multivariate Imputation
-======================================
-
-One type of imputation algorithm is univariate, which imputes values in the i-th
-feature dimension using only non-missing values in that feature dimension
-(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
-algorithms use the entire set of available feature dimensions to estimate the
-missing values (e.g. :class:`impute.ChainedImputer`).
-
-
-.. _single_imputer:
-
-Univariate feature imputation
-=============================
-
 The :class:`SimpleImputer` class provides basic strategies for imputing missing
 values. Missing values can be imputed with a provided constant value, or using
 the statistics (mean, median or most frequent) of each column in which the
@@ -87,60 +71,9 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
  ['a' 'y']
  ['b' 'y']]
 
-.. _chained_imputer:
-
-
-Multivariate feature imputation
-===============================
 
-A more sophisticated approach is to use the :class:`ChainedImputer` class, which
-implements the imputation technique from MICE (Multivariate Imputation by
-Chained Equations). MICE models each feature with missing values as a function of
-other features, and uses that estimate for imputation. It does so in a round-robin
-fashion: at each step, a feature column is designated as output `y` and the other
-feature columns are treated as inputs `X`. A regressor is fit on `(X, y)` for known `y`.
-Then, the regressor is used to predict the unknown values of `y`. This is repeated
-for each feature in a chained fashion, and then is done for a number of imputation
-rounds. Here is an example snippet::
-
- >>> import numpy as np
- >>> from sklearn.impute import ChainedImputer
- >>> imp = ChainedImputer(n_imputations=10, random_state=0)
- >>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
- ChainedImputer(imputation_order='ascending', initial_strategy='mean',
- max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
- n_imputations=10, n_nearest_features=None, predictor=None,
- random_state=0, verbose=False)
- >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
- >>> print(np.round(imp.transform(X_test)))
- [[ 1. 2.]
- [ 6. 4.]
- [13. 6.]]
-
-Both :class:`SimpleImputer` and :class:`ChainedImputer` can be used in a Pipeline
-as a way to build a composite estimator that supports imputation.
-See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
-
-.. _multiple_imputation:
-
-Multiple vs. Single Imputation
-==============================
-
-In the statistics community, it is common practice to perform multiple imputations,
-generating, for example, 10 separate imputations for a single feature matrix.
-Each of these 10 imputations is then put through the subsequent analysis pipeline
-(e.g. feature engineering, clustering, regression, classification). The 10 final
-analysis results (e.g. held-out validation error) allow the data scientist to
-obtain understanding of the uncertainty inherent in the missing values. The above
-practice is called multiple imputation. As implemented, the :class:`ChainedImputer`
-class generates a single (averaged) imputation for each missing value because this
-is the most common use case for machine learning applications. However, it can also be used
-for multiple imputations by applying it repeatedly to the same dataset with different
-random seeds with the ``n_imputations`` parameter set to 1.
-
-Note that a call to the ``transform`` method of :class:`ChainedImputer` is not
-allowed to change the number of samples. Therefore multiple imputations cannot be
-achieved by a single call to ``transform``.
+:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
+estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
 
 .. _missing_indicator:
 

diff --git a/doc/whats_new/v0.20.rst b/doc/whats_new/v0.20.rst
@@ -150,11 +150,6 @@ Preprocessing
 - Added :class:`MissingIndicator` which generates a binary indicator for
  missing values. :issue:`8075` by :user:`Maniteja Nandana <maniteja123>` and
  :user:`Guillaume Lemaitre <glemaitre>`.
-
-- Added :class:`impute.ChainedImputer`, which is a strategy for imputing missing
- values by modeling each feature with missing values as a function of
- other features in a round-robin fashion. :issue:`8478` by
- :user:`Sergey Feldman <sergeyf>`.
 
 - :class:`linear_model.SGDClassifier`, :class:`linear_model.SGDRegressor`,
  :class:`linear_model.PassiveAggressiveClassifier`,

diff --git a/examples/plot_missing_values.py b/examples/plot_missing_values.py
@@ -3,30 +3,29 @@
 Imputing missing values before building an estimator
 ====================================================
 
+This example shows that imputing the missing values can give better
+results than discarding the samples containing any missing value.
+Imputing does not always improve the predictions, so please check via
+cross-validation. Sometimes dropping rows or using marker values is
+more effective.
+
 Missing values can be replaced by the mean, the median or the most frequent
 value using the basic :func:`sklearn.impute.SimpleImputer`.
 The median is a more robust estimator for data with high magnitude variables
 which could dominate results (otherwise known as a 'long tail').
 
-Another option is the :func:`sklearn.impute.ChainedImputer`. This uses
-round-robin linear regression, treating every variable as an output in
-turn. The version implemented assumes Gaussian (output) variables. If your
-features are obviously non-Normal, consider transforming them to look more
-Normal so as to improve performance.
-
 In addition of using an imputing method, we can also keep an indication of the
 missing information using :func:`sklearn.impute.MissingIndicator` which might
 carry some information.
 """
-
 import numpy as np
 import matplotlib.pyplot as plt
 
 from sklearn.datasets import load_diabetes
 from sklearn.datasets import load_boston
 from sklearn.ensemble import RandomForestRegressor
 from sklearn.pipeline import make_pipeline, make_union
-from sklearn.impute import SimpleImputer, ChainedImputer, MissingIndicator
+from sklearn.impute import SimpleImputer, MissingIndicator
 from sklearn.model_selection import cross_val_score
 
 rng = np.random.RandomState(0)
@@ -71,18 +70,10 @@ def get_results(dataset):
  mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
  scoring='neg_mean_squared_error')
 
- # Estimate the score after chained imputation of the missing values
- estimator = make_pipeline(
- make_union(ChainedImputer(missing_values=0, random_state=0),
- MissingIndicator(missing_values=0)),
- RandomForestRegressor(random_state=0, n_estimators=100))
- chained_impute_scores = cross_val_score(estimator, X_missing, y_missing,
- scoring='neg_mean_squared_error')
 
  return ((full_scores.mean(), full_scores.std()),
  (zero_impute_scores.mean(), zero_impute_scores.std()),
- (mean_impute_scores.mean(), mean_impute_scores.std()),
- (chained_impute_scores.mean(), chained_impute_scores.std()))
+ (mean_impute_scores.mean(), mean_impute_scores.std()))
 
 
 results_diabetes = np.array(get_results(load_diabetes()))
@@ -98,8 +89,7 @@ def get_results(dataset):
 
 x_labels = ['Full data',
  'Zero imputation',
- 'Mean Imputation',
- 'Chained Imputation']
+ 'Mean Imputation']
 colors = ['r', 'g', 'b', 'orange']
 
 # plot diabetes results