Skip to content

Commit

Permalink
MAINT: Revert ChainedImputer (scikit-learn#11600)
Browse files Browse the repository at this point in the history
  • Loading branch information
jorisvandenbossche authored and glemaitre committed Jul 17, 2018
1 parent 2242c59 commit f819704
Show file tree
Hide file tree
Showing 7 changed files with 16 additions and 883 deletions.
3 changes: 1 addition & 2 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -655,9 +655,8 @@ Kernels:
:template: class.rst

impute.SimpleImputer
impute.ChainedImputer
impute.MissingIndicator

.. _kernel_approximation_ref:

:mod:`sklearn.kernel_approximation` Kernel Approximation
Expand Down
71 changes: 2 additions & 69 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,6 @@ values. However, this comes at the price of losing data which may be valuable
i.e., to infer them from the known part of the data. See the :ref:`glossary`
entry on imputation.


Univariate vs. Multivariate Imputation
======================================

One type of imputation algorithm is univariate, which imputes values in the i-th
feature dimension using only non-missing values in that feature dimension
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
algorithms use the entire set of available feature dimensions to estimate the
missing values (e.g. :class:`impute.ChainedImputer`).


.. _single_imputer:

Univariate feature imputation
=============================

The :class:`SimpleImputer` class provides basic strategies for imputing missing
values. Missing values can be imputed with a provided constant value, or using
the statistics (mean, median or most frequent) of each column in which the
Expand Down Expand Up @@ -87,60 +71,9 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
['a' 'y']
['b' 'y']]

.. _chained_imputer:


Multivariate feature imputation
===============================

A more sophisticated approach is to use the :class:`ChainedImputer` class, which
implements the imputation technique from MICE (Multivariate Imputation by
Chained Equations). MICE models each feature with missing values as a function of
other features, and uses that estimate for imputation. It does so in a round-robin
fashion: at each step, a feature column is designated as output `y` and the other
feature columns are treated as inputs `X`. A regressor is fit on `(X, y)` for known `y`.
Then, the regressor is used to predict the unknown values of `y`. This is repeated
for each feature in a chained fashion, and then is done for a number of imputation
rounds. Here is an example snippet::

>>> import numpy as np
>>> from sklearn.impute import ChainedImputer
>>> imp = ChainedImputer(n_imputations=10, random_state=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, np.nan]])
ChainedImputer(imputation_order='ascending', initial_strategy='mean',
max_value=None, min_value=None, missing_values=nan, n_burn_in=10,
n_imputations=10, n_nearest_features=None, predictor=None,
random_state=0, verbose=False)
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
>>> print(np.round(imp.transform(X_test)))
[[ 1. 2.]
[ 6. 4.]
[13. 6.]]

Both :class:`SimpleImputer` and :class:`ChainedImputer` can be used in a Pipeline
as a way to build a composite estimator that supports imputation.
See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.

.. _multiple_imputation:

Multiple vs. Single Imputation
==============================

In the statistics community, it is common practice to perform multiple imputations,
generating, for example, 10 separate imputations for a single feature matrix.
Each of these 10 imputations is then put through the subsequent analysis pipeline
(e.g. feature engineering, clustering, regression, classification). The 10 final
analysis results (e.g. held-out validation error) allow the data scientist to
obtain understanding of the uncertainty inherent in the missing values. The above
practice is called multiple imputation. As implemented, the :class:`ChainedImputer`
class generates a single (averaged) imputation for each missing value because this
is the most common use case for machine learning applications. However, it can also be used
for multiple imputations by applying it repeatedly to the same dataset with different
random seeds with the ``n_imputations`` parameter set to 1.

Note that a call to the ``transform`` method of :class:`ChainedImputer` is not
allowed to change the number of samples. Therefore multiple imputations cannot be
achieved by a single call to ``transform``.
:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.

.. _missing_indicator:

Expand Down
5 changes: 0 additions & 5 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -150,11 +150,6 @@ Preprocessing
- Added :class:`MissingIndicator` which generates a binary indicator for
missing values. :issue:`8075` by :user:`Maniteja Nandana <maniteja123>` and
:user:`Guillaume Lemaitre <glemaitre>`.

- Added :class:`impute.ChainedImputer`, which is a strategy for imputing missing
values by modeling each feature with missing values as a function of
other features in a round-robin fashion. :issue:`8478` by
:user:`Sergey Feldman <sergeyf>`.

- :class:`linear_model.SGDClassifier`, :class:`linear_model.SGDRegressor`,
:class:`linear_model.PassiveAggressiveClassifier`,
Expand Down
28 changes: 9 additions & 19 deletions examples/plot_missing_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,29 @@
Imputing missing values before building an estimator
====================================================
This example shows that imputing the missing values can give better
results than discarding the samples containing any missing value.
Imputing does not always improve the predictions, so please check via
cross-validation. Sometimes dropping rows or using marker values is
more effective.
Missing values can be replaced by the mean, the median or the most frequent
value using the basic :func:`sklearn.impute.SimpleImputer`.
The median is a more robust estimator for data with high magnitude variables
which could dominate results (otherwise known as a 'long tail').
Another option is the :func:`sklearn.impute.ChainedImputer`. This uses
round-robin linear regression, treating every variable as an output in
turn. The version implemented assumes Gaussian (output) variables. If your
features are obviously non-Normal, consider transforming them to look more
Normal so as to improve performance.
In addition of using an imputing method, we can also keep an indication of the
missing information using :func:`sklearn.impute.MissingIndicator` which might
carry some information.
"""

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer, ChainedImputer, MissingIndicator
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)
Expand Down Expand Up @@ -71,18 +70,10 @@ def get_results(dataset):
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')

# Estimate the score after chained imputation of the missing values
estimator = make_pipeline(
make_union(ChainedImputer(missing_values=0, random_state=0),
MissingIndicator(missing_values=0)),
RandomForestRegressor(random_state=0, n_estimators=100))
chained_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')

return ((full_scores.mean(), full_scores.std()),
(zero_impute_scores.mean(), zero_impute_scores.std()),
(mean_impute_scores.mean(), mean_impute_scores.std()),
(chained_impute_scores.mean(), chained_impute_scores.std()))
(mean_impute_scores.mean(), mean_impute_scores.std()))


results_diabetes = np.array(get_results(load_diabetes()))
Expand All @@ -98,8 +89,7 @@ def get_results(dataset):

x_labels = ['Full data',
'Zero imputation',
'Mean Imputation',
'Chained Imputation']
'Mean Imputation']
colors = ['r', 'g', 'b', 'orange']

# plot diabetes results
Expand Down
Loading

0 comments on commit f819704

Please sign in to comment.