Deprecate min_samples_leaf and min_weight_fraction_leaf (scikit-learn…

…#11870)
neurodata · Aug 23, 2018 · 2fe58e5 · 2fe58e5
1 parent ac41ccf
commit 2fe58e5
Show file tree

Hide file tree

Showing 13 changed files with 327 additions and 164 deletions.
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -218,7 +218,7 @@ setting ``oob_score=True``.
  The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
  where :math:`M` is the number of trees and :math:`N` is the number of samples.
  In order to reduce the size of the model, you can change these parameters:
- ``min_samples_split``, ``min_samples_leaf``, ``max_leaf_nodes`` and ``max_depth``.
+ ``min_samples_split``, ``max_leaf_nodes`` and ``max_depth``.
 
 Parallelization
 ---------------
@@ -393,9 +393,7 @@ The number of weak learners is controlled by the parameter ``n_estimators``. The
 the final combination. By default, weak learners are decision stumps. Different
 weak learners can be specified through the ``base_estimator`` parameter.
 The main parameters to tune to obtain good results are ``n_estimators`` and
-the complexity of the base estimators (e.g., its depth ``max_depth`` or
-minimum required number of samples at a leaf ``min_samples_leaf`` in case of
-decision trees).
+the complexity of the base estimators (e.g., its depth ``max_depth``).
 
 .. topic:: Examples:
 

diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
@@ -330,29 +330,18 @@ Tips on practical use
  for each additional level the tree grows to. Use ``max_depth`` to control
  the size of the tree to prevent overfitting.
 
- * Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
- samples at a leaf node. A very small number will usually mean the tree
- will overfit, whereas a large number will prevent the tree from learning
- the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
- varies greatly, a float number can be used as percentage in these two parameters.
- The main difference between the two is that ``min_samples_leaf`` guarantees
- a minimum number of samples in a leaf, while ``min_samples_split`` can
- create arbitrary small leaves, though ``min_samples_split`` is more common
- in the literature.
+ * Use ``min_samples_split`` to control the number of samples at a leaf node.
+ A very small number will usually mean the tree will overfit, whereas a
+ large number will prevent the tree from learning the data. If the sample
+ size varies greatly, a float number can be used as percentage in this
+ parameter. Note that ``min_samples_split`` can create arbitrarily
+ small leaves.
 
  * Balance your dataset before training to prevent the tree from being biased
  toward the classes that are dominant. Class balancing can be done by
  sampling an equal number of samples from each class, or preferably by
  normalizing the sum of the sample weights (``sample_weight``) for each
- class to the same value. Also note that weight-based pre-pruning criteria,
- such as ``min_weight_fraction_leaf``, will then be less biased toward
- dominant classes than criteria that are not aware of the sample weights,
- like ``min_samples_leaf``.
-
- * If the samples are weighted, it will be easier to optimize the tree
- structure using weight-based pre-pruning criterion such as
- ``min_weight_fraction_leaf``, which ensure that leaf nodes contain at least
- a fraction of the overall sum of the sample weights.
+ class to the same value.
 
  * All decision trees use ``np.float32`` arrays internally.
  If training data is not in this format, a copy of the dataset will be made.

diff --git a/doc/whats_new/v0.20.rst b/doc/whats_new/v0.20.rst
@@ -325,6 +325,12 @@ Support for Python 3.3 has been officially dropped.
  while mask does not allow this functionality.
  :issue:`9524` by :user:`Guillaume Lemaitre <glemaitre>`.
 
+- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
+ tree-based ensembles are deprecated and will be removed (fixed to 1 and 0
+ respectively) in version 0.22. These parameters were not effective for
+ regularization and at worst would produce bad splits. :issue:`10773` by
+ :user:`Bob Chen <lasagnaman>` and `Joel Nothman`_.
+
 - |Fix| :class:`ensemble.BaseBagging` where one could not deterministically
  reproduce ``fit`` result using the object attributes when ``random_state``
  is set. :issue:`9723` by :user:`Guillaume Lemaitre <glemaitre>`.
@@ -1005,6 +1011,13 @@ Support for Python 3.3 has been officially dropped.
  considered all samples to be of equal weight importance.
  :issue:`11464` by :user:`John Stott <JohnStott>`.
 
+- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
+ :class:`tree.DecisionTreeClassifier` and :class:`tree.DecisionTreeRegressor`
+ are deprecated and will be removed (fixed to 1 and 0 respectively) in version
+ 0.22. These parameters were not effective for regularization and at worst
+ would produce bad splits. :issue:`10773` by :user:`Bob Chen <lasagnaman>`
+ and `Joel Nothman`_.
+
 
 :mod:`sklearn.utils`
 ....................

diff --git a/examples/ensemble/plot_adaboost_hastie_10_2.py b/examples/ensemble/plot_adaboost_hastie_10_2.py
@@ -43,11 +43,11 @@
 X_test, y_test = X[2000:], y[2000:]
 X_train, y_train = X[:2000], y[:2000]
 
-dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
+dt_stump = DecisionTreeClassifier(max_depth=1)
 dt_stump.fit(X_train, y_train)
 dt_stump_err = 1.0 - dt_stump.score(X_test, y_test)
 
-dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
+dt = DecisionTreeClassifier(max_depth=9)
 dt.fit(X_train, y_train)
 dt_err = 1.0 - dt.score(X_test, y_test)
 

diff --git a/examples/ensemble/plot_gradient_boosting_oob.py b/examples/ensemble/plot_gradient_boosting_oob.py
@@ -55,7 +55,7 @@
 
 # Fit classifier with out-of-bag estimates
 params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
- 'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
+ 'learning_rate': 0.01, 'random_state': 3}
 clf = ensemble.GradientBoostingClassifier(**params)
 
 clf.fit(X_train, y_train)

diff --git a/examples/ensemble/plot_gradient_boosting_quantile.py b/examples/ensemble/plot_gradient_boosting_quantile.py
@@ -41,8 +41,7 @@ def f(x):
 
 clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
  n_estimators=250, max_depth=3,
- learning_rate=.1, min_samples_leaf=9,
- min_samples_split=9)
+ learning_rate=.1, min_samples_split=9)
 
 clf.fit(X, y)
 

diff --git a/examples/model_selection/plot_randomized_search.py b/examples/model_selection/plot_randomized_search.py
@@ -55,7 +55,6 @@ def report(results, n_top=3):
 param_dist = {"max_depth": [3, None],
  "max_features": sp_randint(1, 11),
  "min_samples_split": sp_randint(2, 11),
- "min_samples_leaf": sp_randint(1, 11),
  "bootstrap": [True, False],
  "criterion": ["gini", "entropy"]}
 
@@ -74,7 +73,6 @@ def report(results, n_top=3):
 param_grid = {"max_depth": [3, None],
  "max_features": [1, 3, 10],
  "min_samples_split": [2, 3, 10],
- "min_samples_leaf": [1, 3, 10],
  "bootstrap": [True, False],
  "criterion": ["gini", "entropy"]}