Skip to content

Commit

Permalink
Deprecate min_samples_leaf and min_weight_fraction_leaf (scikit-learn…
Browse files Browse the repository at this point in the history
  • Loading branch information
jnothman authored and rth committed Aug 23, 2018
1 parent ac41ccf commit 2fe58e5
Show file tree
Hide file tree
Showing 13 changed files with 327 additions and 164 deletions.
6 changes: 2 additions & 4 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ setting ``oob_score=True``.
The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
where :math:`M` is the number of trees and :math:`N` is the number of samples.
In order to reduce the size of the model, you can change these parameters:
``min_samples_split``, ``min_samples_leaf``, ``max_leaf_nodes`` and ``max_depth``.
``min_samples_split``, ``max_leaf_nodes`` and ``max_depth``.

Parallelization
---------------
Expand Down Expand Up @@ -393,9 +393,7 @@ The number of weak learners is controlled by the parameter ``n_estimators``. The
the final combination. By default, weak learners are decision stumps. Different
weak learners can be specified through the ``base_estimator`` parameter.
The main parameters to tune to obtain good results are ``n_estimators`` and
the complexity of the base estimators (e.g., its depth ``max_depth`` or
minimum required number of samples at a leaf ``min_samples_leaf`` in case of
decision trees).
the complexity of the base estimators (e.g., its depth ``max_depth``).

.. topic:: Examples:

Expand Down
25 changes: 7 additions & 18 deletions doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -330,29 +330,18 @@ Tips on practical use
for each additional level the tree grows to. Use ``max_depth`` to control
the size of the tree to prevent overfitting.

* Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
samples at a leaf node. A very small number will usually mean the tree
will overfit, whereas a large number will prevent the tree from learning
the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
varies greatly, a float number can be used as percentage in these two parameters.
The main difference between the two is that ``min_samples_leaf`` guarantees
a minimum number of samples in a leaf, while ``min_samples_split`` can
create arbitrary small leaves, though ``min_samples_split`` is more common
in the literature.
* Use ``min_samples_split`` to control the number of samples at a leaf node.
A very small number will usually mean the tree will overfit, whereas a
large number will prevent the tree from learning the data. If the sample
size varies greatly, a float number can be used as percentage in this
parameter. Note that ``min_samples_split`` can create arbitrarily
small leaves.

* Balance your dataset before training to prevent the tree from being biased
toward the classes that are dominant. Class balancing can be done by
sampling an equal number of samples from each class, or preferably by
normalizing the sum of the sample weights (``sample_weight``) for each
class to the same value. Also note that weight-based pre-pruning criteria,
such as ``min_weight_fraction_leaf``, will then be less biased toward
dominant classes than criteria that are not aware of the sample weights,
like ``min_samples_leaf``.

* If the samples are weighted, it will be easier to optimize the tree
structure using weight-based pre-pruning criterion such as
``min_weight_fraction_leaf``, which ensure that leaf nodes contain at least
a fraction of the overall sum of the sample weights.
class to the same value.

* All decision trees use ``np.float32`` arrays internally.
If training data is not in this format, a copy of the dataset will be made.
Expand Down
13 changes: 13 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,12 @@ Support for Python 3.3 has been officially dropped.
while mask does not allow this functionality.
:issue:`9524` by :user:`Guillaume Lemaitre <glemaitre>`.

- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
tree-based ensembles are deprecated and will be removed (fixed to 1 and 0
respectively) in version 0.22. These parameters were not effective for
regularization and at worst would produce bad splits. :issue:`10773` by
:user:`Bob Chen <lasagnaman>` and `Joel Nothman`_.

- |Fix| :class:`ensemble.BaseBagging` where one could not deterministically
reproduce ``fit`` result using the object attributes when ``random_state``
is set. :issue:`9723` by :user:`Guillaume Lemaitre <glemaitre>`.
Expand Down Expand Up @@ -1005,6 +1011,13 @@ Support for Python 3.3 has been officially dropped.
considered all samples to be of equal weight importance.
:issue:`11464` by :user:`John Stott <JohnStott>`.

- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
:class:`tree.DecisionTreeClassifier` and :class:`tree.DecisionTreeRegressor`
are deprecated and will be removed (fixed to 1 and 0 respectively) in version
0.22. These parameters were not effective for regularization and at worst
would produce bad splits. :issue:`10773` by :user:`Bob Chen <lasagnaman>`
and `Joel Nothman`_.


:mod:`sklearn.utils`
....................
Expand Down
4 changes: 2 additions & 2 deletions examples/ensemble/plot_adaboost_hastie_10_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@
X_test, y_test = X[2000:], y[2000:]
X_train, y_train = X[:2000], y[:2000]

dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
dt_stump = DecisionTreeClassifier(max_depth=1)
dt_stump.fit(X_train, y_train)
dt_stump_err = 1.0 - dt_stump.score(X_test, y_test)

dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
dt = DecisionTreeClassifier(max_depth=9)
dt.fit(X_train, y_train)
dt_err = 1.0 - dt.score(X_test, y_test)

Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_gradient_boosting_oob.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@

# Fit classifier with out-of-bag estimates
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
'learning_rate': 0.01, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)

clf.fit(X_train, y_train)
Expand Down
3 changes: 1 addition & 2 deletions examples/ensemble/plot_gradient_boosting_quantile.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,7 @@ def f(x):

clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
learning_rate=.1, min_samples_split=9)

clf.fit(X, y)

Expand Down
2 changes: 0 additions & 2 deletions examples/model_selection/plot_randomized_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ def report(results, n_top=3):
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}

Expand All @@ -74,7 +73,6 @@ def report(results, n_top=3):
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}

Expand Down
Loading

0 comments on commit 2fe58e5

Please sign in to comment.