FIX Sets max_samples=1 when it is a float and too low in RandomForestClassifier #25601

JanFidor · 2023-02-13T21:42:21Z

Reference Issues/PRs

Fixes #24037
Superseded #25140

Credit to @mohitthakur13 @Da-Lan for the solution and to @sbendimerad for the contribution

betatim

LGTM.

Thanks for picking this up in attempt to wrap it up :)

JanFidor · 2023-02-28T15:09:37Z

@betatim Hi, I merged main to keep the branch up to date, also It was a while since your approve and I wanted to make sure the PR doesn't get stalled by accident and ask if there's anything else I should do to get it merged :)

thomasjpfan

Thank you for the PR @JanFidor !

thomasjpfan · 2023-02-28T17:55:03Z

doc/whats_new/v1.3.rst

+- |Fix| :class:`ensemble.RandomForestClassifier` raise more descriptive ValueError when round(n_samples * max_samples) < 1
+ :pr:`25601` by :user:`Jan Fidor <JanFidor>`.


Since this is a backward incompatible change, we will need to add this to the "Changed Model section" too. For example, the following works on main but will fail after this PR:

from sklearn.datasets import load_wine from sklearn.ensemble import RandomForestClassifier X, y = load_wine(return_X_y=True) clf = RandomForestClassifier(max_samples=1e-4) clf.fit(X,y)

Suggested change

- |Fix| :class:`ensemble.RandomForestClassifier` raise more descriptive ValueError when round(n_samples * max_samples) < 1

:pr:`25601` by :user:`Jan Fidor <JanFidor>`.

- |Fix| :meth:`ensemble.RandomForestClassifier.fit` raises a more descriptive `ValueError`

when `max_samples` is a float and `round(n_samples * max_samples) < 1`.

:pr:`25601` by :user:`Jan Fidor <JanFidor>`.

thomasjpfan · 2023-02-28T17:56:35Z

sklearn/ensemble/_forest.py

- return round(n_samples * max_samples)
+ result = round(n_samples * max_samples)
+ if result < 1:
+ raise ValueError("round(`max_samples` * `n_samples`) must be >= 1")


I know this does not look like the error message above, but I think the backticks does not add any information:

Suggested change

raise ValueError("round(`max_samples` * `n_samples`) must be >= 1")

raise ValueError("round(max_samples * n_samples) must be >= 1")

Honestly, I added them to stay consistent with this raise and I'm inclined to agree that the backticks are a little redundant. So if it's okay I'd like to delete them from error messages in this file to stay consistent

So if it's okay I'd like to delete them from error messages in this file to stay consistent

Even if it's small, I prefer not to expand the scope of this PR, so it's easier to review and merge. We can cleanup the file in a separate follow up PR.

thomasjpfan · 2023-02-28T17:58:30Z

sklearn/ensemble/tests/test_forest.py

+def test_raises_descriptive_bootstrap_error():
+ X, y = datasets.load_wine(return_X_y=True)
+ forest = RandomForestClassifier(max_samples=1e-4, class_weight="balanced_subsample")
+ warn_msg = "round\\(`max_samples` \\* `n_samples`\\) must be >= 1"


Using re.escape makes the following look a little cleaner:

Suggested change

warn_msg = "round\\(`max_samples` \\* `n_samples`\\) must be >= 1"

warn_msg = re.escape("round(max_samples * n_samples) must be >= 1")

thomasjpfan · 2023-02-28T18:00:13Z

sklearn/ensemble/tests/test_forest.py

@@ -1807,3 +1807,11 @@ def test_read_only_buffer(monkeypatch):

 clf = RandomForestClassifier(n_jobs=2, random_state=rng)
 cross_val_score(clf, X, y, cv=2)
+
+
+def test_raises_descriptive_bootstrap_error():


Suggested change

def test_raises_descriptive_bootstrap_error():

def test_raises_bootstrap_error_when_max_samples_too_low():

"""Check that an error is raised when max_samples is configured too low.

Non-regression test for gh-24037.

"""

thomasjpfan · 2023-02-28T18:01:44Z

sklearn/ensemble/tests/test_forest.py

+
+def test_raises_descriptive_bootstrap_error():
+ X, y = datasets.load_wine(return_X_y=True)
+ forest = RandomForestClassifier(max_samples=1e-4, class_weight="balanced_subsample")


Can we use a pytest.mark.parametrize to check the normal case as well?

@pytest.mark.parametrize("class_weight", ["balanced_subsample", None]) def test_raises_bootstrap_error_when_max_samples_too_low(class_weight): ... forest = RandomForestClassifier(max_samples=1e-4, class_weight=class_weight)

…learn into fix/issue-24037

jeremiedbb · 2023-03-02T18:37:10Z

@thomasjpfan, @glemaitre, would it be bad to not error but just return 1: return max(round(n_samples * max_samples), 1) ?
This is a pretty common pattern in scikit-learn and we usually don't fail but return the minimum acceptable value.

thomasjpfan · 2023-03-02T21:28:10Z

Yea, I'll be okay with returning one. I would have preferred ceil(n_samples * max_samples), but for backward compatibility max(round(...), 1) is okay with me.

JanFidor · 2023-03-03T17:51:42Z

@jeremiedbb @thomasjpfan just wanted to get confirmation, shall I make the change to stay backward compatible and delete the "Changed Model section" entry?

jeremiedbb · 2023-03-03T18:04:10Z

Yes, you can make the modifications in this PR

thomasjpfan

Thanks for the update! May you update the title to reflect the new behavior in this PR? (The title will become the commit message)

The docstring for RandomForestClassifier and RandomForestRegressor needs to be updated:

scikit-learn/sklearn/ensemble/_forest.py

Lines 1286 to 1288 in 4180b07

  - If float, then draw `max_samples * X.shape[0]` samples. Thus, 

  `max_samples` should be in the interval `(0.0, 1.0]`.

I prefer it to be explicit and say max(round(n_samples * max_samples), 1).

sklearn/ensemble/tests/test_forest.py

jeremiedbb

LGTM. Thanks @JanFidor

thomasjpfan

LGTM

JanFidor · 2023-03-10T15:02:15Z

Thanks for the reviews and help @betatim @thomasjpfan and @jeremiedbb !

add fix and test

776479f

github-actions bot added the module:ensemble label Feb 13, 2023

JanFidor mentioned this pull request Feb 13, 2023

fix: added a more descriptive error to _get_n_samples_bootstrap #25140

Closed

JanFidor and others added 4 commits February 13, 2023 22:43

add changelog

8ec3982

fix regex error

036b66e

the regex hates me fix

83526bb

Merge branch 'main' into fix/issue-24037

ddb61a7

betatim approved these changes Feb 15, 2023

View reviewed changes

Merge branch 'main' into fix/issue-24037

327bc78

thomasjpfan reviewed Feb 28, 2023

View reviewed changes

JanFidor added 5 commits February 28, 2023 19:47

improve test readability

7594f20

add docs entry and make error msg more descriptive

16ce606

Merge branch 'fix/issue-24037' of https://github.com/JanFidor/scikit-…

9fb33c9

…learn into fix/issue-24037

black

1fb2320

use parametrized param in test

b198044

JanFidor added 2 commits March 3, 2023 19:18

changes to stay backward compatible

ed2d5e3

black didn't pick up unused import

4b9eae6

thomasjpfan reviewed Mar 3, 2023

View reviewed changes

sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

JanFidor and others added 2 commits March 9, 2023 22:08

Merge branch 'main' into fix/issue-24037

14afb87

update test

00f01e2

JanFidor changed the title ~~Add a more descriptive error to _get_n_samples_bootstrap in RandomForestClassifier~~ Return 1 for _get_n_samples_bootstrap, when samples too low in RandomForestClassifier Mar 9, 2023

JanFidor and others added 2 commits March 9, 2023 22:28

update RandomForestRegressor and RandomForestClassifier docstrings

c18c285

Update test_forest.py

6388d80

jeremiedbb approved these changes Mar 10, 2023

View reviewed changes

thomasjpfan changed the title ~~Return 1 for _get_n_samples_bootstrap, when samples too low in RandomForestClassifier~~ Sets max_samples=1 when it is a float and too low in RandomForestClassifier Mar 10, 2023

thomasjpfan changed the title ~~Sets max_samples=1 when it is a float and too low in RandomForestClassifier~~ FIX Sets max_samples=1 when it is a float and too low in RandomForestClassifier Mar 10, 2023

thomasjpfan approved these changes Mar 10, 2023

View reviewed changes

thomasjpfan enabled auto-merge (squash) March 10, 2023 14:54

thomasjpfan merged commit 01c8e0b into scikit-learn:main Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Sets max_samples=1 when it is a float and too low in RandomForestClassifier #25601

FIX Sets max_samples=1 when it is a float and too low in RandomForestClassifier #25601

JanFidor commented Feb 13, 2023

betatim left a comment

JanFidor commented Feb 28, 2023

thomasjpfan left a comment

thomasjpfan Feb 28, 2023

thomasjpfan Feb 28, 2023 •

edited

Loading

JanFidor Feb 28, 2023 •

edited

Loading

thomasjpfan Feb 28, 2023 •

edited

Loading

thomasjpfan Feb 28, 2023

thomasjpfan Feb 28, 2023

thomasjpfan Feb 28, 2023

jeremiedbb commented Mar 2, 2023

thomasjpfan commented Mar 2, 2023

JanFidor commented Mar 3, 2023

jeremiedbb commented Mar 3, 2023

thomasjpfan left a comment •

edited

Loading

jeremiedbb left a comment

thomasjpfan left a comment

JanFidor commented Mar 10, 2023

		- \|Fix\| :class:`ensemble.RandomForestClassifier` raise more descriptive ValueError when round(n_samples * max_samples) < 1
		:pr:`25601` by :user:`Jan Fidor <JanFidor>`.

	raise ValueError("round(`max_samples` * `n_samples`) must be >= 1")
	raise ValueError("round(max_samples * n_samples) must be >= 1")

	warn_msg = "round\\(`max_samples` \\* `n_samples`\\) must be >= 1"
	warn_msg = re.escape("round(max_samples * n_samples) must be >= 1")

-def test_raises_descriptive_bootstrap_error():
+def test_raises_bootstrap_error_when_max_samples_too_low():
+ """Check that an error is raised when max_samples is configured too low.
+ Non-regression test for gh-24037.
+ """

	- If float, then draw `max_samples * X.shape[0]` samples. Thus,
	`max_samples` should be in the interval `(0.0, 1.0]`.

FIX Sets max_samples=1 when it is a float and too low in RandomForestClassifier #25601

FIX Sets max_samples=1 when it is a float and too low in RandomForestClassifier #25601

Conversation

JanFidor commented Feb 13, 2023

Reference Issues/PRs

betatim left a comment

Choose a reason for hiding this comment

JanFidor commented Feb 28, 2023

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Feb 28, 2023

Choose a reason for hiding this comment

thomasjpfan Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

JanFidor Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

thomasjpfan Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

thomasjpfan Feb 28, 2023

Choose a reason for hiding this comment

thomasjpfan Feb 28, 2023

Choose a reason for hiding this comment

thomasjpfan Feb 28, 2023

Choose a reason for hiding this comment

jeremiedbb commented Mar 2, 2023

thomasjpfan commented Mar 2, 2023

JanFidor commented Mar 3, 2023

jeremiedbb commented Mar 3, 2023

thomasjpfan left a comment • edited Loading

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

JanFidor commented Mar 10, 2023

thomasjpfan Feb 28, 2023 •

edited

Loading

JanFidor Feb 28, 2023 •

edited

Loading

thomasjpfan Feb 28, 2023 •

edited

Loading

thomasjpfan left a comment •

edited

Loading