DOC Add example comparing random forest with hgbt models #26320

ArturoAmorQ · 2023-05-03T12:33:10Z

Reference Issues/PRs

Partially addresses #26220.

What does this implement/fix? Explain your changes.

Adds an example comparing random forest with hgbt models.

Any other comments?

I use a regression model, but maybe a classification problem would make the example more visible (RandomForestClassifier is the third most visited page of the doc). Opinions are welcomed.

lorentzenchr · 2023-05-04T12:41:53Z

@ArturoAmorQ Thanks for this new example!
Could you modify the plots, either use same y-axes for better comparison or plot rf and hgbt in the same plot, different line style?

I would also be good to know how large the dataset is, so number of rows and features.

…nto hgbt_rf_example

adrinjalali

Overall LGTM. I would also add a link to this from the docstrings of these models. Our examples are not otherwise discoverable really.

…nto hgbt_rf_example

ArturoAmorQ · 2023-05-24T15:03:58Z

Just a weird behavior to be investigated: HGBT prediction times are almost 10 times slower on the CI than on my local machine.

Local:

CI:

ogrisel · 2023-05-24T15:18:01Z

Here is what I get on macOS M1:

ogrisel · 2023-05-24T15:19:22Z

So hist gradient boosting classifier is particularly slow at prediction time on the CI... while it's ok at fit time. There is something really fishy happening on the CI.

ogrisel · 2023-05-24T15:29:01Z

There treadpool_info() output on the CI is a bit unexpected as libomp detects only 1 thread:

[{'user_api': 'blas', 'internal_api': 'openblas', 'prefix': 'libopenblas', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libopenblasp-r0.3.21.so', 'version': '0.3.21', 'threading_layer': 'pthreads', 'architecture': 'SkylakeX', 'num_threads': 2}, {'user_api': 'openmp', 'internal_api': 'openmp', 'prefix': 'libomp', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libomp.so', 'version': None, 'num_threads': 1}]

ogrisel · 2023-05-24T15:50:48Z

After removing the export OMP_NUM_THREADS=1 line from the circle ci config we get:

but:

[{'user_api': 'blas', 'internal_api': 'openblas', 'prefix': 'libopenblas', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libopenblasp-r0.3.21.so', 'version': '0.3.21', 'threading_layer': 'pthreads', 'architecture': 'SkylakeX', 'num_threads': 2}, {'user_api': 'openmp', 'internal_api': 'openmp', 'prefix': 'libomp', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libomp.so', 'version': None, 'num_threads': 36}]

The 36 openmp threads might cause a serious oversubscription on other jobs. Let me try set it to 2 instead.

ogrisel

Here is a path of feedback. Even if we do not fully understand why HGBDT prediction is slower on the CI than on our local machines, I think this is good enough this way to get the main conclusion rights with the following suggested analysis below.

Feel free to remove the threadpool_info debugging info from the example.

examples/ensemble/plot_forest_hist_grad_boosting_comparison.py

Co-authored-by: Olivier Grisel <[email protected]>

examples/ensemble/plot_forest_hist_grad_boosting_comparison.py

ogrisel · 2023-05-25T12:22:28Z

Actually my suggestion to use Histogram-based Gradient Boosting broke everything:

it needs to be updated twice
the legend block is now too large

Maybe let's just use "Hist Gradient Boosting" (with spaces).

ogrisel · 2023-05-25T12:27:37Z

With the new code and CI config, both models now use the same number of threads in all cases.

Here is the plot on my laptop (8 cores):

and here is the output on the CI (2 cores):

so all in all, the relative prediction slow-down for HGBDT on the CI is not that impressive any more.

ogrisel

One final pass. Thanks for this example. It makes it very explicit that HGBDT have a stronger speed/accuracy tradeoff in general. We should link to it both from the narrative doc on RF and in the docstring or see also section of the RF models.

examples/ensemble/plot_forest_hist_grad_boosting_comparison.py

Co-authored-by: Olivier Grisel <[email protected]>

ArturoAmorQ · 2023-05-26T13:09:16Z

I thought it would be nice to show the plot from this example in the user guide, but apparently sphinx gallery does not support plotly scrapers to create a custom image directive. Still one can write a custom image scraper. Does anyone know how to do this? Do you think the effort is worthy?

ogrisel · 2023-05-30T09:52:50Z

Does anyone know how to do this? Do you think the effort is worthy?

I have no idea. Feel free to give it a try but I wouldn't hold this PR to depend on such a scraper. Let's do the plotly scraper in a follow-up PR.

…nto hgbt_rf_example

adrinjalali

A few nits, otherwise LGTM.

adrinjalali · 2023-05-31T14:32:12Z

build_tools/circle/build_doc.sh

+# Circle CI has nodes with 36 (logical) cores but docker's cgroup quotas are
+# limitting to 2 usable cores.
+# Scikit-learn's Cython code should be robust to this, but just in case examples
+# run other OpenMP libraries that are not cgroup aware, let's manually limit
+# OpenMP to avoid oversubscription.
+export OMP_NUM_THREADS=2


irrelevant to this PR, and we've had time out issues in the past on circle CI, therefore we changed it to 1. Please revert.

If it's needed for the example, it should be set in the example.

examples/ensemble/plot_forest_hist_grad_boosting_comparison.py

Co-authored-by: Adrin Jalali <[email protected]>

… into hgbt_rf_example

…n#26320) Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Adrin Jalali <[email protected]>

DOC Add example comparing random forest with hgbt models

6a73819

github-actions bot added the Documentation label May 3, 2023

ArturoAmorQ added 2 commits May 3, 2023 14:40

Add text

ae50172

Tweak

c2d2660

ArturoAmorQ added 5 commits May 4, 2023 15:49

Add author

c315d5d

Add note

10f6689

Mention number of rows and features

7f95627

Use same y-axes and other plotting tweaks

0f9c38e

Add concluding remark

c9abc8f

ArturoAmorQ mentioned this pull request May 4, 2023

DOC Add HGBDT to "user_guide" reference in RF #26322

Merged

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

c30829e

…nto hgbt_rf_example

adrinjalali reviewed May 8, 2023

View reviewed changes

ArturoAmorQ added 9 commits May 23, 2023 11:55

Make clear statements on parallelization

bb810ac

Use plotly instead of matplotlib to better show trade-off

87eff7d

Modify conclusions

a33b1c6

Format tweak

c633a94

Add plotly description

2f31ffb

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

634687d

…nto hgbt_rf_example

Set plotly min to 5.14

d27a4ef

Update conda-locks

1b0c2f1

Wording tweak

792220e

Test CI threadpool

3a68f45

Let HGBDT use 2 threads on circle CI

e4b7ab5

Safer OMP_NUM_THREADS=2

eda8e77

Use N_CORES for the RF model

d3c187f

ogrisel reviewed May 25, 2023

View reviewed changes

Apply suggestions from code review

8a12d76

Co-authored-by: Olivier Grisel <[email protected]>

ArturoAmorQ commented May 25, 2023

View reviewed changes

examples/ensemble/plot_forest_hist_grad_boosting_comparison.py Outdated Show resolved Hide resolved

Remove threadpool test

4945f84

ogrisel approved these changes May 25, 2023

View reviewed changes

ArturoAmorQ and others added 5 commits May 25, 2023 15:35

Apply suggestions from code review

83d4036

Co-authored-by: Olivier Grisel <[email protected]>

Address comments from Olivier

6eb7324

Tweaks

fcbafe5

Add image to user guide

dd9e9b3

Fix failing CI

9010747

ArturoAmorQ and others added 3 commits May 26, 2023 15:41

Wording tweak

04d32d9

Link example from docstring

f431b0b

Merge branch 'main' into hgbt_rf_example

fbc6ea6

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

6224020

…nto hgbt_rf_example

ArturoAmorQ requested a review from adrinjalali May 31, 2023 08:32

adrinjalali reviewed May 31, 2023

View reviewed changes

ArturoAmorQ and others added 4 commits May 31, 2023 17:55

Revert change in OMP_NUM_THREADS

e3f686a

Apply suggestions from code review

452a709

Co-authored-by: Adrin Jalali <[email protected]>

Merge branch 'hgbt_rf_example' of github.com:ArturoAmorQ/scikit-learn…

1ed882f

… into hgbt_rf_example

Format tweak

d153d0a

adrinjalali approved these changes Jun 1, 2023

View reviewed changes

adrinjalali merged commit 2415a1b into scikit-learn:main Jun 1, 2023

ArturoAmorQ deleted the hgbt_rf_example branch August 24, 2023 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Add example comparing random forest with hgbt models #26320

DOC Add example comparing random forest with hgbt models #26320

ArturoAmorQ commented May 3, 2023

lorentzenchr commented May 4, 2023

adrinjalali left a comment

ArturoAmorQ commented May 24, 2023

ogrisel commented May 24, 2023

ogrisel commented May 24, 2023 •

edited

Loading

ogrisel commented May 24, 2023 •

edited

Loading

ogrisel commented May 24, 2023

ogrisel left a comment

ogrisel commented May 25, 2023 •

edited

Loading

ogrisel commented May 25, 2023

ogrisel left a comment

ArturoAmorQ commented May 26, 2023 •

edited

Loading

ogrisel commented May 30, 2023

adrinjalali left a comment

adrinjalali May 31, 2023

DOC Add example comparing random forest with hgbt models #26320

DOC Add example comparing random forest with hgbt models #26320

Conversation

ArturoAmorQ commented May 3, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lorentzenchr commented May 4, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

ArturoAmorQ commented May 24, 2023

Local:

CI:

ogrisel commented May 24, 2023

ogrisel commented May 24, 2023 • edited Loading

ogrisel commented May 24, 2023 • edited Loading

ogrisel commented May 24, 2023

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented May 25, 2023 • edited Loading

ogrisel commented May 25, 2023

ogrisel left a comment

Choose a reason for hiding this comment

ArturoAmorQ commented May 26, 2023 • edited Loading

ogrisel commented May 30, 2023

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali May 31, 2023

Choose a reason for hiding this comment

ogrisel commented May 24, 2023 •

edited

Loading

ogrisel commented May 24, 2023 •

edited

Loading

ogrisel commented May 25, 2023 •

edited

Loading

ArturoAmorQ commented May 26, 2023 •

edited

Loading