Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Add example comparing random forest with hgbt models #26320

Merged
merged 37 commits into from
Jun 1, 2023

Conversation

ArturoAmorQ
Copy link
Member

Reference Issues/PRs

Partially addresses #26220.

What does this implement/fix? Explain your changes.

Adds an example comparing random forest with hgbt models.

Any other comments?

I use a regression model, but maybe a classification problem would make the example more visible (RandomForestClassifier is the third most visited page of the doc). Opinions are welcomed.

ArturoAmorQ added 2 commits May 3, 2023 14:40
@lorentzenchr
Copy link
Member

@ArturoAmorQ Thanks for this new example!
Could you modify the plots, either use same y-axes for better comparison or plot rf and hgbt in the same plot, different line style?

I would also be good to know how large the dataset is, so number of rows and features.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I would also add a link to this from the docstrings of these models. Our examples are not otherwise discoverable really.

@ArturoAmorQ
Copy link
Member Author

Just a weird behavior to be investigated: HGBT prediction times are almost 10 times slower on the CI than on my local machine.

Local:

plotly_local

CI:

plotly_CI

@ogrisel
Copy link
Member

ogrisel commented May 24, 2023

Here is what I get on macOS M1:

image

@ogrisel
Copy link
Member

ogrisel commented May 24, 2023

So hist gradient boosting classifier is particularly slow at prediction time on the CI... while it's ok at fit time. There is something really fishy happening on the CI.

@ogrisel
Copy link
Member

ogrisel commented May 24, 2023

There treadpool_info() output on the CI is a bit unexpected as libomp detects only 1 thread:

[{'user_api': 'blas', 'internal_api': 'openblas', 'prefix': 'libopenblas', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libopenblasp-r0.3.21.so', 'version': '0.3.21', 'threading_layer': 'pthreads', 'architecture': 'SkylakeX', 'num_threads': 2}, {'user_api': 'openmp', 'internal_api': 'openmp', 'prefix': 'libomp', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libomp.so', 'version': None, 'num_threads': 1}]

@ogrisel
Copy link
Member

ogrisel commented May 24, 2023

After removing the export OMP_NUM_THREADS=1 line from the circle ci config we get:

image

but:

[{'user_api': 'blas', 'internal_api': 'openblas', 'prefix': 'libopenblas', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libopenblasp-r0.3.21.so', 'version': '0.3.21', 'threading_layer': 'pthreads', 'architecture': 'SkylakeX', 'num_threads': 2}, {'user_api': 'openmp', 'internal_api': 'openmp', 'prefix': 'libomp', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libomp.so', 'version': None, 'num_threads': 36}]

The 36 openmp threads might cause a serious oversubscription on other jobs. Let me try set it to 2 instead.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a path of feedback. Even if we do not fully understand why HGBDT prediction is slower on the CI than on our local machines, I think this is good enough this way to get the main conclusion rights with the following suggested analysis below.

Feel free to remove the threadpool_info debugging info from the example.

@ogrisel
Copy link
Member

ogrisel commented May 25, 2023

Actually my suggestion to use Histogram-based Gradient Boosting broke everything:

  • it needs to be updated twice
  • the legend block is now too large

Maybe let's just use "Hist Gradient Boosting" (with spaces).

@ogrisel
Copy link
Member

ogrisel commented May 25, 2023

With the new code and CI config, both models now use the same number of threads in all cases.

Here is the plot on my laptop (8 cores):

image

and here is the output on the CI (2 cores):

image

so all in all, the relative prediction slow-down for HGBDT on the CI is not that impressive any more.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final pass. Thanks for this example. It makes it very explicit that HGBDT have a stronger speed/accuracy tradeoff in general. We should link to it both from the narrative doc on RF and in the docstring or see also section of the RF models.

@ArturoAmorQ
Copy link
Member Author

ArturoAmorQ commented May 26, 2023

I thought it would be nice to show the plot from this example in the user guide, but apparently sphinx gallery does not support plotly scrapers to create a custom image directive. Still one can write a custom image scraper. Does anyone know how to do this? Do you think the effort is worthy?

@ogrisel
Copy link
Member

ogrisel commented May 30, 2023

Does anyone know how to do this? Do you think the effort is worthy?

I have no idea. Feel free to give it a try but I wouldn't hold this PR to depend on such a scraper. Let's do the plotly scraper in a follow-up PR.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, otherwise LGTM.

Comment on lines 190 to 195
# Circle CI has nodes with 36 (logical) cores but docker's cgroup quotas are
# limitting to 2 usable cores.
# Scikit-learn's Cython code should be robust to this, but just in case examples
# run other OpenMP libraries that are not cgroup aware, let's manually limit
# OpenMP to avoid oversubscription.
export OMP_NUM_THREADS=2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelevant to this PR, and we've had time out issues in the past on circle CI, therefore we changed it to 1. Please revert.

If it's needed for the example, it should be set in the example.

@adrinjalali adrinjalali merged commit 2415a1b into scikit-learn:main Jun 1, 2023
manudarmi pushed a commit to primait/scikit-learn that referenced this pull request Jun 12, 2023
…n#26320)

Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Adrin Jalali <[email protected]>
@ArturoAmorQ ArturoAmorQ deleted the hgbt_rf_example branch August 24, 2023 09:24
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…n#26320)

Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Adrin Jalali <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants