-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Add example comparing random forest with hgbt models #26320
Conversation
@ArturoAmorQ Thanks for this new example! I would also be good to know how large the dataset is, so number of rows and features. |
…nto hgbt_rf_example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I would also add a link to this from the docstrings of these models. Our examples are not otherwise discoverable really.
…nto hgbt_rf_example
So hist gradient boosting classifier is particularly slow at prediction time on the CI... while it's ok at fit time. There is something really fishy happening on the CI. |
There
|
After removing the but: [{'user_api': 'blas', 'internal_api': 'openblas', 'prefix': 'libopenblas', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libopenblasp-r0.3.21.so', 'version': '0.3.21', 'threading_layer': 'pthreads', 'architecture': 'SkylakeX', 'num_threads': 2}, {'user_api': 'openmp', 'internal_api': 'openmp', 'prefix': 'libomp', 'filepath': '/home/circleci/mambaforge/envs/testenv/lib/libomp.so', 'version': None, 'num_threads': 36}] The 36 openmp threads might cause a serious oversubscription on other jobs. Let me try set it to 2 instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a path of feedback. Even if we do not fully understand why HGBDT prediction is slower on the CI than on our local machines, I think this is good enough this way to get the main conclusion rights with the following suggested analysis below.
Feel free to remove the threadpool_info
debugging info from the example.
Co-authored-by: Olivier Grisel <[email protected]>
Actually my suggestion to use
Maybe let's just use "Hist Gradient Boosting" (with spaces). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final pass. Thanks for this example. It makes it very explicit that HGBDT have a stronger speed/accuracy tradeoff in general. We should link to it both from the narrative doc on RF and in the docstring or see also section of the RF models.
Co-authored-by: Olivier Grisel <[email protected]>
I thought it would be nice to show the plot from this example in the user guide, but apparently sphinx gallery does not support plotly scrapers to create a custom image directive. Still one can write a custom image scraper. Does anyone know how to do this? Do you think the effort is worthy? |
I have no idea. Feel free to give it a try but I wouldn't hold this PR to depend on such a scraper. Let's do the plotly scraper in a follow-up PR. |
…nto hgbt_rf_example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits, otherwise LGTM.
build_tools/circle/build_doc.sh
Outdated
# Circle CI has nodes with 36 (logical) cores but docker's cgroup quotas are | ||
# limitting to 2 usable cores. | ||
# Scikit-learn's Cython code should be robust to this, but just in case examples | ||
# run other OpenMP libraries that are not cgroup aware, let's manually limit | ||
# OpenMP to avoid oversubscription. | ||
export OMP_NUM_THREADS=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
irrelevant to this PR, and we've had time out issues in the past on circle CI, therefore we changed it to 1. Please revert.
If it's needed for the example, it should be set in the example.
Co-authored-by: Adrin Jalali <[email protected]>
… into hgbt_rf_example
…n#26320) Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Adrin Jalali <[email protected]>
…n#26320) Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Adrin Jalali <[email protected]>
Reference Issues/PRs
Partially addresses #26220.
What does this implement/fix? Explain your changes.
Adds an example comparing random forest with hgbt models.
Any other comments?
I use a regression model, but maybe a classification problem would make the example more visible (
RandomForestClassifier
is the third most visited page of the doc). Opinions are welcomed.