-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release][CI] air_benchmark_xgboost_cpu_10 failure #28974
Comments
Signed-off-by: Amog Kamsetty [email protected] Seems like some recent changes on head node have caused performance regression for the xgboost benchmark. We change the compute config to only use worker nodes for compute instead. Closes #28974
Still failing yesterday - https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_LmPgPzNA7TLdJ32Wmd1AwKdH?command-history-section=command_history . Verified the change is taking effect on the run - 8fd6a5b . |
We had regression on prediction side with cluster env on commit fd01488 , node memory pattern from training to prediction is Where our known, stable good run on commit c8dbbf3, node memory pattern from training to prediction is As a result, 10 worker 100GB data prediction regressed: From Both training and prediction for this release test used ~15GB more memory. |
I did a few more release test bisection with prediction batch_size = 8192 Latency-wise, we're good with larger batch size on prediction side that e2e latency can be cut <200secs
But memory footprint suggests each node consistently used ~15 GB of RAM compare to good commit on Oct 5th last Wed. |
Root caused to #29103 cc: @clarng is this expected ? Full bisection log see |
I think this is expected. There are several changes to the product + oss that is causing this
|
^ PR about to increase batch size should be all we need, all other investigations and discussions completed |
…t#29091) Signed-off-by: Amog Kamsetty [email protected] Seems like some recent changes on head node have caused performance regression for the xgboost benchmark. We change the compute config to only use worker nodes for compute instead. Closes ray-project#28974 Signed-off-by: Weichen Xu <[email protected]>
What happened + What you expected to happen
Build failure
Cluster
Versions / Dependencies
NA
Reproduction script
NA
Issue Severity
No response
The text was updated successfully, but these errors were encountered: