Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

Closed
rickyyx opened this issue Oct 3, 2022 · 9 comments · Fixed by #29091
Closed

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

rickyyx opened this issue Oct 3, 2022 · 9 comments · Fixed by #29091
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Oct 3, 2022

What happened + What you expected to happen

Build failure
Cluster

run_xgboost_prediction takes 531.6400554740001 seconds.
Results: {'training_time': 793.1882077000001, 'prediction_time': 531.6400554740001}
Traceback (most recent call last):
  File "workloads/xgboost_benchmark.py", line 153, in <module>
    main(args)
  File "workloads/xgboost_benchmark.py", line 134, in main
    f"Batch prediction on XGBoost is taking {prediction_time} seconds, "
RuntimeError: Batch prediction on XGBoost is taking 531.6400554740001 seconds, which is longer than expected (450 seconds).

Versions / Dependencies

NA

Reproduction script

NA

Issue Severity

No response

@rickyyx rickyyx added bug Something that is supposed to be working; but isn't r2.1-failure labels Oct 3, 2022
@rickyyx rickyyx added the release-blocker P0 Issue that blocks the release label Oct 3, 2022
@c21
Copy link
Contributor

c21 commented Oct 4, 2022

@matthewdeng matthewdeng added the air label Oct 5, 2022
amogkam added a commit that referenced this issue Oct 6, 2022
Signed-off-by: Amog Kamsetty [email protected]

Seems like some recent changes on head node have caused performance regression for the xgboost benchmark. We change the compute config to only use worker nodes for compute instead.

Closes #28974
@c21
Copy link
Contributor

c21 commented Oct 7, 2022

@c21 c21 reopened this Oct 7, 2022
@jiaodong
Copy link
Member

jiaodong commented Oct 7, 2022

We had regression on prediction side with cluster env on commit fd01488 , node memory pattern from training to prediction is

Screen Shot 2022-10-07 at 4 39 23 PM

Where our known, stable good run on commit c8dbbf3, node memory pattern from training to prediction is

Screen Shot 2022-10-07 at 4 40 34 PM

As a result, 10 worker 100GB data prediction regressed:

From
Results: {'training_time': 778.026989177, 'prediction_time': 306.5530205929999}
To
Results: {'training_time': 765.7612613899998, 'prediction_time': 501.24525937299995}

Both training and prediction for this release test used ~15GB more memory.

@c21
Copy link
Contributor

c21 commented Oct 8, 2022

CC @amogkam (ML oncall) FYI for @jiaodong's finding.

@jiaodong
Copy link
Member

jiaodong commented Oct 11, 2022

@clarkzinzow

I did a few more release test bisection with prediction batch_size = 8192

Latency-wise, we're good with larger batch size on prediction side that e2e latency can be cut <200secs

run_xgboost_prediction takes 191.02347457200085 seconds.
run_xgboost_prediction takes 183.74121447799916 seconds.

But memory footprint suggests each node consistently used ~15 GB of RAM

Screen Shot 2022-10-11 at 12 47 31 PM

compare to good commit on Oct 5th last Wed.

Screen Shot 2022-10-11 at 12 50 01 PM

@jiaodong jiaodong added the P0 Issues that should be fixed in short order label Oct 11, 2022
@jiaodong
Copy link
Member

Root caused to #29103 cc: @clarng is this expected ?

Full bisection log see
https://docs.google.com/document/d/1SfbHV5AFZe3P_VA_snDve6yeCh8cobVAydn7MRqRSIE/edit#

@clarng
Copy link
Contributor

clarng commented Oct 13, 2022

I think this is expected. There are several changes to the product + oss that is causing this

  • product added memory limit to the container and we have less memory per node now (64 GiB -> 57 GiB). This could contribute to increased spilling as we allocate 30% node memory to the object store memory, which became a smaller number after adding a memory limit
  • the test is ingesting 100GB of data on 10 nodes. It is expected that the memory usage on each node is > 10 GiB as it needs to process and store the results to the object store in addition to using the heap. 3-5 GiB per node doesn't look like the right amount of activity memory per node.

@jiaodong
Copy link
Member

^ PR about to increase batch size should be all we need, all other investigations and discussions completed

@clarkzinzow
Copy link
Contributor

@jiaodong With this PR merged, and release tests passing on both the PR and in master, I'm closing this as fixed.

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…t#29091)

Signed-off-by: Amog Kamsetty [email protected]

Seems like some recent changes on head node have caused performance regression for the xgboost benchmark. We change the compute config to only use worker nodes for compute instead.

Closes ray-project#28974

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
7 participants