[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

rickyyx · 2022-10-03T05:22:56Z

What happened + What you expected to happen

run_xgboost_prediction takes 531.6400554740001 seconds.
Results: {'training_time': 793.1882077000001, 'prediction_time': 531.6400554740001}
Traceback (most recent call last):
  File "workloads/xgboost_benchmark.py", line 153, in <module>
    main(args)
  File "workloads/xgboost_benchmark.py", line 134, in main
    f"Batch prediction on XGBoost is taking {prediction_time} seconds, "
RuntimeError: Batch prediction on XGBoost is taking 531.6400554740001 seconds, which is longer than expected (450 seconds).

Versions / Dependencies

NA

Reproduction script

NA

Issue Severity

No response

The text was updated successfully, but these errors were encountered:

c21 · 2022-10-04T17:05:40Z

Still failing now - https://buildkite.com/ray-project/release-tests-branch/builds/1090#01839fe4-545c-4598-9c70-0a0eb95e6df3 .

Signed-off-by: Amog Kamsetty [email protected] Seems like some recent changes on head node have caused performance regression for the xgboost benchmark. We change the compute config to only use worker nodes for compute instead. Closes #28974

c21 · 2022-10-07T16:31:18Z

Still failing yesterday - https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_LmPgPzNA7TLdJ32Wmd1AwKdH?command-history-section=command_history .

Verified the change is taking effect on the run - 8fd6a5b .

jiaodong · 2022-10-07T23:41:12Z

We had regression on prediction side with cluster env on commit fd01488 , node memory pattern from training to prediction is

Where our known, stable good run on commit c8dbbf3, node memory pattern from training to prediction is

As a result, 10 worker 100GB data prediction regressed:

From
Results: {'training_time': 778.026989177, 'prediction_time': 306.5530205929999}
To
Results: {'training_time': 765.7612613899998, 'prediction_time': 501.24525937299995}

Both training and prediction for this release test used ~15GB more memory.

c21 · 2022-10-08T00:25:37Z

CC @amogkam (ML oncall) FYI for @jiaodong's finding.

jiaodong · 2022-10-11T19:52:44Z

@clarkzinzow

I did a few more release test bisection with prediction batch_size = 8192

Latency-wise, we're good with larger batch size on prediction side that e2e latency can be cut <200secs

run_xgboost_prediction takes 191.02347457200085 seconds.
run_xgboost_prediction takes 183.74121447799916 seconds.

But memory footprint suggests each node consistently used ~15 GB of RAM

compare to good commit on Oct 5th last Wed.

jiaodong · 2022-10-13T04:09:27Z

Root caused to #29103 cc: @clarng is this expected ?

Full bisection log see
https://docs.google.com/document/d/1SfbHV5AFZe3P_VA_snDve6yeCh8cobVAydn7MRqRSIE/edit#

clarng · 2022-10-13T16:47:54Z

I think this is expected. There are several changes to the product + oss that is causing this

product added memory limit to the container and we have less memory per node now (64 GiB -> 57 GiB). This could contribute to increased spilling as we allocate 30% node memory to the object store memory, which became a smaller number after adding a memory limit
the test is ingesting 100GB of data on 10 nodes. It is expected that the memory usage on each node is > 10 GiB as it needs to process and store the results to the object store in addition to using the heap. 3-5 GiB per node doesn't look like the right amount of activity memory per node.

jiaodong · 2022-10-13T21:04:04Z

^ PR about to increase batch size should be all we need, all other investigations and discussions completed

clarkzinzow · 2022-10-18T00:41:12Z

@jiaodong With this PR merged, and release tests passing on both the PR and in master, I'm closing this as fixed.

…t#29091) Signed-off-by: Amog Kamsetty [email protected] Seems like some recent changes on head node have caused performance regression for the xgboost benchmark. We change the compute config to only use worker nodes for compute instead. Closes ray-project#28974 Signed-off-by: Weichen Xu <[email protected]>

rickyyx added bug Something that is supposed to be working; but isn't r2.1-failure labels Oct 3, 2022

rickyyx assigned matthewdeng and amogkam Oct 3, 2022

rickyyx added the release-blocker P0 Issue that blocks the release label Oct 3, 2022

amogkam mentioned this issue Oct 5, 2022

[AIR] [Release] Don't use head node for XGBoost Benchmark #29091

Merged

7 tasks

matthewdeng added the air label Oct 5, 2022

amogkam closed this as completed in #29091 Oct 6, 2022

c21 reopened this Oct 7, 2022

jiaodong added the P0 Issues that should be fixed in short order label Oct 11, 2022

jiaodong assigned c21 and clarkzinzow Oct 11, 2022

amogkam assigned jiaodong Oct 11, 2022

jiaodong mentioned this issue Oct 13, 2022

revert "[core] update cgroup v1 memory usage calculation to ignore inactive" #29316

Closed

7 tasks

jiaodong mentioned this issue Oct 13, 2022

Resolve xgboost benchmark failure #29337

Merged

7 tasks

clarkzinzow closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

rickyyx commented Oct 3, 2022

c21 commented Oct 4, 2022

c21 commented Oct 7, 2022

jiaodong commented Oct 7, 2022 •

edited

Loading

c21 commented Oct 8, 2022

jiaodong commented Oct 11, 2022 •

edited

Loading

jiaodong commented Oct 13, 2022

clarng commented Oct 13, 2022 •

edited

Loading

jiaodong commented Oct 13, 2022

clarkzinzow commented Oct 18, 2022

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

Comments

rickyyx commented Oct 3, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

c21 commented Oct 4, 2022

c21 commented Oct 7, 2022

jiaodong commented Oct 7, 2022 • edited Loading

c21 commented Oct 8, 2022

jiaodong commented Oct 11, 2022 • edited Loading

jiaodong commented Oct 13, 2022

clarng commented Oct 13, 2022 • edited Loading

jiaodong commented Oct 13, 2022

clarkzinzow commented Oct 18, 2022

jiaodong commented Oct 7, 2022 •

edited

Loading

jiaodong commented Oct 11, 2022 •

edited

Loading

clarng commented Oct 13, 2022 •

edited

Loading