[Release] Add release logs for 2.9.0 commit 15a558e #41992

architkulkarni · 2023-12-18T18:47:10Z

Adds performance logs for Ray 2.9.0. Taken from the tests at https://buildkite.com/ray-project/release-tests-branch/builds?branch=releases%2F2.9.0 for commit 15a558e by running python fetch_release_logs.py 2.9.0.

Below I have included the result of running the regression script. From the release instructions:

This script will catch regressions in perf_metrics, you still need to manually check other metrics (e.g. _peak_memory)

I have not done the "manual check" and will leave that to the reviewers of this PR.

(base) architkulkarni@archit-Q4WXGF2WQY release_logs % python compare_perf_metrics 2.8.0 2.9.0                        
REGRESSION 21.61%: actors_per_second (THROUGHPUT) regresses from 753.4446893211699 to 590.5931553046038 (21.61%) in 2.9.0/benchmarks/many_actors.json
REGRESSION 14.94%: multi_client_put_gigabytes (THROUGHPUT) regresses from 36.34816401372876 to 30.918984626602807 (14.94%) in 2.9.0/microbenchmark.json
REGRESSION 13.26%: 1_n_async_actor_calls_async (THROUGHPUT) regresses from 8601.993472120319 to 7460.962715134404 (13.26%) in 2.9.0/microbenchmark.json
REGRESSION 13.10%: single_client_tasks_sync (THROUGHPUT) regresses from 1161.670131632561 to 1009.4349525282154 (13.10%) in 2.9.0/microbenchmark.json
REGRESSION 11.34%: n_n_actor_calls_async (THROUGHPUT) regresses from 30108.565209428394 to 26694.138600078164 (11.34%) in 2.9.0/microbenchmark.json
REGRESSION 10.64%: multi_client_tasks_async (THROUGHPUT) regresses from 27211.51041454346 to 24316.337428119852 (10.64%) in 2.9.0/microbenchmark.json
REGRESSION 10.02%: 1_n_actor_calls_async (THROUGHPUT) regresses from 9581.728569086026 to 8622.116661460657 (10.02%) in 2.9.0/microbenchmark.json
REGRESSION 9.21%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1377.3257452550822 to 1250.487251391533 (9.21%) in 2.9.0/microbenchmark.json
REGRESSION 8.67%: placement_group_create/removal (THROUGHPUT) regresses from 926.0840791839338 to 845.7511547073977 (8.67%) in 2.9.0/microbenchmark.json
REGRESSION 6.25%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2213.6033025230176 to 2075.2443816745968 (6.25%) in 2.9.0/microbenchmark.json
REGRESSION 6.12%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 2895.292478069285 to 2718.2145554952413 (6.12%) in 2.9.0/microbenchmark.json
REGRESSION 5.79%: client__put_gigabytes (THROUGHPUT) regresses from 0.12401864230452364 to 0.1168388142260294 (5.79%) in 2.9.0/microbenchmark.json
REGRESSION 5.62%: client__put_calls (THROUGHPUT) regresses from 856.533614603169 to 808.3571852957423 (5.62%) in 2.9.0/microbenchmark.json
REGRESSION 4.94%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 24290.541801601616 to 23089.526825423094 (4.94%) in 2.9.0/microbenchmark.json
REGRESSION 3.77%: client__get_calls (THROUGHPUT) regresses from 1164.1583807193044 to 1120.242286739544 (3.77%) in 2.9.0/microbenchmark.json
REGRESSION 3.26%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 13.55352518200595 to 13.112230033151658 (3.26%) in 2.9.0/microbenchmark.json
REGRESSION 3.24%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 8.7124898510668 to 8.429852592930626 (3.24%) in 2.9.0/microbenchmark.json
REGRESSION 3.09%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1038.8711159440322 to 1006.7547148607874 (3.09%) in 2.9.0/microbenchmark.json
REGRESSION 2.32%: single_client_tasks_async (THROUGHPUT) regresses from 8643.833466025399 to 8443.260998630982 (2.32%) in 2.9.0/microbenchmark.json
REGRESSION 2.29%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5697.447666436941 to 5567.259268000422 (2.29%) in 2.9.0/microbenchmark.json
REGRESSION 2.27%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1036.0321459583472 to 1012.4837493368098 (2.27%) in 2.9.0/microbenchmark.json
REGRESSION 0.89%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 535.3383020010909 to 530.5597986550025 (0.89%) in 2.9.0/microbenchmark.json
REGRESSION 65.87%: stage_0_time (LATENCY) regresses from 7.927043914794922 to 13.148497581481934 (65.87%) in 2.9.0/stress_tests/stress_test_many_tasks.json
REGRESSION 44.84%: dashboard_p99_latency_ms (LATENCY) regresses from 3088.301 to 4473.111 (44.84%) in 2.9.0/benchmarks/many_actors.json
REGRESSION 15.81%: avg_pg_remove_time_ms (LATENCY) regresses from 0.7885501576572757 to 0.913254288288353 (15.81%) in 2.9.0/stress_tests/stress_test_placement_group.json
REGRESSION 15.50%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 82.940892212 to 95.796644017 (15.50%) in 2.9.0/scalability/object_store.json
REGRESSION 14.83%: dashboard_p95_latency_ms (LATENCY) regresses from 2237.99 to 2569.856 (14.83%) in 2.9.0/benchmarks/many_actors.json
REGRESSION 10.19%: stage_3_time (LATENCY) regresses from 2943.001654624939 to 3242.995056629181 (10.19%) in 2.9.0/stress_tests/stress_test_many_tasks.json
REGRESSION 9.51%: stage_3_creation_time (LATENCY) regresses from 2.260662794113159 to 2.475653648376465 (9.51%) in 2.9.0/stress_tests/stress_test_many_tasks.json
REGRESSION 3.30%: 3000_returns_time (LATENCY) regresses from 5.899374322999989 to 6.094248331000003 (3.30%) in 2.9.0/scalability/single_node.json
REGRESSION 2.08%: avg_pg_create_time_ms (LATENCY) regresses from 0.8868904699705661 to 0.9053212447438167 (2.08%) in 2.9.0/stress_tests/stress_test_placement_group.json
REGRESSION 0.86%: 10000_args_time (LATENCY) regresses from 17.66019733799999 to 17.811292093000006 (0.86%) in 2.9.0/scalability/single_node.json

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>

rickyyx · 2023-12-18T19:19:12Z

multi_client_put_gigabytes is variance.

n_n_actor_calls_with_arg_async variance

dashboard_p99_latency_ms variance

time_to_broadcast_1073741824_bytes_to_50_nodes seems variance - rerunning.

rickyyx · 2023-12-18T19:22:47Z

1_n_async_actor_calls_async due to GRPC upgrade at 10.31 -> no fix

n_n_actor_calls_async also grpc

1_n_actor_calls_async same

Same for

1_1_async_actor_calls_sync
placement_group_create/removal
1_1_actor_calls_sync
stage_3_time

rickyyx · 2023-12-18T19:27:20Z

single_client_tasks_sync also due to grpc upgrade (the initial drop from 1.2k has been fixed in #41695) but the grpc regression wasn't fixed yet.

multi_client_tasks_async same story (there are 2 drops, one fixed, another due to grpc)

architkulkarni · 2023-12-18T19:32:08Z

FYI, we are still awaiting one last cherry-pick PR: #41990

@raulchen @rickyyx @jjyao do you think we should rerun these performance metrics after that PR is picked?

rickyyx · 2023-12-18T19:44:31Z

FYI, we are still awaiting one last cherry-pick PR: #41990

@raulchen @rickyyx @jjyao do you think we should rerun these performance metrics after that PR is picked?

Shouldn't impact core metrics I think

architkulkarni · 2023-12-18T19:50:58Z

@rickyyx Gotcha, thanks!

Thanks for the details about the regressions! Is the conclusion that there's no release-blocking regression? If so you can approve this PR (we need two independent approvals to proceed with the release)

rickyyx · 2023-12-18T20:02:01Z

@rickyyx Gotcha, thanks!

Thanks for the details about the regressions! Is the conclusion that there's no release-blocking regression? If so you can approve this PR (we need two independent approvals to proceed with the release)

@jjyao and I looked through them together, and we think there are 2 we wanted to run to verify if they are variance merely in the release branch. Will update once that's cleared.

architkulkarni · 2023-12-18T20:11:00Z

Sounds good, thanks for your diligence

jjyao · 2023-12-18T20:20:12Z

Ran time_to_broadcast_1073741824_bytes_to_50_nodes again (https://buildkite.com/ray-project/release/builds/4509):


broadcast_time = 64.09479726399996
--
  | object_size = 1073741824
  | num_nodes = 50
  | success = 1
  | perf_metrics = [{'perf_metric_name': 'time_to_broadcast_1073741824_bytes_to_50_nodes', 'perf_metric_value': 64.09479726399996, 'perf_metric_type': 'LATENCY'}]

So it's noise.

raulchen · 2023-12-18T21:56:17Z

FYI, we are still awaiting one last cherry-pick PR: #41990

@raulchen @rickyyx @jjyao do you think we should rerun these performance metrics after that PR is picked?

I submitted another PR to fix the issue instead #42000. This PR only touches data code. Shouldn't impact the core metrics you listed.

jjyao · 2023-12-18T22:03:16Z

Ran actors_per_second again (https://buildkite.com/ray-project/release/builds/4506#018c7e60-0aed-48e3-9eed-825cbbc5566e):

actors_per_second = 614.2315272090922

Still slower than master. There might be a real regression in release branch.

jjyao · 2023-12-18T22:58:25Z

Another run of actors_per_second: https://buildkite.com/ray-project/release/builds/4527#018c7ef7-7ef3-4d50-85c2-cd63ba1b071e

actors_per_second = 652.0240412474651

So it's noise.

Add release logs for 2.9.0 commit 15a558e

9c0e748

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Dec 18, 2023

architkulkarni assigned jjyao, rickyyx and rkooo567 Dec 18, 2023

architkulkarni mentioned this pull request Dec 18, 2023

[2.9.0 Release] Add release logs for 2.9.0 #41663

Closed

8 tasks

jjyao approved these changes Dec 18, 2023

View reviewed changes

jjyao merged commit 04f024a into ray-project:master Dec 19, 2023
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release] Add release logs for 2.9.0 commit 15a558e #41992

[Release] Add release logs for 2.9.0 commit 15a558e #41992

architkulkarni commented Dec 18, 2023

rickyyx commented Dec 18, 2023 •

edited

Loading

rickyyx commented Dec 18, 2023 •

edited

Loading

rickyyx commented Dec 18, 2023 •

edited

Loading

architkulkarni commented Dec 18, 2023

rickyyx commented Dec 18, 2023

architkulkarni commented Dec 18, 2023

rickyyx commented Dec 18, 2023

architkulkarni commented Dec 18, 2023

jjyao commented Dec 18, 2023

raulchen commented Dec 18, 2023

jjyao commented Dec 18, 2023

jjyao commented Dec 18, 2023 •

edited

Loading

[Release] Add release logs for 2.9.0 commit 15a558e #41992

[Release] Add release logs for 2.9.0 commit 15a558e #41992

Conversation

architkulkarni commented Dec 18, 2023

Related issue number

Checks

rickyyx commented Dec 18, 2023 • edited Loading

rickyyx commented Dec 18, 2023 • edited Loading

rickyyx commented Dec 18, 2023 • edited Loading

architkulkarni commented Dec 18, 2023

rickyyx commented Dec 18, 2023

architkulkarni commented Dec 18, 2023

rickyyx commented Dec 18, 2023

architkulkarni commented Dec 18, 2023

jjyao commented Dec 18, 2023

raulchen commented Dec 18, 2023

jjyao commented Dec 18, 2023

jjyao commented Dec 18, 2023 • edited Loading

rickyyx commented Dec 18, 2023 •

edited

Loading

rickyyx commented Dec 18, 2023 •

edited

Loading

rickyyx commented Dec 18, 2023 •

edited

Loading

jjyao commented Dec 18, 2023 •

edited

Loading