[Dashboard] Optimize and backpressure actor_head.py #29580

rkooo567 · 2022-10-22T13:04:42Z

Signed-off-by: SangBin Cho [email protected]

Why are these changes needed?

This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)

import ray
ray.init()

@ray.remote
class A:
    pass
import time
while True:
    a = [A.remote() for _ in range(100)]
    time.sleep(0.1)
    del a

Remove unnecessary fields that are not used (including that causes high CPU usage)
Add backpressure to event processing time (1000/s) so that the dashboard will have enough CPU to respond API requests
Batch processing actor events and minimize the context switching time
Stop using immutable dict that causes high copy overhead

The profiling result after this PR is here; #29580 (comment)

After this PR, most of overhead is just pure MessageToDict. We can drastically optimize this by delaying the processing of protobuf on query time.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>

rickyyx

~~Could you add some more details to the description for the changes and issues? I might be missing the context here, thank you!~~
Saw the slack thread.

dashboard/datacenter.py

ericl · 2022-11-07T20:27:36Z

Any before/after profiling comparisons?

rkooo567 · 2022-11-07T22:24:10Z

@ericl oh forgot to update it. Let me share it here. Note that for the "after" flamegraph, it includes overhead of /actors endpoint (so I just didn't include that page to the screenshot). As you see, all overhead is now just MessageToDict. We can remove this overhead by delaying the protobuf processing to the query time

Before

After

dashboard/modules/actor/actor_head.py

architkulkarni

test_snapshot changes look good to me, stamping as codeowner

rkooo567

@rickyyx I will update the description. I also forgot to add comments which explains the PR...

rkooo567 · 2022-11-07T14:32:13Z

dashboard/datacenter.py

@@ -25,7 +28,7 @@ class DataSource:
 node_physical_stats = Dict()


Changing node related Dict fails some tests now. Since it doesn't incur high overhead, we will only modify actor datasources for now

dashboard/datacenter.py

rkooo567 · 2022-11-07T14:33:41Z

dashboard/modules/actor/actor_head.py

@@ -153,39 +138,53 @@ def process_actor_data_from_pubsub(actor_id, actor_table_data):
 # If actor is not new registered but updated, we only update
 # states related fields.
 if actor_table_data["state"] != "DEPENDENCIES_UNREADY":
- actor_table_data_copy = dict(DataSource.actors[actor_id])
+ actor_table_data_copy = DataSource.actors[actor_id]


this is unnecessary as it is not immutable

rkooo567 · 2022-11-07T14:34:06Z

dashboard/modules/actor/actor_head.py

 node_id = actor["address"].get("rayletId")
- if node_id:
+ if node_id and node_id != actor_consts.NIL_NODE_ID:


it's a separate bug I found

rkooo567 · 2022-11-07T14:34:17Z

dashboard/modules/actor/actor_head.py

 @routes.get("/logical/actors")
 @dashboard_optional_utils.aiohttp_cache
 async def get_all_actors(self, req) -> aiohttp.web.Response:
 return rest_response(
 success=True, message="All actors fetched.", actors=DataSource.actors
 )

- @routes.get("/logical/kill_actor")
- async def kill_actor(self, req) -> aiohttp.web.Response:


for old dashboard

rkooo567 · 2022-11-07T14:34:26Z

dashboard/modules/metrics/metrics_head.py

@@ -193,6 +193,11 @@ async def get_progress(self, req):
 success=False,
 message=e.message,
 )
+ except aiohttp.client_exceptions.ClientConnectorError as e:


also another bug I found

rkooo567 · 2022-11-07T14:34:59Z

src/ray/gcs/gcs_server/gcs_actor_manager.cc

@@ -642,7 +642,8 @@ Status GcsActorManager::CreateActor(const ray::rpc::CreateActorRequest &request,
 actor->UpdateState(rpc::ActorTableData::PENDING_CREATION);
 const auto &actor_table_data = actor->GetActorTableData();
 // Pub this state for dashboard showing.
- RAY_CHECK_OK(gcs_publisher_->PublishActor(actor_id, actor_table_data, nullptr));
+ RAY_CHECK_OK(gcs_publisher_->PublishActor(
+ actor_id, *GenActorDataOnlyWithStates(actor_table_data), nullptr));


we can only publish the subset of data in this case since we already published it when actor is registered

dashboard/datacenter.py

dashboard/modules/actor/actor_head.py

dashboard/utils.py

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 · 2022-11-10T10:55:42Z

Made additional changes.

Increase the max actors to cache 1k -> 10K. It makes no sense it is 1K given our scalability limit is 10K concurrent actors. I even feel like it should something like 20K.
Reduce the cleanup_actors frequency so that each call will be shorter (so that it won't occupy the main thread long time).
Reduce the batch size for the same reason ^

Signed-off-by: SangBin Cho <[email protected]>

rickyyx · 2022-11-10T17:17:59Z

Increase the max actors to cache 1k -> 10K. It makes no sense it is 1K given our scalability limit is 10K concurrent actors. I even feel like it should something like 20K.

What will be a measurement to see if this limit doesn't have a siginificnant impact and could be adjusted?

rkooo567 · 2022-11-10T22:47:49Z

What will be a measurement to see if this limit doesn't have a siginificnant impact and could be adjusted?

I think the workload I am running (stressful actor workload) is the one that can do it. But I wonder if this can impact the prod (e.g., snapshot API becomes slow or broken due to the large API response)

Signed-off-by: SangBin Cho <[email protected]> This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)

Signed-off-by: SangBin Cho <[email protected]> This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors) Signed-off-by: Weichen Xu <[email protected]>

rkooo567 added 7 commits October 22, 2022 06:03

Remove immutable dict

6d3a15b

Signed-off-by: SangBin Cho <[email protected]>

Merge branch 'master' into remove-immutable-dict

23418b5

Fix the issue

da5c780

Signed-off-by: SangBin Cho <[email protected]>

lint

f534a1e

Signed-off-by: SangBin Cho <[email protected]>

remove code

5b01288

Signed-off-by: SangBin Cho <[email protected]>

Merge branch 'master' into remove-immutable-dict

9f16ff4

Signed-off-by: SangBin Cho <[email protected]>

Finish the basic optimization.

5b29e06

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 requested review from wuisawesome, ijrsvt, edoakes, alanwguo and architkulkarni as code owners November 7, 2022 14:31

Change var names

77bdcb7

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 assigned ericl, alanwguo and rickyyx Nov 7, 2022

rkooo567 changed the title ~~[Draft] Try removing immutable dict~~ [Dashboard] Optimize and backpressure actor_head.py Nov 7, 2022

Fix failed tests

aca5727

Signed-off-by: SangBin Cho <[email protected]>

rickyyx reviewed Nov 7, 2022

View reviewed changes

dashboard/datacenter.py Show resolved Hide resolved

dashboard/datacenter.py Show resolved Hide resolved

ericl reviewed Nov 7, 2022

View reviewed changes

dashboard/modules/actor/actor_head.py Outdated Show resolved Hide resolved

architkulkarni approved these changes Nov 7, 2022

View reviewed changes

rkooo567 commented Nov 8, 2022

View reviewed changes

ericl reviewed Nov 8, 2022

View reviewed changes

dashboard/utils.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 8, 2022

rickyyx approved these changes Nov 9, 2022

View reviewed changes

rkooo567 added 2 commits November 10, 2022 01:56

Merge branch 'master' into remove-immutable-dict

cdb74b5

Finished

4913ac1

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 added 3 commits November 10, 2022 02:55

Addressed code review.

1085ffb

Signed-off-by: SangBin Cho <[email protected]>

Merge branch 'master' into remove-immutable-dict

2d6cdc8

Fix a test failure.

11ae8e1

Signed-off-by: SangBin Cho <[email protected]>

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022

rkooo567 merged commit 9da53e3 into ray-project:master Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dashboard] Optimize and backpressure actor_head.py #29580

[Dashboard] Optimize and backpressure actor_head.py #29580

rkooo567 commented Oct 22, 2022 •

edited

Loading

rickyyx left a comment •

edited

Loading

ericl commented Nov 7, 2022

rkooo567 commented Nov 7, 2022 •

edited

Loading

architkulkarni left a comment

rkooo567 left a comment

rkooo567 Nov 7, 2022

rkooo567 Nov 7, 2022

rkooo567 Nov 7, 2022

rkooo567 Nov 7, 2022

rkooo567 Nov 7, 2022

rkooo567 Nov 7, 2022

rkooo567 commented Nov 10, 2022

rickyyx commented Nov 10, 2022

rkooo567 commented Nov 10, 2022

		@@ -25,7 +28,7 @@ class DataSource:
		node_physical_stats = Dict()

[Dashboard] Optimize and backpressure actor_head.py #29580

[Dashboard] Optimize and backpressure actor_head.py #29580

Conversation

rkooo567 commented Oct 22, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

rickyyx left a comment • edited Loading

Choose a reason for hiding this comment

ericl commented Nov 7, 2022

rkooo567 commented Nov 7, 2022 • edited Loading

architkulkarni left a comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 Nov 7, 2022

Choose a reason for hiding this comment

rkooo567 Nov 7, 2022

Choose a reason for hiding this comment

rkooo567 Nov 7, 2022

Choose a reason for hiding this comment

rkooo567 Nov 7, 2022

Choose a reason for hiding this comment

rkooo567 Nov 7, 2022

Choose a reason for hiding this comment

rkooo567 Nov 7, 2022

Choose a reason for hiding this comment

rkooo567 commented Nov 10, 2022

rickyyx commented Nov 10, 2022

rkooo567 commented Nov 10, 2022

rkooo567 commented Oct 22, 2022 •

edited

Loading

rickyyx left a comment •

edited

Loading

rkooo567 commented Nov 7, 2022 •

edited

Loading