Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Optimize and backpressure actor_head.py #29580

Merged
merged 14 commits into from
Nov 11, 2022

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 22, 2022

Signed-off-by: SangBin Cho [email protected]

Why are these changes needed?

This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)

import ray
ray.init()

@ray.remote
class A:
    pass
import time
while True:
    a = [A.remote() for _ in range(100)]
    time.sleep(0.1)
    del a
  • Remove unnecessary fields that are not used (including that causes high CPU usage)
  • Add backpressure to event processing time (1000/s) so that the dashboard will have enough CPU to respond API requests
  • Batch processing actor events and minimize the context switching time
  • Stop using immutable dict that causes high copy overhead

The profiling result after this PR is here; #29580 (comment)

After this PR, most of overhead is just pure MessageToDict. We can drastically optimize this by delaying the processing of protobuf on query time.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>
@rkooo567 rkooo567 changed the title [Draft] Try removing immutable dict [Dashboard] Optimize and backpressure actor_head.py Nov 7, 2022
Signed-off-by: SangBin Cho <[email protected]>
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some more details to the description for the changes and issues? I might be missing the context here, thank you!
Saw the slack thread.

dashboard/datacenter.py Show resolved Hide resolved
dashboard/datacenter.py Show resolved Hide resolved
@ericl
Copy link
Contributor

ericl commented Nov 7, 2022

Any before/after profiling comparisons?

@rkooo567
Copy link
Contributor Author

rkooo567 commented Nov 7, 2022

@ericl oh forgot to update it. Let me share it here. Note that for the "after" flamegraph, it includes overhead of /actors endpoint (so I just didn't include that page to the screenshot). As you see, all overhead is now just MessageToDict. We can remove this overhead by delaying the protobuf processing to the query time

Before
Screen Shot 2022-11-07 at 6 27 16 AM

After
Screen Shot 2022-11-07 at 6 25 54 AM

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_snapshot changes look good to me, stamping as codeowner

Copy link
Contributor Author

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickyyx I will update the description. I also forgot to add comments which explains the PR...

@@ -25,7 +28,7 @@ class DataSource:
node_physical_stats = Dict()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing node related Dict fails some tests now. Since it doesn't incur high overhead, we will only modify actor datasources for now

dashboard/datacenter.py Show resolved Hide resolved
dashboard/datacenter.py Show resolved Hide resolved
@@ -153,39 +138,53 @@ def process_actor_data_from_pubsub(actor_id, actor_table_data):
# If actor is not new registered but updated, we only update
# states related fields.
if actor_table_data["state"] != "DEPENDENCIES_UNREADY":
actor_table_data_copy = dict(DataSource.actors[actor_id])
actor_table_data_copy = DataSource.actors[actor_id]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessary as it is not immutable

node_id = actor["address"].get("rayletId")
if node_id:
if node_id and node_id != actor_consts.NIL_NODE_ID:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a separate bug I found

@routes.get("/logical/actors")
@dashboard_optional_utils.aiohttp_cache
async def get_all_actors(self, req) -> aiohttp.web.Response:
return rest_response(
success=True, message="All actors fetched.", actors=DataSource.actors
)

@routes.get("/logical/kill_actor")
async def kill_actor(self, req) -> aiohttp.web.Response:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for old dashboard

@@ -193,6 +193,11 @@ async def get_progress(self, req):
success=False,
message=e.message,
)
except aiohttp.client_exceptions.ClientConnectorError as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also another bug I found

@@ -642,7 +642,8 @@ Status GcsActorManager::CreateActor(const ray::rpc::CreateActorRequest &request,
actor->UpdateState(rpc::ActorTableData::PENDING_CREATION);
const auto &actor_table_data = actor->GetActorTableData();
// Pub this state for dashboard showing.
RAY_CHECK_OK(gcs_publisher_->PublishActor(actor_id, actor_table_data, nullptr));
RAY_CHECK_OK(gcs_publisher_->PublishActor(
actor_id, *GenActorDataOnlyWithStates(actor_table_data), nullptr));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can only publish the subset of data in this case since we already published it when actor is registered

dashboard/datacenter.py Show resolved Hide resolved
dashboard/modules/actor/actor_head.py Outdated Show resolved Hide resolved
dashboard/utils.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 8, 2022
@rkooo567
Copy link
Contributor Author

Made additional changes.

  1. Increase the max actors to cache 1k -> 10K. It makes no sense it is 1K given our scalability limit is 10K concurrent actors. I even feel like it should something like 20K.
  2. Reduce the cleanup_actors frequency so that each call will be shorter (so that it won't occupy the main thread long time).
  3. Reduce the batch size for the same reason ^

@rickyyx
Copy link
Contributor

rickyyx commented Nov 10, 2022

  • Increase the max actors to cache 1k -> 10K. It makes no sense it is 1K given our scalability limit is 10K concurrent actors. I even feel like it should something like 20K.

What will be a measurement to see if this limit doesn't have a siginificnant impact and could be adjusted?

@rkooo567
Copy link
Contributor Author

What will be a measurement to see if this limit doesn't have a siginificnant impact and could be adjusted?

I think the workload I am running (stressful actor workload) is the one that can do it. But I wonder if this can impact the prod (e.g., snapshot API becomes slow or broken due to the large API response)

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 10, 2022
@rkooo567 rkooo567 merged commit 9da53e3 into ray-project:master Nov 11, 2022
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Dec 16, 2022
Signed-off-by: SangBin Cho <[email protected]>

This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Signed-off-by: SangBin Cho <[email protected]>

This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants