-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] Optimize and backpressure actor_head.py #29580
Conversation
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some more details to the description for the changes and issues? I might be missing the context here, thank you!
Saw the slack thread.
Any before/after profiling comparisons? |
@ericl oh forgot to update it. Let me share it here. Note that for the "after" flamegraph, it includes overhead of /actors endpoint (so I just didn't include that page to the screenshot). As you see, all overhead is now just MessageToDict. We can remove this overhead by delaying the protobuf processing to the query time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_snapshot
changes look good to me, stamping as codeowner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickyyx I will update the description. I also forgot to add comments which explains the PR...
@@ -25,7 +28,7 @@ class DataSource: | |||
node_physical_stats = Dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing node related Dict fails some tests now. Since it doesn't incur high overhead, we will only modify actor datasources for now
@@ -153,39 +138,53 @@ def process_actor_data_from_pubsub(actor_id, actor_table_data): | |||
# If actor is not new registered but updated, we only update | |||
# states related fields. | |||
if actor_table_data["state"] != "DEPENDENCIES_UNREADY": | |||
actor_table_data_copy = dict(DataSource.actors[actor_id]) | |||
actor_table_data_copy = DataSource.actors[actor_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unnecessary as it is not immutable
node_id = actor["address"].get("rayletId") | ||
if node_id: | ||
if node_id and node_id != actor_consts.NIL_NODE_ID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a separate bug I found
@routes.get("/logical/actors") | ||
@dashboard_optional_utils.aiohttp_cache | ||
async def get_all_actors(self, req) -> aiohttp.web.Response: | ||
return rest_response( | ||
success=True, message="All actors fetched.", actors=DataSource.actors | ||
) | ||
|
||
@routes.get("/logical/kill_actor") | ||
async def kill_actor(self, req) -> aiohttp.web.Response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for old dashboard
@@ -193,6 +193,11 @@ async def get_progress(self, req): | |||
success=False, | |||
message=e.message, | |||
) | |||
except aiohttp.client_exceptions.ClientConnectorError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also another bug I found
@@ -642,7 +642,8 @@ Status GcsActorManager::CreateActor(const ray::rpc::CreateActorRequest &request, | |||
actor->UpdateState(rpc::ActorTableData::PENDING_CREATION); | |||
const auto &actor_table_data = actor->GetActorTableData(); | |||
// Pub this state for dashboard showing. | |||
RAY_CHECK_OK(gcs_publisher_->PublishActor(actor_id, actor_table_data, nullptr)); | |||
RAY_CHECK_OK(gcs_publisher_->PublishActor( | |||
actor_id, *GenActorDataOnlyWithStates(actor_table_data), nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can only publish the subset of data in this case since we already published it when actor is registered
Signed-off-by: SangBin Cho <[email protected]>
Made additional changes.
|
Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: SangBin Cho <[email protected]>
What will be a measurement to see if this limit doesn't have a siginificnant impact and could be adjusted? |
I think the workload I am running (stressful actor workload) is the one that can do it. But I wonder if this can impact the prod (e.g., snapshot API becomes slow or broken due to the large API response) |
Signed-off-by: SangBin Cho <[email protected]> This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)
Signed-off-by: SangBin Cho <[email protected]> This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors) Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: SangBin Cho [email protected]
Why are these changes needed?
This optimizes the actor head CPU usage and guarantees a stable API response from the dashboard under lots of actor events published to drivers. The below script is used for testing, and I could reproduce the same level of delay as many_nodes_actor_test (250 nodes + 10k actors)
The profiling result after this PR is here; #29580 (comment)
After this PR, most of overhead is just pure MessageToDict. We can drastically optimize this by delaying the processing of protobuf on query time.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.