Skip to content

Commit

Permalink
[core] Fix a corner case where QueryAllWorkerStates never return (ray…
Browse files Browse the repository at this point in the history
…-project#37496)

In some extreme cases, when the worker becomes dead after getting the workers, the RPC request might never return.

This PR fixed this issue.
  • Loading branch information
fishbone committed Jul 19, 2023
1 parent 6c432f2 commit 45655b7
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions src/ray/raylet/node_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -894,10 +894,13 @@ void NodeManager::QueryAllWorkerStates(
// Query all workers.
auto rpc_replied = std::make_shared<size_t>(0);
auto num_workers = all_workers.size();
bool all_dead = true;
for (const auto &worker : all_workers) {
if (worker->IsDead()) {
*rpc_replied += 1;
continue;
}
all_dead = false;
rpc::GetCoreWorkerStatsRequest request;
request.set_intended_worker_id(worker->WorkerId().Binary());
request.set_include_memory_info(include_memory_info);
Expand All @@ -922,6 +925,9 @@ void NodeManager::QueryAllWorkerStates(
}
});
}
if (all_dead) {
send_reply_callback(Status::OK(), nullptr, nullptr);
}
}

// This warns users that there could be the resource deadlock. It works this way;
Expand Down

0 comments on commit 45655b7

Please sign in to comment.