Skip to content

Commit

Permalink
[core] Fix raylet memory leak in the wrong setup. (ray-project#35647)
Browse files Browse the repository at this point in the history
## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

ray-project#35632
ray-project#35310
  • Loading branch information
fishbone committed May 23, 2023
1 parent 852f0f1 commit 86fab17
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions src/ray/raylet/node_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1742,6 +1742,13 @@ void NodeManager::ProcessPushErrorRequestMessage(const uint8_t *message_data) {
void NodeManager::HandleUpdateResourceUsage(rpc::UpdateResourceUsageRequest request,
rpc::UpdateResourceUsageReply *reply,
rpc::SendReplyCallback send_reply_callback) {
if (RayConfig::instance().use_ray_syncer()) {
RAY_LOG(WARNING)
<< "There is a GCS outside of this cluster sending message to this raylet.";
send_reply_callback(Status::OK(), nullptr, nullptr);
return;
}

rpc::ResourceUsageBroadcastData resource_usage_batch;
resource_usage_batch.ParseFromString(request.serialized_resource_usage_batch());
// When next_resource_seq_no_ == 0 it means it just started.
Expand All @@ -1758,6 +1765,7 @@ void NodeManager::HandleUpdateResourceUsage(rpc::UpdateResourceUsageRequest requ
<< next_resource_seq_no_ << ", but got: " << resource_usage_batch.seq_no() << ".";
if (resource_usage_batch.seq_no() < next_resource_seq_no_) {
RAY_LOG(WARNING) << "Discard the the resource update since local version is newer";
send_reply_callback(Status::OK(), nullptr, nullptr);
return;
}
}
Expand Down

0 comments on commit 86fab17

Please sign in to comment.