Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

Closed
MissiontoMars opened this issue Jan 30, 2023 · 7 comments · Fixed by #33115
Closed

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

MissiontoMars opened this issue Jan 30, 2023 · 7 comments · Fixed by #33115
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-object-store P0 Issues that should be fixed in short order Ray 2.5 size:large

Comments

@MissiontoMars
Copy link

MissiontoMars commented Jan 30, 2023

What happened + What you expected to happen

As mentioned here: https://discuss.ray.io/t/raylet-object-manager-cc-couldnt-send-pull-request-from/9027, we meet problems about pulling remote object in our production environment.

(raylet, ip=[fdbd:dc01:16:165:ad00::48]) [2023-01-30 14:47:12,412 E 27 27] (raylet) object_manager.cc:293: Couldn't send pull request from d9969738fb6ac4cb998e1b12a4d8acfea969cd2ab45a7cc6c7fda954 to fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3 of object 00d31a2771aa3d2fd74413d0eaf37ce41b9135450800000003000000 , setup rpc connection failed.

Our production ray cluster: 240 work nodes and 1400 actors totally, ray version 2.0.0(without any code modified of raycore )

NOTE: In order to describe the problem conveniently, the following use nodeA to represent d9969738fb6ac4cb998e1b12a4d8acfea969cd2ab45a7cc6c7fda954, and nodeB to represent fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3.

After diving into some code(raylet and gcs) and log, we found that it is possible that the gcs pub node info message is missing.

RAY_LOG(ERROR) << "Couldn't send pull request from " << self_node_id_ << " to "

The Couldn't send pull request from log means that the rpc client from nodeA to nodeB is null.

It seems that nodeA cannot get connection info of nodeB.
go on
auto node_info = gcs_client_->Nodes().Get(connection_info.node_id);

auto entry = node_cache_.find(node_id);

The nodeB does not exists in the local node cache.

Then, check raylet.out of nodeA, there is not log like Received notification for node id for nodeB.

RAY_LOG(INFO) << "Received notification for node id = " << node_id

From gcs log, nodeB should be registered normally.

[2023-01-30 11:25:03,809 I 12 12] (gcs_server) gcs_node_manager.cc:42: Registering node info, node id = fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3, address = [fdbd:dc01:28:311:a200::53], node name = [fdbd:dc01:28:311:a200::53][2023-01-30 11:25:03,809 I 12 12] (gcs_server) gcs_node_manager.cc:48: Finished registering node info, node id = fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3, address = [fdbd:dc01:28:311:a200::53], node name = [fdbd:dc01:28:311:a200::53]

So obviously, nodeA didn't receive node info of nodeB from gcs. Meanwhile, we check other nodes, such as nodeC, node info of nodeB is received.

[2023-01-30 11:25:03,811 I 27 27] (raylet) accessor.cc:608: Received notification for node id = fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3, IsAlive = 1

Versions / Dependencies

Version: ray 2.0.0

Reproduction script

None

Issue Severity

None

@MissiontoMars MissiontoMars added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 30, 2023
@MissiontoMars
Copy link
Author

@stephanie-wang @mwtian @rkooo567 Could you help analyze this issue?

@scv119 scv119 added core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 10, 2023
@cadedaniel cadedaniel assigned rkooo567 and scv119 and unassigned rkooo567 Feb 15, 2023
@scv119
Copy link
Contributor

scv119 commented Mar 1, 2023

@MissiontoMars are you using redis based pubsub, or OSS one?

@scv119 scv119 added P0 Issues that should be fixed in short order Ray 2.4 size:large and removed P0 Issues that should be fixed in short order Ray 2.4 size:medium labels Mar 6, 2023
@MissiontoMars
Copy link
Author

MissiontoMars commented Mar 7, 2023

@MissiontoMars are you using redis based pubsub, or OSS one?

Sorry for the late reply.

No additional option configuration was added when launching the ray cluster, so i guess it is gcs pubsub.

The problem may be related to the high CPU load of dashboard agent, I found that it was mostly dealing with metrics, then i disable it to reduce cpu usage. Since then the problem has barely arisen.

@rkooo567
Copy link
Contributor

@MissiontoMars the agent cpu usage should have been fixed in the recent version (from 2.1)

@rkooo567
Copy link
Contributor

One question; when this happens, does it hang forever, or does it eventually resolve? I wonder if it is the data loss or slowdown

@MissiontoMars
Copy link
Author

it hang forever

@scv119 scv119 removed the Ray 2.4 label Apr 17, 2023
@scv119 scv119 added the Ray 2.5 label Apr 17, 2023
@scv119
Copy link
Contributor

scv119 commented Apr 17, 2023

about to merge soon.

scv119 added a commit that referenced this issue Apr 19, 2023
Why are these changes needed?
#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures.

This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received.

The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher.

We also relies on the pubsub protocol that at most one going push request will be inflight.

This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
Why are these changes needed?
ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures.

This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received.

The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher.

We also relies on the pubsub protocol that at most one going push request will be inflight.

This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
Why are these changes needed?
ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures.

This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received.

The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher.

We also relies on the pubsub protocol that at most one going push request will be inflight.

This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.

Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this issue May 16, 2023
Why are these changes needed?
ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures.

This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received.

The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher.

We also relies on the pubsub protocol that at most one going push request will be inflight.

This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-object-store P0 Issues that should be fixed in short order Ray 2.5 size:large
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants