[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

MissiontoMars · 2023-01-30T09:42:14Z

What happened + What you expected to happen

As mentioned here: https://discuss.ray.io/t/raylet-object-manager-cc-couldnt-send-pull-request-from/9027, we meet problems about pulling remote object in our production environment.

(raylet, ip=[fdbd:dc01:16:165:ad00::48]) [2023-01-30 14:47:12,412 E 27 27] (raylet) object_manager.cc:293: Couldn't send pull request from d9969738fb6ac4cb998e1b12a4d8acfea969cd2ab45a7cc6c7fda954 to fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3 of object 00d31a2771aa3d2fd74413d0eaf37ce41b9135450800000003000000 , setup rpc connection failed.

Our production ray cluster: 240 work nodes and 1400 actors totally, ray version 2.0.0(without any code modified of raycore )

NOTE: In order to describe the problem conveniently, the following use nodeA to represent d9969738fb6ac4cb998e1b12a4d8acfea969cd2ab45a7cc6c7fda954, and nodeB to represent fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3.

After diving into some code(raylet and gcs) and log, we found that it is possible that the gcs pub node info message is missing.

ray/src/ray/object_manager/object_manager.cc

Line 293 in 2947e23

RAY_LOG(ERROR) << "Couldn't send pull request from " << self_node_id_ << " to "

The Couldn't send pull request from log means that the rpc client from nodeA to nodeB is null.

ray/src/ray/object_manager/object_manager.cc

Line 704 in 2947e23

return nullptr;

It seems that nodeA cannot get connection info of nodeB.
go on

ray/src/ray/object_manager/ownership_based_object_directory.cc

Line 451 in 2947e23

auto node_info = gcs_client_->Nodes().Get(connection_info.node_id);

ray/src/ray/gcs/gcs_client/accessor.cc

Line 536 in 2947e23

auto entry = node_cache_.find(node_id);

The nodeB does not exists in the local node cache.

Then, check raylet.out of nodeA, there is not log like Received notification for node id for nodeB.

ray/src/ray/gcs/gcs_client/accessor.cc

Line 608 in 2947e23

RAY_LOG(INFO) << "Received notification for node id = " << node_id

From gcs log, nodeB should be registered normally.

[2023-01-30 11:25:03,809 I 12 12] (gcs_server) gcs_node_manager.cc:42: Registering node info, node id = fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3, address = [fdbd:dc01:28:311:a200::53], node name = [fdbd:dc01:28:311:a200::53][2023-01-30 11:25:03,809 I 12 12] (gcs_server) gcs_node_manager.cc:48: Finished registering node info, node id = fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3, address = [fdbd:dc01:28:311:a200::53], node name = [fdbd:dc01:28:311:a200::53]

So obviously, nodeA didn't receive node info of nodeB from gcs. Meanwhile, we check other nodes, such as nodeC, node info of nodeB is received.

[2023-01-30 11:25:03,811 I 27 27] (raylet) accessor.cc:608: Received notification for node id = fd54f76b74986c8e913dbab01a94b9c13881e98981bf0f417a3a62d3, IsAlive = 1

Versions / Dependencies

Version: ray 2.0.0

Reproduction script

None

Issue Severity

None

The text was updated successfully, but these errors were encountered:

MissiontoMars · 2023-01-30T13:15:20Z

@stephanie-wang @mwtian @rkooo567 Could you help analyze this issue?

scv119 · 2023-03-01T23:53:45Z

@MissiontoMars are you using redis based pubsub, or OSS one?

MissiontoMars · 2023-03-07T02:04:30Z

@MissiontoMars are you using redis based pubsub, or OSS one?

Sorry for the late reply.

No additional option configuration was added when launching the ray cluster, so i guess it is gcs pubsub.

The problem may be related to the high CPU load of dashboard agent, I found that it was mostly dealing with metrics, then i disable it to reduce cpu usage. Since then the problem has barely arisen.

rkooo567 · 2023-03-11T02:06:01Z

@MissiontoMars the agent cpu usage should have been fixed in the recent version (from 2.1)

rkooo567 · 2023-03-13T14:45:09Z

One question; when this happens, does it hang forever, or does it eventually resolve? I wonder if it is the data loss or slowdown

MissiontoMars · 2023-03-14T02:03:20Z

it hang forever

scv119 · 2023-04-17T18:28:44Z

about to merge soon.

Why are these changes needed? #32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.

Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: elliottower <[email protected]>

Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: Jack He <[email protected]>

Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.

MissiontoMars added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 30, 2023

scv119 added core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 10, 2023

cadedaniel assigned rkooo567 and scv119 and unassigned rkooo567 Feb 15, 2023

scv119 added the core-object-store label Feb 16, 2023

rkooo567 added the Ray 2.4 label Feb 20, 2023

rkooo567 mentioned this issue Feb 20, 2023

[Core][nightly-test] chaos_dataset_shuffle_sort_1tb failing due to OOM killer #30680

Closed

rkooo567 added the size:medium label Feb 20, 2023

scv119 added P0 Issues that should be fixed in short order Ray 2.4 size:large and removed P0 Issues that should be fixed in short order Ray 2.4 size:medium labels Mar 6, 2023

scv119 mentioned this issue Mar 7, 2023

[Core][pubsub] handle failures when publish failed. #33115

Merged

9 tasks

scv119 linked a pull request Mar 8, 2023 that will close this issue

[Core][pubsub] handle failures when publish failed. #33115

Merged

9 tasks

scv119 mentioned this issue Mar 9, 2023

dask_on_ray_100gb_sort release test failure #32590

Closed

scv119 removed the Ray 2.4 label Apr 17, 2023

scv119 added the Ray 2.5 label Apr 17, 2023

scv119 closed this as completed in #33115 Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

MissiontoMars commented Jan 30, 2023 •

edited

Loading

MissiontoMars commented Jan 30, 2023

scv119 commented Mar 1, 2023

MissiontoMars commented Mar 7, 2023 •

edited

Loading

rkooo567 commented Mar 11, 2023

rkooo567 commented Mar 13, 2023

MissiontoMars commented Mar 14, 2023

scv119 commented Apr 17, 2023

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

Comments

MissiontoMars commented Jan 30, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

MissiontoMars commented Jan 30, 2023

scv119 commented Mar 1, 2023

MissiontoMars commented Mar 7, 2023 • edited Loading

rkooo567 commented Mar 11, 2023

rkooo567 commented Mar 13, 2023

MissiontoMars commented Mar 14, 2023

scv119 commented Apr 17, 2023

MissiontoMars commented Jan 30, 2023 •

edited

Loading

MissiontoMars commented Mar 7, 2023 •

edited

Loading