[Core][pubsub] handle failures when publish failed. #33115

scv119 · 2023-03-07T22:51:18Z

Why are these changes needed?

#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures.

This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received.

The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher.

We also relies on the pubsub protocol that at most one going push request will be inflight.

This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.

Unit tests
Integration tests

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

src/ray/pubsub/publisher.h

python/ray/_private/gcs_pubsub.py

src/ray/pubsub/publisher.cc

src/ray/pubsub/subscriber.cc

clarng

Nice

clarng · 2023-03-09T22:17:43Z

src/ray/pubsub/publisher.h

+ /// it has processed beyond the message's sequence_id.
+ ///
+ /// Note:
+ /// - a valide sequence_id starts from 1.


s/valide/valid/

rkooo567

Generally, lgtm

Did I understand the high level behavior change correctly?

We now guarantee at least once semantics
The subscribe side processing is idempotent (same as status quo)

Also some of tests seem fail (maybe we have a bug somewhere).

Lastly, this should work only when we do "resubscribe". When I wrote code last time, I didn't add resubscribe mechanism to the module (but @mwtian probably added this mechanism to the high lever layer?). At that time, if the publish failed (meaning long polling failed), the publisher is considered as dead. Do you happen to know if we do resubscription in this scenario? I am 100% sure we don't do resubscription for non-GCS channels, but I am not sure about GCS

src/ray/pubsub/publisher.cc

scv119 · 2023-03-13T17:24:43Z

Did I understand the high level behavior change correctly?

We now guarantee at least once semantics
The subscribe side processing is idempotent (same as status quo)

It's almost correct. We guarantee the subscribe will receive all the published message at least once after subscribe succeeded, on those non-lossy channels (i.e. there a channels capped by mailbox queue size). we also guarantee that exact once semantics if only network error happens (no application error).

Lastly, this should work only when we do "resubscribe". When I wrote code last time, I didn't add resubscribe mechanism to the module (but @mwtian probably added this mechanism to the high lever layer?). At that time, if the publish failed (meaning long polling failed), the publisher is considered as dead. Do you happen to know if we do resubscription in this scenario? I am 100% sure we don't do resubscription for non-GCS channels, but I am not sure about GCS

I spoke with @iycheng confirmed that we do have resubscribe logic. I'll add integration test to verify that.

scv119 · 2023-03-13T22:35:01Z

synced with Sang, i'll add integration tests to verify it works e2e.

rkooo567

I will review ti today

rkooo567

LGTM, but I have one last question... I might be missing some invariants you assumed here though.

rkooo567 · 2023-04-18T12:49:26Z

src/ray/pubsub/publisher.cc

+
+ // clean up messages that have already been processed.
+ while (!mailbox_.empty() &&
+ mailbox_.front()->sequence_id() <= max_processed_sequence_id) {


I might be missing something, but isn't it going to cause issues if this happens?

New GCS starts. It sent messages to other nodes (so the global seq_no > 0)

publisher finds the subscriber doesn't match. Set max_processed_sequence_id for this subscriber == 0

Publish is called to this subscriber. Since seq_no > 0, it will never be sent (because max_processed_sequence_id == 0)?

hmm

Publish is called to this subscriber. Since seq_no > 0, it will never be sent (because max_processed_sequence_id == 0)?

on sender: max_processed_sequence_id is only used for garbage collection, not preventing messaging being sent
on receiver: max_processed_sequence_id will also be reset to 0 so if any message sequence_id > 0, it will be received.

rkooo567 · 2023-04-19T03:27:29Z

Hmm not sure lint failure is related. can you merge the latest master?

Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: elliottower <[email protected]>

Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state. Signed-off-by: Jack He <[email protected]>

Why are these changes needed? ray-project#32046 indicating that the pubsub might lose data, especially when the subscriber is under load. After examine the protocol it seems one bug is that the publisher fails to handle publish failures. i.e. when we push message in mailbox, we will delete the message being sent regardless of RPC failures. This PR tries to address the problem by adding monotonically increasing sequence_id to each message, and only delete messages when the subscriber acknowledged a message has been received. The sequence_id sequences is also generated per publisher, regardless of channels. This means if there exists multiple channels for the same publisher, each channel might not see contiguous sequences. This also assumes the invariant that a subscriber object will only subscribe to one publisher. We also relies on the pubsub protocol that at most one going push request will be inflight. This also handles the case gcs failover. We do so by track the publisher_id between both publisher and subscriber. When gcs failover, the publisher_id will be different, thus both the publisher and subscriber will forget the information about previous state.

scv119 changed the title ~~[Core][pubsub] handle failures when publish failed.~~ [Core][pubsub][wip] handle failures when publish failed. Mar 7, 2023

scv119 linked an issue Mar 7, 2023 that may be closed by this pull request

[Core][API] get error from ObjectRef without the overhead of fetching the actual data #32817

Open

scv119 removed a link to an issue Mar 7, 2023

[Core][API] get error from ObjectRef without the overhead of fetching the actual data #32817

Open

scv119 assigned rkooo567 and fishbone Mar 7, 2023

scv119 force-pushed the fix-try branch 3 times, most recently from 44a7e27 to 883e2a3 Compare March 8, 2023 09:37

scv119 marked this pull request as ready for review March 8, 2023 15:33

scv119 requested review from a team, wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen and fishbone as code owners March 8, 2023 15:33

scv119 assigned clarng Mar 8, 2023

scv119 changed the title ~~[Core][pubsub][wip] handle failures when publish failed.~~ [Core][pubsub] handle failures when publish failed. Mar 8, 2023

scv119 linked an issue Mar 8, 2023 that may be closed by this pull request

[Core] ObjectStore fail to pull object, possibly because node info is missing #32046

Closed

MissiontoMars reviewed Mar 9, 2023

View reviewed changes

src/ray/pubsub/publisher.h Outdated Show resolved Hide resolved

jjyao reviewed Mar 9, 2023

View reviewed changes

python/ray/_private/gcs_pubsub.py Show resolved Hide resolved

src/ray/pubsub/publisher.cc Outdated Show resolved Hide resolved

src/ray/pubsub/subscriber.cc Outdated Show resolved Hide resolved

scv119 mentioned this pull request Mar 9, 2023

dask_on_ray_100gb_sort release test failure #32590

Closed

clarng reviewed Mar 10, 2023

View reviewed changes

rkooo567 reviewed Mar 13, 2023

View reviewed changes

src/ray/pubsub/publisher.cc Show resolved Hide resolved

src/ray/pubsub/publisher.cc Show resolved Hide resolved

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 13, 2023

rkooo567 approved these changes Mar 13, 2023

View reviewed changes

scv119 added the do-not-merge Do not merge this PR! label Mar 13, 2023

scv119 added 15 commits April 10, 2023 10:31

update

2b7fdb9

fix

cc7c1eb

update

445ff91

update test

ebd5937

update

917c93f

update

3062ce1

update

3d29429

update

ab37f30

update

8ef736a

update

87343e1

update

89872c6

bug fix

d8933bb

update

100ef06

update

8bf65d5

update

2380f7a

scv119 force-pushed the fix-try branch from cc4f2a1 to 2380f7a Compare April 10, 2023 17:32

scv119 added 5 commits April 10, 2023 14:01

update

e0a17fa

update

e093970

fix test

4b55b68

Merge remote-tracking branch 'upstream/master' into fix-try

0ba9d25

update

559ed27

scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 17, 2023

rkooo567 requested changes Apr 17, 2023

View reviewed changes

rkooo567 reviewed Apr 18, 2023

View reviewed changes

rkooo567 approved these changes Apr 18, 2023

View reviewed changes

scv119 merged commit 897a282 into ray-project:master Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][pubsub] handle failures when publish failed. #33115

[Core][pubsub] handle failures when publish failed. #33115

scv119 commented Mar 7, 2023 •

edited

Loading

clarng left a comment

clarng Mar 9, 2023

rkooo567 left a comment •

edited

Loading

scv119 commented Mar 13, 2023 •

edited

Loading

scv119 commented Mar 13, 2023

rkooo567 left a comment

rkooo567 left a comment •

edited

Loading

rkooo567 Apr 18, 2023

scv119 Apr 18, 2023

rkooo567 commented Apr 19, 2023

[Core][pubsub] handle failures when publish failed. #33115

[Core][pubsub] handle failures when publish failed. #33115

Conversation

scv119 commented Mar 7, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarng left a comment

Choose a reason for hiding this comment

clarng Mar 9, 2023

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

scv119 commented Mar 13, 2023 • edited Loading

scv119 commented Mar 13, 2023

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 Apr 18, 2023

Choose a reason for hiding this comment

scv119 Apr 18, 2023

Choose a reason for hiding this comment

rkooo567 commented Apr 19, 2023

scv119 commented Mar 7, 2023 •

edited

Loading

rkooo567 left a comment •

edited

Loading

scv119 commented Mar 13, 2023 •

edited

Loading

rkooo567 left a comment •

edited

Loading