Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Wait until replicas have finished recovering (with timeout) to broadcast LongPoll updates #34675

Merged
merged 1 commit into from
Apr 25, 2023

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Apr 21, 2023

Why are these changes needed?

When the controller recovers, all replicas are put into the RECOVERING state. These are not included in long poll updates for running replicas, which means we broadcast an update that effectively clears out all available replicas in all handles.

This PR addresses this problem by avoiding broadcasting any updates until all replicas are fully recovered (or a 10s timeout is reached).

We also wait to run the http_state update loop because if a new proxy is started, it won't be able to serve any traffic due to having no replicas available.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@edoakes edoakes force-pushed the wait-to-broadcast-on-recovery branch from 34245dd to 90b3730 Compare April 24, 2023 15:16
Signed-off-by: Edward Oakes <[email protected]>
Copy link
Contributor

@sihanwang41 sihanwang41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

# Don't update http_state until after the done recovering event is set,
# otherwise we may start a new HTTP proxy but not broadcast it any
# info about available deployments & their replicas.
if self.http_state and self.done_recovering_event.is_set():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch.

@edoakes edoakes merged commit 38f4e44 into ray-project:master Apr 25, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…o broadcast `LongPoll` updates (ray-project#34675)

When the controller recovers, all replicas are put into the `RECOVERING` state. These are not included in long poll updates for running replicas, which means we broadcast an update that effectively clears out all available replicas in all handles.

This PR addresses this problem by avoiding broadcasting any updates until all replicas are fully recovered (or a 10s timeout is reached).

We also wait to run the `http_state` update loop because if a new proxy is started, it won't be able to serve any traffic due to having no replicas available.

Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…o broadcast `LongPoll` updates (ray-project#34675)

When the controller recovers, all replicas are put into the `RECOVERING` state. These are not included in long poll updates for running replicas, which means we broadcast an update that effectively clears out all available replicas in all handles.

This PR addresses this problem by avoiding broadcasting any updates until all replicas are fully recovered (or a 10s timeout is reached).

We also wait to run the `http_state` update loop because if a new proxy is started, it won't be able to serve any traffic due to having no replicas available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants