Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] ServeHandle detects ActorError and drop replicas from target group #26685

Merged
merged 10 commits into from
Jul 29, 2022

Conversation

simon-mo
Copy link
Contributor

@simon-mo simon-mo commented Jul 18, 2022

Why are these changes needed?

When ServeController crashes, the replicas membership updates is paused. This means ServeHandle will continue to send requests to the replicas that also crashed during this time. This PR show how can we detect actor failures locally from within the handle and take those replicas of the group it load balance to.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/serve/router.py Outdated Show resolved Hide resolved
@simon-mo simon-mo marked this pull request as ready for review July 21, 2022 00:37
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
@simon-mo
Copy link
Contributor Author

@edoakes ready for review

ray.get(handle.remote(do_crash=True))

pids = ray.get([handle.remote() for _ in range(10)])
assert len(set(pids)) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert len(handle.router._replica_set.in_flight_queries) == 1


handle = serve.run(f.bind())
pids = ray.get([handle.remote() for _ in range(2)])
assert len(set(pids)) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one more assert to double check on the client side:
assert len(handle.router._replica_set.in_flight_queries) == 2

Signed-off-by: simon-mo <[email protected]>
@simon-mo simon-mo changed the title [Serve] [Prototype] ServeHandle detects ActorError and drop replicas from target group [Serve] ServeHandle detects ActorError and drop replicas from target group Jul 21, 2022
@@ -87,6 +88,12 @@ def __init__(
{"deployment": self.deployment_name}
)

def _reset_replica_iterator(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstring with the behavior here (what happens to inflight & subsequent requests)

Comment on lines 186 to 188
logger.exception(
"Handle received unexpected error when processing request."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will print the traceback, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines 720 to 721
client = get_global_client()
ray.kill(client._controller, no_restart=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we testing by killing the controller? add a comment pls

@@ -701,5 +702,31 @@ def ready(self):
)


def test_handle_early_detect_failure(shutdown_ray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a header comment describing the behavior that's being tested (let's try to do this in general, really helps readers in the future)

@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 21, 2022
@fishbone
Copy link
Contributor

I have one concern. What if RayActorError is sent due to network issues, and the Actor actually is still alive. Will this lead to leak?

I think we shouldn't just remove it, instead we should move it out and move it in after x seconds. If later controller remove this replica, we just remove it.

@simon-mo
Copy link
Contributor Author

@iycheng I'm a bit confused now about the semantics of RayActorError. Is this error string now out of date?

self.base_error_msg = "The actor died unexpectedly before finishing this task."

@simon-mo simon-mo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022
@simon-mo simon-mo requested a review from edoakes July 28, 2022 17:45
@simon-mo
Copy link
Contributor Author

@sihanwang41 @edoakes ready for another look, comments added

@fishbone
Copy link
Contributor

I have one concern. What if RayActorError is sent due to network issues, and the Actor actually is still alive. Will this lead to leak?

I think we shouldn't just remove it, instead we should move it out and move it in after x seconds. If later controller remove this replica, we just remove it.

@simon-mo how about this? Do we plan to fix it? I think in case of a network partition, this is going to lead instance leak.

@fishbone
Copy link
Contributor

I'm trying to find documentation about all cases of RayActorError, but I can't. Maybe we should have a doc about this. @jjyao do we have this?

@fishbone
Copy link
Contributor

Network issue is one example

I think it's more like we think the actor died (code), but somehow it's not. So whether the actor died depends on GCS.

Signed-off-by: simon-mo <[email protected]>
@simon-mo simon-mo merged commit 545c516 into ray-project:master Jul 29, 2022
@simon-mo
Copy link
Contributor Author

@iycheng I'm going to group the network error as follow up

simon-mo added a commit that referenced this pull request Jul 29, 2022
simon-mo added a commit that referenced this pull request Jul 29, 2022
simon-mo added a commit to simon-mo/ray that referenced this pull request Aug 1, 2022
simon-mo added a commit that referenced this pull request Aug 3, 2022
simon-mo added a commit to simon-mo/ray that referenced this pull request Aug 3, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants