Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Remove replicas from scheduler when receiving an ActorDiedError #44772

Conversation

shrekris-anyscale
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale commented Apr 16, 2024

Why are these changes needed?

This change makes the PowerOfTwoChoicesReplicaScheduler stop routing to a replica when it receives an ActorDiedError.

Context: Currently, the scheduler stops routing to a replica when it receives a RayActorError. However, RayActorError can also be raised when there's network instability in the cluster, which leads to issues like #44185. #44212 recently introduced the ActorDiedError subclass, which only gets raised if the actor has died.

Related issue number

Closes #44185.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
      • This change adds tests to test_pow_2_replica_scheduler.py.
    • Release tests
    • This PR is not tested :(

Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Signed-off-by: Shreyas Krishnaswamy <[email protected]>

@pytest.mark.asyncio
@pytest.mark.parametrize("pow_2_scheduler", [{}], indirect=True)
async def test_replicas_actor_unavailable_error(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails on master.

Signed-off-by: Shreyas Krishnaswamy <[email protected]>
# We do not catch RayActorError here because that error can be
# raised even when a replica is temporarily unavailable.
# See https://github.com/ray-project/ray/issues/44185 for details.
if isinstance(t.exception(), ActorDiedError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add an else block for the temporarily unavailable exception type with a more informative error message here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a more informative error message.

Comment on lines +1689 to +1690
# The scheduler should keep r1 since it may recover.
assert set(pow_2_scheduler.curr_replicas.values()) == {r1, r2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's test the e2e behavior instead of verifying what are basically implementation details -- swap r1 to stop raising the exception and check it can be scheduled to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I checked that r1 gets scheduling requests after it recovers.

Comment on lines 1657 to 1658
# The scheduler should remove r1 since it died.
assert set(pow_2_scheduler.curr_replicas.values()) == {r2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to below -- instead, you can schedule a bunch requests of and check that the get_queue_len call is never made to the replica that returned the actor died error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now checks whether get_queue_len gets called.

Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
@shrekris-anyscale shrekris-anyscale merged commit 924d3e1 into ray-project:master Apr 16, 2024
5 checks passed
harborn pushed a commit to harborn/ray that referenced this pull request Apr 18, 2024
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Avoid blacklisting replicas that have intermittent network failures
2 participants