-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Remove replicas from scheduler when receiving an ActorDiedError
#44772
[Serve] Remove replicas from scheduler when receiving an ActorDiedError
#44772
Conversation
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
…ctor_error_router
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
|
||
@pytest.mark.asyncio | ||
@pytest.mark.parametrize("pow_2_scheduler", [{}], indirect=True) | ||
async def test_replicas_actor_unavailable_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails on master
.
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
# We do not catch RayActorError here because that error can be | ||
# raised even when a replica is temporarily unavailable. | ||
# See https://github.com/ray-project/ray/issues/44185 for details. | ||
if isinstance(t.exception(), ActorDiedError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add an else
block for the temporarily unavailable exception type with a more informative error message here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a more informative error message.
# The scheduler should keep r1 since it may recover. | ||
assert set(pow_2_scheduler.curr_replicas.values()) == {r1, r2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's test the e2e behavior instead of verifying what are basically implementation details -- swap r1
to stop raising the exception and check it can be scheduled to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I checked that r1
gets scheduling requests after it recovers.
# The scheduler should remove r1 since it died. | ||
assert set(pow_2_scheduler.curr_replicas.values()) == {r2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to below -- instead, you can schedule a bunch requests of and check that the get_queue_len
call is never made to the replica that returned the actor died error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test now checks whether get_queue_len
gets called.
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Signed-off-by: Shreyas Krishnaswamy <[email protected]>
Why are these changes needed?
This change makes the
PowerOfTwoChoicesReplicaScheduler
stop routing to a replica when it receives anActorDiedError
.Context: Currently, the scheduler stops routing to a replica when it receives a
RayActorError
. However,RayActorError
can also be raised when there's network instability in the cluster, which leads to issues like #44185. #44212 recently introduced theActorDiedError
subclass, which only gets raised if the actor has died.Related issue number
Closes #44185.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.test_pow_2_replica_scheduler.py
.