-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Use soft constraint for pinning controller on head node #25091
Conversation
… the head node (ray-project#24934)" (ray-project#25050)" This reverts commit b9fb902.
…vert-controller-pin
…vert-controller-pin
@edoakes are we working on this one? I thought it's merged. |
…vert-controller-pin
@iycheng it was merged but got reverted because it caused a test to fail. I will pick it back up. |
@iycheng this is passing tests, please review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you add a test which is a two nodes ray cluster and kill one node and make sure the serve controller will be able to be scheduled again?
@iycheng I spent a few mins thinking about how to write this test and it's quite messy, I think we should wait until we can kill the head node to add the test (check that it gets placed on a worker node). Otherwise it's hard to force the controller to get placed on a worker node right now. |
@edoakes you only need to kill the raylet (not gcs which I'll cover in my PR) where the controller is deployed to make sure it's working e2e. To kill the raylet, just remove it from the cluster. You can follow the test here (https://github.com/ray-project/ray/blob/master/python/ray/tests/test_scheduling_2.py#L363) |
@iycheng using |
Got it. Yeah, I'm ok with this and let me cover the tests in some other places. I'll come back to you if it's not working. :p |
sounds good :) |
btw @edoakes do you think it's worth it to update the http_state with these strategy as well? (currently they uses node resource key). |
Ah it's in #25813 |
🤦♂️ . it's already done in this PR. ignore me |
??? |
…de (ray-project#25091)" This reverts commit 0f60036.
… head node (ray-project#25091)" (ray-project#25857)" This reverts commit e435230.
… head node (ray-project#25091)" (ray-project#25857)" (ray-project#25858) Signed-off-by: Xiaowei Jiang <[email protected]>
… head node (ray-project#25091)" (ray-project#25857)" (ray-project#25858) Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
Un-reverting #24934 which caused
test_cluster
to become flaky. This was due to an oversight: we need to update theHTTPState
logic to account for the controller not necessarily running on the head node.This will require using the new
SchedulingPolicy
API, but I'm not quite sure the best way to do it. Context here: #25090.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.