Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Use soft constraint for pinning controller on head node #25091

Merged
merged 10 commits into from
Jun 15, 2022

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented May 23, 2022

Why are these changes needed?

Un-reverting #24934 which caused test_cluster to become flaky. This was due to an oversight: we need to update the HTTPState logic to account for the controller not necessarily running on the head node.

This will require using the new SchedulingPolicy API, but I'm not quite sure the best way to do it. Context here: #25090.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@edoakes edoakes changed the title [serve] Use soft constraint for pinning controller on head node [WIP][serve] Use soft constraint for pinning controller on head node May 23, 2022
@fishbone
Copy link
Contributor

@edoakes are we working on this one? I thought it's merged.

@edoakes
Copy link
Contributor Author

edoakes commented Jun 14, 2022

@iycheng it was merged but got reverted because it caused a test to fail. I will pick it back up.

@edoakes
Copy link
Contributor Author

edoakes commented Jun 14, 2022

@iycheng this is passing tests, please review!

@edoakes edoakes changed the title [WIP][serve] Use soft constraint for pinning controller on head node [serve] Use soft constraint for pinning controller on head node Jun 14, 2022
@edoakes edoakes added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 14, 2022
Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you add a test which is a two nodes ray cluster and kill one node and make sure the serve controller will be able to be scheduled again?

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 15, 2022
@edoakes
Copy link
Contributor Author

edoakes commented Jun 15, 2022

@iycheng I spent a few mins thinking about how to write this test and it's quite messy, I think we should wait until we can kill the head node to add the test (check that it gets placed on a worker node). Otherwise it's hard to force the controller to get placed on a worker node right now.

@fishbone
Copy link
Contributor

fishbone commented Jun 15, 2022

@edoakes you only need to kill the raylet (not gcs which I'll cover in my PR) where the controller is deployed to make sure it's working e2e.

To kill the raylet, just remove it from the cluster. You can follow the test here (https://github.com/ray-project/ray/blob/master/python/ray/tests/test_scheduling_2.py#L363)

@edoakes
Copy link
Contributor Author

edoakes commented Jun 15, 2022

@iycheng using remove_node will only work on worker nodes though, right? In this case I need to remove the head node because we prefer to place the controller there when possible

@fishbone
Copy link
Contributor

fishbone commented Jun 15, 2022

@iycheng using remove_node will only work on worker nodes though, right? In this case I need to remove the head node because we prefer to place the controller there when possible

Got it. Yeah, I'm ok with this and let me cover the tests in some other places. I'll come back to you if it's not working. :p

@edoakes
Copy link
Contributor Author

edoakes commented Jun 15, 2022

sounds good :)

@edoakes edoakes merged commit 0f60036 into ray-project:master Jun 15, 2022
@simon-mo
Copy link
Contributor

btw @edoakes do you think it's worth it to update the http_state with these strategy as well? (currently they uses node resource key).

@simon-mo
Copy link
Contributor

Ah it's in #25813

@simon-mo
Copy link
Contributor

🤦‍♂️ . it's already done in this PR. ignore me

@DmitriGekhtman
Copy link
Contributor

???
:)

edoakes added a commit to edoakes/ray that referenced this pull request Jun 16, 2022
edoakes added a commit that referenced this pull request Jun 16, 2022
edoakes added a commit to edoakes/ray that referenced this pull request Jun 16, 2022
edoakes added a commit that referenced this pull request Jul 15, 2022
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants