Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/n] Stabilize GCS/Autoscaler interface: Introduce monitor server #31827

Merged
merged 26 commits into from
Jan 26, 2023

Conversation

wuisawesome
Copy link
Contributor

Why are these changes needed?

This is the first PR towards stabilizing the GCS autoscaler interface by introducing a new grpc service definition which we will provide backwards compatibility guarantees.

This PR mostly just introduces scaffolding and a trivial GetRayVersion endpoint.

By the end of this series of PRs, monitor.py will only communicate with the rest of the ray cluster via this service definition.

Related issue number

Related to #31826

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Co-authored-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
Copy link
Contributor

@gvspraveen gvspraveen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment fix

namespace ray {
namespace gcs {

/// GcsNodeManager is responsible for managing and monitoring nodes as well as handing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace GcsNodeManager with GcsMonitorServer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the rest of the docstring too I suppose

@rkooo567
Copy link
Contributor

Can you link me a design doc?

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I'm assuming most of the gRPC code is boilerplate is copied from the existing GcsNodeManager/NodeInfoService, but let me know if there's something different here that warrants closer attention.

@wuisawesome
Copy link
Contributor Author

Can you link me a design doc?

There have been no design changes since the previous design docs (you can refer to those design docs), I'm just refactoring everything to a cleaner defined protos.

I've updated the list of things we need to handle in the github issue but there will be no design changes.

@@ -598,6 +601,13 @@ void GcsServer::InitGcsTaskManager() {
rpc_server_.RegisterService(*task_info_service_);
}

void GcsServer::InitMonitorServer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we call it "MonitorServer" that's different from other modules' naming convention (*Manager)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because it's not actually "managing" anything. Most of the other managers actually manage some state.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2023
@wuisawesome wuisawesome added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 26, 2023
@wuisawesome
Copy link
Contributor Author

Remaining test failures look like DinD issues and known flakey tests which are all unrelated. Merging.

@wuisawesome wuisawesome merged commit e753b03 into ray-project:master Jan 26, 2023
cadedaniel pushed a commit to cadedaniel/ray that referenced this pull request Mar 22, 2023
…ay-project#31827)

This is the first PR towards stabilizing the GCS autoscaler interface by introducing a new grpc service definition which we will provide backwards compatibility guarantees.

This PR mostly just introduces scaffolding and a trivial GetRayVersion endpoint.

By the end of this series of PRs, monitor.py will only communicate with the rest of the ray cluster via this service definition.
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this pull request Mar 28, 2023
…ay-project#31827)

This is the first PR towards stabilizing the GCS autoscaler interface by introducing a new grpc service definition which we will provide backwards compatibility guarantees.

This PR mostly just introduces scaffolding and a trivial GetRayVersion endpoint.

By the end of this series of PRs, monitor.py will only communicate with the rest of the ray cluster via this service definition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants