Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add is_running_tasks bit in JobStatus #35188

Merged
merged 45 commits into from
Jun 21, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented May 9, 2023

Why are these changes needed?

This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number

Followup to PR #31046.

Closes #30436

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni architkulkarni changed the title [WIP] Add is_running bit in JobStatus [WIP] Add is_running_tasks bit in JobStatus May 9, 2023
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni architkulkarni dismissed wuisawesome’s stale review May 16, 2023 16:38

"will make my comment non blocking since i'm not actually 100% sure if this is a compatibility issue"

@architkulkarni
Copy link
Contributor Author

@wuisawesome I think you intended to make your review non blocking, so I dismissed the "changes requested," but feel free to put it back if I made a mistake.

@scv119 Do you know if gcs.proto needs to be backwards compatible? Would you mind reviewing this PR or suggesting someone who can review it?

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 16, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I feel like it is not a good idea to query all drivers. This pattern has not been working well in the past too for RPC that require reliable response time. There are 3 cons;

  • it introduces a new code path when we already have an aggregation path everyone knows (resource loads).
  • I believe GetAllJobInfo is called frequently which can add unnecessary loads to user drivers.
  • It can be common for the driver to be overloaded sometimes. In this case, GetAllJobInfo APIs will become extremely slow. If 1 driver is slow to respond. If the driver exits unexpectedly, it can take up to keepalive timeout (30~60 seconds) to detect the failure.

We are reporting every autoscaler-related data via periodic aggregation, and is there any special reason why we make a special case only to this field? Is it possible to instead

  1. when raylets send loads to GCS, we include the # of pending tasks there.
  2. And just store them to GCS, and use it when replying GetAllJobInfo?

std::make_shared<std::atomic<int>>(0);

// Create a shared boolean flag for the internal KV callback completion
std::shared_ptr<std::atomic<bool>> kv_callback_done =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't need to be atomic. They are in the same thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

@rkooo567 rkooo567 added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels May 23, 2023
@architkulkarni
Copy link
Contributor Author

We are reporting every autoscaler-related data via periodic aggregation, and is there any special reason why we make a special case only to this field? Is it possible to instead

  1. when raylets send loads to GCS, we include the # of pending tasks there.
  2. And just store them to GCS, and use it when replying GetAllJobInfo?

No special reason, I just wasn't aware of this other reporting. Do you mind linking me to the relevant parts of the code here? Then I can rewrite the PR.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am concerned this approach may not work well (and probably unstable), the alternative solution takes time to implement, and there’s no one who has bandwidth now

@architkulkarni
Copy link
Contributor Author

As discussed offline between @rkooo567 and @scv119, we don’t actually have a plumbing path yet for the alternate approach "reporting via periodic aggregation" and it will require a substantial amount of work on the Ray Core side. So we're going ahead with merging this PR to have this field available in Ray 2.6 for external cluster managers. (cc @sofianhnaide)

huggingface_text_classification test failure is unrelated.

@architkulkarni architkulkarni merged commit b7588b8 into ray-project:master Jun 21, 2023
1 of 2 checks passed
SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number
Followup to PR ray-project#31046.

Closes ray-project#30436

Signed-off-by: 久龙 <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number
Followup to PR ray-project#31046.

Closes ray-project#30436

Signed-off-by: harborn <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number
Followup to PR ray-project#31046.

Closes ray-project#30436
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number
Followup to PR ray-project#31046.

Closes ray-project#30436

Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[jobs] Provide a "is_running_tasks" bit in the JobStatus to check interactive job status
8 participants