-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add is_running_tasks bit in JobStatus #35188
[Core] Add is_running_tasks bit in JobStatus #35188
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
…-job-info-gcs Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
…is-running-bit Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
This reverts commit e054b49.
Signed-off-by: Archit Kulkarni <[email protected]>
"will make my comment non blocking since i'm not actually 100% sure if this is a compatibility issue"
@wuisawesome I think you intended to make your review non blocking, so I dismissed the "changes requested," but feel free to put it back if I made a mistake. @scv119 Do you know if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I feel like it is not a good idea to query all drivers. This pattern has not been working well in the past too for RPC that require reliable response time. There are 3 cons;
- it introduces a new code path when we already have an aggregation path everyone knows (resource loads).
- I believe GetAllJobInfo is called frequently which can add unnecessary loads to user drivers.
- It can be common for the driver to be overloaded sometimes. In this case, GetAllJobInfo APIs will become extremely slow. If 1 driver is slow to respond. If the driver exits unexpectedly, it can take up to keepalive timeout (30~60 seconds) to detect the failure.
We are reporting every autoscaler-related data via periodic aggregation, and is there any special reason why we make a special case only to this field? Is it possible to instead
- when raylets send loads to GCS, we include the # of pending tasks there.
- And just store them to GCS, and use it when replying GetAllJobInfo?
std::make_shared<std::atomic<int>>(0); | ||
|
||
// Create a shared boolean flag for the internal KV callback completion | ||
std::shared_ptr<std::atomic<bool>> kv_callback_done = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't need to be atomic. They are in the same thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
No special reason, I just wasn't aware of this other reporting. Do you mind linking me to the relevant parts of the code here? Then I can rewrite the PR. |
…is-running-bit Signed-off-by: Archit Kulkarni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I am concerned this approach may not work well (and probably unstable), the alternative solution takes time to implement, and there’s no one who has bandwidth now
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
As discussed offline between @rkooo567 and @scv119, we don’t actually have a plumbing path yet for the alternate approach "reporting via periodic aggregation" and it will require a substantial amount of work on the Ray Core side. So we're going ahead with merging this PR to have this field available in Ray 2.6 for external cluster managers. (cc @sofianhnaide)
|
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: 久龙 <[email protected]>
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: harborn <[email protected]>
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
This PR propagates the existing
num_pending_tasks
property of a driver's core worker process to a new booleanis_running_tasks
field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.The code for the new
NumPendingTasks
RPC from the GCS job manager to the driver's core worker process mimics the existingPushTask
RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.To make the connection to the core worker from the Job manager, we update the Job table proto to include the full
Address
object (including the port) instead of just the driver IP address.This PR also adds unit tests in C++ and an end to end Python test.
Related issue number
Followup to PR #31046.
Closes #30436
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.