[Core] Add is_running_tasks bit in JobStatus #35188

architkulkarni · 2023-05-09T20:19:04Z

Why are these changes needed?

This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number

Followup to PR #31046.

Closes #30436

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>

…-job-info-gcs

Signed-off-by: Archit Kulkarni <[email protected]>

…-job-info-gcs

Signed-off-by: Archit Kulkarni <[email protected]>

…-job-info-gcs Signed-off-by: Archit Kulkarni <[email protected]>

Signed-off-by: Archit Kulkarni <[email protected]>

…is-running-bit Signed-off-by: Archit Kulkarni <[email protected]>

Signed-off-by: Archit Kulkarni <[email protected]>

…is-running-bit

Signed-off-by: Archit Kulkarni <[email protected]>

This reverts commit e054b49.

Signed-off-by: Archit Kulkarni <[email protected]>

…is-running-bit

"will make my comment non blocking since i'm not actually 100% sure if this is a compatibility issue"

architkulkarni · 2023-05-16T16:38:44Z

@wuisawesome I think you intended to make your review non blocking, so I dismissed the "changes requested," but feel free to put it back if I made a mistake.

@scv119 Do you know if gcs.proto needs to be backwards compatible? Would you mind reviewing this PR or suggesting someone who can review it?

rkooo567

Hmm I feel like it is not a good idea to query all drivers. This pattern has not been working well in the past too for RPC that require reliable response time. There are 3 cons;

it introduces a new code path when we already have an aggregation path everyone knows (resource loads).
I believe GetAllJobInfo is called frequently which can add unnecessary loads to user drivers.
It can be common for the driver to be overloaded sometimes. In this case, GetAllJobInfo APIs will become extremely slow. If 1 driver is slow to respond. If the driver exits unexpectedly, it can take up to keepalive timeout (30~60 seconds) to detect the failure.

We are reporting every autoscaler-related data via periodic aggregation, and is there any special reason why we make a special case only to this field? Is it possible to instead

when raylets send loads to GCS, we include the # of pending tasks there.
And just store them to GCS, and use it when replying GetAllJobInfo?

rkooo567 · 2023-05-23T14:44:29Z

src/ray/gcs/gcs_server/gcs_job_manager.cc

+ std::make_shared<std::atomic<int>>(0);
+
+ // Create a shared boolean flag for the internal KV callback completion
+ std::shared_ptr<std::atomic<bool>> kv_callback_done =


it doesn't need to be atomic. They are in the same thread.

Thanks, fixed

architkulkarni · 2023-05-23T16:59:26Z

We are reporting every autoscaler-related data via periodic aggregation, and is there any special reason why we make a special case only to this field? Is it possible to instead

when raylets send loads to GCS, we include the # of pending tasks there.

And just store them to GCS, and use it when replying GetAllJobInfo?

No special reason, I just wasn't aware of this other reporting. Do you mind linking me to the relevant parts of the code here? Then I can rewrite the PR.

…is-running-bit Signed-off-by: Archit Kulkarni <[email protected]>

rkooo567

While I am concerned this approach may not work well (and probably unstable), the alternative solution takes time to implement, and there’s no one who has bandwidth now

…is-running-bit

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2023-06-21T22:27:29Z

As discussed offline between @rkooo567 and @scv119, we don’t actually have a plumbing path yet for the alternate approach "reporting via periodic aggregation" and it will require a substantial amount of work on the Ray Core side. So we're going ahead with merging this PR to have this field available in Ray 2.6 for external cluster managers. (cc @sofianhnaide)

huggingface_text_classification test failure is unrelated.

This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: 久龙 <[email protected]>

This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: harborn <[email protected]>

This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436

This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: e428265 <[email protected]>

architkulkarni added 25 commits December 12, 2022 15:14

Add submission_id to JobInfo

8fcbc9f

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into join…

18ca6e5

…-job-info-gcs

Fix typo in entrypoint comment

426d34e

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into join…

79e50f1

…-job-info-gcs

Serialize runtime_env to JSON

12f0f3a

Signed-off-by: Archit Kulkarni <[email protected]>

Fix issue with duplicate submission_id

f3daeac

Signed-off-by: Archit Kulkarni <[email protected]>

Add JobsAPIInfo

243d96d

Signed-off-by: Archit Kulkarni <[email protected]>

Query JobInfo from internal kv in GCS Job Manager

68fdb1f

Signed-off-by: Archit Kulkarni <[email protected]>

Query JobInfo from internal kc in GCS Job Manager

a932f44

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into join…

e910e67

…-job-info-gcs Signed-off-by: Archit Kulkarni <[email protected]>

Lint

c860838

Signed-off-by: Archit Kulkarni <[email protected]>

Undo doc change

28e14b0

Signed-off-by: Archit Kulkarni <[email protected]>

Fix error with capture of job_submission_id

44ab6a4

Signed-off-by: Archit Kulkarni <[email protected]>

Load info into reply

c0426b7

Signed-off-by: Archit Kulkarni <[email protected]>

Lint

9c6ae8a

Signed-off-by: Archit Kulkarni <[email protected]>

Fix existing JobManager test

63b98a7

Signed-off-by: Archit Kulkarni <[email protected]>

Fix duplicate submission_id in JobDetails

ae6db4c

Signed-off-by: Archit Kulkarni <[email protected]>

Revert add submission_id to JobInfo

04da99f

Signed-off-by: Archit Kulkarni <[email protected]>

Add rpc for NumPendingTasks

48fbf69

Signed-off-by: Archit Kulkarni <[email protected]>

Add coreworker client to GCS Job Manager

faf9aa0

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add-…

699c557

…is-running-bit Signed-off-by: Archit Kulkarni <[email protected]>

Fix merge in gcs.proto

87891d6

Signed-off-by: Archit Kulkarni <[email protected]>

Expose driver address in job table

d325d1a

Signed-off-by: Archit Kulkarni <[email protected]>

Fix test

32920e7

Signed-off-by: Archit Kulkarni <[email protected]>

Add is_running_tasks bit to JobTableData and populate it

0a097bb

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni changed the title ~~[WIP] Add is_running bit in JobStatus~~ [WIP] Add is_running_tasks bit in JobStatus May 9, 2023

architkulkarni added 4 commits May 9, 2023 17:13

Add unit test

852d50c

Signed-off-by: Archit Kulkarni <[email protected]>

Lint

273bb41

Signed-off-by: Archit Kulkarni <[email protected]>

Fix unit test

a5dc864

Signed-off-by: Archit Kulkarni <[email protected]>

Fix call_ray_start fixture on MacOS

e054b49

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni added 5 commits May 11, 2023 13:49

Merge branch 'master' of https://github.com/ray-project/ray into add-…

8e9bb9f

…is-running-bit

Increase test_http_job_server from medium to large

a4b11f7

Signed-off-by: Archit Kulkarni <[email protected]>

Revert "Fix call_ray_start fixture on MacOS"

4a4a48b

This reverts commit e054b49.

Fix missing data field for test_http_job_server

9a72fcf

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add-…

45a764e

…is-running-bit

architkulkarni assigned scv119 May 16, 2023

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 16, 2023

scv119 assigned rkooo567 and jjyao May 19, 2023

rkooo567 reviewed May 23, 2023

View reviewed changes

rkooo567 added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels May 23, 2023

Merge branch 'master' of https://github.com/ray-project/ray into add-…

adde7a7

…is-running-bit Signed-off-by: Archit Kulkarni <[email protected]>

rkooo567 approved these changes Jun 8, 2023

View reviewed changes

architkulkarni added 3 commits June 12, 2023 11:30

Merge branch 'master' of https://github.com/ray-project/ray into add-…

0376480

…is-running-bit

Remove atomic because it's in the same thread

7bc8fd9

Signed-off-by: Archit Kulkarni <[email protected]>

Lint

977c33c

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni merged commit b7588b8 into ray-project:master Jun 21, 2023
1 of 2 checks passed

architkulkarni mentioned this pull request Jun 21, 2023

[Core] Establish plumbing path for driver pending tasks info #36680

Open

fishbone mentioned this pull request Jun 27, 2023

[release-test] placement_group_performance_test.aws failed #36829

Closed

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

rkooo567 mentioned this pull request Oct 25, 2023

[Core][GCS FT] The ray list jobs command experiences significant delays after the head Pod recovers from a failure #39947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add is_running_tasks bit in JobStatus #35188

[Core] Add is_running_tasks bit in JobStatus #35188

architkulkarni commented May 9, 2023 •

edited

Loading

architkulkarni commented May 16, 2023

rkooo567 left a comment •

edited

Loading

rkooo567 May 23, 2023

architkulkarni Jun 12, 2023

architkulkarni commented May 23, 2023

rkooo567 left a comment

architkulkarni commented Jun 21, 2023

[Core] Add is_running_tasks bit in JobStatus #35188

[Core] Add is_running_tasks bit in JobStatus #35188

Conversation

architkulkarni commented May 9, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni commented May 16, 2023

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 May 23, 2023

Choose a reason for hiding this comment

architkulkarni Jun 12, 2023

Choose a reason for hiding this comment

architkulkarni commented May 23, 2023

rkooo567 left a comment

Choose a reason for hiding this comment

architkulkarni commented Jun 21, 2023

architkulkarni commented May 9, 2023 •

edited

Loading

rkooo567 left a comment •

edited

Loading