[core] Raylet adds job to GCS after the driver port announced. #44626

rynewang · 2024-04-10T15:04:10Z

GCS Client API GetAllJobInfo makes 1 RPC to each driver process for some up-to-date info. To make the call, GCS uses the driver core worker's IP and port address from the Raylet calling ray::gcs::JobInfoAccessor::AsyncAdd method. However Raylet may give a port = 0 in the call, causing GCS not able to access the driver's core worker.

Here is a time order in core worker init:

core worker starts
core worker -> raylet RegisterClientRequest
raylet -> GCS AddJob
core worker -> raylet AnnounceWorkerPort.

Note step 3 should actually happen after step 4, because at step 3, raylet does not yet know the core worker's real port. This can happen when the raylet has NodeManagerConfig::min_worker_port set to 0, allowing the core worker to pick a port on its own.

This PR moves the step 3 (raylet -> GCS AddJob) to the raylet function that handles incoming AnnounceWorkerPort message, and instead of the assigned_port, now we give GCS the real port.

One catch is that, previously AnnounceWorkerPort is one-way, core worker does not wait for the reply and continues to the user code immediately. However we do want it to wait until the GCS received the newly added job. We don't want to always add an RTT to all worker inits either, so we only add the reply message when the worker is a driver. Luckily, both sides knows the worker type.

Fixes #44459.

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-04-10T15:19:27Z

Note: I expect this PR to not have any performance regressions. The driver needs to wait for the raylet -> GCS RPC to finish anyways in both before & after this PR.

jjyao

lg

src/ray/raylet/node_manager.cc

src/ray/raylet_client/raylet_client.h

python/ray/tests/test_state_api.py

jjyao · 2024-04-11T05:54:48Z

Many test failures.

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-04-14T13:33:19Z

@jjyao ready to merge

jjyao · 2024-04-15T04:07:51Z

python/ray/tests/test_advanced_9.py

+ # When we create a new node, the new raylet invokes RegisterGcs -> AsyncSubscribeAll
+ # -> AsyncGetAll -> GetAllJobInfo which makes the GCS to make a connection to each
+ # drivers. To make the connections consistent, pre-create the connection here.
+ ray._private.worker.global_worker.gcs_client.get_all_job_info()
+


I still don't understand why we need to do this here?

So I found after this PR, we got 1 more connection created in the test add node, and is not closed in test remove node. Then I found creating a node triggers gcs to connect with driver core workers, and that connection is persisted in core_worker_connection_pool in an LRU policy. Previously we did not have that because gcs had no way to connect to the driver (port is 0) which is exactly what we are fixing in this PR. So I pre-trigger the gcs->driver core worker connection to make it count in the initial list, so after the add node and remove node, the conn count is stable.

We can move ray.init(cluster.address) down and add a corresponding ray.shutdown()?

todo: just shutdown the driver job

jjyao · 2024-04-15T04:09:53Z

python/ray/tests/test_state_api.py

+ time.sleep(10000)
+
+ # Create some long running tasks, no need to wait.
+ tasks = [f.remote() for i in range(4)] # noqa: F841


any reason why we need 4 tasks here instead of 1?

just random. Switched back to only 1 task.

I think we need some synchronization

TODO: make it a signal actor enabling task

jjyao · 2024-04-15T04:24:09Z

src/ray/raylet/node_manager.cc

+ if (!status.ok()) {
+ RAY_LOG(ERROR) << "Failed to add job to GCS: " << status.ToString();
+ }


In the original code, if this fails, RegisterClientReply will have

// Whether the registration succeeded. success: bool; // The reason of registration failure. failure_reason: string;

set. We should do the same thing for AnnounceWorkerPortReply?

ok, added reply status. But at the terminal we check-fail it, since the cython binding notify_raylet returns void. Do we want to also change that to a raise exception?

jjyao · 2024-04-15T04:26:15Z

src/ray/raylet/node_manager.cc

+ auto message = protocol::CreateAnnounceWorkerPortReply(fbb);
+ fbb.Finish(message);
+
+ auto reply_status = client->WriteMessage(


Why don't we call WriteMessageAsync here following

auto reply = ray::protocol::CreateRegisterClientReply(fbb, status.ok(), fbb.CreateString(status.ToString()), to_flatbuf(fbb, self_node_id_), assigned_port); fbb.Finish(reply); client->WriteMessageAsync( static_cast<int64_t>(protocol::MessageType::RegisterClientReply), fbb.GetSize(), fbb.GetBufferPointer(), [this, client](const ray::Status &status) { if (!status.ok()) { DisconnectClient(client, rpc::WorkerExitType::SYSTEM_ERROR, "Worker is failed because the raylet couldn't reply the " "registration request."); } });

makes sense. updated

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-04-16T05:18:15Z

src/ray/core_worker/core_worker.cc

- RAY_CHECK_OK(local_raylet_client_->AnnounceWorkerPort(core_worker_server_->GetPort()));
+ if (options_.worker_type == WorkerType::DRIVER) {
+ RAY_CHECK_OK(local_raylet_client_->AnnounceWorkerPortForDriver(
+ core_worker_server_->GetPort(), options_.entrypoint));


todo: add a msg to check

Signed-off-by: Ruiyang Wang <[email protected]>

python/ray/tests/test_advanced_9.py

jjyao · 2024-04-16T16:03:26Z

python/ray/tests/test_advanced_9.py

+ time.sleep(10)
+ # Note: `fds_without_workers` need to be recorded *after* a ray start, because
+ # a prestarted worker is started on the first driver init. This worker keeps 1
+ # connection to the GCS, and it stays alive even after the driver exits. If


do you know why the connection stays alive even after the driver exits?

The prestarted worker is not specifically for job 01000000; it's for job = nil (ffffffff): code. So it's not killed after the job finished. Frankly I don't know what it does - the worker can't be used by any jobs right?

Yea, initially the prestarted worker is not tied to a job but I thought later on when we start the actor, it will late bind to job 0100000

Oh, those prestarted workers cannot be used for actors

void WorkerPool::PrestartDefaultCpuWorkers(ray::Language language, int64_t num_needed) { // default workers uses 1 cpu and doesn't support actor. static const WorkerCacheKey kDefaultCpuWorkerCacheKey{/*serialized_runtime_env*/ "", {{"CPU", 1}}, /*is_actor*/ false, /*is_gpu*/ false};

Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>

…roject#44626) GCS Client API GetAllJobInfo makes 1 RPC to each driver process for some up-to-date info. To make the call, GCS uses the driver core worker's IP and port address from the Raylet calling ray::gcs::JobInfoAccessor::AsyncAdd method. However Raylet may give a port = 0 in the call, causing GCS not able to access the driver's core worker. Signed-off-by: Ruiyang Wang <[email protected]>

[core] Raylet adds job to GCS after the driver port announced.

3b3b47b

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang requested a review from a team as a code owner April 10, 2024 15:04

rynewang assigned jjyao Apr 10, 2024

jjyao reviewed Apr 11, 2024

View reviewed changes

rynewang and others added 11 commits April 11, 2024 23:47

continue reply on gcs failure

0546940

Signed-off-by: Ruiyang Wang <[email protected]>

remove print

29e00ce

Signed-off-by: Ruiyang Wang <[email protected]>

remove print

72cdc15

Signed-off-by: Ruiyang Wang <[email protected]>

fix cpp test

ee0cb00

Signed-off-by: Ruiyang Wang <[email protected]>

Merge branch 'master' into add-to-gcs-after-announce-port

8ee8a06

add prints in test to debug

2fd4b9a

Signed-off-by: Ruiyang Wang <[email protected]>

add log

8b4c2e4

Signed-off-by: Ruiyang Wang <[email protected]>

move entrypoint from RegisterClientRequest to AnnounceWorkerPort

b635f0b

Signed-off-by: Ruiyang Wang <[email protected]>

Merge branch 'master' into add-to-gcs-after-announce-port

9b743df

add debug sleep

9cc1e1f

Signed-off-by: Ruiyang Wang <[email protected]>

Pre create the conns

526f077

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao reviewed Apr 15, 2024

View reviewed changes

Add reply status to the reply

f8dd2e4

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang assigned rynewang and unassigned jjyao Apr 15, 2024

rynewang commented Apr 16, 2024

View reviewed changes

rynewang added 3 commits April 16, 2024 13:45

fix test and add log

56ba76d

Signed-off-by: Ruiyang Wang <[email protected]>

fix check ok

22b0e6d

Signed-off-by: Ruiyang Wang <[email protected]>

fix test

d1f2426

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao approved these changes Apr 16, 2024

View reviewed changes

Update python/ray/tests/test_advanced_9.py

facb3db

Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>

jjyao merged commit 26af5b7 into ray-project:master Apr 16, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Raylet adds job to GCS after the driver port announced. #44626

[core] Raylet adds job to GCS after the driver port announced. #44626

rynewang commented Apr 10, 2024

rynewang commented Apr 10, 2024

jjyao left a comment

jjyao commented Apr 11, 2024

rynewang commented Apr 14, 2024

jjyao Apr 15, 2024

rynewang Apr 15, 2024

jjyao Apr 16, 2024

rynewang Apr 16, 2024

jjyao Apr 15, 2024

rynewang Apr 15, 2024

jjyao Apr 16, 2024

rynewang Apr 16, 2024

jjyao Apr 15, 2024

rynewang Apr 15, 2024

jjyao Apr 15, 2024

rynewang Apr 15, 2024

rynewang Apr 16, 2024

jjyao Apr 16, 2024

rynewang Apr 16, 2024 •

edited

Loading

jjyao Apr 16, 2024

jjyao Apr 16, 2024

[core] Raylet adds job to GCS after the driver port announced. #44626

[core] Raylet adds job to GCS after the driver port announced. #44626

Conversation

rynewang commented Apr 10, 2024

rynewang commented Apr 10, 2024

jjyao left a comment

Choose a reason for hiding this comment

jjyao commented Apr 11, 2024

rynewang commented Apr 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rynewang Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rynewang Apr 16, 2024 •

edited

Loading