Adds TPU pod Support within the Ray Cluster Launcher. #37934

allenwang28 · 2023-07-31T16:07:00Z

Why are these changes needed?

This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers.

Brief overview of the changes:

GCPTPUNode has the ability to access internal/external IP addressed based on worker indices.
GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner.
Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface.
- It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters.
- Base interface functions are broadcasted across workers in the TPU pod slices.
Unit tests have been added
We also provide a sample YAML that uses TPU pods

(This is a redo of #37932 to fix some of my issues with signed commits..)

Related issue number

Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see #25223)

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
[n/a] I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: allenwang28 <[email protected]>

allenwang28 · 2023-07-31T16:15:12Z

Copying @DmitriGekhtman's comment from the previous PR:

Passing comment -- actual review will come from ray-core team --
I think the autoscaler internals probably assume a 1-1 mapping between Ray nodes and VMs, so this might not work as expected. In particular, it's unlikely that autoscaling (with a dynamic number of TPU pods) will work properly.

With this change, I was able to test the provisioning/updating steps using a real cluster and was able to ultimately run a simple JAX workload on it. That's really the main functionality I wanted to contribute here.

I agree that proper autoscaling does not work with this PR and I am happy to make any changes to convey that message clearly. The problems I see are:

Function/actor placement logic - we would likely need to introduce changes in Ray core (maybe within placement groups?) such that we can send particular Ray functions/actors to particular groups that we can hint during provisioning time (cc @richardliaw)
Resource request handling - it's unclear to me how making a resource request from the autoscaler will translate to TPU pod creation...
Integrating the changes above into tune/serve/train etc.

Signed-off-by: allenwang28 <[email protected]>

architkulkarni

The code looks great, just some minor comments around readability.

(This one is actually important) I think you need to add the new command runner test file similar to https://github.com/allenwang28/ray/blob/e43285de1a9b0b5af3a87b33917dd3d6cc07c3cf/python/ray/tests/BUILD#L422 so that it runs in CI
For the parts that say "this may not be the right expected return result", were we saying it's okay because the return value doesn't seem to be used by the caller? Maybe we could include that in the comment as well.

Some questions which could be addressed in a followup PR:

Do we plan to add some notes about the TPU pod/host business to the Ray docs?
Since autoscaling isn't supported for TPU, ideally we would print some kind of warning if we detect that the user is trying this. Do you have a sense of where it would be best to print this? At a minimum, we should state this in the sample YAML file.

architkulkarni · 2023-08-15T17:04:56Z

python/ray/autoscaler/_private/gcp/tpu_command_runner.py

+# We see issues where too many concurrent connections may lead to failed SSH connections
+# when there are too many TPU hosts.
+# For now, we cap the maximum concurrent connections until we find the right fix.
+_MAX_NUM_CONCURRENT_ACTIVE_CONNECTIONS = 16


Does it make sense to make this configurable via an environment variable?

That makes sense, any suggestions on a name?

Mm, maybe RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS?

Added! Btw, is there a way to set this value within the cluster yaml file? I couldn't find one, but that would be great for usability.

I don't think so unfortunately, there's currently no place to set env vars in the cluster YAML that would be read by the cluster launcher code itself.

architkulkarni · 2023-08-15T17:15:19Z

python/ray/autoscaler/gcp/example-tpu-pod.yaml

+ min_workers: 1
+ resources: {"TPU": 1} # use TPU custom resource in your code
+ node_config:
+ acceleratorType: v4-16


Should we add a comment to emphasize that v4-16 has two hosts?

architkulkarni · 2023-08-15T17:19:37Z

python/ray/autoscaler/gcp/example-tpu-pod.yaml

+# A unique identifier for the head node and workers of this cluster.
+cluster_name: tpupodtest
+
+max_workers: 2


We have max_workers=2 here and min_workers=1 under ray_tpu, but since the TPU has two hosts, we actually have reached the max_workers bound for this cluster. Is that right? And if we had a TPU pod with 4 hosts we'd need max_workers=4? It might be worth adding a comment to clarify this

Actually in this case the worker refers to the topology, so if we have a v4-16 as our acceleratorType and max_workers=2 it can create up to 2 v4-16s.

Ah makes sense!

architkulkarni · 2023-08-15T17:22:07Z

python/ray/tests/gcp/test_gcp_node_provider.py

+ command_runner = node_provider.get_command_runner(**args)
+ assert isinstance(command_runner, TPUCommandRunner)
+
+


You could consider using pytest.mark.parametrize to reduce the repeated test code, that would also make it easy to add a "tpu+docker" test

Done, also added the tpu+docker test as suggested

architkulkarni · 2023-08-15T17:27:41Z

python/ray/tests/gcp/test_gcp_tpu_command_runner.py

+
+ calls = process_runner.calls
+
+ assert len(process_runner.calls) == num_workers


This is where we actually check that the command got executed once for each host in the TPU pod right? it might be worth calling attention to this in a comment

architkulkarni · 2023-08-15T17:27:44Z

python/ray/tests/gcp/test_gcp_tpu_command_runner.py

+ for i in range(num_workers):
+ for x, y in zip(calls[i], expected):
+ assert x == y
+ process_runner.assert_has_call("1.2.3.4", exact=expected)


Where does 1.2.3.4 come from? I didn't notice it elsewhere in the test.

hmm, I copied this from test_command_runner: https://github.com/ray-project/ray/blob/master/python/ray/tests/test_command_runner.py#L122

but it doesn't seem like this call really does much.. the above asserts are sufficient so I just removed the call altogether.

architkulkarni · 2023-08-15T17:38:44Z

python/ray/autoscaler/_private/gcp/node.py

+ def get_external_ips(self) -> List[str]:
+ return self.get("networkEndpoints", [{}])
+
+ def get_external_ip(self, worker_index: int = 0) -> str:


Can you remind me why it makes sense to have a default worker index here? How are the return values of get_external_ip() and get_internal_ip() used by the caller when the index isn't specified?

So node_provider has a function for get_external_ip(node_ip). Since TPUs are the only node types (at least currently) that will have multiple worker indices, I figured it would be safe to set a default index to 0 which is esp. useful when we are using single TPU VMs.

We could remove the default value, but we would need to change the logic here: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/gcp/node_provider.py#L157 with a TPU check which is kind of ugly.

Got it, that makes sense! The current way is fine in that case

architkulkarni · 2023-08-15T17:39:12Z

python/ray/autoscaler/_private/gcp/node.py

@@ -371,12 +381,17 @@ def list_instances(
 key=TAG_RAY_CLUSTER_NAME, value=self.cluster_name
 )

+ # TPU VMs spawn accompanying Compute Instances that must be filtered out,
+ # else this results in duplicated nodes.
+ tpu_negation_filter_expr = "(NOT labels.{label}:*)".format(label="tpu_cores")


Ideally we would have a unit test for this

Great point, added a test in test_gcp_node_provider.py

…ariable, 2) warnings for TPU pod autoscaling, 3) better comments about tpu command runner retvalues, 4) comments in example yaml, 5) adding tests into the build file, 6) better and cleaner tests Signed-off-by: allenwang28 <[email protected]>

architkulkarni · 2023-08-16T17:07:16Z

python/ray/autoscaler/_private/gcp/tpu_command_runner.py

+
+ """
+ num_max_concurrent_active_connections = os.getenv(
+ ray_constants.RAY_TPU_MAX_CONCURRENT_CONNECTIONS_ENV_VAR, default=16


Ah sorry I think this might return a string and cause an error, can you double check? We seem to use env_integer for this https://github.com/architkulkarni/ray/blob/37c415ff380631c477afe93c1b88cd959eb00fca/python/ray/_private/ray_constants.py#L10

Aha, you're right. Btw would you prefer that I move this altogether into ray_constants, e.g.

# f:ray_constants.py RAY_TPU_MAX_CONCURRENT_CONNECTIONS = env_integer("RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS", 16)

(and the appropriate changes in tpu_command_runner.py)

I don't have a preference about that

Actually, let's read the env var as late as possible to make it more configurable. So the env_integer would be inside the command runner

Done (+ added a test for this)

allenwang28 · 2023-08-16T17:08:51Z

@architkulkarni - addressing a few comments in your top level comment:

(This one is actually important) I think you need to add the new command runner test file similar to https://github.com/allenwang28/ray/blob/e43285de1a9b0b5af3a87b33917dd3d6cc07c3cf/python/ray/tests/BUILD#L422 so that it runs in CI

Done

For the parts that say "this may not be the right expected return result", were we saying it's okay because the return value doesn't seem to be used by the caller? Maybe we could include that in the comment as well.

Done

Do we plan to add some notes about the TPU pod/host business to the Ray docs?

That's a great idea, I can definitely do that in a future PR. Any suggestions on where this should land? Maybe I should open an issue for general TPU documentation?

Since autoscaling isn't supported for TPU, ideally we would print some kind of warning if we detect that the user is trying this. Do you have a sense of where it would be best to print this? At a minimum, we should state this in the sample YAML file.

Yes, great point. I think you can actually "hack" TPU pod support so I don't want to turn this off altogether, but I added in a logger warning in config.py and wrote a note in the YAML file saying that TPU pod autoscaling isn't properly supported.

architkulkarni · 2023-08-16T17:22:12Z

That's a great idea, I can definitely do that in a future PR. Any suggestions on where this should land? Maybe I should open an issue for general TPU documentation?

Yup great question, one idea is to add a subsection here: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/gcp.html#launching-ray-clusters-on-gcp.

Opening an issue sounds like a good idea!

architkulkarni · 2023-08-16T17:34:44Z

@allenwang28 I think you got unlucky and hit a merge conflict, could you please take a look?

rickyyx

Stamp to unblock

Signed-off-by: allenwang28 <[email protected]>

allenwang28 · 2023-08-16T17:55:19Z

Fixed the merge conflict

Signed-off-by: allenwang28 <[email protected]>

architkulkarni · 2023-08-17T20:39:35Z

Failed tests:

Windows test_cancel unrelated
rllib:TestLearnerGroupAsyncUpdate unrelated
rllib:test_memory_leak_ppo unrelated
rllib:examples/learner/train_w_bc_finetune_w_ppo unrelated
tune:test_remote unrelated
serve:test_callback unrelated
tests:test_out_of_disk_space unrelated
py11 build images unrelated and flaky on master

…er. (ray-project#37934) This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers. Brief overview of the changes: GCPTPUNode has the ability to access internal/external IP addressed based on worker indices. GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner. Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface. It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters. Base interface functions are broadcasted across workers in the TPU pod slices. Unit tests have been added We also provide a sample YAML that uses TPU pods (This is a redo of ray-project#37932 to fix some of my issues with signed commits..) Related issue number Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223) --------- Signed-off-by: allenwang28 <[email protected]>

…er. (ray-project#37934) This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers. Brief overview of the changes: GCPTPUNode has the ability to access internal/external IP addressed based on worker indices. GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner. Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface. It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters. Base interface functions are broadcasted across workers in the TPU pod slices. Unit tests have been added We also provide a sample YAML that uses TPU pods (This is a redo of ray-project#37932 to fix some of my issues with signed commits..) Related issue number Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223) --------- Signed-off-by: allenwang28 <[email protected]> Signed-off-by: e428265 <[email protected]>

…er. (ray-project#37934) This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers. Brief overview of the changes: GCPTPUNode has the ability to access internal/external IP addressed based on worker indices. GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner. Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface. It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters. Base interface functions are broadcasted across workers in the TPU pod slices. Unit tests have been added We also provide a sample YAML that uses TPU pods (This is a redo of ray-project#37932 to fix some of my issues with signed commits..) Related issue number Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223) --------- Signed-off-by: allenwang28 <[email protected]> Signed-off-by: Victor <[email protected]>

allenwang28 added 2 commits July 31, 2023 15:54

Adds in TPU pod support within Ray cluster launcher

56d9b70

Signed-off-by: allenwang28 <[email protected]>

Fixes formatting

2f6e01a

Signed-off-by: allenwang28 <[email protected]>

allenwang28 requested review from wuisawesome, DmitriGekhtman, ericl and a team as code owners July 31, 2023 16:07

allenwang28 mentioned this pull request Jul 31, 2023

Adds proper TPU pod Support within the Ray Cluster Launcher / Autoscaler. #37932

Closed

4 tasks

allenwang28 changed the title ~~Adds proper TPU pod Support within the Ray Cluster Launcher / Autoscaler.~~ Adds proper TPU pod Support within the Ray Cluster Launcher. Jul 31, 2023

allenwang28 changed the title ~~Adds proper TPU pod Support within the Ray Cluster Launcher.~~ Adds TPU pod Support within the Ray Cluster Launcher. Jul 31, 2023

Format script, but this time with the correct packages

e43285d

Signed-off-by: allenwang28 <[email protected]>

architkulkarni self-assigned this Aug 4, 2023

architkulkarni reviewed Aug 15, 2023

View reviewed changes

allenwang28 added 2 commits August 16, 2023 15:10

Merge remote-tracking branch 'upstream/master' into ray-autoscaler-prod

b672eda

architkulkarni reviewed Aug 16, 2023

View reviewed changes

rickyyx approved these changes Aug 16, 2023

View reviewed changes

allenwang28 added 2 commits August 16, 2023 17:54

Merge remote-tracking branch 'upstream/master' into ray-autoscaler-prod

aa9d881

Signed-off-by: allenwang28 <[email protected]>

Reorder tests in BUILD

5172580

Signed-off-by: allenwang28 <[email protected]>

allenwang28 mentioned this pull request Aug 16, 2023

[Core] Add in documentation about TPU pods #38503

Open

get_env -> env_integer + a test

8593c43

Signed-off-by: allenwang28 <[email protected]>

architkulkarni approved these changes Aug 16, 2023

View reviewed changes

allenwang28 added 3 commits August 16, 2023 18:24

Merge remote-tracking branch 'upstream/master' into ray-autoscaler-prod

d26ef39

Signed-off-by: allenwang28 <[email protected]>

Merge remote-tracking branch 'upstream/master' into ray-autoscaler-prod

0bc41aa

Signed-off-by: allenwang28 <[email protected]>

Exclude GCP TPU command runner test from CI.

f8d0860

Signed-off-by: allenwang28 <[email protected]>

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 17, 2023

architkulkarni merged commit 47edb32 into ray-project:master Aug 17, 2023
2 checks passed

brandonroyal mentioned this pull request Sep 20, 2023

[Ray Cluster Launcher] tpu pod support #25223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds TPU pod Support within the Ray Cluster Launcher. #37934

Adds TPU pod Support within the Ray Cluster Launcher. #37934

allenwang28 commented Jul 31, 2023

allenwang28 commented Jul 31, 2023

architkulkarni left a comment

architkulkarni Aug 15, 2023

allenwang28 Aug 15, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 16, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 15, 2023

architkulkarni Aug 15, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 15, 2023

architkulkarni Aug 15, 2023

architkulkarni Aug 15, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 16, 2023

allenwang28 Aug 16, 2023

architkulkarni Aug 16, 2023

architkulkarni Aug 16, 2023

allenwang28 Aug 16, 2023 •

edited

Loading

allenwang28 commented Aug 16, 2023 •

edited by architkulkarni

Loading

architkulkarni commented Aug 16, 2023

architkulkarni commented Aug 16, 2023

rickyyx left a comment

allenwang28 commented Aug 16, 2023

architkulkarni commented Aug 17, 2023

		command_runner = node_provider.get_command_runner(**args)
		assert isinstance(command_runner, TPUCommandRunner)


		calls = process_runner.calls

		assert len(process_runner.calls) == num_workers

Adds TPU pod Support within the Ray Cluster Launcher. #37934

Adds TPU pod Support within the Ray Cluster Launcher. #37934

Conversation

allenwang28 commented Jul 31, 2023

Why are these changes needed?

Related issue number

Checks

allenwang28 commented Jul 31, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenwang28 Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

allenwang28 commented Aug 16, 2023 • edited by architkulkarni Loading

architkulkarni commented Aug 16, 2023

architkulkarni commented Aug 16, 2023

rickyyx left a comment

Choose a reason for hiding this comment

allenwang28 commented Aug 16, 2023

architkulkarni commented Aug 17, 2023

allenwang28 Aug 16, 2023 •

edited

Loading

allenwang28 commented Aug 16, 2023 •

edited by architkulkarni

Loading