-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds TPU pod Support within the Ray Cluster Launcher. #37934
Adds TPU pod Support within the Ray Cluster Launcher. #37934
Conversation
Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: allenwang28 <[email protected]>
Copying @DmitriGekhtman's comment from the previous PR:
With this change, I was able to test the provisioning/updating steps using a real cluster and was able to ultimately run a simple JAX workload on it. That's really the main functionality I wanted to contribute here. I agree that proper autoscaling does not work with this PR and I am happy to make any changes to convey that message clearly. The problems I see are:
|
Signed-off-by: allenwang28 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great, just some minor comments around readability.
-
(This one is actually important) I think you need to add the new command runner test file similar to https://github.com/allenwang28/ray/blob/e43285de1a9b0b5af3a87b33917dd3d6cc07c3cf/python/ray/tests/BUILD#L422 so that it runs in CI
-
For the parts that say "this may not be the right expected return result", were we saying it's okay because the return value doesn't seem to be used by the caller? Maybe we could include that in the comment as well.
Some questions which could be addressed in a followup PR:
- Do we plan to add some notes about the TPU pod/host business to the Ray docs?
- Since autoscaling isn't supported for TPU, ideally we would print some kind of warning if we detect that the user is trying this. Do you have a sense of where it would be best to print this? At a minimum, we should state this in the sample YAML file.
# We see issues where too many concurrent connections may lead to failed SSH connections | ||
# when there are too many TPU hosts. | ||
# For now, we cap the maximum concurrent connections until we find the right fix. | ||
_MAX_NUM_CONCURRENT_ACTIVE_CONNECTIONS = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to make this configurable via an environment variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, any suggestions on a name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, maybe RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! Btw, is there a way to set this value within the cluster yaml file? I couldn't find one, but that would be great for usability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so unfortunately, there's currently no place to set env vars in the cluster YAML that would be read by the cluster launcher code itself.
min_workers: 1 | ||
resources: {"TPU": 1} # use TPU custom resource in your code | ||
node_config: | ||
acceleratorType: v4-16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a comment to emphasize that v4-16 has two hosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# A unique identifier for the head node and workers of this cluster. | ||
cluster_name: tpupodtest | ||
|
||
max_workers: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have max_workers=2
here and min_workers=1
under ray_tpu
, but since the TPU has two hosts, we actually have reached the max_workers
bound for this cluster. Is that right? And if we had a TPU pod with 4 hosts we'd need max_workers=4
? It might be worth adding a comment to clarify this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in this case the worker refers to the topology, so if we have a v4-16 as our acceleratorType and max_workers=2
it can create up to 2 v4-16s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense!
command_runner = node_provider.get_command_runner(**args) | ||
assert isinstance(command_runner, TPUCommandRunner) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could consider using pytest.mark.parametrize
to reduce the repeated test code, that would also make it easy to add a "tpu+docker" test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also added the tpu+docker test as suggested
|
||
calls = process_runner.calls | ||
|
||
assert len(process_runner.calls) == num_workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we actually check that the command got executed once for each host in the TPU pod right? it might be worth calling attention to this in a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for i in range(num_workers): | ||
for x, y in zip(calls[i], expected): | ||
assert x == y | ||
process_runner.assert_has_call("1.2.3.4", exact=expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 1.2.3.4 come from? I didn't notice it elsewhere in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I copied this from test_command_runner
: https://github.com/ray-project/ray/blob/master/python/ray/tests/test_command_runner.py#L122
but it doesn't seem like this call really does much.. the above asserts are sufficient so I just removed the call altogether.
def get_external_ips(self) -> List[str]: | ||
return self.get("networkEndpoints", [{}]) | ||
|
||
def get_external_ip(self, worker_index: int = 0) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me why it makes sense to have a default worker index here? How are the return values of get_external_ip()
and get_internal_ip()
used by the caller when the index isn't specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So node_provider has a function for get_external_ip(node_ip)
. Since TPUs are the only node types (at least currently) that will have multiple worker indices, I figured it would be safe to set a default index to 0 which is esp. useful when we are using single TPU VMs.
We could remove the default value, but we would need to change the logic here: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/gcp/node_provider.py#L157 with a TPU check which is kind of ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that makes sense! The current way is fine in that case
@@ -371,12 +381,17 @@ def list_instances( | |||
key=TAG_RAY_CLUSTER_NAME, value=self.cluster_name | |||
) | |||
|
|||
# TPU VMs spawn accompanying Compute Instances that must be filtered out, | |||
# else this results in duplicated nodes. | |||
tpu_negation_filter_expr = "(NOT labels.{label}:*)".format(label="tpu_cores") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would have a unit test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, added a test in test_gcp_node_provider.py
…ariable, 2) warnings for TPU pod autoscaling, 3) better comments about tpu command runner retvalues, 4) comments in example yaml, 5) adding tests into the build file, 6) better and cleaner tests Signed-off-by: allenwang28 <[email protected]>
|
||
""" | ||
num_max_concurrent_active_connections = os.getenv( | ||
ray_constants.RAY_TPU_MAX_CONCURRENT_CONNECTIONS_ENV_VAR, default=16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I think this might return a string and cause an error, can you double check? We seem to use env_integer
for this https://github.com/architkulkarni/ray/blob/37c415ff380631c477afe93c1b88cd959eb00fca/python/ray/_private/ray_constants.py#L10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, you're right. Btw would you prefer that I move this altogether into ray_constants, e.g.
# f:ray_constants.py
RAY_TPU_MAX_CONCURRENT_CONNECTIONS = env_integer("RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS", 16)
(and the appropriate changes in tpu_command_runner.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's read the env var as late as possible to make it more configurable. So the env_integer
would be inside the command runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (+ added a test for this)
@architkulkarni - addressing a few comments in your top level comment:
Done
Done
That's a great idea, I can definitely do that in a future PR. Any suggestions on where this should land? Maybe I should open an issue for general TPU documentation?
Yes, great point. I think you can actually "hack" TPU pod support so I don't want to turn this off altogether, but I added in a logger warning in config.py and wrote a note in the YAML file saying that TPU pod autoscaling isn't properly supported. |
Yup great question, one idea is to add a subsection here: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/gcp.html#launching-ray-clusters-on-gcp. Opening an issue sounds like a good idea! |
@allenwang28 I think you got unlucky and hit a merge conflict, could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp to unblock
Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: allenwang28 <[email protected]>
Fixed the merge conflict |
Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: allenwang28 <[email protected]>
Failed tests:
|
…er. (ray-project#37934) This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers. Brief overview of the changes: GCPTPUNode has the ability to access internal/external IP addressed based on worker indices. GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner. Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface. It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters. Base interface functions are broadcasted across workers in the TPU pod slices. Unit tests have been added We also provide a sample YAML that uses TPU pods (This is a redo of ray-project#37932 to fix some of my issues with signed commits..) Related issue number Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223) --------- Signed-off-by: allenwang28 <[email protected]>
…er. (ray-project#37934) This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers. Brief overview of the changes: GCPTPUNode has the ability to access internal/external IP addressed based on worker indices. GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner. Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface. It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters. Base interface functions are broadcasted across workers in the TPU pod slices. Unit tests have been added We also provide a sample YAML that uses TPU pods (This is a redo of ray-project#37932 to fix some of my issues with signed commits..) Related issue number Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223) --------- Signed-off-by: allenwang28 <[email protected]> Signed-off-by: e428265 <[email protected]>
…er. (ray-project#37934) This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers. Brief overview of the changes: GCPTPUNode has the ability to access internal/external IP addressed based on worker indices. GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner. Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface. It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters. Base interface functions are broadcasted across workers in the TPU pod slices. Unit tests have been added We also provide a sample YAML that uses TPU pods (This is a redo of ray-project#37932 to fix some of my issues with signed commits..) Related issue number Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223) --------- Signed-off-by: allenwang28 <[email protected]> Signed-off-by: Victor <[email protected]>
Why are these changes needed?
This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers.
Brief overview of the changes:
GCPTPUNode
has the ability to access internal/external IP addressed based on worker indices.GCPNodeProvider
overridesget_command_runner
. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner.CommandRunnerInterface
.(This is a redo of #37932 to fix some of my issues with signed commits..)
Related issue number
Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see #25223)
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.