Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds TPU pod Support within the Ray Cluster Launcher. #37934

Merged
merged 11 commits into from
Aug 17, 2023

Conversation

allenwang28
Copy link
Contributor

Why are these changes needed?

This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers.

Brief overview of the changes:

  • GCPTPUNode has the ability to access internal/external IP addressed based on worker indices.
  • GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner.
  • Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface.
    • It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters.
    • Base interface functions are broadcasted across workers in the TPU pod slices.
  • Unit tests have been added
  • We also provide a sample YAML that uses TPU pods

(This is a redo of #37932 to fix some of my issues with signed commits..)

Related issue number

Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see #25223)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • [n/a] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@allenwang28
Copy link
Contributor Author

Copying @DmitriGekhtman's comment from the previous PR:

Passing comment -- actual review will come from ray-core team --
I think the autoscaler internals probably assume a 1-1 mapping between Ray nodes and VMs, so this might not work as expected. In particular, it's unlikely that autoscaling (with a dynamic number of TPU pods) will work properly.

With this change, I was able to test the provisioning/updating steps using a real cluster and was able to ultimately run a simple JAX workload on it. That's really the main functionality I wanted to contribute here.

I agree that proper autoscaling does not work with this PR and I am happy to make any changes to convey that message clearly. The problems I see are:

  • Function/actor placement logic - we would likely need to introduce changes in Ray core (maybe within placement groups?) such that we can send particular Ray functions/actors to particular groups that we can hint during provisioning time (cc @richardliaw)
  • Resource request handling - it's unclear to me how making a resource request from the autoscaler will translate to TPU pod creation...
  • Integrating the changes above into tune/serve/train etc.

@allenwang28 allenwang28 changed the title Adds proper TPU pod Support within the Ray Cluster Launcher / Autoscaler. Adds proper TPU pod Support within the Ray Cluster Launcher. Jul 31, 2023
@allenwang28 allenwang28 changed the title Adds proper TPU pod Support within the Ray Cluster Launcher. Adds TPU pod Support within the Ray Cluster Launcher. Jul 31, 2023
@architkulkarni architkulkarni self-assigned this Aug 4, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great, just some minor comments around readability.

Some questions which could be addressed in a followup PR:

  • Do we plan to add some notes about the TPU pod/host business to the Ray docs?
  • Since autoscaling isn't supported for TPU, ideally we would print some kind of warning if we detect that the user is trying this. Do you have a sense of where it would be best to print this? At a minimum, we should state this in the sample YAML file.

# We see issues where too many concurrent connections may lead to failed SSH connections
# when there are too many TPU hosts.
# For now, we cap the maximum concurrent connections until we find the right fix.
_MAX_NUM_CONCURRENT_ACTIVE_CONNECTIONS = 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to make this configurable via an environment variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, any suggestions on a name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, maybe RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! Btw, is there a way to set this value within the cluster yaml file? I couldn't find one, but that would be great for usability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so unfortunately, there's currently no place to set env vars in the cluster YAML that would be read by the cluster launcher code itself.

min_workers: 1
resources: {"TPU": 1} # use TPU custom resource in your code
node_config:
acceleratorType: v4-16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment to emphasize that v4-16 has two hosts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# A unique identifier for the head node and workers of this cluster.
cluster_name: tpupodtest

max_workers: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have max_workers=2 here and min_workers=1 under ray_tpu, but since the TPU has two hosts, we actually have reached the max_workers bound for this cluster. Is that right? And if we had a TPU pod with 4 hosts we'd need max_workers=4? It might be worth adding a comment to clarify this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in this case the worker refers to the topology, so if we have a v4-16 as our acceleratorType and max_workers=2 it can create up to 2 v4-16s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense!

command_runner = node_provider.get_command_runner(**args)
assert isinstance(command_runner, TPUCommandRunner)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider using pytest.mark.parametrize to reduce the repeated test code, that would also make it easy to add a "tpu+docker" test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also added the tpu+docker test as suggested


calls = process_runner.calls

assert len(process_runner.calls) == num_workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we actually check that the command got executed once for each host in the TPU pod right? it might be worth calling attention to this in a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for i in range(num_workers):
for x, y in zip(calls[i], expected):
assert x == y
process_runner.assert_has_call("1.2.3.4", exact=expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 1.2.3.4 come from? I didn't notice it elsewhere in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I copied this from test_command_runner: https://github.com/ray-project/ray/blob/master/python/ray/tests/test_command_runner.py#L122

but it doesn't seem like this call really does much.. the above asserts are sufficient so I just removed the call altogether.

def get_external_ips(self) -> List[str]:
return self.get("networkEndpoints", [{}])

def get_external_ip(self, worker_index: int = 0) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me why it makes sense to have a default worker index here? How are the return values of get_external_ip() and get_internal_ip() used by the caller when the index isn't specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So node_provider has a function for get_external_ip(node_ip). Since TPUs are the only node types (at least currently) that will have multiple worker indices, I figured it would be safe to set a default index to 0 which is esp. useful when we are using single TPU VMs.

We could remove the default value, but we would need to change the logic here: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/gcp/node_provider.py#L157 with a TPU check which is kind of ugly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense! The current way is fine in that case

@@ -371,12 +381,17 @@ def list_instances(
key=TAG_RAY_CLUSTER_NAME, value=self.cluster_name
)

# TPU VMs spawn accompanying Compute Instances that must be filtered out,
# else this results in duplicated nodes.
tpu_negation_filter_expr = "(NOT labels.{label}:*)".format(label="tpu_cores")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would have a unit test for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, added a test in test_gcp_node_provider.py

…ariable, 2) warnings for TPU pod autoscaling, 3) better comments about tpu command runner retvalues, 4) comments in example yaml, 5) adding tests into the build file, 6) better and cleaner tests

Signed-off-by: allenwang28 <[email protected]>

"""
num_max_concurrent_active_connections = os.getenv(
ray_constants.RAY_TPU_MAX_CONCURRENT_CONNECTIONS_ENV_VAR, default=16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I think this might return a string and cause an error, can you double check? We seem to use env_integer for this https://github.com/architkulkarni/ray/blob/37c415ff380631c477afe93c1b88cd959eb00fca/python/ray/_private/ray_constants.py#L10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, you're right. Btw would you prefer that I move this altogether into ray_constants, e.g.

# f:ray_constants.py

RAY_TPU_MAX_CONCURRENT_CONNECTIONS = env_integer("RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS", 16)

(and the appropriate changes in tpu_command_runner.py)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference about that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's read the env var as late as possible to make it more configurable. So the env_integer would be inside the command runner

Copy link
Contributor Author

@allenwang28 allenwang28 Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (+ added a test for this)

@allenwang28
Copy link
Contributor Author

allenwang28 commented Aug 16, 2023

@architkulkarni - addressing a few comments in your top level comment:

(This one is actually important) I think you need to add the new command runner test file similar to https://github.com/allenwang28/ray/blob/e43285de1a9b0b5af3a87b33917dd3d6cc07c3cf/python/ray/tests/BUILD#L422 so that it runs in CI

Done

For the parts that say "this may not be the right expected return result", were we saying it's okay because the return value doesn't seem to be used by the caller? Maybe we could include that in the comment as well.

Done

Do we plan to add some notes about the TPU pod/host business to the Ray docs?

That's a great idea, I can definitely do that in a future PR. Any suggestions on where this should land? Maybe I should open an issue for general TPU documentation?

Since autoscaling isn't supported for TPU, ideally we would print some kind of warning if we detect that the user is trying this. Do you have a sense of where it would be best to print this? At a minimum, we should state this in the sample YAML file.

Yes, great point. I think you can actually "hack" TPU pod support so I don't want to turn this off altogether, but I added in a logger warning in config.py and wrote a note in the YAML file saying that TPU pod autoscaling isn't properly supported.

@architkulkarni
Copy link
Contributor

That's a great idea, I can definitely do that in a future PR. Any suggestions on where this should land? Maybe I should open an issue for general TPU documentation?

Yup great question, one idea is to add a subsection here: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/gcp.html#launching-ray-clusters-on-gcp.

Opening an issue sounds like a good idea!

@architkulkarni
Copy link
Contributor

@allenwang28 I think you got unlucky and hit a merge conflict, could you please take a look?

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp to unblock

@allenwang28
Copy link
Contributor Author

Fixed the merge conflict

@architkulkarni
Copy link
Contributor

Failed tests:

  • Windows test_cancel unrelated
  • rllib:TestLearnerGroupAsyncUpdate unrelated
  • rllib:test_memory_leak_ppo unrelated
  • rllib:examples/learner/train_w_bc_finetune_w_ppo unrelated
  • tune:test_remote unrelated
  • serve:test_callback unrelated
  • tests:test_out_of_disk_space unrelated
  • py11 build images unrelated and flaky on master

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 17, 2023
@architkulkarni architkulkarni merged commit 47edb32 into ray-project:master Aug 17, 2023
2 checks passed
vitsai pushed a commit to vitsai/ray that referenced this pull request Aug 19, 2023
…er. (ray-project#37934)

This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers.

Brief overview of the changes:

GCPTPUNode has the ability to access internal/external IP addressed based on worker indices.
GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner.
Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface.
It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters.
Base interface functions are broadcasted across workers in the TPU pod slices.
Unit tests have been added
We also provide a sample YAML that uses TPU pods
(This is a redo of ray-project#37932 to fix some of my issues with signed commits..)

Related issue number
Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223)

---------

Signed-off-by: allenwang28 <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…er. (ray-project#37934)

This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers.

Brief overview of the changes:

GCPTPUNode has the ability to access internal/external IP addressed based on worker indices.
GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner.
Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface.
It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters.
Base interface functions are broadcasted across workers in the TPU pod slices.
Unit tests have been added
We also provide a sample YAML that uses TPU pods
(This is a redo of ray-project#37932 to fix some of my issues with signed commits..)

Related issue number
Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223)

---------

Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…er. (ray-project#37934)

This adds support for TPU pod slice provisioning within the Ray cluster launcher. The unique feature of TPU pod slices are that they consist of multiple machines with individual IP addresses. These slices are provisioned/deleted as a group with a single command but updates happen within the individual machines. The Ray cluster launcher follows a VM-centric paradigm, e.g. machines in a cluster are individually provisioned/deleted/accessed. To stay consistent with the existing Ray cluster launcher paradigm, we represent TPU pod slices as individual nodes and broadcast updates to all workers.

Brief overview of the changes:

GCPTPUNode has the ability to access internal/external IP addressed based on worker indices.
GCPNodeProvider overrides get_command_runner. If the resource type is a TPU, return a TPUCommandRunner, else return the base command runner.
Introduces a TPUCommandRunner. This is an extension of the CommandRunnerInterface.
It contains N command runners (both SSH/Docker-based) that have been modified to overwrite the internal/external IP address getters.
Base interface functions are broadcasted across workers in the TPU pod slices.
Unit tests have been added
We also provide a sample YAML that uses TPU pods
(This is a redo of ray-project#37932 to fix some of my issues with signed commits..)

Related issue number
Addresses Ray Cluster Launcher / Autoscaler support for TPU pods (see ray-project#25223)

---------

Signed-off-by: allenwang28 <[email protected]>
Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants