[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100

mjd3 · 2024-03-18T21:30:46Z

Why are these changes needed?

This PR addresses a few issues when launching clusters with Azure:

Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment.
- Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group.
VM termination is extremely lengthy and does not clean up all dependencies.
- When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist.
VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error.
- This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two.
The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters.
- This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned).

I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned.

Related issue number

Node termination times are discussed in #25971

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1206743421817428

mjd3 · 2024-03-18T21:33:19Z

@gramhagen let me know if you would be able to review this one; I am also happy to break out some of the changes if you think that not all of them are necessary!

gramhagen

Looks great! just some syntax suggestions and question on while loop limits. I'll leave it to you how to address, functionally looks good.

python/ray/autoscaler/_private/_azure/node_provider.py

mjd3 · 2024-03-19T17:49:28Z

Thanks for the review and suggestions @gramhagen; added some timeouts to those whiles just in case. @architkulkarni @hongchaodeng let me know if you have any other suggestions!

architkulkarni

Just reviewed the code in commands.py and updater.py, looks good. Deferring to @gramhagen's approval for the Azure-specific part.

Another nit: the check for external_head_ip is a bit complex and in a few places, would it make sense to factor it out and add a unit test for it?

architkulkarni · 2024-03-20T17:37:03Z

python/ray/autoscaler/_private/commands.py

@@ -605,7 +604,7 @@ def kill_node(

 time.sleep(POLL_INTERVAL)

- if config.get("provider", {}).get("use_internal_ips", False) is True:


Nit: I'm not sure why it was written like this... Is it possible it's to guard against something like the string "False" being truthy in Python, or any other nonempty value or typo in the config? What do you think?

Yeah I was also confused by that; it's inconsistent within that file which is why I had changed it when adding the additional check for the head node external IP param. For example, this line. So either way, one of these lines should likely be changed. Let me know what you think.

mjd3 · 2024-03-20T18:11:55Z

Another nit: the check for external_head_ip is a bit complex and in a few places, would it make sense to factor it out and add a unit test for it?

Happy to add a unit test for this; would you be able to point me to where that would belong? I didn't see any Azure tests here and this folder only contains one utils.py file.

Signed-off-by: Mike Danielczuk <[email protected]>

…eads Signed-off-by: Mike Danielczuk <[email protected]>

Signed-off-by: Mike Danielczuk <[email protected]>

Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Mike Danielczuk <[email protected]>

Signed-off-by: Mike Danielczuk <[email protected]>

mjd3 · 2024-03-21T18:59:02Z

@architkulkarni I added some descriptive comments around the logic for use_external_head_ip; let me know if you think there should be tests added somewhere as well. I'm still uncertain on the is True conversation above. Let me know what you think is best there; happy to go either way on that one.

mjd3 · 2024-03-25T16:14:14Z

@architkulkarni following up on the above; let me know what next steps there are for this PR if any.

architkulkarni

Looks good, my comments were minor and don't need to block the PR.

…re configurable and robust (ray-project#44100) This PR addresses a few issues when launching clusters with Azure: Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment. Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group. VM termination is extremely lengthy and does not clean up all dependencies. When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist. VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error. This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two. The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters. This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned). I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned. Related issue number Node termination times are discussed in ray-project#25971 --------- Signed-off-by: Mike Danielczuk <[email protected]> Signed-off-by: Mike Danielczuk <[email protected]> Co-authored-by: Scott Graham <[email protected]>

mjd3 requested review from architkulkarni, maxpumperla, pcmoritz, kevin85421, a team, ericl and hongchaodeng as code owners March 18, 2024 21:30

mjd3 force-pushed the mjd3/azure-terminator branch from e1ba4fd to 3051e58 Compare March 18, 2024 21:38

architkulkarni assigned hongchaodeng, architkulkarni and gramhagen Mar 18, 2024

gramhagen approved these changes Mar 19, 2024

View reviewed changes

architkulkarni assigned jjyao and unassigned hongchaodeng Mar 19, 2024

architkulkarni reviewed Mar 20, 2024

View reviewed changes

mjd3 force-pushed the mjd3/azure-terminator branch from dd23327 to d3090f5 Compare March 20, 2024 20:27

mjd3 added 10 commits March 21, 2024 11:56

fix: do not bash subnets upon redeployment

8b87ae4

Signed-off-by: Mike Danielczuk <[email protected]>

fix: rework node termination to clean up dependencies in separate thr…

a3ef1ac

…eads Signed-off-by: Mike Danielczuk <[email protected]>

feat: option to only provision public ip address for head node

12429ce

Signed-off-by: Mike Danielczuk <[email protected]>

docs: new options

ce64175

Signed-off-by: Mike Danielczuk <[email protected]>

fix: bug where status code could have >2 parts

f40aca8

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

d679035

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

43667cf

Signed-off-by: Mike Danielczuk <[email protected]>

fix: typo

218703c

Signed-off-by: Mike Danielczuk <[email protected]>

fix: propagate external head ip option

6c0fbf7

Signed-off-by: Mike Danielczuk <[email protected]>

fix: propagate external head ip option

06d6ed6

Signed-off-by: Mike Danielczuk <[email protected]>

mjd3 and others added 12 commits March 21, 2024 11:56

fix: docs

e6bac38

Signed-off-by: Mike Danielczuk <[email protected]>

fix: docs

87d476e

Signed-off-by: Mike Danielczuk <[email protected]>

fix: suggestions from code review

20cf414

Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Mike Danielczuk <[email protected]>

fix: add termination timeout of 5 mins in case deletion hangs

05a7c17

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

ea2d0e6

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

7462745

Signed-off-by: Mike Danielczuk <[email protected]>

fix: revert or in get_node

db9f666

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

065bcd1

Signed-off-by: Mike Danielczuk <[email protected]>

fix: add comments explaining head node ip logic

4b127a4

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

f9f8444

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

f3b7f4b

Signed-off-by: Mike Danielczuk <[email protected]>

fix: format

9cb95a9

Signed-off-by: Mike Danielczuk <[email protected]>

mjd3 force-pushed the mjd3/azure-terminator branch from d3090f5 to 9cb95a9 Compare March 21, 2024 18:57

architkulkarni approved these changes Mar 25, 2024

View reviewed changes

architkulkarni merged commit 408b1fb into ray-project:master Mar 25, 2024
5 checks passed

mjd3 deleted the mjd3/azure-terminator branch March 25, 2024 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100

[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100

mjd3 commented Mar 18, 2024 •

edited

Loading

mjd3 commented Mar 18, 2024

gramhagen left a comment

mjd3 commented Mar 19, 2024 •

edited

Loading

architkulkarni left a comment

architkulkarni Mar 20, 2024

mjd3 Mar 20, 2024

mjd3 commented Mar 20, 2024

mjd3 commented Mar 21, 2024

mjd3 commented Mar 25, 2024

architkulkarni left a comment

		@@ -605,7 +604,7 @@ def kill_node(

		time.sleep(POLL_INTERVAL)

		if config.get("provider", {}).get("use_internal_ips", False) is True:

[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100

[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100

Conversation

mjd3 commented Mar 18, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

mjd3 commented Mar 18, 2024

gramhagen left a comment

Choose a reason for hiding this comment

mjd3 commented Mar 19, 2024 • edited Loading

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Mar 20, 2024

Choose a reason for hiding this comment

mjd3 Mar 20, 2024

Choose a reason for hiding this comment

mjd3 commented Mar 20, 2024

mjd3 commented Mar 21, 2024

mjd3 commented Mar 25, 2024

architkulkarni left a comment

Choose a reason for hiding this comment

mjd3 commented Mar 18, 2024 •

edited

Loading

mjd3 commented Mar 19, 2024 •

edited

Loading