-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100
[Cluster launcher] [Azure] Make cluster termination and networking more configurable and robust #44100
Conversation
@gramhagen let me know if you would be able to review this one; I am also happy to break out some of the changes if you think that not all of them are necessary! |
e1ba4fd
to
3051e58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! just some syntax suggestions and question on while loop limits. I'll leave it to you how to address, functionally looks good.
Thanks for the review and suggestions @gramhagen; added some timeouts to those whiles just in case. @architkulkarni @hongchaodeng let me know if you have any other suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed the code in commands.py
and updater.py
, looks good. Deferring to @gramhagen's approval for the Azure-specific part.
Another nit: the check for external_head_ip is a bit complex and in a few places, would it make sense to factor it out and add a unit test for it?
@@ -605,7 +604,7 @@ def kill_node( | |||
|
|||
time.sleep(POLL_INTERVAL) | |||
|
|||
if config.get("provider", {}).get("use_internal_ips", False) is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm not sure why it was written like this... Is it possible it's to guard against something like the string "False"
being truthy in Python, or any other nonempty value or typo in the config? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was also confused by that; it's inconsistent within that file which is why I had changed it when adding the additional check for the head node external IP param. For example, this line. So either way, one of these lines should likely be changed. Let me know what you think.
Happy to add a unit test for this; would you be able to point me to where that would belong? I didn't see any Azure tests here and this folder only contains one |
dd23327
to
d3090f5
Compare
Signed-off-by: Mike Danielczuk <[email protected]>
…eads Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
d3090f5
to
9cb95a9
Compare
@architkulkarni I added some descriptive comments around the logic for |
@architkulkarni following up on the above; let me know what next steps there are for this PR if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, my comments were minor and don't need to block the PR.
…re configurable and robust (ray-project#44100) This PR addresses a few issues when launching clusters with Azure: Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment. Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group. VM termination is extremely lengthy and does not clean up all dependencies. When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist. VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error. This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two. The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters. This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned). I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned. Related issue number Node termination times are discussed in ray-project#25971 --------- Signed-off-by: Mike Danielczuk <[email protected]> Signed-off-by: Mike Danielczuk <[email protected]> Co-authored-by: Scott Graham <[email protected]>
…re configurable and robust (ray-project#44100) This PR addresses a few issues when launching clusters with Azure: Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment. Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group. VM termination is extremely lengthy and does not clean up all dependencies. When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist. VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error. This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two. The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters. This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned). I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned. Related issue number Node termination times are discussed in ray-project#25971 --------- Signed-off-by: Mike Danielczuk <[email protected]> Signed-off-by: Mike Danielczuk <[email protected]> Co-authored-by: Scott Graham <[email protected]>
Why are these changes needed?
This PR addresses a few issues when launching clusters with Azure:
ray
CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group.ProvisioningState/failed/RetryableError
, causing an unpacking error.use_external_head_ip
) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem withuse_internal_ips
- if both are set toTrue
, then a public IP address will only be provisioned for the head node. Ifuse_external_head_ip
is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes ifuse_internal_ips
isFalse
, otherwise no public IPs will be provisioned).I've tested all of these fixes using
ray up
/ray dashboard
/ray down
on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned.Related issue number
Node termination times are discussed in #25971
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.