Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure cluster launcher] Deletion takes ~10 minutes without stopped node caching, fast with stopped node caching #25971

Closed
cadedaniel opened this issue Jun 21, 2022 · 3 comments · Fixed by #31645
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical

Comments

@cadedaniel
Copy link
Member

What happened + What you expected to happen

This came up in a Discuss post. The user highlights confusing, undesired behavior where the Azure cluster launcher waits for virtual machines to completely terminate (~10 minutes) when cache_stopped_nodes=False. The expected behavior is that node removal is fast, as is the behavior when cache_stopped_nodes=True. "Fast" here means idle_timeout_minutes.

The fix is to remove this blocking wait, so that the node provider makes the request to terminate the virtual machine, but does not block waiting for termination.

More context:

Versions / Dependencies

I presume the user tested on latest released Ray, but not sure. The root cause is present in the 1.13 branch.

Reproduction script

Use example-full.yaml with cache_stopped_nodes=True|False and idle_timeout_minutes=1

Issue Severity

Low: It annoys or frustrates me.

@cadedaniel cadedaniel added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 21, 2022
@cadedaniel cadedaniel assigned cadedaniel and unassigned cadedaniel Jun 21, 2022
@cadedaniel
Copy link
Member Author

@gramhagen here's the GitHub issue

@stale
Copy link

stale bot commented Oct 19, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 19, 2022
@zhe-thoughts zhe-thoughts added P2 Important issue, but not time-critical infra autoscaler, ray client, kuberay, related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022
@stale stale bot removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Oct 26, 2022
@zhe-thoughts
Copy link
Collaborator

(Context: I’m doing a round of cleanup of waiting-for-triage issues)

I'm marking as P2 since on docs we have clarified Azure cluster launcher is community maintained. cc @AmeerHajAli to put in Infra backlog

wuisawesome pushed a commit that referenced this issue Feb 28, 2023
…ays (#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes #31538
Closes #25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue Mar 21, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: Jack He <[email protected]>
cadedaniel pushed a commit to cadedaniel/ray that referenced this issue Mar 22, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
peytondmurray pushed a commit to peytondmurray/ray that referenced this issue Mar 22, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
scottsun94 pushed a commit to scottsun94/ray that referenced this issue Mar 28, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this issue Mar 28, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit that referenced this issue Mar 25, 2024
…re configurable and robust (#44100)

This PR addresses a few issues when launching clusters with Azure:

Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment.
Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group.
VM termination is extremely lengthy and does not clean up all dependencies.
When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist.
VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error.
This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two.
The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters.
This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned).
I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned.

Related issue number
Node termination times are discussed in #25971


---------

Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this issue Mar 27, 2024
…re configurable and robust (ray-project#44100)

This PR addresses a few issues when launching clusters with Azure:

Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment.
Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group.
VM termination is extremely lengthy and does not clean up all dependencies.
When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist.
VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error.
This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two.
The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters.
This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned).
I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned.

Related issue number
Node termination times are discussed in ray-project#25971


---------

Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this issue Jun 7, 2024
…re configurable and robust (ray-project#44100)

This PR addresses a few issues when launching clusters with Azure:

Any changes made to subnets of the deployed virtual network(s) are bashed upon redeployment.
Any service endpoints, route tables, or delegations are removed when redeploying (which happens on any of the ray CLI calls) due to this open Azure issue. This PR provides a workaround for the issue by copying the existing subnet configuration into the deployment template if a subnet already exists with the cluster unique id within the same resource group.
VM termination is extremely lengthy and does not clean up all dependencies.
When VMs are provisioned, dependencies such as disks, NICs, and public IP addresses are also provisioned. However, because the termination process does not wait for the VM to be deleted and the dependent resources cannot be deleted at the same time as the VM, these dependencies are often left in the resource group after termination. This can cause issues with quotas (i.e., reaching a limit of public IP addresses or disks) and wastes resources. This PR moves node termination into a pool of threads so that node deletion can be parallelized (since waiting for each node to be deleted takes a long time) and all dependencies can be correctly deleted once their VMs no longer exist.
VMs can have status code ProvisioningState/failed/RetryableError, causing an unpacking error.
This line throws an exception when the provisioning state is the string above, resulting in incorrect provisioning/termination of the node. This PR addresses that issue by slicing the list of status strings and only using the first two.
The default quota for public IP addresses in Azure is only 100, which can result in quota limits being hit for larger clusters.
This PR adds an option (use_external_head_ip) for only provisioning a public IP address for the head node (instead of all nodes or no nodes). This allows a user to still communicate with the head node via a public IP address without running into quota limits on public IP addresses. This option works in tandem with use_internal_ips - if both are set to True, then a public IP address will only be provisioned for the head node. If use_external_head_ip is omitted, the behavior is unchanged from the current behavior (i.e., public IPs will be provisioned for all nodes if use_internal_ips is False, otherwise no public IPs will be provisioned).
I've tested all of these fixes using ray up/ray dashboard/ray down on Azure clusters of 4-32 nodes to make sure the start up/teardown works correctly and the correct amount of resources are provisioned.

Related issue number
Node termination times are discussed in ray-project#25971


---------

Signed-off-by: Mike Danielczuk <[email protected]>
Signed-off-by: Mike Danielczuk <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants