Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538

Closed
mlguruz opened this issue Jan 9, 2023 · 13 comments · Fixed by #31645
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical

Comments

@mlguruz
Copy link

mlguruz commented Jan 9, 2023

What happened + What you expected to happen

@gramhagen @richardliaw

The new release of Ray 2.1 and 2.2 creates a bug where unique_id is a fixed formula instead of old uuid4() random one.
The problem is, b/c Ray by default launches 5 instances and gradually scale up, the new unique_id scheme will keep re-using a pool of 5 unique ids and Azure as a result, will keep killing and restarting existing healthy running VMs.

You won't be able to scale more than 5, and also, your existing running VMs will constantly be killed and then relaunched.

This is old (Ray 2.0):

unique_id = uuid4().hex[:VM_NAME_UUID_LEN]

This is new: (Ray 2.1 or 2.2):

name=name_tag, id=self.provider_config["unique_id"]

This is a pretty serious bug IMO.

Versions / Dependencies

Ray = 2.1/2.2 (<= 2.0 is fine)

Reproduction script

Just launch a cluster and you will see it.

Issue Severity

High: It blocks me from completing my task.

@mlguruz mlguruz added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 9, 2023
@gramhagen
Copy link
Contributor

good point, looks like the unique_id was over-applied here. we should revert this to the old approach

@zhe-thoughts
Copy link
Collaborator

Thanks @gramhagen . Do you have time to file a PR for the suggested fix?

@gramhagen
Copy link
Contributor

yeah, let me create a pr, will need a little time to test it though

@mlguruz
Copy link
Author

mlguruz commented Jan 12, 2023

yeah, let me create a pr, will need a little time to test it though

@gramhagen

I manually modified the node_provider.py, using the old uuid way of generating unique_id, and mount the files to both head and worker nodes, via ray autoscaler yaml entrypoint callbacks.

I can confirm that I'm able to scale up to more instances now.

One potential bug that I would remind is this:
image

I haven't got time to read all the code, but my gut tells me somewhere there's a sleep to allow Azure to spin up instances? (I could be wrong)

Will this cause any trouble?

This error (screenshot) I've seen in almost every release.

@gramhagen
Copy link
Contributor

is error (screenshot) I've seen in almost every release.

looks like this is a problem with the deployment name, I just updated my changes to use the unique vm name for the deployment too. (and included the cluster name) to make it a little clearer if people have nodes from multiple clusters in the same rg.

@mlguruz
Copy link
Author

mlguruz commented Jan 12, 2023

Yes - I think for every cluster creation, it is re-using the same deployment name hence overwriting old deployments:

image

This might not be what we want as we might wanna check older deployment details for debugging.

Is there away to improve this?

@gramhagen
Copy link
Contributor

yeah, the change here should resolve that: https://github.com/ray-project/ray/pull/31645/files (line 258)

@mlguruz
Copy link
Author

mlguruz commented Jan 12, 2023

yeah, the change here should resolve that: https://github.com/ray-project/ray/pull/31645/files (line 258)

Thank you!
That makes sense, so that each deployment is unique.

@zhe-thoughts zhe-thoughts added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2023
@zhe-thoughts
Copy link
Collaborator

Marking P2 because the Ray Azure launcher is community maintained (we should still try to fix it!)

@mlguruz
Copy link
Author

mlguruz commented Jan 20, 2023

@gramhagen

Since we're on this potential bug fix release, another thing I've been seeing on my end is that, even though ray down runs successfully, there are usually auxiliary components for VMs left that are not properly cleaned up, things like nic / os disk etc. (please see screenshot below)

Do you know why this is happening?

image

@gramhagen
Copy link
Contributor

afaik, there is no way to destroy resources that are not nodes when running ray down which executes commands.py:teardown_cluster()

it's best to deploy all resources into a resource group and delete that after shutting down the cluster. but you can also use the unique key to identify resources corresponding to a specific cluster and delete those manually. I think it will be a significant amount of work to automate that because there are dependencies that need to be accounted for in the order of deleting network resources.

@mrksr
Copy link

mrksr commented Feb 28, 2023

Hi @gramhagen, thanks for working on a fix! What's the state of affairs with this issue? The problem still exists for me. Is there a workaround with current ray?

@gramhagen
Copy link
Contributor

waiting for #31645 to close, you could grab the changes from that branch to build, but it will be easier to wait for the ci tools to build the wheel for you once the pr is merged.

wuisawesome pushed a commit that referenced this issue Feb 28, 2023
…ays (#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes #31538
Closes #25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue Mar 21, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: Jack He <[email protected]>
cadedaniel pushed a commit to cadedaniel/ray that referenced this issue Mar 22, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
peytondmurray pushed a commit to peytondmurray/ray that referenced this issue Mar 22, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
scottsun94 pushed a commit to scottsun94/ray that referenced this issue Mar 28, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this issue Mar 28, 2023
…ays (ray-project#31645)


This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971


---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
…ays (ray-project#31645)

This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing
Currently the Azure autoscaler blocks on node destruction, so that was removed in this change

Related issue number
Closes ray-project#31538
Closes ray-project#25971

---------

Signed-off-by: Scott Graham <[email protected]>
Co-authored-by: Scott Graham <[email protected]>
Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants