Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: allow stop/autostop for spot VMs. #2877

Merged
merged 13 commits into from
Dec 28, 2023
Merged

GCP: allow stop/autostop for spot VMs. #2877

merged 13 commits into from
Dec 28, 2023

Conversation

concretevitamin
Copy link
Member

Fixes #2837.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_stop_gcp_spot
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support of stopping a spot instance on GCP @concretevitamin! The code looks mostly good to me! Left two nits.

sky/core.py Outdated
Comment on lines 333 to 334
if handle.launched_resources.use_spot and not cloud.is_same_cloud(
clouds.GCP()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we make this as a CloudImplementationFeatures.STOP_SPOT_INSTANCE, so we don't have to do a cloud specific check here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

sky/core.py Outdated
Comment on lines 455 to 456
elif (handle.launched_resources.use_spot and not down and not is_cancel and
not handle.launched_resources.cloud.is_same_cloud(clouds.GCP())):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, should we make this one of the CloudImplementationFeatures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @concretevitamin! It looks mostly good to me with a comment for a refactoring of the check_features_are_supported function.

sky/core.py Outdated
Comment on lines 330 to 345
cloud.check_features_are_supported(
{clouds.CloudImplementationFeatures.STOP})
if handle.launched_resources.use_spot:
# Check cloud supports stopping spot instances
supports_stop_spot = True
try:
cloud.check_features_are_supported(
{clouds.CloudImplementationFeatures.STOP_SPOT_INSTANCE})
except exceptions.NotSupportedError:
supports_stop_spot = False
# Allow GCP spot to be stopped since it preserves disk:
# https://cloud.google.com/compute/docs/instances/preemptible#preemption-process # pylint: disable=line-too-long
if handle.launched_resources.use_spot and not supports_stop_spot:
# Disable spot instances to be stopped.
# TODO(suquark): enable GCP+spot to be stopped in the future.
raise exceptions.NotSupportedError(
f'{colorama.Fore.YELLOW}Stopping cluster '
f'{cluster_name!r}... skipped.{colorama.Style.RESET_ALL}\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following for simplicity?

features = {clouds.CloudImplementationFeatures.STOP}
if handle.launched_resources.use_spot:
    features.insert(clouds.CloudImplementationFeatures.STOP_SPOT_INSTANCE)
cloud.check_features_are_supported(features)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since check_features_are_supported() raises an error I found it harder to read. For example, if a cloud doesn't support stopping spot and here the cluster is an on-demand one, we do not want to perform a check on STOP_SPOT_INSTANCE or raise.

sky/core.py Outdated Show resolved Hide resolved
sky/core.py Show resolved Hide resolved
Copy link
Member Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes. PTAL.

sky/core.py Outdated
Comment on lines 330 to 345
cloud.check_features_are_supported(
{clouds.CloudImplementationFeatures.STOP})
if handle.launched_resources.use_spot:
# Check cloud supports stopping spot instances
supports_stop_spot = True
try:
cloud.check_features_are_supported(
{clouds.CloudImplementationFeatures.STOP_SPOT_INSTANCE})
except exceptions.NotSupportedError:
supports_stop_spot = False
# Allow GCP spot to be stopped since it preserves disk:
# https://cloud.google.com/compute/docs/instances/preemptible#preemption-process # pylint: disable=line-too-long
if handle.launched_resources.use_spot and not supports_stop_spot:
# Disable spot instances to be stopped.
# TODO(suquark): enable GCP+spot to be stopped in the future.
raise exceptions.NotSupportedError(
f'{colorama.Fore.YELLOW}Stopping cluster '
f'{cluster_name!r}... skipped.{colorama.Style.RESET_ALL}\n'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since check_features_are_supported() raises an error I found it harder to read. For example, if a cloud doesn't support stopping spot and here the cluster is an on-demand one, we do not want to perform a check on STOP_SPOT_INSTANCE or raise.

sky/core.py Outdated Show resolved Hide resolved
sky/core.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @concretevitamin! The code looks mostly good to me, except for when we raise an error for the STOP_SPOT_INSTANCE being unsupported.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

@concretevitamin I refactored the handling for checking supported features. Please take a look : )

Copy link
Member Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for the refactoring! Failover works correctly now.

I pushed some minor changes in messages. PTAL.

sky/clouds/cloud.py Show resolved Hide resolved
@@ -809,8 +817,6 @@ def need_cleanup_after_preemption(self,
# you must delete it and create a new one ..."
# See: https://cloud.google.com/tpu/docs/preemptible#tpu-vm

# pylint: disable=import-outside-toplevel
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Was there a reason this was imported inlined?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it was for avoiding circular import, but since we have updated a lot of places, it seems fine to import it at the top level

sky/core.py Outdated Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider allowing sky stop on GCP spot clusters
2 participants