Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500

Merged
merged 16 commits into from
Dec 15, 2022

Conversation

infwinston
Copy link
Member

@infwinston infwinston commented Dec 7, 2022

Our user reported some errors when getting TPU IPs. This PR adds some safe guard for _get_tpu_vm_pod_ips with better error handling to prevent failures like leaked resources with duplicated cluster IDs.

This PR also fixed #1514 which tried to remove an INIT TPU VM resource.

Tested:

  • launching TPU VM/TPU Pod
  • TPU Pod smoke test

Comment on lines 309 to 315

# Clean up preempted TPU VM before launching the cluster.
# This is needed as status -r will not remove it if GCP
# turns the VM state to other than PREEMPTED.
is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
if is_tpuvm:
self.terminate_cluster()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing offline, we found it might be better to have this special case handling in the controller process instead of the recovery strategy. Therefore, future recovery strategies will not need to handle separately.

@infwinston
Copy link
Member Author

infwinston commented Dec 11, 2022

Ok after some investigation, I finally realize what's going on. So basically when a TPU VM is preempted by GCP (i.e., no longer SSHable into the server), its state may not be immediately set to PREEMPTED but may stay in READY for a while.
Hence, even when our controller detects such failure to connect and determine the server is preempted, status -r won't reflect such status correctly.

To be specific, during status -r, we run the below code to refresh status

node_statuses = _get_cluster_status_via_cloud_cli(handle)

and then we got READY which translated to ClusterStatus.UP (see below log)

(sky-de99-weichiang, pid=734143) I 12-10 21:39:36 spot_utils.py:72] Failed to connect to the cluster.
(sky-de99-weichiang, pid=734143) I 12-10 21:39:36 spot_utils.py:73] ==================================
(sky-de99-weichiang, pid=734143) D 12-10 21:39:38 backend_utils.py:1676] Refreshing status: Failed to get IPs from cluster 'sky-de99-weichiang-16', trying to fetch from provider.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] gcloud compute tpus tpu-vm list --zone us-central1-f --filter="(labels.ray-cluster-name=sky-de99-weichiang-16 AND labels.ray-launch-config=(c070152c2602f36dd1adc9a0a6cd087fc2db9352))" --format="value(state)" returned 0.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] **** STDOUT ****
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] READY
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433]
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] **** STDERR ****
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433]
(sky-de99-weichiang, pid=734143) W 12-10 21:39:39 backend_utils.py:1452] Cluster status: 'READY'.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1710] Failed to reset autostop. Due to <class 'sky.exceptions.CommandError'>: Command python3 -u -c 'from sky.skylet import autostop_lib;autostop_lib.set_autostop(-1, '"'"'cloudvmray'"'"', False)' failed with return code 255.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1710] Failed to set autostop
(sky-de99-weichiang, pid=734143) I 12-10 21:39:39 controller.py:118] Cluster is preempted (status: INIT). Recovering...
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 spot_state.py:134] === Recovering... ===
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126] Ignoring the job cancellation failure; the spot cluster is likely completely stopped.
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126]   Detailed exception: 'NoneType' object has no attribute 'cluster_yaml'
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:316] Terminating unhealthy spot cluster.
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:320] Relaunch the cluster  without constraining to prior cloud/region.

However, later we immediately change the status from UP to INIT because we determine the cluster is "abnormal".

is_abnormal = ((0 < len(node_statuses) < handle.launched_nodes) or

global_user_state.add_or_update_cluster(cluster_name,

That's why we see the cluster was in INIT state.

(sky-de99-weichiang, pid=734143) I 12-10 21:39:39 controller.py:118] Cluster is preempted (status: INIT). Recovering...

===================

In conclusion

  1. We cannot rely on status -r to clean up preempted resources because the state may not reflect in real-time.
  2. We also cannot rely on ray up because a spot controller may not try ray up with the same region/zone after preemption. so a preempted resource can still be leaked.
    in the end, I think it is the spot controller's job to make sure there's no leaked preempted resource before launching a new one.

To avoid adding specific logic that only applies to TPU VM, I introduce a slightly better abstraction in this PR.

if not resources.is_spot_restartable():
# If the resource is not restartable after preemption,
# we need to terminate the cluster before recovering it.
logger.info('Resource not restartable. Cleaning up '
'the cluster.')
self._strategy_executor.terminate_cluster()

@Michaelvll @concretevitamin what do you think?

@concretevitamin
Copy link
Member

concretevitamin commented Dec 12, 2022

Thanks. Read everything before "In conclusion" and they made sense to me.

Question: from the log pasted:

(sky-de99-weichiang, pid=734143) I 12-10 21:39:39 controller.py:118] Cluster is preempted (status: INIT). Recovering...
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 spot_state.py:134] === Recovering... ===
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126] Ignoring the job cancellation failure; the spot cluster is likely completely stopped.
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126]   Detailed exception: 'NoneType' object has no attribute 'cluster_yaml'
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:316] Terminating unhealthy spot cluster.
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:320] Relaunch the cluster  without constraining to prior cloud/region.

we see that recovery_strategy.py:316] Terminating unhealthy spot cluster. This suggests a termination request is being run for the preempted cluster (our status: INIT; console status: READY/PREEMPTED). Is this call not going through or otherwise not sufficient?

--

RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped.

@infwinston
Copy link
Member Author

we see that recovery_strategy.py:316] Terminating unhealthy spot cluster. This suggests a termination request is being run for the preempted cluster (our status: INIT; console status: READY/PREEMPTED). Is this call not going through or otherwise not sufficient?

Ah this action was in effect only after this PR. Without this PR, the termination action won't be performed.

RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped.

Yes this is a problem. I was thinking we need to add a new field (such as self.termination_on_spot) in Resources to indicate such property. And each cloud may have their own default value.

@Michaelvll
Copy link
Collaborator

Thanks for the update @infwinston!

Ah this action was in effect only after this PR. Without this PR, the termination action won't be performed.

It seems we skip the "retry in the same region first" behavior. That is because the terminate_cluster resets the self._launched_cloud_region, before the recovery strategy starts. Let's remove the terminate_cluster method in the FailoverStrategyExecutor and add the self._launched_cloud_region = None before the self.terminate_cluster() in the recover() method.

RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped.

I think the problem may not be related to the preemption behavior, as the is_spot_restartable here only means if we can sky.launch a cluster with the same name and hash, when the spot cluster is preempted (no matter it is stopped or terminated).

That said, I would prefer to rename the method is_spot_restartable to need_termination_after_preemption, or something similar.

@infwinston
Copy link
Member Author

infwinston commented Dec 12, 2022

It seems we skip the "retry in the same region first" behavior. That is because the terminate_cluster resets the self._launched_cloud_region, before the recovery strategy starts. Let's remove the terminate_cluster method in the FailoverStrategyExecutor and add the self._launched_cloud_region = None before the self.terminate_cluster() in the recover() method.

Great catch! I just updated the code.

I think the problem may not be related to the preemption behavior, as the is_spot_restartable here only means if we can sky.launch a cluster with the same name and hash, when the spot cluster is preempted (no matter it is stopped or terminated).
That said, I would prefer to rename the method is_spot_restartable to need_termination_after_preemption, or something similar.

Yes, I agree need_termination_after_preemption can be a better name. I update the code with a shorter name need_cleanup_after_preemption.

However, I think the issue @concretevitamin mentioned still remains? If in our j2 template we specify a different interruption behavior such as Stop rather than Terminate (default) in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interruption-behavior.html.
Then for AWS Spot VM, we shouldn't clean up the spot vm's disk after preemption as we want to reuse it if possible?

@Michaelvll
Copy link
Collaborator

Michaelvll commented Dec 12, 2022

However, I think the issue @concretevitamin mentioned still remains? If in our j2 template we specify a different interruption behavior such as Stop rather than Terminate (default) in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interruption-behavior.html.
Then for AWS Spot VM, we shouldn't clean up the spot vm's disk after preemption as we want to reuse it if possible?

For those, we still retry on the same region first, which will automatically reuse the stopped cluster if needed right? The current change does not make any worse for supporting those in the future comparing to the master branch, right?

@infwinston infwinston changed the title Add safe guard when getting TPU VM IPs Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak Dec 12, 2022
@infwinston
Copy link
Member Author

infwinston commented Dec 12, 2022

For those, we still retry on the same region first, which will automatically reuse the stopped cluster if needed right?

Ah I see. so you mean for AWS need_cleanup_after_preemption should still be False even if we change its preemption behavior.
https://github.com/skypilot-org/skypilot/pull/1500/files#diff-96e0e94ebec2d13f2565051ef5df13f97627aa02ad198916e5591392a73d1b65R362
Because AWS never requires manual cleanup by users after preemption.

I think this makes sense.

====

Another thing I'm not sure is: does Azure require manual cleanup after preemption? not sure how to test it as our Azure subscription disallows us to launch a spot vm.

@Michaelvll
Copy link
Collaborator

Ah I see. so you mean for AWS need_cleanup_after_preemption should still be False even if we change its preemption behavior.
https://github.com/skypilot-org/skypilot/pull/1500/files#diff-96e0e94ebec2d13f2565051ef5df13f97627aa02ad198916e5591392a73d1b65R362
Because AWS never requires manual cleanup by users after preemption.

Yes. Since our sky status -r will reflect the correct status from the cloud provider, I think it should be sufficient to handle the different behavior of terminate/stop.

@infwinston
Copy link
Member Author

Launched several long running jobs on Spot TPU VM/Pods a few days ago.
They survived through 30+ preemptions in total and no leaked resource found.
Looks like the change should be robust enough..

(sky-tmp) weichiang@mbp sky % sky spot queue
Fetching managed spot job statuses...
Managed spot jobs:
In progress jobs: 7 RUNNING

ID  NAME                RESOURCES         SUBMITTED   TOT. DURATION       JOB DURATION        #RECOVERIES  STATUS
19  sky-c6c2-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 12h 20m 1s   3 days 9h 53m 32s   10           RUNNING
18  sky-f148-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 21h 10m 15s  3 days 20h 14m 52s  5            RUNNING
17  sky-05c6-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 21h 11m 56s  3 days 20h 22m 54s  4            RUNNING
16  sky-de99-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 21h 13m 3s   3 days 19h 27m 23s  11           RUNNING
15  sky-55ca-weichiang  1x [tpu-v3-8:1]   4 days ago  4 days 10h 30m 56s  4 days 10h 1m 3s    4            RUNNING
14  sky-b46d-weichiang  1x [tpu-v2-32:1]  4 days ago  4 days 11h 11m 21s  4 days 10h 32m 46s  4            RUNNING
13  sky-94d0-weichiang  1x [tpu-v3-32:1]  4 days ago  4 days 11h 13m 34s  4 days 10h 18m 54s  4            RUNNING

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @infwinston! It looks good to me. Left some comments to make the code cleaner. It would be great if we can run the smoke tests before merging.

Comment on lines 1246 to 1251
if len(stdout) == 0:
logger.warning('No TPU VMs found with cluster name '
f'{cluster_name} in zone {zone}.')
if len(stdout.splitlines()) > 1:
logger.warning('Found more than one TPU VM with cluster name '
f'{cluster_name} in zone {zone}.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning? Also, we must be careful about the logs printed during the status refresh. Since that will corrupt the progress bar output of sky status -r. How about we change them to logger.debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I choose logger.warning because multiple TPU VM/Pod with the same cluster name is considered an abnormal case which is not supposed to happen.
When this happens, it means there's a resource leak. I think in this case we'd like to let user know?

Copy link
Collaborator

@Michaelvll Michaelvll Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a normal case for spot VM? I think for a non-TPU cluster, we don't show the warning. I think we will handle the number of IPs not equal to the actual amount in the caller function.

Also, is it true that a user can have multiple TPU VM with the same name in a same zone?

Copy link
Member Author

@infwinston infwinston Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for the confusion. let me explain again. For Spot VM, it also shouldn't happen that multiple Spot TPU VM having the same labels.ray-cluster-name.

Basically the query command should only return one VM/Pod in normal case.

query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
                 f'\\(labels.ray-cluster-name={cluster_name}\\) '
                 f'--zone={zone} --format=value\\(name\\)')

But if there's a leak resource (e.g., controller failed to terminate a preempted spot TPU), then this query command will return two VMs which is an abnormal case.

Also, is it true that a user can have multiple TPU VM with the same name in a same zone?

note that I was not referring to the "TPU name" shown on the console but labels.ray-cluster-name. so yes multiple TPU VM can have same labels.ray-cluster-name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing it to logger.debug but I'm also afraid that user will never find out there's a leaked resource unless they manually check the console.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, then why does the problem not happen for a non TPU VM cluster? What ensures those cluster not leaked?

Copy link
Member Author

@infwinston infwinston Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non TPU VM cluster, as it doesn't require manual cleanup after preemption, resource won't be leaked this way? but I'm not sure if there are other scenarios that could trigger leakage. Also we mostly rely on ray up to handle non-TPU VM clusters (probably irrelevant)

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
Comment on lines +1273 to +1278
tpuvm_json = json.loads(stdout)
if tpuvm_json['state'] != 'READY':
# May be a leaked preempted resource.
logger.warning(f'TPU VM {tpu_id} is not in READY state. '
'Could be a garbage resource. Skipping...')
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this state be different for differnet tpu_id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each TPU VM or TPU Pod only maps to a single tpu_id. So yes, different tpu_id can have different states.
But when this multiple tpu_id situation happens, it means there is a leaked resource with the same cluster name as the current one. That's why I print the garbage resource message.

Normally there should be only one tpu_id returned with the below query command.

query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
                 f'\\(labels.ray-cluster-name={cluster_name}\\) '
                 f'--zone={zone} --format=value\\(name\\)')

Comment on lines 2687 to 2701
returncode, stdout, stderr = log_lib.run_with_log(
query_cmd,
log_abs_path,
shell=True,
stream_logs=False,
require_outputs=True)

# Needs to create a list as GCP does not allow deleting
# multiple TPU VMs at once
tpu_terminate_cmds = []
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not handle the first returncode here. How about?

Suggested change
returncode, stdout, stderr = log_lib.run_with_log(
query_cmd,
log_abs_path,
shell=True,
stream_logs=False,
require_outputs=True)
# Needs to create a list as GCP does not allow deleting
# multiple TPU VMs at once
tpu_terminate_cmds = []
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)
returncode, stdout, stderr = log_lib.run_with_log(
query_cmd,
log_abs_path,
shell=True,
stream_logs=False,
require_outputs=True)
# Needs to create a list as GCP does not allow deleting
# multiple TPU VMs at once
# Skip the termination commands, if the TPU ID query commands fail.
tpu_terminate_cmds = [f'([[ "{returncode}" == "0" ]] || exit {returncode})']
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. fixed with minor modification.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @infwinston!

Comment on lines 221 to 223
In most cases, spot resources do not need cleanup after preemption.
The only exception by far is GCP's Spot TPU VM. We override this method
in gcp.py.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In most cases, spot resources do not need cleanup after preemption.
The only exception by far is GCP's Spot TPU VM. We override this method
in gcp.py.
In most cases, spot resources do not need cleanup after preemption,
as long as the cluster can be launched with the same name and tag,
no matter the preemption behavior is to terminate or stop the cluster.
The only exception by far is GCP's Spot TPU VM. We override this method
in gcp.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. fixed with minor modification.

Comment on lines 2698 to 2704
tpu_terminate_cmds = [f'exit {returncode}'
] if returncode != 0 else []
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can print out the information about the failed query?

Suggested change
tpu_terminate_cmds = [f'exit {returncode}'
] if returncode != 0 else []
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)
if returncode != 0:
tpu_terminate_cmd = f'echo "cmd: {query_cmd}" && echo "{stdout}" && echo "{stderr}" >&2 && eixt {returncode}'
else:
tpu_terminate_cmds = [f'exit {returncode}'
] if returncode != 0 else []
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let me do it. you meant this right

if returncode != 0:
    terminate_cmd = (f'echo "cmd: {query_cmd}" && '
                     f'echo "{stdout}" && '
                     f'echo "{stderr}" >&2 && '
                     f'exit {returncode}')
else:
    tpu_terminate_cmds = []
    for tpu_id in stdout.splitlines():
        tpu_terminate_cmds.append(
            f'gcloud compute tpus tpu-vm delete --zone={zone} '
            f'--quiet {tpu_id}')
    terminate_cmd = ' && '.join(tpu_terminate_cmds)

@infwinston
Copy link
Member Author

infwinston commented Dec 14, 2022

OK all smoke tests have passed. I just spot launched 7 TPU VMs to see if they can handle preemptions correctly. will merge tomorrow if everything works well. Thanks a lot for reviewing @Michaelvll @concretevitamin

(sky) ubuntu@ip-172-31-94-104:~$ sky spot queue | grep "tpu"
46  sky-c3a0-ubuntu                            1x [tpu-v2-8:1]   7 mins ago   7m 51s         1m 47s        0            RUNNING
45  sky-64ce-ubuntu                            1x [tpu-v3-8:1]   8 mins ago   8m 54s         2m 48s        0            RUNNING
44  sky-e5d5-ubuntu                            1x [tpu-v2-8:1]   10 mins ago  10m 9s         4m            0            RUNNING
43  sky-cc0c-ubuntu                            1x [tpu-v2-8:1]   10 mins ago  10m 59s        -             0            STARTING
42  sky-e540-ubuntu                            1x [tpu-v2-8:1]   11 mins ago  11m 45s        -             0            STARTING
41  sky-0d9f-ubuntu                            1x [tpu-v2-32:1]  13 mins ago  13m            5m 31s        0            RUNNING
40  sky-c11f-ubuntu                            1x [tpu-v3-32:1]  14 mins ago  14m 4s         5m 20s        0            RUNNING

@infwinston
Copy link
Member Author

They all recovered successfully from preemptions (1 out of 7 was the special case situation). Merging this PR now!

ID  NAME                                       RESOURCES         SUBMITTED   TOT. DURATION  JOB DURATION   #RECOVERIES  STATUS
46  sky-c3a0-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 26m 28s  1 day 14m 23s  1            RUNNING
45  sky-64ce-ubuntu                            1x [tpu-v3-8:1]   1 day ago   1 day 27m 31s  1 day 15m 27s  1            RUNNING
44  sky-e5d5-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 28m 46s  1 day 14m 20s  1            RUNNING
43  sky-cc0c-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 29m 36s  1 day 12m 18s  1            RUNNING
42  sky-e540-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 30m 22s  1 day 3m 53s   1            RUNNING
41  sky-0d9f-ubuntu                            1x [tpu-v2-32:1]  1 day ago   1 day 31m 37s  1 day 16m 40s  1            RUNNING
40  sky-c11f-ubuntu                            1x [tpu-v3-32:1]  1 day ago   1 day 32m 41s  1 day 15m 10s  1            RUNNING

@infwinston infwinston merged commit af1b7fd into master Dec 15, 2022
@infwinston infwinston deleted the tpu-safeguard branch December 15, 2022 07:21
@infwinston infwinston mentioned this pull request Dec 19, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TPU VM] Cannot sky down a TPU VM when it does not appear on GCP
3 participants