Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Show spot controller in sky status and simplify tearing down #1270

Merged
merged 13 commits into from
Oct 21, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 18, 2022

According to the offline discussion with @concretevitamin, we decided to add the spot controller in the sky status to avoid confusion caused by the controller in the cloud console. (Part of the efforts for fixing #1269)

After this PR, the sky status will be:
image

When tearing down the controller, the output would be (no --purge required, as purging can lead to resource leakage):
image

TODO:

@Michaelvll Michaelvll marked this pull request as ready for review October 18, 2022 07:34
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @Michaelvll - Quick UX comments, haven't read the code in detail.

sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
if is_reserved:
click.echo(
f'{colorama.Style.DIM}Reserved clusters will be autostopped '
'when inactive. Refresh statuses with: sky status --refresh'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(didn't read code carefully) do we need to show "Refresh statuses with: sky status --refresh" here and again below? Can we show it once below and keep this hint about controller shorter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please refer to the latest output in the comment.

sky/cli.py Outdated
f'{reserved_clusters_str} is not supported.')
else:
click.secho(
'WARNING: Tearing down a SkyPilot reserved cluster. If '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto - just say managed spot controller.

Also, we should hint on all consequences (1) <what's already written here> (2) managed spot jobs info (sky spot status) will be lost.

For (1), maybe add a TODO that we automatically warn if there are in-progress jobs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added the warning for the in-progress jobs as well. PTAL.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a pass.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 20, 2022

New status table:
image

Tearing down the STOPPED spot controller:
image

Tearing down an INIT spot controller (submitting jobs):
image

Tearing down an UP spot controller (with no in-progress jobs):
image

Tearing down an UP spot controller (with in-progress jobs):
image

Please let me know if you think any of the outputs should be improved. : )

Future TODO (new PR):

  • enable the sky stop for the spot controller with the messages shown above.
  • enable autodown for the spot controller

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - some questions.

sky/cli.py Outdated
filtered_cluster_records = []
reserved_clusters = []
nonreserved_cluster_records = []
reserved_clusters = collections.defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to simplify this group based logic? Maybe specializing to just 1 spot controller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Changed each group to only has 1 cluster.

sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated
Comment on lines 1965 to 1966
f'\n {cnt}. Resource leakage may happen caused by '
'spot jobs being submitted, and in-progress spot jobs.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If there are pending/in-progress spot jobs, those resources will not be terminated and require manual cleanup.

Actually, why do we show this when the controller is INIT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is possible that another sky spot launch is running in parallel, which will make the cluster status INIT, i.e. during a spot job is being submitted to the spot controller.

sky/cli.py Outdated
'spot jobs being submitted, and in-progress spot jobs.')
cnt += 1
elif cluster_status == global_user_state.ClusterStatus.UP:
spot_jobs = core.spot_status(refresh=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if before this line, a concurrent spot launch made the controller INIT? Would this call fail? The docstr of this func doesn't suggest what to expect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added an error handling for it. PTAL.

sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll for this UX improvement!

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated
non_terminal_jobs):
msg += ('\n * In-progress spot jobs found, their resources '
'will not be terminated and require manual cleanup '
'(sky spot cancel --all):\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to show (sky spot cancel --all)? User is downing the controller, and after it's downed, spot cancel won't work? I feel either of the following is fine:

  • just say manual cleanup (I prefer this one)
  • say manual cleanup via cloud consoles

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I was trying to say abort the current down process and run sky spot cancel --all first, and then do the sky down, but that could take a long time, waiting the controller process hit the line that checks the cancel signal.

Removed. Thanks!

sky/cli.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 2aad67a into master Oct 21, 2022
@Michaelvll Michaelvll deleted the reserved-cluster-in-status branch October 21, 2022 07:08
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
…kypilot-org#1270)

* Show controller in status and enable tear down.

* Reduce the autostop for controller to save cost

* fix test

* format

* address comments

* fix down output

* fix test

* address comments

* Fix the docs for spot job

* fix space

* address comments
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
…kypilot-org#1270)

* Show controller in status and enable tear down.

* Reduce the autostop for controller to save cost

* fix test

* format

* address comments

* fix down output

* fix test

* address comments

* Fix the docs for spot job

* fix space

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants