Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Let the controller aware of the failed setup and fail early #1479

Merged
merged 30 commits into from
Jan 24, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 1, 2022

Closes #1397

When the spot job fails during setup, the master branch will retry 3 times (or forever if --retry-until-up is specified), and set the spot job to FAILED_NO_RESOURCES, which is not desired.

This PR makes the controller aware of the failure of the setup early, and set the spot job to FAILED.

A side benefit: the setup logs will go to sky spot logs as well, so the user don't have to use sky logs sky-spot-controller too see the output of their job's setup.

Tested:

@Michaelvll Michaelvll marked this pull request as ready for review December 1, 2022 21:50
@@ -218,7 +221,7 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
continue

# Check the job status until it is not in initialized status
if status is not None and job_lib.JobStatus.PENDING < status:
if status is not None and job_lib.JobStatus.INIT < status:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird.

  • JobStatus = [INIT, SETTING_UP, PENDING, RUNNING, ...]
  • SpotStatus = [PENDING, SUBMITTED, STARTING, RUNNING, ...]

With this new change, this means we could return to controller.py from launch() when the JobStatus is in SETTING_UP or PENDING. The controller process will then set the SpotStatus to RUNNING.

Copy link
Collaborator Author

@Michaelvll Michaelvll Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that might be fine, as we can modify the definition of spot job RUNNING to actual job SETTING_UP or RUNNING. That means the spot_status only shows the status of whether user's commands start running. The PENDING status for the actual job will not last long since the spot cluster launched is dedicated to the job, i.e. the job will be RUNNING as soon as the SETTING_UP finish.

If you think it is still confusing, I can try to see if it is possible to keep the previous status transition behavior, but that may be more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to have a very detailed comment, incl the two lists in my previous comment + some tricky behaviors, somewhere. Maybe spot_state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added the comments in the spot_state.SpotStatus. PTAL : )

@Michaelvll
Copy link
Collaborator Author

We reverted the #1579 because that PR does not handle the preemption that happens during the setup correctly (the spot job will fail immediately even if the setup is preempted during the recovery). This PR is a better solution for the spot job to be aware of the setup failure.

Pros:

  1. The setup failure will be distinguished from the preemption during the setup based on the same logic we used for identifying the job failure vs the cluster preemption.
  2. Expose the setup failure to the spot status, so the user can better understand why the spot job fails.

Cons:

  1. The semantics of the job duration will be changed, as it will include the setup duration, which seems to be fine as described in [Spot] Let the controller aware of the failed setup and fail early #1479 (comment).

Wdyt @concretevitamin @infwinston?

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. Question to understand this solution better:

  1. If setup is preempted: The only place we set a job to FAILED_SETUP is in the generated Ray program that's run on the spot cluster: https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L264-L278

If a spot cluster is preempted when the setup commands are running, and before we do job_lib.set_status(..., FAILED_SETUP), is it correct that we'll trigger the normal cluster preemption detection without relying on FAILED_SETUP handling?

  1. Left some other comments that caused some confusion. Mainly on (i) state transitions (ii) how we start counting duration.

sky/spot/spot_state.py Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
@@ -291,7 +291,7 @@ def get_latest_job_id() -> Optional[int]:


def get_job_time_payload(job_id: int, is_end: bool) -> Optional[int]:
field = 'end_at' if is_end else 'start_at'
field = 'end_at' if is_end else 'submitted_at'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned about this change?

Many places encode the return value as start_at, which has downstream usage/assumptions:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, confused why we need to do this change. Let's say we submit 1000 spot jobs, thus making many of them pending. Would this change somehow start counting those pending jobs' duration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned about this change?
Many places encode the return value as start_at, which has downstream usage/assumptions:

start_at = self._strategy_executor.launch()

Thanks for catching this! I changed the variable name to launched_at instead, but still keep the concept for spot job to be start_time, as we now consider the setup to happen after the the spot job starts.

Also, confused why we need to do this change. Let's say we submit 1000 spot jobs, thus making many of them pending. Would this change somehow start counting those pending jobs' duration?

The pending period for the spot job does not count for the jobs' duration, because:

  1. Each spot cluster will be dedicated for a single spot job, which means the job will not be in JobStatus.PENDING mode when the spot cluster is healthy.
  2. 1000 spot jobs will be pending on the controller, i.e. in the job queue of the controller. The submitted_at/start_at are only retrieved from the spot cluster, after the spot job is actually submitted to a spot cluster. The only change to the job duration is from the duration for the run section to the duration of all setup and run section.

@@ -218,7 +221,7 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
continue

# Check the job status until it is not in initialized status
if status is not None and job_lib.JobStatus.PENDING < status:
if status is not None and job_lib.JobStatus.INIT < status:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to have a very detailed comment, incl the two lists in my previous comment + some tricky behaviors, somewhere. Maybe spot_state?

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jan 24, 2023

Tested:

  • pytest tests/test_smoke.py --managed-spot

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for adding the excellent comments which make the existing code massively more clear! A few final comments.

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/spot/spot_state.py Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for fixing this and bearing with the comments!

field = 'end_at' if get_ended_time else 'submitted_at'
rows = _CURSOR.execute(f'SELECT {field} FROM jobs WHERE job_id=(?)',
(job_id,))
for (timestamp,) in rows:
return common_utils.encode_payload(timestamp)
return common_utils.encode_payload(None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Why didn't we trigger this bug?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that in the controller, we always query the job_id that exists in the spot table in the normal case.

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 7adb54e into master Jan 24, 2023
@Michaelvll Michaelvll deleted the spot-aware-setup-failure branch January 24, 2023 18:58
@Michaelvll Michaelvll mentioned this pull request Jan 24, 2023
1 task
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Feb 22, 2023
…kypilot-org#1479)

* Let the controller aware of the failed setup and fail early

* format

* Add test

* yapf

* add test yaml

* increase timeout for spot tests

* fix

* Add timeout for final spot status waiting

* yapf

* fix merge error

* get rid of autostop test for spot controller

* reorder

* fix comment

* Add failed setup status for spot

* Update sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <[email protected]>

* Address comments

* format

* update and variable names

* format

* lint

* address comments

* Address comments

* fix test

Co-authored-by: Zongheng Yang <[email protected]>
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
…kypilot-org#1479)

* Let the controller aware of the failed setup and fail early

* format

* Add test

* yapf

* add test yaml

* increase timeout for spot tests

* fix

* Add timeout for final spot status waiting

* yapf

* fix merge error

* get rid of autostop test for spot controller

* reorder

* fix comment

* Add failed setup status for spot

* Update sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <[email protected]>

* Address comments

* format

* update and variable names

* format

* lint

* address comments

* Address comments

* fix test

Co-authored-by: Zongheng Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Error in the setup will keep the spot cluster retrying forever
2 participants