-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Let the controller aware of the failed setup and fail early #1479
Conversation
3ce339e
to
c6cd1d6
Compare
sky/spot/recovery_strategy.py
Outdated
@@ -218,7 +221,7 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]: | |||
continue | |||
|
|||
# Check the job status until it is not in initialized status | |||
if status is not None and job_lib.JobStatus.PENDING < status: | |||
if status is not None and job_lib.JobStatus.INIT < status: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit weird.
- JobStatus = [INIT, SETTING_UP, PENDING, RUNNING, ...]
- SpotStatus = [PENDING, SUBMITTED, STARTING, RUNNING, ...]
With this new change, this means we could return to controller.py from launch() when the JobStatus is in SETTING_UP or PENDING. The controller process will then set the SpotStatus to RUNNING.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that might be fine, as we can modify the definition of spot job RUNNING to actual job SETTING_UP
or RUNNING
. That means the spot_status
only shows the status of whether user's commands start running. The PENDING
status for the actual job will not last long since the spot cluster launched is dedicated to the job, i.e. the job will be RUNNING as soon as the SETTING_UP
finish.
If you think it is still confusing, I can try to see if it is possible to keep the previous status transition behavior, but that may be more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to have a very detailed comment, incl the two lists in my previous comment + some tricky behaviors, somewhere. Maybe spot_state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I added the comments in the spot_state.SpotStatus
. PTAL : )
…nto spot-aware-setup-failure
…nto spot-aware-setup-failure
…nto spot-aware-setup-failure
…nto spot-aware-setup-failure
We reverted the #1579 because that PR does not handle the preemption that happens during the setup correctly (the spot job will fail immediately even if the setup is preempted during the recovery). This PR is a better solution for the spot job to be aware of the setup failure. Pros:
Cons:
Wdyt @concretevitamin @infwinston? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Question to understand this solution better:
- If setup is preempted: The only place we set a job to FAILED_SETUP is in the generated Ray program that's run on the spot cluster: https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L264-L278
If a spot cluster is preempted when the setup commands are running, and before we do job_lib.set_status(..., FAILED_SETUP)
, is it correct that we'll trigger the normal cluster preemption detection without relying on FAILED_SETUP handling?
- Left some other comments that caused some confusion. Mainly on (i) state transitions (ii) how we start counting duration.
sky/skylet/job_lib.py
Outdated
@@ -291,7 +291,7 @@ def get_latest_job_id() -> Optional[int]: | |||
|
|||
|
|||
def get_job_time_payload(job_id: int, is_end: bool) -> Optional[int]: | |||
field = 'end_at' if is_end else 'start_at' | |||
field = 'end_at' if is_end else 'submitted_at' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be concerned about this change?
Many places encode the return value as start_at, which has downstream usage/assumptions:
skypilot/sky/spot/controller.py
Line 67 in 8a80f5b
start_at = self._strategy_executor.launch() - etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, confused why we need to do this change. Let's say we submit 1000 spot jobs, thus making many of them pending. Would this change somehow start counting those pending jobs' duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be concerned about this change?
Many places encode the return value as start_at, which has downstream usage/assumptions:
skypilot/sky/spot/controller.py
Line 67 in 8a80f5b
start_at = self._strategy_executor.launch()
Thanks for catching this! I changed the variable name to launched_at
instead, but still keep the concept for spot job to be start_time
, as we now consider the setup to happen after the the spot job starts.
Also, confused why we need to do this change. Let's say we submit 1000 spot jobs, thus making many of them pending. Would this change somehow start counting those pending jobs' duration?
The pending period for the spot job does not count for the jobs' duration, because:
- Each spot cluster will be dedicated for a single spot job, which means the job will not be in JobStatus.PENDING mode when the spot cluster is healthy.
- 1000 spot jobs will be pending on the controller, i.e. in the job queue of the controller. The submitted_at/start_at are only retrieved from the spot cluster, after the spot job is actually submitted to a spot cluster. The only change to the job duration is from the duration for the
run
section to the duration of allsetup
andrun
section.
sky/spot/recovery_strategy.py
Outdated
@@ -218,7 +221,7 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]: | |||
continue | |||
|
|||
# Check the job status until it is not in initialized status | |||
if status is not None and job_lib.JobStatus.PENDING < status: | |||
if status is not None and job_lib.JobStatus.INIT < status: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to have a very detailed comment, incl the two lists in my previous comment + some tricky behaviors, somewhere. Maybe spot_state?
Co-authored-by: Zongheng Yang <[email protected]>
…/sky-experiments into spot-aware-setup-failure
…nto spot-aware-setup-failure
…nto spot-aware-setup-failure
Tested:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll for adding the excellent comments which make the existing code massively more clear! A few final comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll for fixing this and bearing with the comments!
field = 'end_at' if get_ended_time else 'submitted_at' | ||
rows = _CURSOR.execute(f'SELECT {field} FROM jobs WHERE job_id=(?)', | ||
(job_id,)) | ||
for (timestamp,) in rows: | ||
return common_utils.encode_payload(timestamp) | ||
return common_utils.encode_payload(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Why didn't we trigger this bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that in the controller, we always query the job_id
that exists in the spot table in the normal case.
…kypilot-org#1479) * Let the controller aware of the failed setup and fail early * format * Add test * yapf * add test yaml * increase timeout for spot tests * fix * Add timeout for final spot status waiting * yapf * fix merge error * get rid of autostop test for spot controller * reorder * fix comment * Add failed setup status for spot * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * Address comments * format * update and variable names * format * lint * address comments * Address comments * fix test Co-authored-by: Zongheng Yang <[email protected]>
…kypilot-org#1479) * Let the controller aware of the failed setup and fail early * format * Add test * yapf * add test yaml * increase timeout for spot tests * fix * Add timeout for final spot status waiting * yapf * fix merge error * get rid of autostop test for spot controller * reorder * fix comment * Add failed setup status for spot * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * Address comments * format * update and variable names * format * lint * address comments * Address comments * fix test Co-authored-by: Zongheng Yang <[email protected]>
Closes #1397
When the spot job fails during setup, the master branch will retry 3 times (or forever if
--retry-until-up
is specified), and set the spot job toFAILED_NO_RESOURCES
, which is not desired.This PR makes the controller aware of the failure of the setup early, and set the spot job to
FAILED
.A side benefit: the setup logs will go to
sky spot logs
as well, so the user don't have to usesky logs sky-spot-controller
too see the output of their job's setup.Tested:
sky spot launch -n test-setup-failure ./tests/test_yamls/failed_setup.yaml
pytest tests/test_smoke.py
(except for [WIP][Azure] Fix azure race condition #1428)