[Spot] Let the controller aware of the failed setup and fail early #1479

Michaelvll · 2022-12-01T04:47:31Z

When the spot job fails during setup, the master branch will retry 3 times (or forever if --retry-until-up is specified), and set the spot job to FAILED_NO_RESOURCES, which is not desired.

This PR makes the controller aware of the failure of the setup early, and set the spot job to FAILED.

A side benefit: the setup logs will go to sky spot logs as well, so the user don't have to use sky logs sky-spot-controller too see the output of their job's setup.

Tested:

sky spot launch -n test-setup-failure ./tests/test_yamls/failed_setup.yaml
pytest tests/test_smoke.py (except for [WIP][Azure] Fix azure race condition #1428)

concretevitamin · 2022-12-01T23:54:38Z

sky/spot/recovery_strategy.py

@@ -218,7 +221,7 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
 continue

 # Check the job status until it is not in initialized status
- if status is not None and job_lib.JobStatus.PENDING < status:
+ if status is not None and job_lib.JobStatus.INIT < status:


This seems a bit weird.

JobStatus = [INIT, SETTING_UP, PENDING, RUNNING, ...]

SpotStatus = [PENDING, SUBMITTED, STARTING, RUNNING, ...]

With this new change, this means we could return to controller.py from launch() when the JobStatus is in SETTING_UP or PENDING. The controller process will then set the SpotStatus to RUNNING.

Yes, I think that might be fine, as we can modify the definition of spot job RUNNING to actual job SETTING_UP or RUNNING. That means the spot_status only shows the status of whether user's commands start running. The PENDING status for the actual job will not last long since the spot cluster launched is dedicated to the job, i.e. the job will be RUNNING as soon as the SETTING_UP finish.

If you think it is still confusing, I can try to see if it is possible to keep the previous status transition behavior, but that may be more complicated.

We may need to have a very detailed comment, incl the two lists in my previous comment + some tricky behaviors, somewhere. Maybe spot_state?

Good point! I added the comments in the spot_state.SpotStatus. PTAL : )

…nto spot-aware-setup-failure

Michaelvll · 2023-01-23T05:24:13Z

We reverted the #1579 because that PR does not handle the preemption that happens during the setup correctly (the spot job will fail immediately even if the setup is preempted during the recovery). This PR is a better solution for the spot job to be aware of the setup failure.

Pros:

The setup failure will be distinguished from the preemption during the setup based on the same logic we used for identifying the job failure vs the cluster preemption.
Expose the setup failure to the spot status, so the user can better understand why the spot job fails.

Cons:

The semantics of the job duration will be changed, as it will include the setup duration, which seems to be fine as described in [Spot] Let the controller aware of the failed setup and fail early #1479 (comment).

Wdyt @concretevitamin @infwinston?

concretevitamin

Thanks @Michaelvll. Question to understand this solution better:

If setup is preempted: The only place we set a job to FAILED_SETUP is in the generated Ray program that's run on the spot cluster: https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L264-L278

If a spot cluster is preempted when the setup commands are running, and before we do job_lib.set_status(..., FAILED_SETUP), is it correct that we'll trigger the normal cluster preemption detection without relying on FAILED_SETUP handling?

Left some other comments that caused some confusion. Mainly on (i) state transitions (ii) how we start counting duration.

sky/spot/spot_state.py

sky/spot/recovery_strategy.py

sky/skylet/job_lib.py

concretevitamin · 2023-01-23T16:38:35Z

sky/skylet/job_lib.py

@@ -291,7 +291,7 @@ def get_latest_job_id() -> Optional[int]:


 def get_job_time_payload(job_id: int, is_end: bool) -> Optional[int]:
- field = 'end_at' if is_end else 'start_at'
+ field = 'end_at' if is_end else 'submitted_at'


Should we be concerned about this change?

Many places encode the return value as start_at, which has downstream usage/assumptions:

skypilot/sky/spot/controller.py

Line 67 in 8a80f5b

start_at = self._strategy_executor.launch()

etc

Also, confused why we need to do this change. Let's say we submit 1000 spot jobs, thus making many of them pending. Would this change somehow start counting those pending jobs' duration?

Should we be concerned about this change?
Many places encode the return value as start_at, which has downstream usage/assumptions:

skypilot/sky/spot/controller.py

Line 67 in 8a80f5b

start_at = self._strategy_executor.launch()

Thanks for catching this! I changed the variable name to launched_at instead, but still keep the concept for spot job to be start_time, as we now consider the setup to happen after the the spot job starts.

Also, confused why we need to do this change. Let's say we submit 1000 spot jobs, thus making many of them pending. Would this change somehow start counting those pending jobs' duration?

The pending period for the spot job does not count for the jobs' duration, because:

Each spot cluster will be dedicated for a single spot job, which means the job will not be in JobStatus.PENDING mode when the spot cluster is healthy.

1000 spot jobs will be pending on the controller, i.e. in the job queue of the controller. The submitted_at/start_at are only retrieved from the spot cluster, after the spot job is actually submitted to a spot cluster. The only change to the job duration is from the duration for the run section to the duration of all setup and run section.

concretevitamin · 2023-01-23T16:46:11Z

sky/spot/recovery_strategy.py

@@ -218,7 +221,7 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
 continue

 # Check the job status until it is not in initialized status
- if status is not None and job_lib.JobStatus.PENDING < status:
+ if status is not None and job_lib.JobStatus.INIT < status:


We may need to have a very detailed comment, incl the two lists in my previous comment + some tricky behaviors, somewhere. Maybe spot_state?

Co-authored-by: Zongheng Yang <[email protected]>

…/sky-experiments into spot-aware-setup-failure

…nto spot-aware-setup-failure

Michaelvll · 2023-01-24T06:30:12Z

Tested:

pytest tests/test_smoke.py --managed-spot

concretevitamin

Thanks @Michaelvll for adding the excellent comments which make the existing code massively more clear! A few final comments.

sky/skylet/job_lib.py

sky/spot/spot_state.py

sky/spot/spot_utils.py

concretevitamin

Thanks @Michaelvll for fixing this and bearing with the comments!

concretevitamin · 2023-01-24T17:53:30Z

sky/skylet/job_lib.py

 field = 'end_at' if get_ended_time else 'submitted_at'
 rows = _CURSOR.execute(f'SELECT {field} FROM jobs WHERE job_id=(?)',
 (job_id,))
 for (timestamp,) in rows:
 return common_utils.encode_payload(timestamp)
+ return common_utils.encode_payload(None)


Good catch! Why didn't we trigger this bug?

The reason is that in the controller, we always query the job_id that exists in the spot table in the normal case.

sky/skylet/job_lib.py

…kypilot-org#1479) * Let the controller aware of the failed setup and fail early * format * Add test * yapf * add test yaml * increase timeout for spot tests * fix * Add timeout for final spot status waiting * yapf * fix merge error * get rid of autostop test for spot controller * reorder * fix comment * Add failed setup status for spot * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * Address comments * format * update and variable names * format * lint * address comments * Address comments * fix test Co-authored-by: Zongheng Yang <[email protected]>

Michaelvll added 2 commits November 30, 2022 20:41

Let the controller aware of the failed setup and fail early

f680c9f

format

0cd55aa

concretevitamin requested a review from infwinston December 1, 2022 06:01

Michaelvll added 2 commits December 1, 2022 13:46

Add test

b883f32

yapf

db8a59b

Michaelvll marked this pull request as ready for review December 1, 2022 21:50

Michaelvll added 3 commits December 1, 2022 13:52

add test yaml

bca7ed5

increase timeout for spot tests

1ea8d1c

fix

c6cd1d6

Michaelvll force-pushed the spot-aware-setup-failure branch from 3ce339e to c6cd1d6 Compare December 1, 2022 22:46

concretevitamin reviewed Dec 1, 2022

View reviewed changes

Michaelvll added 4 commits December 2, 2022 02:02

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

065bc46

…nto spot-aware-setup-failure

Add timeout for final spot status waiting

2df0b9c

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

5c27d9f

…nto spot-aware-setup-failure

yapf

cae385c

infwinston mentioned this pull request Jan 9, 2023

Bugfix for spot launch setup error #1579

Merged

2 tasks

Michaelvll added 7 commits January 19, 2023 22:52

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

fb6f41b

…nto spot-aware-setup-failure

fix merge error

9c26705

get rid of autostop test for spot controller

12d7bbd

reorder

e05de82

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

4c994ff

…nto spot-aware-setup-failure

fix comment

81ede34

Add failed setup status for spot

8a80f5b

Michaelvll requested a review from concretevitamin January 23, 2023 05:19

concretevitamin reviewed Jan 23, 2023

View reviewed changes

Michaelvll and others added 4 commits January 23, 2023 08:59

Update sky/spot/recovery_strategy.py

06d875c

Co-authored-by: Zongheng Yang <[email protected]>

Address comments

e094b9a

format

e45da5b

Merge branch 'spot-aware-setup-failure' of github.com:concretevitamin…

ca4ee19

…/sky-experiments into spot-aware-setup-failure

Michaelvll added 5 commits January 23, 2023 11:39

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

3b8aabf

…nto spot-aware-setup-failure

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

88fae27

…nto spot-aware-setup-failure

update and variable names

821fe38

format

43190e2

lint

d30c193

Michaelvll requested a review from concretevitamin January 24, 2023 17:06

concretevitamin reviewed Jan 24, 2023

View reviewed changes

address comments

a9059ec

concretevitamin approved these changes Jan 24, 2023

View reviewed changes

Michaelvll added 2 commits January 24, 2023 10:06

Address comments

52315b0

fix test

c699be1

Michaelvll merged commit 7adb54e into master Jan 24, 2023

Michaelvll deleted the spot-aware-setup-failure branch January 24, 2023 18:58

Michaelvll mentioned this pull request Jan 24, 2023

Fix spot terminal status #1624

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Let the controller aware of the failed setup and fail early #1479

[Spot] Let the controller aware of the failed setup and fail early #1479

Michaelvll commented Dec 1, 2022 •

edited

Loading

concretevitamin Dec 1, 2022

Michaelvll Dec 11, 2022 •

edited

Loading

concretevitamin Jan 23, 2023

Michaelvll Jan 23, 2023

Michaelvll commented Jan 23, 2023

concretevitamin left a comment

concretevitamin Jan 23, 2023

concretevitamin Jan 23, 2023

Michaelvll Jan 23, 2023

concretevitamin Jan 23, 2023

Michaelvll commented Jan 24, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin left a comment

concretevitamin Jan 24, 2023

Michaelvll Jan 24, 2023

[Spot] Let the controller aware of the failed setup and fail early #1479

[Spot] Let the controller aware of the failed setup and fail early #1479

Conversation

Michaelvll commented Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

Michaelvll Dec 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jan 23, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jan 24, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Dec 1, 2022 •

edited

Loading

Michaelvll Dec 11, 2022 •

edited

Loading

Michaelvll commented Jan 24, 2023 •

edited

Loading