Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix resource validation for existing cluster and refactor #1259

Merged
merged 5 commits into from
Jul 15, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 16, 2022

Fixes #1258

This PR makes validation of the resources based on the offering for each region/zone, only when the resources is launchable and is being launched by the optimizer. This is because if the cluster already exists, the user can specify some combination of the accelerator / accelerator count that does not exist in the specified region/zone.

For example: aws only has A100:8 in a region, but the user can sky exec cluster_name --region region_name --cloud aws --gpus A100:1 to schedule the job on the existing cluster. Our current master PR will block this behavior, as the A100:1 is not offered by the region.

The validation of the Resources for existing cluster will be done by

if not (task.num_nodes <= handle.launched_nodes and
task_resources.less_demanding_than(
launched_resources, requested_num_nodes=task.num_nodes)):
if (task_resources.region is not None and
task_resources.region != launched_resources.region):
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesMismatchError(
'Task requested resources in region '
f'{task_resources.region!r}, but the existing cluster '
f'is in region {launched_resources.region!r}.')
if (task_resources.zone is not None and
task_resources.zone != launched_resources.zone):
zone_str = (f'is in zone {launched_resources.zone!r}.'
if launched_resources.zone is not None else
'does not have zone specified.')
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesMismatchError(
'Task requested resources in zone '
f'{task_resources.zone!r}, but the existing cluster '
f'{zone_str}')
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesMismatchError(
'Requested resources do not match the existing cluster.\n'
f' Requested:\t{task.num_nodes}x {task_resources} \n'
f' Existing:\t{handle.launched_nodes}x '
f'{handle.launched_resources}\n'
f'{mismatch_str}')

Note: This PR should be compatible with #1429 as the validate_launchable function should only be called by the resources that is meant to be used to create new VMs.


This PR also refactors the optimizer logic, so as to make the resource validation clearer.

We previously had two code paths for validating the requested resources causing our implementation for checking needs to be duplicated for requested resources that is launchable and not launchable .

skypilot/sky/optimizer.py

Lines 957 to 1024 in 1501708

elif resources.is_launchable():
if isinstance(resources.cloud, clouds.GCP):
# Check if the host VM satisfies the max vCPU and memory limits.
clouds.GCP.check_accelerator_attachable_to_host(
resources.instance_type, resources.accelerators,
resources.zone)
# If the user has specified a GCP zone and the zone does not support
# the host-accelerator combination, then an error will be raised by
# the above check_accelerator_attachable_to_host() call.
# If the user has not specified any zone, a launchable will be made
# for every zone even if some of the zones do not support the
# host-accelerator combination. Then the provisioner may try to
# launch the instance, and fail over to other zones. We find this
# behavior acceptable because this will happen only when the user
# requested GCP 4:P100 or 8:K80 with a very large host VM.
elif isinstance(resources.cloud, clouds.SCP):
# Check if the host VM satisfies the min/max disk size limits.
is_allowed, launchable[resources] = \
clouds.SCP.is_disk_size_allowed(resources)
if not is_allowed:
continue
launchable[resources] = _make_launchables_for_valid_region_zones(
resources)
else:
clouds_list = ([resources.cloud]
if resources.cloud is not None else enabled_clouds)
# Hack: When >=2 cloud candidates, always remove local cloud from
# possible candidates. This is so the optimizer will consider
# public clouds, except local. Local will be included as part of
# optimizer in a future PR.
# TODO(mluo): Add on-prem to cloud spillover.
if len(clouds_list) >= 2:
clouds_list = [
c for c in clouds_list if not isinstance(c, clouds.Local)
]
all_fuzzy_candidates = set()
for cloud in clouds_list:
(feasible_resources, fuzzy_candidate_list) = (
cloud.get_feasible_launchable_resources(resources))
if len(feasible_resources) > 0:
# Assume feasible_resources is sorted by prices.
cheapest = feasible_resources[0]
# Generate region/zone-specified resources.
launchable[resources].extend(
_make_launchables_for_valid_region_zones(cheapest))
cloud_candidates[cloud] = feasible_resources
else:
all_fuzzy_candidates.update(fuzzy_candidate_list)
if len(launchable[resources]) == 0:
clouds_str = str(clouds_list) if len(clouds_list) > 1 else str(
clouds_list[0])
logger.info(f'No resource satisfying {resources} '
f'on {clouds_str}.')
if len(all_fuzzy_candidates) > 0:
logger.info('Did you mean: '
f'{colorama.Fore.CYAN}'
f'{sorted(all_fuzzy_candidates)}'
f'{colorama.Style.RESET_ALL}')
else:
if resources.cpus is not None:
logger.info('Try specifying a different CPU count, '
'or add "+" to the end of the CPU count '
'to allow for larger instances.')
if resources.memory is not None:
logger.info('Try specifying a different memory size, '
'or add "+" to the end of the memory size '
'to allow for larger instances.')

For example, in #2096, although we disable the spot instance in cloud.get_feasible_launchable_resoures and sky launch --use-spot will skip the k8s cloud, we can still get around that check by making the requested resources launchable: sky launch --cloud kubernetes --instance-type 8CPU-16GB --use-spot.

That said our current code is quite error-prone, making it easy to forget to update the two places for the check.
The refactoring should fix this, by always using the get_feasible_launchable_resources


Tested:

  • pytest tests/test_smoke.py

@concretevitamin
Copy link
Member

Tagging @WoosukKwon @infwinston - can you take a look as well?

Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region .... In this case, should we let it succeed (which changes our resource fulfillment semantics to >=; currently it is exactly satisfiable)?

@Michaelvll
Copy link
Collaborator Author

Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region .... In this case, should we let it succeed (which changes our resource fulfillment semantics to >=; currently it is exactly satisfiable)?

The launch-ability will be checked in the optimizer. The following is our error message for that:

> sky launch -c test-gpu --gpus K80:3 --region us-central1 --cloud gcp                    
I 10-16 17:14:56 optimizer.py:880] No resource satisfying {'K80': 3} on [GCP].
I 10-16 17:14:56 optimizer.py:883] Did you mean: ['K80:4', 'K80:8']
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.

@WoosukKwon
Copy link
Collaborator

Thanks @Michaelvll for quickly fixing the bug! I've checked that this PR resolves the issue.

However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone (or any other validation check) for sky exec? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than) should be enough for sky exec. I believe all the validation checks in Resources are only for sky launch, and we need to differentiate the two in order to avoid confusion.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 17, 2022

However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone (or any other validation check) for sky exec? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than) should be enough for sky exec. I believe all the validation checks in Resources are only for sky launch, and we need to differentiate the two in order to avoid confusion.

Great point @WoosukKwon! The resources should only be checked for availability during launching, instead of creation. I modified the PR thoroughly to move the resource validation to optimization and added several tests to make sure the modification is correct. PTAL.

Tested:

  • tests/run_smoke_tests.sh

@Michaelvll
Copy link
Collaborator Author

This has been fixed in the master branch. Closing this PR for now.

@Michaelvll Michaelvll closed this May 9, 2023
@Michaelvll
Copy link
Collaborator Author

Sorry, this issue still exists by the following commands:

sky gpunode --gpus K80:4 --region us-central1 --cloud gcp -c test-gpu
sky exec --gpus K80:3 --cloud gcp --region us-central1 test-gpu 'echo hi'

The error message will be:

ValueError: Accelerator "K80" is not available in "us-central1".

The error is quite confusing.

@Michaelvll Michaelvll reopened this May 9, 2023
@Michaelvll Michaelvll changed the title [Core] Fix partial GPUs when region specified [Core] Fix resource validation for existing cluster May 9, 2023
@Michaelvll
Copy link
Collaborator Author

A kind bump for this PR @WoosukKwon

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for the effort. Left some comments. Please take a look.

sky/clouds/service_catalog/__init__.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/resources.py Outdated
# Validate whether accelerator is available in specified region/zone
acc, acc_count = list(acc_requested.items())[0]
if self.region is not None or self.zone is not None:
if not self._cloud.accelerator_in_region_or_zone(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dumb question: in which case will this branch be taken? I found this error when trying to launch a VM that is not supported in the specified region.

$ sky launch --gpus a100:8 --region us-west-1 --cloud aws
I 05-11 13:19:55 optimizer.py:998] No resource satisfying AWS({'A100': 8}) on AWS.
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task Task(run=<empty>)
  resources: AWS({'A100': 8}). Region: us-west-1.
This means the catalog does not contain any instance types that satisfy this request.
To fix: relax or change the resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.

_capture_match_gpus_spec(f.name, 'V100')
_capture_match_gpus_spec(f.name, 'V100:0.5')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's a bit weird that partial GPUs are only supported when the instance type is specified, while they are not supported otherwise. Why don't we disallow partial GPUs for sky launch?

$ sky launch --gpus v100:0.5
I 05-11 13:23:15 optimizer.py:998] No resource satisfying <Cloud>({'V100': 0.5}) on [AWS, Azure, GCP, Lambda].
I 05-11 13:23:15 optimizer.py:1001] Did you mean: ['V100-32GB:8', 'V100:1', 'V100:2', 'V100:4', 'V100:8']
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task Task(run=<empty>)
  resources: <Cloud>({'V100': 0.5}).
This means the catalog does not contain any instance types that satisfy this request.
To fix: relax or change the resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.

$ sky launch --gpus v100:0.5 --instance-type p3.2xlarge
I 05-11 13:23:22 optimizer.py:636] == Optimizer ==
I 05-11 13:23:22 optimizer.py:647] Target: minimizing cost
I 05-11 13:23:22 optimizer.py:659] Estimated cost: $3.1 / hour
I 05-11 13:23:22 optimizer.py:659] 
I 05-11 13:23:22 optimizer.py:732] Considered resources (1 node):
I 05-11 13:23:22 optimizer.py:781] -----------------------------------------------------------------------------------------
I 05-11 13:23:22 optimizer.py:781]  CLOUD   INSTANCE     vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 05-11 13:23:22 optimizer.py:781] -----------------------------------------------------------------------------------------
I 05-11 13:23:22 optimizer.py:781]  AWS     p3.2xlarge   8       61        V100:0.5       us-east-1     3.06          ✔     
I 05-11 13:23:22 optimizer.py:781] -----------------------------------------------------------------------------------------
I 05-11 13:23:22 optimizer.py:781] 
Launching a new cluster 'sky-e1f9-woosuk'. Proceed? [Y/n]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that is a great point! I reverted this change

Copy link
Collaborator Author

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @WoosukKwon! Based on your comments, I just realized that we have quite many redundancies in the code for checking the instance_type-accelerator combination. I just refactored it. PTAL.

sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/__init__.py Outdated Show resolved Hide resolved
_capture_match_gpus_spec(f.name, 'V100')
_capture_match_gpus_spec(f.name, 'V100:0.5')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that is a great point! I reverted this change

Copy link
Collaborator Author

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @WoosukKwon! Based on your comments, I just realized that we have quite many redundancies in the code for checking the instance_type-accelerator combination. I just refactored it. PTAL.

@Michaelvll Michaelvll changed the title [Core] Fix resource validation for existing cluster [Core] Fix resource validation for existing cluster and refactor Jul 5, 2023
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jul 5, 2023

I further refactored the code in optimizer and clouds to reduce the redundant codes. Please feel free to take another look. @WoosukKwon
Tested:

  • pytest tests/test_optimizer_dryrun.py with the A100 test included.
  • pytest tests/test_cli.py with the job execution on existing cluster tests included.
  • sky launch --gpus A100:8 --region us-west-1 --cloud aws should fail before the confirmation
  • pytest tests/test_smoke.py

To reproduce the issue for failing to execute job with GPUs on an existing cluster:

  1. go to the master branch
  2. git cherry-pick ced2878
  3. pytest tests/test_jobs.py

@Michaelvll Michaelvll force-pushed the fix-partial-gpu branch 5 times, most recently from 4f5dbc3 to a4cb94d Compare July 5, 2023 23:57
format

validate the resources only when optimizing

Add test for TPU

fix test

minor refactoring

fix tests

fix merging error

revert is_same_resources

fix test

format

fix test

Refactor

format

Add comment

refactor

Fix test

fix scp disk check

simplify the code

fix

fix a100 cpu memory check

format

format

add test

fix azure disk_tier check

format

fix rebase error

filter_region_zone
fix instance type

fix the test without aws credential

refactor test

add job tests

remove remnant

add invalid accelerator regions

format

fix test

fix test

fix test
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR.

sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Show resolved Hide resolved
sky/clouds/cloud.py Show resolved Hide resolved
sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
tests/test_jobs.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jul 14, 2023

Thanks a lot for the review @WoosukKwon!! I am testing the smoke tests before merging:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Fail to launch task on existing cluster with odd GPU numbers when region/zone specified
3 participants