[Core] Fix resource validation for existing cluster and refactor #1259

Michaelvll · 2022-10-16T23:31:35Z

This PR makes validation of the resources based on the offering for each region/zone, only when the resources is launchable and is being launched by the optimizer. This is because if the cluster already exists, the user can specify some combination of the accelerator / accelerator count that does not exist in the specified region/zone.

For example: aws only has A100:8 in a region, but the user can sky exec cluster_name --region region_name --cloud aws --gpus A100:1 to schedule the job on the existing cluster. Our current master PR will block this behavior, as the A100:1 is not offered by the region.

The validation of the Resources for existing cluster will be done by

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 2173 to 2199 in 8d3fce2

    
           if not (task.num_nodes <= handle.launched_nodes and 
        
                   task_resources.less_demanding_than( 
        
                       launched_resources, requested_num_nodes=task.num_nodes)): 
        
               if (task_resources.region is not None and 
        
                       task_resources.region != launched_resources.region): 
        
                   with ux_utils.print_exception_no_traceback(): 
        
                       raise exceptions.ResourcesMismatchError( 
        
                           'Task requested resources in region ' 
        
                           f'{task_resources.region!r}, but the existing cluster ' 
        
                           f'is in region {launched_resources.region!r}.') 
        
               if (task_resources.zone is not None and 
        
                       task_resources.zone != launched_resources.zone): 
        
                   zone_str = (f'is in zone {launched_resources.zone!r}.' 
        
                               if launched_resources.zone is not None else 
        
                               'does not have zone specified.') 
        
                   with ux_utils.print_exception_no_traceback(): 
        
                       raise exceptions.ResourcesMismatchError( 
        
                           'Task requested resources in zone ' 
        
                           f'{task_resources.zone!r}, but the existing cluster ' 
        
                           f'{zone_str}') 
        
               with ux_utils.print_exception_no_traceback(): 
        
                   raise exceptions.ResourcesMismatchError( 
        
                       'Requested resources do not match the existing cluster.\n' 
        
                       f'  Requested:\t{task.num_nodes}x {task_resources} \n' 
        
                       f'  Existing:\t{handle.launched_nodes}x ' 
        
                       f'{handle.launched_resources}\n' 
        
                       f'{mismatch_str}')

Note: This PR should be compatible with #1429 as the validate_launchable function should only be called by the resources that is meant to be used to create new VMs.

This PR also refactors the optimizer logic, so as to make the resource validation clearer.

We previously had two code paths for validating the requested resources causing our implementation for checking needs to be duplicated for requested resources that is launchable and not launchable .

skypilot/sky/optimizer.py

Lines 957 to 1024 in 1501708

    
           elif resources.is_launchable(): 
        
               if isinstance(resources.cloud, clouds.GCP): 
        
                   # Check if the host VM satisfies the max vCPU and memory limits. 
        
                   clouds.GCP.check_accelerator_attachable_to_host( 
        
                       resources.instance_type, resources.accelerators, 
        
                       resources.zone) 
        
               # If the user has specified a GCP zone and the zone does not support 
        
               # the host-accelerator combination, then an error will be raised by 
        
               # the above check_accelerator_attachable_to_host() call. 
        
               # If the user has not specified any zone, a launchable will be made 
        
               # for every zone even if some of the zones do not support the 
        
               # host-accelerator combination. Then the provisioner may try to 
        
               # launch the instance, and fail over to other zones. We find this 
        
               # behavior acceptable because this will happen only when the user 
        
               # requested GCP 4:P100 or 8:K80 with a very large host VM. 
        
               elif isinstance(resources.cloud, clouds.SCP): 
        
                   # Check if the host VM satisfies the min/max disk size limits. 
        
                   is_allowed, launchable[resources] = \ 
        
                       clouds.SCP.is_disk_size_allowed(resources) 
        
                   if not is_allowed: 
        
                       continue 
        
               launchable[resources] = _make_launchables_for_valid_region_zones( 
        
                   resources) 
        
           else: 
        
               clouds_list = ([resources.cloud] 
        
                              if resources.cloud is not None else enabled_clouds) 
        
               # Hack: When >=2 cloud candidates, always remove local cloud from 
        
               # possible candidates. This is so the optimizer will consider 
        
               # public clouds, except local. Local will be included as part of 
        
               # optimizer in a future PR. 
        
               # TODO(mluo): Add on-prem to cloud spillover. 
        
               if len(clouds_list) >= 2: 
        
                   clouds_list = [ 
        
                       c for c in clouds_list if not isinstance(c, clouds.Local) 
        
                   ] 
        
               all_fuzzy_candidates = set() 
        
               for cloud in clouds_list: 
        
                   (feasible_resources, fuzzy_candidate_list) = ( 
        
                       cloud.get_feasible_launchable_resources(resources)) 
        
                   if len(feasible_resources) > 0: 
        
                       # Assume feasible_resources is sorted by prices. 
        
                       cheapest = feasible_resources[0] 
        
                       # Generate region/zone-specified resources. 
        
                       launchable[resources].extend( 
        
                           _make_launchables_for_valid_region_zones(cheapest)) 
        
                       cloud_candidates[cloud] = feasible_resources 
        
                   else: 
        
                       all_fuzzy_candidates.update(fuzzy_candidate_list) 
        
               if len(launchable[resources]) == 0: 
        
                   clouds_str = str(clouds_list) if len(clouds_list) > 1 else str( 
        
                       clouds_list[0]) 
        
                   logger.info(f'No resource satisfying {resources} ' 
        
                               f'on {clouds_str}.') 
        
                   if len(all_fuzzy_candidates) > 0: 
        
                       logger.info('Did you mean: ' 
        
                                   f'{colorama.Fore.CYAN}' 
        
                                   f'{sorted(all_fuzzy_candidates)}' 
        
                                   f'{colorama.Style.RESET_ALL}') 
        
                   else: 
        
                       if resources.cpus is not None: 
        
                           logger.info('Try specifying a different CPU count, ' 
        
                                       'or add "+" to the end of the CPU count ' 
        
                                       'to allow for larger instances.') 
        
                       if resources.memory is not None: 
        
                           logger.info('Try specifying a different memory size, ' 
        
                                       'or add "+" to the end of the memory size ' 
        
                                       'to allow for larger instances.')

For example, in #2096, although we disable the spot instance in cloud.get_feasible_launchable_resoures and sky launch --use-spot will skip the k8s cloud, we can still get around that check by making the requested resources launchable: sky launch --cloud kubernetes --instance-type 8CPU-16GB --use-spot.

That said our current code is quite error-prone, making it easy to forget to update the two places for the check.
The refactoring should fix this, by always using the get_feasible_launchable_resources

Tested:

pytest tests/test_smoke.py

concretevitamin · 2022-10-16T23:53:41Z

Tagging @WoosukKwon @infwinston - can you take a look as well?

Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region .... In this case, should we let it succeed (which changes our resource fulfillment semantics to >=; currently it is exactly satisfiable)?

Michaelvll · 2022-10-17T00:15:30Z

Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region .... In this case, should we let it succeed (which changes our resource fulfillment semantics to >=; currently it is exactly satisfiable)?

The launch-ability will be checked in the optimizer. The following is our error message for that:

> sky launch -c test-gpu --gpus K80:3 --region us-central1 --cloud gcp                    
I 10-16 17:14:56 optimizer.py:880] No resource satisfying {'K80': 3} on [GCP].
I 10-16 17:14:56 optimizer.py:883] Did you mean: ['K80:4', 'K80:8']
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.

WoosukKwon · 2022-10-17T10:12:30Z

Thanks @Michaelvll for quickly fixing the bug! I've checked that this PR resolves the issue.

However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone (or any other validation check) for sky exec? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than) should be enough for sky exec. I believe all the validation checks in Resources are only for sky launch, and we need to differentiate the two in order to avoid confusion.

Michaelvll · 2022-10-17T22:56:30Z

However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone (or any other validation check) for sky exec? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than) should be enough for sky exec. I believe all the validation checks in Resources are only for sky launch, and we need to differentiate the two in order to avoid confusion.

Great point @WoosukKwon! The resources should only be checked for availability during launching, instead of creation. I modified the PR thoroughly to move the resource validation to optimization and added several tests to make sure the modification is correct. PTAL.

Tested:

tests/run_smoke_tests.sh

Michaelvll · 2023-05-09T20:28:06Z

This has been fixed in the master branch. Closing this PR for now.

Michaelvll · 2023-05-09T20:46:03Z

Sorry, this issue still exists by the following commands:

sky gpunode --gpus K80:4 --region us-central1 --cloud gcp -c test-gpu
sky exec --gpus K80:3 --cloud gcp --region us-central1 test-gpu 'echo hi'

The error message will be:

ValueError: Accelerator "K80" is not available in "us-central1".

The error is quite confusing.

Michaelvll · 2023-05-09T22:02:51Z

A kind bump for this PR @WoosukKwon

WoosukKwon

Thanks @Michaelvll for the effort. Left some comments. Please take a look.

sky/clouds/service_catalog/__init__.py

sky/clouds/service_catalog/gcp_catalog.py

WoosukKwon · 2023-05-11T20:20:52Z

sky/resources.py

+            # Validate whether accelerator is available in specified region/zone
+            acc, acc_count = list(acc_requested.items())[0]
+            if self.region is not None or self.zone is not None:
+                if not self._cloud.accelerator_in_region_or_zone(


A dumb question: in which case will this branch be taken? I found this error when trying to launch a VM that is not supported in the specified region.

$ sky launch --gpus a100:8 --region us-west-1 --cloud aws I 05-11 13:19:55 optimizer.py:998] No resource satisfying AWS({'A100': 8}) on AWS. sky.exceptions.ResourcesUnavailableError: No launchable resource found for task Task(run=<empty>) resources: AWS({'A100': 8}). Region: us-west-1. This means the catalog does not contain any instance types that satisfy this request. To fix: relax or change the resource requirements. Hint: 'sky show-gpus --all' to list available accelerators. 'sky check' to check the enabled clouds.

WoosukKwon · 2023-05-11T20:24:54Z

tests/test_cli.py

        _capture_match_gpus_spec(f.name, 'V100')
+        _capture_match_gpus_spec(f.name, 'V100:0.5')


I feel it's a bit weird that partial GPUs are only supported when the instance type is specified, while they are not supported otherwise. Why don't we disallow partial GPUs for sky launch?

$ sky launch --gpus v100:0.5 I 05-11 13:23:15 optimizer.py:998] No resource satisfying <Cloud>({'V100': 0.5}) on [AWS, Azure, GCP, Lambda]. I 05-11 13:23:15 optimizer.py:1001] Did you mean: ['V100-32GB:8', 'V100:1', 'V100:2', 'V100:4', 'V100:8'] sky.exceptions.ResourcesUnavailableError: No launchable resource found for task Task(run=<empty>) resources: <Cloud>({'V100': 0.5}). This means the catalog does not contain any instance types that satisfy this request. To fix: relax or change the resource requirements. Hint: 'sky show-gpus --all' to list available accelerators. 'sky check' to check the enabled clouds. $ sky launch --gpus v100:0.5 --instance-type p3.2xlarge I 05-11 13:23:22 optimizer.py:636] == Optimizer == I 05-11 13:23:22 optimizer.py:647] Target: minimizing cost I 05-11 13:23:22 optimizer.py:659] Estimated cost: $3.1 / hour I 05-11 13:23:22 optimizer.py:659] I 05-11 13:23:22 optimizer.py:732] Considered resources (1 node): I 05-11 13:23:22 optimizer.py:781] ----------------------------------------------------------------------------------------- I 05-11 13:23:22 optimizer.py:781] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN I 05-11 13:23:22 optimizer.py:781] ----------------------------------------------------------------------------------------- I 05-11 13:23:22 optimizer.py:781] AWS p3.2xlarge 8 61 V100:0.5 us-east-1 3.06 ✔ I 05-11 13:23:22 optimizer.py:781] ----------------------------------------------------------------------------------------- I 05-11 13:23:22 optimizer.py:781] Launching a new cluster 'sky-e1f9-woosuk'. Proceed? [Y/n]:

Actually, that is a great point! I reverted this change

Michaelvll

Thank you for the review @WoosukKwon! Based on your comments, I just realized that we have quite many redundancies in the code for checking the instance_type-accelerator combination. I just refactored it. PTAL.

sky/clouds/service_catalog/gcp_catalog.py

sky/clouds/service_catalog/__init__.py

Michaelvll · 2023-05-11T21:59:39Z

tests/test_cli.py

        _capture_match_gpus_spec(f.name, 'V100')
+        _capture_match_gpus_spec(f.name, 'V100:0.5')


Actually, that is a great point! I reverted this change

Michaelvll

Thank you for the review @WoosukKwon! Based on your comments, I just realized that we have quite many redundancies in the code for checking the instance_type-accelerator combination. I just refactored it. PTAL.

Michaelvll · 2023-07-05T21:49:45Z

I further refactored the code in optimizer and clouds to reduce the redundant codes. Please feel free to take another look. @WoosukKwon
Tested:

pytest tests/test_optimizer_dryrun.py with the A100 test included.
pytest tests/test_cli.py with the job execution on existing cluster tests included.
sky launch --gpus A100:8 --region us-west-1 --cloud aws should fail before the confirmation
pytest tests/test_smoke.py

To reproduce the issue for failing to execute job with GPUs on an existing cluster:

go to the master branch
git cherry-pick ced2878
pytest tests/test_jobs.py

format validate the resources only when optimizing Add test for TPU fix test minor refactoring fix tests fix merging error revert is_same_resources fix test format fix test Refactor format Add comment refactor Fix test fix scp disk check simplify the code fix fix a100 cpu memory check format format add test fix azure disk_tier check format fix rebase error filter_region_zone

fix instance type fix the test without aws credential refactor test add job tests remove remnant add invalid accelerator regions format fix test fix test fix test

WoosukKwon

LGTM. Thanks for the PR.

sky/clouds/cloud.py

sky/clouds/service_catalog/common.py

tests/test_cli.py

tests/test_jobs.py

…rtial-gpu

Michaelvll · 2023-07-14T04:37:25Z

Thanks a lot for the review @WoosukKwon!! I am testing the smoke tests before merging:

pytest tests/test_smoke.py (with [Core] Avoid install python 3.11 conda #2241 merged)

…rtial-gpu

concretevitamin requested review from WoosukKwon and infwinston October 16, 2022 23:52

Michaelvll closed this May 9, 2023

Michaelvll reopened this May 9, 2023

Michaelvll force-pushed the fix-partial-gpu branch from 6d7adf4 to e425ae7 Compare May 9, 2023 21:05

Michaelvll added the waiting-for-review label May 9, 2023

Michaelvll changed the title ~~[Core] Fix partial GPUs when region specified~~ [Core] Fix resource validation for existing cluster May 9, 2023

WoosukKwon reviewed May 11, 2023

View reviewed changes

Michaelvll force-pushed the fix-partial-gpu branch from 7559d70 to 74727e9 Compare May 11, 2023 23:03

Michaelvll commented May 11, 2023

View reviewed changes

Michaelvll changed the title ~~[Core] Fix resource validation for existing cluster~~ [Core] Fix resource validation for existing cluster and refactor Jul 5, 2023

Michaelvll requested review from WoosukKwon and concretevitamin July 5, 2023 21:49

Michaelvll force-pushed the fix-partial-gpu branch 5 times, most recently from 4f5dbc3 to a4cb94d Compare July 5, 2023 23:57

Michaelvll force-pushed the fix-partial-gpu branch from a4cb94d to d18dae5 Compare July 6, 2023 00:01

add unitest for excution on existing cluster

ced2878

fix instance type fix the test without aws credential refactor test add job tests remove remnant add invalid accelerator regions format fix test fix test fix test

Michaelvll force-pushed the fix-partial-gpu branch from 4642031 to ced2878 Compare July 6, 2023 02:12

WoosukKwon approved these changes Jul 13, 2023

View reviewed changes

Michaelvll added 2 commits July 13, 2023 20:55

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-pa…

ea3eab1

…rtial-gpu

address comments

6a230b4

Michaelvll removed the waiting-for-review label Jul 14, 2023

Merge branch 'master' of github.com:skypilot-org/skypilot into fix-pa…

5205588

…rtial-gpu

Michaelvll merged commit f05abc2 into master Jul 15, 2023

Michaelvll deleted the fix-partial-gpu branch July 15, 2023 04:47

romilbhardwaj mentioned this pull request Aug 19, 2023

[Core] Is cloud.accelerator_in_region_or_zone required? #2429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix resource validation for existing cluster and refactor #1259

[Core] Fix resource validation for existing cluster and refactor #1259

Michaelvll commented Oct 16, 2022 •

edited

Loading

concretevitamin commented Oct 16, 2022

Michaelvll commented Oct 17, 2022

WoosukKwon commented Oct 17, 2022

Michaelvll commented Oct 17, 2022 •

edited

Loading

Michaelvll commented May 9, 2023

Michaelvll commented May 9, 2023

Michaelvll commented May 9, 2023

WoosukKwon left a comment

WoosukKwon May 11, 2023

WoosukKwon May 11, 2023

Michaelvll May 11, 2023

Michaelvll left a comment

Michaelvll May 11, 2023

Michaelvll left a comment

Michaelvll commented Jul 5, 2023 •

edited

Loading

WoosukKwon left a comment

Michaelvll commented Jul 14, 2023 •

edited

Loading

	if not (task.num_nodes <= handle.launched_nodes and
	task_resources.less_demanding_than(
	launched_resources, requested_num_nodes=task.num_nodes)):
	if (task_resources.region is not None and
	task_resources.region != launched_resources.region):
	with ux_utils.print_exception_no_traceback():
	raise exceptions.ResourcesMismatchError(
	'Task requested resources in region '
	f'{task_resources.region!r}, but the existing cluster '
	f'is in region {launched_resources.region!r}.')
	if (task_resources.zone is not None and
	task_resources.zone != launched_resources.zone):
	zone_str = (f'is in zone {launched_resources.zone!r}.'
	if launched_resources.zone is not None else
	'does not have zone specified.')
	with ux_utils.print_exception_no_traceback():
	raise exceptions.ResourcesMismatchError(
	'Task requested resources in zone '
	f'{task_resources.zone!r}, but the existing cluster '
	f'{zone_str}')
	with ux_utils.print_exception_no_traceback():
	raise exceptions.ResourcesMismatchError(
	'Requested resources do not match the existing cluster.\n'
	f' Requested:\t{task.num_nodes}x {task_resources} \n'
	f' Existing:\t{handle.launched_nodes}x '
	f'{handle.launched_resources}\n'
	f'{mismatch_str}')

	elif resources.is_launchable():
	if isinstance(resources.cloud, clouds.GCP):
	# Check if the host VM satisfies the max vCPU and memory limits.
	clouds.GCP.check_accelerator_attachable_to_host(
	resources.instance_type, resources.accelerators,
	resources.zone)
	# If the user has specified a GCP zone and the zone does not support
	# the host-accelerator combination, then an error will be raised by
	# the above check_accelerator_attachable_to_host() call.
	# If the user has not specified any zone, a launchable will be made
	# for every zone even if some of the zones do not support the
	# host-accelerator combination. Then the provisioner may try to
	# launch the instance, and fail over to other zones. We find this
	# behavior acceptable because this will happen only when the user
	# requested GCP 4:P100 or 8:K80 with a very large host VM.
	elif isinstance(resources.cloud, clouds.SCP):
	# Check if the host VM satisfies the min/max disk size limits.
	is_allowed, launchable[resources] = \
	clouds.SCP.is_disk_size_allowed(resources)
	if not is_allowed:
	continue

	launchable[resources] = _make_launchables_for_valid_region_zones(
	resources)
	else:
	clouds_list = ([resources.cloud]
	if resources.cloud is not None else enabled_clouds)
	# Hack: When >=2 cloud candidates, always remove local cloud from
	# possible candidates. This is so the optimizer will consider
	# public clouds, except local. Local will be included as part of
	# optimizer in a future PR.
	# TODO(mluo): Add on-prem to cloud spillover.
	if len(clouds_list) >= 2:
	clouds_list = [
	c for c in clouds_list if not isinstance(c, clouds.Local)
	]
	all_fuzzy_candidates = set()
	for cloud in clouds_list:
	(feasible_resources, fuzzy_candidate_list) = (
	cloud.get_feasible_launchable_resources(resources))
	if len(feasible_resources) > 0:
	# Assume feasible_resources is sorted by prices.
	cheapest = feasible_resources[0]
	# Generate region/zone-specified resources.
	launchable[resources].extend(
	_make_launchables_for_valid_region_zones(cheapest))
	cloud_candidates[cloud] = feasible_resources
	else:
	all_fuzzy_candidates.update(fuzzy_candidate_list)
	if len(launchable[resources]) == 0:
	clouds_str = str(clouds_list) if len(clouds_list) > 1 else str(
	clouds_list[0])
	logger.info(f'No resource satisfying {resources} '
	f'on {clouds_str}.')
	if len(all_fuzzy_candidates) > 0:
	logger.info('Did you mean: '
	f'{colorama.Fore.CYAN}'
	f'{sorted(all_fuzzy_candidates)}'
	f'{colorama.Style.RESET_ALL}')
	else:
	if resources.cpus is not None:
	logger.info('Try specifying a different CPU count, '
	'or add "+" to the end of the CPU count '
	'to allow for larger instances.')
	if resources.memory is not None:
	logger.info('Try specifying a different memory size, '
	'or add "+" to the end of the memory size '
	'to allow for larger instances.')

		_capture_match_gpus_spec(f.name, 'V100')
		_capture_match_gpus_spec(f.name, 'V100:0.5')

[Core] Fix resource validation for existing cluster and refactor #1259

[Core] Fix resource validation for existing cluster and refactor #1259

Conversation

Michaelvll commented Oct 16, 2022 • edited Loading

concretevitamin commented Oct 16, 2022

Michaelvll commented Oct 17, 2022

WoosukKwon commented Oct 17, 2022

Michaelvll commented Oct 17, 2022 • edited Loading

Michaelvll commented May 9, 2023

Michaelvll commented May 9, 2023

Michaelvll commented May 9, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon May 11, 2023

Choose a reason for hiding this comment

WoosukKwon May 11, 2023

Choose a reason for hiding this comment

Michaelvll May 11, 2023

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll May 11, 2023

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll commented Jul 5, 2023 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

Michaelvll commented Jul 14, 2023 • edited Loading

Michaelvll commented Oct 16, 2022 •

edited

Loading

Michaelvll commented Oct 17, 2022 •

edited

Loading

Michaelvll commented Jul 5, 2023 •

edited

Loading

Michaelvll commented Jul 14, 2023 •

edited

Loading