-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix resource validation for existing cluster and refactor #1259
Conversation
Tagging @WoosukKwon @infwinston - can you take a look as well? Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do |
The launch-ability will be checked in the optimizer. The following is our error message for that:
|
Thanks @Michaelvll for quickly fixing the bug! I've checked that this PR resolves the issue. However, I have a bit fundamental question: do we really need to call |
Great point @WoosukKwon! The resources should only be checked for availability during launching, instead of creation. I modified the PR thoroughly to move the resource validation to optimization and added several tests to make sure the modification is correct. PTAL. Tested:
|
This has been fixed in the master branch. Closing this PR for now. |
Sorry, this issue still exists by the following commands:
The error message will be:
The error is quite confusing. |
6d7adf4
to
e425ae7
Compare
A kind bump for this PR @WoosukKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll for the effort. Left some comments. Please take a look.
sky/resources.py
Outdated
# Validate whether accelerator is available in specified region/zone | ||
acc, acc_count = list(acc_requested.items())[0] | ||
if self.region is not None or self.zone is not None: | ||
if not self._cloud.accelerator_in_region_or_zone( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dumb question: in which case will this branch be taken? I found this error when trying to launch a VM that is not supported in the specified region.
$ sky launch --gpus a100:8 --region us-west-1 --cloud aws
I 05-11 13:19:55 optimizer.py:998] No resource satisfying AWS({'A100': 8}) on AWS.
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task Task(run=<empty>)
resources: AWS({'A100': 8}). Region: us-west-1.
This means the catalog does not contain any instance types that satisfy this request.
To fix: relax or change the resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
'sky check' to check the enabled clouds.
tests/test_cli.py
Outdated
_capture_match_gpus_spec(f.name, 'V100') | ||
_capture_match_gpus_spec(f.name, 'V100:0.5') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's a bit weird that partial GPUs are only supported when the instance type is specified, while they are not supported otherwise. Why don't we disallow partial GPUs for sky launch
?
$ sky launch --gpus v100:0.5
I 05-11 13:23:15 optimizer.py:998] No resource satisfying <Cloud>({'V100': 0.5}) on [AWS, Azure, GCP, Lambda].
I 05-11 13:23:15 optimizer.py:1001] Did you mean: ['V100-32GB:8', 'V100:1', 'V100:2', 'V100:4', 'V100:8']
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task Task(run=<empty>)
resources: <Cloud>({'V100': 0.5}).
This means the catalog does not contain any instance types that satisfy this request.
To fix: relax or change the resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
'sky check' to check the enabled clouds.
$ sky launch --gpus v100:0.5 --instance-type p3.2xlarge
I 05-11 13:23:22 optimizer.py:636] == Optimizer ==
I 05-11 13:23:22 optimizer.py:647] Target: minimizing cost
I 05-11 13:23:22 optimizer.py:659] Estimated cost: $3.1 / hour
I 05-11 13:23:22 optimizer.py:659]
I 05-11 13:23:22 optimizer.py:732] Considered resources (1 node):
I 05-11 13:23:22 optimizer.py:781] -----------------------------------------------------------------------------------------
I 05-11 13:23:22 optimizer.py:781] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 05-11 13:23:22 optimizer.py:781] -----------------------------------------------------------------------------------------
I 05-11 13:23:22 optimizer.py:781] AWS p3.2xlarge 8 61 V100:0.5 us-east-1 3.06 ✔
I 05-11 13:23:22 optimizer.py:781] -----------------------------------------------------------------------------------------
I 05-11 13:23:22 optimizer.py:781]
Launching a new cluster 'sky-e1f9-woosuk'. Proceed? [Y/n]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that is a great point! I reverted this change
7559d70
to
74727e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @WoosukKwon! Based on your comments, I just realized that we have quite many redundancies in the code for checking the instance_type-accelerator combination. I just refactored it. PTAL.
tests/test_cli.py
Outdated
_capture_match_gpus_spec(f.name, 'V100') | ||
_capture_match_gpus_spec(f.name, 'V100:0.5') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that is a great point! I reverted this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @WoosukKwon! Based on your comments, I just realized that we have quite many redundancies in the code for checking the instance_type-accelerator combination. I just refactored it. PTAL.
I further refactored the code in optimizer and clouds to reduce the redundant codes. Please feel free to take another look. @WoosukKwon
To reproduce the issue for failing to execute job with GPUs on an existing cluster:
|
4f5dbc3
to
a4cb94d
Compare
format validate the resources only when optimizing Add test for TPU fix test minor refactoring fix tests fix merging error revert is_same_resources fix test format fix test Refactor format Add comment refactor Fix test fix scp disk check simplify the code fix fix a100 cpu memory check format format add test fix azure disk_tier check format fix rebase error filter_region_zone
a4cb94d
to
d18dae5
Compare
fix instance type fix the test without aws credential refactor test add job tests remove remnant add invalid accelerator regions format fix test fix test fix test
4642031
to
ced2878
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR.
Thanks a lot for the review @WoosukKwon!! I am testing the smoke tests before merging:
|
Fixes #1258
This PR makes validation of the resources based on the offering for each region/zone, only when the resources is
launchable
and is being launched by the optimizer. This is because if the cluster already exists, the user can specify some combination of the accelerator / accelerator count that does not exist in the specified region/zone.For example: aws only has A100:8 in a region, but the user can
sky exec cluster_name --region region_name --cloud aws --gpus A100:1
to schedule the job on the existing cluster. Our current master PR will block this behavior, as the A100:1 is not offered by the region.The validation of the Resources for existing cluster will be done by
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 2173 to 2199 in 8d3fce2
Note: This PR should be compatible with #1429 as the
validate_launchable
function should only be called by the resources that is meant to be used to create new VMs.This PR also refactors the optimizer logic, so as to make the resource validation clearer.
We previously had two code paths for validating the requested resources causing our implementation for checking needs to be duplicated for requested resources that is launchable and not launchable .
skypilot/sky/optimizer.py
Lines 957 to 1024 in 1501708
For example, in #2096, although we disable the spot instance in cloud.get_feasible_launchable_resoures and
sky launch --use-spot
will skip the k8s cloud, we can still get around that check by making the requested resources launchable:sky launch --cloud kubernetes --instance-type 8CPU-16GB --use-spot
.That said our current code is quite error-prone, making it easy to forget to update the two places for the check.
The refactoring should fix this, by always using the
get_feasible_launchable_resources
Tested:
pytest tests/test_smoke.py