Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add labels field to resources #3464

Merged
merged 18 commits into from
May 2, 2024
Merged

[Core] Add labels field to resources #3464

merged 18 commits into from
May 2, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Apr 22, 2024

Adds a labels field to resources. Follows proposal 2 from this doc which has more context.

resources:
  labels:
    mylabel: myvalue

Semantics:

  1. If the cluster does not already exist, cluster is created with labels/instance_tags if the cloud supports them. If not supported by cloud, labels are not created but the cluster is launched as usual.
  2. If the cluster already exists, no action is taken. Labels are not compared to check if task can be run. Labels are not updated.
  3. Task labels override global labels set in ~/.sky/config.yaml.

Passes manual tests on k8s, gcp and aws.

TODOs

  • Add --labels to cli
  • Add tests in test_smoke.py
  • Add docs

Tested:

  • New smoke tests - pytest tests/test_smoke.py::test_task_labels_aws --aws, pytest tests/test_smoke.py::test_task_labels_gcp --gcp, pytest tests/test_smoke.py::test_task_labels_kubernetes --kubernetes
  • All smoke tests (yet to run)

@romilbhardwaj romilbhardwaj marked this pull request as ready for review April 25, 2024 19:03
@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Apr 25, 2024

Ready for review! Running smoke tests and backcompat tests now.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for labels @romilbhardwaj! Left two comments : )

Comment on lines +952 to +957
if self.cloud is None:
# Because each cloud has its own label format, we cannot validate
# the labels without knowing the cloud.
with ux_utils.print_exception_no_traceback():
raise ValueError(
'Cloud must be specified when labels are provided.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause failover unable to work. Should we just apply a global label format check instead? It might be fine to sacrifice some flexibility. Otherwise, a user have to specify the following to enable all the clouds they have which might be a bit unintuitive:

resources:
  labels:
    mykey1: myvalue1
  any_of:
    - cloud: aws
    - cloud: gcp

Copy link
Collaborator Author

@romilbhardwaj romilbhardwaj Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting failover is slightly tricky....

For example, labels in Kubernetes commonly use / to create a "namespace" for tags (e.g., skypilot.co/accelerators, app.kubernetes.io/component.. see recommended labels). However, GCP does not support / in labels, and a failover would cause provisioning to fail. Putting a stricter global format check would prevent users from creating these labels at all.

Another option could be to not do any validation at all and let these checks fail at provisioning time. We could do that, but this seemed cleaner so went with this for now. Any other ideas? Happy to change if you think we should support failover with labels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I think it should be fine to require the cloud to be presented when the labels is specified. Another option to support failover could be the following:

Validate the labels in the following function, and return an empty list if the labels do not match the requirement for a specific cloud, i.e. make the current cloud infeasible:

def get_feasible_launchable_resources(
self,
resources: 'resources_lib.Resources',
num_nodes: int = 1
) -> Tuple[List['resources_lib.Resources'], List[str]]:
"""Returns ([feasible and launchable resources], [fuzzy candidates]).
Feasible resources refer to an offering respecting the resource
requirements. Currently, this function implements "filtering" the
cloud's offerings only w.r.t. accelerators constraints.
Launchable resources require a cloud and an instance type be assigned.
Fuzzy candidates example: when the requested GPU is A100:1 but is not
available in a cloud/region, the fuzzy candidates are results of a fuzzy
search in the catalog that are offered in the location. E.g.,
['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8']
"""
if resources.is_launchable():
self._check_instance_type_accelerators_combination(resources)
resources_required_features = resources.get_required_cloud_features()
if num_nodes > 1:
resources_required_features.add(
CloudImplementationFeatures.MULTI_NODE)
try:
self.check_features_are_supported(resources,
resources_required_features)
except exceptions.NotSupportedError:
# TODO(zhwu): The resources are now silently filtered out. We
# should have some logging telling the user why the resources
# are not considered.
return ([], [])
return self._get_feasible_launchable_resources(resources)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that could work. Seems a little hacky though, since the error now becomes hidden in the warning instead of being a clear error.

Checking in resources.py (current):

sky launch task.yaml
Task from YAML spec: task.yaml
ValueError: Invalid label my/label=my/value. Invalid label value my/value for GCP. Value can include lowercase alphanumeric characters, dashes, and underscores, with a total length of 63 characters or less.

Checking in get_feasible_launchable_resources with warning (last proposal):

$ sky launch task.yaml
Task from YAML spec: task.yaml
W 04-30 15:36:34 cloud.py:383] Label my/label=my/value is invalid for cloud GCP. Reason: Invalid label value my/value for GCP. Value can include lowercase alphanumeric characters, dashes, and underscores, with a total length of 63 characters or less.
I 04-30 15:36:34 optimizer.py:1209] No resource satisfying GCP({'T4': 1}) on GCP.
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task(run=<empty>)
  resources: GCP({'T4': 1}).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

wdyt? I'm leaning towards keeping the current variant, but can change to checking in get_feasible_launchable_resources to support failover if you feel strongly about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the current way sounds good to me, but we may want to file an issue for it for supporting failover when the labels are specified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #3500!

sky/utils/schemas.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updating @romilbhardwaj! Left several comments. It looks mostly good to me.

docs/source/reference/yaml-spec.rst Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
Comment on lines +952 to +957
if self.cloud is None:
# Because each cloud has its own label format, we cannot validate
# the labels without knowing the cloud.
with ux_utils.print_exception_no_traceback():
raise ValueError(
'Cloud must be specified when labels are provided.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I think it should be fine to require the cloud to be presented when the labels is specified. Another option to support failover could be the following:

Validate the labels in the following function, and return an empty list if the labels do not match the requirement for a specific cloud, i.e. make the current cloud infeasible:

def get_feasible_launchable_resources(
self,
resources: 'resources_lib.Resources',
num_nodes: int = 1
) -> Tuple[List['resources_lib.Resources'], List[str]]:
"""Returns ([feasible and launchable resources], [fuzzy candidates]).
Feasible resources refer to an offering respecting the resource
requirements. Currently, this function implements "filtering" the
cloud's offerings only w.r.t. accelerators constraints.
Launchable resources require a cloud and an instance type be assigned.
Fuzzy candidates example: when the requested GPU is A100:1 but is not
available in a cloud/region, the fuzzy candidates are results of a fuzzy
search in the catalog that are offered in the location. E.g.,
['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100:8']
"""
if resources.is_launchable():
self._check_instance_type_accelerators_combination(resources)
resources_required_features = resources.get_required_cloud_features()
if num_nodes > 1:
resources_required_features.add(
CloudImplementationFeatures.MULTI_NODE)
try:
self.check_features_are_supported(resources,
resources_required_features)
except exceptions.NotSupportedError:
# TODO(zhwu): The resources are now silently filtered out. We
# should have some logging telling the user why the resources
# are not considered.
return ([], [])
return self._get_feasible_launchable_resources(resources)

sky/utils/schemas.py Show resolved Hide resolved
…o resources_labels

# Conflicts:
#	sky/templates/aws-ray.yml.j2
#	sky/templates/gcp-ray.yml.j2
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @romilbhardwaj! The code looks mostly good to me. : )

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated
# Returns the base labels "updated" with the override labels.
labels = base_resource_config.get('labels')
# Merge the labels from the base and override resources.
if labels is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if there is no global labels specified, the local labels will not be used? Should we do the following suggestion instead?

sky/resources.py Outdated
Comment on lines 1263 to 1270
new_resource_config = base_resource_config.copy()
new_resource_config.update(override_config)

# Handle label merging. When any_of or ordered is specified,
# the labels from the base resource are updated with the labels
# from the override resource.
new_resource_config['labels'] = _merge_labels(
base_resource_config, override_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_resource_config = base_resource_config.copy()
new_resource_config.update(override_config)
# Handle label merging. When any_of or ordered is specified,
# the labels from the base resource are updated with the labels
# from the override resource.
new_resource_config['labels'] = _merge_labels(
base_resource_config, override_config)
new_resource_config = base_resource_config.copy()
override_labels = override_config.pop('labels', {})
new_resource_config.update(override_config)
labels = new_resource_config.get('labels', {})
labels.update(override_labels)
if labels:
new_resource_config['labels'] = labels

@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll! Found a small backcompat bug, fixed now. Ran some manual backward compatibility tests:

  • Regular cluster launched on master, switch to branch and try exec, logs and launch again.
  • Spot controller launched on master with instance_tags specified in config, switch to branch and launch more spot jobs, check queue/logs

Running smoke tests now.

@romilbhardwaj
Copy link
Collaborator Author

Smoke tests pass, merging now!

@romilbhardwaj romilbhardwaj merged commit f256b53 into master May 2, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the resources_labels branch May 2, 2024 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants