Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Fix GPU count detection from autoscaler YAML #2636

Merged
merged 2 commits into from
Nov 22, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Oct 2, 2023

Closes #2608.

Our resource detection from pod spec puts a minimum of 1 on CPU resource so that fractional CPU sized clusters can also be launched on k8s clusters. However, this was being incorrectly applied to GPU resources too, causing all clusters to be initialized with 1 GPU resource.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR - sky launch --cloud kubernetes -- ray status now does not show GPUs.
  • k8s smoke tests: pytest tests/test_smoke.py--kubernetes -k "not TestStorageWithCredentials"

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried out and it works great! Left a minor comment. LGTM!

rounded_count = math.ceil(res_count)
if resource_name == 'cpu':
# For CPU, we set minimum count to 1 because if CPU count is set to 0,
# (e.g., when request=0.5), ray will not be able to schedule any tasks.
Copy link
Collaborator

@landscapepainter landscapepainter Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the comment, request=0.5, what is the request referring to? If it's referring to the request variable used in this function, rounded_count(CPU count) will be set to 1, not 0 when request=0.5. If it's not the variable, I think we should elaborate a bit more in the comment to show that it's not referring to the variable :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - clarified that it's referring to when skypilot task requests something like --cpus 0.5.

@romilbhardwaj romilbhardwaj merged commit 76da783 into master Nov 22, 2023
19 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_fix_gpucount branch November 22, 2023 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Ray cluster is incorrectly initialized with gpu resource
2 participants