Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one byte fixing a5000 and a6000 on-prem #2023

Merged
merged 1 commit into from
Jun 4, 2023

Conversation

tobi
Copy link
Contributor

@tobi tobi commented Jun 3, 2023

on prem machines with a6000 or a5000 didn't work correctly because of a missing comma in the array the grep was for lspci | grep 'A5000A6000' which... isn't a card anyone has.

I think it may make sense to have a look at this script, it feels quite fragile. Probably best would be to do this via nvidia-smi -L and just map against the ./.sky/catalog to find supported gpus.

Also, the way this made things fail was rather frustrating.

Local clusters:
NAME   USER  HEAD_IP       RESOURCES    COMMAND
local  sky   [ip]  [(no GPUs)]  sky launch -c local --  -...

sky status showed (no GPUs) and sky launch fails predictably with AssertionError: (Local(None), 'cloud, region and instance_type must have been set by optimizer')

sky launch -c local totally misreported the error as

Failed to rsync up: /var/folders/wp/6cmt_nw16jj79kd9rj4m777w0000gn/T/sky_local_app_ph50zoz4 -> 
/tmp/sky_local/sky_local_app_ph50zoz4. Ensure that the network is stable, then retry.

I think it's a good idea to detect this failure and report that (remote) ~/.sky/get_resources.py couldn't detect any GPUs or some such thing.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tobi - welcome to the SkyPilot community! Thanks for submitting the PR, lgtm.

Also, thanks for the feedback. I agree, querying nvidia-smi and polling against the catalog will be more robust.

Just wanted to share that onprem support is experimental (and we'll update the docs soon to reflect this, #2024). We are redesigning this feature to be more robust and support a larger set of deployment environments, including multi-tenant clusters. We'd love to learn more about your use-case if you can share more details - I am at [email protected] or Slack. Thanks!

@tobi
Copy link
Contributor Author

tobi commented Jun 4, 2023

Makes sense. Onprem feels like a key feature to company adoption. We all have existing clusters with low utilization but need to hold on to them because of contracts.

Getting more usage out of them and sharing them better is something that skypilot can help with. That makes the spot and on-demand feature a kind of secondary but powerful feature on top of just really good tooling for making training runs in larger teams.

@romilbhardwaj
Copy link
Collaborator

Thanks for sharing @tobi. Running jobs on onprem clusters and spilling them to cloud resources when required is something we want to enable with SkyPilot. We'll keep you posted with our progress.

@romilbhardwaj romilbhardwaj merged commit 8c497ed into skypilot-org:master Jun 4, 2023
15 checks passed
@tobi
Copy link
Contributor Author

tobi commented Jun 4, 2023

Thank you.

Really like this project. Great job guys.

concretevitamin pushed a commit that referenced this pull request Jun 4, 2023
one byte fixing a5000 and a6000 on prem
concretevitamin pushed a commit that referenced this pull request Jun 4, 2023
one byte fixing a5000 and a6000 on prem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants