Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

Merged
merged 10 commits into from
Dec 20, 2020

Conversation

talesa
Copy link
Contributor

@talesa talesa commented Nov 14, 2020

By default 30 chars were too short to print some of our node names, increased the limit.

Some of our GPUs had weird names, changed the way it's parsed to a regexp, added examples of some weird outputs I got on some nodes to the following website where people contributing can keep iterating on the regexp used to cover other weird strings they encounter.
https://regex101.com/r/RHYM8Z/3

@frankier
Copy link
Contributor

This PR makes it almost work for me in ARC, but I had another problem afterwards. Sometimes we get tokens as:

tokens ['gpu', 'arcus-htc-***, ***, ***]

In which case we can't get num_gpus. I'm not sure what the solution is here. Should num_gpus be assumed to be 1?

@talesa
Copy link
Contributor Author

talesa commented Nov 30, 2020

I've never used ARC so I don't know how the output is formatted there or how many GPUs per node are there.

@albanie
Copy link
Owner

albanie commented Dec 14, 2020

@talesa - thanks a lot for this!
@frankier, would you be able to paste input/output samples (so I can check that the PR doesn't break stuff for you) when merging?

@frankier
Copy link
Contributor

So the situation is that I'm +1 merging this PR because it makes this script go from not working -> almost working for me. What makes it work for me is adding also the following, which just throws away rows without a number of GPU specified. However, this isn't going to produce an accurate result probably, since it throws away data. If you merge the PR, then I can go ahead an put this in a more detailed new issue if/when I can replicate it.

*** slurm_gpustat.py.1	2020-12-14 06:40:58.643007000 +0000
--- slurm_gpustat.py	2020-11-30 14:03:42.053292000 +0000
***************
*** 527,536 ****
--- 527,538 ----
          # ignore pending jobs
          if len(tokens) < 4 or not tokens[0].startswith("gpu"):
              continue
          gpu_count_str, node_str, user, jobid = tokens
          gpu_count_tokens = gpu_count_str.split(":")
+         if len(gpu_count_tokens) == 1:
+             continue
          num_gpus = int(gpu_count_tokens[-1])
          if len(gpu_count_tokens) == 2:
              gpu_type = None
          elif len(gpu_count_tokens) == 3:
              gpu_type = gpu_count_tokens[1]

@albanie albanie merged commit 33720a0 into albanie:master Dec 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants