Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

talesa · 2020-11-14T02:13:32Z

By default 30 chars were too short to print some of our node names, increased the limit.

Some of our GPUs had weird names, changed the way it's parsed to a regexp, added examples of some weird outputs I got on some nodes to the following website where people contributing can keep iterating on the regexp used to cover other weird strings they encounter.
https://regex101.com/r/RHYM8Z/3

frankier · 2020-11-30T14:03:24Z

This PR makes it almost work for me in ARC, but I had another problem afterwards. Sometimes we get tokens as:

tokens ['gpu', 'arcus-htc-***, ***, ***]

In which case we can't get num_gpus. I'm not sure what the solution is here. Should num_gpus be assumed to be 1?

talesa · 2020-11-30T14:19:36Z

I've never used ARC so I don't know how the output is formatted there or how many GPUs per node are there.

albanie · 2020-12-14T05:59:28Z

@talesa - thanks a lot for this!
@frankier, would you be able to paste input/output samples (so I can check that the PR doesn't break stuff for you) when merging?

frankier · 2020-12-14T06:45:57Z

So the situation is that I'm +1 merging this PR because it makes this script go from not working -> almost working for me. What makes it work for me is adding also the following, which just throws away rows without a number of GPU specified. However, this isn't going to produce an accurate result probably, since it throws away data. If you merge the PR, then I can go ahead an put this in a more detailed new issue if/when I can replicate it.

*** slurm_gpustat.py.1	2020-12-14 06:40:58.643007000 +0000
--- slurm_gpustat.py	2020-11-30 14:03:42.053292000 +0000
***************
*** 527,536 ****
--- 527,538 ----
          # ignore pending jobs
          if len(tokens) < 4 or not tokens[0].startswith("gpu"):
              continue
          gpu_count_str, node_str, user, jobid = tokens
          gpu_count_tokens = gpu_count_str.split(":")
+         if len(gpu_count_tokens) == 1:
+             continue
          num_gpus = int(gpu_count_tokens[-1])
          if len(gpu_count_tokens) == 2:
              gpu_type = None
          elif len(gpu_count_tokens) == 3:
              gpu_type = gpu_count_tokens[1]

emilemathieu and others added 10 commits October 30, 2020 17:15

fix bug

604874c

revert Emiles change

dee1ebf

larger char limit

5c79aa4

regular expression

e5b8c8a

partition argument

1bcacf0

argument -p

3833822

README

ff13992

minor further corrections

4d11df2

compile regex only once

276fa48

better guessing

3437c3d

albanie approved these changes Dec 20, 2020

View reviewed changes

albanie merged commit 33720a0 into albanie:master Dec 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

talesa commented Nov 14, 2020

frankier commented Nov 30, 2020

talesa commented Nov 30, 2020

albanie commented Dec 14, 2020

frankier commented Dec 14, 2020

Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

Added --partition argument and adjusted node & GPU name parsing to work on our cluster #6

Conversation

talesa commented Nov 14, 2020

frankier commented Nov 30, 2020

talesa commented Nov 30, 2020

albanie commented Dec 14, 2020

frankier commented Dec 14, 2020