Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics collector crashes when NVIDIA MIGs are present #3090

Open
UrkoAT opened this issue Apr 16, 2024 · 1 comment
Open

Metrics collector crashes when NVIDIA MIGs are present #3090

UrkoAT opened this issue Apr 16, 2024 · 1 comment

Comments

@UrkoAT
Copy link

UrkoAT commented Apr 16, 2024

馃悰 Describe the bug

I was configuring the pytorch/torchserve:0.10.0-gpu image of docker to deploy a model to production and Ive encountered the following issue.
The thing is that the package nvgpu used by the metrics collector, fails to work with the NVIDIA MIG technology, and it crashes the thread.

After a bit of investigation, the culprit is the nvgpu.gpu_info() function, that tries to parse the nvidia-smi output. In a normal GPU, it works fine since it tries to grab the Memory-Usage field (5ish line, column 2):

urko@port-urkoa:~$ nvidia-smi 
Tue Apr 16 08:35:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0  On |                  N/A |
| N/A   68C    P0              38W /  80W |   3091MiB /  6144MiB |     18%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3317      G   /usr/lib/xorg/Xorg                         1955MiB |
|    0   N/A  N/A      3632      G   /usr/bin/gnome-shell                        279MiB |
|    0   N/A  N/A      4915      G   ...seed-version=20240414-180149.278000      327MiB |
|    0   N/A  N/A      5778      G   ...erProcess --variations-seed-version      477MiB |
+---------------------------------------------------------------------------------------+

However, as the MIG technology changes the nvidia-smi command, it looks like this:

root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# nvidia-smi 
Tue Apr 16 06:27:22 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:03:00.0 Off |                   On |
| N/A   74C    P0    63W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    4   0   0  |   1141MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      2MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So when nvgpu tries to parse the Memory-Usage field, its gets N/A and tries to convert it to an integer, and thats the error I get.

Error logs

The error I get in the main logs:

2024-04-16T06:25:41,915 [ERROR] Thread-14 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
    value(num_of_gpu)
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization
    info = nvgpu.gpu_info()
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'

Installation instructions

I used a dockerfile with torchserve as base:

FROM pytorch/torchserve:0.10.0-gpu
ENV DEBIAN_FRONTEND=noninteractive
USER 0
RUN apt update && apt install -y python3-opencv python3-pip git build-essential
ENV PYTHONUNBUFFERED=1
RUN pip install opencv-python torchvision torch torchaudio timm numpy scikit-learn matplotlib seaborn pandas
RUN pip install 'git+https://github.com/facebookresearch/detectron2.git'

Model Packaing

Standard mar. Doesnt apply.

config.properties

default_workers_per_model=1

Versions

docker standard pytorch/torchserve:0.10.0-gpu

Repro instructions

The steps to reproduced it (its MANDATORY to have MIGs):

root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# python3   
Python 3.9.18 (main, Aug 25 2023, 13:20:04) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nvgpu
>>> nvgpu.gpu_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'
>>> 

Possible Solution

I know its a bug of nvgpu, not from torchserve, but AFAIK, nvgpu is no longer being maitained so it might be a good chance to change the package or way to work. Just my suggestion. Thanks

@lxning
Copy link
Collaborator

lxning commented Apr 16, 2024

@UrkoAT Thank you for investigating the root cause. We do notice that there are some bugs in nvgpu which is also in maintenance mode. TS v0.10.0 provides a feature which is able to customize system metrics (see PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants