Low performance, but low hardware utilization #337

vivi90 · 2022-08-15T14:10:52Z

Problem

Simple ROMP has very poor performance on my machine:

around 10 FPS (standalone: romp --mode=webcam --show -t)
around 7 FPS (as an module: from romp import ROMP)

But my hardware utilization on my GPU with CUDA is still low:

Steps to reproduce

conda create -n romp python=3.10
conda activate romp
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install simple-romp cython
romp --mode=webcam --show -t

The text was updated successfully, but these errors were encountered:

vivi90 · 2022-08-15T17:03:07Z

Fixed the 7 FPS issue:

Used it in combination with my vmcp package and it's vmcp.osc.backend.osc4py3.as_comthreads OSC backend.
But yeah, this also uses threading so there was an performance loss.
Fixed it by using the vmcp.osc.backend.osc4py3.as_eventloop backend instead and running vmcp.osc.channel.Sender.system.run() after every vmcp.osc.channel.Sender.send()

But the inefficient hardware usage still causes around 10 FPS.

Have run the torch.utils.bottleneck profiler over my script while running each test 10 predictions:

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3696464 function calls (3485118 primitive calls) in 7.879 seconds

   Ordered by: internal time
   List reduced from 3133 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3100    2.324    0.001    2.324    0.001 {built-in method torch.conv2d}
        1    0.954    0.954    7.881    7.881 romp_vmcp.py:1(<module>)
       10    0.806    0.081    3.650    0.365 D:\miniconda3\envs\romp\lib\site-packages\romp\model.py:382(forward)
     1097    0.364    0.000    0.727    0.001 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1440(_load_from_state_dict)
       10    0.351    0.035    0.351    0.035 {method 'read' of 'cv2.VideoCapture' objects}
  2048976    0.272    0.000    0.272    0.000 {method 'startswith' of 'str' objects}
     3070    0.208    0.000    0.208    0.000 {built-in method torch.batch_norm}
      316    0.126    0.000    0.126    0.000 {method 'uniform_' of 'torch._C._TensorBase' objects}
     2760    0.107    0.000    0.107    0.000 {built-in method torch.relu_}
     1853    0.095    0.000    0.095    0.000 {method 'copy_' of 'torch._C.StorageBase' objects}
     3772    0.092    0.000    0.092    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     1853    0.090    0.000    0.203    0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\_utils.py:48(_cuda)
       30    0.080    0.003    0.114    0.004 D:\miniconda3\envs\romp\lib\site-packages\romp\utils.py:606(rotation_matrix_to_quaternion)
     1851    0.079    0.000    0.079    0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
184480/21960    0.075    0.000    0.079    0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1775(named_modules)

--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    DataParallel.forward        10.19%      57.587ms        21.09%     119.144ms     119.144ms             1
    DataParallel.forward        10.10%      57.059ms        20.87%     117.934ms     117.934ms             1
    DataParallel.forward        10.00%      56.507ms        20.55%     116.128ms     116.128ms             1
    DataParallel.forward         9.82%      55.509ms        20.41%     115.327ms     115.327ms             1
    DataParallel.forward         9.80%      55.348ms        20.36%     115.011ms     115.011ms             1
    DataParallel.forward         9.58%      54.141ms        20.33%     114.857ms     114.857ms             1
    DataParallel.forward         9.59%      54.172ms        20.10%     113.549ms     113.549ms             1
    DataParallel.forward         9.70%      54.788ms        20.04%     113.232ms     113.232ms             1
    DataParallel.forward         9.52%      53.783ms        20.04%     113.219ms     113.219ms             1
    DataParallel.forward         9.54%      53.874ms        19.99%     112.923ms     112.923ms             1
          aten::uniform_         0.44%       2.462ms         0.44%       2.462ms       2.462ms             1
          aten::uniform_         0.44%       2.459ms         0.44%       2.459ms       2.459ms             1
          aten::uniform_         0.43%       2.441ms         0.43%       2.441ms       2.441ms             1
          aten::uniform_         0.43%       2.437ms         0.43%       2.437ms       2.437ms             1
          aten::uniform_         0.43%       2.417ms         0.43%       2.417ms       2.417ms             1
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 564.984ms

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    DataParallel.forward        10.50%      68.836ms        22.29%     146.068ms     146.068ms       2.030ms         9.64%     137.153ms     137.153ms             1
    DataParallel.forward        10.39%      68.062ms        22.10%     144.827ms     144.827ms       2.673ms        12.69%     132.493ms     132.493ms             1
    DataParallel.forward        10.12%      66.321ms        21.33%     139.765ms     139.765ms       2.053ms         9.74%     129.801ms     129.801ms             1
    DataParallel.forward         9.93%      65.093ms        21.14%     138.530ms     138.530ms       2.056ms         9.76%     129.635ms     129.635ms             1
    DataParallel.forward         9.87%      64.702ms        21.06%     138.023ms     138.023ms       2.035ms         9.66%     128.844ms     128.844ms             1
    DataParallel.forward         9.78%      64.101ms        21.05%     137.950ms     137.950ms       2.035ms         9.66%     129.586ms     129.586ms             1
    DataParallel.forward         9.53%      62.471ms        20.71%     135.708ms     135.708ms       2.016ms         9.57%     127.760ms     127.760ms             1
    DataParallel.forward         9.77%      64.044ms        20.70%     135.675ms     135.675ms       2.064ms         9.80%     125.398ms     125.398ms             1
    DataParallel.forward         9.43%      61.809ms        20.21%     132.436ms     132.436ms       2.032ms         9.64%     124.054ms     124.054ms             1
    DataParallel.forward         9.51%      62.324ms        20.20%     132.358ms     132.358ms       2.064ms         9.80%     122.687ms     122.687ms             1
          aten::uniform_         0.39%       2.566ms         0.39%       2.566ms       2.566ms       1.000us         0.00%       1.000us       1.000us             1
          aten::uniform_         0.39%       2.534ms         0.39%       2.534ms       2.534ms       1.000us         0.00%       1.000us       1.000us             1
          aten::uniform_         0.38%       2.485ms         0.38%       2.485ms       2.485ms       1.000us         0.00%       1.000us       1.000us             1
                aten::to         0.00%       7.000us         0.37%       2.423ms       2.423ms       3.000us         0.01%       3.809ms       3.809ms             1
          aten::_to_copy         0.00%      29.000us         0.37%       2.416ms       2.416ms       5.000us         0.02%       3.806ms       3.806ms             1
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 655.384ms
Self CUDA time total: 21.069ms

Not exactly sure, what this means.
But to me it seems like the ROMP implementation is slowed down by too much communication overhead.

Configuration

GPU: 0
onnx: false
smooth_coeff: 1
temporal_optimize: true

vivi90 · 2022-08-15T17:04:50Z

@Arthur151 Do you have any ideas for optimization of ROMP to speed it up? 🙂

vivi90 · 2022-08-15T17:06:23Z

Btw, i need to mention, that i was not able to test ONNX with CUDA, because of: #336

Arthur151 · 2022-08-27T10:30:18Z

@vivi90 Hi, Vivian,
Yes, I see your question. It happened to my colleague too.
ROMP runs over 25/50 FPS on my 1070Ti/3090Ti, but only runs about 20 FPS on my colleague's 3090 server. I haven't found the reason why causes this problem. But I guess the reason might some essential acceleration libraries I installed but my colleague didn't. I haven't determined which lib it is. Sorry to say that.

JunfengLiu1 · 2023-03-12T05:58:24Z

@Arthur151 Hey,请问成功修复了BUG吗。无论是romp还是BEV，我在4080上也只能跑20帧左右。

vivi90 · 2023-03-12T17:10:39Z

@JunfengLiu1

Hey, did you successfully fix the BUG? Whether it is romp or BEV, I can only run about 20 frames on the 4080.

Please share the following information with us:

Used operating system
Used python version
Used CUDA version
ROMP & BEV configuration settings
Profiling reports (https://pytorch.org/docs/stable/bottleneck.html)

JunfengLiu1 · 2023-03-13T07:02:27Z

@vivi90

Used operating system
ubuntu18.04
Used python version
3.7
Used CUDA version
11.4
ROMP & BEV configuration settings
我没有修改主文件中的任何参数；
romp:romp --mode=webcam --show
这是结果：

bev:bev --mode=webcam --show
结果：

问题是它时不时（大约20%）的时间会变成10帧。（romp也是）
Profiling reports
这是bev --mode=webcam --show的结果，检测300帧时停止运行。
环境：
certifi 2022.12.7
commonmark 0.9.1
cycler 0.11.0
Cython 0.29.33
cython-bbox 0.1.3
filterpy 1.4.5
fonttools 4.38.0
importlib-metadata 4.8.3
kiwisolver 1.4.4
lap 0.4.0
matplotlib 3.5.3
norfair 2.2.0
numpy 1.21.6
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
opencv-python 4.7.0.72
packaging 23.0
Pillow 9.4.0
pip 22.3.1
Pygments 2.14.0
pyparsing 3.0.9
PySocks 1.7.1
python-dateutil 2.8.2
rich 12.6.0
scipy 1.7.3
setuptools 67.6.0
simple-romp 1.0.8
six 1.16.0
torch 1.13.1
typing_extensions 4.5.0
wget 3.2
wheel 0.38.4
zipp 3.15.0
我使用pip install --upgrade simple_romp进行安装，它默认安装了nvidia-cuda-nvrtc-cu11 11.7.99，而我本地的cuda是11.4，可能是这个问题？

JunfengLiu1 · 2023-03-13T12:11:16Z

I changed my cuda to 11.8,but in BEV it still ran about 15-20fps,and always down to 10fps

vivi90 mentioned this issue Aug 17, 2022

Question about cpu usage and gpu usage. #328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance, but low hardware utilization #337

Low performance, but low hardware utilization #337

vivi90 commented Aug 15, 2022

vivi90 commented Aug 15, 2022 •

edited

Loading

vivi90 commented Aug 15, 2022

vivi90 commented Aug 15, 2022

Arthur151 commented Aug 27, 2022

JunfengLiu1 commented Mar 12, 2023

vivi90 commented Mar 12, 2023

JunfengLiu1 commented Mar 13, 2023 •

edited

Loading

JunfengLiu1 commented Mar 13, 2023

Low performance, but low hardware utilization #337

Low performance, but low hardware utilization #337

Comments

vivi90 commented Aug 15, 2022

Problem

Steps to reproduce

vivi90 commented Aug 15, 2022 • edited Loading

Fixed the 7 FPS issue:

But the inefficient hardware usage still causes around 10 FPS.

Configuration

vivi90 commented Aug 15, 2022

vivi90 commented Aug 15, 2022

Arthur151 commented Aug 27, 2022

JunfengLiu1 commented Mar 12, 2023

vivi90 commented Mar 12, 2023

JunfengLiu1 commented Mar 13, 2023 • edited Loading

JunfengLiu1 commented Mar 13, 2023

vivi90 commented Aug 15, 2022 •

edited

Loading

JunfengLiu1 commented Mar 13, 2023 •

edited

Loading