Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance, but low hardware utilization #337

Open
vivi90 opened this issue Aug 15, 2022 · 8 comments
Open

Low performance, but low hardware utilization #337

vivi90 opened this issue Aug 15, 2022 · 8 comments

Comments

@vivi90
Copy link
Contributor

vivi90 commented Aug 15, 2022

Problem

Simple ROMP has very poor performance on my machine:

  • around 10 FPS (standalone: romp --mode=webcam --show -t)
  • around 7 FPS (as an module: from romp import ROMP)

But my hardware utilization on my GPU with CUDA is still low:
Utilization

Steps to reproduce

conda create -n romp python=3.10
conda activate romp
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install simple-romp cython
romp --mode=webcam --show -t
@vivi90
Copy link
Contributor Author

vivi90 commented Aug 15, 2022

Fixed the 7 FPS issue:

Used it in combination with my vmcp package and it's vmcp.osc.backend.osc4py3.as_comthreads OSC backend.
But yeah, this also uses threading so there was an performance loss.
Fixed it by using the vmcp.osc.backend.osc4py3.as_eventloop backend instead and running vmcp.osc.channel.Sender.system.run() after every vmcp.osc.channel.Sender.send()

But the inefficient hardware usage still causes around 10 FPS.

Have run the torch.utils.bottleneck profiler over my script while running each test 10 predictions:

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3696464 function calls (3485118 primitive calls) in 7.879 seconds

   Ordered by: internal time
   List reduced from 3133 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3100    2.324    0.001    2.324    0.001 {built-in method torch.conv2d}
        1    0.954    0.954    7.881    7.881 romp_vmcp.py:1(<module>)
       10    0.806    0.081    3.650    0.365 D:\miniconda3\envs\romp\lib\site-packages\romp\model.py:382(forward)
     1097    0.364    0.000    0.727    0.001 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1440(_load_from_state_dict)
       10    0.351    0.035    0.351    0.035 {method 'read' of 'cv2.VideoCapture' objects}
  2048976    0.272    0.000    0.272    0.000 {method 'startswith' of 'str' objects}
     3070    0.208    0.000    0.208    0.000 {built-in method torch.batch_norm}
      316    0.126    0.000    0.126    0.000 {method 'uniform_' of 'torch._C._TensorBase' objects}
     2760    0.107    0.000    0.107    0.000 {built-in method torch.relu_}
     1853    0.095    0.000    0.095    0.000 {method 'copy_' of 'torch._C.StorageBase' objects}
     3772    0.092    0.000    0.092    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     1853    0.090    0.000    0.203    0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\_utils.py:48(_cuda)
       30    0.080    0.003    0.114    0.004 D:\miniconda3\envs\romp\lib\site-packages\romp\utils.py:606(rotation_matrix_to_quaternion)
     1851    0.079    0.000    0.079    0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
184480/21960    0.075    0.000    0.079    0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1775(named_modules)
--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    DataParallel.forward        10.19%      57.587ms        21.09%     119.144ms     119.144ms             1
    DataParallel.forward        10.10%      57.059ms        20.87%     117.934ms     117.934ms             1
    DataParallel.forward        10.00%      56.507ms        20.55%     116.128ms     116.128ms             1
    DataParallel.forward         9.82%      55.509ms        20.41%     115.327ms     115.327ms             1
    DataParallel.forward         9.80%      55.348ms        20.36%     115.011ms     115.011ms             1
    DataParallel.forward         9.58%      54.141ms        20.33%     114.857ms     114.857ms             1
    DataParallel.forward         9.59%      54.172ms        20.10%     113.549ms     113.549ms             1
    DataParallel.forward         9.70%      54.788ms        20.04%     113.232ms     113.232ms             1
    DataParallel.forward         9.52%      53.783ms        20.04%     113.219ms     113.219ms             1
    DataParallel.forward         9.54%      53.874ms        19.99%     112.923ms     112.923ms             1
          aten::uniform_         0.44%       2.462ms         0.44%       2.462ms       2.462ms             1
          aten::uniform_         0.44%       2.459ms         0.44%       2.459ms       2.459ms             1
          aten::uniform_         0.43%       2.441ms         0.43%       2.441ms       2.441ms             1
          aten::uniform_         0.43%       2.437ms         0.43%       2.437ms       2.437ms             1
          aten::uniform_         0.43%       2.417ms         0.43%       2.417ms       2.417ms             1
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 564.984ms
--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    DataParallel.forward        10.50%      68.836ms        22.29%     146.068ms     146.068ms       2.030ms         9.64%     137.153ms     137.153ms             1
    DataParallel.forward        10.39%      68.062ms        22.10%     144.827ms     144.827ms       2.673ms        12.69%     132.493ms     132.493ms             1
    DataParallel.forward        10.12%      66.321ms        21.33%     139.765ms     139.765ms       2.053ms         9.74%     129.801ms     129.801ms             1
    DataParallel.forward         9.93%      65.093ms        21.14%     138.530ms     138.530ms       2.056ms         9.76%     129.635ms     129.635ms             1
    DataParallel.forward         9.87%      64.702ms        21.06%     138.023ms     138.023ms       2.035ms         9.66%     128.844ms     128.844ms             1
    DataParallel.forward         9.78%      64.101ms        21.05%     137.950ms     137.950ms       2.035ms         9.66%     129.586ms     129.586ms             1
    DataParallel.forward         9.53%      62.471ms        20.71%     135.708ms     135.708ms       2.016ms         9.57%     127.760ms     127.760ms             1
    DataParallel.forward         9.77%      64.044ms        20.70%     135.675ms     135.675ms       2.064ms         9.80%     125.398ms     125.398ms             1
    DataParallel.forward         9.43%      61.809ms        20.21%     132.436ms     132.436ms       2.032ms         9.64%     124.054ms     124.054ms             1
    DataParallel.forward         9.51%      62.324ms        20.20%     132.358ms     132.358ms       2.064ms         9.80%     122.687ms     122.687ms             1
          aten::uniform_         0.39%       2.566ms         0.39%       2.566ms       2.566ms       1.000us         0.00%       1.000us       1.000us             1
          aten::uniform_         0.39%       2.534ms         0.39%       2.534ms       2.534ms       1.000us         0.00%       1.000us       1.000us             1
          aten::uniform_         0.38%       2.485ms         0.38%       2.485ms       2.485ms       1.000us         0.00%       1.000us       1.000us             1
                aten::to         0.00%       7.000us         0.37%       2.423ms       2.423ms       3.000us         0.01%       3.809ms       3.809ms             1
          aten::_to_copy         0.00%      29.000us         0.37%       2.416ms       2.416ms       5.000us         0.02%       3.806ms       3.806ms             1
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 655.384ms
Self CUDA time total: 21.069ms

Not exactly sure, what this means.
But to me it seems like the ROMP implementation is slowed down by too much communication overhead.

Configuration

GPU: 0
onnx: false
smooth_coeff: 1
temporal_optimize: true

@vivi90
Copy link
Contributor Author

vivi90 commented Aug 15, 2022

@Arthur151 Do you have any ideas for optimization of ROMP to speed it up? 🙂

@vivi90
Copy link
Contributor Author

vivi90 commented Aug 15, 2022

Btw, i need to mention, that i was not able to test ONNX with CUDA, because of: #336

@Arthur151
Copy link
Owner

@vivi90 Hi, Vivian,
Yes, I see your question. It happened to my colleague too.
ROMP runs over 25/50 FPS on my 1070Ti/3090Ti, but only runs about 20 FPS on my colleague's 3090 server. I haven't found the reason why causes this problem. But I guess the reason might some essential acceleration libraries I installed but my colleague didn't. I haven't determined which lib it is. Sorry to say that.

@JunfengLiu1
Copy link

@Arthur151 Hey,请问成功修复了BUG吗。无论是romp还是BEV,我在4080上也只能跑20帧左右。

@vivi90
Copy link
Contributor Author

vivi90 commented Mar 12, 2023

@JunfengLiu1

Hey, did you successfully fix the BUG? Whether it is romp or BEV, I can only run about 20 frames on the 4080.

Please share the following information with us:

@JunfengLiu1
Copy link

JunfengLiu1 commented Mar 13, 2023

@vivi90

  • Used operating system
    ubuntu18.04
  • Used python version
    3.7
  • Used CUDA version
    11.4
  • ROMP & BEV configuration settings
    我没有修改主文件中的任何参数;
    romp:romp --mode=webcam --show
    这是结果:
    romp1
    bev:bev --mode=webcam --show
    结果:
    bev
    问题是它时不时(大约20%)的时间会变成10帧。(romp也是)
  • Profiling reports
    这是bev --mode=webcam --show的结果,检测300帧时停止运行。
    1
    2
  • 环境:
    certifi 2022.12.7
    commonmark 0.9.1
    cycler 0.11.0
    Cython 0.29.33
    cython-bbox 0.1.3
    filterpy 1.4.5
    fonttools 4.38.0
    importlib-metadata 4.8.3
    kiwisolver 1.4.4
    lap 0.4.0
    matplotlib 3.5.3
    norfair 2.2.0
    numpy 1.21.6
    nvidia-cublas-cu11 11.10.3.66
    nvidia-cuda-nvrtc-cu11 11.7.99
    nvidia-cuda-runtime-cu11 11.7.99
    nvidia-cudnn-cu11 8.5.0.96
    opencv-python 4.7.0.72
    packaging 23.0
    Pillow 9.4.0
    pip 22.3.1
    Pygments 2.14.0
    pyparsing 3.0.9
    PySocks 1.7.1
    python-dateutil 2.8.2
    rich 12.6.0
    scipy 1.7.3
    setuptools 67.6.0
    simple-romp 1.0.8
    six 1.16.0
    torch 1.13.1
    typing_extensions 4.5.0
    wget 3.2
    wheel 0.38.4
    zipp 3.15.0
    我使用pip install --upgrade simple_romp进行安装,它默认安装了nvidia-cuda-nvrtc-cu11 11.7.99,而我本地的cuda是11.4,可能是这个问题?

@JunfengLiu1
Copy link

I changed my cuda to 11.8,but in BEV it still ran about 15-20fps,and always down to 10fps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants