RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

scshtyk · 2024-04-26T10:02:03Z

Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of yolov3 in a GPU docker environment. The training runs without issues when the --accuracy flag is not used.

Steps to Reproduce
python install.py yolov3
python run.py yolov3 -d cuda -t train --accuracy

Expected Behavior
The training process should run without errors and perform accuracy checks without causing runtime errors.

Actual Behavior
The script executes successfully without the --accuracy flag.
However, when the accuracy check is enabled, it fails with the following error message:

TypeError: Darknet.forward() takes from 2 to 4 positional arguments but 6 were given
Running train method from yolov3 on cuda in eager mode with input batch size 4 and precision fp32.

env：pytorch-cuda=12.1 python=3.11

xuzhao9 · 2024-06-20T03:28:42Z

The following command will pass the test:

$ python run_benchmark.py dynamo --only yolov3 --accuracy --training --amp --backend=inductor
loading model: 0it [00:02, ?it/s]
cuda train yolov3
W0619 23:26:48.274000 139707225552704 torch/_inductor/utils.py:1219] [9/0] DeviceCopy in input program
W0619 23:26:48.275000 139707225552704 torch/_inductor/utils.py:1219] [9/0] DeviceCopy in input program
skipping cudagraphs due to skipping cudagraphs due to cpu device (primals_2). Found from :
   File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 188, in forward
    self.create_grids((nx, ny), p.device)
  File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 159, in create_grids
    self.anchor_vec = self.anchor_vec.to(device)

W0619 23:26:52.032000 139707225552704 torch/_inductor/utils.py:1219] [9/1] DeviceCopy in input program
W0619 23:26:52.032000 139707225552704 torch/_inductor/utils.py:1219] [9/1] DeviceCopy in input program
skipping cudagraphs due to skipping cudagraphs due to cpu device (primals_4). Found from :
   File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 188, in forward
    self.create_grids((nx, ny), p.device)
  File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 159, in create_grids
    self.anchor_vec = self.anchor_vec.to(device)

W0619 23:26:54.131000 139707225552704 torch/_logging/_internal.py:1034] [13/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
pass

Looking into why the result is different in run.py runner.

xuzhao9 · 2024-06-20T04:06:56Z

Fix: #2321
Can you help verify it fixes the issue?

scshtyk · 2024-06-20T09:59:35Z

Fix: #2321 Can you help verify it fixes the issue?

Thank you. I tested with your fix commit, and the accuracy test is passing.

xuzhao9 added the accuracy label May 6, 2024

xuzhao9 mentioned this issue Jun 20, 2024

Fix yolov3 train accuracy test #2321

Closed

facebook-github-bot closed this as completed in caa76d8 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

scshtyk commented Apr 26, 2024

xuzhao9 commented Jun 20, 2024 •

edited

Loading

xuzhao9 commented Jun 20, 2024

scshtyk commented Jun 20, 2024 •

edited

Loading

RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

Comments

scshtyk commented Apr 26, 2024

xuzhao9 commented Jun 20, 2024 • edited Loading

xuzhao9 commented Jun 20, 2024

scshtyk commented Jun 20, 2024 • edited Loading

xuzhao9 commented Jun 20, 2024 •

edited

Loading

scshtyk commented Jun 20, 2024 •

edited

Loading