Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

Closed
scshtyk opened this issue Apr 26, 2024 · 3 comments
Closed

RuntimeError When Enabling Accuracy Checks in yolov3 Training on GPU. #2248

scshtyk opened this issue Apr 26, 2024 · 3 comments
Labels

Comments

@scshtyk
Copy link

scshtyk commented Apr 26, 2024

Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of yolov3 in a GPU docker environment. The training runs without issues when the --accuracy flag is not used.

Steps to Reproduce
python install.py yolov3
python run.py yolov3 -d cuda -t train --accuracy

Expected Behavior
The training process should run without errors and perform accuracy checks without causing runtime errors.

Actual Behavior
The script executes successfully without the --accuracy flag.
However, when the accuracy check is enabled, it fails with the following error message:

TypeError: Darknet.forward() takes from 2 to 4 positional arguments but 6 were given
Running train method from yolov3 on cuda in eager mode with input batch size 4 and precision fp32.

env:pytorch-cuda=12.1 python=3.11

@xuzhao9
Copy link
Contributor

xuzhao9 commented Jun 20, 2024

The following command will pass the test:

$ python run_benchmark.py dynamo --only yolov3 --accuracy --training --amp --backend=inductor
loading model: 0it [00:02, ?it/s]
cuda train yolov3
W0619 23:26:48.274000 139707225552704 torch/_inductor/utils.py:1219] [9/0] DeviceCopy in input program
W0619 23:26:48.275000 139707225552704 torch/_inductor/utils.py:1219] [9/0] DeviceCopy in input program
skipping cudagraphs due to skipping cudagraphs due to cpu device (primals_2). Found from :
   File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 188, in forward
    self.create_grids((nx, ny), p.device)
  File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 159, in create_grids
    self.anchor_vec = self.anchor_vec.to(device)

W0619 23:26:52.032000 139707225552704 torch/_inductor/utils.py:1219] [9/1] DeviceCopy in input program
W0619 23:26:52.032000 139707225552704 torch/_inductor/utils.py:1219] [9/1] DeviceCopy in input program
skipping cudagraphs due to skipping cudagraphs due to cpu device (primals_4). Found from :
   File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 188, in forward
    self.create_grids((nx, ny), p.device)
  File "/home/xz/git/benchmark/torchbenchmark/models/yolov3/yolo_models.py", line 159, in create_grids
    self.anchor_vec = self.anchor_vec.to(device)

W0619 23:26:54.131000 139707225552704 torch/_logging/_internal.py:1034] [13/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
pass

Looking into why the result is different in run.py runner.

@xuzhao9
Copy link
Contributor

xuzhao9 commented Jun 20, 2024

Fix: #2321
Can you help verify it fixes the issue?

@scshtyk
Copy link
Author

scshtyk commented Jun 20, 2024

Fix: #2321 Can you help verify it fixes the issue?

Thank you. I tested with your fix commit, and the accuracy test is passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants