adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli… #127664

Fuzzkatt · 2024-06-01T00:43:11Z

Compiling the error logs for accuracy failures from #126692, we get

2024-05-30T22:25:48.9826632Z cuda train cspdarknet53                       
2024-05-30T22:27:25.3007720Z W0530 22:27:25.299000 140102948520576 torch/_logging/_internal.py:1029] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-30T22:28:32.5647228Z E0530 22:28:32.563000 140102948520576 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.08938, (ref-fp64): 0.02659 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-30T22:28:32.5678973Z fail_accuracy

2024-05-30T22:54:18.6599720Z cuda train gluon_inception_v3                 
2024-05-30T22:55:15.9177553Z W0530 22:55:15.916000 140325213831808 torch/_logging/_internal.py:1029] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-30T22:56:48.8158371Z E0530 22:56:48.815000 140325213831808 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01151, (ref-fp64): 0.00230 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-30T22:56:48.8206274Z fail_accuracy

2024-05-30T22:07:57.0250861Z cuda train beit_base_patch16_224              
...
2024-05-30T22:09:57.6998720Z E0530 22:09:57.698000 140132593296000 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-30T22:09:57.7000332Z E0530 22:09:57.699000 140132593296000 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad
2024-05-30T22:09:57.7025912Z fail_accuracy

2024-05-31T19:52:22.6647704Z cuda train phlippe_resnet                     
2024-05-31T19:52:42.3562368Z E0531 19:52:42.355000 140052933362304 torch/_dynamo/utils.py:1400] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
2024-05-31T19:52:42.3574085Z fail_accuracy

So out of the four failing models, gluon_inception_v3 and beit_base_patch16_224 are 0.01151, and 0.01333 respectively where the tolerance is 0.01, so a very minor mismatch. Similarly phlippe_resnet is 0.00102, where the tolerance is at 0.001.

This PR adjusts the thresholds for these three models that are failing with minor accuracy mismatch by bumping them up to the next tolerance requirement, and then removing the xfail from the tests.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @nWEIdia @eqy

…ppe_resnet

pytorch-bot · 2024-06-01T00:43:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127664

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 13 Unrelated Failures

As of commit 47c45ff with merge base 4448397 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
Process completed with exit code 137.

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
ImportError: attempted relative import with no known parent package
inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2, unstable) (gh) (#128871)
'Test'
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128929)
ImportError: attempted relative import with no known parent package
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128929)
ImportError: attempted relative import with no known parent package
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128931)
ImportError: attempted relative import with no known parent package
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128931)
ImportError: attempted relative import with no known parent package
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128932)
curl: (22) The requested URL returned error:
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128932)
ImportError: attempted relative import with no known parent package

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nWEIdia · 2024-06-02T20:11:35Z

We may need to triage beit_base_patch16_224 and phlippe_resnet models further, as the debug print PR seems to suggest a real regression (see: #126692 (comment))

…hreshold for phlippe_resnet as it's fixed by cudnnv9

Fuzzkatt · 2024-06-13T23:02:49Z

@pytorchbot rebase

pytorchmergebot · 2024-06-13T23:04:25Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-06-13T23:04:27Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/127664/head returned non-zero exit code 1

Rebasing (1/18)
Rebasing (2/18)
Auto-merging benchmarks/dynamo/timm_models.py
CONFLICT (content): Merge conflict in benchmarks/dynamo/timm_models.py
error: could not apply 33c15b2455d... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 33c15b2455d... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9

Raised by https://github.com/pytorch/pytorch/actions/runs/9507897428

Fuzzkatt · 2024-06-18T18:48:13Z

@pytorchbot rebase

pytorchmergebot · 2024-06-18T18:49:31Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-06-18T18:49:33Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/127664/head returned non-zero exit code 1

Rebasing (1/4)
Rebasing (2/4)
Auto-merging benchmarks/dynamo/timm_models.py
CONFLICT (content): Merge conflict in benchmarks/dynamo/timm_models.py
error: could not apply 33c15b2455... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 33c15b2455... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9

Raised by https://github.com/pytorch/pytorch/actions/runs/9570550855

…inductor_tolerance_fixes

github-actions · 2024-08-18T01:57:26Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli…

0394699

…ppe_resnet

pytorch-bot bot added ciflow/inductor module: dynamo labels Jun 1, 2024

Fuzzkatt marked this pull request as ready for review June 1, 2024 00:43

eqy approved these changes Jun 1, 2024

View reviewed changes

pytorchbot added the open source label Jun 1, 2024

Fuzzkatt mentioned this pull request Jun 1, 2024

[DO NOT MERGE] Fuzzkatt/12 4 inductor tolerance fixes debug #127669

Draft

Skylion007 mentioned this pull request Jun 1, 2024

UNSTABLE inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm) #127680

Closed

Skylion007 approved these changes Jun 2, 2024

View reviewed changes

Fuzzkatt added 2 commits June 13, 2024 16:02

add new threshold for tinynet_a from new cudnnv9 regression, remove t…

33c15b2

…hreshold for phlippe_resnet as it's fixed by cudnnv9

Merge branch 'main' into Fuzzkatt/12_4_inductor_tolerance_fixes

d7196ca

Fuzzkatt added 2 commits June 17, 2024 14:41

change tinynet_a back to expected pass

9677b96

fix typo

ab4ca70

Merge branch 'main' of github.com:pytorch/pytorch into Fuzzkatt/12_4_…

47c45ff

…inductor_tolerance_fixes

github-actions bot added the Stale label Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli… #127664

adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli… #127664

Fuzzkatt commented Jun 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 1, 2024 •

edited

Loading

nWEIdia commented Jun 2, 2024

Fuzzkatt commented Jun 13, 2024

pytorchmergebot commented Jun 13, 2024

pytorchmergebot commented Jun 13, 2024

Fuzzkatt commented Jun 18, 2024

pytorchmergebot commented Jun 18, 2024

pytorchmergebot commented Jun 18, 2024

github-actions bot commented Aug 18, 2024

adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli… #127664

Are you sure you want to change the base?

adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli… #127664

Conversation

Fuzzkatt commented Jun 1, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Jun 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127664

❌ 1 New Failure, 13 Unrelated Failures

nWEIdia commented Jun 2, 2024

Fuzzkatt commented Jun 13, 2024

pytorchmergebot commented Jun 13, 2024

pytorchmergebot commented Jun 13, 2024

Fuzzkatt commented Jun 18, 2024

pytorchmergebot commented Jun 18, 2024

pytorchmergebot commented Jun 18, 2024

github-actions bot commented Aug 18, 2024

Fuzzkatt commented Jun 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 1, 2024 •

edited

Loading