Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[custom_op] stop using nonlocals to store information (#128547) #128616

Merged
merged 1 commit into from
Jun 19, 2024

Conversation

zou3519
Copy link
Contributor

@zou3519 zou3519 commented Jun 13, 2024

Fixes #128544 Fixes #128535

We had a problem with multithreading where the nonlocals were being clobbered. In the first place, we stored these nonlocals because we wanted to ferry information from an autograd.Function.apply to autograd.Function.forward.

Our new approach is:

  • pass the information directly as an input to the autograd.Function.apply. This means that the autograd.Function.forward will receive the information too.
  • this messes up ctx.needs_input_grad, which has an element per input to forward. The user should not see the additional information we passed. We fix this by temporarily overriding ctx.needs_input_grad to the right thing.
  • this exposed a bug in that ctx.needs_input_grad wasn't correct for TensorList inputs. This PR fixes that too.

Test Plan:

Fixes #ISSUE_NUMBER

Fixes #128544
Fixes #128535

We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.

Our new approach is:
- pass the information directly as an input to the
  autograd.Function.apply. This means that the autograd.Function.forward
  will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
  forward. The user should not see the additional information we passed.
  We fix this by temporarily overriding ctx.needs_input_grad to the
  right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
  TensorList inputs. This PR fixes that too.

Test Plan:
- existing and new tests
Pull Request resolved: #128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
Copy link

pytorch-bot bot commented Jun 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128616

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2ce7865 with merge base b66e3f0 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@atalman atalman merged commit e7dde73 into release/2.4 Jun 19, 2024
103 of 104 checks passed
@github-actions github-actions bot deleted the rzou/2.4/fix_custom_op_local branch July 20, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants