Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow get attributes on DDP similar to FSDP #128620

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mayank31398
Copy link
Contributor

@mayank31398 mayank31398 commented Jun 13, 2024

FSDP implements the following logic but its missing from DDP.
This PR adds an equivalent function for the same.

    def __getattr__(self, name: str) -> Any:
        """Forward missing attributes to the wrapped module."""
        try:
            return super().__getattr__(name)  # defer to nn.Module's logic
        except AttributeError:
            return getattr(self._fsdp_wrapped_module, name)

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented Jun 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128620

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 14 Unrelated Failures

As of commit b32c632 with merge base c63ccea (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jun 13, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 13, 2024
@fegin fegin self-requested a review June 13, 2024 16:48
@fegin
Copy link
Contributor

fegin commented Jun 13, 2024

@awgu Does __getattr__ cause any issues for FSDP? I don't recall any. If so, I think it is okay to land the PR.

@mikaylagawarecki mikaylagawarecki removed their request for review June 13, 2024 16:55
@awgu
Copy link
Contributor

awgu commented Jun 13, 2024

@fegin I think that this change is generally okay, but I wonder what is the motivation. I am guessing to make it more similar to FSDP mainly? (Otherwise, probably everyone already is doing ddp_module.module.<attr>.)

This approach of attribute forwarding can have some non-intuitive behavior in edge cases. For example, recently, I hit an issue where we try to access a property p on FullyShardedDataParallel but inside def p:, we hit a real AttributeError. In that case, the except AttributeError would misleadingly report that the wrapped module does not have attribute p.

@mayank31398
Copy link
Contributor Author

yeah @awgu there can be non-intuitive changes
completely agree since I have hit similar problems in past.
this change is just to make it similar to FSDP API since the NO_SHARD route is slated for deprecation in FSDP.

Also, in another PR, I want to add DDP.clip_grad_norm_ similar to FSDP.clip_grad_norm_
There are existing code which could just switch between DDP and FSDP with no change with these changes.

@awgu
Copy link
Contributor

awgu commented Jun 13, 2024

DDP.clip_grad_norm_

I would caution against this. DDP should just use torch.nn.utils.clip_grad_norm_. I do not see why we should attach it onto the module just as syntactic sugar. It feels like this could exist inside the trainer (which is what many trainers I have seen do).

@mayank31398
Copy link
Contributor Author

makes sense @awgu. thanks for the comments

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 14, 2024
@mayank31398
Copy link
Contributor Author

I am not sure what the failing test here is @awgu
doesn't seem related to the PR

@mayank31398
Copy link
Contributor Author

Hi, pinging again for this issue

@mayank31398
Copy link
Contributor Author

@awgu is this still relevant?

awgu
awgu previously approved these changes Jun 28, 2024
Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay with me.

"""Forward missing attributes to the wrapped module."""
try:
return super().__getattr__(name) # defer to nn.Module's logic
except AttributeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future: If when trying to get a property attribute p, p's implementation raises an AttributeError, and self.module does not have p either, then the error message here will be non-intuitive since it would say that self.module does not have the attribute p, where the useful error would be that accessing ddp_module.p raised an AttributeError when accessing .p.

@awgu
Copy link
Contributor

awgu commented Jun 28, 2024

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu) failure is unrelated (foreach ops test, whereas this is DDP)

@awgu
Copy link
Contributor

awgu commented Jun 28, 2024

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 28, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@awgu awgu added the release notes: distributed (ddp) release notes category label Jun 28, 2024
@awgu
Copy link
Contributor

awgu commented Jun 28, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeanschmidt
Copy link
Contributor

It seems that this PR introduced the following error: https://github.com/pytorch/pytorch/actions/runs/9720692485/job/26833212318

@jeanschmidt
Copy link
Contributor

@pytorchbot revert -m "Reverting in order to see if the trunk error on inductor is fixed" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@mayank31398 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jul 1, 2024
This reverts commit 065c386.

Reverted #128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](#128620 (comment)))
@pytorch-bot pytorch-bot bot dismissed awgu’s stale review July 1, 2024 17:57

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

@mayank31398
Copy link
Contributor Author

yikes!! @jeanschmidt
Hmm, I guess we can drop this for now.
But I think a cleaner API is needed for DDP. Currently, there are too much differences between DDP and FSDP.

@albanD albanD removed their request for review July 2, 2024 18:16
pytorchmergebot added a commit to khushi-411/pytorch that referenced this pull request Jul 2, 2024
This reverts commit 065c386.

Reverted pytorch#128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](pytorch#128620 (comment)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (ddp) release notes category Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants