Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added hpu backend support in fsdp utils #127757

Closed
wants to merge 7 commits into from

Conversation

VRSinghHabana
Copy link
Contributor

@VRSinghHabana VRSinghHabana commented Jun 3, 2024

Copy link

pytorch-bot bot commented Jun 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127757

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 1eeaad9 with merge base 406f510 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jun 3, 2024
@VRSinghHabana
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 3, 2024
@cpuhrsch cpuhrsch requested a review from wconstab June 3, 2024 18:12
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 3, 2024
@VRSinghHabana
Copy link
Contributor Author

@wconstab could you please review.

@VRSinghHabana
Copy link
Contributor Author

@wconstab could you please review the change.

@@ -817,6 +817,16 @@ def _get_device_from_device_id(
"index as the `device_id` argument."
)
device = torch.device("cuda", torch.cuda.current_device())
if device == torch.device("hpu"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be nicer to avoid the copy-paste here. Could the code be written in a way to have a list of device types and their cooresponding current_device functions, then just iterate this list and issue the warning/return inside the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@awgu wonder if you think it is possible as a follow up to replace the list of backend_devices with something that iterates over all the devices types in pytorch, or do you think its necessary to update this code for every device-type we care to support?

Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to use the device handle abstraction included from previous backend work.

def _init_device_handle(

Maybe we can pass in the _FSDPState into this call:

device_from_device_id = _get_device_from_device_id(device_id, state.rank)

and then use _FSDPState._device_handle in place of torch.cuda or torch.hpu.

@VRSinghHabana
Copy link
Contributor Author

Lint pops an attribute error. Does it require some registration eve in case to use a string name in list.

Error (MYPY) [attr-defined]
Module has no attribute "hpu"

     809  |    )
     810  |    backend_devices = [
     811  |        ("cuda", lambda: torch.cuda.current_device()),
>>>  812  |        ("hpu", lambda: torch.hpu.current_device()),
     813  |    ]

@VRSinghHabana
Copy link
Contributor Author

VRSinghHabana commented Jun 18, 2024

Any pointers to resolve Lint error about 'Module has no attribute "hpu"' ? Please share suggestion.

@VRSinghHabana
Copy link
Contributor Author

Pushed another patch to resolve Lint error.
I tried to do FSDPstate related change but fetching _device_handle ended in attributeError, possibly due to this attribute of state is not up being in init path flow.

@bsochack
Copy link

@awgu @wconstab can you please take a look at this patch again?

Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@awgu
Copy link
Contributor

awgu commented Jun 26, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 26, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@awgu
Copy link
Contributor

awgu commented Jun 26, 2024

Test failure looks real

@VRSinghHabana
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_dist onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_dist && git pull --rebase)

@VRSinghHabana
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_dist onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_dist && git pull --rebase)

@VRSinghHabana
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

VRSinghHabana and others added 5 commits July 23, 2024 12:20
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
@pytorchmergebot
Copy link
Collaborator

Successfully rebased fsdp_dist onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fsdp_dist && git pull --rebase)

malfet and others added 2 commits July 24, 2024 09:31
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of pytorch#124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
@VRSinghHabana
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@VRSinghHabana
Copy link
Contributor Author

@awgu @wconstab @jgong5 , please approve for merge.

@jgong5
Copy link
Collaborator

jgong5 commented Jul 27, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

bigfootjon pushed a commit that referenced this pull request Jul 31, 2024
In fsdp init_utils, adding support for hpu backend device on _get_device API.

Co-authored-by: Nikita Shulga <[email protected]>
Pull Request resolved: #127757
Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu

(cherry picked from commit bcdba9f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants