Refine the logic of device construction when only device index is given #129119

guangyey · 2024-06-20T06:42:54Z

Stack from ghstack (oldest at bottom):

Motivation

Before this PR, device construction was cuda type when only a device index was given. It also returns the PrivateUser1 type if a PrivateUser1 type is registered.

>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
>>> b
tensor([1, 2], device='cuda:0')

It works well on CUDA GPU. But it will raise unexpected information and error running on XPU.

>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

With this PR, refine the logic to use the currently available device type instead.

pytorch-bot · 2024-06-20T06:42:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129119

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PREEMPTIVE] Experimenting with AMAZN 2023 for linux.4xlarge.nvidia.gpu instances

✅ No Failures

As of commit c1fdb82 with merge base e2e624a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

albanD · 2024-06-20T20:49:45Z

torch/csrc/utils/python_arg_parser.h

@@ -813,8 +813,14 @@ inline at::Device toDevice(PyObject* obj) {
 c10::DeviceType::PrivateUse1,
 static_cast<c10::DeviceIndex>(device_index));
 }
+#ifdef USE_CUDA


Updated #126646 with details.
Can you change this to get the current accelerator instead?

OK, using the current accelerator looks better.
And I add XPU to getAccelerator and refine the code. In order to facilitate the review, I separate the code change into two PRs.

ghstack-source-id: a34c42c3cbaf5e492e6baa794251b4bc1dc74c10 Pull Request resolved: #129119

…ndex is given" # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic for XPU. It will return 'xpu' device type if PyTorch is built with XPU. And raise an error if PyTorch is built without any device but only accepts a device index. Now, it works well on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'xpu' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='xpu:0') ``` [ghstack-poisoned]

albanD · 2024-06-21T13:44:30Z

docs/source/tensor_attributes.rst

@@ -213,7 +213,8 @@ non-None device argument. To globally change the default device, see also

 .. note::
 For legacy reasons, a device can be constructed via a single device ordinal, which is treated
- as a cuda device. This matches :meth:`Tensor.get_device`, which returns an ordinal for cuda
+ as a currently available device type (i.e. "cuda" if cuda is available, "xpu" if xpu is available).


Given that we're going to have a few features relying on it as the default (pin memory for example), I think we should make "accelerator" a public concept in this doc.
I think we can:

Have one paragraph in https://pytorch.org/docs/stable/torch.html that introduces the concept of accelerator and lists all the current ones (based on the list in c10).

Have other docs like this one just mention the "current accelerator" with a link to the paragraph above.

WDYT?

Sounds good. Maybe we can take below RFC together with this PR, to figure out an accelerator mechanism both in frontend and backend? #128403

@guangyey let's have a talk next week.

It is great to me~

…ndex is given" # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. [ghstack-poisoned]

ghstack-source-id: 665d24a5bcbf922c1125c4fa753066bedbe1968e Pull Request resolved: pytorch#129119

…ndex is given" # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. [ghstack-poisoned]

ghstack-source-id: c5b3829bd33f8b32a0761b9d850630028d7c1d01 Pull Request resolved: #129119

ghstack-source-id: 2d88b54f74342a330c1bee53d2dba94428582bbf Pull Request resolved: #129119

…ndex is given" # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. [ghstack-poisoned]

dvrogozh · 2024-07-12T22:18:19Z

Folks, just want to give you a heads up that there was an assumption that this PR will address huggingface/transformers#31941. Unfortunately, this PR does not fully solve it. I believe that PR gives an essential change, but it alone is not enough to get the described case working. This being said, mind that at the moment I don't know what's the remainder of the root cause for 31941. It might be that HF will need another fix somewhere or pytorch - can't say right now.

gujinghui · 2024-07-15T14:27:22Z

@pytorchbot merge

pytorchmergebot · 2024-07-15T14:29:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

dvrogozh · 2024-07-15T16:28:31Z

Folks, just want to give you a heads up that there was an assumption that this PR will address huggingface/transformers#31941. Unfortunately, this PR does not fully solve it.

Update on this. The root cause is the bug in HF safetensors library, see huggingface/safetensors#499. This PR indeed provides required prerequisite for the fix. So, thank you for merging @gujinghui.

Fixes: huggingface#499 Fixes: huggingface/transformers#31941 In some cases only device index is given on querying device. In this case both PyTorch and Safetensors were returning 'cuda:N' by default. This is causing runtime failures if user actually runs something on non-cuda device and does not have cuda at all. Recently this was addressed on PyTorch side by [1]: starting from PyTorch 2.5 calling 'torch.device(N)' will return current device instead of cuda device. This commit is making similar change to Safetensors. If only device index is given, Safetensors will query and return device calling 'torch.device(N)'. This change is backward compatible since this call would return 'cuda:N' on PyTorch <=2.4 which aligns with previous Safetensors behavior. See[1]: pytorch/pytorch#129119 Signed-off-by: Dmitry Rogozhkin <[email protected]>

…en (pytorch#129119) # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. Pull Request resolved: pytorch#129119 Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang ghstack dependencies: pytorch#129463, pytorch#129205, pytorch#129363

Fixes: huggingface#499 Fixes: huggingface/transformers#31941 In some cases only device index is given on querying device. In this case both PyTorch and Safetensors were returning 'cuda:N' by default. This is causing runtime failures if user actually runs something on non-cuda device and does not have cuda at all. Recently this was addressed on PyTorch side by [1]: starting from PyTorch 2.5 calling 'torch.device(N)' will return current device instead of cuda device. This commit is making similar change to Safetensors. If only device index is given, Safetensors will query and return device calling 'torch.device(N)'. This change is backward compatible since this call would return 'cuda:N' on PyTorch <=2.4 which aligns with previous Safetensors behavior. See[1]: pytorch/pytorch#129119 Signed-off-by: Dmitry Rogozhkin <[email protected]>

pytorchbot added the open source label Jun 20, 2024

guangyey requested review from gujinghui and albanD June 20, 2024 07:13

Refine the logic of device construction when only device index is given

1afbe62

[ghstack-poisoned]

albanD reviewed Jun 20, 2024

View reviewed changes

guangyey requested a review from albanD June 21, 2024 03:47

guangyey added a commit that referenced this pull request Jun 21, 2024

Refine the logic of device construction when only device index is given

3cdf1a8

ghstack-source-id: a34c42c3cbaf5e492e6baa794251b4bc1dc74c10 Pull Request resolved: #129119

guangyey mentioned this pull request Jun 21, 2024

[RELAND] Add xpu to getAccelerator #129205

Closed

guangyey added ciflow/trunk Trigger trunk jobs on your pull request release notes: python_frontend release notes category labels Jun 21, 2024

gujinghui requested a review from EikanWang June 21, 2024 06:15

guangyey added 7 commits June 21, 2024 10:47

albanD reviewed Jun 21, 2024

View reviewed changes

guangyey mentioned this pull request Jun 24, 2024

Introduce the concept of Accelerators to PyTorch doc #129363

Closed

guangyey mentioned this pull request Jun 25, 2024

Use default accelerator type as the default parameter for pin_memory #129460

Closed

guangyey added 2 commits July 2, 2024 09:23

OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request Jul 2, 2024

Refine the logic of device construction when only device index is given

01d5ee6

ghstack-source-id: 665d24a5bcbf922c1125c4fa753066bedbe1968e Pull Request resolved: pytorch#129119

faaany mentioned this pull request Jul 3, 2024

fix bug when getting the real accelerator's device number huggingface/accelerate#2874

Closed

guangyey added a commit that referenced this pull request Jul 4, 2024

Refine the logic of device construction when only device index is given

30eccf2

ghstack-source-id: c5b3829bd33f8b32a0761b9d850630028d7c1d01 Pull Request resolved: #129119

guangyey added a commit that referenced this pull request Jul 4, 2024

Refine the logic of device construction when only device index is given

8760230

ghstack-source-id: 2d88b54f74342a330c1bee53d2dba94428582bbf Pull Request resolved: #129119

guangyey added 3 commits July 4, 2024 09:31

EikanWang approved these changes Jul 6, 2024

View reviewed changes

dvrogozh mentioned this pull request Jul 12, 2024

cuda device is wrongly requested instead of xpu running pipeline(device_map="auto", max_memory": {0: 1.0e+10}) huggingface/transformers#31941

Open

dvrogozh mentioned this pull request Jul 13, 2024

pytorch: safetensors library hardcodes using CUDA if only device index is provided huggingface/safetensors#499

Open

pytorchmergebot added the merging label Jul 15, 2024

pytorchmergebot closed this in 7cd48df Jul 15, 2024

pytorchmergebot added Merged and removed merging labels Jul 15, 2024

dvrogozh mentioned this pull request Jul 15, 2024

Query device name from pytorch if only device index is given huggingface/safetensors#500

Open

henrylhtsang mentioned this pull request Jul 17, 2024

[aoti] Unskip some aot inductor tests #130973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine the logic of device construction when only device index is given #129119

Refine the logic of device construction when only device index is given #129119

guangyey commented Jun 20, 2024 •

edited

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

albanD Jun 20, 2024

guangyey Jun 21, 2024 •

edited

Loading

albanD Jun 21, 2024

gujinghui Jun 21, 2024

guangyey Jun 24, 2024

dvrogozh commented Jul 12, 2024

gujinghui commented Jul 15, 2024

pytorchmergebot commented Jul 15, 2024

dvrogozh commented Jul 15, 2024

Refine the logic of device construction when only device index is given #129119

Refine the logic of device construction when only device index is given #129119

Conversation

guangyey commented Jun 20, 2024 • edited Loading

Motivation

pytorch-bot bot commented Jun 20, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129119

❗ 1 Active SEVs

✅ No Failures

albanD Jun 20, 2024

Choose a reason for hiding this comment

guangyey Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

albanD Jun 21, 2024

Choose a reason for hiding this comment

gujinghui Jun 21, 2024

Choose a reason for hiding this comment

guangyey Jun 24, 2024

Choose a reason for hiding this comment

dvrogozh commented Jul 12, 2024

gujinghui commented Jul 15, 2024

pytorchmergebot commented Jul 15, 2024

Merge started

dvrogozh commented Jul 15, 2024

guangyey commented Jun 20, 2024 •

edited

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

guangyey Jun 21, 2024 •

edited

Loading