add uuid in cudaDeviceProperties #125083

jeffdaily · 2024-04-26T23:41:03Z

Replaces #99967.

Fixes #99903.

cc @ptrblck @msaroufim @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

Replaces pytorch#99967. Fixes pytorch#99903.

pytorch-bot · 2024-04-26T23:41:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125083

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit bd9edca with merge base a76faff ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-13-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (similar failure)
Process completed with exit code 1.
trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh) (similar failure)
'Test'

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cpuhrsch · 2024-04-30T19:49:54Z

There's a couple failing tests. Hud indicates errors are related https://hud.pytorch.org/pr/125083

jeffdaily · 2024-05-01T14:51:35Z

There's a couple failing tests. Hud indicates errors are related https://hud.pytorch.org/pr/125083

Apologies, I should have moved it back to draft after my first commit once I realized this wasn't as straightforward as anticipated. First commit passed, but there was no unit test coverage. Adding a unit test proved the first try was wrong. Many commits later, finally got the correct implementation and linting. Please review now.

torch/csrc/cuda/Module.cpp

stellaraccident · 2024-05-01T23:30:24Z

Thanks, Jeff. Nit: Maybe a class level comment which says that the str format is expected to match the format used by nvidia-smi and can be counted on for that (i.e. it is stable and will only be updated if future updates require it to match a different format from that tool)? If doing that, I think it would be more appropriate to implement as __str__ vs __repr__ (since repr is usually just for debugging).

jeffdaily · 2024-05-01T23:46:47Z

Thanks, Jeff. Nit: Maybe a class level comment which says that the str format is expected to match the format used by nvidia-smi and can be counted on for that (i.e. it is stable and will only be updated if future updates require it to match a different format from that tool)? If doing that, I think it would be more appropriate to implement as __str__ vs __repr__ (since repr is usually just for debugging).

I opted to drop the "GPU-" prefix that would normally appear with nvidia-smi; the way I have it implemented here is a pure UUID format. Would it be better to have left the GPU- prefix as part of the string? Since this PR hasn't had maintainer approval yet, we have time to put the GPU- prefix back in if you think it is more appropriate. And I can certainly swap str for repr too.

albanD · 2024-05-02T14:15:47Z

I would be curious if @ptrblck have a strong opinion about keeping the GPU- header here? Or any other comment?

albanD

Ok SGTM
We can revisit if there is any new opinion from @eqy later.

pytorchmergebot · 2024-05-08T19:15:46Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

izaitsevfb · 2024-05-09T19:51:06Z

@pytorchbot revert -m "Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t'" -c ghfirst

Looks like the same issue as here, how was it fixed?

buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:989:49: error: no member named 'uuid' in 'hipDeviceProp_t'
               << ", uuid=" << std::string(prop.uuid.bytes, 16) << ")";
                                           ~~~~ ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:979:47: error: no member named 'uuid' in 'hipDeviceProp_t'
      .def_readonly("uuid", &hipDeviceProp_t::uuid)
                             ~~~~~~~~~~~~~~~~~^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:982:9: error: variable 'stream' cannot be implicitly captured in a lambda with no capture-default specified
        stream << "_CudaDeviceProperties(name='" << prop.name
        ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:981:28: note: 'stream' declared here
        std::ostringstream stream;
                           ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:24: note: lambda expression begins here
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                       ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: capture 'stream' by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &stream
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: default capture by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:990:16: error: variable 'stream' cannot be implicitly captured in a lambda with no capture-default specified
        return stream.str();
               ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:981:28: note: 'stream' declared here
        std::ostringstream stream;
                           ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:24: note: lambda expression begins here
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                       ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: capture 'stream' by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &stream
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: default capture by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &
4 errors generated.

For Meta folks, see D57138027

pytorchmergebot · 2024-05-09T19:52:40Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 3f36145. Reverted #125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](#125083 (comment)))

pytorchmergebot · 2024-05-09T19:52:49Z

@jeffdaily your PR has been successfully reverted.

jeffdaily · 2024-05-09T21:28:46Z

Ouch, 2 reverts. Can any Meta folks help shepherd this one? The uuid attribute was added to hipDeviceProp_t in ROCm 6.0. This implies Meta isn't yet up to that version of ROCm yet internally. I could perhaps add some CMake config-time check for the attribute and some #ifdef protection and an alternative implementation. Please advise.

jeffdaily · 2024-05-21T23:16:48Z

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

stellaraccident · 2024-05-22T18:00:27Z

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

+1 - we've got multiple AMD GPU integrations stalling on the inability to tie device identifiers together concretely and would really like to see this PR merged (and stay merged).

malfet · 2024-05-22T18:05:41Z

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

FBCODE_CAFFE2 sounds like the one you could use, but what exactly is the problem with the internal builds? If it's the rocm version problem, couldn't you just guard the code with ROCM version check?

@xw285cornell do you know of the plans to update internal runtime to ROCM-6?

jeffdaily · 2024-05-22T18:10:40Z

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

FBCODE_CAFFE2 sounds like the one you should use, but what exactly is the problem with the internal builds?

See earlier comment why this PR was reverted. It's a build issue. The uuid attribute was added for ROCm 6.0 as a backward-breaking change as part of the major release bump. Your internal build seems to be based on a version of ROCm that predates this change. I don't have visibility into why your internal ROCm version isn't >= 6.0 and I don't want this PR to force you to upgrade. That's why I was asking for a work-around. Unless there's something else you can do as part of your phabricator import process to help land this.

jeffdaily · 2024-05-28T22:40:47Z

@malfet please double-check my use of FBCODE in bd9edca.

Any chance we could double-check if this will break Meta-internal builds before landing again?

malfet · 2024-05-29T00:33:08Z

torch/csrc/cuda/Module.cpp

@@ -909,6 +909,50 @@ PyObject* THCPModule_cudaGetSyncDebugMode(PyObject* self, PyObject* noargs) {
 static void registerCudaDeviceProperties(PyObject* module) {
 // Add _cudaDevicePropertires class to torch._C
 auto m = py::handle(module).cast<py::module>();
+ // until internal build is using a rocm version with uuid attr
+#ifndef FBCODE_CAFFE2


Why not replace it with definition specific to the code in question rather than guard it with FBCODE_CAFFE2?

Suggested change

#ifndef FBCODE_CAFFE2

#if defined(USE_CUDA) || (defined(USE_ROCM) && TORCH_HIP_VERSION >= 600)

I wish I could but your internal build is at a ROCm version that claims to be 6.0 but isn't quite. TORCH_HIP_VERSION would be >= 600 but the uuid attribute would still not be present.

facebook-github-bot · 2024-06-04T20:32:44Z

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jeffdaily · 2024-06-19T05:20:45Z

@malfet any progress landing this?

jeffdaily · 2024-06-24T15:20:43Z

@malfet ping, any progress landing this?

stellaraccident · 2024-06-24T15:54:12Z

@malfet ping, any progress landing this?

+1 this is blocking some integrations we are doing

jeffdaily · 2024-06-27T23:45:52Z

@pytorchbot merge

pytorchmergebot · 2024-06-27T23:47:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

add uuid in cudaDeviceProperties

2589368

Replaces pytorch#99967. Fixes pytorch#99903.

jeffdaily added module: cuda Related to torch.cuda, and CUDA support in general module: rocm AMD GPU support for Pytorch release notes: rocm mandatorylabel release notes: cuda release notes category rocm This tag is for PRs from ROCm team ciflow/rocm labels Apr 26, 2024

pytorchbot added the open source label Apr 26, 2024

jeffdaily added 4 commits April 27, 2024 00:30

linting

3d16c86

add CUuuid class to py bindings

5071850

fix new uuid unit test

2af7eb7

use 8-4-4-4-16 uuid for __repr__

37ed341

cpuhrsch requested a review from albanD April 30, 2024 19:47

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 30, 2024

cpuhrsch self-requested a review April 30, 2024 19:50

jeffdaily added 2 commits April 30, 2024 23:03

return uuid bytes without conversion

e03695c

linting

c301dac

albanD reviewed May 1, 2024

View reviewed changes

torch/csrc/cuda/Module.cpp Outdated Show resolved Hide resolved

albanD reviewed May 1, 2024

View reviewed changes

torch/csrc/cuda/Module.cpp Outdated Show resolved Hide resolved

torch/csrc/cuda/Module.cpp Show resolved Hide resolved

more comments

8dc5979

jeffdaily requested a review from albanD May 1, 2024 19:19

albanD requested a review from eqy May 2, 2024 17:11

pruthvistony approved these changes May 6, 2024

View reviewed changes

albanD approved these changes May 6, 2024

View reviewed changes

pytorchmergebot closed this in 3f36145 May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

pytorchmergebot reopened this May 9, 2024

stellaraccident mentioned this pull request May 10, 2024

In order to be compatible with iree-turbine, make iree-turbine can support training iree-org/iree#17342

Open

malfet approved these changes May 16, 2024

View reviewed changes

jeffdaily added 2 commits May 28, 2024 22:30

Merge branch 'main' into uuid_cuda_hip_device_properties

96b2fc2

use FBCODE to comment out uuid feature and test

bd9edca

malfet reviewed May 29, 2024

View reviewed changes

pytorchmergebot added the merging label Jun 27, 2024

pytorchmergebot closed this in 169b4ca Jun 27, 2024

pytorchmergebot removed the merging label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add uuid in cudaDeviceProperties #125083

add uuid in cudaDeviceProperties #125083

jeffdaily commented Apr 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 26, 2024 •

edited

Loading

cpuhrsch commented Apr 30, 2024

jeffdaily commented May 1, 2024

stellaraccident commented May 1, 2024

jeffdaily commented May 1, 2024

albanD commented May 2, 2024

albanD left a comment

pytorchmergebot commented May 8, 2024

izaitsevfb commented May 9, 2024

pytorchmergebot commented May 9, 2024

pytorchmergebot commented May 9, 2024

jeffdaily commented May 9, 2024

jeffdaily commented May 21, 2024

stellaraccident commented May 22, 2024

malfet commented May 22, 2024 •

edited

Loading

jeffdaily commented May 22, 2024

jeffdaily commented May 28, 2024

malfet May 29, 2024

jeffdaily May 29, 2024

facebook-github-bot commented Jun 4, 2024

jeffdaily commented Jun 19, 2024

jeffdaily commented Jun 24, 2024

stellaraccident commented Jun 24, 2024

jeffdaily commented Jun 27, 2024

pytorchmergebot commented Jun 27, 2024

	#ifndef FBCODE_CAFFE2
	#if defined(USE_CUDA) \|\| (defined(USE_ROCM) && TORCH_HIP_VERSION >= 600)

add uuid in cudaDeviceProperties #125083

add uuid in cudaDeviceProperties #125083

Conversation

jeffdaily commented Apr 26, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Apr 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125083

✅ You can merge normally! (4 Unrelated Failures)

cpuhrsch commented Apr 30, 2024

jeffdaily commented May 1, 2024

stellaraccident commented May 1, 2024

jeffdaily commented May 1, 2024

albanD commented May 2, 2024

albanD left a comment

Choose a reason for hiding this comment

pytorchmergebot commented May 8, 2024

Merge started

izaitsevfb commented May 9, 2024

pytorchmergebot commented May 9, 2024

pytorchmergebot commented May 9, 2024

jeffdaily commented May 9, 2024

jeffdaily commented May 21, 2024

stellaraccident commented May 22, 2024

malfet commented May 22, 2024 • edited Loading

jeffdaily commented May 22, 2024

jeffdaily commented May 28, 2024

malfet May 29, 2024

Choose a reason for hiding this comment

jeffdaily May 29, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Jun 4, 2024

jeffdaily commented Jun 19, 2024

jeffdaily commented Jun 24, 2024

stellaraccident commented Jun 24, 2024

jeffdaily commented Jun 27, 2024

pytorchmergebot commented Jun 27, 2024

Merge started

jeffdaily commented Apr 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 26, 2024 •

edited

Loading

malfet commented May 22, 2024 •

edited

Loading