Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add uuid in cudaDeviceProperties #125083

Closed
wants to merge 12 commits into from

Conversation

jeffdaily
Copy link
Collaborator

@jeffdaily jeffdaily commented Apr 26, 2024

@jeffdaily jeffdaily added module: cuda Related to torch.cuda, and CUDA support in general module: rocm AMD GPU support for Pytorch release notes: rocm mandatorylabel release notes: cuda release notes category rocm This tag is for PRs from ROCm team ciflow/rocm labels Apr 26, 2024
Copy link

pytorch-bot bot commented Apr 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125083

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit bd9edca with merge base a76faff (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@cpuhrsch cpuhrsch requested a review from albanD April 30, 2024 19:47
@cpuhrsch
Copy link
Contributor

There's a couple failing tests. Hud indicates errors are related https://hud.pytorch.org/pr/125083

@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 30, 2024
@cpuhrsch cpuhrsch self-requested a review April 30, 2024 19:50
@jeffdaily
Copy link
Collaborator Author

There's a couple failing tests. Hud indicates errors are related https://hud.pytorch.org/pr/125083

Apologies, I should have moved it back to draft after my first commit once I realized this wasn't as straightforward as anticipated. First commit passed, but there was no unit test coverage. Adding a unit test proved the first try was wrong. Many commits later, finally got the correct implementation and linting. Please review now.

torch/csrc/cuda/Module.cpp Outdated Show resolved Hide resolved
torch/csrc/cuda/Module.cpp Outdated Show resolved Hide resolved
torch/csrc/cuda/Module.cpp Show resolved Hide resolved
@jeffdaily jeffdaily requested a review from albanD May 1, 2024 19:19
@stellaraccident
Copy link

Thanks, Jeff. Nit: Maybe a class level comment which says that the str format is expected to match the format used by nvidia-smi and can be counted on for that (i.e. it is stable and will only be updated if future updates require it to match a different format from that tool)? If doing that, I think it would be more appropriate to implement as __str__ vs __repr__ (since repr is usually just for debugging).

@jeffdaily
Copy link
Collaborator Author

Thanks, Jeff. Nit: Maybe a class level comment which says that the str format is expected to match the format used by nvidia-smi and can be counted on for that (i.e. it is stable and will only be updated if future updates require it to match a different format from that tool)? If doing that, I think it would be more appropriate to implement as __str__ vs __repr__ (since repr is usually just for debugging).

I opted to drop the "GPU-" prefix that would normally appear with nvidia-smi; the way I have it implemented here is a pure UUID format. Would it be better to have left the GPU- prefix as part of the string? Since this PR hasn't had maintainer approval yet, we have time to put the GPU- prefix back in if you think it is more appropriate. And I can certainly swap str for repr too.

@albanD
Copy link
Collaborator

albanD commented May 2, 2024

I would be curious if @ptrblck have a strong opinion about keeping the GPU- header here? Or any other comment?

@albanD albanD requested a review from eqy May 2, 2024 17:11
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok SGTM
We can revisit if there is any new opinion from @eqy later.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@izaitsevfb
Copy link
Contributor

@pytorchbot revert -m "Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t'" -c ghfirst

Looks like the same issue as here, how was it fixed?

buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:989:49: error: no member named 'uuid' in 'hipDeviceProp_t'
               << ", uuid=" << std::string(prop.uuid.bytes, 16) << ")";
                                           ~~~~ ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:979:47: error: no member named 'uuid' in 'hipDeviceProp_t'
      .def_readonly("uuid", &hipDeviceProp_t::uuid)
                             ~~~~~~~~~~~~~~~~~^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:982:9: error: variable 'stream' cannot be implicitly captured in a lambda with no capture-default specified
        stream << "_CudaDeviceProperties(name='" << prop.name
        ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:981:28: note: 'stream' declared here
        std::ostringstream stream;
                           ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:24: note: lambda expression begins here
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                       ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: capture 'stream' by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &stream
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: default capture by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:990:16: error: variable 'stream' cannot be implicitly captured in a lambda with no capture-default specified
        return stream.str();
               ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:981:28: note: 'stream' declared here
        std::ostringstream stream;
                           ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:24: note: lambda expression begins here
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                       ^
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: capture 'stream' by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &stream
buck-out/v2/gen/fbcode/5839ccd20a633894/caffe2/__fb_C_impl_hipify_gen_eqsb_torch/csrc/cuda/Module.cpp__/out/torch/csrc/cuda/Module.cpp:980:25: note: default capture by reference
      .def("__repr__", [](const hipDeviceProp_t& prop) {
                        ^
                        &
4 errors generated.

For Meta folks, see D57138027

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request May 9, 2024
This reverts commit 3f36145.

Reverted #125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](#125083 (comment)))
@pytorchmergebot
Copy link
Collaborator

@jeffdaily your PR has been successfully reverted.

@jeffdaily
Copy link
Collaborator Author

Ouch, 2 reverts. Can any Meta folks help shepherd this one? The uuid attribute was added to hipDeviceProp_t in ROCm 6.0. This implies Meta isn't yet up to that version of ROCm yet internally. I could perhaps add some CMake config-time check for the attribute and some #ifdef protection and an alternative implementation. Please advise.

@jeffdaily
Copy link
Collaborator Author

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

@stellaraccident
Copy link

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

+1 - we've got multiple AMD GPU integrations stalling on the inability to tie device identifiers together concretely and would really like to see this PR merged (and stay merged).

@malfet
Copy link
Contributor

malfet commented May 22, 2024

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

FBCODE_CAFFE2 sounds like the one you could use, but what exactly is the problem with the internal builds? If it's the rocm version problem, couldn't you just guard the code with ROCM version check?

@xw285cornell do you know of the plans to update internal runtime to ROCM-6?

@jeffdaily
Copy link
Collaborator Author

@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR.

FBCODE_CAFFE2 sounds like the one you should use, but what exactly is the problem with the internal builds?

See earlier comment why this PR was reverted. It's a build issue. The uuid attribute was added for ROCm 6.0 as a backward-breaking change as part of the major release bump. Your internal build seems to be based on a version of ROCm that predates this change. I don't have visibility into why your internal ROCm version isn't >= 6.0 and I don't want this PR to force you to upgrade. That's why I was asking for a work-around. Unless there's something else you can do as part of your phabricator import process to help land this.

@jeffdaily
Copy link
Collaborator Author

@malfet please double-check my use of FBCODE in bd9edca.

Any chance we could double-check if this will break Meta-internal builds before landing again?

@@ -909,6 +909,50 @@ PyObject* THCPModule_cudaGetSyncDebugMode(PyObject* self, PyObject* noargs) {
static void registerCudaDeviceProperties(PyObject* module) {
// Add _cudaDevicePropertires class to torch._C
auto m = py::handle(module).cast<py::module>();
// until internal build is using a rocm version with uuid attr
#ifndef FBCODE_CAFFE2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not replace it with definition specific to the code in question rather than guard it with FBCODE_CAFFE2?

Suggested change
#ifndef FBCODE_CAFFE2
#if defined(USE_CUDA) || (defined(USE_ROCM) && TORCH_HIP_VERSION >= 600)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I could but your internal build is at a ROCm version that claims to be 6.0 but isn't quite. TORCH_HIP_VERSION would be >= 600 but the uuid attribute would still not be present.

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jeffdaily
Copy link
Collaborator Author

@malfet any progress landing this?

@jeffdaily
Copy link
Collaborator Author

@malfet ping, any progress landing this?

@stellaraccident
Copy link

@malfet ping, any progress landing this?

+1 this is blocking some integrations we are doing

@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda Related to torch.cuda, and CUDA support in general module: rocm AMD GPU support for Pytorch open source release notes: cuda release notes category release notes: rocm mandatorylabel Reverted rocm This tag is for PRs from ROCm team triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can the CUDA device LUID be exposed as part of _CudaDeviceProperties?