-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add uuid in cudaDeviceProperties #125083
add uuid in cudaDeviceProperties #125083
Conversation
Replaces pytorch#99967. Fixes pytorch#99903.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125083
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit bd9edca with merge base a76faff (): FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There's a couple failing tests. Hud indicates errors are related https://hud.pytorch.org/pr/125083 |
Apologies, I should have moved it back to draft after my first commit once I realized this wasn't as straightforward as anticipated. First commit passed, but there was no unit test coverage. Adding a unit test proved the first try was wrong. Many commits later, finally got the correct implementation and linting. Please review now. |
Thanks, Jeff. Nit: Maybe a class level comment which says that the str format is expected to match the format used by nvidia-smi and can be counted on for that (i.e. it is stable and will only be updated if future updates require it to match a different format from that tool)? If doing that, I think it would be more appropriate to implement as |
I opted to drop the "GPU-" prefix that would normally appear with nvidia-smi; the way I have it implemented here is a pure UUID format. Would it be better to have left the GPU- prefix as part of the string? Since this PR hasn't had maintainer approval yet, we have time to put the GPU- prefix back in if you think it is more appropriate. And I can certainly swap str for repr too. |
I would be curious if @ptrblck have a strong opinion about keeping the GPU- header here? Or any other comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok SGTM
We can revisit if there is any new opinion from @eqy later.
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t'" -c ghfirst Looks like the same issue as here, how was it fixed?
For Meta folks, see D57138027 |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 3f36145. Reverted #125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](#125083 (comment)))
@jeffdaily your PR has been successfully reverted. |
Ouch, 2 reverts. Can any Meta folks help shepherd this one? The uuid attribute was added to hipDeviceProp_t in ROCm 6.0. This implies Meta isn't yet up to that version of ROCm yet internally. I could perhaps add some CMake config-time check for the attribute and some #ifdef protection and an alternative implementation. Please advise. |
@malfet is there some #ifdef I could add that would remove this feature for your internal builds? I would like to land this but I don't know how to work around your internal ROCm builds failing with this PR. |
+1 - we've got multiple AMD GPU integrations stalling on the inability to tie device identifiers together concretely and would really like to see this PR merged (and stay merged). |
FBCODE_CAFFE2 sounds like the one you could use, but what exactly is the problem with the internal builds? If it's the rocm version problem, couldn't you just guard the code with ROCM version check? @xw285cornell do you know of the plans to update internal runtime to ROCM-6? |
See earlier comment why this PR was reverted. It's a build issue. The uuid attribute was added for ROCm 6.0 as a backward-breaking change as part of the major release bump. Your internal build seems to be based on a version of ROCm that predates this change. I don't have visibility into why your internal ROCm version isn't >= 6.0 and I don't want this PR to force you to upgrade. That's why I was asking for a work-around. Unless there's something else you can do as part of your phabricator import process to help land this. |
@@ -909,6 +909,50 @@ PyObject* THCPModule_cudaGetSyncDebugMode(PyObject* self, PyObject* noargs) { | |||
static void registerCudaDeviceProperties(PyObject* module) { | |||
// Add _cudaDevicePropertires class to torch._C | |||
auto m = py::handle(module).cast<py::module>(); | |||
// until internal build is using a rocm version with uuid attr | |||
#ifndef FBCODE_CAFFE2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not replace it with definition specific to the code in question rather than guard it with FBCODE_CAFFE2?
#ifndef FBCODE_CAFFE2 | |
#if defined(USE_CUDA) || (defined(USE_ROCM) && TORCH_HIP_VERSION >= 600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I could but your internal build is at a ROCm version that claims to be 6.0 but isn't quite. TORCH_HIP_VERSION would be >= 600 but the uuid attribute would still not be present.
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@malfet any progress landing this? |
@malfet ping, any progress landing this? |
+1 this is blocking some integrations we are doing |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Replaces #99967.
Fixes #99903.
cc @ptrblck @msaroufim @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang