-
-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP: can't use GPU with official Tensorflow or PyTorch ROCM containers with Ryzen 5600G #207109
Comments
Can you use the hip from nixos-unstable and tell me if it still gives you that error? |
I should also mention that I am working on native ROCm support for pytorch and tensorflow in nixpkgs so you don't need to use those docker containers, but that's going to take some time. |
Also try |
As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with |
About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.
Switched to latest unstable rn
Edit 1: I am now switching it to staging. It didn't started build screaming (yet). |
I'm in the same boat, it's how #197885 started lol. |
Try without the latest tag, again this should just be an issue with the docker container. |
Same problem on staging |
I haven't gotten tensorflow working yet, but you should be able to use pytorch now when the next staging-next and #206995 is merged. |
I think I found a bug in lucasew@whiterun ~ 0$ nix shell github:Madouura/nixpkgs/df71e711026a37178f9a258f236db0e1a66e2f0b#legacyPackages.x86_64-linux.{python3Packages.torchWithRocm,roctracer,rccl,python3} -c python
Python 3.10.9 (main, Dec 6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch' |
I haven't gotten that problem, I may have linked you a bad build. |
Oh, this is interesting. I didn't realize |
Tested with the following shell.nix (workaround of that issue)
Same problem of the container so far. But I returned to stable. I will try with the latest staging commit. |
Try this. import torch, timeit
print(f"CUDA support: {torch.cuda.is_available()} (Should be \"True\")")
print(f"CUDA version: {torch.version.cuda} (Should be \"None\")")
print(f"HIP version: {torch.version.hip} (Should contain \"5.4\")")
# Storing ID of current CUDA device
cuda_id = torch.cuda.current_device()
print(f"Current CUDA device ID: {torch.cuda.current_device()}")
print(f"Current CUDA device name: {torch.cuda.get_device_name(cuda_id)} (Should be AMD, not NVIDIA)")
def batched_dot_mul_sum(a, b):
'''Computes batched dot by multiplying and summing'''
return a.mul(b).sum(-1)
def batched_dot_bmm(a, b):
'''Computes batched dot by reducing to bmm'''
a = a.reshape(-1, 1, a.shape[-1])
b = b.reshape(-1, b.shape[-1], 1)
return torch.bmm(a, b).flatten(-3)
x = torch.randn(10000, 1024, device='cuda')
t0 = timeit.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = timeit.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
# Ran each twice to show difference before/after warmup
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'mul_sum(x, x): {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x): {t1.timeit(100) / 100 * 1e6:>5.1f} us') If everything is working, everything should match what's in the parenthesis and if you have something like |
If that still doesn't work, it may honestly just be possible that the Ryzen 5600G just isn't supported. |
@Flakebi If you have an AMD GPU, could you run this check/benchmark as well to confirm it isn't just working for me and only me? |
Same problem. Built my NixOS config against the staging right after #206421 was merged because the latest staging failed in the middle of the build because of an unrelated package. This is the shell.nix I am using to provision torch based on the commit you mentioned: let
nixpkgs = builtins.fetchTarball "https://github.com/NixOS/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz";
pkgs = import nixpkgs { };
in pkgs.mkShell {
buildInputs = with pkgs; [ python3Packages.torchWithRocm ];
} This is my Python prompt after nix-shell the shell.nix above lucasew@whiterun ~/demo-hip-issue 0$ nix-shell
(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python
Python 3.10.9 (main, Dec 6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')
...
... )
Memory access fault by GPU node-1 (Agent handle: 0x7817470) on address 0x735d000. Reason: Unknown.
Aborted (imagem do núcleo gravada) Whiterun is running https://github.com/lucasew/nixcfg/tree/811c58b6b9c743fab692fb3fc7817ded83974b6c And this is what I got in the dmesg right after I ran that Python snippet.
And this is your script output: (shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python test-pytorch
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
Segmentation fault (imagem do núcleo gravada) |
So it's not torch itself, the commit, or nixpkgs then, everything as far as torch goes matches up. |
I do have my user in the "video" and "render" groups, just in case that solves your issue, but I doubt it. |
I think I got something \o/ (shell:impure) lucasew@whiterun ~/demo-hip-issue 139$ HSA_OVERRIDE_GFX_VERSION=9.0.0 ./test-pytorch
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
mul_sum(x, x): 131.0 us
mul_sum(x, x): 9.2 us
bmm(x, x): 330.2 us
bmm(x, x): 18.9 us |
Ahh so it was |
Try both of those (and |
Tried to replicate with a fresh reboot. Same result. We got it 🥂 For the registry, whiterun is running lucasew/nixcfg@d98b0e2 and I added the group definitions in the |
Glad we got it working! |
The official docker containers, right? That's not nixpkgs-related. |
...unless this is related to your |
You could also try adding |
The example that got working is based on
The full docker run command is 67a4 is a container generated from the BTW, that |
Ugh, reading comprehension again... |
I don't think that Docker gets into way anymore that much here, because right device nodes appear to be bound from host, and stock seccomp profile which could block syscalls is disabled as well ( Have you looked into stable-diffusion-webui issues about those segfaults? Maybe those give few pointers:
I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?). |
Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work. @Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit. |
Hopefully this should be enough. One is 6900XT, other is 6800. |
Wait a minute... The likely reason why our nixpkgs/pkgs/development/libraries/rocclr/default.nix Lines 19 to 27 in 0f0929f
IIRC shouldn't the 5600g be gfx8? If so, that's definitely why. The official docker image isn't an option for you. |
Nope, I got that wrong. It's gfx9. |
I suppose this issue can be closed now? |
I just want to test tensorflow before. But if the ROCm layer is known to be working then I suppose no more work is needed in this issue for you to do. Thank you guys. You are awesome. |
Looks like there were some AMD changes in 6.0, go figure. |
Describe the bug
I have a Ryzen 5600G APU and I am trying to use Tensorflow or PyTorch to do some machine learning stuff. So far whatever one, I am just trying to make it recognize the GPU and make it usable, and so far I was only able to use it on Blender with blender-hip or a workaround to use it with blender-bin.
Steps To Reproduce
Steps to reproduce the behavior:
For PyTorch
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/pytorch:latest
python
import torch
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)
For TensorFlow
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/tensorflow:latest
python
import tensorflow as tf
tf.config.list_physical_devices()
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)
If I do an
export HSA_OVERRIDE_GFX_VERSION=10.3.0
and do any activity that actually uses the GPU, liketorch.tensor([[1,2],[3,4]]).to(torch.device('cuda')
it crashes and dmesg shows the following:Expected behavior
Machine learning working the same as it would work in Google Colab I guess
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Nixcfg revision used to replicate the issue: https://github.com/lucasew/nixcfg/tree/ff430dc0992d9247989f739a326536f87e345d98/nodes/whiterun
A PC with a i5 6400 + RX460 has the same problem but I don't have access to it anymore to test eventual fixes.
Notify maintainers
@NixOS/rocm-maintainers
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.The text was updated successfully, but these errors were encountered: