Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] ROCm packages #197885

Open
25 of 34 tasks
Madouura opened this issue Oct 26, 2022 · 66 comments
Open
25 of 34 tasks

[Tracking] ROCm packages #197885

Madouura opened this issue Oct 26, 2022 · 66 comments
Assignees
Labels
5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: hardware 6.topic: rocm

Comments

@Madouura
Copy link
Contributor

Madouura commented Oct 26, 2022

Tracking issue for ROCm derivations.

moar packages

Key

  • Package
    • Dependencies

WIP

Ready

TODO

Merged

ROCm-related

Notes

  • Update command: nix-shell maintainers/scripts/update.nix --argstr commit true --argstr keep-going true --arg predicate '(path: pkg: builtins.elem (pkg.pname or null) [ "rocm-llvm-llvm" "rocm-core" "rocm-cmake" "rocm-thunk" "rocm-smi" "rocm-device-libs" "rocm-runtime" "rocm-comgr" "rocminfo" "clang-ocl" "rdc" "rocm-docs-core" "hip-common" "hipcc" "clr" "hipify" "rocprofiler" "roctracer" "rocgdb" "rocdbgapi" "rocr-debug-agent" "rocprim" "rocsparse" "rocthrust" "rocrand" "rocfft" "rccl" "hipcub" "hipsparse" "hipfort" "hipfft" "tensile" "rocblas" "rocsolver" "rocwmma" "rocalution" "rocmlir" "hipsolver" "hipblas" "miopengemm" "composable_kernel" "half" "miopen" "migraphx" "rpp-hip" "mivisionx-hip" "hsa-amd-aqlprofile-bin" ])'

Won't implement

  • ROCmValidationSuite
    • Too many assumptions, not going to rewrite half the cmake files
  • rocm_bandwidth_test
    • Not really needed, will implement on request
  • atmi
    • Out-of-date
  • aomp
    • We basically already do this
  • Implement strictDeps for all derivations
    • Seems pointless now and I don't see many other derivations doing this
@Madouura
Copy link
Contributor Author

Madouura commented Oct 30, 2022

Updating to 5.3.1, marking all WIP until pushed to their respective PRs and verified.

@Madouura
Copy link
Contributor Author

Madouura commented Oct 30, 2022

If anyone is interested in helping me debug rocBLAS, here's the current derivation
Already fixed.

@Flakebi
Copy link
Member

Flakebi commented Oct 31, 2022

Hi, thanks a lot for your work on ROCm packages!

So far, the updates where all aggregated in a single rocm: 5.a.b -> 5.x.y pr. I think that makes more sense than splitting the package updates into single prs for a couple of reasons:

  • Often, packages have backward- (and forward-) incompatible changes, i.e. a the 5.3.0 version of rocm-runtime only works with 5.3.0 of rocm-comgr, but not with 5.2.0 or 5.4.0 (made up example).
  • Nobody tests a mixture of versions, i.e. only all packages at the same version are known to work.
  • If I want to test hip, OpenCL and other things for an update, it’s easier to do it one time (and compile everything a single time), rather than 10 times.

tl;dr, do you mind merging all your 5.3.1 updates into a single PR?

PS: Not sure how you did the update, I usually do it with for f in rocm-smi rocm-cmake rocm-thunk rocm-runtime rocm-opencl-runtime rocm-device-libs rocm-comgr rocclr rocminfo llvmPackages_rocm.llvm hip; nix-shell maintainers/scripts/update.nix --argstr commit true --argstr package $f; end.

@Madouura
Copy link
Contributor Author

I was actually afraid of the opposite being true so I split them up.
Got it, I'll aggregate them.
Thanks for the tip on the update script, that would have saved me a lot of time.

@Madouura
Copy link
Contributor Author

Madouura commented Oct 31, 2022

Hip I think should stay separate though, since there are other changes.
Actually never mind it's just an extra dependency so should be fine to split it.

@Madouura
Copy link
Contributor Author

Took me a bit to compile torchWithRocm. I need to fix openai-triton to be free again.
Anyway, the test worked just fine for me.
nix-shell -I nixpkgs=/home/mado/Documents/Development/nixpkgs -p python3Packages.torchWithRocm

python test.py
tensor([[[[ 0.1783, -0.3823, -0.0870],
          [ 0.1783, -0.3823, -0.0870],
          [ 0.1783, -0.3823, -0.0870]]]], grad_fn=<ConvolutionBackward0>)
tensor([[[[ 0.1783, -0.3823, -0.0870],
          [ 0.1783, -0.3823, -0.0870],
          [ 0.1783, -0.3823, -0.0870]]]], device='cuda:0',
       grad_fn=<ConvolutionBackward0>)

@kurnevsky
Copy link
Member

ok, thanks. I assume you use a different GPU? Maybe it's a problem specifically with 7900 XTX...

@Madouura
Copy link
Contributor Author

Madouura commented Oct 23, 2023

It's possible your GPU may not be fully supported yet.
I believe your GPU is GFX11? I wonder if that's why.

@kurnevsky
Copy link
Member

kurnevsky commented Oct 23, 2023

I believe your GPU is GFX11?

Yes.

@Madouura
Copy link
Contributor Author

Madouura commented Oct 28, 2023

New tensorflow-rocm WIP at Madouura@344aa78.
Current blocking factor is an LLVM mismatch.
Most likely, tensorflow 2.13.0 isn't nearly up-to-date enough with rocm 5.7.1.

@Madouura
Copy link
Contributor Author

Madouura commented Oct 30, 2023

@Flakebi I have some basic impureTests stuff at https://github.com/Madouura/nixpkgs/blob/pr/rocm/pkgs/development/rocm-modules/5/rocm-thunk/generic.nix as well as some other stuff.
Tell me if you think this is the best way to go forward please.

@Flakebi
Copy link
Member

Flakebi commented Oct 31, 2023

Nice!
I think we shouldn’t add anything to <package>.tests that is not also runnable as a (pure) nix test because these get parsed by scripts and bots.
Why not set the testScript = "${rocmPackages_5.rocm-smi-variants.shared}/bin/rocm-smi"?
That would be easier to build most tests :)

I think the rocminfo test can check the output that it actually detected something (like rocminfo | grep -E 'Device Type: +GPU' and rocm_agent_enumerator | grep -E 'gfx[^0]'). That makes sure we don’t ship something that’s unable to find GPUs.

@Madouura
Copy link
Contributor Author

I'm going to take a bit of a break from ROCm and work on another project.
I'll try to work on the major updates/upgrades here and there, but until early-mid next year the other project is going to be my focus.
If there's any major issues or if you just need something explained, don't hesitate to ping me.

@gjz010
Copy link

gjz010 commented Dec 15, 2023

Hi. Thanks for maintaining rocm for nix!

When I try to use torchWithRocm I got the following error:

MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' naive_conv.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: naive_conv.cpp
MIOpen(HIP): Warning [BuildHip] hip runtime failed to load.
Error: Please provide architecture for which code is to be generated.
MIOpen Error: /build/source/src/hipoc/hipoc_program.cpp:304: Code object build failed. Source: naive_conv.cpp

Any idea what should be in the environment? I tried adding recent meta.rocm-all but it didn't help.

Same problem here with same GPU (7900 XTX). After some strace on your minimal example I noticed that:

openat(AT_FDCWD, "/nix/store/mkih90ygzxczv4k0fn6gapgi7i7wy292-rocm-llvm-libunwind-5.7.1/lib/libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
...
openat(AT_FDCWD, "./libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "./libamdhip64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
write(2, "MIOpen(HIP): Error [Compile] 'hi"..., 145MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' naive_conv.cpp: HIPRTC_ERROR_COMPILATION (6)
) = 145

It appears that some important libamdhip64.so is not added to runtime library path:

ls $(dirname $(nix-shell -p rocmPackages.meta.rocm-hip-runtime --run "which hipcc"))/../lib/libamdhip64.so
# /nix/store/09ic1qizx0aacml0vi83k9lgq23fz0wg-rocm-hip-runtime-meta/bin/../lib/libamdhip64.so

By setting environment variable manually:

export LD_LIBRARY_PATH=/nix/store/bz15zrilgr04ghdiz4cd73sam5wvmhhw-clr-5.7.1/lib/

The problem is temporarily fixed and I can now run Stable Diffusion WebUI.

@Madouura
Copy link
Contributor Author

Madouura commented Dec 17, 2023

ROCm 6.0.0 has been released.
rocmPackages_5 is now in maintenance-mode.
I will eventually backport the changes I am making with rocmPackages_6 to rocmPackages_5, however it is not a high priority.

@kurnevsky
Copy link
Member

kurnevsky commented Dec 17, 2023

By setting environment variable manually

Interesting - now pytorch works for me, but it doesn't seem to work correctly. I'm trying to generate an image from sdxl+lora with diffusers, and it generates an incorrect image...

I tried identical code and model with manually defined seeds in google colab with cuda - it works there. Also seems to work locally on cpu with f32 types.

(or it might be some problem in one of the libs, since locally I use all python libs from nix)

@sersorrel
Copy link
Contributor

The export LD_LIBRARY_PATH=/nix/store/...-clr-5.7.1/lib solution fixed the same torchWithRocm problem for me, also with a 7900 XTX. I couldn't see how you got that path – it's returned by nix build --print-out-paths nixpkgs#rocmPackages.clr, right?

@ScatteredRay
Copy link
Contributor

Hey, giving this a try. Still very much WIP, but it's working so far for my current project.

@dwf
Copy link
Contributor

dwf commented Mar 21, 2024

@Madouura First, thanks for all your work on this front.

You left a comment to the effect that rocBLASLt is "Very broken with Tensile at the moment, only supports GFX9". It looks like other platforms might be supported now, but I wondered if you might be able to elaborate with the "very broken with Tensile" part. I notice that they ship a vendored "Tensilelite", was that what you were trying to use?

Any pointers you have on how I might manage to build this would be useful. I'm currently eyeing the rocBLAS derivation as a potentially good starting point.

Edit: no longer a priority for me

@yshui
Copy link
Contributor

yshui commented Apr 1, 2024

pytorch now fails to build after 5 -> 6 transition, because it depends on miopengemm which was removed.

@SomeoneSerge
Copy link
Contributor

I edited the description to add an entry for rocblaslt. It's, apparently, a dependency for zluda

@errnoh errnoh mentioned this issue Apr 10, 2024
13 tasks
@samueldr samueldr added the 5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems label Apr 23, 2024
@jalil-salame
Copy link
Contributor

Apparently pytorch now requires hipBLASLt:

python3.11-torch> CMake Error at cmake/public/LoadHIP.cmake:37 (find_package):
python3.11-torch>   By not providing "Findhipblaslt.cmake" in CMAKE_MODULE_PATH this project
python3.11-torch>   has asked CMake to find a package configuration file provided by
python3.11-torch>   "hipblaslt", but CMake did not find one.
python3.11-torch>   Could not find a package configuration file provided by "hipblaslt" with
python3.11-torch>   any of the following names:
python3.11-torch>     hipblasltConfig.cmake
python3.11-torch>     hipblaslt-config.cmake
python3.11-torch>   Add the installation prefix of "hipblaslt" to CMAKE_PREFIX_PATH or set
python3.11-torch>   "hipblaslt_DIR" to a directory containing one of the above files.  If
python3.11-torch>   "hipblaslt" provides a separate development package or SDK, be sure it has
python3.11-torch>   been installed.
python3.11-torch> Call Stack (most recent call first):
python3.11-torch>   cmake/public/LoadHIP.cmake:160 (find_package_and_print_version)
python3.11-torch>   cmake/Dependencies.cmake:1258 (include)
python3.11-torch>   CMakeLists.txt:754 (include)
python3.11-torch>
python3.11-torch> -- Configuring incomplete, errors occurred!

@ony
Copy link
Contributor

ony commented Jun 16, 2024

As per pytorch/pytorch#119081 (comment) in 2.4.0+ (future release) it should be possible to use something like:

  pythonPackagesExtensions = prev.pythonPackagesExtensions ++ [
    (python-final: python-prev: {
      torch = python-prev.torch.overrideDerivation (oldAttrs: {
        TORCH_BLAS_PREFER_HIPBLASLT = 0;  # not yet in nixpkgs
      });
    })
  ];

@AngryLoki
Copy link

@ony , TORCH_BLAS_PREFER_HIPBLASLT is environment variable for runtime; pytorch still links and requires hipblaslt, even when unused. pytorch/pytorch#120551 should help, but I have no idea whether and when it could be accepted.

By the way, hipblaslt is not difficult to build. Just don't build 6.0 release, skip directly to 6.1. When I tried, bundled TensileLine in 6.0 generated wall of unreadable errors, while 6.1 worked from first attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5. scope: tracking Long-lived issue tracking long-term fixes or multiple sub-problems 6.topic: hardware 6.topic: rocm
Projects
None yet
Development

No branches or pull requests