Add experimental install guide for ROCm #1550

xzuyn · 2024-04-19T14:01:10Z

Description

This adds a guide on how to install Axolotl for ROCm users.

Currently you need to install the packages included in pip install -e '.[deepspeed]' then uninstall torch, xformers, and bitsandbytes, so you can then install the ROCm versions of torch and bitsandbytes. The process is a definitely janky, since you install stuff you don't want just to uninstall it afterwards.

Installing the ROCm version of torch first to try to skip a step results in Axolotl failing to install, so the order this is in is necessary without changes to the readme.txt or setup.py.

Improvements could be made to this setup by preventing torch, bitsandbytes, and xformers from being installed by modifying setup.py to include [amd] and [nvidia] options. That way we would skip the install-then-uninstall step done before installing the packages required.

Motivation and Context

I still see people on places like Reddit asking if it's possible yet to train AI stuff using AMD hardware. I want more people to know it's possible, although still experimental.

How has this been tested?

Using my personal system; Ubuntu 22.04.4, using ROCm 6.1 with an RX 7900 XTX. I've been using Axolotl (and other PyTorch based AI tools like kohya_ss) this way for months on various version of PyTorch and ROCm without major issues.

The only time I've had issues was when ROCm 6.0.2 released and caused training to only output 0 loss after I upgraded. This might've just been an issue with how I upgraded.

Screenshots (if appropriate)

Types of changes

This only adds additions to the README.md file.

Social Handles (Optional)

winglian · 2024-04-19T18:32:43Z

thanks for this @xzuyn would it be helpful if I handled this in the docker images for you? Do you use the docker images?

@ehartford does this line up with the AMD work that you've been doing?

ehartford · 2024-04-19T19:06:50Z

I'm happy to test it

xzuyn · 2024-04-19T21:09:35Z

would it be helpful if I handled this in the docker images for you? Do you use the docker images?

The only time I use docker is with runpod, but then I'm using an NVIDIA GPU. Even though the setup is a little janky, the ROCm setup is fairly straightforward for me to do in a venv.

`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`

xzuyn · 2024-04-20T23:51:24Z

I updated the readme to use the rocm branch instead of the main branch as that seems to be newer.

Although arlo-phoenix now recommends using the official ROCm rocm_enabled branch of bitsandbytes instead of his fork. It's more up to date (26 commits behind & 0.44.0.dev0 vs. 140 commits behind & 0.42.0) and initially looks to install without issue, but when running Axolotl I get an error.

Could not find the bitsandbytes CUDA binary at PosixPath('/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so')
Could not load bitsandbytes native library: /media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 122, in <module>
    lib = get_native_library()
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 104, in get_native_library
    dll = ct.cdll.LoadLibrary(str(binary_path))
  File "/usr/lib/python3.10/ctypes/__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory

So for now this is the latest I can confirm works for me.

ehartford · 2024-04-29T05:24:54Z

I go the other way around
I modify the requirements.txt so it doesn't install torch, bitsandbytes, xformers, flash attention, triton, deepspeed, etc.
then I install those manually myself.

lizamd · 2024-05-02T17:25:51Z

hi @xzuyn Thanks for the effort, I have been trying this on rocm and no luck yet. can we connect internally AMD? is it ok to put contact information on your profile? my email is: [email protected]

Without this you get `NameError: name 'amdsmi' is not defined`

Add experimental install guide for ROCm

9921e4f

Use the newer version of the bitsandbytes fork

8e15925

`main` branch is `0.41.3.post1`, so using `rocm` branch brings us to `0.42.0`

Missed the clone location name

33727dc

Use stable ROCm PyTorch

a9a77a2

Without this you get `NameError: name 'amdsmi' is not defined`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental install guide for ROCm #1550

Add experimental install guide for ROCm #1550

xzuyn commented Apr 19, 2024

winglian commented Apr 19, 2024

ehartford commented Apr 19, 2024

xzuyn commented Apr 19, 2024

xzuyn commented Apr 20, 2024

ehartford commented Apr 29, 2024

lizamd commented May 2, 2024

Add experimental install guide for ROCm #1550

Are you sure you want to change the base?

Add experimental install guide for ROCm #1550

Conversation

xzuyn commented Apr 19, 2024

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Apr 19, 2024

ehartford commented Apr 19, 2024

xzuyn commented Apr 19, 2024

xzuyn commented Apr 20, 2024

ehartford commented Apr 29, 2024

lizamd commented May 2, 2024