Add experimental install guide for ROCm #1550
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This adds a guide on how to install Axolotl for ROCm users.
Currently you need to install the packages included in
pip install -e '.[deepspeed]'
then uninstalltorch
,xformers
, andbitsandbytes
, so you can then install the ROCm versions oftorch
andbitsandbytes
. The process is a definitely janky, since you install stuff you don't want just to uninstall it afterwards.Installing the ROCm version of
torch
first to try to skip a step results in Axolotl failing to install, so the order this is in is necessary without changes to thereadme.txt
orsetup.py
.Improvements could be made to this setup by preventing
torch
,bitsandbytes
, andxformers
from being installed by modifyingsetup.py
to include[amd]
and[nvidia]
options. That way we would skip the install-then-uninstall step done before installing the packages required.Motivation and Context
I still see people on places like Reddit asking if it's possible yet to train AI stuff using AMD hardware. I want more people to know it's possible, although still experimental.
How has this been tested?
Using my personal system; Ubuntu 22.04.4, using ROCm 6.1 with an RX 7900 XTX. I've been using Axolotl (and other PyTorch based AI tools like kohya_ss) this way for months on various version of PyTorch and ROCm without major issues.
The only time I've had issues was when ROCm 6.0.2 released and caused training to only output 0 loss after I upgraded. This might've just been an issue with how I upgraded.
Screenshots (if appropriate)
Types of changes
This only adds additions to the
README.md
file.Social Handles (Optional)