-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support AMD Ryzen Unified Memory Architecture (UMA) #107605
Comments
@ winstonma pytorch's
That means you should develop your own caching strategy wrapper upon the your primitive allocator/deallocator. But I think we could do better by reusing the cache strategy developed by Pytorch, don't we ? cc @jaykchen @cpuhrsch |
@yiakwy-xpu-ml-framework-team but it seems that the sample in torch-apu-helper could get a piece of shared memory from the system and used by PyTorch. Maybe that is just a simple demo and may not have full functionalities but it seems there is a way for PyTorch to grab a piece of memory from the system. Compared to booting the system into BIOS, define a dedicated piece of memory for GPU, and then reboot the system. I think allocating a fixed piece of memory at the program runtime is already considered as an improvment. |
Llama cpp added an option to use the shared memory on amd apus and I can confirm that it works. This is the pull request: |
@dkuku Thanks. From llama.cpp we know that it is do-able (on Windows we can access video memory via UMA through DirectX, that's my guess). Also torch-apu-helper uses the same way to get PyTorch using the UMA. However for using Stable Diffusion, the torch-apu-helper method would not work. And it support PyTorch on using the video memory. However when Stable Diffusion uses some other PyTorch API it would fail (like the following example)
It shows that in order to get higher level application working, the torch-apu-helper is not sufficient. Getting official support from PyTorch is needed (just like llama.cpp).
Are you using AMD APU or AMD Dedicated GPU? How much memory does your system originally have? I think I am a bit confused. |
@jithunnair-amd Could you please feel free to take a look? Thanks 🙏🏻 EDIT: Sorry I didn't see the first post was CCed |
@winstonma as for pytorch caching algorithm, I have tried to added an external cached allocator wrapper successfully (with exactly pytorch block algorithm, -- I am familiar with Pytorch block, extended segments allocation strategy). It works well in pytorch sample tests with any custom allocator as long as the memory address is deemed "valid" by device. The only issue is how to dump it into pytorch newly supported memory snapshot. That also needs a python frame tracer. Hence it is better to have native support from Pytorch code base.
This is because exporting memory info (blocks, segments) uses many statistics from Pytorch cached allocator (blocks, segments are exclusive to cached allocator). Note the external allocator should support export memory snapshot and livenessinfo like this: The whole function involvs 3k c++ codes. So I guess your hipHostMalloc demo doesn't work in practice. |
@dkuku How does llama.cpp handle overhead of memory allocation ? Does it cache memory allocated ? |
I am not sure. But I think both llama.cpp and torch-apu-helper both referenced HIP Programming Manual. I think memory-allocation-wise torch-apu-helper works. But as you mentioned modifications are needed in order to get full support from PyTorch. But I didn't know it need to modify 3k lines of codes. I just think the modification mainly perpose on "ask the ROCm driver to assign the memory". After the memory is allocated everything should work. |
Hi, @winstonma I am afraid that this is not how pytorch works:
In allocating stage, the pytorch first requests an available block (which contains valid GPU address + offset point to the remaining available block), then it splits the block such that a memory description of newly inserted block are exactly what you want. This will create small memory segments, hence each time you request an allocation, pytorch will increase age of the memory and release the "oldest" one if necessary. The release stage just reverse in the above order except that the root block containing the GPU memory buffer will not be released (or immediately released) by CudaFree or something similar. This algorithm can be easily supported with around 2k codes dependent on whether you want to handle multi streams : this is tricky because a memory can be used under one stream for computing and another stream for stream copy and one has to manually notify pytorch to increase event-streams-block references, -- otherwise Pytorch by no means to know whether the memory is still in the usage. Native support from Pytorch must consider this for "CUDAPlugableAllocator". Another issue one must need to support is tracing (this can be done easily by an allocator wrapper singleton with pybind interface to override native memory_snapshot function, you must override cDLL loading process if you want a singleton as the interface), otherwise pytorch will not generate memory snapshot. This needs another 1k codes : PyObject frame tracking, c++ codes tracking, etc. |
@yiakwy-xpu-ml-framework-team Thanks for explaining. Seems I oversimplify the whole process and think AMD could take advantage of the existing PyTorch ROCm framework and include support for UMA with some modification. Seems there is a long way to go. As llama.cpp shows that the driver is able to use UMA in linux. So I wish that AMD team would get PyTorch on UMA work 🙏🏻 This feature is very crucial because most of the APU are installed on laptop, and the laptop manufacturer doesn't allow you to modify the dedicated memory in the BIOS. Therefore laptop user wouldn't be able to use PyTorch ROCm version which is really sad. |
If you use force-host-alloction-APU instead of torch-apu-helper, you can run Stable Diffusion on an APU. I have a 5600G APU and I'm using Fooocus with
Notice a Links to ROCm and PyTorch versions I used: ROCm 5.7 with PyTorch for 5.7
ROCm 6.0 with PyTorch for 6.0
|
I was able to run sd-webui around January on my AMD APU(7840HS, RDNA3, 4G VRAM). I used UMAF to adjust VRAM from 4G to 8G, and set Just leave some words, as an alternative approach. |
@yiakwy-xpu-ml-framework-team I tested @qkiel's method and I can run Stable Diffusion on my laptop, w/o VRAM modification in BIOS. I am running ROCm v6.0.2 with PyTorch 2.2.1 stable version. I don't see there are some performance difference between the new method and the VRAM modification method. Just wonder if PyTorch ROCm would consider including the method in force-host-alloction-APU in the future release of ROCm PyTorch. Thank you very much. |
@winstonma great it worked :] @gonwan have you tried adding one more environment variable
|
@qkiel Any idea? |
I have also a problem with my ryzen: when starting a model (I mostly saw it
with stable diffusion) sometimes my screen flashes black and it even kicks
me out of x session. Have you seen this before ?
--
Sincerely
Daniel Kukula
…On Sat, 23 Mar 2024 at 14:50, Binhao Qian ***@***.***> wrote:
@qkiel <https://github.com/qkiel>
Just tried to set HSA_ENABLE_SDMA=0 with both versions, no luck.
With rocm6.0+pytorch-rocm5.7, hang when generating images, no other logs.
With rocm6.0+pytorch-romc6.0, invalid ISA error reported.
Any idea?
—
Reply to this email directly, view it on GitHub
<#107605 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG4X4Z4YNWXGDLAJG2MEPTYZWJCHAVCNFSM6AAAAAA3YU63Z2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGUYTKMRRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@gonwan You can also try setting
Notice a @dkuku I've seen something like that when I launched Fooocus on CPU instead of GPU. |
With this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.10-rc4&id=eb853413d02c8d9b27942429b261a9eef228f005, the default VRAM size for ROCm is 1/2 system memory now. You can also change the default VRAM size via ttm_pages_limit. (sudo modprobe ttm pages_limit=xxx) |
@dkuku @qkiel in cuda platform using HostAllocator means memory is locked and not pagable which means the buffer size is usually small. So HostAllocator is rarely used in models with large inputs. In Pytorch, The pluggable memory allocator means no support of caching algorithm (blocks, segments, and memory aging) and recently supported profiling toolkits (3 months ago). This may result in degradation of performance. Note if Unified Memory used, that means no need copy between GPU and CPU:
|
I observe 50% degradation of performance when GTT allocator hack is applied. I installed the https://pypi.org/project/pytorch-rocm-gtt/ . The hardware is APU 5600G's GPU (gfx 900), RoCM 5.7.1. Torch version is 2.2.2+rocm5.7. The test code trains resnet34. |
@AGenchev You can try to use hipMallocHost(&ptr,size) instead of hipHostMalloc(&ptr,size,0) in Since hipHostMalloc allocation is coherent(i.e., uncached on APU) by default, please refer to |
Yes, it has effect: now the performance degradation decreased to roughly ~30 %: |
@AGenchev Can you build the kernel with this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.10-rc4&id=eb853413d02c8d9b27942429b261a9eef228f005? |
time = 7.20, which is very good and can be accepted as solving the problem. |
I close this ticket because it was fixed in Linux Kernel 6.9.9 (check commit |
🚀 The feature, motivation and pitch
Background:
I am using Asus Zenbook S13 OLED, which runs AMD Ryzen 6800U APU. The APU comes with 680M Graphics Card. The memory of the graphic card use the shared memory from the system and its default is 512MB (Please reference the screenshot below).
In Windows environment the memory size would dynamically change, due to the amount of GPU memory required. But in Linux environment it shows 512MB memory (which is the result of setting Auto in BIOS) and thus when I use Stable Diffusion Pytorch would face the OOM situation. As the BIOS setting of the Notebook doesn't allow users from modifying the amount of dedicated memory so would it be possible that PyTorch could support UMA?
Here is the quote from AMD Ryzen UMA
Alternatives
No response
Additional context
Another developer created torch-apu-helper that uses
CUDAPluggableAllocator
to take advantage of the shared memory on PyTorch. However when I try the code snippet with Stable Diffusion I got the following error:cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang
The text was updated successfully, but these errors were encountered: