Support AMD Ryzen Unified Memory Architecture (UMA) #107605

winstonma · 2023-08-21T16:41:25Z

🚀 The feature, motivation and pitch

Background:
I am using Asus Zenbook S13 OLED, which runs AMD Ryzen 6800U APU. The APU comes with 680M Graphics Card. The memory of the graphic card use the shared memory from the system and its default is 512MB (Please reference the screenshot below).

In Windows environment the memory size would dynamically change, due to the amount of GPU memory required. But in Linux environment it shows 512MB memory (which is the result of setting Auto in BIOS) and thus when I use Stable Diffusion Pytorch would face the OOM situation. As the BIOS setting of the Notebook doesn't allow users from modifying the amount of dedicated memory so would it be possible that PyTorch could support UMA?

Here is the quote from AMD Ryzen UMA

The UMA Frame Buffer Size when set to Auto (default setting) allows the system to manage the amount of shared memory for graphics. In this configuration, the size of the UMA frame buffer should scale depending on the amount of available system memory, enabling the system to perform in an optimal state. Therefore, it is recommended to leave the setting on Auto, which is ideal for most types of video processing workloads.

Alternatives

No response

Additional context

Another developer created torch-apu-helper that uses CUDAPluggableAllocator to take advantage of the shared memory on PyTorch. However when I try the code snippet with Stable Diffusion I got the following error:

RuntimeError: CUDAPluggableAllocator does not yet support getDeviceStats. If you need it, please file an issue describing your use case.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

The text was updated successfully, but these errors were encountered:

yiakwy-xpu-ml-framework-team · 2023-12-07T08:11:15Z

@ winstonma pytorch's CUDAPluggableAllocator memory allocator does not support native use of high level memory management data structure such as segments, blocks and call frames developed by pytorch:

it allocates what you just allocate

That means you should develop your own caching strategy wrapper upon the your primitive allocator/deallocator. But I think we could do better by reusing the cache strategy developed by Pytorch, don't we ? cc @jaykchen @cpuhrsch

winstonma · 2023-12-08T06:39:00Z

@yiakwy-xpu-ml-framework-team but it seems that the sample in torch-apu-helper could get a piece of shared memory from the system and used by PyTorch. Maybe that is just a simple demo and may not have full functionalities but it seems there is a way for PyTorch to grab a piece of memory from the system.

Compared to booting the system into BIOS, define a dedicated piece of memory for GPU, and then reboot the system. I think allocating a fixed piece of memory at the program runtime is already considered as an improvment.

dkuku · 2024-01-07T17:40:21Z

Llama cpp added an option to use the shared memory on amd apus and I can confirm that it works. This is the pull request:
ggerganov/llama.cpp#4449
With this I can offload a 25GB model into gpu memory (it's still copying the model) and run it using the gpu. Since that time I have the gpu memory set in bios to 512MB

winstonma · 2024-01-11T07:18:08Z

@dkuku Thanks. From llama.cpp we know that it is do-able (on Windows we can access video memory via UMA through DirectX, that's my guess). Also torch-apu-helper uses the same way to get PyTorch using the UMA.

However for using Stable Diffusion, the torch-apu-helper method would not work. And it support PyTorch on using the video memory. However when Stable Diffusion uses some other PyTorch API it would fail (like the following example)

Warning: caught exception 'CUDAPluggableAllocator does not yet support getDeviceStats. If you need it, please file an issue describing your use case.', memory monitor disabled

It shows that in order to get higher level application working, the torch-apu-helper is not sufficient. Getting official support from PyTorch is needed (just like llama.cpp).

With this I can offload a 25GB model into gpu memory (it's still copying the model) and run it using the gpu. Since that time I have the gpu memory set in bios to 512MB

Are you using AMD APU or AMD Dedicated GPU? How much memory does your system originally have? I think I am a bit confused.

winstonma · 2024-01-11T07:38:47Z

@jithunnair-amd Could you please feel free to take a look? Thanks 🙏🏻

EDIT: Sorry I didn't see the first post was CCed

yiakwy-xpu-ml-framework-team · 2024-01-11T07:49:42Z

@winstonma as for pytorch caching algorithm, I have tried to added an external cached allocator wrapper successfully (with exactly pytorch block algorithm, -- I am familiar with Pytorch block, extended segments allocation strategy).

It works well in pytorch sample tests with any custom allocator as long as the memory address is deemed "valid" by device.

The only issue is how to dump it into pytorch newly supported memory snapshot. That also needs a python frame tracer.

Hence it is better to have native support from Pytorch code base.

CUDAPluggableAllocator does not yet support getDeviceStats.

This is because exporting memory info (blocks, segments) uses many statistics from Pytorch cached allocator (blocks, segments are exclusive to cached allocator).

Note the external allocator should support export memory snapshot and livenessinfo like this:

The whole function involvs 3k c++ codes.

So I guess your hipHostMalloc demo doesn't work in practice.

yiakwy-xpu-ml-framework-team · 2024-01-11T08:03:22Z

@dkuku How does llama.cpp handle overhead of memory allocation ? Does it cache memory allocated ?

winstonma · 2024-01-11T08:51:59Z

So I guess your hipHostMalloc demo doesn't work in practice.

I am not sure. But I think both llama.cpp and torch-apu-helper both referenced HIP Programming Manual. I think memory-allocation-wise torch-apu-helper works. But as you mentioned modifications are needed in order to get full support from PyTorch.

But I didn't know it need to modify 3k lines of codes. I just think the modification mainly perpose on "ask the ROCm driver to assign the memory". After the memory is allocated everything should work.

yiakwy-xpu-ml-framework-team · 2024-01-12T01:47:58Z

Hi, @winstonma I am afraid that this is not how pytorch works:

I just think the modification mainly perpose on "ask the ROCm driver to assign the memory". After the memory is allocated everything should work.

In allocating stage, the pytorch first requests an available block (which contains valid GPU address + offset point to the remaining available block), then it splits the block such that a memory description of newly inserted block are exactly what you want. This will create small memory segments, hence each time you request an allocation, pytorch will increase age of the memory and release the "oldest" one if necessary.

The release stage just reverse in the above order except that the root block containing the GPU memory buffer will not be released (or immediately released) by CudaFree or something similar.

This algorithm can be easily supported with around 2k codes dependent on whether you want to handle multi streams : this is tricky because a memory can be used under one stream for computing and another stream for stream copy and one has to manually notify pytorch to increase event-streams-block references, -- otherwise Pytorch by no means to know whether the memory is still in the usage.

Native support from Pytorch must consider this for "CUDAPlugableAllocator".

Another issue one must need to support is tracing (this can be done easily by an allocator wrapper singleton with pybind interface to override native memory_snapshot function, you must override cDLL loading process if you want a singleton as the interface), otherwise pytorch will not generate memory snapshot.

This needs another 1k codes : PyObject frame tracking, c++ codes tracking, etc.

winstonma · 2024-01-12T02:04:50Z

@yiakwy-xpu-ml-framework-team Thanks for explaining. Seems I oversimplify the whole process and think AMD could take advantage of the existing PyTorch ROCm framework and include support for UMA with some modification. Seems there is a long way to go.

As llama.cpp shows that the driver is able to use UMA in linux. So I wish that AMD team would get PyTorch on UMA work 🙏🏻

This feature is very crucial because most of the APU are installed on laptop, and the laptop manufacturer doesn't allow you to modify the dedicated memory in the BIOS. Therefore laptop user wouldn't be able to use PyTorch ROCm version which is really sad.

qkiel · 2024-03-21T21:21:07Z

@dkuku Thanks. From llama.cpp we know that it is do-able (on Windows we can access video memory via UMA through DirectX, that's my guess). Also torch-apu-helper uses the same way to get PyTorch using the UMA.

However for using Stable Diffusion, the torch-apu-helper method would not work. And it support PyTorch on using the video memory. However when Stable Diffusion uses some other PyTorch API it would fail (like the following example)
Warning: caught exception 'CUDAPluggableAllocator does not yet support getDeviceStats. If you need it, please file an issue describing your use case.', memory monitor disabled
It shows that in order to get higher level application working, the torch-apu-helper is not sufficient. Getting official support from PyTorch is needed (just like llama.cpp).

With this I can offload a 25GB model into gpu memory (it's still copying the model) and run it using the gpu. Since that time I have the gpu memory set in bios to 512MB

Are you using AMD APU or AMD Dedicated GPU? How much memory does your system originally have? I think I am a bit confused.

If you use force-host-alloction-APU instead of torch-apu-helper, you can run Stable Diffusion on an APU.

I have a 5600G APU and I'm using Fooocus with force-host-alloction-APU and only 512MiB allocated to VRAM. For ROCm version 5.7.3 I need to set up environment variable HSA_OVERRIDE_GFX_VERSION=9.0.0 and for ROCm 6.0+ I need one more HSA_ENABLE_SDMA=0. Then I start Foocus like that:

LD_PRELOAD=~/force-host-alloction-APU/./libforcegttalloc.so python3 ~/Fooocus/entry_with_update.py --always-high-vram

Notice a ./ before libforcegttalloc.so. Model is fully loaded into memory and GPU generates an image.

Links to ROCm and PyTorch versions I used:
https://www.amd.com/en/support/linux-drivers|https://www.amd.com/en/support/linux-drivers
https://pytorch.org/get-started/locally/|https://pytorch.org/get-started/locally/

ROCm 5.7 with PyTorch for 5.7

https://repo.radeon.com/amdgpu-install/5.7.3/ubuntu/jammy/amdgpu-install_5.7.50703-1_all.deb
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

ROCm 6.0 with PyTorch for 6.0

https://repo.radeon.com/amdgpu-install/23.40.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

gonwan · 2024-03-22T03:06:25Z

I was able to run sd-webui around January on my AMD APU(7840HS, RDNA3, 4G VRAM). I used UMAF to adjust VRAM from 4G to 8G, and set HSA_OVERRIDE_GFX_VERSION=11.0.0. ROCm 6.0 with PyTorch-ROCm 5.6 works for me, but ROCm 6.0 with PyTorch-ROCm 5.7/6.0 do not.

Just leave some words, as an alternative approach.

winstonma · 2024-03-22T05:15:04Z

@yiakwy-xpu-ml-framework-team I tested @qkiel's method and I can run Stable Diffusion on my laptop, w/o VRAM modification in BIOS.

I am running ROCm v6.0.2 with PyTorch 2.2.1 stable version. I don't see there are some performance difference between the new method and the VRAM modification method.

Just wonder if PyTorch ROCm would consider including the method in force-host-alloction-APU in the future release of ROCm PyTorch. Thank you very much.

qkiel · 2024-03-22T08:43:21Z

@winstonma great it worked :]

@gonwan have you tried adding one more environment variable HSA_ENABLE_SDMA=0 for the latest PyTorch-ROCm6.0? That did the trick for me.

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

gonwan · 2024-03-23T14:49:48Z

@qkiel
Just tried to set HSA_ENABLE_SDMA=0 with both versions, no luck.
With rocm6.0+pytorch-rocm5.7, hang when generating images, no other logs.
With rocm6.0+pytorch-romc6.0, invalid ISA error reported.

Any idea?

dkuku · 2024-03-23T15:23:50Z

I have also a problem with my ryzen: when starting a model (I mostly saw it with stable diffusion) sometimes my screen flashes black and it even kicks me out of x session. Have you seen this before ? -- Sincerely Daniel Kukula

…

On Sat, 23 Mar 2024 at 14:50, Binhao Qian ***@***.***> wrote: @qkiel <https://github.com/qkiel> Just tried to set HSA_ENABLE_SDMA=0 with both versions, no luck. With rocm6.0+pytorch-rocm5.7, hang when generating images, no other logs. With rocm6.0+pytorch-romc6.0, invalid ISA error reported. Any idea? — Reply to this email directly, view it on GitHub <#107605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4X4Z4YNWXGDLAJG2MEPTYZWJCHAVCNFSM6AAAAAA3YU63Z2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGUYTKMRRGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

qkiel · 2024-03-24T18:14:22Z

@gonwan
If HSA_OVERRIDE_GFX_VERSION=11.0.0 + HSA_ENABLE_SDMA=0, maybe try lower version HSA_OVERRIDE_GFX_VERSION=10.3.0 + HSA_ENABLE_SDMA=0.

You can also try setting UMA_SPECIFIED to 8 or more GiB of VRAM in BIOS or use force-host-alloction-APU method when launching Stable Diffusion, which requires UMA_AUTO set in BIOS. I have a 5600G APU with UMA_AUTO, and I'm launching Fooocus with force-host-alloction-APU like that:

LD_PRELOAD=~/force-host-alloction-APU/./libforcegttalloc.so python3 ~/Fooocus/entry_with_update.py --always-high-vram

Notice a ./ before libforcegttalloc.so.

@dkuku I've seen something like that when I launched Fooocus on CPU instead of GPU.

langyuxf · 2024-06-19T15:08:55Z

With this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.10-rc4&id=eb853413d02c8d9b27942429b261a9eef228f005, the default VRAM size for ROCm is 1/2 system memory now.

You can also change the default VRAM size via ttm_pages_limit. (sudo modprobe ttm pages_limit=xxx)

yiakwy-xpu-ml-framework-team · 2024-06-24T05:09:18Z

@dkuku @qkiel in cuda platform using HostAllocator means memory is locked and not pagable which means the buffer size is usually small. So HostAllocator is rarely used in models with large inputs.

In Pytorch, The pluggable memory allocator means no support of caching algorithm (blocks, segments, and memory aging) and recently supported profiling toolkits (3 months ago).

This may result in degradation of performance.

Note if Unified Memory used, that means no need copy between GPU and CPU:

a = torch.rand(..., device='cpu')

# memory alias, not copy observed in intel x86 pcm(https://github.com/intel/pcm)
b = a.to('amdgpu')

AGenchev · 2024-07-09T09:05:43Z

I observe 50% degradation of performance when GTT allocator hack is applied. I installed the https://pypi.org/project/pytorch-rocm-gtt/ . The hardware is APU 5600G's GPU (gfx 900), RoCM 5.7.1. Torch version is 2.2.2+rocm5.7. The test code trains resnet34.

langyuxf · 2024-07-09T09:50:28Z

@AGenchev You can try to use hipMallocHost(&ptr,size) instead of hipHostMalloc(&ptr,size,0) in
https://github.com/pomoke/torch-apu-helper/blob/main/gttalloc.c#L10C5-L10C32.

Since hipHostMalloc allocation is coherent(i.e., uncached on APU) by default, please refer to
https://rocm.docs.amd.com/projects/HIP/en/docs-5.7.0/user_guide/programming_manual.html#coherency-controls.

AGenchev · 2024-07-09T11:06:57Z

Yes, it has effect: now the performance degradation decreased to roughly ~30 %:
time = 6.47 (without GTT), 9.223 with hipMallocHost(), 11.93 with hipHostMalloc(&ptr,size,0);

langyuxf · 2024-07-09T13:44:50Z

@AGenchev Can you build the kernel with this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.10-rc4&id=eb853413d02c8d9b27942429b261a9eef228f005?
Then you don't need the GTT allocator hack and it should not have ~30 % performance degradation from my observation.

AGenchev · 2024-07-10T07:32:09Z

time = 7.20, which is very good and can be accepted as solving the problem.
Probably timing differences come due to differences in the kernel version. I compiled the v6.10-rc6 kernel.
And yes, now it doesn't need to replace the memory allocator, works with UMA memory size=auto.
Tested with ResNet152 to rise the memory needs above the 'auto' reserved 'GPU' memory.

winstonma · 2024-07-22T12:22:57Z

I close this ticket because it was fixed in Linux Kernel 6.9.9 (check commit 8d656c in the changelog) and Linux 6.10 onwards. After you update the kernel to the mentioned version then PyTorch should work without any problem.

winstonma mentioned this issue Aug 21, 2023

How to allocate more memory to my Ryzen APU's GPU? ROCm/ROCm#2014

Closed

cpuhrsch added module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 22, 2023

winstonma changed the title ~~Support AMD Smart Access Memory~~ Support AMD Ryzen Unified Memory Architecture (UMA) Aug 27, 2023

yiakwy-xpu-ml-framework-team mentioned this issue Dec 7, 2023

[memory] extend CUDAPluggableAllocator to support caching algorithms of segments, blocks and calling frames #115336

Open

chiragkrishna mentioned this issue Feb 26, 2024

Integrated AMD GPU support ollama/ollama#2637

Open

winstonma closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support AMD Ryzen Unified Memory Architecture (UMA) #107605

Support AMD Ryzen Unified Memory Architecture (UMA) #107605

winstonma commented Aug 21, 2023 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Dec 7, 2023 •

edited

Loading

winstonma commented Dec 8, 2023

dkuku commented Jan 7, 2024 •

edited

Loading

winstonma commented Jan 11, 2024 •

edited

Loading

winstonma commented Jan 11, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 11, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 11, 2024 •

edited

Loading

winstonma commented Jan 11, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 12, 2024 •

edited

Loading

winstonma commented Jan 12, 2024 •

edited

Loading

qkiel commented Mar 21, 2024

gonwan commented Mar 22, 2024

winstonma commented Mar 22, 2024 •

edited

Loading

qkiel commented Mar 22, 2024

gonwan commented Mar 23, 2024

dkuku commented Mar 23, 2024 via email

qkiel commented Mar 24, 2024

langyuxf commented Jun 19, 2024

yiakwy-xpu-ml-framework-team commented Jun 24, 2024 •

edited

Loading

AGenchev commented Jul 9, 2024 •

edited

Loading

langyuxf commented Jul 9, 2024

AGenchev commented Jul 9, 2024

langyuxf commented Jul 9, 2024

AGenchev commented Jul 10, 2024 •

edited

Loading

winstonma commented Jul 22, 2024 •

edited

Loading

Support AMD Ryzen Unified Memory Architecture (UMA) #107605

Support AMD Ryzen Unified Memory Architecture (UMA) #107605

Comments

winstonma commented Aug 21, 2023 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

yiakwy-xpu-ml-framework-team commented Dec 7, 2023 • edited Loading

winstonma commented Dec 8, 2023

dkuku commented Jan 7, 2024 • edited Loading

winstonma commented Jan 11, 2024 • edited Loading

winstonma commented Jan 11, 2024 • edited Loading

yiakwy-xpu-ml-framework-team commented Jan 11, 2024 • edited Loading

yiakwy-xpu-ml-framework-team commented Jan 11, 2024 • edited Loading

winstonma commented Jan 11, 2024 • edited Loading

yiakwy-xpu-ml-framework-team commented Jan 12, 2024 • edited Loading

winstonma commented Jan 12, 2024 • edited Loading

qkiel commented Mar 21, 2024

gonwan commented Mar 22, 2024

winstonma commented Mar 22, 2024 • edited Loading

qkiel commented Mar 22, 2024

gonwan commented Mar 23, 2024

dkuku commented Mar 23, 2024 via email

qkiel commented Mar 24, 2024

langyuxf commented Jun 19, 2024

yiakwy-xpu-ml-framework-team commented Jun 24, 2024 • edited Loading

AGenchev commented Jul 9, 2024 • edited Loading

langyuxf commented Jul 9, 2024

AGenchev commented Jul 9, 2024

langyuxf commented Jul 9, 2024

AGenchev commented Jul 10, 2024 • edited Loading

winstonma commented Jul 22, 2024 • edited Loading

winstonma commented Aug 21, 2023 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Dec 7, 2023 •

edited

Loading

dkuku commented Jan 7, 2024 •

edited

Loading

winstonma commented Jan 11, 2024 •

edited

Loading

winstonma commented Jan 11, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 11, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 11, 2024 •

edited

Loading

winstonma commented Jan 11, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 12, 2024 •

edited

Loading

winstonma commented Jan 12, 2024 •

edited

Loading

winstonma commented Mar 22, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jun 24, 2024 •

edited

Loading

AGenchev commented Jul 9, 2024 •

edited

Loading

AGenchev commented Jul 10, 2024 •

edited

Loading

winstonma commented Jul 22, 2024 •

edited

Loading