-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrays larger than 4 GB crashes #325
Comments
I did some further tests and it seems like allocating more than 4GB returns garbage or randomly crashes. Example of allocating less than 4GB in A770 16GB. The mean is around 0.5 which is expected.
Example of allocating more than 4GB on CPU. The mean is around 0.5 which is expected.
Example of allocating more than 4GB on A770 16GB. The mean is around 0.014 which is completely wrong.
In conclusion, allocating more than 4GB crashes or returns complete garbage. |
@jingxu10 |
It should be allocated by Level-0. |
Will passing https://spec.oneapi.io/level-zero/latest/core/PROG.html#module-build-options |
Hi, @BA8F0D39 |
Hi @BA8F0D39 Thank you for using intel product and IPEX. And is it possible to add the following flags and attach the log here when you find the error?
Thank you. |
On windows 11 WSL
Code
|
On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.
Code
Crash
|
I believe this issue is caused by incorrect env setting. You can follow this blog to setup IPEX environment on WSL2 with docker: https://medium.com/intel-analytics-software/stable-diffusion-with-intel-arc-gpus-f2986bba8365 |
@cchheennhhaaoo On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.
Code
Crash
|
I am able to replicate the same issue on Fedora 37 with 6.2 and Ubuntu 22.04 with 5.19. Both instances involve a build from the latest |
It is weird the crash error is only reported when you enable DEBUG flags, otherwise the code silently crashes.
|
Here is some quick findings I had, it's not exactly at 4GB, I don't think the gibberish is related...
For FP16, I have some other weird bugs that sometimes it works, sometimes it doesn't even for small array (less than 10000x10000). Even for multiple consecutive run, it might work for 50 times in a row, than go bonkers for 10. For FP32, the gibberish starts appearing at around 30800x30800 which is 3.79456GB. Before that starting around 30400x30400, it is gibberish and then a good output in alternance when doing multiple succesive runs. Which such numerical instability, I might write a script and test every possible combination at this point, might be worth to take a look at other random sampling methods too. |
Just did another quick run for FP32 at 30800x30800 and this time, it works just fine (even 32000x32000 works this time around), there is some weird instability going on... Quick thought, since I am not using a fixed seed in those tests, might it be that some "bad seeds" are cause the instability? |
@fredlarochelle If the adjacent memory locations just so happens to have zeros, then the mean is around 0. If the adjacent memory locations just so happens to have uniformly distributed values from 0 to 1, then the mean is 0.5 . It could allow you to read other program's data in the GPU. |
@BA8F0D39 That would make sense, but I still do get the instability for FP16 and FP32 start acting weird before it before it would actually overfill a 32bit buffer + instability, there is probably more than one problem going on at the same time. |
@fredlarochelle @BA8F0D39 Thanks for feedbacks. The issue mentioned here (so-called numerical instability) looks like one we met recently in internal test. The issue might be caused cache consistency after global memory fence. We are following. BTW, as for crashes when allocating memory larger than 4GB, we cannot reproduce on recommended driver. |
@arthuryuan1987 On Ubuntu Linux 22.04 with 5.19 out of tree driver (intel-i915-dkms intel-platform-vsec-dkms intel-platform-cse-dkms intel-fw-gpu), it randomly crashes and it is not deterministic. On Ubuntu Linux 22.04 with 6.3 mainline kernel, it also randomly crashes. I can force it to crash 100% of the time if you enable debug flags.
|
@arthuryuan1987 I am on Ubuntu 22.04.2 5.19.0.41-generic, on the lastest driver, all following the installation instructions in the documentation with a build from the lastest commit in the |
I used a Vulkan GPU memory tester. It seems all memory regions above 4GB are corrupt and the read transfer speed is 1.9 GB/s.
|
@BA8F0D39 I checked the repo, https://github.com/GpuZelenograd/memtest_vulkan |
Could you please provide an update on the status of this issue? On the lastest |
@cchheennhhaaoo |
I have exactly the same bug : during torch finetuning, my script crash with
i have an Arc A 770 with 16Gb of memory. To do this i i last transformer version wich integrate XPU to compute. Is a fix to use all available memory planned? |
Please check this line in your repo. For invalid result issue, please refer to above arthuryuan1987's comment. |
@BA8F0D39 @fredlarochelle we don't plan to support this. You can still allocate > 4GB with 2.0.110+xpu because we disabled the allocation in master not the previous released drop. Could you please provide the justification why >4GB allocation is required? |
Using an image larger than 768x512 in stable diffusion 1.5 results in a blank or garbled image when pytorch doesn't even use all of the 16 GB in A770 Every LLM is bigger than 4GB and they all fail to load on the A770 even though they can fit the VRAM Other huge models and datasets bigger than 4GB runs out of memory. |
@tye1 Pretty much what @BA8F0D39 said and that you need to use work arounds that you don't need to use with a Nvidia GPUs. For example, using a smaller batch size and loading multiple separate batch on the GPU, ... The main problem I would say tho is a lot of Pytorch code you can find around the internet simply assume that you can allocate more than 4GB since it's supported on Nvidia GPUs. |
@tye1 This is also an issue I have. For highly complex models and long sequence lengths, even a batch size of 1 has the possibility of being larger than 4GB. Such limits should be determined by the VRAM capacity of the GPU, rather than in software I would have thought. |
Are there any updates on this or is the stance still "we don't plan to support this"? Asking, since if it's the latter, I'd be looking to sell my gpu sooner rather than later. |
Same here. Machine learning is the only reason I paid extra for a 16gb card. |
Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it. |
Again, thank you for the great work on this project, But is there any possibility to have it enabled still but with a warning that comes up when exceeding 4GB of allocation and notes that performance would be significantly reduced? I imagine it's still better than CPU processing, which is the only alternative I (and I'm sure others too) have available. Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I'm coming from. I also understand if there's absolutely nothing you can do, it would just be really rather disappointing. If that is the case, perhaps this is maybe something the ARC/IPEX team could work on to make it a possibility? If not directly possible in this extension that is. Thank you |
I guess this would only impact the case where you have to allocate big memory chunks which are larger than 4GB. If your workload doesn't need such big chunk, you can still allocate large enough memory in total up to 16GB (maybe a little lower than that due to the need from runtime/driver)? |
Unfortunately that's the issue, with long sequences and large model sizes when using transformer encoders, my use case requires being able to move more than 4GB in one go. Unless there is a way built into this extension that automatically splits the model into chunks before loading it into memory (same with samples and/or batches)? P.s. Or even a manual way to do this? |
Same here it's the ONLY reason I bought this card, first Intel product I've bought in 15 years, and it will be the last |
I implemented a W/A in stable-diffusion-webui for I'm wondering whether such mechanism could be implemented at IPEX framework level. Adding IPEX W/A in upper level applications is just not scalable. |
Whilst a neat idea, batches can be easily sent in smaller chunks via just a software implementation with PyTorch, so I can't see much of a need for this at the core level. Forgive me if I'm wrong. Something that can't be fixed in software, only firmware/library-level is if you're already running stochastic gradient descent (batch size of 1) and are still exceeding the 4GB limit. |
I'm ok with >4GB allocation to cause some slowdowns. But if this cannot be implemented at all that means your GPUs are practically (not theoretically or technically) useless for Stable Diffusion and I'm going to sell my A770. |
I am still working my way through the Intel ARC documentation with respect to how the global / process / surface / execution unit / physical / virtual etc. etc. addressing works at the architecture level, and I have no full idea of how the multiple Intel driver / compute software layers above the HW affect the memory limitations but I'd like to better understand where these limitations are between the HW / driver / compute SW stack. It is disappointing for ARC A770-16 to have a 16GBy VRAM GPU and not be (as a programmer) able to easily just access as much data as desired at least anywhere in the card's VRAM (and also in my host side's application data RAM while programming, ideally beyond even those limits as I exemplify below (q.v.)). It makes me concerned for Battlemage, Celestial, Druid as well since apparently the programmer's model of memory access for the nvidia GPUs has been (IMO) so much better even on their consumer GPUs for several past GPU generations. I gladly got the ARC A770-16 to use its 16GBy ram for GPGPU and I can see from several intel documents there are From Intel documentation showing mostly hopeful capabilities (though maybe SW is turning some things into SW limitations?):
Please see the below just for contrast in terms of what I'd consider (as a developer) a most ideal programming model In contrast to the above Intel architecture, looking at this below exemplified case (working already on several generations of consumer NVIDIA GPUs), the developer is able to seamlessly access data anywhere in the VRAM of any of the GPUs in a system, but also CPU memory anywhere in their application's CPU address space, and in fact also CPU virtual addresses Here's the citations about the programmer's view of memory (as I understand it to be relevant) https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ Here are small relevant excerpts
I'm not sure why we can't have such a capability of a programming model mapping to efficient HW operations IMO it would be nice to see Battlemage, Celestial, Druid, Arcanist improve this aspect of the programming model, |
The thing is, if all they're concerned about is slowdowns, then wouldn't it be easy enough to embed a warning that these slowdowns occur when transferring data in chunks that are greater than 4GB in size. Significant slowdowns would mean at least it still works. Some functionality is better than no functionality. I'm sure a lot of people would agree with that. @tye1 |
Exactly. I mean these 4GB limits (IIRC) have been variously mentioned here (wrt. pytorch programmers), for the OpenCL implementation (OCL programmers), etc. Ok, so it (the limitation) is something that directly affects GPU/HPC/ML programmers. As a group that writes HPC / GPGPU code, I think we're especially used to benchmarking / analyzing / optimizing our code wrt. a myriad of trade-offs as to capability vs. speed vs. complexity etc. "Oh look I'm going beyond {L1, L2, L3} cache size / cache line / page size -- significant performance drop" Same thing using RAM vs registers or accessing RAM non sequentially, or about 50 other cases where real world code must / should deviate from the ideal best case performance strategy and must have the flexibility to do it as the programmer decides best at design time or even run-time. I'd rather the most flexible / capable possibility "just work" easily, and if I have to optimize things somehow (if even possible) then I'll spend the time to optimize the DSA I used or choose new speed / capability trade-offs if that's even appropriate. I'm hoping our RAM / VRAM sizes will keep increasing substantially every generation (16G ARCs now, hopefully 32-48G "ECCd" B / C / D / NV / AMD / whatever cards in months / a year or so to come) so it seems key to be able to actually use ("it just works" style) all that VRAM one has paid for (particularly since IIRC as aforementioned "it just works" on the CPU execution device vs the GPU device having the unusual case limit). |
Single >4GB VRAM allocations are possible on Arc, but currently they require 2 small workarounds in the application. For OpenCL, these are:
I've added this to my OpenCL-Wrapper in this commit, so anything built on top of it works on Arc out-of-the-box. For Level Zero, the workarounds are similar: https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md I agree that >4GB allocations should be enabled by default. Such a limitation is not contemporary in a time where AI and simulation models commonly use much larger VRAM capacity. Using the full 16GB VRAM capacity of a 16GB GPU has to work no matter what. ISVs should not have to manually add patches only for Arc, to enable basic functionality. Better eliminate this complication and just make it work, and provide the option to disable >4GB allocations for optimization. |
I am very curious whether this problem has been solved in the latest stable version of ipex. The only purpose of buying a770 is to develop llm. My code can even run correctly on Moore thread s80 but cannot run on Intel's ipex. At some point I have to allocate large amounts of memory, especially when doing batch inference. |
I'd very much be interested in hearing if this is fixed yet. However you can still do something for now if it isn't: You've also got the benefit of still being able to do things, I have a scenario that becomes impossible due to this limitation. Which is having a model or individual sequence that's greater than 4GB, something that can't be split into parts then recombined on the GPU once it's been sent in parts smaller than 4GB... 😞 |
Can I ask whether you are using WSL or directly using Ubuntu 22.04 or other systems? Because when I run my previous code in WSL (they have been verified on CUDA, CPU, and even Moorethread s80 GPU), problems often occur, and I have already submitted an issue for one of them. I would like to know what system you are using, to see if I need to replace the system. :) |
I run directly in windows nowadays, but I've tried it directly in Ubuntu and WSL previously. Behaves the same. |
Describe the bug
Intel compute runtime doesn't allow allocating a buffer bigger than 4 GB.
intel/compute-runtime#627
When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.
Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?
Versions
The text was updated successfully, but these errors were encountered: